
FinTech Data Generator - Week 1 Sprint 0
========================================

This notebook creates realistic financial datasets for our 11-week FinTech project.
We'll generate stock prices, crypto data, economic indicators, portfolio holdings, 
and customer demographics that mirror real-world financial data patterns.

Learning Objectives:
- Understand financial data structures (OHLCV format)
- Learn about time series data generation using statistical models
- Practice working with pandas DataFrames
- Set up reproducible data pipelines using random seeds

Key Concepts:
- Geometric Brownian Motion for price modeling
- Log-normal distributions for financial variables
- Time series patterns and volatility clustering

In [1]:
# Standard library imports for data manipulation and file operations
import pandas as pd              # Data manipulation and analysis
import numpy as np              # Numerical computing and random number generation
from datetime import datetime, timedelta  # Date and time handling
import random                   # Additional random number generation
import os                      # Operating system interface for file operations
from typing import Tuple, List, Dict  # Type hints for better code documentation

# Display settings for better output formatting
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("✅ All libraries imported successfully!")
print("📊 Ready to generate realistic financial datasets for our FinTech project")
print("🎯 This data will support all 11 weeks of our Agile development sprints")

✅ All libraries imported successfully!
📊 Ready to generate realistic financial datasets for our FinTech project
🎯 This data will support all 11 weeks of our Agile development sprints


Class-Based Data Generation Architecture
=======================================

We use Object-Oriented Programming (OOP) to organize our data generation logic.
The FinancialDataGenerator class encapsulates all methods and market constants,
making our code modular, reusable, and easy to maintain - key Agile principles!

Key Financial Concepts:
- Stock symbols represent publicly traded companies
- Crypto symbols represent digital assets with 24/7 trading
- Economic indicators measure macroeconomic health
- Each asset class has different volatility and behavior patterns

In [2]:

class FinancialDataGenerator:
    """
    Comprehensive mock financial data generator for FinTech projects.
    
    This class (In Python, a class is a blueprint for generating objects (instances) that share the same data-attributes and behavior (methods),
    creates realistic datasets that simulate:
    1. Stock market behavior (geometric Brownian motion)
    2. Cryptocurrency volatility (higher volatility, 24/7 trading)
    3. Economic indicators (mean-reverting time series)
    4. Portfolio allocations (risk-based asset allocation)
    5. Customer demographics (realistic distributions)
    
    Design Pattern: This follows the Factory Pattern - one class that creates
    multiple types of related objects (different financial datasets).
    """
    
    def __init__(self, seed: int = 42):
        """
        Initialize the generator with predefined market data and random seed.
        
        Args:
            seed: Random seed for reproducibility (crucial for testing and validation)
        
        Why use a seed?
        - Ensures our data generation is reproducible
        - Critical for debugging and validation
        - Allows team members to generate identical datasets
        - Follows best practices in quantitative finance
        """
        # Set random seeds for reproducible results
        np.random.seed(seed)  # NumPy random operations
        random.seed(seed)     # Python random module operations
        
        # Define major stock symbols - representing different sectors and market caps
        # These are real S&P 500 companies for realistic modeling
        self.stock_symbols = [
            # Technology Giants (FAANG + others)
            'AAPL', 'GOOGL', 'MSFT', 'AMZN', 'TSLA', 'META', 'NVDA', 
            
            # Financial Services
            'JPM', 'BAC', 'V', 'MA', 
            
            # Healthcare & Consumer Goods
            'JNJ', 'PG', 'UNH', 'PFE', 'KO', 'WMT',
            
            # Entertainment & Retail
            'DIS', 'HD', 'NKE', 'COST',
            
            # Software & Cloud
            'ADBE', 'CRM', 'ORCL', 'CSCO',
            
            # Traditional Industries
            'T', 'VZ', 'XOM', 'CVX', 'IBM'
        ]
        
        # Major cryptocurrency symbols by market capitalization (as of 2024-2025)
        # Note: Crypto markets are more volatile and trade 24/7
        self.crypto_symbols = [
            'BTC',   # Bitcoin - digital gold, store of value
            'ETH',   # Ethereum - smart contract platform
            'BNB',   # Binance Coin - exchange token
            'XRP',   # Ripple - cross-border payments
            'ADA',   # Cardano - proof-of-stake blockchain
            'DOGE',  # Dogecoin - meme coin with high volatility
            'SOL',   # Solana - high-performance blockchain
            'TRX',   # Tron - decentralized entertainment platform
            'DOT',   # Polkadot - interoperability protocol
            'MATIC', # Polygon - Ethereum scaling solution
            'SHIB',  # Shiba Inu - another meme coin
            'AVAX',  # Avalanche - smart contracts platform
            'LTC',   # Litecoin - Bitcoin fork
            'UNI',   # Uniswap - decentralized exchange token
            'LINK'   # Chainlink - oracle network
        ]
        
        # Key macroeconomic indicators that affect financial markets
        # These drive fundamental analysis and economic forecasting
        self.economic_indicators = [
            'GDP_GROWTH',           # Gross Domestic Product growth rate
            'INFLATION_RATE',       # Consumer Price Index changes
            'UNEMPLOYMENT_RATE',    # Labor market health
            'INTEREST_RATE',        # Federal funds rate (monetary policy)
            'CONSUMER_CONFIDENCE',  # Consumer sentiment index
            'RETAIL_SALES',         # Consumer spending patterns
            'INDUSTRIAL_PRODUCTION',# Manufacturing output
            'HOUSING_STARTS',       # Real estate market health
            'TRADE_BALANCE',        # Imports vs exports
            'MONEY_SUPPLY'          # M2 money supply (liquidity)
        ]
        
        print(f"🏭 FinancialDataGenerator initialized with seed={seed}")
        print(f"📈 Configured for {len(self.stock_symbols)} stock symbols")
        print(f"💰 Configured for {len(self.crypto_symbols)} crypto symbols") 
        print(f"📊 Tracking {len(self.economic_indicators)} economic indicators")
        print("🔄 All random generators seeded for reproducible results")

# Test the class initialization
generator = FinancialDataGenerator(seed=42)
print("\n✅ Generator class created successfully!")
print("📝 Next: We'll implement individual data generation methods")

🏭 FinancialDataGenerator initialized with seed=42
📈 Configured for 30 stock symbols
💰 Configured for 15 crypto symbols
📊 Tracking 10 economic indicators
🔄 All random generators seeded for reproducible results

✅ Generator class created successfully!
📝 Next: We'll implement individual data generation methods


Stock Price Modeling: Geometric Brownian Motion (GBM)
====================================================

Financial Theory Background:
- Stock prices follow a random walk with drift
- Daily returns are approximately normally distributed
- Prices cannot go negative (geometric vs arithmetic Brownian motion)
- Volatility clustering: periods of high/low volatility tend to cluster

The GBM Formula: S(t+1) = S(t) * exp(μ*dt + σ*sqrt(dt)*Z)
Where:
- S(t) = current stock price
- μ = drift (expected return)
- σ = volatility (standard deviation of returns)
- Z = random normal variable
- dt = time step (1 day = 1/252 years)

OHLCV Format:
- Open: First trade price of the day
- High: Highest price during the day
- Low: Lowest price during the day
- Close: Last trade price of the day
- Volume: Number of shares traded

In [3]:
def generate_stock_prices(self, 
                         symbols: List[str] = None,
                         start_date: str = '2020-01-01',
                         end_date: str = '2024-12-31',
                         initial_price_range: Tuple[float, float] = (20, 500)) -> pd.DataFrame:
    """
    Generate realistic stock price data using Geometric Brownian Motion.
    
    This method simulates how stock prices evolve over time, incorporating:
    1. Random price movements (market efficiency)
    2. Volatility clustering (periods of high/low volatility)
    3. Mean reversion tendencies (prices don't drift too far from fundamentals)
    4. Realistic trading volumes correlated with price volatility
    
    Args:
        symbols: List of stock symbols to generate (default: first 20 predefined)
        start_date: Start date for price series
        end_date: End date for price series  
        initial_price_range: Range for starting stock prices
        
    Returns:
        DataFrame with columns: Date, Symbol, Open, High, Low, Close, Volume
        
    Financial Insights:
        - Higher volatility stocks have more dramatic price swings
        - Volume increases during high volatility periods (realistic behavior)
        - Mean reversion prevents prices from drifting to unrealistic levels
        - Weekend gaps are handled by excluding weekends from trading days
    """
    if symbols is None:
        symbols = self.stock_symbols[:20]  # Use first 20 symbols for manageable dataset
        
    # Create business day range (exclude weekends - NYSE is closed Sat/Sun)
    date_range = pd.date_range(start=start_date, end=end_date, freq='D')
    business_days = date_range[date_range.weekday < 5]  # Monday=0, Friday=4
    
    print(f"📅 Generating stock data for {len(business_days)} trading days")
    print(f"📈 Creating price series for {len(symbols)} symbols")
    
    all_stock_data = []
    
    for i, symbol in enumerate(symbols):
        print(f"  📊 Processing {symbol} ({i+1}/{len(symbols)})")
        
        # Initialize stock-specific parameters
        initial_price = np.random.uniform(*initial_price_range)
        annual_volatility = np.random.uniform(0.15, 0.45)  # 15-45% annual volatility
        daily_volatility = annual_volatility / np.sqrt(252)  # Convert to daily
        
        # Store prices for mean reversion calculation
        prices = [initial_price]
        volumes = []
        
        # Generate daily price evolution
        for day_idx, date in enumerate(business_days):
            # Base daily return components
            base_drift = np.random.normal(0.0008, 0.002)  # ~20% annual drift with variation
            volatility_shock = np.random.normal(0, daily_volatility)
            
            # Add mean reversion after 30 days (prevents unrealistic price drift)
            if day_idx > 30:
                # Calculate 30-day moving average
                recent_avg = np.mean(prices[-30:])
                mean_reversion_force = (recent_avg - prices[-1]) * 0.001
                base_drift += mean_reversion_force
            
            # Apply geometric Brownian motion formula
            price_multiplier = np.exp(base_drift + volatility_shock)
            new_price = prices[-1] * price_multiplier
            
            # Apply circuit breakers (realistic market limits)
            # No stock can drop more than 30% or gain more than 50% in one day
            new_price = max(new_price, prices[-1] * 0.70)  # Max 30% daily drop
            new_price = min(new_price, prices[-1] * 1.50)  # Max 50% daily gain
            
            prices.append(new_price)
            
            # Generate realistic trading volume
            # Volume correlates with volatility (high volatility = high volume)
            base_volume = np.random.lognormal(15, 1)  # Log-normal distribution for volume
            volatility_multiplier = abs(volatility_shock) * 5 + 1
            daily_volume = int(base_volume * volatility_multiplier)
            volumes.append(daily_volume)
        
        # Convert daily close prices to OHLCV format
        for day_idx, date in enumerate(business_days):
            close_price = prices[day_idx + 1]  # +1 because prices[0] is initial
            previous_close = prices[day_idx]
            
            # Generate intraday price range
            intraday_volatility = abs(np.random.normal(0, daily_volatility * close_price))
            
            # Calculate OHLC with realistic constraints
            high_price = close_price + np.random.uniform(0, 1) * intraday_volatility
            low_price = close_price - np.random.uniform(0, 1) * intraday_volatility  
            open_price = previous_close + np.random.normal(0, daily_volatility * previous_close * 0.3)
            
            # Ensure OHLC logical consistency: Low ≤ Open,Close ≤ High
            high_price = max(high_price, open_price, close_price)
            low_price = min(low_price, open_price, close_price)
            
            # Add to dataset
            all_stock_data.append({
                'Date': date,
                'Symbol': symbol,
                'Open': round(open_price, 2),
                'High': round(high_price, 2), 
                'Low': round(low_price, 2),
                'Close': round(close_price, 2),
                'Volume': volumes[day_idx]
            })
    
    stock_df = pd.DataFrame(all_stock_data)
    print(f"✅ Generated {len(stock_df):,} stock price records")
    print(f"📊 Data shape: {stock_df.shape}")
    return stock_df

# Add the method to our generator class
FinancialDataGenerator.generate_stock_prices = generate_stock_prices

# Test the stock price generation
print("🧪 Testing stock price generation with 3 symbols...")
test_stocks = generator.generate_stock_prices(
    symbols=['AAPL', 'GOOGL', 'TSLA'], 
    start_date='2024-01-01', 
    end_date='2024-01-31'
)

print("\n📈 Sample Stock Data:")
print(test_stocks.head())
print(f"\n📊 Price range for AAPL: ${test_stocks[test_stocks['Symbol']=='AAPL']['Close'].min():.2f} - ${test_stocks[test_stocks['Symbol']=='AAPL']['Close'].max():.2f}")
print("✅ Stock price generation method working correctly!")

🧪 Testing stock price generation with 3 symbols...
📅 Generating stock data for 23 trading days
📈 Creating price series for 3 symbols
  📊 Processing AAPL (1/3)
  📊 Processing GOOGL (2/3)
  📊 Processing TSLA (3/3)
✅ Generated 69 stock price records
📊 Data shape: (69, 7)

📈 Sample Stock Data:
        Date Symbol    Open    High     Low   Close   Volume
0 2024-01-01   AAPL  195.47  215.52  195.47  208.73  3126595
1 2024-01-02   AAPL  208.88  218.17  208.88  218.04  8566710
2 2024-01-03   AAPL  217.65  221.85  217.65  221.28  2209596
3 2024-01-04   AAPL  219.81  223.62  219.81  222.72   498495
4 2024-01-05   AAPL  224.40  224.40  216.30  218.74  1278784

📊 Price range for AAPL: $192.08 - $222.72
✅ Stock price generation method working correctly!


print(test_stocks[test_stocks['Symbol'] == 'AAPL'])

Cryptocurrency Market Characteristics
====================================

Key Differences from Traditional Assets:
1. 24/7 Trading: No market closure, continuous price discovery
2. Higher Volatility: 50-120% annual volatility vs 15-45% for stocks  
3. Less Liquidity: More susceptible to large price swings
4. Market Sentiment: Heavily influenced by news, social media, regulatory events
5. Technological Factors: Network upgrades, adoption metrics affect prices

Time Frequency: 6-hour intervals (4 data points per day)
This provides sufficient granularity while keeping dataset manageable.

DeFi & Crypto Fundamentals:
- Bitcoin: Digital gold, store of value narrative
- Ethereum: Smart contract platform, powers DeFi ecosystem
- Altcoins: Various use cases (payments, DeFi, NFTs, gaming)

In [12]:
def generate_crypto_prices(self,
                          symbols: List[str] = None,
                          start_date: str = '2020-01-01', 
                          end_date: str = '2024-12-31') -> pd.DataFrame:
    """
    Generate cryptocurrency price data with realistic 24/7 market behavior.
    
    Crypto markets exhibit unique characteristics:
    - Much higher volatility (50-120% annually)
    - 24/7 trading (no weekend gaps)
    - Sentiment-driven price action
    - Lower liquidity leads to more extreme movements
    - Different behavior patterns for weekends vs weekdays
    
    Args:
        symbols: List of crypto symbols (default: top 10 by market cap)
        start_date: Start date for generation
        end_date: End date for generation
        
    Returns:
        DataFrame with columns: Timestamp, Symbol, Open, High, Low, Close, Volume
        
    Technical Implementation:
        - 6-hour intervals (4 data points per day)
        - Higher volatility parameters than stocks
        - Weekend and night-time volume adjustments
        - Realistic initial price ranges for major cryptocurrencies
    """
    if symbols is None:
        symbols = self.crypto_symbols[:10]  # Top 10 cryptocurrencies
    
    # Crypto trades 24/7 - generate 6-hour intervals
    full_date_range = pd.date_range(start=start_date, end=end_date, freq='h')
    # Take every 6th hour: 00:00, 06:00, 12:00, 18:00 UTC
    crypto_timestamps = full_date_range[::6]
    
    print(f"⏰ Generating crypto data for {len(crypto_timestamps)} 6-hour intervals")
    print(f"💎 Creating price series for {len(symbols)} cryptocurrencies")
    print("🌍 Modeling 24/7 global crypto markets")
    
    all_crypto_data = []
    
    # Realistic initial price ranges for major cryptocurrencies (USD)
    # These reflect approximate price levels as of 2024-2025
    initial_price_ranges = {
        'BTC': (30000, 60000),    # Bitcoin: $30k-60k range
        'ETH': (2000, 4000),      # Ethereum: $2k-4k range  
        'BNB': (300, 600),        # Binance Coin: $300-600
        'XRP': (0.5, 1.5),        # Ripple: $0.50-1.50
        'ADA': (0.3, 1.2),        # Cardano: $0.30-1.20
        'DOGE': (0.05, 0.3),      # Dogecoin: $0.05-0.30
        'SOL': (50, 200),         # Solana: $50-200
        'TRX': (0.06, 0.12),      # Tron: $0.06-0.12
        'DOT': (5, 30),           # Polkadot: $5-30
        'MATIC': (0.5, 2.5)       # Polygon: $0.50-2.50
    }
    
    for i, symbol in enumerate(symbols):
        print(f"  💰 Processing {symbol} ({i+1}/{len(symbols)})")
        
        # Set initial price and volatility for this crypto
        price_range = initial_price_ranges.get(symbol, (1, 100))  # Default range for unlisted coins
        initial_price = np.random.uniform(*price_range)
        
        # Crypto volatility is much higher than stocks
        annual_volatility = np.random.uniform(0.5, 1.2)  # 50-120% annual volatility
        six_hour_volatility = annual_volatility / np.sqrt(365 * 4)  # Convert to 6-hour periods
        
        prices = [initial_price]
        volumes = []
        
        # Generate price evolution for each 6-hour period
        for period_idx, timestamp in enumerate(crypto_timestamps):
            # Base price movement
            base_drift = np.random.normal(0, 0.001)  # Slightly positive expected return
            volatility_shock = np.random.normal(0, six_hour_volatility)
            
            # Weekend effect: Crypto markets are less active on weekends
            if timestamp.weekday() >= 5:  # Saturday=5, Sunday=6
                base_drift *= 0.7  # Reduced weekend activity
            
            # Night time effect: Reduced activity during US night hours
            if timestamp.hour < 6 or timestamp.hour > 22:
                base_drift *= 0.5  # Lower overnight activity
            
            # Apply geometric Brownian motion with higher volatility bounds
            price_multiplier = np.exp(base_drift + volatility_shock)
            new_price = prices[-1] * price_multiplier
            
            # Crypto circuit breakers (more lenient than stocks due to higher volatility)
            new_price = max(new_price, prices[-1] * 0.5)   # Max 50% period drop
            new_price = min(new_price, prices[-1] * 2.0)   # Max 100% period gain
            
            prices.append(new_price)
            
            # Generate trading volume (crypto volumes are typically lower than stocks)
            base_volume = np.random.lognormal(12, 1.5)  # Smaller base volume than stocks
            volatility_multiplier = abs(volatility_shock) * 10 + 1  # Higher sensitivity to volatility
            period_volume = int(base_volume * volatility_multiplier)
            volumes.append(period_volume)
        
        # Convert to OHLCV format for each 6-hour period
        for period_idx, timestamp in enumerate(crypto_timestamps):
            close_price = prices[period_idx + 1]
            previous_close = prices[period_idx]
            
            # Generate intraday range for 6-hour period
            period_volatility = abs(np.random.normal(0, six_hour_volatility * close_price * 2))
            
            # Calculate OHLC
            high_price = close_price + np.random.uniform(0, 1) * period_volatility
            low_price = close_price - np.random.uniform(0, 1) * period_volatility
            open_price = previous_close + np.random.normal(0, six_hour_volatility * previous_close)
            
            # Ensure OHLC consistency
            high_price = max(high_price, open_price, close_price)
            low_price = min(low_price, open_price, close_price)
            
            # Add to dataset with appropriate precision
            # Crypto prices need more decimal places due to wide price ranges
            decimal_places = 6 if close_price < 10 else 2
            
            all_crypto_data.append({
                'Timestamp': timestamp,
                'Symbol': symbol,
                'Open': round(open_price, decimal_places),
                'High': round(high_price, decimal_places),
                'Low': round(low_price, decimal_places), 
                'Close': round(close_price, decimal_places),
                'Volume': volumes[period_idx]
            })
    
    crypto_df = pd.DataFrame(all_crypto_data)
    print(f"✅ Generated {len(crypto_df):,} cryptocurrency price records")
    print(f"📊 Data shape: {crypto_df.shape}")
    return crypto_df

# Add method to the generator class
FinancialDataGenerator.generate_crypto_prices = generate_crypto_prices

# Test cryptocurrency generation
print("🧪 Testing crypto price generation with BTC and ETH...")
test_crypto = generator.generate_crypto_prices(
    symbols=['BTC', 'ETH'],
    start_date='2024-01-01',
    end_date='2024-01-07'  # One week for testing
)

print("\n💎 Sample Cryptocurrency Data:")
print(test_crypto.head())
print(f"\n📊 BTC price range: ${test_crypto[test_crypto['Symbol']=='BTC']['Close'].min():.2f} - ${test_crypto[test_crypto['Symbol']=='BTC']['Close'].max():.2f}")
print(f"📊 ETH price range: ${test_crypto[test_crypto['Symbol']=='ETH']['Close'].min():.2f} - ${test_crypto[test_crypto['Symbol']=='ETH']['Close'].max():.2f}")
print("✅ Cryptocurrency price generation working correctly!")
print("🌍 Note: Crypto data includes 24/7 trading with 6-hour intervals")

🧪 Testing crypto price generation with BTC and ETH...
⏰ Generating crypto data for 25 6-hour intervals
💎 Creating price series for 2 cryptocurrencies
🌍 Modeling 24/7 global crypto markets
  💰 Processing BTC (1/2)
  💰 Processing ETH (2/2)
✅ Generated 50 cryptocurrency price records
📊 Data shape: (50, 7)

💎 Sample Cryptocurrency Data:
            Timestamp Symbol      Open      High       Low     Close  Volume
0 2024-01-01 00:00:00    BTC  49291.92  49742.25  49291.92  49621.41   53815
1 2024-01-01 06:00:00    BTC  48645.36  48844.56  48645.36  48788.78   82863
2 2024-01-01 12:00:00    BTC  48484.04  50399.66  48484.04  50310.54  226355
3 2024-01-01 18:00:00    BTC  49839.12  50870.34  49326.46  50450.39  142297
4 2024-01-02 00:00:00    BTC  51631.66  51631.66  50589.41  50950.55  557573

📊 BTC price range: $42166.45 - $51512.48
📊 ETH price range: $3115.87 - $3497.23
✅ Cryptocurrency price generation working correctly!
🌍 Note: Crypto data includes 24/7 trading with 6-hour intervals


Macroeconomic Indicators and Market Impact
=========================================

Economic indicators are crucial for:
1. Fundamental Analysis: Understanding economic health
2. Market Prediction: Economic data drives asset prices
3. Risk Assessment: Economic cycles affect all investments
4. Policy Analysis: Central bank decisions impact markets

Key Indicators We Model:
- GDP Growth: Overall economic expansion/contraction
- Inflation Rate: Price level changes (CPI-based)
- Unemployment Rate: Labor market health
- Interest Rate: Federal funds rate (monetary policy)
- Consumer Confidence: Sentiment indicator
- Retail Sales: Consumer spending patterns
- Industrial Production: Manufacturing output
- Housing Starts: Real estate market health

Time Series Properties:
- Mean Reversion: Economic indicators tend to revert to long-term averages
- Persistence: Current values influence future values
- Seasonality: Some indicators have seasonal patterns
- Structural Breaks: Economic crises can shift baseline levels

In [13]:
def generate_economic_data(self,
                          start_date: str = '2020-01-01',
                          end_date: str = '2024-12-31', 
                          frequency: str = 'ME') -> pd.DataFrame:  # Changed from 'M' to 'ME'
    """
    Generate macroeconomic indicator time series with realistic patterns.
    
    Economic indicators exhibit different behaviors than asset prices:
    - Mean reversion to long-term equilibrium values
    - Lower volatility than financial assets
    - Autocorrelation (current values predict future values)
    - Different measurement frequencies (monthly, quarterly)
    
    Args:
        start_date: Start date for data generation
        end_date: End date for data generation
        frequency: 'ME' for month-end, 'QE' for quarter-end, 'YE' for year-end
        
    Returns:
        DataFrame with columns: Date, Indicator, Value
        
    Economic Theory Applied:
    - Business Cycle: Indicators move together in cycles
    - Phillips Curve: Inverse relationship between unemployment and inflation
    - Taylor Rule: Interest rates respond to inflation and output gaps
    """
    # Generate date range based on frequency
    date_range = pd.date_range(start=start_date, end=end_date, freq=frequency)
    
    print(f"📊 Generating economic indicators for {len(date_range)} periods")
    print(f"📅 Frequency: {frequency} ({len(date_range)} data points)")
    print("🏛️ Modeling key macroeconomic relationships")
    
    all_economic_data = []
    
    # Historical baseline values (approximately US averages 2020-2024)
    # These represent "normal" economic conditions
    baseline_values = {
        'GDP_GROWTH': 2.5,          # 2.5% annual GDP growth
        'INFLATION_RATE': 2.0,      # 2% Fed (as well as ECB) inflation target
        'UNEMPLOYMENT_RATE': 5.0,   # ~5% natural rate of unemployment
        'INTEREST_RATE': 1.5,       # Federal funds rate
        'CONSUMER_CONFIDENCE': 100.0, # Index value (100 = baseline)
        'RETAIL_SALES': 0.5,        # Monthly % change
        'INDUSTRIAL_PRODUCTION': 1.0, # Monthly % change
        'HOUSING_STARTS': 1200000,  # Annual units (in thousands)
        'TRADE_BALANCE': -50000,    # Million USD (negative = deficit)
        'MONEY_SUPPLY': 15000       # Billion USD (M2 money supply)
    }
    
    # Initialize current values at baseline
    current_values = baseline_values.copy()
    
    # Generate data for each time period
    for date_idx, date in enumerate(date_range):
        print(f"  📈 Processing period {date_idx + 1}/{len(date_range)}: {date.strftime('%Y-%m')}")
        
        for indicator in self.economic_indicators:
            baseline = baseline_values[indicator]
            current = current_values[indicator]
            
            # Mean reversion component
            # Economic indicators tend to return to long-term averages
            reversion_speed = 0.1  # 10% reversion per period
            mean_reversion = (baseline - current) * reversion_speed
            
            # Random shock component (varies by indicator type)
            if indicator in ['GDP_GROWTH', 'INFLATION_RATE', 'UNEMPLOYMENT_RATE']:
                # Core economic indicators have moderate volatility
                shock_std = 0.3
            elif indicator == 'INTEREST_RATE':
                # Interest rates change more gradually (Fed policy)
                shock_std = 0.2
            elif indicator == 'CONSUMER_CONFIDENCE':
                # Sentiment can be quite volatile
                shock_std = 5.0
            elif indicator in ['RETAIL_SALES', 'INDUSTRIAL_PRODUCTION']:
                # Monthly economic activity indicators
                shock_std = 0.5
            elif indicator == 'HOUSING_STARTS':
                # Housing market can be quite volatile
                shock_std = 50000
            elif indicator == 'TRADE_BALANCE':
                # Trade balance fluctuates with global conditions
                shock_std = 10000
            else:  # MONEY_SUPPLY
                # Money supply grows steadily with occasional policy changes
                shock_std = 500
            
            # Generate random shock
            random_shock = np.random.normal(0, shock_std)
            
            # Calculate new value
            new_value = current + mean_reversion + random_shock
            
            # Apply realistic bounds to prevent unrealistic values
            if indicator == 'UNEMPLOYMENT_RATE':
                new_value = max(2.0, min(15.0, new_value))  # 2-15% range
            elif indicator == 'INFLATION_RATE':
                new_value = max(-2.0, min(10.0, new_value))  # -2% to 10% range
            elif indicator == 'INTEREST_RATE':
                new_value = max(0.0, min(10.0, new_value))   # 0-10% range
            elif indicator == 'CONSUMER_CONFIDENCE':
                new_value = max(50, min(150, new_value))     # 50-150 index range
            elif indicator == 'GDP_GROWTH':
                new_value = max(-5.0, min(8.0, new_value))   # -5% to 8% growth range
            elif indicator == 'HOUSING_STARTS':
                new_value = max(500000, min(2000000, new_value))  # Realistic housing range
            elif indicator == 'MONEY_SUPPLY':
                new_value = max(new_value, current_values[indicator])  # Money supply rarely decreases
            
            # Update current value for next period
            current_values[indicator] = new_value
            
            # Add to dataset
            all_economic_data.append({
                'Date': date,
                'Indicator': indicator,
                'Value': round(new_value, 2)
            })
    
    economic_df = pd.DataFrame(all_economic_data)
    print(f"✅ Generated {len(economic_df):,} economic data points")
    print(f"📊 Data shape: {economic_df.shape}")
    
    # Display summary statistics
    print("\n📈 Economic Indicator Ranges:")
    for indicator in self.economic_indicators:
        indicator_data = economic_df[economic_df['Indicator'] == indicator]['Value']
        print(f"  {indicator}: {indicator_data.min():.2f} to {indicator_data.max():.2f}")
    
    return economic_df

# Add method to the generator class
FinancialDataGenerator.generate_economic_data = generate_economic_data

# Test economic data generation
print("🧪 Testing economic data generation...")
test_economic = generator.generate_economic_data(
    start_date='2024-01-01',
    end_date='2024-06-30',  # 6 months for testing
    frequency='ME'  # Monthly data
)

print("\n🏛️ Sample Economic Data:")
print(test_economic.head(10))

# Show data for one specific indicator
gdp_data = test_economic[test_economic['Indicator'] == 'GDP_GROWTH']
print(f"\n📊 GDP Growth over test period:")
print(gdp_data[['Date', 'Value']].head())

print("✅ Economic indicators generation working correctly!")
print("📈 Note: Data exhibits mean reversion and realistic bounds")

🧪 Testing economic data generation...
📊 Generating economic indicators for 6 periods
📅 Frequency: ME (6 data points)
🏛️ Modeling key macroeconomic relationships
  📈 Processing period 1/6: 2024-01
  📈 Processing period 2/6: 2024-02
  📈 Processing period 3/6: 2024-03
  📈 Processing period 4/6: 2024-04
  📈 Processing period 5/6: 2024-05
  📈 Processing period 6/6: 2024-06
✅ Generated 60 economic data points
📊 Data shape: (60, 3)

📈 Economic Indicator Ranges:
  GDP_GROWTH: 2.28 to 2.72
  INFLATION_RATE: 2.08 to 3.16
  UNEMPLOYMENT_RATE: 4.37 to 4.62
  INTEREST_RATE: 1.22 to 1.45
  CONSUMER_CONFIDENCE: 95.45 to 106.08
  RETAIL_SALES: 0.48 to 2.16
  INDUSTRIAL_PRODUCTION: 0.67 to 1.94
  HOUSING_STARTS: 1201415.92 to 1280341.51
  TRADE_BALANCE: -58678.27 to -47155.64
  MONEY_SUPPLY: 15469.14 to 15469.14

🏛️ Sample Economic Data:
        Date              Indicator       Value
0 2024-01-31             GDP_GROWTH        2.56
1 2024-01-31         INFLATION_RATE        2.08
2 2024-01-31      UNEMP

Portfolio Management and Asset Allocation Theory
===============================================

Modern Portfolio Theory (MPT) Concepts:
1. Risk-Return Tradeoff: Higher expected returns require higher risk
2. Diversification: Spreading investments across asset classes
3. Asset Allocation: Strategic mix of stocks, bonds, cash based on risk tolerance
4. Rebalancing: Adjusting portfolio weights over time

Risk Profiles:
- Conservative: Capital preservation, lower volatility, higher bond allocation
- Moderate: Balanced growth and income, mixed allocation
- Aggressive: Growth-focused, higher equity allocation, higher volatility

Key Metrics We Generate:
- Portfolio Value: Total market value of holdings
- Asset Weights: Percentage allocation to stocks, bonds, cash
- Monthly Returns: Period-over-period performance
- Risk Level: Conservative, Moderate, Aggressive classification

This data supports:
- Portfolio optimization algorithms
- Risk management systems
- Performance attribution analysis
- Client reporting and advisory services

In [14]:
def generate_portfolio_data(self,
                           n_portfolios: int = 100,
                           start_date: str = '2020-01-01',
                           end_date: str = '2024-12-31') -> pd.DataFrame:
    """
    Generate realistic portfolio holdings and performance data.
    
    Each portfolio has:
    - Consistent risk profile (Conservative/Moderate/Aggressive)
    - Realistic asset allocation based on risk tolerance
    - Monthly rebalancing with small drifts
    - Returns correlated with allocation and market conditions
    
    Args:
        n_portfolios: Number of unique portfolios to generate
        start_date: Start date for portfolio tracking
        end_date: End date for portfolio tracking
        
    Returns:
        DataFrame with columns: Date, PortfolioID, RiskLevel, TotalValue, 
                              StockWeight, BondWeight, CashWeight, MonthlyReturn
                              
    Portfolio Theory Implementation:
    - Risk-based asset allocation follows industry standards
    - Returns follow normal distribution with risk-appropriate parameters
    - Allocation drift simulates real-world portfolio management
    - Performance is consistent with risk profile expectations
    """
    # Generate monthly date range for portfolio snapshots
    date_range = pd.date_range(start=start_date, end=end_date, freq='ME')  # Month end frequency
    
    print(f"💼 Generating portfolio data for {n_portfolios} portfolios")
    print(f"📅 Tracking {len(date_range)} monthly periods")
    print("📊 Modeling Conservative, Moderate, and Aggressive risk profiles")
    
    all_portfolio_data = []
    
    # Generate each portfolio's characteristics and evolution
    for portfolio_id in range(1, n_portfolios + 1):
        if portfolio_id % 20 == 0:  # Progress indicator
            print(f"  💼 Processing portfolio {portfolio_id}/{n_portfolios}")
        
        # Randomly assign risk level (realistic distribution)
        risk_level = np.random.choice(
            ['Conservative', 'Moderate', 'Aggressive'], 
            p=[0.3, 0.5, 0.2]  # Most clients are moderate risk
        )
        
        # Generate initial portfolio value (realistic range for different client types)
        if risk_level == 'Conservative':
            initial_value = np.random.uniform(500000, 5000000)  # Older, wealthier clients
        elif risk_level == 'Moderate':
            initial_value = np.random.uniform(100000, 2000000)  # Middle-market clients
        else:  # Aggressive
            initial_value = np.random.uniform(50000, 1000000)   # Younger, growth-oriented
        
        # Set risk-appropriate asset allocation and return expectations
        if risk_level == 'Conservative':
            # Capital preservation focus
            base_stock_weight = np.random.uniform(0.3, 0.5)    # 30-50% stocks
            base_bond_weight = np.random.uniform(0.4, 0.6)     # 40-60% bonds
            base_cash_weight = 1 - base_stock_weight - base_bond_weight
            expected_annual_return = 0.06  # 6% expected return
            annual_volatility = 0.08       # 8% volatility
            
        elif risk_level == 'Moderate':
            # Balanced growth and income
            base_stock_weight = np.random.uniform(0.5, 0.7)    # 50-70% stocks
            base_bond_weight = np.random.uniform(0.2, 0.4)     # 20-40% bonds
            base_cash_weight = 1 - base_stock_weight - base_bond_weight
            expected_annual_return = 0.08  # 8% expected return
            annual_volatility = 0.12       # 12% volatility
            
        else:  # Aggressive
            # Growth maximization
            base_stock_weight = np.random.uniform(0.7, 0.9)    # 70-90% stocks
            base_bond_weight = np.random.uniform(0.05, 0.2)    # 5-20% bonds
            base_cash_weight = 1 - base_stock_weight - base_bond_weight
            expected_annual_return = 0.10  # 10% expected return
            annual_volatility = 0.18       # 18% volatility
        
        # Ensure weights sum to 1.0
        total_weight = base_stock_weight + base_bond_weight + base_cash_weight
        stock_weight = base_stock_weight / total_weight
        bond_weight = base_bond_weight / total_weight
        cash_weight = base_cash_weight / total_weight
        
        # Initialize portfolio tracking variables
        current_value = initial_value
        current_stock_weight = stock_weight
        current_bond_weight = bond_weight
        current_cash_weight = cash_weight
        
        # Generate monthly performance data
        for date_idx, date in enumerate(date_range):
            # Generate monthly return based on risk profile
            # Convert annual parameters to monthly
            monthly_expected_return = expected_annual_return / 12
            monthly_volatility = annual_volatility / np.sqrt(12)
            
            # Add some market correlation (all portfolios affected by same market conditions)
            market_factor = np.random.normal(0, 0.02)  # Common market movement
            idiosyncratic_return = np.random.normal(monthly_expected_return, monthly_volatility)
            monthly_return = idiosyncratic_return + market_factor * (stock_weight * 1.5)
            
            # Update portfolio value
            current_value *= (1 + monthly_return)
            
            # Simulate allocation drift (realistic portfolio management)
            # Weights drift slightly due to different asset class performance
            drift_std = 0.02  # 2% standard deviation for weight changes
            stock_drift = np.random.normal(0, drift_std)
            bond_drift = np.random.normal(0, drift_std)
            
            current_stock_weight += stock_drift
            current_bond_weight += bond_drift
            
            # Normalize weights to ensure they sum to 1.0
            total_current_weight = current_stock_weight + current_bond_weight + current_cash_weight
            current_stock_weight /= total_current_weight
            current_bond_weight /= total_current_weight
            current_cash_weight = 1 - current_stock_weight - current_bond_weight
            
            # Ensure no negative weights (realistic constraint)
            current_stock_weight = max(0.05, min(0.95, current_stock_weight))
            current_bond_weight = max(0.05, min(0.95, current_bond_weight))
            current_cash_weight = max(0.05, min(0.95, current_cash_weight))
            
            # Re-normalize after applying bounds
            total_bounded = current_stock_weight + current_bond_weight + current_cash_weight
            current_stock_weight /= total_bounded
            current_bond_weight /= total_bounded
            current_cash_weight = 1 - current_stock_weight - current_bond_weight
            
            # Add record to dataset
            all_portfolio_data.append({
                'Date': date,
                'PortfolioID': f'PF_{portfolio_id:03d}',
                'RiskLevel': risk_level,
                'TotalValue': round(current_value, 2),
                'StockWeight': round(current_stock_weight, 3),
                'BondWeight': round(current_bond_weight, 3),
                'CashWeight': round(current_cash_weight, 3),
                'MonthlyReturn': round(monthly_return, 4)
            })
    
    portfolio_df = pd.DataFrame(all_portfolio_data)
    print(f"✅ Generated {len(portfolio_df):,} portfolio records")
    print(f"📊 Data shape: {portfolio_df.shape}")
    
    # Display summary statistics by risk level
    print("\n📈 Portfolio Performance Summary by Risk Level:")
    for risk_level in ['Conservative', 'Moderate', 'Aggressive']:
        risk_data = portfolio_df[portfolio_df['RiskLevel'] == risk_level]
        avg_return = risk_data['MonthlyReturn'].mean() * 12 * 100  # Annualized %
        volatility = risk_data['MonthlyReturn'].std() * np.sqrt(12) * 100  # Annualized %
        avg_stock_weight = risk_data['StockWeight'].mean() * 100
        
        print(f"  {risk_level}: {avg_return:.1f}% return, {volatility:.1f}% volatility, {avg_stock_weight:.1f}% stocks")
    
    return portfolio_df

# Add method to the generator class
FinancialDataGenerator.generate_portfolio_data = generate_portfolio_data

# Test portfolio data generation
print("🧪 Testing portfolio data generation...")
test_portfolios = generator.generate_portfolio_data(
    n_portfolios=5,  # Small test set
    start_date='2024-01-01',
    end_date='2024-06-30'
)

print("\n💼 Sample Portfolio Data:")
print(test_portfolios.head())

# Show one portfolio's evolution over time
portfolio_1 = test_portfolios[test_portfolios['PortfolioID'] == 'PF_001']
print(f"\n📊 Portfolio PF_001 Evolution:")
print(portfolio_1[['Date', 'TotalValue', 'StockWeight', 'MonthlyReturn']].head())

print("✅ Portfolio data generation working correctly!")
print("💡 Data includes realistic risk-based allocations and performance patterns")

🧪 Testing portfolio data generation...
💼 Generating portfolio data for 5 portfolios
📅 Tracking 6 monthly periods
📊 Modeling Conservative, Moderate, and Aggressive risk profiles
✅ Generated 30 portfolio records
📊 Data shape: (30, 8)

📈 Portfolio Performance Summary by Risk Level:
  Conservative: 9.0% return, 9.1% volatility, 44.5% stocks
  Moderate: 7.7% return, 9.0% volatility, 57.1% stocks
  Aggressive: 66.3% return, 16.9% volatility, 77.2% stocks

💼 Sample Portfolio Data:
        Date PortfolioID RiskLevel  TotalValue  StockWeight  BondWeight  \
0 2024-01-31      PF_001  Moderate   801471.58        0.515       0.388   
1 2024-02-29      PF_001  Moderate   797281.44        0.519       0.386   
2 2024-03-31      PF_001  Moderate   826790.11        0.542       0.365   
3 2024-04-30      PF_001  Moderate   835570.01        0.538       0.371   
4 2024-05-31      PF_001  Moderate   861796.67        0.545       0.359   

   CashWeight  MonthlyReturn  
0       0.097         0.0473  
1       

FinTech Customer Analytics and Risk Modeling
===========================================

Customer data is crucial for:
1. Credit Risk Assessment: Scoring models for lending decisions
2. Customer Segmentation: Targeted product offerings and marketing
3. Fraud Detection: Identifying unusual transaction patterns
4. Regulatory Compliance: KYC (Know Your Customer) requirements
5. Product Development: Understanding customer needs and behaviors

Key Variables We Model:
- Demographics: Age, income (realistic distributions)
- Credit Profile: Credit score, existing loans
- Account Behavior: Transaction frequency, amounts, tenure
- Product Usage: Number of products, account balance
- Risk Segmentation: High/Medium/Low risk classification

Statistical Distributions Used:
- Age: Normal distribution (mean=40, realistic range)
- Income: Log-normal distribution (reflects real-world income inequality)
- Credit Score: Normal distribution with realistic bounds (300-850)
- Transaction Behavior: Poisson distribution for count data
- Account Balance: Log-normal distribution

This supports:
- Credit scoring models
- Customer lifetime value analysis
- Churn prediction
- Fraud detection systems
- Regulatory reporting

In [15]:
def generate_customer_data(self, n_customers: int = 10000) -> pd.DataFrame:
    """
    Generate realistic customer demographic and account data for FinTech analysis.
    
    Creates a comprehensive customer database with correlated variables that
    reflect real-world patterns in financial services:
    - Income correlates with credit score and account balance
    - Age affects risk tolerance and product usage
    - Account tenure influences transaction behavior
    - Risk segmentation based on credit score thresholds
    
    Args:
        n_customers: Number of customer records to generate
        
    Returns:
        DataFrame with customer demographics, account details, and risk metrics
        
    Data Science Applications:
    - Feature engineering for ML models
    - Customer segmentation analysis  
    - Credit risk modeling
    - Behavioral analytics
    - A/B testing frameworks
    """
    print(f"👥 Generating customer data for {n_customers:,} customers")
    print("📊 Modeling realistic demographic and behavioral patterns")
    print("🎯 Creating data suitable for credit scoring, segmentation, and analytics")
    
    all_customer_data = []
    
    # Generate each customer's profile
    for customer_id in range(1, n_customers + 1):
        if customer_id % 1000 == 0:  # Progress indicator
            print(f"  👤 Processing customer {customer_id:,}/{n_customers:,}")
        
        # Demographics
        # Age: Normal distribution centered at 40 with realistic bounds
        age = np.random.normal(40, 15)
        age = max(18, min(80, int(age)))  # Legal age bounds
        
        # Income: Log-normal distribution (realistic income inequality)
        # This creates a right-skewed distribution like real income data
        log_income = np.random.lognormal(10.5, 0.5)  # Parameters chosen for realistic range
        income = max(20000, min(500000, log_income))  # Reasonable bounds
        
        # Credit Score: Normal distribution with income correlation
        # Higher income generally correlates with higher credit scores
        income_effect = (income - 60000) / 100000 * 50  # Income influence on credit
        base_credit_score = np.random.normal(700, 100)  # Base score
        credit_score = base_credit_score + income_effect
        credit_score = max(300, min(850, int(credit_score)))  # FICO range bounds
        
        # Account Details
        # Account age: Uniform distribution with some bias toward newer accounts
        account_age_days = np.random.randint(30, 2000)  # 1 month to ~5.5 years
        
        # Account Balance: Log-normal with correlation to income and credit score
        # Higher income and credit scores lead to higher balances
        income_factor = income / 60000  # Normalize around median income
        credit_factor = credit_score / 700  # Normalize around good credit
        balance_multiplier = (income_factor + credit_factor) / 2
        
        base_balance = np.random.lognormal(8, 1.5)  # Base log-normal distribution
        account_balance = max(0, base_balance * balance_multiplier)
        
        # Transaction Behavior
        # Monthly transactions: Poisson distribution (count data)
        # More active users tend to have higher incomes and longer tenure
        activity_factor = min(2.0, income / 50000 + account_age_days / 1000)
        base_transaction_rate = 25  # Average transactions per month
        monthly_transactions = np.random.poisson(base_transaction_rate * activity_factor)
        
        # Average transaction amount: Log-normal with income correlation
        base_transaction_amount = np.random.lognormal(4, 1)  # Base amount
        transaction_income_factor = max(0.5, income / 60000)
        avg_transaction_amount = base_transaction_amount * transaction_income_factor
        
        # Product Usage
        # Number of products: Poisson with income/age correlation
        # Older, wealthier customers typically have more products
        age_factor = max(0.5, age / 40)
        income_product_factor = max(0.5, income / 60000)
        product_rate = 2 * age_factor * income_product_factor
        num_products = max(1, np.random.poisson(product_rate))
        
        # Loan Status: Probability based on age, income, and credit score
        # Younger people and those with good credit more likely to have loans
        age_loan_factor = max(0.1, (40 - age) / 40)  # Younger = higher probability
        credit_loan_factor = max(0.1, (credit_score - 600) / 250)  # Better credit = higher prob
        loan_probability = 0.3 * age_loan_factor * credit_loan_factor
        has_loan = np.random.choice([True, False], p=[loan_probability, 1 - loan_probability])
        
        # Risk Segmentation based on credit score (industry standard thresholds)
        if credit_score < 600:
            risk_segment = 'High'      # Subprime
        elif credit_score < 750:
            risk_segment = 'Medium'    # Near prime
        else:
            risk_segment = 'Low'       # Prime
        
        # Compile customer record
        all_customer_data.append({
            'CustomerID': f'CUST_{customer_id:06d}',
            'Age': age,
            'Income': round(income, 2),
            'CreditScore': credit_score,
            'AccountAgeDays': account_age_days,
            'AccountBalance': round(account_balance, 2),
            'MonthlyTransactions': monthly_transactions,
            'AvgTransactionAmount': round(avg_transaction_amount, 2),
            'NumProducts': num_products,
            'HasLoan': has_loan,
            'RiskSegment': risk_segment
        })
    
    customer_df = pd.DataFrame(all_customer_data)
    print(f"✅ Generated {len(customer_df):,} customer records")
    print(f"📊 Data shape: {customer_df.shape}")
    
    # Display summary statistics
    print("\n👥 Customer Demographics Summary:")
    print(f"  Average Age: {customer_df['Age'].mean():.1f} years")
    print(f"  Average Income: ${customer_df['Income'].mean():,.0f}")
    print(f"  Average Credit Score: {customer_df['CreditScore'].mean():.0f}")
    print(f"  Average Account Balance: ${customer_df['AccountBalance'].mean():,.0f}")
    
    print("\n📊 Risk Segment Distribution:")
    risk_counts = customer_df['RiskSegment'].value_counts()
    for segment in ['Low', 'Medium', 'High']:
        count = risk_counts.get(segment, 0)
        percentage = (count / len(customer_df)) * 100
        print(f"  {segment} Risk: {count:,} customers ({percentage:.1f}%)")
    
    print(f"\n💳 Loan Penetration: {customer_df['HasLoan'].sum():,} customers ({customer_df['HasLoan'].mean()*100:.1f}%)")
    
    return customer_df

# Add method to the generator class
FinancialDataGenerator.generate_customer_data = generate_customer_data

# Test customer data generation
print("🧪 Testing customer data generation...")
test_customers = generator.generate_customer_data(n_customers=1000)  # Smaller test set

print("\n👥 Sample Customer Data:")
print(test_customers.head())

# Show correlation analysis
print("\n📈 Income vs Credit Score Correlation:")
correlation = test_customers['Income'].corr(test_customers['CreditScore'])
print(f"Correlation coefficient: {correlation:.3f}")

# Risk segment analysis
print("\n🎯 Risk Segment Statistics:")
for segment in ['Low', 'Medium', 'High']:
    segment_data = test_customers[test_customers['RiskSegment'] == segment]
    if len(segment_data) > 0:
        avg_income = segment_data['Income'].mean()
        avg_credit = segment_data['CreditScore'].mean()
        avg_balance = segment_data['AccountBalance'].me# Cell 8: Complete Dataset Generation and Export System
"""
Production Data Pipeline and Export System
==========================================

This final cell brings together all our data generation methods into a 
production-ready pipeline that creates a complete FinTech dataset suitable
for all 11 weeks of our Agile development sprints.

Data Outputs:
1. stock_prices.csv - Daily OHLCV data for major stocks
2. crypto_prices.csv - 6-hourly cryptocurrency data  
3. economic_indicators.csv - Monthly macroeconomic indicators
4. portfolio_data.csv - Monthly portfolio holdings and performance
5. customer_data.csv - Customer demographics and behavior

File Format: CSV (Comma-Separated Values)
- Portable across all platforms and tools
- Easy to import into Python, R, SQL databases
- Human-readable for data inspection
- Compatible with Excel, Google Sheets

Data Governance:
- Consistent date formats (ISO 8601)
- Standardized column naming (snake_case)
- Appropriate data types and precision
- Comprehensive metadata and documentation
"""

def save_all_datasets(self, output_dir: str = 'mock_financial_data') -> Dict[str, pd.DataFrame]:
    """
    Generate and save all financial datasets to CSV files.
    
    This method orchestrates the complete data generation pipeline:
    1. Creates output directory structure
    2. Generates all dataset types with appropriate parameters
    3. Saves data in CSV format with proper encoding
    4. Provides comprehensive logging and error handling
    5. Returns datasets for immediate analysis
    
    Args:
        output_dir: Directory name for output files
        
    Returns:
        Dictionary containing all generated datasets
        
    Production Considerations:
    - Error handling for file system operations
    - Progress tracking for long-running operations  
    - Memory-efficient processing for large datasets
    - Consistent file naming conventions
    - UTF-8 encoding for international compatibility
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    print(f"📁 Created output directory: {output_dir}")
    
    print("\n🏭 Starting complete financial dataset generation...")
    print("=" * 60)
    
    # Dictionary to store all generated datasets
    datasets = {}
    
    try:
        # 1. Generate Stock Price Data
        print("📈 Generating stock market data...")
        print("   - 20 major US stocks (S&P 500 components)")
        print("   - Daily OHLCV format, business days only")
        print("   - 5 years of historical data (2020-2024)")
        
        stock_data = self.generate_stock_prices(
            symbols=self.stock_symbols[:20],  # Top 20 stocks for performance
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['stock_prices.csv'] = stock_data
        print(f"   ✅ Generated {len(stock_data):,} stock price records")
        
        # 2. Generate Cryptocurrency Data  
        print("\n💎 Generating cryptocurrency data...")
        print("   - 10 major cryptocurrencies by market cap")
        print("   - 6-hourly OHLCV format, 24/7 trading")
        print("   - Higher volatility modeling")
        
        crypto_data = self.generate_crypto_prices(
            symbols=self.crypto_symbols[:10],  # Top 10 cryptos
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['crypto_prices.csv'] = crypto_data
        print(f"   ✅ Generated {len(crypto_data):,} crypto price records")
        
        # 3. Generate Economic Indicators
        print("\n🏛️ Generating macroeconomic data...")
        print("   - 10 key economic indicators")
        print("   - Monthly frequency with mean reversion")
        print("   - Realistic bounds and relationships")
        
        economic_data = self.generate_economic_data(
            start_date='2020-01-01',
            end_date='2024-12-31',
            frequency='ME'  # Month-end frequency
        )
        datasets['economic_indicators.csv'] = economic_data
        print(f"   ✅ Generated {len(economic_data):,} economic data points")
        
        # 4. Generate Portfolio Data
        print("\n💼 Generating portfolio management data...")
        print("   - 100 diversified investment portfolios")
        print("   - Risk-based asset allocation (Conservative/Moderate/Aggressive)")
        print("   - Monthly rebalancing and performance tracking")
        
        portfolio_data = self.generate_portfolio_data(
            n_portfolios=100,
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['portfolio_data.csv'] = portfolio_data
        print(f"   ✅ Generated {len(portfolio_data):,} portfolio records")
        
        # 5. Generate Customer Data
        print("\n👥 Generating customer demographics...")
        print("   - 10,000 realistic customer profiles")
        print("   - Credit scores, income, transaction behavior")
        print("   - Risk segmentation for analytics")
        
        customer_data = self.generate_customer_data(n_customers=10000)
        datasets['customer_data.csv'] = customer_data
        print(f"   ✅ Generated {len(customer_data):,} customer records")
        
        # Save all datasets to CSV files
        print(f"\n💾 Saving datasets to {output_dir}/...")
        for filename, dataframe in datasets.items():
            filepath = os.path.join(output_dir, filename)
            
            # Save with UTF-8 encoding and proper formatting
            dataframe.to_csv(filepath, index=False, encoding='utf-8')
            file_size_mb = os.path.getsize(filepath) / (1024 * 1024)
            
            print(f"   📄 Saved {filename}")
            print(f"      - {len(dataframe):an()
        print(f"  {segment}: Income=${avg_income:,.0f}, Credit={avg_credit:.0f}, Balance=${avg_balance:,.0f}")

print("✅ Customer data generation working correctly!")
print("💡 Data includes realistic correlations and distributions suitable for ML modeling")

🧪 Testing customer data generation...
👥 Generating customer data for 1,000 customers
📊 Modeling realistic demographic and behavioral patterns
🎯 Creating data suitable for credit scoring, segmentation, and analytics
  👤 Processing customer 1,000/1,000
✅ Generated 1,000 customer records
📊 Data shape: (1000, 11)

👥 Customer Demographics Summary:
  Average Age: 40.2 years
  Average Income: $41,936
  Average Credit Score: 692
  Average Account Balance: $7,453

📊 Risk Segment Distribution:
  Low Risk: 288 customers (28.8%)
  Medium Risk: 542 customers (54.2%)
  High Risk: 170 customers (17.0%)

💳 Loan Penetration: 28 customers (2.8%)

👥 Sample Customer Data:
    CustomerID  Age    Income  CreditScore  AccountAgeDays  AccountBalance  \
0  CUST_000001   53  32516.99          688            1918          610.48   
1  CUST_000002   42  21347.37          779            1023         5575.62   
2  CUST_000003   53  39070.54          741            1137        12302.13   
3  CUST_000004   30  28511.

Production Data Pipeline and Export System
==========================================

This final cell brings together all our data generation methods into a 
production-ready pipeline that creates a complete FinTech dataset suitable
for all 11 weeks of our Agile development sprints.

Data Outputs:
1. stock_prices.csv - Daily OHLCV data for major stocks
2. crypto_prices.csv - 6-hourly cryptocurrency data  
3. economic_indicators.csv - Monthly macroeconomic indicators
4. portfolio_data.csv - Monthly portfolio holdings and performance
5. customer_data.csv - Customer demographics and behavior

File Format: CSV (Comma-Separated Values)
- Portable across all platforms and tools
- Easy to import into Python, R, SQL databases
- Human-readable for data inspection
- Compatible with Excel, Google Sheets

Data Governance:
- Consistent date formats (ISO 8601)
- Standardized column naming (snake_case)
- Appropriate data types and precision
- Comprehensive metadata and documentation

In [18]:
# Cell 8: Complete Dataset Generation and Export System
"""
Production Data Pipeline and Export System
==========================================

This final cell brings together all our data generation methods into a 
production-ready pipeline that creates a complete FinTech dataset suitable
for all 11 weeks of our Agile development sprints.

Data Outputs:
1. stock_prices.csv - Daily OHLCV data for major stocks
2. crypto_prices.csv - 6-hourly cryptocurrency data  
3. economic_indicators.csv - Monthly macroeconomic indicators
4. portfolio_data.csv - Monthly portfolio holdings and performance
5. customer_data.csv - Customer demographics and behavior

File Format: CSV (Comma-Separated Values)
- Portable across all platforms and tools
- Easy to import into Python, R, SQL databases
- Human-readable for data inspection
- Compatible with Excel, Google Sheets

Data Governance:
- Consistent date formats (ISO 8601)
- Standardized column naming (snake_case)
- Appropriate data types and precision
- Comprehensive metadata and documentation
"""

def save_all_datasets(self, output_dir: str = 'mock_financial_data') -> Dict[str, pd.DataFrame]:
    """
    Generate and save all financial datasets to CSV files.
    
    This method orchestrates the complete data generation pipeline:
    1. Creates output directory structure
    2. Generates all dataset types with appropriate parameters
    3. Saves data in CSV format with proper encoding
    4. Provides comprehensive logging and error handling
    5. Returns datasets for immediate analysis
    
    Args:
        output_dir: Directory name for output files
        
    Returns:
        Dictionary containing all generated datasets
        
    Production Considerations:
    - Error handling for file system operations
    - Progress tracking for long-running operations  
    - Memory-efficient processing for large datasets
    - Consistent file naming conventions
    - UTF-8 encoding for international compatibility
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    print(f"📁 Created output directory: {output_dir}")
    
    print("\n🏭 Starting complete financial dataset generation...")
    print("=" * 60)
    
    # Dictionary to store all generated datasets
    datasets = {}
    
    try:
        # 1. Generate Stock Price Data
        print("📈 Generating stock market data...")
        print("   - 20 major US stocks (S&P 500 components)")
        print("   - Daily OHLCV format, business days only")
        print("   - 5 years of historical data (2020-2024)")
        
        stock_data = self.generate_stock_prices(
            symbols=self.stock_symbols[:20],  # Top 20 stocks for performance
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['stock_prices.csv'] = stock_data
        print(f"   ✅ Generated {len(stock_data):,} stock price records")
        
        # 2. Generate Cryptocurrency Data  
        print("\n💎 Generating cryptocurrency data...")
        print("   - 10 major cryptocurrencies by market cap")
        print("   - 6-hourly OHLCV format, 24/7 trading")
        print("   - Higher volatility modeling")
        
        crypto_data = self.generate_crypto_prices(
            symbols=self.crypto_symbols[:10],  # Top 10 cryptos
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['crypto_prices.csv'] = crypto_data
        print(f"   ✅ Generated {len(crypto_data):,} crypto price records")
        
        # 3. Generate Economic Indicators
        print("\n🏛️ Generating macroeconomic data...")
        print("   - 10 key economic indicators")
        print("   - Monthly frequency with mean reversion")
        print("   - Realistic bounds and relationships")
        
        economic_data = self.generate_economic_data(
            start_date='2020-01-01',
            end_date='2024-12-31',
            frequency='ME'  # Month-end frequency
        )
        datasets['economic_indicators.csv'] = economic_data
        print(f"   ✅ Generated {len(economic_data):,} economic data points")
        
        # 4. Generate Portfolio Data
        print("\n💼 Generating portfolio management data...")
        print("   - 100 diversified investment portfolios")
        print("   - Risk-based asset allocation (Conservative/Moderate/Aggressive)")
        print("   - Monthly rebalancing and performance tracking")
        
        portfolio_data = self.generate_portfolio_data(
            n_portfolios=100,
            start_date='2020-01-01',
            end_date='2024-12-31'
        )
        datasets['portfolio_data.csv'] = portfolio_data
        print(f"   ✅ Generated {len(portfolio_data):,} portfolio records")
        
        # 5. Generate Customer Data
        print("\n👥 Generating customer demographics...")
        print("   - 10,000 realistic customer profiles")
        print("   - Credit scores, income, transaction behavior")
        print("   - Risk segmentation for analytics")
        
        customer_data = self.generate_customer_data(n_customers=10000)
        datasets['customer_data.csv'] = customer_data
        print(f"   ✅ Generated {len(customer_data):,} customer records")
        
        # Save all datasets to CSV files
        print(f"\n💾 Saving datasets to {output_dir}/...")
        for filename, dataframe in datasets.items():
            filepath = os.path.join(output_dir, filename)
            
            # Save with UTF-8 encoding and proper formatting
            dataframe.to_csv(filepath, index=False, encoding='utf-8')
            file_size_mb = os.path.getsize(filepath) / (1024 * 1024)
            
            print(f"   📄 Saved {filename}")
            print(f"      - {len(dataframe):,} rows × {len(dataframe.columns)} columns")
            print(f"      - File size: {file_size_mb:.1f} MB")
        
        print(f"\n🎉 All datasets successfully saved to '{output_dir}' directory!")
        
        # Generate data summary report
        total_records = sum(len(df) for df in datasets.values())
        total_size_mb = sum(os.path.getsize(os.path.join(output_dir, f)) for f in datasets.keys()) / (1024 * 1024)
        
        print(f"\n📊 Dataset Generation Summary:")
        print(f"   - Total records: {total_records:,}")
        print(f"   - Total file size: {total_size_mb:.1f} MB")
        print(f"   - Data coverage: 2020-2024 (5 years)")
        print(f"   - Asset classes: Stocks, Crypto, Economic, Portfolio, Customer")
        
        return datasets
        
    except Exception as e:
        print(f"❌ Error during dataset generation: {str(e)}")
        print("🔧 Check file permissions and available disk space")
        raise

# Add the master method to our generator class
FinancialDataGenerator.save_all_datasets = save_all_datasets

# Execute the complete data generation pipeline
print("🚀 Starting complete FinTech dataset generation pipeline...")
print("⏱️  This may take 2-3 minutes depending on your system...")

# Generate all datasets
all_datasets = generator.save_all_datasets()

print("\n" + "=" * 70)
print("📋 DATASET PREVIEW AND VALIDATION")
print("=" * 70)

# Display sample data from each dataset for validation
for filename, dataframe in all_datasets.items():
    print(f"\n📊 {filename.upper().replace('_', ' ').replace('.CSV', '')}:")
    print("-" * 50)
    
    # Show first few rows
    print("Sample data:")
    print(dataframe.head(3).to_string(index=False))
    
    # Show data types and basic statistics
    print(f"\nData Info:")
    print(f"  Shape: {dataframe.shape}")
    print(f"  Columns: {list(dataframe.columns)}")
    
    # Show data quality metrics
    missing_data = dataframe.isnull().sum().sum()
    print(f"  Missing values: {missing_data}")
    
    # Dataset-specific insights
    if 'stock_prices' in filename:
        unique_symbols = dataframe['Symbol'].nunique()
        date_range = f"{dataframe['Date'].min()} to {dataframe['Date'].max()}"
        print(f"  Symbols: {unique_symbols}, Date range: {date_range}")
        
    elif 'crypto_prices' in filename:
        unique_symbols = dataframe['Symbol'].nunique()
        timestamp_range = f"{dataframe['Timestamp'].min()} to {dataframe['Timestamp'].max()}"
        print(f"  Cryptocurrencies: {unique_symbols}, Time range: {timestamp_range}")
        
    elif 'economic_indicators' in filename:
        unique_indicators = dataframe['Indicator'].nunique()
        value_range = f"{dataframe['Value'].min():.2f} to {dataframe['Value'].max():.2f}"
        print(f"  Indicators: {unique_indicators}, Value range: {value_range}")
        
    elif 'portfolio_data' in filename:
        unique_portfolios = dataframe['PortfolioID'].nunique()
        risk_levels = dataframe['RiskLevel'].value_counts().to_dict()
        print(f"  Portfolios: {unique_portfolios}, Risk distribution: {risk_levels}")
        
    elif 'customer_data' in filename:
        risk_distribution = dataframe['RiskSegment'].value_counts().to_dict()
        avg_age = dataframe['Age'].mean()
        print(f"  Average age: {avg_age:.1f}, Risk segments: {risk_distribution}")

print("\n" + "=" * 70)
print("🎯 DATA GENERATION COMPLETE - READY FOR AGILE SPRINTS!")
print("=" * 70)

print("""
🚀 Your FinTech Data Pipeline is Ready!

Next Steps for Week 1 Sprint:
1. 📋 Complete your project charter (use this data for problem statement)
2. 👥 Form your squad (5 diverse team members)  
3. 📊 Set up JIRA board with first sprint tasks
4. 💾 Import raw data into PostgreSQL database
5. 🔗 Create your first SQL joins between datasets

Week 2 Preview - Data Wrangling Tasks:
• Clean and validate all CSV files
• Handle missing values and outliers  
• Create tidy data formats for analysis
• Build your first exploratory visualizations
• Generate data quality reports

🎓 Learning Objectives Achieved:
✅ Understand financial data structures (OHLCV, customer profiles)
✅ Apply statistical distributions to model real-world data
✅ Implement Object-Oriented Programming principles
✅ Create reproducible data pipelines with proper error handling
✅ Generate production-ready datasets for FinTech applications

Happy coding! 💻📈
""")

# Optional: Quick data validation checks
print("\n🔍 Running final data validation checks...")

# Check for data consistency
validation_passed = True

# Validate date ranges
stock_dates = pd.to_datetime(all_datasets['stock_prices.csv']['Date'])
if stock_dates.min().year != 2020 or stock_dates.max().year != 2024:
    print("⚠️  Stock date range validation failed")
    validation_passed = False

# Validate OHLC consistency (High >= Low, etc.)
stock_data = all_datasets['stock_prices.csv']
ohlc_valid = (
    (stock_data['High'] >= stock_data['Low']).all() and
    (stock_data['High'] >= stock_data['Open']).all() and  
    (stock_data['High'] >= stock_data['Close']).all() and
    (stock_data['Low'] <= stock_data['Open']).all() and
    (stock_data['Low'] <= stock_data['Close']).all()
)

if not ohlc_valid:
    print("⚠️  OHLC consistency validation failed")
    validation_passed = False
    
# Validate customer credit scores are in valid range
customer_data = all_datasets['customer_data.csv']
credit_valid = (
    (customer_data['CreditScore'] >= 300).all() and
    (customer_data['CreditScore'] <= 850).all()
)

if not credit_valid:
    print("⚠️  Credit score range validation failed")
    validation_passed = False

if validation_passed:
    print("✅ All validation checks passed!")
    print("🎉 Dataset is ready for production use in your FinTech projects!")
else:
    print("⚠️  Some validation checks failed - please review the data")

print(f"\n📁 All files saved in: {os.path.abspath('mock_financial_data')}")
print("🎯 Ready to begin your 11-week FinTech development journey!")

🚀 Starting complete FinTech dataset generation pipeline...
⏱️  This may take 2-3 minutes depending on your system...
📁 Created output directory: mock_financial_data

🏭 Starting complete financial dataset generation...
📈 Generating stock market data...
   - 20 major US stocks (S&P 500 components)
   - Daily OHLCV format, business days only
   - 5 years of historical data (2020-2024)
📅 Generating stock data for 1305 trading days
📈 Creating price series for 20 symbols
  📊 Processing AAPL (1/20)
  📊 Processing GOOGL (2/20)
  📊 Processing MSFT (3/20)
  📊 Processing AMZN (4/20)
  📊 Processing TSLA (5/20)
  📊 Processing META (6/20)
  📊 Processing NVDA (7/20)
  📊 Processing JPM (8/20)
  📊 Processing BAC (9/20)
  📊 Processing V (10/20)
  📊 Processing MA (11/20)
  📊 Processing JNJ (12/20)
  📊 Processing PG (13/20)
  📊 Processing UNH (14/20)
  📊 Processing PFE (15/20)
  📊 Processing KO (16/20)
  📊 Processing WMT (17/20)
  📊 Processing DIS (18/20)
  📊 Processing HD (19/20)
  📊 Processing NKE (20/2

In [None]:
# Cell 9: Agile Sprint Integration Guide and Next Steps
"""
Integration with 11-Week FinTech Development Sprints
===================================================

This data generator supports all phases of your Agile FinTech project:

🏃‍♂️ SPRINT ROADMAP:

Week 1 - Startup Onboarding:
• Use generated data in your project charter problem statement
• Import CSVs into PostgreSQL for SQL practice  
• Form diverse squads with complementary skills

Week 2 - Data Wrangling:
• Clean and validate all generated datasets
• Practice pandas/polars operations on realistic data
• Create data quality reports and visualizations

Week 3 - Classical Econometrics:
• Run OLS regressions on stock prices vs economic indicators
• Test cointegration between BTC and ETH prices
• Build CAPM models using generated returns

Week 4 - Time Series Forecasting:
• Forecast stock prices using ARIMA models
• Apply Prophet to cryptocurrency data
• Compare forecasting performance across assets

Week 5 - ML & Regularization:
• Use customer data for credit scoring models
• Apply Ridge/Lasso to portfolio optimization
• Implement prompt-assisted feature engineering

Week 6 - Tree Ensembles:
• Build fraud detection models with customer transaction data
• Use Random Forest for portfolio risk classification
• Apply SHAP for model interpretability

Week 7 - Dimensionality Reduction:
• Apply PCA to economic indicators
• Cluster customers by behavior patterns
• Factor analysis of portfolio returns

Week 8 - Deep Learning:
• LSTM forecasting of crypto prices (24/7 data advantage)
• RNN for sequential customer behavior modeling
• Neural networks for portfolio optimization

Week 9 - Portfolio Analytics:
• Backtest trading strategies using generated price data
• Risk-parity portfolio construction
• Performance attribution analysis

Week 10 - Model Audit:
• Cross-validation on all generated datasets
• Stress testing with economic scenario analysis
• Model validation frameworks

Week 11 - Productization:
• Deploy models as FastAPI services
• Create dashboards using generated data
• Final board presentation with real insights

🔧 TECHNICAL INTEGRATION TIPS:
"""

# Helper functions for ongoing sprint work
def quick_data_loader(dataset_name: str, data_dir: str = 'mock_financial_data') -> pd.DataFrame:
    """
    Quick loader for sprint work - use in subsequent notebooks.
    
    Args:
        dataset_name: One of 'stocks', 'crypto', 'economic', 'portfolio', 'customer'
        data_dir: Directory containing the CSV files
        
    Returns:
        Loaded and basic-processed DataFrame
    """
    
    filename_map = {
        'stocks': 'stock_prices.csv',
        'crypto': 'crypto_prices.csv', 
        'economic': 'economic_indicators.csv',
        'portfolio': 'portfolio_data.csv',
        'customer': 'customer_data.csv'
    }
    
    if dataset_name not in filename_map:
        raise ValueError(f"Dataset must be one of: {list(filename_map.keys())}")
    
    filepath = os.path.join(data_dir, filename_map[dataset_name])
    
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Data file not found: {filepath}")
    
    # Load with appropriate parsing
    df = pd.read_csv(filepath)
    
    # Apply basic preprocessing based on dataset type
    if dataset_name in ['stocks', 'crypto']:
        # Parse date columns
        date_col = 'Date' if dataset_name == 'stocks' else 'Timestamp'
        df[date_col] = pd.to_datetime(df[date_col])
        df = df.sort_values([date_col, 'Symbol']).reset_index(drop=True)
        
    elif dataset_name == 'economic':
        df['Date'] = pd.to_datetime(df['Date'])
        df = df.sort_values(['Date', 'Indicator']).reset_index(drop=True)
        
    elif dataset_name == 'portfolio':
        df['Date'] = pd.to_datetime(df['Date'])  
        df = df.sort_values(['Date', 'PortfolioID']).reset_index(drop=True)
    
    print(f"✅ Loaded {dataset_name} data: {df.shape}")
    return df

def create_sprint_workspace(sprint_number: int, sprint_name: str):
    """
    Create organized workspace for each sprint.
    
    Args:
        sprint_number: Week number (1-11)
        sprint_name: Descriptive name for the sprint
    """
    
    workspace_dir = f"sprint_{sprint_number:02d}_{sprint_name.lower().replace(' ', '_')}"
    
    # Create directory structure
    dirs_to_create = [
        workspace_dir,
        os.path.join(workspace_dir, 'notebooks'),
        os.path.join(workspace_dir, 'data'),
        os.path.join(workspace_dir, 'models'),
        os.path.join(workspace_dir, 'reports'),
        os.path.join(workspace_dir, 'src')
    ]
    
    for dir_path in dirs_to_create:
        os.makedirs(dir_path, exist_ok=True)
    
    # Create template files
    readme_content = f"""# Sprint {sprint_number}: {sprint_name}

## Objectives
- [Add your sprint objectives here]

## Datasets Used
- Stock prices: Daily OHLCV data
- Crypto prices: 6-hourly data
- Economic indicators: Monthly data
- Portfolio data: Monthly holdings
- Customer data: Demographics and behavior

## Deliverables
- [ ] Jupyter notebook with analysis
- [ ] Data quality report
- [ ] Model/analysis results
- [ ] Brief presentation slides

## Team Members
- Product Owner: [Name]
- Scrum Master: [Name] 
- Developers: [Names]

## Sprint Retrospective
- What went well:
- What could be improved:
- Action items for next sprint:
"""
    
    with open(os.path.join(workspace_dir, 'README.md'), 'w') as f:
        f.write(readme_content)
    
    print(f"📁 Created sprint workspace: {workspace_dir}")
    print(f"📋 Template README.md created")
    print(f"🎯 Ready for Sprint {sprint_number}: {sprint_name}")

def generate_data_dictionary():
    """Generate comprehensive data dictionary for all datasets."""
    
    data_dict = {
        'stock_prices.csv': {
            'description': 'Daily stock price data in OHLCV format',
            'columns': {
                'Date': 'Trading date (business days only)',
                'Symbol': 'Stock ticker symbol',
                'Open': 'Opening price (USD)',
                'High': 'Highest price during trading day (USD)',
                'Low': 'Lowest price during trading day (USD)',
                'Close': 'Closing price (USD)',
                'Volume': 'Number of shares traded'
            },
            'frequency': 'Daily (business days)',
            'date_range': '2020-01-01 to 2024-12-31'
        },
        
        'crypto_prices.csv': {
            'description': '6-hourly cryptocurrency price data',
            'columns': {
                'Timestamp': 'Trading timestamp (6-hour intervals)',
                'Symbol': 'Cryptocurrency symbol',
                'Open': 'Opening price (USD)',
                'High': 'Highest price during 6-hour period (USD)',
                'Low': 'Lowest price during 6-hour period (USD)',
                'Close': 'Closing price (USD)',
                'Volume': 'Number of units traded'
            },
            'frequency': '6-hourly (24/7 trading)',
            'date_range': '2020-01-01 to 2024-12-31'
        },
        
        'economic_indicators.csv': {
            'description': 'Monthly macroeconomic indicators',
            'columns': {
                'Date': 'Month-end date',
                'Indicator': 'Economic indicator name',
                'Value': 'Indicator value (units vary by indicator)'
            },
            'frequency': 'Monthly',
            'indicators': [
                'GDP_GROWTH: GDP growth rate (%)',
                'INFLATION_RATE: CPI inflation rate (%)',
                'UNEMPLOYMENT_RATE: Unemployment rate (%)',
                'INTEREST_RATE: Federal funds rate (%)',
                'CONSUMER_CONFIDENCE: Consumer confidence index',
                'RETAIL_SALES: Retail sales growth (%)',
                'INDUSTRIAL_PRODUCTION: Industrial production growth (%)',
                'HOUSING_STARTS: Housing starts (thousands of units)',
                'TRADE_BALANCE: Trade balance (million USD)',
                'MONEY_SUPPLY: M2 money supply (billion USD)'
            ]
        },
        
        'portfolio_data.csv': {
            'description': 'Monthly portfolio holdings and performance',
            'columns': {
                'Date': 'Month-end date',
                'PortfolioID': 'Unique portfolio identifier',
                'RiskLevel': 'Risk profile (Conservative/Moderate/Aggressive)',
                'TotalValue': 'Total portfolio value (USD)',
                'StockWeight': 'Percentage allocated to stocks (0-1)',
                'BondWeight': 'Percentage allocated to bonds (0-1)',
                'CashWeight': 'Percentage held in cash (0-1)',
                'MonthlyReturn': 'Monthly return rate (decimal)'
            },
            'frequency': 'Monthly',
            'portfolios': 100
        },
        
        'customer_data.csv': {
            'description': 'Customer demographics and account information',
            'columns': {
                'CustomerID': 'Unique customer identifier',
                'Age': 'Customer age (years)',
                'Income': 'Annual income (USD)',
                'CreditScore': 'FICO credit score (300-850)',
                'AccountAgeDays': 'Days since account opening',
                'AccountBalance': 'Current account balance (USD)',
                'MonthlyTransactions': 'Average monthly transaction count',
                'AvgTransactionAmount': 'Average transaction amount (USD)',
                'NumProducts': 'Number of financial products held',
                'HasLoan': 'Boolean - customer has active loan',
                'RiskSegment': 'Risk classification (High/Medium/Low)'
            },
            'frequency': 'Cross-sectional (one record per customer)',
            'customers': 10000
        }
    }
    
    # Save data dictionary as JSON
    import json
    with open('mock_financial_data/data_dictionary.json', 'w') as f:
        json.dump(data_dict, f, indent=2)
    
    print("📖 Data dictionary created: mock_financial_data/data_dictionary.json")
    return data_dict

# Demo the helper functions
print("🛠️ SPRINT INTEGRATION TOOLS READY!")
print("=" * 50)

# Test the quick loader
print("\n📊 Testing quick data loader...")
try:
    sample_stocks = quick_data_loader('stocks')
    print(f"   Sample stock data loaded: {sample_stocks.shape}")
    print(f"   Date range: {sample_stocks['Date'].min()} to {sample_stocks['Date'].max()}")
    print(f"   Symbols: {sorted(sample_stocks['Symbol'].unique())}")
except Exception as e:
    print(f"   ⚠️ Data not yet generated: {e}")

# Create sample sprint workspace
print("\n📁 Creating sample sprint workspace...")
create_sprint_workspace(1, "Startup Onboarding")

# Generate data dictionary
print("\n📖 Generating data dictionary...")
data_dict = generate_data_dictionary()

print("""
🎯 READY FOR AGILE DEVELOPMENT!

Your FinTech data pipeline is complete and ready to support all 11 sprints.

Key Integration Points:
✅ Realistic financial data across all major asset classes
✅ Production-ready CSV formats for easy import
✅ Helper functions for sprint-specific data loading
✅ Workspace templates for organized development
✅ Comprehensive data dictionary for reference

Next Actions:
1. 📋 Complete your project charter using this data
2. 👥 Form your cross-functional squad
3. 🎯 Set up JIRA board with Sprint 1 tasks
4. 💾 Import data into PostgreSQL database
5. 📈 Begin exploratory data analysis

Good luck with your FinTech development journey! 🚀
""")

# Final memory cleanup
import gc
gc.collect()
print("\n🔧 Memory cleanup completed - ready for next sprint!")