# Data Ingestion & Master Feature Store

This notebook serves as the entry point for the project's data lifecycle. It performs the following steps:
1. **Download**: Fetches historical stock and crypto data using `yfinance`.
2. **Raw Storage**: Minimizes memory usage with `float32` and saves the raw state to Parquet.
3. **Universal Feature Engineering**: Computes all features required for Stages 1 through 7, including:
    - Lagged returns (Classical ML)
    - Intraday OHLCV features (Deep Learning/Transformers)
    - Rolling volatility & volume z-scores (Probabilistic/Amazon Chronos)
4. **Processed Storage**: Saves the final engineeered dataset to `data/processed/stock_data_processed.parquet`.

**Note for GitHub Users**: Run this notebook once to populate the `data/` directory before running any subsequent Stage notebooks.

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import os
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

# =============================================================================
# Configuration
# =============================================================================
START_DATE  = "2010-01-01"
END_DATE    = "2024-12-31"
ROLLING_WIN = 20

# 12 tickers: 6 sectors + market ETF + crypto
TICKERS = [
    "AAPL", "MSFT", "NVDA",       # Tech
    "JPM",  "GS",                  # Finance
    "JNJ",  "UNH",                 # Healthcare
    "KO",   "MCD",                 # Consumer
    "XOM",                         # Energy
    "SPY",                         # Market ETF (feature only)
    "BTC-USD",                     # Crypto
]

# Ensure directories exist
os.makedirs("../data/raw", exist_ok=True)
os.makedirs("../data/processed", exist_ok=True)

## 1. Download & Save Raw Data
We download the data and immediately cast numeric columns to `float32`. We save the combined panel to `data/raw/stock_data_raw.parquet`.

In [None]:
raw_dict = {}
for ticker in tqdm(TICKERS, desc="Downloading"):
    df = yf.download(
        ticker, start=START_DATE, end=END_DATE,
        interval="1d", auto_adjust=False, progress=False,
    )
    if df.empty:
        print(f"  ⚠ {ticker}: empty — skipped")
        continue
    
    # Handle MultiIndex columns from newer yfinance versions
    if isinstance(df.columns, pd.MultiIndex):
        df = df.droplevel(1, axis=1)
    
    df.columns = df.columns.str.lower()
    df.index.name = "date"
    
    # Cast to float32 for memory efficiency
    float_cols = df.select_dtypes(include=['float', 'int']).columns
    df[float_cols] = df[float_cols].astype('float32')
    
    raw_dict[ticker] = df

print(f"  ✓ {len(raw_dict)}/{len(TICKERS)} tickers downloaded")

raw_panel = pd.concat([df.assign(ticker=t) for t, df in raw_dict.items()])
raw_panel.to_parquet("../data/raw/stock_data_raw.parquet", index=True)
print("  ✓ Raw data saved to data/raw/stock_data_raw.parquet")

## 2. Universal Feature Engineering
We compute a wide array of features to satisfy the requirements of all modeling stages (Baselines, LSTM, Transformers, Probabilistic, and Latent Analysis).

In [None]:
# Align Dates (Common Trading Days)
common_dates = sorted(set.intersection(*(set(d.index) for d in raw_dict.values())))
aligned = {t: d.loc[common_dates].copy() for t, d in raw_dict.items()}
print(f"  ✓ {len(common_dates)} common trading days")

pieces = []
for ticker, df in aligned.items():
    d = df.copy()

    # --- Block A: Target & Lagged Returns (Classical ML) ---
    d["log_return"] = np.log(d["adj close"] / d["adj close"].shift(1))
    d["ret_lag1"] = d["log_return"].shift(1)
    d["ret_lag2"] = d["log_return"].shift(2)
    d["ret_lag5"] = d["log_return"].shift(5)

    # --- Block B: Intraday & OHLCV Features (DL & Transformers) ---
    d["oc_return"] = (d["close"] - d["open"]) / d["open"]
    d["hl_range"]  = (d["high"] - d["low"]) / d["close"]
    d["close_pos"] = (d["close"] - d["low"]) / (d["high"] - d["low"] + 1e-8)
    d["log_vol"]   = np.log1p(d["volume"])
    d["vol_change"] = np.log((d["volume"] + 1) / (d["volume"].shift(1) + 1))

    # --- Block C: Rolling Statistics (Risk & Uncertainty) ---
    d["roll_vol"] = d["log_return"].rolling(ROLLING_WIN).std().shift(1)
    d["range_norm"] = d["hl_range"].rolling(ROLLING_WIN).mean().shift(1)
    
    # Volume z-score
    v_mu  = d["log_vol"].rolling(ROLLING_WIN).mean()
    v_sig = d["log_vol"].rolling(ROLLING_WIN).std()
    d["vol_zscore"] = ((d["log_vol" ] - v_mu) / (v_sig + 1e-8)).shift(1)

    d["ticker"] = ticker
    pieces.append(d)

panel = pd.concat(pieces).reset_index().set_index(["date", "ticker"]).sort_index()

# --- Block D: Market Alignment (SPY Return) ---
if "SPY" in raw_dict:
    spy_lr = (
        panel.xs("SPY", level="ticker")["log_return"]
        .shift(1)
        .rename("mkt_return")
        .reset_index()
    )
    panel = (
        panel.reset_index()
        .merge(spy_lr, on="date", how="left")
        .set_index(["date", "ticker"])
    )

panel.dropna(inplace=True)

# Final memory optimization
float64_cols = panel.select_dtypes(include=['float64']).columns
panel[float64_cols] = panel[float64_cols].astype('float32')

panel.to_parquet("../data/processed/stock_data_processed.parquet", index=True)
print(f"  ✓ Final Master Panel shape: {panel.shape}")
print("  ✓ Master features saved to data/processed/stock_data_processed.parquet")