# Regime-Aware Multifactor + ML/RL Alpha Engine with Backward & Forward Testing

## Project Description

This project is a **modular trading research system** designed to generate **pure alpha** (market-neutral returns independent of market beta) by combining **proven multifactor investing principles** with **modern machine learning and reinforcement learning techniques**, and testing them with **rigorous statistical validation**.

The strategy’s profit engine comes from exploiting cross-sectional mispricings in a broad large-cap U.S. universe (**S&P 500 training set with dynamic top-N selection by confidence**) by identifying which stocks are likely to outperform or underperform others over the next 5–10 days. This is achieved through:

- **Multifactor Alpha Layer:**  
  - **Value** (cheap stocks with potential to mean-revert up)  
  - **Momentum** (stocks in persistent trends)  
  - **Quality** (financially strong, operationally robust companies)  
  - Per-regime factor blending with shrinkage to avoid overfitting.

- **Machine Learning Overlays:**  
  - **LSTM** (sequence model) to capture time-series patterns in returns, volatility, and technicals.  
  - **LightGBM/XGBoost/MLP** (tabular models) to detect nonlinear interactions in cross-sectional features.  
  - **Stacking meta-learner** to optimally blend factor and ML outputs.  
  - **Uncertainty quantification** via MC-dropout and quantile models to control position sizing.

- **Regime Detection:**  
  - Hidden Markov Model (HMM) to classify markets as **Risk-On**, **Risk-Off**, or **Transition**, adjusting model weights and risk accordingly.

- **Portfolio Construction & Risk Management:**  
  - **Black–Litterman optimization** to integrate model views with market-implied returns.  
  - **Risk parity** to balance sector/factor exposures.  
  - **Dynamic hedging** against SPY/sector ETFs to maintain market neutrality.

- **Reinforcement Learning (PPO):**  
  - Learns a sizing and hedging policy that adapts risk-taking to forecast strength, uncertainty, and current market regime, maximizing return per unit of tail risk (CVaR-aware reward).

## Testing & Validation

The project integrates **both backward and forward testing** to ensure robustness:

- **Backward Testing (Historical):**  
  - Walk-forward analysis with purged cross-validation to avoid look-ahead bias.  
  - Statistical significance tests (Diebold–Mariano, SPA/White Reality Check) to confirm non-randomness.  
  - Monte Carlo block bootstrap to estimate confidence intervals and failure probabilities.  
  - VaR/CVaR analysis and stress testing against historical crisis scenarios.

- **Forward Testing (Shadow, No Trades):**  
  - Daily simulation using only forward data, logging PnL and risk metrics without sending orders.  
  - Weekly retraining and monthly auto-generated tear sheets to track live performance against backtest expectations.  
  - Recommended forward-testing period: 4–12 weeks before considering paper/live execution.

## Goal

The system’s goal is to produce **consistent, statistically validated alpha** with low correlation to the market and controlled drawdowns, using a combination of **factor investing, machine learning, and reinforcement learning**. This approach maximizes the probability of sustainable profitability before any real capital is risked.



# Objectives & Success Criteria
- Primary objective: Generate statistically significant pure alpha (market-neutral) with controlled drawdowns after transaction costs.

- Secondary objective: Build a repeatable process capable of ongoing, unattended forward testing that outputs monthly tear sheets.

- Pass/Fail gates (OO-S):
  - Annualized Sharpe ≥ 1.0 (cost-adjusted) across walk-forward windows.
  - SPA/White Reality Check non-rejection vs family of alternatives at 5–10% level.
  - Max DD ≤ 15–20% (tunable) in backtests.
  - Forward test (4–8+ weeks): positive return, rolling Sharpe > 0.8, tail losses consistent with backtest VaR/CVaR.



# 1. Data & Universe

In [6]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(torch.cuda.get_device_name(0))


Using device: cpu


In [7]:
%pip -q install yfinance pandas numpy PyYAML pyarrow statsmodels tenacity

In [8]:
# ============================================================
# 1.1 UNIVERSE (UPDATED)
# S&P 500 training set with dynamic top-N selection by confidence (later in pipeline).
# Hedging instruments: SPY + sector ETFs.
# Source: Yahoo Finance (daily bars). Lookback from 2006-01-01 to today.
# Saves: universe.csv and raw_prices.parquet (OHLCV + Adj Close for all tickers incl. hedges + ^VIX)
# ============================================================

import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime

START_DATE = "2006-01-01"
END_DATE = datetime.today().strftime("%Y-%m-%d")

def to_fmp_symbol(sym: str) -> str:
    # map Yahoo/WSJ style class tickers to FMP
    return sym.replace("-", ".") if "-" in sym else sym

def is_index_like(sym: str) -> bool:
    # skip ^VIX and other index-style series for FMP backfill
    return sym.startswith("^")

# --- Get S&P 500 constituents from Wikipedia (survivorship bias acknowledged) ---
sp500_url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(sp500_url)
sp500 = tables[0]  # first table
tickers_raw = sp500["Symbol"].tolist()
# Some tickers on Wikipedia have periods; yfinance uses dashes for certain cases
tickers = [t.replace(".", "-") for t in tickers_raw]

# --- Hedging instruments (market & sector ETFs) ---
hedges = ["SPY", "XLY", "XLF", "XLV", "XLK", "XLI", "XLE", "XLP", "XLB", "XLU", "XLRE"]
context_symbols = ["^VIX"]  # market context series

universe = sorted(set(tickers))
universe_all = sorted(set(universe + hedges + context_symbols))

# --- Save universe to CSV ---
pd.DataFrame({"ticker": universe}).to_csv(f"universe_{END_DATE}.csv", index=False)
pd.DataFrame({"ticker": universe}).to_csv("universe.csv", index=False)  # pointer


# --- Download daily OHLCV for all symbols ---
# yfinance handles adjusted prices; we’ll keep both Close & Adj Close.
data = yf.download(
    universe_all,
    start=START_DATE,
    end=END_DATE,
    auto_adjust=False,
    group_by="ticker",
    progress=False,
    threads=True,
)

if data is None or getattr(data, "empty", False):
    raise RuntimeError("yfinance returned no data — try rerunning or chunking the request.")

def top_level_symbols(df):
    # Handles both MultiIndex (normal multi-ticker) and flat columns (edge cases)
    if isinstance(df.columns, pd.MultiIndex):
        return set(df.columns.get_level_values(0))
    # flat columns -> we can only have one symbol; yfinance puts OHLCV names as columns
    return set()  # treat as empty to trigger backfill logic safely

# added: tells us if yfinance skipped any tickers
available = top_level_symbols(data)
missing = [sym for sym in universe_all if sym not in available]
if missing:
    pd.Series(missing, name="missing_symbols").to_csv("missing_symbols.csv", index=False)
    print(f"WARNING: {len(missing)} symbols missing from download. Saved to missing_symbols.csv")

# Normalize to tidy format: MultiIndex -> long DataFrame
frames = []
if isinstance(data.columns, pd.MultiIndex):
    for sym in universe_all:
        if sym not in available:
            continue
        df = data[sym].copy()
        df.columns = [c.lower().replace(" ", "_") for c in df.columns]
        df["ticker"] = sym
        frames.append(df.reset_index().rename(columns={"Date": "date"}))
else:
    # Edge: flat columns — shouldn't happen with many symbols, but keep it safe
    df = data.copy()
    df.columns = [c.lower().replace(" ", "_") for c in df.columns]
    df["ticker"] = universe_all[0]
    frames.append(df.reset_index().rename(columns={"Date": "date"}))

prices = pd.concat(frames, ignore_index=True).sort_values(["ticker", "date"])
prices["date"] = pd.to_datetime(prices["date"])

# Basic sanity: drop rows with all NaNs for OHLCV
keep_cols = ["open", "high", "low", "close", "adj_close", "volume"]
prices = prices.dropna(subset=keep_cols, how="all")

# Save raw prices
prices.to_parquet("raw_prices.parquet", index=False)

print(f"Universe size (S&P 500): {len(universe)} tickers")
print(f"Total symbols incl. hedges/context: {len(universe_all)}")
print("Saved: universe.csv, raw_prices.parquet")


Universe size (S&P 500): 503 tickers
Total symbols incl. hedges/context: 515
Saved: universe.csv, raw_prices.parquet


In [9]:
# ---- Optional: Backfill any missing tickers with FMP (skip ^VIX etc.) ----
import os, requests, time
from getpass import getpass

if os.path.exists("missing_symbols.csv"):
    missing = pd.read_csv("missing_symbols.csv")["missing_symbols"].tolist()
else:
    uni = pd.read_csv("universe.csv")["ticker"].tolist()
    hedges = ["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"]
    context = ["^VIX"]
    universe_all = sorted(set(uni + hedges + context))
    base_prices = pd.read_parquet("raw_prices.parquet")
    present = set(base_prices["ticker"].unique())
    missing = [s for s in universe_all if s not in present]

missing = [s for s in missing if not is_index_like(s)]
if not missing:
    print("No missing symbols to backfill.")
else:
    print(f"Backfilling {len(missing)} symbols from FMP (skipping indexes):", missing[:8], "...")
    FMP_API_KEY = os.environ.get("FMP_API_KEY", "").strip() or getpass("Enter FMP API key for price backfill: ").strip()
    if not FMP_API_KEY:
        raise RuntimeError("FMP_API_KEY required for backfill.")

    base_url = "https://financialmodelingprep.com/api/v3/historical-price-full"
    def fetch_fmp_prices(sym):
        fmp_sym = to_fmp_symbol(sym)
        url = f"{base_url}/{fmp_sym}?from={START_DATE}&to={END_DATE}&serietype=line&apikey={FMP_API_KEY}"
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        js = r.json()
        hist = js.get("historical", [])
        if not hist:
            return None
        df = pd.DataFrame(hist)
        df["date"] = pd.to_datetime(df["date"])
        # Map columns; fall back if adjClose missing
        df = df.rename(columns={"adjClose":"adj_close"})
        if "adj_close" not in df.columns:
            df["adj_close"] = df["close"]
        cols = ["date","open","high","low","close","adj_close","volume"]
        for c in cols:
            if c not in df.columns: df[c] = np.nan
        df = df[cols]
        df["ticker"] = sym
        return df.sort_values("date")

    filled = []
    for i, sym in enumerate(missing, 1):
        try:
            df = fetch_fmp_prices(sym)
            if df is not None and len(df):
                filled.append(df)
        except Exception:
            pass
        if i % 10 == 0:
            time.sleep(0.5)  # be polite

    if filled:
        add = pd.concat(filled, ignore_index=True)
        base_prices = pd.read_parquet("raw_prices.parquet")
        prices_fixed = pd.concat([base_prices, add], ignore_index=True).sort_values(["ticker","date"])
        prices_fixed.to_parquet("raw_prices.parquet", index=False)
        print(f"Backfilled {add['ticker'].nunique()} symbols and re-saved raw_prices.parquet")
    else:
        print("FMP backfill returned no data; proceeding without these tickers.")

No missing symbols to backfill.


In [1]:
# ============================================================
# 1.2 FEATURES (FMP Premium, no hard-coded key)
# ------------------------------------------------------------
# Builds:
#   • Price/technical features (returns/vol/ATR/momentum/trend)
#   • Market context (SPY vol, ^VIX, breadth)
#   • Fundamentals via FMP (quarterly BS/IS/CF), cached per ticker,
#     forward-filled to daily, and ratio metrics (Value + Quality)
# Post-merge:
#   • Leakage control (shift all predictive features by 1 day)
#   • Winsorize & cross-sectional z-score (by date)
#   • Fundamentals imputation + missing masks
# Saves:
#   • features.parquet
#   • funda_quarterly.parquet, funda_daily.parquet
#   • cache/funda_q_<TICKER>.parquet (per-ticker cache)
# Notes:
#   - API key is taken from env var FMP_API_KEY or prompted securely.
#   - GPU not used here (CPU/I/O heavy); that’s normal.
# ============================================================

# %pip -q install yfinance pyarrow tenacity

import os, time, random, gc
import pandas as pd
import numpy as np
import yfinance as yf
from getpass import getpass
from concurrent.futures import ThreadPoolExecutor, as_completed
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

# ---------- Config / Toggles ----------
COMPUTE_SLOPE = True      # slope_20 via vectorized method
SLOPE_WINDOW = 20
RV_WIN = 20
ATR_WIN = 14

# Fundamentals provider config (FMP Premium)
PROVIDER = "fmp"          # fixed to FMP for reliability
FMP_API_KEY = os.environ.get("FMP_API_KEY", "").strip()
if not FMP_API_KEY:
    # Prompt securely; not echoed, not written to disk
    FMP_API_KEY = getpass("Enter your FMP API key (kept in-memory for this session): ").strip()
if not FMP_API_KEY:
    raise RuntimeError("FMP_API_KEY is required. Set env var FMP_API_KEY or enter it when prompted.")

def to_fmp_symbol(sym: str) -> str:
    """
    Convert Yahoo-style tickers to FMP-style.
    Yahoo uses '-' for class/shared tickers (e.g., BRK-B),
    while FMP uses '.' (e.g., BRK.B). Everything else stays the same.
    """
    # common class/delimiter cases
    # e.g., BRK-B, BF-B, FOXA (no change), META (no change)
    if "-" in sym:
        return sym.replace("-", ".")
    return sym

# Chunking: Premium can fetch all at once. If you ever need throttling, set CHUNK_TICKERS to an int.
CHUNK_TICKERS = 100      # None = process entire universe in one go
START_AT = 0              # offset if chunking
SKIP_IF_CACHED = True     # skip ticker if cache exists

MAX_WORKERS = 4           # Premium can handle more concurrency; tune 4–12 as you like
RETRY_ATTEMPTS = 5
BATCH_SLEEP = (0.2, 0.6)  # polite jitter between HTTP calls
CACHE_DIR = "cache"
os.makedirs(CACHE_DIR, exist_ok=True)

# ---------- Load raw prices & universe (from 1.1) ----------
prices = pd.read_parquet("raw_prices.parquet")
universe_full = list(pd.read_csv("universe.csv")["ticker"])
hedges = {"SPY", "XLY", "XLF", "XLV", "XLK", "XLI", "XLE", "XLP", "XLB", "XLU", "XLRE"}
context_symbols = {"^VIX"}

# ============================================================
# A) PRICE / TECHNICAL FEATURES
# ============================================================

def compute_atr(df, window=ATR_WIN):
    high, low, close = df["high"], df["low"], df["close"]
    prev_close = close.shift(1)
    tr = pd.concat([(high - low),
                    (high - prev_close).abs(),
                    (low - prev_close).abs()], axis=1).max(axis=1)
    return tr.rolling(window).mean()

def vectorized_rolling_slope(y: pd.Series, window=SLOPE_WINDOW) -> pd.Series:
    N = window
    if N <= 1:
        return pd.Series(np.nan, index=y.index, dtype=float)
    x = np.arange(N, dtype=float)
    Sx = x.sum()
    Sxx = (x**2).sum()
    yv = y.to_numpy(dtype=float)
    yv = np.where(np.isfinite(yv), yv, 0.0)
    k = np.ones(N, dtype=float)
    Sy  = np.convolve(yv, k[::-1], mode="full")[N-1:len(yv)+N-1]
    Sxy = np.convolve(yv, x[::-1], mode="full")[N-1:len(yv)+N-1]
    denom = N * Sxx - Sx * Sx + 1e-12
    slope = (N * Sxy - Sx * Sy) / denom
    out = pd.Series(np.nan, index=y.index, dtype=float)
    out.iloc[N-1:] = slope[N-1:]
    return out

def mom_over_n(adj_close, n):
    return np.log(adj_close / adj_close.shift(n))

feat_frames = []
tickers = sorted(prices["ticker"].unique())
total = len(tickers)

for i, (sym, df_sym) in enumerate(prices.groupby("ticker"), start=1):
    if sym in context_symbols:
        continue
    if i % 25 == 0:
        print(f"[Features] {i}/{total} processed… ({sym})")

    df = df_sym.sort_values("date").copy()
    df["ret_1d"] = np.log(df["adj_close"] / df["adj_close"].shift(1))
    for l in range(1, 61):
        df[f"ret_lag_{l}"] = df["ret_1d"].shift(l)

    df["rv_20"] = df["ret_1d"].rolling(RV_WIN).std() * np.sqrt(252)
    df["atr_14"] = compute_atr(df, ATR_WIN)

    df["mom_20"]  = mom_over_n(df["adj_close"], 20)
    df["mom_6m"]  = mom_over_n(df["adj_close"], 126)
    df["mom_12m"] = mom_over_n(df["adj_close"], 252)
    df["mom_12_1"] = np.log(df["adj_close"].shift(21) / df["adj_close"].shift(252))
    df["mom_6_1"]  = np.log(df["adj_close"].shift(21) / df["adj_close"].shift(126))

    df["sma_20"] = df["adj_close"].rolling(20).mean()
    df["sma_50"] = df["adj_close"].rolling(50).mean()
    df["sma_20_gt_50"] = (df["sma_20"] > df["sma_50"]).astype("float32")
    df["slope_20"] = vectorized_rolling_slope(df["adj_close"], window=SLOPE_WINDOW) if COMPUTE_SLOPE else np.nan

    df["mom_20_vs_vol"] = df["mom_20"] / (df["ret_1d"].rolling(20).std() + 1e-8)

    feat_frames.append(df)

features = pd.concat(feat_frames, ignore_index=True)

# ============================================================
# B) MARKET CONTEXT (SPY vol, VIX, breadth)
# ============================================================

vix = prices[prices["ticker"] == "^VIX"][["date", "adj_close"]].rename(columns={"adj_close": "vix_close"})
spy = prices[prices["ticker"] == "SPY"].copy()
spy["spy_ret"] = np.log(spy["adj_close"] / spy["adj_close"].shift(1))
spy["spy_rv_20"] = spy["spy_ret"].rolling(20).std() * np.sqrt(252)
ctx = spy[["date", "spy_rv_20"]].merge(vix, on="date", how="left")

rets = features.pivot(index="date", columns="ticker", values="ret_1d")
advancers = (rets > 0).sum(axis=1)
# Fixed denominator for stability = full S&P 500 count from universe.csv
breadth = (advancers / len(universe_full)).rename("breadth")
ctx = ctx.merge(breadth.reset_index(), on="date", how="left")

features = features.merge(ctx, on="date", how="left")

# ============================================================
# C) FUNDAMENTALS (FMP Premium primary; cached per ticker)
# ============================================================

px_daily_all = prices[prices["ticker"].isin(universe_full)][["date", "ticker", "adj_close"]].copy()
px_daily_all["date"] = pd.to_datetime(px_daily_all["date"])
dates_all = px_daily_all[["date"]].drop_duplicates().sort_values("date")

def _tidy_quarterly_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"])
    elif "fillingDate" in df.columns:
        df["date"] = pd.to_datetime(df["fillingDate"])
    return df

def _coalesce_cols(df: pd.DataFrame, cols: list[str], default=np.nan) -> pd.Series:
    avail = [c for c in cols if c in df.columns]
    if not avail:
        return pd.Series(default, index=df.index)
    tmp = df[avail].apply(pd.to_numeric, errors="coerce")
    # first non-null across the candidate columns
    s = tmp.bfill(axis=1).iloc[:, 0]
    return s

def _fetch_quarterly_funda_fmp(ticker: str) -> pd.DataFrame:
    import requests
    base = "https://financialmodelingprep.com/api/v3"
    fmp_ticker = to_fmp_symbol(ticker)   # BRK-B -> BRK.B

    def jget(url):
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        return r.json()

    # pull a long history (Premium supports it)
    bs = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/balance-sheet-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    is_ = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/income-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    cf = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/cash-flow-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    if bs.empty and is_.empty and cf.empty:
        raise RuntimeError(f"FMP fundamentals empty for {ticker} (queried as {fmp_ticker})")

    out = bs.merge(is_, on="date", how="outer").merge(cf, on="date", how="outer")

    # Coalesce across schema variants
    out["book_equity"]  = _coalesce_cols(out, ["totalStockholdersEquity","totalShareholderEquity","totalEquity"]).astype(float)
    out["net_income"]   = _coalesce_cols(out, ["netIncome","netIncomeApplicableToCommonShares"]).astype(float)
    out["ocf"]          = _coalesce_cols(out, [
        "netCashProvidedByOperatingActivities",
        "netCashProvidedByUsedInOperatingActivities",
        "netCashProvidedByUsedInOperatingActivitiesContinuingOperations"
    ]).astype(float)
    out["gross_profit"] = _coalesce_cols(out, ["grossProfit"]).astype(float)
    out["total_assets"] = _coalesce_cols(out, ["totalAssets"]).astype(float)

    # total_debt: prefer totalDebt; else short + long
    td = _coalesce_cols(out, ["totalDebt"])
    if td.isna().all():
        short = _coalesce_cols(out, ["shortTermDebt","shortLongTermDebtTotal"])
        long  = _coalesce_cols(out, ["longTermDebt"])
        td = (short.fillna(0) + long.fillna(0)).replace({0: np.nan})
    out["total_debt"] = td.astype(float)

    # dividends / buybacks (raw signs as provided by FMP)
    out["dividends"] = _coalesce_cols(out, ["dividendsPaid","dividendsPaidCashFlow"]).astype(float)
    out["buybacks"]  = _coalesce_cols(out, ["commonStockRepurchased","purchaseOfCommonStock"]).astype(float)

    out["ticker"] = ticker  # keep Yahoo-style symbol for our dataset

    cols = ["date","ticker","book_equity","net_income","ocf","gross_profit",
            "total_assets","total_debt","dividends","buybacks"]
    return out[cols].dropna(subset=["date"])


def fetch_or_load_cached_quarterly(ticker: str) -> pd.DataFrame | None:
    path = os.path.join(CACHE_DIR, f"funda_q_{ticker}.parquet")
    if SKIP_IF_CACHED and os.path.exists(path):
        try:
            return pd.read_parquet(path)
        except Exception:
            pass
    try:
        df = _fetch_quarterly_funda_fmp(ticker)
        if df is None or df.empty:
            return None
        df.to_parquet(path, index=False)
        time.sleep(random.uniform(*BATCH_SLEEP))  # polite pause
        return df
    except Exception:
        return None

# ---- SMOKE TEST (run once, then you can comment it out) ----
try:
    from IPython.display import display
except Exception:
    pass

test_syms = ["AAPL", "MSFT", "BRK-B", "BF-B"]
for s in test_syms:
    try:
        df = _fetch_quarterly_funda_fmp(s)
        # 👇 trim preview to match your price history
        df = df[df["date"] >= pd.to_datetime(START_DATE)]
        print(s, "→", to_fmp_symbol(s), "rows:", len(df))
        try:
            display(df.head(2))
        except Exception:
            print(df.head(2))
    except Exception as e:
        print("ERR", s, e)

# Determine chunk (or all)
if CHUNK_TICKERS:
    end_at = min(len(universe_full), START_AT + CHUNK_TICKERS)
    tickers_chunk = sorted(universe_full[START_AT:end_at])
    print(f"[FMP] Processing chunk {START_AT}:{end_at} (size={len(tickers_chunk)})")
else:
    tickers_chunk = sorted(universe_full)
    print(f"[FMP] Processing entire universe (size={len(tickers_chunk)})")

# Parallel fetch with caching
funda_parts, successes = [], 0
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    futs = {ex.submit(fetch_or_load_cached_quarterly, t): t for t in tickers_chunk}
    for j, fut in enumerate(as_completed(futs), start=1):
        df = fut.result()
        if df is not None and len(df):
            funda_parts.append(df)
            successes += 1
        if j % 25 == 0:
            print(f"[Fundamentals/FMP] {j}/{len(tickers_chunk)} processed… (successes: {successes})")

if successes == 0:
    raise RuntimeError("No fundamentals fetched. Check your FMP key or try a smaller CHUNK_TICKERS with fewer workers.")

# Merge chunk with existing quarterly file (so multiple runs accumulate)
q_path = "funda_quarterly.parquet"
funda_q_chunk = pd.concat(funda_parts, ignore_index=True).sort_values(["ticker","date"])

# after you build funda_q_chunk
cutoff = pd.to_datetime(px_daily_all["date"].min())
funda_q_chunk = funda_q_chunk[funda_q_chunk["date"] >= cutoff]

if os.path.exists(q_path):
    old = pd.read_parquet(q_path)
    old["date"] = pd.to_datetime(old["date"])
    old = old[old["date"] >= cutoff]  # <- trim the old file too
    funda_q = (
        pd.concat([old, funda_q_chunk], ignore_index=True)
          .drop_duplicates(["ticker","date"], keep="last")
          .sort_values(["ticker","date"])
    )
else:
    funda_q = funda_q_chunk

funda_q = funda_q.sort_values(["ticker","date"])
funda_q.to_parquet(q_path, index=False)
print(
    f"Saved: {q_path}  "
    f"(tickers with funda total: {funda_q['ticker'].nunique()}, "
    f"rows: {len(funda_q)})"
)

# Quarterly → daily (forward-fill per ticker) across ALL tickers collected so far
ff = []
for sym, grp in funda_q.groupby("ticker"):
    g = dates_all.merge(grp, on="date", how="left")
    g["ticker"] = sym
    g = g.sort_values("date").ffill()
    ff.append(g)
funda_daily = pd.concat(ff, ignore_index=True)

# Ratios (Value & Quality)
fd = funda_daily.merge(px_daily_all, on=["date","ticker"], how="left")
price = fd["adj_close"].replace(0, np.nan)

fd["book_to_price"]     = fd["book_equity"] / price
fd["earnings_yield"]    = fd["net_income"]  / price
fd["cf_yield"]          = fd["ocf"]         / price
fd["shareholder_yield"] = (fd["dividends"].fillna(0) * -1 + fd["buybacks"].fillna(0)) / price

fd["gross_profitability"] = fd["gross_profit"] / fd["total_assets"].replace(0, np.nan)
fd["roe"]                 = fd["net_income"] / fd["book_equity"].replace(0, np.nan)
fd["accruals"]            = (fd["net_income"] - fd["ocf"]) / fd["total_assets"].replace(0, np.nan)
fd["leverage"]            = fd["total_debt"] / fd["total_assets"].replace(0, np.nan)

funda_daily = fd[[
    "date","ticker","book_to_price","earnings_yield","cf_yield","shareholder_yield",
    "gross_profitability","roe","accruals","leverage"
]]
funda_daily.to_parquet("funda_daily.parquet", index=False)
print("Saved: funda_daily.parquet")

# Merge fundamentals into features
features = features.merge(funda_daily, on=["date","ticker"], how="left")

# ============================================================
# D) POST-MERGE HYGIENE
# ============================================================

# 1) Leakage control
non_feature_cols = {"date","ticker","open","high","low","close","adj_close","volume"}
cols_to_shift = [c for c in features.columns if c not in non_feature_cols]
features[cols_to_shift] = features.groupby("ticker")[cols_to_shift].shift(1)

# 2) Winsorize & cross-sectional z-score
def winsorize_cs(s, lo=0.01, hi=0.99):
    ql, qh = s.quantile(lo), s.quantile(hi)
    return s.clip(ql, qh)

# --- choose features for cross-sectional standardization (exclude context & raw SMAs) ---
cs_cols = [
    "rv_20","atr_14","mom_20","mom_6m","mom_12m","mom_12_1","mom_6_1",
    "sma_20_gt_50","slope_20","mom_20_vs_vol",
    # fundamentals
    "book_to_price","earnings_yield","cf_yield","shareholder_yield",
    "gross_profitability","roe","accruals","leverage"
] + [f"ret_lag_{l}" for l in range(1,61)]

# keep context raw (no CS z-score)
context_keep_raw = ["spy_rv_20","vix_close","breadth"]

present = [c for c in cs_cols if c in features.columns]

print(f"[Standardize] Cross-sectional z-score on {len(present)} features")

def cs_standardize_fast(df, cols, lo=0.01, hi=0.99):
    out = df.copy()
    out[cols] = out[cols].astype("float32")

    d = out["date"]
    for c in cols:
        s = out[c]

        ql = s.groupby(d).transform(lambda x: x.quantile(lo))
        qh = s.groupby(d).transform(lambda x: x.quantile(hi))
        s_clip = s.clip(ql, qh)

        mu = s_clip.groupby(d).transform("mean")
        sd = s_clip.groupby(d).transform("std")

        # if std is 0 or NaN (date-constant or all-NaN), set denom=1 to avoid blowing up / NaNs
        denom = sd.fillna(0.0).replace(0.0, 1.0)

        out[c] = ((s_clip - mu) / (denom + 1e-9)).astype("float32")

    return out

features = cs_standardize_fast(features, present)

# 3) Fundamentals imputation + masks
funda_cols = ["book_to_price","earnings_yield","cf_yield","shareholder_yield",
              "gross_profitability","roe","accruals","leverage"]
for c in funda_cols:
    if c in features.columns:
        features[f"{c}_is_missing"] = features[c].isna().astype(int)
        features[c] = features.groupby("date")[c].transform(lambda s: s.fillna(s.median()))

# Save final
features.to_parquet("features.parquet", index=False)
print("Saved: features.parquet (lagged, winsorized, cross-sectional z-scored)")
print("Artifacts: funda_quarterly.parquet, funda_daily.parquet, cache/funda_q_*.parquet")

gc.collect()

Enter your FMP API key (kept in-memory for this session): ··········
[Features] 25/515 processed… (AMAT)
[Features] 50/515 processed… (BA)
[Features] 75/515 processed… (CARR)
[Features] 100/515 processed… (CNP)
[Features] 125/515 processed… (DAL)
[Features] 150/515 processed… (EA)
[Features] 175/515 processed… (EXE)
[Features] 200/515 processed… (GEHC)
[Features] 225/515 processed… (HON)
[Features] 250/515 processed… (IT)
[Features] 275/515 processed… (LDOS)
[Features] 300/515 processed… (MCO)
[Features] 325/515 processed… (MTCH)
[Features] 350/515 processed… (OKE)
[Features] 375/515 processed… (PNC)
[Features] 400/515 processed… (RSG)
[Features] 425/515 processed… (SWKS)
[Features] 450/515 processed… (TSN)
[Features] 475/515 processed… (VST)
[Features] 500/515 processed… (XLF)
ERR AAPL name 'START_DATE' is not defined
ERR MSFT name 'START_DATE' is not defined
ERR BRK-B name 'START_DATE' is not defined
ERR BF-B FMP fundamentals empty for BF-B (queried as BF.B)
[FMP] Processing chunk 0:

0

In [2]:
# ============================================================
# 1.3 DATA HYGIENE / QC (non-destructive)
# ------------------------------------------------------------
# - Summarize coverage & missingness (post 1.2)
# - Optional pruning: drop early warmup dates & low-coverage dates
# - Write meta.yaml and QC CSVs
# ============================================================

import pandas as pd
import numpy as np
import yaml

FEATURES_PATH = "features.parquet"
UNIVERSE_PATH = "universe.csv"

features = pd.read_parquet(FEATURES_PATH)
universe_df = pd.read_csv(UNIVERSE_PATH)

# ---------- QC: basics ----------
min_date = pd.to_datetime(features["date"]).min()
max_date = pd.to_datetime(features["date"]).max()
n_rows = len(features)
n_tickers = features["ticker"].nunique()

# Columns we standardized in 1.2 (will exist if 1.2 ran)
feature_cols = [c for c in features.columns
                if c not in {"date","ticker","open","high","low","close","adj_close","volume"}]

# Per-column missingness (after 1.2; should be low except earliest windows)
missing_pct = (1.0 - features[feature_cols].notna().mean()).sort_values(ascending=False)
missing_pct.to_csv("qc_missing_by_feature.csv", header=["missing_pct"])

# Coverage by date (# of tickers with at least 1 valid feature on that date)
valid_any = features[feature_cols].notna().sum(axis=1) > 0
coverage_by_date = (features.assign(valid_any=valid_any)
                             .groupby("date")["ticker"]
                             .nunique()
                             .rename("n_tickers"))
coverage_by_date.to_csv("qc_coverage_by_date.csv")

# ---------- Optional: pruning rules (non-destructive by default) ----------
# 1) Warmup: many features need long windows (max ≈ 252 + 21). Keep dates after first 273 trading days.
#    We'll infer a warmup cutoff from SPY availability to be robust.
spy_dates = features.loc[features["ticker"]=="SPY", "date"].sort_values().unique()
if len(spy_dates) > 300:
    warmup_cutoff = pd.to_datetime(spy_dates[min(273, len(spy_dates)-1)])
else:
    warmup_cutoff = min_date  # fallback

# 2) Low coverage: drop dates with very few names (e.g., <300) — tweak if you want.
COVERAGE_MIN = 300
low_cov_dates = coverage_by_date[coverage_by_date < COVERAGE_MIN].index

# We don’t mutate features here; write a recommended mask so training can filter.
date_mask_keep = (~pd.Series(features["date"]).isin(low_cov_dates)) & (features["date"] >= warmup_cutoff)
keep_rate = date_mask_keep.mean()
pd.DataFrame({
    "warmup_cutoff":[warmup_cutoff],
    "coverage_min":[COVERAGE_MIN],
    "keep_rate":[float(keep_rate)]
}).to_csv("qc_recommendations.csv", index=False)

# ---------- Meta ----------
meta = {
    "universe": {
        "description": "S&P 500 (current constituents; survivorship bias acknowledged).",
        "count": int(len(universe_df)),
        "hedges": ["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"],
        "context_symbols": ["^VIX"],
        "lookback": {"start": str(min_date.date()), "end": str(max_date.date())}
    },
    "pricing": {
        "source": "Yahoo Finance via yfinance",
        "adjusted_prices_used": True,
        "file": "raw_prices.parquet"
    },
    "features": {
        "file": "features.parquet",
        "rows": int(n_rows),
        "tickers": int(n_tickers),
        "leakage_control": "All predictive features shifted by 1 day.",
        "cross_sectional_processing": "Winsorized [1%,99%] & z-scored by date (see 1.2).",
        "imputation": "Fundamentals imputed (cross-sectional median) in 1.2; *_is_missing masks present."
    },
    "qc": {
        "missing_by_feature_csv": "qc_missing_by_feature.csv",
        "coverage_by_date_csv": "qc_coverage_by_date.csv",
        "recommendations_csv": "qc_recommendations.csv",
        "warmup_cutoff": str(warmup_cutoff.date()),
        "coverage_min": COVERAGE_MIN,
        "recommendation": "Filter training rows to dates >= warmup_cutoff and dates with coverage >= coverage_min."
    },
    "deliverables": ["universe.csv", "raw_prices.parquet", "features.parquet",
                     "funda_quarterly.parquet", "funda_daily.parquet", "meta.yaml",
                     "qc_missing_by_feature.csv", "qc_coverage_by_date.csv", "qc_recommendations.csv"]
}

with open("meta.yaml", "w") as f:
    yaml.safe_dump(meta, f, sort_keys=False)

print("Saved: meta.yaml + QC CSVs")

Saved: meta.yaml + QC CSVs


In [3]:
# ============================================================
# 1.4 DATA QC & ASSERTIONS (non-destructive; optional filtered view)
# Produces: qc_summary.json, qc_constant_cols.csv, qc_missing_by_feature.csv (again),
#           qc_skew_kurtosis.csv, qc_outlier_rate.csv, qc_drift.csv,
#           features_filtered.parquet (optional, if you turn on APPLY_FILTERS)
# ============================================================

import json, pandas as pd, numpy as np
from scipy.stats import skew, kurtosis
import warnings
warnings.filterwarnings("ignore", message="Precision loss occurred in moment calculation")
warnings.filterwarnings("ignore", message="Degrees of freedom <= 0 for slice")

FEATURES_PATH = "features.parquet"

APPLY_FILTERS = True          # set False if you only want reports
COVERAGE_MIN = 300            # min tickers per date
Z_OUTLIER = 5.0               # |z| threshold post-standardization
EARLY_YEARS = 5               # windows for drift check
RECENT_YEARS = 5

df = pd.read_parquet(FEATURES_PATH)
df["date"] = pd.to_datetime(df["date"])
feature_cols = [c for c in df.columns if c not in {"date","ticker","open","high","low","close","adj_close","volume"}]

# Basic shape / duplicates
dup_count = df.duplicated(["date","ticker"]).sum()
idx_dupes = int(dup_count)

# Per-ticker monotonic date check
monotonic_bad = []
for t, g in df.groupby("ticker"):
    if not g["date"].sort_values().is_monotonic_increasing:
        monotonic_bad.append(t)

# Constant/empty columns
MIN_N = 200  # only compute moments if we’ve got enough points
sk_stats = []

const_cols, empty_cols = [], []
for c in feature_cols:
    nn = df[c].notna().sum()
    if nn == 0:
        empty_cols.append(c)
        continue
    # treat “constant” as very low variance or single unique value
    if df[c].nunique(dropna=True) == 1 or np.nanstd(df[c].to_numpy(dtype=float)) < 1e-12:
        const_cols.append(c)

pd.Series(const_cols, name="constant_cols").to_csv("qc_constant_cols.csv", index=False)
pd.Series(empty_cols,  name="empty_cols").to_csv("qc_empty_cols.csv", index=False)

# Missingness
missing_pct = (1.0 - df[feature_cols].notna().mean()).sort_values(ascending=False)
missing_pct.to_csv("qc_missing_by_feature.csv", header=["missing_pct"])

# Coverage by date and warmup/low-coverage mask (reuse warmup logic from 1.3)
spy_dates = df.loc[df["ticker"]=="SPY", "date"].sort_values().unique()
warmup_cutoff = pd.to_datetime(spy_dates[min(273, len(spy_dates)-1)]) if len(spy_dates) > 300 else df["date"].min()
coverage = df.groupby("date")["ticker"].nunique()
low_cov_dates = coverage[coverage < COVERAGE_MIN].index
keep_mask = (df["date"] >= warmup_cutoff) & (~df["date"].isin(low_cov_dates))
keep_rate = float(keep_mask.mean())

# Outlier rate (features are z-scored per date already)
outlier_rate = {}
for c in feature_cols:
    s = df[c]
    outlier_rate[c] = float((s.abs() > Z_OUTLIER).mean())
pd.Series(outlier_rate, name="outlier_rate").sort_values(ascending=False).to_csv("qc_outlier_rate.csv")

# Skew/Kurtosis (global, ignoring NaNs)
sk_rows = []
for c in feature_cols:
    x = df[c].to_numpy(dtype=float)
    x = x[np.isfinite(x)]
    if len(x) < MIN_N or np.nanstd(x) < 1e-8:
        # optional: quantile-based skew as fallback
        try:
            q1,q2,q3 = np.nanpercentile(x, [25,50,75])
            bowley = ((q3 + q1) - 2*q2) / ((q3 - q1) + 1e-9)
        except Exception:
            bowley = np.nan
        sk_rows.append([c, np.nan, np.nan, bowley, np.nan, np.nan])
        continue
    sk = float(skew(x, bias=False))
    ku = float(kurtosis(x, fisher=True, bias=False))
    p99 = float(np.nanpercentile(x, 99))
    med = float(np.nanmedian(x))
    dom = abs(p99) / (abs(med) + 1e-9)
    sk_rows.append([c, sk, ku, np.nan, dom, p99])

pd.DataFrame(sk_rows, columns=["feature","skew","kurtosis_fisher","bowley_skew","p99_to_median_abs","p99"])\
  .sort_values("p99_to_median_abs", ascending=False)\
  .to_csv("qc_skew_kurtosis.csv", index=False)

# Drift: early vs recent windows
dstart, dend = df["date"].min(), df["date"].max()
span_years = (dend - dstart).days / 365.25
if span_years < (EARLY_YEARS + RECENT_YEARS):
    # fallback: split the dataset in half
    mid = dstart + (dend - dstart) / 2
    early = df[(df["date"] >= dstart) & (df["date"] <= mid)]
    late  = df[(df["date"] >  mid) & (df["date"] <= dend)]
else:
    early_end    = pd.Timestamp(dstart) + pd.DateOffset(years=EARLY_YEARS)
    recent_start = pd.Timestamp(dend)   - pd.DateOffset(years=RECENT_YEARS)
    early = df[(df["date"] >= dstart) & (df["date"] <= early_end)]
    late  = df[(df["date"] >= recent_start) & (df["date"] <= dend)]

drift_rows = []
for c in feature_cols:
    e = early[c].astype("float64"); l = late[c].astype("float64")
    e = e[np.isfinite(e)]; l = l[np.isfinite(l)]
    if len(e) < MIN_N or len(l) < MIN_N:
        drift_rows.append([c, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan])
        continue
    e_mean, e_std = float(np.nanmean(e)), float(np.nanstd(e))
    l_mean, l_std = float(np.nanmean(l)), float(np.nanstd(l))
    drift_rows.append([c, e_mean, l_mean, l_mean - e_mean, e_std, l_std, (l_std+1e-9)/(e_std+1e-9)])

pd.DataFrame(
    drift_rows,
    columns=["feature","early_mean","late_mean","mean_diff","early_std","late_std","std_ratio_late_over_early"]
).to_csv("qc_drift.csv", index=False)

def bowley_skew(x):
    q1, q2, q3 = np.nanpercentile(x, [25,50,75])
    denom = (q3 - q1) + 1e-9
    return float(((q3 + q1) - 2*q2) / denom)
# you can compute this alongside or instead of moment skew for each feature

# Optionally write filtered view for modeling
if APPLY_FILTERS:
    # Also drop truly empty/constant cols from the filtered file only
    drop_cols = list(set(empty_cols) | set(const_cols))
    cols_keep = [c for c in df.columns if c not in drop_cols]
    df_filt = df.loc[keep_mask, cols_keep].copy()
    df_filt.to_parquet("features_filtered.parquet", index=False)

# Summary JSON (for quick eyeball)
summary = {
    "rows": int(len(df)),
    "tickers": int(df["ticker"].nunique()),
    "dates": int(df["date"].nunique()),
    "date_min": str(df["date"].min().date()),
    "date_max": str(df["date"].max().date()),
    "duplicates_idx": idx_dupes,
    "monotonic_date_issues": len(monotonic_bad),
    "constant_cols": len(const_cols),
    "empty_cols": len(empty_cols),
    "warmup_cutoff": str(warmup_cutoff.date()),
    "coverage_min": COVERAGE_MIN,
    "keep_rate_after_filters": keep_rate,
    "median_missing_pct": float(missing_pct.median()),
    "max_missing_pct": float(missing_pct.max()),
    "mean_outlier_rate_|z|>5": float(pd.Series(outlier_rate).mean()),
    "filtered_file_written": APPLY_FILTERS
}

# 🚦 Hard QC checks — stop if these fail
assert summary["duplicates_idx"] == 0, "Duplicate (date,ticker) rows found."
assert summary["keep_rate_after_filters"] >= 0.85, "Too many rows dropped by filters."
assert summary["constant_cols"] <= 10, "Suspicious number of constant columns."

with open("qc_summary.json","w") as f:
    json.dump(summary, f, indent=2)

print("QC done → qc_summary.json, qc_* CSVs",
      "and features_filtered.parquet" if APPLY_FILTERS else "")

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


QC done → qc_summary.json, qc_* CSVs and features_filtered.parquet


In [4]:
import json, pandas as pd

with open("qc_summary.json") as f:
    s = json.load(f)

print("=== QC SUMMARY ===")
for k in [
    "rows","tickers","dates","date_min","date_max",
    "duplicates_idx","monotonic_date_issues",
    "constant_cols","empty_cols",
    "warmup_cutoff","coverage_min","keep_rate_after_filters",
    "median_missing_pct","max_missing_pct","mean_outlier_rate_|z|>5",
    "filtered_file_written"
]:
    print(f"{k}: {s.get(k)}")

print("\n=== Top 10 most-missing features ===")
print(pd.read_csv("qc_missing_by_feature.csv").head(10))

print("\n=== Top 10 highest outlier rates (|z|>5) ===")
print(pd.read_csv("qc_outlier_rate.csv").head(10))

print("\n=== Constant / Empty columns ===")
try: print(pd.read_csv("qc_constant_cols.csv").head())
except: print("none")
try: print(pd.read_csv("qc_empty_cols.csv").head())
except: print("none")

print("\n=== Drift (largest mean change early→late) ===")
drift = pd.read_csv("qc_drift.csv")
drift["abs_mean_diff"] = drift["mean_diff"].abs()
print(drift.sort_values("abs_mean_diff", ascending=False).head(10))

print("\n=== Filtered file shape ===")
ff = pd.read_parquet("features_filtered.parquet")
print(ff.shape, "rows x cols; dates:", ff['date'].min(), "→", ff['date'].max(), "; tickers:", ff['ticker'].nunique())


=== QC SUMMARY ===
rows: 2336181
tickers: 514
dates: 4930
date_min: 2006-01-03
date_max: 2025-08-07
duplicates_idx: 0
monotonic_date_issues: 0
constant_cols: 3
empty_cols: 3
warmup_cutoff: 2007-02-05
coverage_min: 300
keep_rate_after_filters: 0.9515307247169633
median_missing_pct: 0.005390421375741028
max_missing_pct: 1.0
mean_outlier_rate_|z|>5: 0.03214321560068222
filtered_file_written: True

=== Top 10 most-missing features ===
       Unnamed: 0  missing_pct
0  earnings_yield     1.000000
1             roe     1.000000
2        accruals     1.000000
3        mom_12_1     0.055664
4         mom_12m     0.055664
5         mom_6_1     0.027942
6          mom_6m     0.027942
7      ret_lag_60     0.013641
8      ret_lag_59     0.013421
9      ret_lag_58     0.013201

=== Top 10 highest outlier rates (|z|>5) ===
          Unnamed: 0  outlier_rate
0          vix_close      0.999780
1             sma_20      0.973354
2             sma_50      0.967136
3             atr_14      0.005795
4  

<details>
<summary>📦 Summary — Section 1 (Data & Universe)</summary>

In this section, we **built the full, modeling-ready dataset** by merging historical prices, technical indicators, market context, and fundamentals into a single leakage-controlled feature matrix.  
Key steps included:

- **Data acquisition** — pulled long-term daily OHLCV for the equity universe, hedges, and context symbols, plus quarterly fundamentals from FMP.
- **Feature engineering** — created lagged returns/volatility, momentum metrics, trend filters, ATR, volatility-adjusted momentum, and value/quality factor composites. Fundamentals were forward-filled to daily frequency.
- **Leakage control & scaling** — shifted predictive features by one day, winsorized extreme values, and cross-sectionally z-scored each feature per date.
- **Missing data handling** — conservative imputation for fundamentals and binary masks to record missingness.
- **Quality control** — removed low-coverage dates, early warmup period, constant/empty columns, and duplicate rows; generated QC reports and metadata.

**Outcome:** A clean, consistent, and statistically robust `features_filtered.parquet` file — ready for direct use in **Section 2 (Regime Modeling)** without recomputing or re-fetching any raw data.

</details>


<details>
<summary> Variables to reuse — Section 1 (Data & Universe) </summary>
**Status:** Done. Artifacts are written; QC checks passed; ready to start **Section 2 (Regime Modeling)** using the saved files and globals below.

---

## Canonical Artifacts (reuse, don’t recompute)
- `universe.csv` – S&P 500 tickers (Yahoo-style), excludes hedges/context.
- `raw_prices.parquet` – OHLCV + `adj_close` for equities + hedges + `^VIX` (long format).
- `features.parquet` – lagged, winsorized, cross-sectionally z-scored features (+ *_is_missing masks).
- `features_filtered.parquet` – modeling-ready view (warmup & low-coverage dates removed; empty/constant cols dropped).
- `funda_quarterly.parquet`, `funda_daily.parquet` – fundamentals at quarterly/daily granularity.
- `meta.yaml` – machine-readable metadata (sources, lookback, QC guidance).
- QC reports: `qc_summary.json`, `qc_missing_by_feature.csv`, `qc_coverage_by_date.csv`, `qc_constant_cols.csv`, `qc_empty_cols.csv`, `qc_outlier_rate.csv`, `qc_skew_kurtosis.csv`, `qc_drift.csv`, `qc_recommendations.csv`.

---

## Reusable Globals (organized)
> These exist (or are trivially reloadable) after Section 1. Prefer these over re-deriving.

### Dates / Ranges
- `START_DATE = "2006-01-01"`  
- `END_DATE = datetime.today().strftime("%Y-%m-%d")`

### Universe & Symbols
- `sp500_url` – Wikipedia source for constituents.
- `tickers_raw` → raw symbols from Wikipedia.
- `tickers` → Yahoo-normalized tickers (periods → dashes).
- `hedges` → `["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"]`
- `context_symbols` → `["^VIX"]`  *(later used as `{"^VIX"}` set in 1.2)*
- `universe` → sorted unique S&P tickers.
- `universe_all` → `universe + hedges + context_symbols`
- `universe_full` → list from `universe.csv` (canonical equities universe for downstream code).

### DataFrames (load-once, reuse)
- `prices` → long OHLCV for `universe_all` (saved as `raw_prices.parquet`).
- `features` → merged technical + context + fundamentals (post-shift, winsorize, z-score) (saved).
- `vix` → `^VIX` close series; `spy` → SPY prices with `spy_ret`, `spy_rv_20`.
- `ctx` → market context by date: `["spy_rv_20","vix_close","breadth"]`.
- `px_daily_all` → `["date","ticker","adj_close"]` for equities universe.
- `dates_all` → unique trading dates.
- `funda_q` → quarterly fundamentals by ticker (saved).
- `funda_daily` → daily forward-filled fundamentals (saved).

### Feature Engineering Toggles / Windows
- `COMPUTE_SLOPE = True`
- `SLOPE_WINDOW = 20`
- `RV_WIN = 20`
- `ATR_WIN = 14`

### Provider / API / Caching
- `PROVIDER = "fmp"`
- `FMP_API_KEY` – from env or prompt (in-memory only).
- `CACHE_DIR = "cache"`
- `CHUNK_TICKERS = 100`, `START_AT = 0`, `SKIP_IF_CACHED = True`
- `MAX_WORKERS = 4`, `RETRY_ATTEMPTS = 5`, `BATCH_SLEEP = (0.2, 0.6)`

### Useful Function Handles
- `to_fmp_symbol(sym)` – Yahoo “-” ↔ FMP “.” class ticker mapping.
- `is_index_like(sym)` – identifies index symbols (e.g., `^VIX`).
- `compute_atr(df, window=ATR_WIN)`
- `vectorized_rolling_slope(y, window=SLOPE_WINDOW)`
- `mom_over_n(adj_close, n)`
- `_tidy_quarterly_df(df)`, `_coalesce_cols(df, cols, default)`
- `_fetch_quarterly_funda_fmp(ticker)` – pulls BS/IS/CF, coalesces variants.
- `fetch_or_load_cached_quarterly(ticker)` – cached loader for fundamentals.
- `cs_standardize_fast(df, cols, lo=0.01, hi=0.99)` – per-date winsorize+z-score.

### Column Sets / Masks (downstream-friendly)
- `non_feature_cols = {"date","ticker","open","high","low","close","adj_close","volume"}`
- `cols_to_shift` – all predictive feature columns actually shifted by 1 bar.
- `cs_cols` – features standardized cross-sectionally (lags, vol, mom, fundamentals, etc.).
- `context_keep_raw = ["spy_rv_20","vix_close","breadth"]`
- *(QC section)*
  - `FEATURES_PATH = "features.parquet"`, `UNIVERSE_PATH = "universe.csv"`
  - `COVERAGE_MIN = 300`
  - `APPLY_FILTERS = True`
  - `Z_OUTLIER = 5.0`, `EARLY_YEARS = 5`, `RECENT_YEARS = 5`
  - `warmup_cutoff` – computed from SPY date series (≈273 trading-day warmup).
  - `keep_mask` – dates ≥ `warmup_cutoff` and with coverage ≥ `COVERAGE_MIN`.
  - *(Note: `features_filtered.parquet` is written using `keep_mask` and pruned columns.)*

---

## What this means for Section 2 (Regimes)
- **Use** `features_filtered.parquet` (or reload `features` and apply `keep_mask`) to build HMM inputs.
- Inputs available out of the box: `spy_rv_20`, `vix_close`, `breadth`, and per-asset returns (`ret_1d`), plus everything in `cs_cols`.
- **No duplicate `(date, ticker)` rows**, **no monotonic issues**; early sparse periods removed by `warmup_cutoff`/`keep_mask`.

---

## Sanity Questions (short answers)
- **“Are we good to go?”** Yes — Section 1 is complete and validated; proceed to regime modeling.
- **“Empty rows?”** Raw OHLCV rows with all NaNs were dropped; the modeling file (`features_filtered.parquet`) is filtered to warmup/coverage and prunes empty/constant columns. Row-level all-NaN feature cases should not remain after these filters.
- **“Add the assertions?”** Already present and passing in QC (`qc_summary.json`). No need to add them again unless you change the pipeline.
</details>

# 2. Regime Modeling

<details> <summary>
Outline (HMM → Regime Labels & Probabilities)</summary>

# 2) Regime Modeling — Updated Outline (HMM → Regime Labels & Probabilities)

## 2.0 Scope & Interfaces
- **Goal:** Assign a daily market regime (Risk-On, Risk-Off, Transition) with posterior probabilities to drive regime-aware weighting, turnover caps, and risk targets in Sections 3–5.
- **Inputs (from Section 1):**
  - `features_filtered.parquet` with **raw** `spy_rv_20`, `vix_close`, `breadth`, and SPY `adj_close` for return computation.
  - Trading calendar (aligned daily business days).
- **Outputs (artifacts):**
  - `regime_labels.parquet`: `date, state_id, p0..pK, regime_label`
  - `regime_labels.csv` (plot-friendly)
  - `regime_plot.png` (timeline with shading), `state_profiles.csv` (state stats)
  - `regime_hmm.pkl` (bundle: scaler + HMM per walk-forward window)
  - `regime_meta.json` (config, state→label map, scaler params, transition matrix, diagnostics)
  - `regime_sensitivity.json` (K/feature/era stability tests)
- **Pass/Fail gates:**
  - Interpretable state profiles (return/vol ordering aligns with labels)
  - Reasonable persistence (median run length > 5–10 days; no chattering)
  - Stable mapping across walk-forward windows (low semantic flip rate)
  - No leakage (all inputs at t known at t)

---

## 2.1 Data Assembly (Market Panel)
- **Series:**
  - SPY **log return** at t (computed from `adj_close`, shifted to avoid leakage if needed).
  - SPY realized volatility (20-day) — from raw `spy_rv_20`.
  - VIX **level** (`vix_close`) and optionally **daily Δ** (t − t-1).
  - Market breadth (% advancers in S&P, known at t).
- **IMPORTANT:** Use **raw** context series from Section 1 (`spy_rv_20`, `vix_close`, `breadth`), **not** cross-sectional z-scored features.
- **Breadth timing:** Confirm that `breadth` reflects t-1 data available at t; if not, shift by 1.
- **Alignment:** Daily business days; merge by `date`; forward-fill only for indicators known at t; drop rows with missing core inputs.
- **Standardization:** Fit `StandardScaler` **per train window** on the raw context features; persist scaler per window (stored in `regime_hmm.pkl`).
- **Sanity checks:**
  - Stationarity proxy (mean/var drift over eras).
  - Outlier handling: no winsorization needed for HMM since we scale raw series per window.
  - Coverage check: ensure no missing dates in test stitching.

---

## 2.2 Model Choice & Configuration
- **Primary:** Gaussian HMM with `covariance_type="full"`; components K ∈ {2,3} (default 3).
- **Alternative (optional):** Student-t HMM, GMM-HMM, Markov-Switching VAR, or Bayesian HMM with sticky priors.
- **Hyperparameters:**
  - `n_components`, `covariance_type`, `n_iter`, `random_state`.
  - Optional: Dirichlet priors / sticky transitions to enforce regime persistence.
- **Training protocol:**
  - Train on standardized features in the train window.
  - Multiple random restarts; choose model with highest log-likelihood.
  - If applying **finance recency weighting rule**: optionally weight log-likelihood so recent data has more influence (can be implemented here if desired).

---

## 2.3 State Labeling & Semantics
- **Profile each state:**
  - Mean and vol of SPY returns.
  - Mean VIX level, mean ΔVIX.
  - Mean breadth, tail metrics (5% quantile returns).
- **Label rules:**
  - Highest mean return & lowest vol → **Risk-On**
  - Highest vol & lowest return → **Risk-Off**
  - Remaining state → **Transition**
- **Tie-breakers:** breadth, VIX changes, downside tails.
- **Persist mapping:** Save `state_id → regime_label` per window in `regime_meta.json` so semantics don’t silently drift across walk-forward windows.

---

## 2.4 Smoothing, Persistence & Debounce
- **Posterior smoothing:** Option to use Viterbi most-likely path vs. raw posterior argmax.
- **Debounce parameters:** `MIN_DWELL_DAYS` and `POSTERIOR_THRESH` from `config.yaml`.
- **Gap handling:** Holidays/missing days inherit last known regime; no forward-looking fill.

---

## 2.5 Robustness & Sensitivity
- **K sensitivity:** Run K=2 and K=3; prefer K with clearest separation (return/vol) and healthy dwell-time.
- **Feature sensitivity:** Drop-one/add-one tests (remove VIX, remove breadth, etc.) to check label stability.
- **Era stability:** Compare state profiles and transition matrices pre/post-2015 and during crisis years (e.g., 2020).
- **Bootstrap:** Block bootstrap re-fit; produce confusion matrix for label stability across samples.

---

## 2.6 Diagnostics & QA
- **Plots:**
  - Timeline with regime shading over SPY price & drawdown.
  - Posterior probabilities (stacked area).
  - State return histograms, QQ plots.
  - Transition matrix heatmap, dwell-time distribution.
- **Tables:**
  - State profiles (returns, vol, VIX, breadth, tails).
  - Transition matrix & steady-state distribution.
  - Switch frequency and chattering metrics.
- **Alerts:**
  - Flag if any state has inconsistent semantics (positive mean but top-2 vol, dwell-time < 3 days, mapping flips).

---

## 2.7 Regime-Aware Policy Hooks (Interfaces to Sections 3–5)
- **Weights & turnover caps:** JSON map per regime (e.g., throttle momentum in Risk-Off, upweight quality).
- **Risk targets:** Per-regime vol targets (e.g., 10%/8%/6% for On/Trans/Off).
- **Hedge intensity:** Baseline hedge ratios per regime; pass to RL policy as defaults.
- **Confidence proxy:** Use max posterior or entropy to scale aggressiveness.

---

## 2.8 Walk-Forward Integration
- **Windows:** Match Section 6 (rolling/expanding).
- **Per window:**
  - Fit scaler + HMM on train subset.
  - Apply to test subset only.
  - Save artifacts: `regime_labels_<winid>.parquet`, `regime_hmm.pkl`, `regime_meta.json`.
- **Stitching:** Concatenate per-window outputs into one continuous timeline for backtests.
- **Label stability:** Use saved state→label mapping to avoid regime meaning drift.

---

## 2.9 Forward (Shadow) Mode
- **Daily update:** Apply persisted scaler + HMM to latest t; append to `regime_labels.parquet`.
- **Retrain cadence:** Weekly/bi-weekly.
- **Logging:** Save model hash, posterior, chosen label, features vector.
- **Alerts:** If mapping flips or dwell-time anomaly detected.

---

## 2.10 Configuration & Reproducibility
- **Config keys (`config.yaml`):**
  - Features list for HMM.
  - `n_components`, `MIN_DWELL_DAYS`, `POSTERIOR_THRESH`.
  - Finance recency weighting toggle & decay parameter (if implemented here).
  - Random seed, plot toggles.
- **Serialization:**
  - joblib for model + scaler.
  - JSON for meta (labels, thresholds, diagnostics).
- **Tests:**
  - Deterministic output with fixed seed.
  - No leakage (t-only features).
  - Posterior rows sum to 1; dates strictly increasing.
  - No gaps after stitching.
  - Label semantics test per window.

---

## 2.11 Deliverables Checklist
- `regime_labels.parquet` (+ CSV).
- `regime_hmm.pkl` (model + scaler per window).
- `regime_meta.json` (state→label, scaler params, diagnostics).
- `regime_timeline.png`, `regime_posteriors.png`, `state_profiles.csv`, `transition_matrix.csv`.
- `regime_sensitivity.json` (K/feature/era stability).
- `regime_policy_map.json` (interfaces to Sections 3–5).


---
</details>

# 3. Alpha Layer (Signals)

# 4. Portfolio Construction & Risk

# 5. RL Sizing Policy (PPO)

# 6. Backtesting (Backward Testing) — Rigor

# 7. Forward Testing (No Orders; Shadow Runs)

# 8. Cost Model & Execution Assumptions

# 9. Reproducibility & Testability

# 10. Visualization & Reporting

# 11. Automation Options (Optional, no trading)

# 12. Optional Alpaca Integration (disabled by default)

# 13. File/Module Structure (Colab-friendly)




```
/project
  config.yaml
  data/
    universe.csv
    features.parquet
    regime_labels.parquet
  models/
    lstm_*.pt / .h5
    gbm_*.txt
    stacker_*.pkl
    rl_policy_*.pkl
  runs/YYYY-MM-DD/
    signals.parquet
    weights.parquet
    hedges.parquet
    daily_pnl.csv
    risk.json
  reports/
    backtest_tearsheet.html
    forward_tearsheet_YYYY-MM.html
  src/
    data_loader.py
    feature_engineering.py
    regime.py
    models_lstm.py
    models_tabular.py
    stacking.py
    uncertainty.py
    portfolio_bl_rp.py
    hedging.py
    rl_policy.py
    backtest.py
    forward_shadow.py
    risk_metrics.py
    stats_tests.py  # DM, SPA/White RC, Sharpe inference
    monte_carlo.py  # block bootstrap
    reporting.py    # plots & HTML/PDF
  main.py          # CLI: daily-shadow / weekly-train / monthly-report
  notebook.ipynb   # Colab master: end-to-end run with toggles

```



# 14. More info

- Suggested stack: pandas, numpy, scikit-learn, lightgbm, xgboost, tensorflow/PyTorch (choose one for LSTM), hmmlearn, stable-baselines3, cvxpy (for BL/optimization), arch (optional), statsmodels, scipy, matplotlib/plotly.

Compute plan (fits $50–$100):

- S&P 100, 5–8 walk-forward windows.

- LSTM 1–2 layers (64–128 units), MC-dropout 20 samples.

- PPO with modest timesteps per window.

- 200–400 Monte Carlo bootstrap paths.

- 1–3 GPU hours on Colab Pro/Pro+; RAM < 24GB.

# 15. Build Order (fastest to value)

1. Data + Features + Regimes → validate leakage & plots.

2. Multifactor composite → baseline cross-sec L/S backtest.

3. GBM/MLP + LSTM → stacking + uncertainty; re-run backtest.

4. BL + RP + Dynamic hedge → re-run backtest & stress.

5. RL sizing → ablation vs no-RL; finalize backtest.

6. Forward shadow loop (daily), weekly retrain, monthly reports.

7. Automation (Actions/cron), optional Alpaca paper stub (off).

# 16. What you'll see in the first results
- Backtest tear sheet with OO-S equity curve, MC bands, by-regime tables, SPA/DM outcomes, VaR/CVaR & stress.

- Ablation:

  - Multifactor only → +ML → +ML+RL;

  - Market-neutral vs long-only w/ hedging;

  - Cost sensitivity 5–20 bps.

- A live forward dashboard (from Day 1) accumulating daily PnL + monthly report.



# 17. Forward-Testing Duration Recommendation

- Run at least 4 weeks forward shadow to confirm plumbing & stability.

- Prefer 8–12 weeks to evaluate regime adaptation, RL sizing behavior under drawdowns, and cost realism.

- Only after the forward period matches backtest risk/return within expected error bands should you consider paper-trading execution.



<details>
<summary><strong>Outline Details</strong></summary>

# Project Outline — Regime-Aware Multifactor + LSTM/Ensembles + RL (with rigorous back & forward testing)

## 0) Objectives & Success Criteria
**Primary objective:** Generate statistically significant pure alpha (market-neutral) with controlled drawdowns after transaction costs.  

**Secondary objective:** Build a repeatable process capable of ongoing, unattended forward testing that outputs monthly tear sheets.  

**Pass/Fail gates (OO-S):**  
- Annualized Sharpe ≥ 1.0 (cost-adjusted) across walk-forward windows.  
- SPA/White Reality Check non-rejection vs family of alternatives at 5–10% level.  
- Max DD ≤ 15–20% (tunable) in backtests.  
- Forward test (4–8+ weeks): positive return, rolling Sharpe > 0.8, tail losses consistent with backtest VaR/CVaR.  

---

## 1) Data & Universe

### 1.1 Universe
- S&P 100 equities (liquid, keeps compute sane).  
- Hedging instruments: SPY + sector ETFs (XLY, XLF, XLV, XLK, XLI, XLE, XLP, XLB, XLU, XLRE).  
- Source: Yahoo Finance (daily bars).  
- Lookback: 10–15 years if available (train 2012→, test recent).  

### 1.2 Features
- **Returns/vol:** log returns (1–60d lags), realized vol, ATR.  
- **Momentum:** 12–1, 6–1, 20d, trend filters (e.g., SMA cross, slope).  
- **Value:** B/P, E/P, CF/P, shareholder yield (latest available; forward-fill monthly/quarterly).  
- **Quality:** gross profitability, ROE, accruals, leverage, F-Score-like composite.  
- **Market context:** VIX, SPY vol, market breadth (% advancers, optional).  
- Leakage controls: strictly lag all features, align to t-1; winsorize & z-score cross-sectionally.  

### 1.3 Data Hygiene
- Survivorship-bias approach: use current S&P 100 for practicality; (optional) point-in-time later.  
- Corporate actions: use adjusted prices.  
- Missing fundamentals: impute conservatively or drop; record masks for model.  
- **Deliverables:** `features.parquet`, `universe.csv`, `meta.yaml`.  

---

## 2) Regime Modeling

### 2.1 HMM (2–3 states)
- Inputs: SPY daily returns/vol, VIX level/change, market breadth.  
- States: Risk-On, Risk-Off, Transition (labeled by average return/vol).  
- **Output:** daily regime label + posterior probabilities.  

### 2.2 Usage
- Regime-specific ensemble weights, turnover caps, and risk targets.  
- Momentum throttled in Risk-Off; quality emphasized.  
- **Deliverables:** `regime_labels.parquet`, regime plot.  

---

## 3) Alpha Layer (Signals)

### 3.1 Multifactor Composite
- Value/Momentum/Quality composites (winsorized, z-scored).  
- Per-regime blend fit with ridge.  
- **Output:** factor alpha score per asset/day.  

### 3.2 ML Overlays
- **LSTM:** 60-day sequences → t+5/t+10 returns; MC-dropout for uncertainty.  
- **Tabular ensembles:** LightGBM (primary), XGBoost, small MLP; also quantile versions.  
- **Stacking meta-learner:** ridge/LightGBM; OOF training within walk-forward train window.  
- **Output:** final forecast (mean) + uncertainty proxy.  

### 3.3 Uncertainty → Confidence
- Expected Sharpe proxy = mean / std_hat.  
- Bucket confidence for analytics.  
- **Deliverables:** `alpha_raw.parquet`, `alpha_ensemble.parquet`, feature importance charts.  

---

## 4) Portfolio Construction & Risk

### 4.1 Baseline Weights
- Cross-sectional L/S: long top decile, short bottom decile by forecasted Sharpe.  
- Beta-neutral, per-name and sector caps.  

### 4.2 Black–Litterman (BL)
- Prior: market-cap weights → implied μ.  
- Views: ensemble alphas scaled by uncertainty.  
- Posterior μ̂ → mean-variance with L2 & turnover penalty.  

### 4.3 Risk Parity & Vol Target
- Equalize risk across sector/factor clusters.  
- Target portfolio vol (8–12% ann.).  

### 4.4 Dynamic Hedging
- Daily orthogonalization vs SPY + sectors; hedge ratios adjustable by RL.  
- **Deliverables:** weights, exposures, hedge plots.  

---

## 5) RL Sizing Policy (PPO)

### 5.1 Role
- Scales risk target and tunes hedges.  

### 5.2 State
- Regime, vol, drawdown, alpha strength, uncertainty, turnover, betas, cost model.  

### 5.3 Reward
- PnL – costs – λ·CVaR_tail – κ·Δdrawdown – penalties.  

### 5.4 Training
- Train within walk-forward segments; fixed seeds.  
- **Deliverables:** `rl_policy.pkl`, diagnostics.  

---

## 6) Backtesting (Backward Testing) — Rigor

### 6.1 Walk-Forward Engine
- Rolling/expanding windows; purged & embargoed CV.  
- Refit all models per window; test daily with costs.  

### 6.2 Significance & Reality Checks
- DM test, SPA/White RC, Sharpe inference.  

### 6.3 Tail Risk & Stress
- VaR/CVaR; stress tests (2008/2020, vol shocks, liquidity cuts).  

### 6.4 Monte Carlo Robustness
- Block bootstrap; output PnL envelopes.  
- **Deliverables:** equity curves, DD charts, ablations.  

---

## 7) Forward Testing (Shadow Mode)

### 7.1 Daily Shadow Run
- No backfill; use latest models; log all artifacts.  

### 7.2 Retraining Cadence
- Weekly or bi-weekly; strict forward-only.  

### 7.3 Monthly Auto-Report
- Tear sheets with returns, Sharpe, DD, risk, regime PnL, VaR/CVaR.  

### 7.4 Duration
- Min: 4 weeks; Pref: 8–12 weeks.  
- **Deliverables:** daily run files, monthly reports.  

---

## 8) Cost Model & Execution Assumptions
- Costs: 10 bps round-trip (sweep 5–20).  
- Slippage: 1–2 bps; higher in Risk-Off.  
- Short borrow: 10–50 bps ann.  
- Liquidity caps: ≤5–10% ADV.  

---

## 9) Reproducibility & Testability
- Config-driven (`config.yaml`); fixed seeds.  
- Unit/integration tests for leakage, CV folds, NaNs, RL bounds.  
- Experiment tracking with CSV/JSON + git hash.  

---

## 10) Visualization & Reporting
- Equity curves with regime shading, rolling metrics, exposures, attribution, bucket PnL, by-regime performance, risk dashboards.  

---

## 11) Automation Options
- **Colab:** manual or scheduled;  
- **GitHub Actions:** nightly, weekly, monthly;  
- **VM + cron:** low-budget option.  

---

## 12) Optional Alpaca Integration
- Disabled by default; forward test never sends orders; later optional paper fills.  

</details>
