# Regime-Aware Multifactor + ML/RL Alpha Engine with Backward & Forward Testing

## Project Description

This project is a **modular trading research system** designed to generate **pure alpha** (market-neutral returns independent of market beta) by combining **proven multifactor investing principles** with **modern machine learning and reinforcement learning techniques**, and testing them with **rigorous statistical validation**.

The strategy‚Äôs profit engine comes from exploiting cross-sectional mispricings in a broad large-cap U.S. universe (**S&P 500 training set with dynamic top-N selection by confidence**) by identifying which stocks are likely to outperform or underperform others over the next 5‚Äì10 days. This is achieved through:

- **Multifactor Alpha Layer:**  
  - **Value** (cheap stocks with potential to mean-revert up)  
  - **Momentum** (stocks in persistent trends)  
  - **Quality** (financially strong, operationally robust companies)  
  - Per-regime factor blending with shrinkage to avoid overfitting.

- **Machine Learning Overlays:**  
  - **LSTM** (sequence model) to capture time-series patterns in returns, volatility, and technicals.  
  - **LightGBM/XGBoost/MLP** (tabular models) to detect nonlinear interactions in cross-sectional features.  
  - **Stacking meta-learner** to optimally blend factor and ML outputs.  
  - **Uncertainty quantification** via MC-dropout and quantile models to control position sizing.

- **Regime Detection:**  
  - Hidden Markov Model (HMM) to classify markets as **Risk-On**, **Risk-Off**, or **Transition**, adjusting model weights and risk accordingly.

- **Portfolio Construction & Risk Management:**  
  - **Black‚ÄìLitterman optimization** to integrate model views with market-implied returns.  
  - **Risk parity** to balance sector/factor exposures.  
  - **Dynamic hedging** against SPY/sector ETFs to maintain market neutrality.

- **Reinforcement Learning (PPO):**  
  - Learns a sizing and hedging policy that adapts risk-taking to forecast strength, uncertainty, and current market regime, maximizing return per unit of tail risk (CVaR-aware reward).

## Testing & Validation

The project integrates **both backward and forward testing** to ensure robustness:

- **Backward Testing (Historical):**  
  - Walk-forward analysis with purged cross-validation to avoid look-ahead bias.  
  - Statistical significance tests (Diebold‚ÄìMariano, SPA/White Reality Check) to confirm non-randomness.  
  - Monte Carlo block bootstrap to estimate confidence intervals and failure probabilities.  
  - VaR/CVaR analysis and stress testing against historical crisis scenarios.

- **Forward Testing (Shadow, No Trades):**  
  - Daily simulation using only forward data, logging PnL and risk metrics without sending orders.  
  - Weekly retraining and monthly auto-generated tear sheets to track live performance against backtest expectations.  
  - Recommended forward-testing period: 4‚Äì12 weeks before considering paper/live execution.

## Goal

The system‚Äôs goal is to produce **consistent, statistically validated alpha** with low correlation to the market and controlled drawdowns, using a combination of **factor investing, machine learning, and reinforcement learning**. This approach maximizes the probability of sustainable profitability before any real capital is risked.



# Objectives & Success Criteria
- Primary objective: Generate statistically significant pure alpha (market-neutral) with controlled drawdowns after transaction costs.

- Secondary objective: Build a repeatable process capable of ongoing, unattended forward testing that outputs monthly tear sheets.

- Pass/Fail gates (OO-S):
  - Annualized Sharpe ‚â• 1.0 (cost-adjusted) across walk-forward windows.
  - SPA/White Reality Check non-rejection vs family of alternatives at 5‚Äì10% level.
  - Max DD ‚â§ 15‚Äì20% (tunable) in backtests.
  - Forward test (4‚Äì8+ weeks): positive return, rolling Sharpe > 0.8, tail losses consistent with backtest VaR/CVaR.



# 1. Data & Universe

In [None]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

if device == "cuda":
    print(torch.cuda.get_device_name(0))


Using device: cuda
Tesla T4


In [None]:
%pip -q install yfinance pandas numpy PyYAML pyarrow statsmodels tenacity

In [None]:
# ============================================================
# 1.1 UNIVERSE (UPDATED)
# S&P 500 training set with dynamic top-N selection by confidence (later in pipeline).
# Hedging instruments: SPY + sector ETFs.
# Source: Yahoo Finance (daily bars). Lookback from 2006-01-01 to today.
# Saves: universe.csv and raw_prices.parquet (OHLCV + Adj Close for all tickers incl. hedges + ^VIX)
# ============================================================

import pandas as pd
import numpy as np
import yfinance as yf
from datetime import datetime

START_DATE = "2006-01-01"
END_DATE = datetime.today().strftime("%Y-%m-%d")

def to_fmp_symbol(sym: str) -> str:
    # map Yahoo/WSJ style class tickers to FMP
    return sym.replace("-", ".") if "-" in sym else sym

def is_index_like(sym: str) -> bool:
    # skip ^VIX and other index-style series for FMP backfill
    return sym.startswith("^")

# --- Get S&P 500 constituents from Wikipedia (survivorship bias acknowledged) ---
sp500_url = "https://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
tables = pd.read_html(sp500_url)
sp500 = tables[0]  # first table
tickers_raw = sp500["Symbol"].tolist()
# Some tickers on Wikipedia have periods; yfinance uses dashes for certain cases
tickers = [t.replace(".", "-") for t in tickers_raw]

# --- Hedging instruments (market & sector ETFs) ---
hedges = ["SPY", "XLY", "XLF", "XLV", "XLK", "XLI", "XLE", "XLP", "XLB", "XLU", "XLRE"]
context_symbols = ["^VIX"]  # market context series

universe = sorted(set(tickers))
universe_all = sorted(set(universe + hedges + context_symbols))

# --- Save universe to CSV ---
pd.DataFrame({"ticker": universe}).to_csv(f"universe_{END_DATE}.csv", index=False)
pd.DataFrame({"ticker": universe}).to_csv("universe.csv", index=False)  # pointer


# --- Download daily OHLCV for all symbols ---
# yfinance handles adjusted prices; we‚Äôll keep both Close & Adj Close.
data = yf.download(
    universe_all,
    start=START_DATE,
    end=END_DATE,
    auto_adjust=False,
    group_by="ticker",
    progress=False,
    threads=True,
)

if data is None or getattr(data, "empty", False):
    raise RuntimeError("yfinance returned no data ‚Äî try rerunning or chunking the request.")

def top_level_symbols(df):
    # Handles both MultiIndex (normal multi-ticker) and flat columns (edge cases)
    if isinstance(df.columns, pd.MultiIndex):
        return set(df.columns.get_level_values(0))
    # flat columns -> we can only have one symbol; yfinance puts OHLCV names as columns
    return set()  # treat as empty to trigger backfill logic safely

# added: tells us if yfinance skipped any tickers
available = top_level_symbols(data)
missing = [sym for sym in universe_all if sym not in available]
if missing:
    pd.Series(missing, name="missing_symbols").to_csv("missing_symbols.csv", index=False)
    print(f"WARNING: {len(missing)} symbols missing from download. Saved to missing_symbols.csv")

# Normalize to tidy format: MultiIndex -> long DataFrame
frames = []
if isinstance(data.columns, pd.MultiIndex):
    for sym in universe_all:
        if sym not in available:
            continue
        df = data[sym].copy()
        df.columns = [c.lower().replace(" ", "_") for c in df.columns]
        df["ticker"] = sym
        frames.append(df.reset_index().rename(columns={"Date": "date"}))
else:
    # Edge: flat columns ‚Äî shouldn't happen with many symbols, but keep it safe
    df = data.copy()
    df.columns = [c.lower().replace(" ", "_") for c in df.columns]
    df["ticker"] = universe_all[0]
    frames.append(df.reset_index().rename(columns={"Date": "date"}))

prices = pd.concat(frames, ignore_index=True).sort_values(["ticker", "date"])
prices["date"] = pd.to_datetime(prices["date"])

# Basic sanity: drop rows with all NaNs for OHLCV
keep_cols = ["open", "high", "low", "close", "adj_close", "volume"]
prices = prices.dropna(subset=keep_cols, how="all")

# Save raw prices
prices.to_parquet("raw_prices.parquet", index=False)

print(f"Universe size (S&P 500): {len(universe)} tickers")
print(f"Total symbols incl. hedges/context: {len(universe_all)}")
print("Saved: universe.csv, raw_prices.parquet")


Universe size (S&P 500): 503 tickers
Total symbols incl. hedges/context: 515
Saved: universe.csv, raw_prices.parquet


In [None]:
# ---- Optional: Backfill any missing tickers with FMP (skip ^VIX etc.) ----
import os, requests, time
from getpass import getpass

if os.path.exists("missing_symbols.csv"):
    missing = pd.read_csv("missing_symbols.csv")["missing_symbols"].tolist()
else:
    uni = pd.read_csv("universe.csv")["ticker"].tolist()
    hedges = ["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"]
    context = ["^VIX"]
    universe_all = sorted(set(uni + hedges + context))
    base_prices = pd.read_parquet("raw_prices.parquet")
    present = set(base_prices["ticker"].unique())
    missing = [s for s in universe_all if s not in present]

missing = [s for s in missing if not is_index_like(s)]
if not missing:
    print("No missing symbols to backfill.")
else:
    print(f"Backfilling {len(missing)} symbols from FMP (skipping indexes):", missing[:8], "...")
    FMP_API_KEY = os.environ.get("FMP_API_KEY", "").strip() or getpass("Enter FMP API key for price backfill: ").strip()
    if not FMP_API_KEY:
        raise RuntimeError("FMP_API_KEY required for backfill.")

    base_url = "https://financialmodelingprep.com/api/v3/historical-price-full"
    def fetch_fmp_prices(sym):
        fmp_sym = to_fmp_symbol(sym)
        url = f"{base_url}/{fmp_sym}?from={START_DATE}&to={END_DATE}&serietype=line&apikey={FMP_API_KEY}"
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        js = r.json()
        hist = js.get("historical", [])
        if not hist:
            return None
        df = pd.DataFrame(hist)
        df["date"] = pd.to_datetime(df["date"])
        # Map columns; fall back if adjClose missing
        df = df.rename(columns={"adjClose":"adj_close"})
        if "adj_close" not in df.columns:
            df["adj_close"] = df["close"]
        cols = ["date","open","high","low","close","adj_close","volume"]
        for c in cols:
            if c not in df.columns: df[c] = np.nan
        df = df[cols]
        df["ticker"] = sym
        return df.sort_values("date")

    filled = []
    for i, sym in enumerate(missing, 1):
        try:
            df = fetch_fmp_prices(sym)
            if df is not None and len(df):
                filled.append(df)
        except Exception:
            pass
        if i % 10 == 0:
            time.sleep(0.5)  # be polite

    if filled:
        add = pd.concat(filled, ignore_index=True)
        base_prices = pd.read_parquet("raw_prices.parquet")
        prices_fixed = pd.concat([base_prices, add], ignore_index=True).sort_values(["ticker","date"])
        prices_fixed.to_parquet("raw_prices.parquet", index=False)
        print(f"Backfilled {add['ticker'].nunique()} symbols and re-saved raw_prices.parquet")
    else:
        print("FMP backfill returned no data; proceeding without these tickers.")

No missing symbols to backfill.


In [None]:
# ============================================================
# 1.2 FEATURES (FMP Premium, no hard-coded key)
# ------------------------------------------------------------
# Builds:
#   ‚Ä¢ Price/technical features (returns/vol/ATR/momentum/trend)
#   ‚Ä¢ Market context (SPY vol, ^VIX, breadth)
#   ‚Ä¢ Fundamentals via FMP (quarterly BS/IS/CF), cached per ticker,
#     forward-filled to daily, and ratio metrics (Value + Quality)
# Post-merge:
#   ‚Ä¢ Leakage control (shift all predictive features by 1 day)
#   ‚Ä¢ Winsorize & cross-sectional z-score (by date)
#   ‚Ä¢ Fundamentals imputation + missing masks
# Saves:
#   ‚Ä¢ features.parquet
#   ‚Ä¢ funda_quarterly.parquet, funda_daily.parquet
#   ‚Ä¢ cache/funda_q_<TICKER>.parquet (per-ticker cache)
# Notes:
#   - API key is taken from env var FMP_API_KEY or prompted securely.
# ============================================================

# %pip -q install yfinance pyarrow tenacity

import os, time, random, gc
import pandas as pd
import numpy as np
import yfinance as yf
from getpass import getpass
from concurrent.futures import ThreadPoolExecutor, as_completed
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

# ---------- Config / Toggles ----------
COMPUTE_SLOPE = True      # slope_20 via vectorized method
SLOPE_WINDOW = 20
RV_WIN = 20
ATR_WIN = 14

# Fundamentals provider config (FMP Premium)
PROVIDER = "fmp"          # fixed to FMP for reliability
FMP_API_KEY = os.environ.get("FMP_API_KEY", "").strip()
if not FMP_API_KEY:
    # Prompt securely; not echoed, not written to disk
    FMP_API_KEY = getpass("Enter your FMP API key (kept in-memory for this session): ").strip()
if not FMP_API_KEY:
    raise RuntimeError("FMP_API_KEY is required. Set env var FMP_API_KEY or enter it when prompted.")

def to_fmp_symbol(sym: str) -> str:
    """
    Convert Yahoo-style tickers to FMP-style.
    Yahoo uses '-' for class/shared tickers (e.g., BRK-B),
    while FMP uses '.' (e.g., BRK.B). Everything else stays the same.
    """
    # common class/delimiter cases
    # e.g., BRK-B, BF-B, FOXA (no change), META (no change)
    if "-" in sym:
        return sym.replace("-", ".")
    return sym

# Chunking: Premium can fetch all at once. If you ever need throttling, set CHUNK_TICKERS to an int.
CHUNK_TICKERS = 20        # TOCHANGE: None = process entire universe in one go > change this later to None to check all stocks instead of just 100
START_AT = 0              # offset if chunking
SKIP_IF_CACHED = True     # skip ticker if cache exists

MAX_WORKERS = 4           # Premium can handle more concurrency; tune 4‚Äì12 as you like
RETRY_ATTEMPTS = 5
BATCH_SLEEP = (0.2, 0.6)  # polite jitter between HTTP calls
CACHE_DIR = "cache"
os.makedirs(CACHE_DIR, exist_ok=True)

# ---------- Load raw prices & universe (from 1.1) ----------
prices = pd.read_parquet("raw_prices.parquet")
universe_full = list(pd.read_csv("universe.csv")["ticker"])
hedges = {"SPY", "XLY", "XLF", "XLV", "XLK", "XLI", "XLE", "XLP", "XLB", "XLU", "XLRE"}
context_symbols = {"^VIX"}

# ============================================================
# A) PRICE / TECHNICAL FEATURES
# ============================================================

def compute_atr(df, window=ATR_WIN):
    high, low, close = df["high"], df["low"], df["close"]
    prev_close = close.shift(1)
    tr = pd.concat([(high - low),
                    (high - prev_close).abs(),
                    (low - prev_close).abs()], axis=1).max(axis=1)
    return tr.rolling(window).mean()

def vectorized_rolling_slope(y: pd.Series, window=SLOPE_WINDOW) -> pd.Series:
    N = window
    if N <= 1:
        return pd.Series(np.nan, index=y.index, dtype=float)
    x = np.arange(N, dtype=float)
    Sx = x.sum()
    Sxx = (x**2).sum()
    yv = y.to_numpy(dtype=float)
    yv = np.where(np.isfinite(yv), yv, 0.0)
    k = np.ones(N, dtype=float)
    Sy  = np.convolve(yv, k[::-1], mode="full")[N-1:len(yv)+N-1]
    Sxy = np.convolve(yv, x[::-1], mode="full")[N-1:len(yv)+N-1]
    denom = N * Sxx - Sx * Sx + 1e-12
    slope = (N * Sxy - Sx * Sy) / denom
    out = pd.Series(np.nan, index=y.index, dtype=float)
    out.iloc[N-1:] = slope[N-1:]
    return out

def mom_over_n(adj_close, n):
    return np.log(adj_close / adj_close.shift(n))

feat_frames = []
tickers = sorted(prices["ticker"].unique())
total = len(tickers)

for i, (sym, df_sym) in enumerate(prices.groupby("ticker"), start=1):
    if sym in context_symbols:
        continue
    if i % 25 == 0:
        print(f"[Features] {i}/{total} processed‚Ä¶ ({sym})")

    df = df_sym.sort_values("date").copy()
    df["ret_1d"] = np.log(df["adj_close"] / df["adj_close"].shift(1))
    for l in range(1, 61):
        df[f"ret_lag_{l}"] = df["ret_1d"].shift(l)

    df["rv_20"] = df["ret_1d"].rolling(RV_WIN).std() * np.sqrt(252)
    df["atr_14"] = compute_atr(df, ATR_WIN)

    df["mom_20"]  = mom_over_n(df["adj_close"], 20)
    df["mom_6m"]  = mom_over_n(df["adj_close"], 126)
    df["mom_12m"] = mom_over_n(df["adj_close"], 252)
    df["mom_12_1"] = np.log(df["adj_close"].shift(21) / df["adj_close"].shift(252))
    df["mom_6_1"]  = np.log(df["adj_close"].shift(21) / df["adj_close"].shift(126))

    df["sma_20"] = df["adj_close"].rolling(20).mean()
    df["sma_50"] = df["adj_close"].rolling(50).mean()
    df["sma_20_gt_50"] = (df["sma_20"] > df["sma_50"]).astype("float32")
    df["slope_20"] = vectorized_rolling_slope(df["adj_close"], window=SLOPE_WINDOW) if COMPUTE_SLOPE else np.nan

    df["mom_20_vs_vol"] = df["mom_20"] / (df["ret_1d"].rolling(20).std() + 1e-8)

    feat_frames.append(df)

features = pd.concat(feat_frames, ignore_index=True)

# ============================================================
# B) MARKET CONTEXT (SPY vol, VIX, breadth)
# ============================================================

vix = prices[prices["ticker"] == "^VIX"][["date", "adj_close"]].rename(columns={"adj_close": "vix_close"})
spy = prices[prices["ticker"] == "SPY"].copy()
spy["spy_ret"] = np.log(spy["adj_close"] / spy["adj_close"].shift(1))
spy["spy_rv_20"] = spy["spy_ret"].rolling(20).std() * np.sqrt(252)
ctx = spy[["date", "spy_rv_20"]].merge(vix, on="date", how="left")

rets = features.pivot(index="date", columns="ticker", values="ret_1d")
advancers = (rets > 0).sum(axis=1)
# Fixed denominator for stability = full S&P 500 count from universe.csv
breadth = (advancers / len(universe_full)).rename("breadth")
ctx = ctx.merge(breadth.reset_index(), on="date", how="left")

features = features.merge(ctx, on="date", how="left")

# ============================================================
# C) FUNDAMENTALS (FMP Premium primary; cached per ticker)
# ============================================================

px_daily_all = prices[prices["ticker"].isin(universe_full)][["date", "ticker", "adj_close"]].copy()
px_daily_all["date"] = pd.to_datetime(px_daily_all["date"])
dates_all = px_daily_all[["date"]].drop_duplicates().sort_values("date")

def _tidy_quarterly_df(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    if "date" in df.columns:
        df["date"] = pd.to_datetime(df["date"])
    elif "fillingDate" in df.columns:
        df["date"] = pd.to_datetime(df["fillingDate"])
    return df

def _coalesce_cols(df: pd.DataFrame, cols: list[str], default=np.nan) -> pd.Series:
    avail = [c for c in cols if c in df.columns]
    if not avail:
        return pd.Series(default, index=df.index)
    tmp = df[avail].apply(pd.to_numeric, errors="coerce")
    # first non-null across the candidate columns
    s = tmp.bfill(axis=1).iloc[:, 0]
    return s

def _fetch_quarterly_funda_fmp(ticker: str) -> pd.DataFrame:
    import requests
    base = "https://financialmodelingprep.com/api/v3"
    fmp_ticker = to_fmp_symbol(ticker)   # BRK-B -> BRK.B

    def jget(url):
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        return r.json()

    # pull a long history (Premium supports it)
    bs = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/balance-sheet-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    is_ = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/income-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    cf = _tidy_quarterly_df(pd.DataFrame(jget(f"{base}/cash-flow-statement/{fmp_ticker}?period=quarter&limit=120&apikey={FMP_API_KEY}")))
    if bs.empty and is_.empty and cf.empty:
        raise RuntimeError(f"FMP fundamentals empty for {ticker} (queried as {fmp_ticker})")

    out = bs.merge(is_, on="date", how="outer").merge(cf, on="date", how="outer")

    # Coalesce across schema variants
    out["book_equity"]  = _coalesce_cols(out, ["totalStockholdersEquity","totalShareholderEquity","totalEquity"]).astype(float)
    out["net_income"]   = _coalesce_cols(out, ["netIncome","netIncomeApplicableToCommonShares"]).astype(float)
    out["ocf"]          = _coalesce_cols(out, [
        "netCashProvidedByOperatingActivities",
        "netCashProvidedByUsedInOperatingActivities",
        "netCashProvidedByUsedInOperatingActivitiesContinuingOperations"
    ]).astype(float)
    out["gross_profit"] = _coalesce_cols(out, ["grossProfit"]).astype(float)
    out["total_assets"] = _coalesce_cols(out, ["totalAssets"]).astype(float)

    # total_debt: prefer totalDebt; else short + long
    td = _coalesce_cols(out, ["totalDebt"])
    if td.isna().all():
        short = _coalesce_cols(out, ["shortTermDebt","shortLongTermDebtTotal"])
        long  = _coalesce_cols(out, ["longTermDebt"])
        td = (short.fillna(0) + long.fillna(0)).replace({0: np.nan})
    out["total_debt"] = td.astype(float)

    # dividends / buybacks (raw signs as provided by FMP)
    out["dividends"] = _coalesce_cols(out, ["dividendsPaid","dividendsPaidCashFlow"]).astype(float)
    out["buybacks"]  = _coalesce_cols(out, ["commonStockRepurchased","purchaseOfCommonStock"]).astype(float)

    out["ticker"] = ticker  # keep Yahoo-style symbol for our dataset

    cols = ["date","ticker","book_equity","net_income","ocf","gross_profit",
            "total_assets","total_debt","dividends","buybacks"]
    return out[cols].dropna(subset=["date"])


def fetch_or_load_cached_quarterly(ticker: str) -> pd.DataFrame | None:
    path = os.path.join(CACHE_DIR, f"funda_q_{ticker}.parquet")
    if SKIP_IF_CACHED and os.path.exists(path):
        try:
            return pd.read_parquet(path)
        except Exception:
            pass
    try:
        df = _fetch_quarterly_funda_fmp(ticker)
        if df is None or df.empty:
            return None
        df.to_parquet(path, index=False)
        time.sleep(random.uniform(*BATCH_SLEEP))  # polite pause
        return df
    except Exception:
        return None

# ---- SMOKE TEST (run once, then you can comment it out) ----
try:
    from IPython.display import display
except Exception:
    pass

test_syms = ["AAPL", "MSFT", "BRK-B", "BF-B"]
for s in test_syms:
    try:
        df = _fetch_quarterly_funda_fmp(s)
        # üëá trim preview to match your price history
        df = df[df["date"] >= pd.to_datetime(START_DATE)]
        print(s, "‚Üí", to_fmp_symbol(s), "rows:", len(df))
        try:
            display(df.head(2))
        except Exception:
            print(df.head(2))
    except Exception as e:
        print("ERR", s, e)

# Determine chunk (or all)
if CHUNK_TICKERS:
    end_at = min(len(universe_full), START_AT + CHUNK_TICKERS)
    tickers_chunk = sorted(universe_full[START_AT:end_at])
    print(f"[FMP] Processing chunk {START_AT}:{end_at} (size={len(tickers_chunk)})")
else:
    tickers_chunk = sorted(universe_full)
    print(f"[FMP] Processing entire universe (size={len(tickers_chunk)})")

# Parallel fetch with caching
funda_parts, successes = [], 0
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex:
    futs = {ex.submit(fetch_or_load_cached_quarterly, t): t for t in tickers_chunk}
    for j, fut in enumerate(as_completed(futs), start=1):
        df = fut.result()
        if df is not None and len(df):
            funda_parts.append(df)
            successes += 1
        if j % 25 == 0:
            print(f"[Fundamentals/FMP] {j}/{len(tickers_chunk)} processed‚Ä¶ (successes: {successes})")

if successes == 0:
    raise RuntimeError("No fundamentals fetched. Check your FMP key or try a smaller CHUNK_TICKERS with fewer workers.")

# Merge chunk with existing quarterly file (so multiple runs accumulate)
q_path = "funda_quarterly.parquet"
funda_q_chunk = pd.concat(funda_parts, ignore_index=True).sort_values(["ticker","date"])

# after you build funda_q_chunk
cutoff = pd.to_datetime(px_daily_all["date"].min())
funda_q_chunk = funda_q_chunk[funda_q_chunk["date"] >= cutoff]

if os.path.exists(q_path):
    old = pd.read_parquet(q_path)
    old["date"] = pd.to_datetime(old["date"])
    old = old[old["date"] >= cutoff]  # <- trim the old file too
    funda_q = (
        pd.concat([old, funda_q_chunk], ignore_index=True)
          .drop_duplicates(["ticker","date"], keep="last")
          .sort_values(["ticker","date"])
    )
else:
    funda_q = funda_q_chunk

funda_q = funda_q.sort_values(["ticker","date"])
funda_q.to_parquet(q_path, index=False)
print(
    f"Saved: {q_path}  "
    f"(tickers with funda total: {funda_q['ticker'].nunique()}, "
    f"rows: {len(funda_q)})"
)

# Quarterly ‚Üí daily (forward-fill per ticker) across ALL tickers collected so far
ff = []
for sym, grp in funda_q.groupby("ticker"):
    g = dates_all.merge(grp, on="date", how="left")
    g["ticker"] = sym
    g = g.sort_values("date").ffill()
    ff.append(g)
funda_daily = pd.concat(ff, ignore_index=True)

# Ratios (Value & Quality)
fd = funda_daily.merge(px_daily_all, on=["date","ticker"], how="left")
price = fd["adj_close"].replace(0, np.nan)

fd["book_to_price"]     = fd["book_equity"] / price
fd["earnings_yield"]    = fd["net_income"]  / price
fd["cf_yield"]          = fd["ocf"]         / price
fd["shareholder_yield"] = (fd["dividends"].fillna(0) * -1 + fd["buybacks"].fillna(0)) / price

fd["gross_profitability"] = fd["gross_profit"] / fd["total_assets"].replace(0, np.nan)
fd["roe"]                 = fd["net_income"] / fd["book_equity"].replace(0, np.nan)
fd["accruals"]            = (fd["net_income"] - fd["ocf"]) / fd["total_assets"].replace(0, np.nan)
fd["leverage"]            = fd["total_debt"] / fd["total_assets"].replace(0, np.nan)

funda_daily = fd[[
    "date","ticker","book_to_price","earnings_yield","cf_yield","shareholder_yield",
    "gross_profitability","roe","accruals","leverage"
]]
funda_daily.to_parquet("funda_daily.parquet", index=False)
print("Saved: funda_daily.parquet")

# Merge fundamentals into features
features = features.merge(funda_daily, on=["date","ticker"], how="left")

# ============================================================
# D) POST-MERGE HYGIENE
# ============================================================

# 1) Leakage control
non_feature_cols = {"date","ticker","open","high","low","close","adj_close","volume"}
cols_to_shift = [c for c in features.columns if c not in non_feature_cols]
features[cols_to_shift] = features.groupby("ticker")[cols_to_shift].shift(1)

# 2) Winsorize & cross-sectional z-score
def winsorize_cs(s, lo=0.01, hi=0.99):
    ql, qh = s.quantile(lo), s.quantile(hi)
    return s.clip(ql, qh)

# --- choose features for cross-sectional standardization (exclude context & raw SMAs) ---
cs_cols = [
    "rv_20","atr_14","mom_20","mom_6m","mom_12m","mom_12_1","mom_6_1",
    "sma_20_gt_50","slope_20","mom_20_vs_vol",
    # fundamentals
    "book_to_price","earnings_yield","cf_yield","shareholder_yield",
    "gross_profitability","roe","accruals","leverage"
] + [f"ret_lag_{l}" for l in range(1,61)]

# keep context raw (no CS z-score)
context_keep_raw = ["spy_rv_20","vix_close","breadth"]

present = [c for c in cs_cols if c in features.columns]

print(f"[Standardize] Cross-sectional z-score on {len(present)} features")

def cs_standardize_fast(df, cols, lo=0.01, hi=0.99):
    out = df.copy()
    out[cols] = out[cols].astype("float32")

    d = out["date"]
    for c in cols:
        s = out[c]

        ql = s.groupby(d).transform(lambda x: x.quantile(lo))
        qh = s.groupby(d).transform(lambda x: x.quantile(hi))
        s_clip = s.clip(ql, qh)

        mu = s_clip.groupby(d).transform("mean")
        sd = s_clip.groupby(d).transform("std")

        # if std is 0 or NaN (date-constant or all-NaN), set denom=1 to avoid blowing up / NaNs
        denom = sd.fillna(0.0).replace(0.0, 1.0)

        out[c] = ((s_clip - mu) / (denom + 1e-9)).astype("float32")

    return out

features = cs_standardize_fast(features, present)

# 3) Fundamentals imputation + masks
funda_cols = ["book_to_price","earnings_yield","cf_yield","shareholder_yield",
              "gross_profitability","roe","accruals","leverage"]
for c in funda_cols:
    if c in features.columns:
        features[f"{c}_is_missing"] = features[c].isna().astype(int)
        features[c] = features.groupby("date")[c].transform(lambda s: s.fillna(s.median()))

# Save final
features.to_parquet("features.parquet", index=False)
print("Saved: features.parquet (lagged, winsorized, cross-sectional z-scored)")
print("Artifacts: funda_quarterly.parquet, funda_daily.parquet, cache/funda_q_*.parquet")

gc.collect()

Enter your FMP API key (kept in-memory for this session): ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑
[Features] 25/515 processed‚Ä¶ (AMAT)
[Features] 50/515 processed‚Ä¶ (BA)
[Features] 75/515 processed‚Ä¶ (CARR)
[Features] 100/515 processed‚Ä¶ (CNP)
[Features] 125/515 processed‚Ä¶ (DAL)
[Features] 150/515 processed‚Ä¶ (EA)
[Features] 175/515 processed‚Ä¶ (EXE)
[Features] 200/515 processed‚Ä¶ (GEHC)
[Features] 225/515 processed‚Ä¶ (HON)
[Features] 250/515 processed‚Ä¶ (IT)
[Features] 275/515 processed‚Ä¶ (LDOS)
[Features] 300/515 processed‚Ä¶ (MCO)
[Features] 325/515 processed‚Ä¶ (MTCH)
[Features] 350/515 processed‚Ä¶ (OKE)
[Features] 375/515 processed‚Ä¶ (PNR)
[Features] 400/515 processed‚Ä¶ (RSG)
[Features] 425/515 processed‚Ä¶ (SWKS)
[Features] 450/515 processed‚Ä¶ (TSN)
[Features] 475/515 processed‚Ä¶ (VST)
[Features] 500/515 processed‚Ä¶ (XLF)
AAPL ‚Üí AAPL rows: 78


Unnamed: 0,date,ticker,book_equity,net_income,ocf,gross_profit,total_assets,total_debt,dividends,buybacks
42,2006-04-01,AAPL,8682000000.0,,-125000000.0,1297000000.0,13911000000.0,0.0,0.0,0.0
43,2006-07-01,AAPL,9330000000.0,,1007000000.0,1325000000.0,15114000000.0,0.0,0.0,-1000000.0


MSFT ‚Üí MSFT rows: 78


Unnamed: 0,date,ticker,book_equity,net_income,ocf,gross_profit,total_assets,total_debt,dividends,buybacks
42,2006-03-31,MSFT,42038000000.0,,4563000000.0,8872000000.0,66854000000.0,0.0,-925000000.0,-4675000000.0
43,2006-06-30,MSFT,40104000000.0,,3281000000.0,9674000000.0,69597000000.0,0.0,-917000000.0,-3981000000.0


BRK-B ‚Üí BRK.B rows: 78


Unnamed: 0,date,ticker,book_equity,net_income,ocf,gross_profit,total_assets,total_debt,dividends,buybacks
42,2006-03-31,BRK-B,95349000000.0,,2359000000.0,6162000000.0,230206000000.0,30479000000.0,0.0,0.0
43,2006-06-30,BRK-B,97613000000.0,,1092000000.0,8197000000.0,232331000000.0,30557000000.0,0.0,0.0


ERR BF-B FMP fundamentals empty for BF-B (queried as BF.B)
[FMP] Processing chunk 0:20 (size=20)
Saved: funda_quarterly.parquet  (tickers with funda total: 20, rows: 1487)
Saved: funda_daily.parquet
[Standardize] Cross-sectional z-score on 78 features
Saved: features.parquet (lagged, winsorized, cross-sectional z-scored)
Artifacts: funda_quarterly.parquet, funda_daily.parquet, cache/funda_q_*.parquet


0

In [None]:
# === Feature Inventory & Dictionary ===
# Point this to your artifact; adjust if you saved elsewhere.
FEATURES_PATH = "features.parquet"

import os, re, pandas as pd

if not os.path.exists(FEATURES_PATH):
    raise FileNotFoundError(f"Could not find {FEATURES_PATH}. "
                            "Run your Section 1.2 pipeline first or update FEATURES_PATH.")

df = pd.read_parquet(FEATURES_PATH)

# Columns that are not predictive "features" (aligns with your Section 1.2 script)
NON_FEATURE_COLS = {"date","ticker","open","high","low","close","adj_close","volume"}

# Known groups (from your build)
CONTEXT_COLS = {"spy_rv_20","vix_close","breadth"}
TECH_BASE = {
    "ret_1d","rv_20","atr_14","mom_20","mom_6m","mom_12m","mom_12_1","mom_6_1",
    "sma_20","sma_50","sma_20_gt_50","slope_20","mom_20_vs_vol"
}
FUNDAMENTAL_COLS = {
    "book_to_price","earnings_yield","cf_yield","shareholder_yield",
    "gross_profitability","roe","accruals","leverage"
}

# 1) Figure out which columns are in the file
all_cols = list(df.columns)
feature_cols = [c for c in all_cols if c not in NON_FEATURE_COLS]

# 2) Detect buckets
lags = sorted([c for c in feature_cols if re.fullmatch(r"ret_lag_\d+", c)], key=lambda x: int(x.split("_")[-1]))
context = [c for c in feature_cols if c in CONTEXT_COLS]
fundas = [c for c in feature_cols if c in FUNDAMENTAL_COLS]
funda_masks = sorted([c for c in feature_cols if c.endswith("_is_missing") and c.replace("_is_missing","") in FUNDAMENTAL_COLS])

# Technicals include the base tech set + anything that looks like SMA/trend/ATR/vol/momentum beyond the lags
tech_known = sorted([c for c in feature_cols if c in TECH_BASE])
# capture any extra tech-style columns you might add later (prefix match heuristics)
TECH_PREFIXES = ("rv_", "atr_", "mom_", "sma_", "slope_")
tech_extra = sorted([
    c for c in feature_cols
    if c not in tech_known
    and c not in lags
    and c not in context
    and (c.startswith(TECH_PREFIXES) or c in {"ret_1d"})
])

# 3) Anything not covered falls into "other"
covered = set(lags) | set(context) | set(fundas) | set(funda_masks) | set(tech_known) | set(tech_extra)
other = sorted([c for c in feature_cols if c not in covered])

# 4) Build a compact report
def hdr(title, items):
    return f"{title} ({len(items)}):\n" + (", ".join(items) if items else "‚Äî")

report = "\n".join([
    f"Total columns: {len(all_cols)}",
    f"Predictive feature columns: {len(feature_cols)}",
    "",
    hdr("Market context", context),
    hdr("Price/technical (known)", tech_known),
    hdr("Price/technical (extra detected)", tech_extra),
    hdr("Return lags", lags),
    hdr("Fundamentals", fundas),
    hdr("Fundamentals ‚Äî missing masks", funda_masks),
    hdr("Other", other),
])

print(report)

# 5) Also emit a Markdown dictionary to disk for teammates
def sample_stats(cols):
    if not cols: return ""
    sub = df[cols]
    # % non-null and dtype summary
    nn = sub.notna().mean().rename("non_null_frac")
    dtypes = sub.dtypes.rename("dtype").astype(str)
    return pd.concat([dtypes, nn.round(4)], axis=1).sort_index()

sections = [
    ("Market context", context),
    ("Price/technical (known)", tech_known),
    ("Price/technical (extra detected)", tech_extra),
    ("Return lags", lags),
    ("Fundamentals", fundas),
    ("Fundamentals ‚Äî missing masks", funda_masks),
    ("Other", other),
]

lines = ["# Feature Dictionary\n",
         f"- Source file: `{FEATURES_PATH}`",
         f"- Total columns: **{len(all_cols)}**",
         f"- Predictive feature columns: **{len(feature_cols)}**",
         ""]

for title, cols in sections:
    lines.append(f"## {title} ({len(cols)})")
    if cols:
        lines.append(", ".join(cols))
        stats = sample_stats(cols)
        lines.append("\n<details><summary>Schema & coverage</summary>\n\n")
        lines.append(stats.to_markdown())
        lines.append("\n</details>\n")
    else:
        lines.append("‚Äî\n")

md_path = "feature_dictionary.md"
with open(md_path, "w") as f:
    f.write("\n".join(lines))

print(f"\nSaved detailed dictionary ‚Üí {md_path}")


Total columns: 100
Predictive feature columns: 92

Market context (3):
spy_rv_20, vix_close, breadth
Price/technical (known) (13):
atr_14, mom_12_1, mom_12m, mom_20, mom_20_vs_vol, mom_6_1, mom_6m, ret_1d, rv_20, slope_20, sma_20, sma_20_gt_50, sma_50
Price/technical (extra detected) (0):
‚Äî
Return lags (60):
ret_lag_1, ret_lag_2, ret_lag_3, ret_lag_4, ret_lag_5, ret_lag_6, ret_lag_7, ret_lag_8, ret_lag_9, ret_lag_10, ret_lag_11, ret_lag_12, ret_lag_13, ret_lag_14, ret_lag_15, ret_lag_16, ret_lag_17, ret_lag_18, ret_lag_19, ret_lag_20, ret_lag_21, ret_lag_22, ret_lag_23, ret_lag_24, ret_lag_25, ret_lag_26, ret_lag_27, ret_lag_28, ret_lag_29, ret_lag_30, ret_lag_31, ret_lag_32, ret_lag_33, ret_lag_34, ret_lag_35, ret_lag_36, ret_lag_37, ret_lag_38, ret_lag_39, ret_lag_40, ret_lag_41, ret_lag_42, ret_lag_43, ret_lag_44, ret_lag_45, ret_lag_46, ret_lag_47, ret_lag_48, ret_lag_49, ret_lag_50, ret_lag_51, ret_lag_52, ret_lag_53, ret_lag_54, ret_lag_55, ret_lag_56, ret_lag_57, ret_lag_58, r

In [None]:
# ============================================================
# 1.3 DATA HYGIENE / QC (non-destructive)
# ------------------------------------------------------------
# - Summarize coverage & missingness (post 1.2)
# - Optional pruning: drop early warmup dates & low-coverage dates
# - Write meta.yaml and QC CSVs
# ============================================================

import pandas as pd
import numpy as np
import yaml

FEATURES_PATH = "features.parquet"
UNIVERSE_PATH = "universe.csv"

features = pd.read_parquet(FEATURES_PATH)
universe_df = pd.read_csv(UNIVERSE_PATH)

# ---------- QC: basics ----------
min_date = pd.to_datetime(features["date"]).min()
max_date = pd.to_datetime(features["date"]).max()
n_rows = len(features)
n_tickers = features["ticker"].nunique()

# Columns we standardized in 1.2 (will exist if 1.2 ran)
feature_cols = [c for c in features.columns
                if c not in {"date","ticker","open","high","low","close","adj_close","volume"}]

# Per-column missingness (after 1.2; should be low except earliest windows)
missing_pct = (1.0 - features[feature_cols].notna().mean()).sort_values(ascending=False)
missing_pct.to_csv("qc_missing_by_feature.csv", header=["missing_pct"])

# Coverage by date (# of tickers with at least 1 valid feature on that date)
valid_any = features[feature_cols].notna().sum(axis=1) > 0
coverage_by_date = (features.assign(valid_any=valid_any)
                             .groupby("date")["ticker"]
                             .nunique()
                             .rename("n_tickers"))
coverage_by_date.to_csv("qc_coverage_by_date.csv")

# ---------- Optional: pruning rules (non-destructive by default) ----------
# 1) Warmup: many features need long windows (max ‚âà 252 + 21). Keep dates after first 273 trading days.
#    We'll infer a warmup cutoff from SPY availability to be robust.
spy_dates = features.loc[features["ticker"]=="SPY", "date"].sort_values().unique()
if len(spy_dates) > 300:
    warmup_cutoff = pd.to_datetime(spy_dates[min(273, len(spy_dates)-1)])
else:
    warmup_cutoff = min_date  # fallback

# 2) Low coverage: drop dates with very few names (e.g., <300) ‚Äî tweak if you want.
COVERAGE_MIN = 300
low_cov_dates = coverage_by_date[coverage_by_date < COVERAGE_MIN].index

# We don‚Äôt mutate features here; write a recommended mask so training can filter.
date_mask_keep = (~pd.Series(features["date"]).isin(low_cov_dates)) & (features["date"] >= warmup_cutoff)
keep_rate = date_mask_keep.mean()
pd.DataFrame({
    "warmup_cutoff":[warmup_cutoff],
    "coverage_min":[COVERAGE_MIN],
    "keep_rate":[float(keep_rate)]
}).to_csv("qc_recommendations.csv", index=False)

# ---------- Meta ----------
meta = {
    "universe": {
        "description": "S&P 500 (current constituents; survivorship bias acknowledged).",
        "count": int(len(universe_df)),
        "hedges": ["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"],
        "context_symbols": ["^VIX"],
        "lookback": {"start": str(min_date.date()), "end": str(max_date.date())}
    },
    "pricing": {
        "source": "Yahoo Finance via yfinance",
        "adjusted_prices_used": True,
        "file": "raw_prices.parquet"
    },
    "features": {
        "file": "features.parquet",
        "rows": int(n_rows),
        "tickers": int(n_tickers),
        "leakage_control": "All predictive features shifted by 1 day.",
        "cross_sectional_processing": "Winsorized [1%,99%] & z-scored by date (see 1.2).",
        "imputation": "Fundamentals imputed (cross-sectional median) in 1.2; *_is_missing masks present."
    },
    "qc": {
        "missing_by_feature_csv": "qc_missing_by_feature.csv",
        "coverage_by_date_csv": "qc_coverage_by_date.csv",
        "recommendations_csv": "qc_recommendations.csv",
        "warmup_cutoff": str(warmup_cutoff.date()),
        "coverage_min": COVERAGE_MIN,
        "recommendation": "Filter training rows to dates >= warmup_cutoff and dates with coverage >= coverage_min."
    },
    "deliverables": ["universe.csv", "raw_prices.parquet", "features.parquet",
                     "funda_quarterly.parquet", "funda_daily.parquet", "meta.yaml",
                     "qc_missing_by_feature.csv", "qc_coverage_by_date.csv", "qc_recommendations.csv"]
}

with open("meta.yaml", "w") as f:
    yaml.safe_dump(meta, f, sort_keys=False)

print("Saved: meta.yaml + QC CSVs")

Saved: meta.yaml + QC CSVs


In [None]:
# ============================================================
# 1.4 DATA QC & ASSERTIONS (non-destructive; optional filtered view)
# Produces: qc_summary.json, qc_constant_cols.csv, qc_missing_by_feature.csv (again),
#           qc_skew_kurtosis.csv, qc_outlier_rate.csv, qc_drift.csv,
#           features_filtered.parquet (optional, if you turn on APPLY_FILTERS)
# ============================================================

import json, pandas as pd, numpy as np
from scipy.stats import skew, kurtosis
import warnings
warnings.filterwarnings("ignore", message="Precision loss occurred in moment calculation")
warnings.filterwarnings("ignore", message="Degrees of freedom <= 0 for slice")

FEATURES_PATH = "features.parquet"

APPLY_FILTERS = True          # set False if you only want reports
COVERAGE_MIN = 300            # min tickers per date
Z_OUTLIER = 5.0               # |z| threshold post-standardization
EARLY_YEARS = 5               # windows for drift check
RECENT_YEARS = 5

df = pd.read_parquet(FEATURES_PATH)
df["date"] = pd.to_datetime(df["date"])
feature_cols = [c for c in df.columns if c not in {"date","ticker","open","high","low","close","adj_close","volume"}]

# Basic shape / duplicates
dup_count = df.duplicated(["date","ticker"]).sum()
idx_dupes = int(dup_count)

# Per-ticker monotonic date check
monotonic_bad = []
for t, g in df.groupby("ticker"):
    if not g["date"].sort_values().is_monotonic_increasing:
        monotonic_bad.append(t)

# Constant/empty columns
MIN_N = 200  # only compute moments if we‚Äôve got enough points
sk_stats = []

const_cols, empty_cols = [], []
for c in feature_cols:
    nn = df[c].notna().sum()
    if nn == 0:
        empty_cols.append(c)
        continue
    # treat ‚Äúconstant‚Äù as very low variance or single unique value
    if df[c].nunique(dropna=True) == 1 or np.nanstd(df[c].to_numpy(dtype=float)) < 1e-12:
        const_cols.append(c)

pd.Series(const_cols, name="constant_cols").to_csv("qc_constant_cols.csv", index=False)
pd.Series(empty_cols,  name="empty_cols").to_csv("qc_empty_cols.csv", index=False)

# Missingness
missing_pct = (1.0 - df[feature_cols].notna().mean()).sort_values(ascending=False)
missing_pct.to_csv("qc_missing_by_feature.csv", header=["missing_pct"])

# Coverage by date and warmup/low-coverage mask (reuse warmup logic from 1.3)
spy_dates = df.loc[df["ticker"]=="SPY", "date"].sort_values().unique()
warmup_cutoff = pd.to_datetime(spy_dates[min(273, len(spy_dates)-1)]) if len(spy_dates) > 300 else df["date"].min()
coverage = df.groupby("date")["ticker"].nunique()
low_cov_dates = coverage[coverage < COVERAGE_MIN].index
keep_mask = (df["date"] >= warmup_cutoff) & (~df["date"].isin(low_cov_dates))
keep_rate = float(keep_mask.mean())

# Outlier rate (features are z-scored per date already)
outlier_rate = {}
for c in feature_cols:
    s = df[c]
    outlier_rate[c] = float((s.abs() > Z_OUTLIER).mean())
pd.Series(outlier_rate, name="outlier_rate").sort_values(ascending=False).to_csv("qc_outlier_rate.csv")

# Skew/Kurtosis (global, ignoring NaNs)
sk_rows = []
for c in feature_cols:
    x = df[c].to_numpy(dtype=float)
    x = x[np.isfinite(x)]
    if len(x) < MIN_N or np.nanstd(x) < 1e-8:
        # optional: quantile-based skew as fallback
        try:
            q1,q2,q3 = np.nanpercentile(x, [25,50,75])
            bowley = ((q3 + q1) - 2*q2) / ((q3 - q1) + 1e-9)
        except Exception:
            bowley = np.nan
        sk_rows.append([c, np.nan, np.nan, bowley, np.nan, np.nan])
        continue
    sk = float(skew(x, bias=False))
    ku = float(kurtosis(x, fisher=True, bias=False))
    p99 = float(np.nanpercentile(x, 99))
    med = float(np.nanmedian(x))
    dom = abs(p99) / (abs(med) + 1e-9)
    sk_rows.append([c, sk, ku, np.nan, dom, p99])

pd.DataFrame(sk_rows, columns=["feature","skew","kurtosis_fisher","bowley_skew","p99_to_median_abs","p99"])\
  .sort_values("p99_to_median_abs", ascending=False)\
  .to_csv("qc_skew_kurtosis.csv", index=False)

# Drift: early vs recent windows
dstart, dend = df["date"].min(), df["date"].max()
span_years = (dend - dstart).days / 365.25
if span_years < (EARLY_YEARS + RECENT_YEARS):
    # fallback: split the dataset in half
    mid = dstart + (dend - dstart) / 2
    early = df[(df["date"] >= dstart) & (df["date"] <= mid)]
    late  = df[(df["date"] >  mid) & (df["date"] <= dend)]
else:
    early_end    = pd.Timestamp(dstart) + pd.DateOffset(years=EARLY_YEARS)
    recent_start = pd.Timestamp(dend)   - pd.DateOffset(years=RECENT_YEARS)
    early = df[(df["date"] >= dstart) & (df["date"] <= early_end)]
    late  = df[(df["date"] >= recent_start) & (df["date"] <= dend)]

drift_rows = []
for c in feature_cols:
    e = early[c].astype("float64"); l = late[c].astype("float64")
    e = e[np.isfinite(e)]; l = l[np.isfinite(l)]
    if len(e) < MIN_N or len(l) < MIN_N:
        drift_rows.append([c, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan])
        continue
    e_mean, e_std = float(np.nanmean(e)), float(np.nanstd(e))
    l_mean, l_std = float(np.nanmean(l)), float(np.nanstd(l))
    drift_rows.append([c, e_mean, l_mean, l_mean - e_mean, e_std, l_std, (l_std+1e-9)/(e_std+1e-9)])

pd.DataFrame(
    drift_rows,
    columns=["feature","early_mean","late_mean","mean_diff","early_std","late_std","std_ratio_late_over_early"]
).to_csv("qc_drift.csv", index=False)

def bowley_skew(x):
    q1, q2, q3 = np.nanpercentile(x, [25,50,75])
    denom = (q3 - q1) + 1e-9
    return float(((q3 + q1) - 2*q2) / denom)
# you can compute this alongside or instead of moment skew for each feature

# Optionally write filtered view for modeling
if APPLY_FILTERS:
    # Also drop truly empty/constant cols from the filtered file only
    drop_cols = list(set(empty_cols) | set(const_cols))
    cols_keep = [c for c in df.columns if c not in drop_cols]
    df_filt = df.loc[keep_mask, cols_keep].copy()
    df_filt.to_parquet("features_filtered.parquet", index=False)

# Summary JSON (for quick eyeball)
summary = {
    "rows": int(len(df)),
    "tickers": int(df["ticker"].nunique()),
    "dates": int(df["date"].nunique()),
    "date_min": str(df["date"].min().date()),
    "date_max": str(df["date"].max().date()),
    "duplicates_idx": idx_dupes,
    "monotonic_date_issues": len(monotonic_bad),
    "constant_cols": len(const_cols),
    "empty_cols": len(empty_cols),
    "warmup_cutoff": str(warmup_cutoff.date()),
    "coverage_min": COVERAGE_MIN,
    "keep_rate_after_filters": keep_rate,
    "median_missing_pct": float(missing_pct.median()),
    "max_missing_pct": float(missing_pct.max()),
    "mean_outlier_rate_|z|>5": float(pd.Series(outlier_rate).mean()),
    "filtered_file_written": APPLY_FILTERS
}

# üö¶ Hard QC checks ‚Äî stop if these fail
assert summary["duplicates_idx"] == 0, "Duplicate (date,ticker) rows found."
assert summary["keep_rate_after_filters"] >= 0.85, "Too many rows dropped by filters."
assert summary["constant_cols"] <= 10, "Suspicious number of constant columns."

with open("qc_summary.json","w") as f:
    json.dump(summary, f, indent=2)

print("QC done ‚Üí qc_summary.json, qc_* CSVs",
      "and features_filtered.parquet" if APPLY_FILTERS else "")

  return np.nanmean(a, axis, out=out, keepdims=keepdims)


QC done ‚Üí qc_summary.json, qc_* CSVs and features_filtered.parquet


In [None]:
import json, pandas as pd

with open("qc_summary.json") as f:
    s = json.load(f)

print("=== QC SUMMARY ===")
for k in [
    "rows","tickers","dates","date_min","date_max",
    "duplicates_idx","monotonic_date_issues",
    "constant_cols","empty_cols",
    "warmup_cutoff","coverage_min","keep_rate_after_filters",
    "median_missing_pct","max_missing_pct","mean_outlier_rate_|z|>5",
    "filtered_file_written"
]:
    print(f"{k}: {s.get(k)}")

print("\n=== Top 10 most-missing features ===")
print(pd.read_csv("qc_missing_by_feature.csv").head(10))

print("\n=== Top 10 highest outlier rates (|z|>5) ===")
print(pd.read_csv("qc_outlier_rate.csv").head(10))

print("\n=== Constant / Empty columns ===")
try: print(pd.read_csv("qc_constant_cols.csv").head())
except: print("none")
try: print(pd.read_csv("qc_empty_cols.csv").head())
except: print("none")

print("\n=== Drift (largest mean change early‚Üílate) ===")
drift = pd.read_csv("qc_drift.csv")
drift["abs_mean_diff"] = drift["mean_diff"].abs()
print(drift.sort_values("abs_mean_diff", ascending=False).head(10))

print("\n=== Filtered file shape ===")
ff = pd.read_parquet("features_filtered.parquet")
print(ff.shape, "rows x cols; dates:", ff['date'].min(), "‚Üí", ff['date'].max(), "; tickers:", ff['ticker'].nunique())


=== QC SUMMARY ===
rows: 2337209
tickers: 514
dates: 4932
date_min: 2006-01-03
date_max: 2025-08-11
duplicates_idx: 0
monotonic_date_issues: 0
constant_cols: 3
empty_cols: 3
warmup_cutoff: 2007-02-05
coverage_min: 300
keep_rate_after_filters: 0.951552043484344
median_missing_pct: 0.005388050448205506
max_missing_pct: 1.0
mean_outlier_rate_|z|>5: 0.032075781133748024
filtered_file_written: True

=== Top 10 most-missing features ===
       Unnamed: 0  missing_pct
0  earnings_yield     1.000000
1             roe     1.000000
2        accruals     1.000000
3        mom_12_1     0.055640
4         mom_12m     0.055640
5         mom_6_1     0.027930
6          mom_6m     0.027930
7      ret_lag_60     0.013635
8      ret_lag_59     0.013415
9      ret_lag_58     0.013195

=== Top 10 highest outlier rates (|z|>5) ===
     Unnamed: 0  outlier_rate
0     vix_close      0.999780
1        sma_20      0.973318
2        sma_50      0.967107
3        atr_14      0.005798
4      slope_20      0.00214

<details>
<summary>üì¶ Summary ‚Äî Section 1 (Data & Universe)</summary>

In this section, we **built the full, modeling-ready dataset** by merging historical prices, technical indicators, market context, and fundamentals into a single leakage-controlled feature matrix.  
Key steps included:

- **Data acquisition** ‚Äî pulled long-term daily OHLCV for the equity universe, hedges, and context symbols, plus quarterly fundamentals from FMP.
- **Feature engineering** ‚Äî created lagged returns/volatility, momentum metrics, trend filters, ATR, volatility-adjusted momentum, and value/quality factor composites. Fundamentals were forward-filled to daily frequency.
- **Leakage control & scaling** ‚Äî shifted predictive features by one day, winsorized extreme values, and cross-sectionally z-scored each feature per date.
- **Missing data handling** ‚Äî conservative imputation for fundamentals and binary masks to record missingness.
- **Quality control** ‚Äî removed low-coverage dates, early warmup period, constant/empty columns, and duplicate rows; generated QC reports and metadata.

**Outcome:** A clean, consistent, and statistically robust `features_filtered.parquet` file ‚Äî ready for direct use in **Section 2 (Regime Modeling)** without recomputing or re-fetching any raw data.

</details>


<details>
<summary> Variables to reuse ‚Äî Section 1 (Data & Universe) </summary>
**Status:** Done. Artifacts are written; QC checks passed; ready to start **Section 2 (Regime Modeling)** using the saved files and globals below.

---

## Canonical Artifacts (reuse, don‚Äôt recompute)
- `universe.csv` ‚Äì S&P 500 tickers (Yahoo-style), excludes hedges/context.
- `raw_prices.parquet` ‚Äì OHLCV + `adj_close` for equities + hedges + `^VIX` (long format).
- `features.parquet` ‚Äì lagged, winsorized, cross-sectionally z-scored features (+ *_is_missing masks).
- `features_filtered.parquet` ‚Äì modeling-ready view (warmup & low-coverage dates removed; empty/constant cols dropped).
- `funda_quarterly.parquet`, `funda_daily.parquet` ‚Äì fundamentals at quarterly/daily granularity.
- `meta.yaml` ‚Äì machine-readable metadata (sources, lookback, QC guidance).
- QC reports: `qc_summary.json`, `qc_missing_by_feature.csv`, `qc_coverage_by_date.csv`, `qc_constant_cols.csv`, `qc_empty_cols.csv`, `qc_outlier_rate.csv`, `qc_skew_kurtosis.csv`, `qc_drift.csv`, `qc_recommendations.csv`.

---

## Reusable Globals (organized)
> These exist (or are trivially reloadable) after Section 1. Prefer these over re-deriving.

### Dates / Ranges
- `START_DATE = "2006-01-01"`  
- `END_DATE = datetime.today().strftime("%Y-%m-%d")`

### Universe & Symbols
- `sp500_url` ‚Äì Wikipedia source for constituents.
- `tickers_raw` ‚Üí raw symbols from Wikipedia.
- `tickers` ‚Üí Yahoo-normalized tickers (periods ‚Üí dashes).
- `hedges` ‚Üí `["SPY","XLY","XLF","XLV","XLK","XLI","XLE","XLP","XLB","XLU","XLRE"]`
- `context_symbols` ‚Üí `["^VIX"]`  *(later used as `{"^VIX"}` set in 1.2)*
- `universe` ‚Üí sorted unique S&P tickers.
- `universe_all` ‚Üí `universe + hedges + context_symbols`
- `universe_full` ‚Üí list from `universe.csv` (canonical equities universe for downstream code).

### DataFrames (load-once, reuse)
- `prices` ‚Üí long OHLCV for `universe_all` (saved as `raw_prices.parquet`).
- `features` ‚Üí merged technical + context + fundamentals (post-shift, winsorize, z-score) (saved).
- `vix` ‚Üí `^VIX` close series; `spy` ‚Üí SPY prices with `spy_ret`, `spy_rv_20`.
- `ctx` ‚Üí market context by date: `["spy_rv_20","vix_close","breadth"]`.
- `px_daily_all` ‚Üí `["date","ticker","adj_close"]` for equities universe.
- `dates_all` ‚Üí unique trading dates.
- `funda_q` ‚Üí quarterly fundamentals by ticker (saved).
- `funda_daily` ‚Üí daily forward-filled fundamentals (saved).

### Feature Engineering Toggles / Windows
- `COMPUTE_SLOPE = True`
- `SLOPE_WINDOW = 20`
- `RV_WIN = 20`
- `ATR_WIN = 14`

### Provider / API / Caching
- `PROVIDER = "fmp"`
- `FMP_API_KEY` ‚Äì from env or prompt (in-memory only).
- `CACHE_DIR = "cache"`
- `CHUNK_TICKERS = 100`, `START_AT = 0`, `SKIP_IF_CACHED = True`
- `MAX_WORKERS = 4`, `RETRY_ATTEMPTS = 5`, `BATCH_SLEEP = (0.2, 0.6)`

### Useful Function Handles
- `to_fmp_symbol(sym)` ‚Äì Yahoo ‚Äú-‚Äù ‚Üî FMP ‚Äú.‚Äù class ticker mapping.
- `is_index_like(sym)` ‚Äì identifies index symbols (e.g., `^VIX`).
- `compute_atr(df, window=ATR_WIN)`
- `vectorized_rolling_slope(y, window=SLOPE_WINDOW)`
- `mom_over_n(adj_close, n)`
- `_tidy_quarterly_df(df)`, `_coalesce_cols(df, cols, default)`
- `_fetch_quarterly_funda_fmp(ticker)` ‚Äì pulls BS/IS/CF, coalesces variants.
- `fetch_or_load_cached_quarterly(ticker)` ‚Äì cached loader for fundamentals.
- `cs_standardize_fast(df, cols, lo=0.01, hi=0.99)` ‚Äì per-date winsorize+z-score.

### Column Sets / Masks (downstream-friendly)
- `non_feature_cols = {"date","ticker","open","high","low","close","adj_close","volume"}`
- `cols_to_shift` ‚Äì all predictive feature columns actually shifted by 1 bar.
- `cs_cols` ‚Äì features standardized cross-sectionally (lags, vol, mom, fundamentals, etc.).
- `context_keep_raw = ["spy_rv_20","vix_close","breadth"]`
- *(QC section)*
  - `FEATURES_PATH = "features.parquet"`, `UNIVERSE_PATH = "universe.csv"`
  - `COVERAGE_MIN = 300`
  - `APPLY_FILTERS = True`
  - `Z_OUTLIER = 5.0`, `EARLY_YEARS = 5`, `RECENT_YEARS = 5`
  - `warmup_cutoff` ‚Äì computed from SPY date series (‚âà273 trading-day warmup).
  - `keep_mask` ‚Äì dates ‚â• `warmup_cutoff` and with coverage ‚â• `COVERAGE_MIN`.
  - *(Note: `features_filtered.parquet` is written using `keep_mask` and pruned columns.)*

---

## What this means for Section 2 (Regimes)
- **Use** `features_filtered.parquet` (or reload `features` and apply `keep_mask`) to build HMM inputs.
- Inputs available out of the box: `spy_rv_20`, `vix_close`, `breadth`, and per-asset returns (`ret_1d`), plus everything in `cs_cols`.
- **No duplicate `(date, ticker)` rows**, **no monotonic issues**; early sparse periods removed by `warmup_cutoff`/`keep_mask`.

---

## Sanity Questions (short answers)
- **‚ÄúAre we good to go?‚Äù** Yes ‚Äî Section 1 is complete and validated; proceed to regime modeling.
- **‚ÄúEmpty rows?‚Äù** Raw OHLCV rows with all NaNs were dropped; the modeling file (`features_filtered.parquet`) is filtered to warmup/coverage and prunes empty/constant columns. Row-level all-NaN feature cases should not remain after these filters.
- **‚ÄúAdd the assertions?‚Äù** Already present and passing in QC (`qc_summary.json`). No need to add them again unless you change the pipeline.
</details>

# 2. Regime Modeling

<details> <summary>
Outline (HMM ‚Üí Regime Labels & Probabilities)</summary>

# 2) Regime Modeling ‚Äî Updated Outline (HMM ‚Üí Regime Labels & Probabilities)

## 2.0 Scope & Interfaces
- **Goal:** Assign a daily market regime (Risk-On, Risk-Off, Transition) with posterior probabilities to drive regime-aware weighting, turnover caps, and risk targets in Sections 3‚Äì5.
- **Inputs (from Section 1):**
  - `features_filtered.parquet` with **raw** `spy_rv_20`, `vix_close`, `breadth`, and SPY `adj_close` for return computation.
  - Trading calendar (aligned daily business days).
- **Outputs (artifacts):**
  - `regime_labels.parquet`: `date, state_id, p0..pK, regime_label`
  - `regime_labels.csv` (plot-friendly)
  - `regime_plot.png` (timeline with shading), `state_profiles.csv` (state stats)
  - `regime_hmm.pkl` (bundle: scaler + HMM per walk-forward window)
  - `regime_meta.json` (config, state‚Üílabel map, scaler params, transition matrix, diagnostics)
  - `regime_sensitivity.json` (K/feature/era stability tests)
- **Pass/Fail gates:**
  - Interpretable state profiles (return/vol ordering aligns with labels)
  - Reasonable persistence (median run length > 5‚Äì10 days; no chattering)
  - Stable mapping across walk-forward windows (low semantic flip rate)
  - No leakage (all inputs at t known at t)

---

## 2.1 Data Assembly (Market Panel)
- **Series:**
  - SPY **log return** at t (computed from `adj_close`, shifted to avoid leakage if needed).
  - SPY realized volatility (20-day) ‚Äî from raw `spy_rv_20`.
  - VIX **level** (`vix_close`) and optionally **daily Œî** (t ‚àí t-1).
  - Market breadth (% advancers in S&P, known at t).
- **IMPORTANT:** Use **raw** context series from Section 1 (`spy_rv_20`, `vix_close`, `breadth`), **not** cross-sectional z-scored features.
- **Breadth timing:** Confirm that `breadth` reflects t-1 data available at t; if not, shift by 1.
- **Alignment:** Daily business days; merge by `date`; forward-fill only for indicators known at t; drop rows with missing core inputs.
- **Standardization:** Fit `StandardScaler` **per train window** on the raw context features; persist scaler per window (stored in `regime_hmm.pkl`).
- **Sanity checks:**
  - Stationarity proxy (mean/var drift over eras).
  - Outlier handling: no winsorization needed for HMM since we scale raw series per window.
  - Coverage check: ensure no missing dates in test stitching.

---

## 2.2 Model Choice & Configuration
- **Primary:** Gaussian HMM with `covariance_type="full"`; components K ‚àà {2,3} (default 3).
- **Alternative (optional):** Student-t HMM, GMM-HMM, Markov-Switching VAR, or Bayesian HMM with sticky priors.
- **Hyperparameters:**
  - `n_components`, `covariance_type`, `n_iter`, `random_state`.
  - Dirichlet priors / sticky transitions to enforce regime persistence.
- **Training protocol:**
  - Train on standardized features in the train window.
  - Multiple random restarts; choose model with highest log-likelihood.
  - If applying **finance recency weighting rule**: optionally weight log-likelihood so recent data has more influence (can be implemented here if desired).

---

## 2.3 State Labeling & Semantics
- **Profile each state:**
  - Mean and vol of SPY returns.
  - Mean VIX level, mean ŒîVIX.
  - Mean breadth, tail metrics (5% quantile returns).
- **Label rules:**
  - Highest mean return & lowest vol ‚Üí **Risk-On**
  - Highest vol & lowest return ‚Üí **Risk-Off**
  - Remaining state ‚Üí **Transition**
- **Tie-breakers:** breadth, VIX changes, downside tails.
- **Persist mapping:** Save `state_id ‚Üí regime_label` per window in `regime_meta.json` so semantics don‚Äôt silently drift across walk-forward windows.

---

## 2.4 Smoothing, Persistence & Debounce
- **Posterior smoothing:** Option to use Viterbi most-likely path vs. raw posterior argmax.
- **Debounce parameters:** `MIN_DWELL_DAYS` and `POSTERIOR_THRESH` from `config.yaml`.
- **Gap handling:** Holidays/missing days inherit last known regime; no forward-looking fill.

---

## 2.5 Robustness & Sensitivity
- **K sensitivity:** Run K=2 and K=3; prefer K with clearest separation (return/vol) and healthy dwell-time.
- **Feature sensitivity:** Drop-one/add-one tests (remove VIX, remove breadth, etc.) to check label stability.
- **Era stability:** Compare state profiles and transition matrices pre/post-2015 and during crisis years (e.g., 2020).
- **Bootstrap:** Block bootstrap re-fit; produce confusion matrix for label stability across samples.

---

## 2.6 Diagnostics & QA
- **Plots:**
  - Timeline with regime shading over SPY price & drawdown.
  - Posterior probabilities (stacked area).
  - State return histograms, QQ plots.
  - Transition matrix heatmap, dwell-time distribution.
- **Tables:**
  - State profiles (returns, vol, VIX, breadth, tails).
  - Transition matrix & steady-state distribution.
  - Switch frequency and chattering metrics.
- **Alerts:**
  - Flag if any state has inconsistent semantics (positive mean but top-2 vol, dwell-time < 3 days, mapping flips).

---

## 2.7 Regime-Aware Policy Hooks (Interfaces to Sections 3‚Äì5)
- **Weights & turnover caps:** JSON map per regime (e.g., throttle momentum in Risk-Off, upweight quality).
- **Risk targets:** Per-regime vol targets (e.g., 10%/8%/6% for On/Trans/Off).
- **Hedge intensity:** Baseline hedge ratios per regime; pass to RL policy as defaults.
- **Confidence proxy:** Use max posterior or entropy to scale aggressiveness.

---

## 2.8 Walk-Forward Integration
- **Windows:** Match Section 6 (rolling/expanding).
- **Per window:**
  - Fit scaler + HMM on train subset.
  - Apply to test subset only.
  - Save artifacts: `regime_labels_<winid>.parquet`, `regime_hmm.pkl`, `regime_meta.json`.
- **Stitching:** Concatenate per-window outputs into one continuous timeline for backtests.
- **Label stability:** Use saved state‚Üílabel mapping to avoid regime meaning drift.

---

## 2.9 Forward (Shadow) Mode
- **Daily update:** Apply persisted scaler + HMM to latest t; append to `regime_labels.parquet`.
- **Retrain cadence:** Weekly/bi-weekly.
- **Logging:** Save model hash, posterior, chosen label, features vector.
- **Alerts:** If mapping flips or dwell-time anomaly detected.

---

## 2.10 Configuration & Reproducibility
- **Config keys (`config.yaml`):**
  - Features list for HMM.
  - `n_components`, `MIN_DWELL_DAYS`, `POSTERIOR_THRESH`.
  - Finance recency weighting toggle & decay parameter (if implemented here).
  - Random seed, plot toggles.
- **Serialization:**
  - joblib for model + scaler.
  - JSON for meta (labels, thresholds, diagnostics).
- **Tests:**
  - Deterministic output with fixed seed.
  - No leakage (t-only features).
  - Posterior rows sum to 1; dates strictly increasing.
  - No gaps after stitching.
  - Label semantics test per window.

---

## 2.11 Deliverables Checklist
- `regime_labels.parquet` (+ CSV).
- `regime_hmm.pkl` (model + scaler per window).
- `regime_meta.json` (state‚Üílabel, scaler params, diagnostics).
- `regime_timeline.png`, `regime_posteriors.png`, `state_profiles.csv`, `transition_matrix.csv`.
- `regime_sensitivity.json` (K/feature/era stability).
- `regime_policy_map.json` (interfaces to Sections 3‚Äì5).


---
</details>

In [None]:
# ============================================================
# Section 2.0 ‚Äî Scope & Interfaces (Regime Modeling bootstrap)
# Builds on Section 1 artifacts; defines config, I/O, sanity checks,
# and prepares the market-level panel stub used by 2.1+ (no HMM yet).
# ============================================================

from __future__ import annotations

import os
import json
import yaml
from dataclasses import dataclass, asdict
from typing import Dict, Any, List
from datetime import datetime

import numpy as np
import pandas as pd

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 0) Paths & directories (reuse Section 1 outputs)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
FEATURES_PATH_DEFAULT = (
    "features_filtered.parquet"
    if os.path.exists("features_filtered.parquet")
    else "features.parquet"
)
UNIVERSE_PATH = "universe.csv"
ARTIFACT_DIR = "artifacts"
REGIME_DIR = os.path.join(ARTIFACT_DIR, "regimes")
PLOTS_DIR = os.path.join(REGIME_DIR, "plots")

os.makedirs(REGIME_DIR, exist_ok=True)
os.makedirs(PLOTS_DIR, exist_ok=True)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 1) Config ‚Äî defaults + optional override via config.yaml
# Keys are intentionally minimal here; 2.1‚Äì2.10 will read them.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
@dataclass
class RegimeConfig:
    # Raw context features for HMM (NOT cross-sectional z-scores)
    hmm_features: List[str]
    include_dvix: bool                 # add ŒîVIX feature to panel
    n_components_grid: List[int]       # HMM K sensitivity (e.g., [2,3])
    covariance_type: str               # "full" by default
    random_seed: int
    # Debounce (used later in 2.4)
    min_dwell_days: int
    posterior_thresh: float
    # Optional finance rule: give more weight to recent samples during HMM fit
    recency_weighting: bool
    recency_half_life_days: int
    # I/O
    plots_enabled: bool
    save_csv_alongside_parquet: bool
    features_path: str = FEATURES_PATH_DEFAULT
    universe_path: str = UNIVERSE_PATH
    regime_dir: str = REGIME_DIR
    plots_dir: str = PLOTS_DIR

DEFAULT_CFG = RegimeConfig(
    hmm_features=["spy_rv_20", "vix_close", "breadth"],  # from Section 1 (raw context)
    include_dvix=True,
    n_components_grid=[2, 3],
    covariance_type="full",
    random_seed=42,
    min_dwell_days=3,
    posterior_thresh=0.55,
    recency_weighting=False,           # flip to True if enabling in 2.2
    recency_half_life_days=90,
    plots_enabled=True,
    save_csv_alongside_parquet=True,
)

CONFIG_FILE = "config.yaml"
user_cfg = {}
if os.path.exists(CONFIG_FILE):
    try:
        with open(CONFIG_FILE, "r") as f:
            raw_cfg = yaml.safe_load(f) or {}
            if isinstance(raw_cfg, dict):
                user_cfg = raw_cfg.get("regimes", {}) or {}
    except Exception:
        user_cfg = {}

def merge_cfg(default: RegimeConfig, override: Dict[str, Any]) -> RegimeConfig:
    d = asdict(default)
    for k, v in override.items():
        if k in d and v is not None:
            d[k] = v
    return RegimeConfig(**d)

CFG = merge_cfg(DEFAULT_CFG, user_cfg)

# Persist effective config for traceability
with open(os.path.join(REGIME_DIR, "regime_config_effective.json"), "w") as f:
    json.dump(asdict(CFG), f, indent=2)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 2) Load Section 1 artifacts and build the market panel stub
# NOTE: use RAW context features from Section 1 (no CS-z).
# This version auto-detects whether ^VIX exists as a ticker,
# or vix_close/breadth/spy_rv_20 are already on every row.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
assert os.path.exists(CFG.features_path), f"Missing features file: {CFG.features_path}"
fe = pd.read_parquet(CFG.features_path)
fe["date"] = pd.to_datetime(fe["date"], utc=False, errors="coerce")
fe = fe.dropna(subset=["date"]).sort_values(["date", "ticker"])

# Required columns present?
required_cols = {"date", "ticker", "adj_close", "spy_rv_20", "vix_close", "breadth"}
missing = list(required_cols - set(fe.columns))
if missing:
    raise ValueError(f"Required columns missing in features file: {sorted(missing)}")

# SPY must exist for returns
if not (fe["ticker"] == "SPY").any():
    raise ValueError("SPY rows not found in features file; cannot compute spy_ret.")

# Build SPY returns
spy = fe.loc[fe["ticker"] == "SPY", ["date", "adj_close", "spy_rv_20"]].copy()
spy["spy_ret"] = np.log(spy["adj_close"] / spy["adj_close"].shift(1))

# vix_close / breadth / rv_20 may be replicated on every row; prefer unique-by-date view
# If ^VIX rows exist, we can still just take unique-by-date‚Äîworks for both layouts.
vix_by_date = fe[["date", "vix_close"]].drop_duplicates("date").copy()
breadth_by_date = fe[["date", "breadth"]].drop_duplicates("date").copy()
rv20_by_date = fe[["date", "spy_rv_20"]].drop_duplicates("date").copy()

# Merge market panel (date-level)
mkt = (
    spy[["date", "spy_ret"]]                     # SPY returns
    .merge(rv20_by_date, on="date", how="inner") # realized vol
    .merge(vix_by_date, on="date", how="inner")  # VIX level
    .merge(breadth_by_date, on="date", how="inner")  # breadth
    .sort_values("date")
)

# Optional ŒîVIX (level change)
if CFG.include_dvix:
    mkt["dvix"] = mkt["vix_close"].diff()

# Breadth timing guard: uncomment if your breadth is same-day and should be known-at-t
# mkt["breadth"] = mkt["breadth"].shift(1)

# Complete-case rows only (HMM requires no NaNs)
core_cols = ["spy_ret", "spy_rv_20", "vix_close", "breadth"] + (["dvix"] if CFG.include_dvix else [])
mkt = mkt.dropna(subset=core_cols).reset_index(drop=True)

# Save panel (consumed by 2.1/2.2)
panel_path = os.path.join(CFG.regime_dir, "market_panel.parquet")
mkt.to_parquet(panel_path, index=False)
if CFG.save_csv_alongside_parquet:
    mkt.to_csv(os.path.join(CFG.regime_dir, "market_panel.csv"), index=False)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 5) Finalize & console summary (with robust date handling)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _fmt_date(ts):
    return None if pd.isna(ts) else pd.Timestamp(ts).strftime("%Y-%m-%d")

if mkt.empty:
    # Diagnostics to help you decide if breadth shift is needed, etc.
    core_for_diag = ["spy_ret", "spy_rv_20", "vix_close", "breadth"] + (["dvix"] if CFG.include_dvix else [])
    non_null_counts = {c: int(fe[c].notna().sum()) if c in fe.columns else 0 for c in core_for_diag}
    spy_src = fe.loc[fe["ticker"] == "SPY", ["date", "adj_close", "spy_rv_20"]].assign(
        spy_ret=lambda d: np.log(d["adj_close"] / d["adj_close"].shift(1))
    )
    coverage_diag = {
        "rows_with_spy_ret_and_rv20": int(spy_src.dropna(subset=["spy_ret", "spy_rv_20"]).shape[0]),
        "unique_dates_with_vix": int(vix_by_date.dropna(subset=["vix_close"]).shape[0]),
        "unique_dates_with_breadth": int(breadth_by_date.dropna(subset=["breadth"]).shape[0]),
    }
    raise ValueError(
        "Market panel is empty after merging/dropping NaNs. "
        f"Non-null counts (in features file): {non_null_counts}. "
        f"Coverage by component: {coverage_diag}. "
        "Common fixes: ensure breadth timing (try shifting breadth by 1), "
        "or check for gaps in SPY/VIX/breadth date alignment."
    )

summary = {
    "features_file": CFG.features_path,
    "universe_file": CFG.universe_path,
    "market_panel_rows": int(mkt.shape[0]),
    "market_panel_cols": list(mkt.columns),
    "date_min": _fmt_date(mkt['date'].min()),
    "date_max": _fmt_date(mkt['date'].max()),
    "config_effective": os.path.abspath(os.path.join(CFG.regime_dir, "regime_config_effective.json")),
    "panel_path": os.path.abspath(panel_path),
    "meta_path": os.path.abspath(os.path.join(CFG.regime_dir, 'regime_meta.json')),
}
print(json.dumps(summary, indent=2))

{
  "features_file": "features_filtered.parquet",
  "universe_file": "universe.csv",
  "market_panel_rows": 4658,
  "market_panel_cols": [
    "date",
    "spy_ret",
    "spy_rv_20",
    "vix_close",
    "breadth",
    "dvix"
  ],
  "date_min": "2007-02-06",
  "date_max": "2025-08-11",
  "config_effective": "/content/artifacts/regimes/regime_config_effective.json",
  "panel_path": "/content/artifacts/regimes/market_panel.parquet",
  "meta_path": "/content/artifacts/regimes/regime_meta.json"
}


In [None]:
# ============================================================
# Section 2.1 ‚Äî Data Assembly (windowed extraction + scaling)
# Uses artifacts from 2.0; prepares X_train/X_test for HMM.
# ============================================================

from __future__ import annotations

import os
import json
from dataclasses import asdict
from typing import Dict, Any, Tuple, List, Optional
from datetime import datetime

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
import joblib

# Reuse CFG, paths from 2.0
REGIME_DIR = CFG.regime_dir
PLOTS_DIR = CFG.plots_dir
PANEL_PATH = os.path.join(REGIME_DIR, "market_panel.parquet")
META_PATH = os.path.join(REGIME_DIR, "regime_meta.json")

assert os.path.exists(PANEL_PATH), f"Missing market panel: {PANEL_PATH}"
mkt = pd.read_parquet(PANEL_PATH)
mkt["date"] = pd.to_datetime(mkt["date"])
mkt = mkt.sort_values("date").reset_index(drop=True)

# Choose feature list (raw context features only; dvix optional)
hmm_feat_cols = list(CFG.hmm_features)
if CFG.include_dvix and "dvix" not in hmm_feat_cols:
    hmm_feat_cols.append("dvix")

# Safety: ensure columns exist
missing_cols = [c for c in hmm_feat_cols + ["spy_ret"] if c not in mkt.columns]
if missing_cols:
    raise ValueError(f"Missing required columns in market panel: {missing_cols}")

def make_window_masks(df: pd.DataFrame,
                      train_start: str,
                      train_end: str,
                      test_start: str,
                      test_end: str) -> Tuple[pd.Series, pd.Series]:
    d = df["date"]
    train_mask = (d >= pd.to_datetime(train_start)) & (d <= pd.to_datetime(train_end))
    test_mask  = (d >= pd.to_datetime(test_start))  & (d <= pd.to_datetime(test_end))
    return train_mask, test_mask

def build_hmm_matrices(df: pd.DataFrame,
                       features: List[str],
                       train_start: str,
                       train_end: str,
                       test_start: str,
                       test_end: str,
                       scaler_out_path: Optional[str] = None,
                       breadth_shift_days: int = 0) -> Dict[str, Any]:
    """
    Returns:
      {
        'X_train': np.ndarray,
        'X_test': np.ndarray,
        'dates_train': pd.DatetimeIndex,
        'dates_test': pd.DatetimeIndex,
        'scaler_path': str,
        'scaler_mean': list,
        'scaler_scale': list,
        'qc': dict
      }
    """
    dfw = df.copy()

    # Optional breadth shift (if you decide breadth should be known-at-t from t-1)
    if breadth_shift_days != 0 and "breadth" in features:
        dfw["breadth"] = dfw["breadth"].shift(breadth_shift_days)

    # Drop rows with missing features
    dfw = dfw.dropna(subset=features).reset_index(drop=True)

    # Window masks
    tr_mask, te_mask = make_window_masks(dfw, train_start, train_end, test_start, test_end)

    # Slice
    train_df = dfw.loc[tr_mask, ["date"] + features].dropna()
    test_df  = dfw.loc[te_mask, ["date"] + features].dropna()

    if train_df.empty or test_df.empty:
        raise ValueError(
            f"Empty train/test after slicing: "
            f"train({train_start}‚Üí{train_end}) rows={train_df.shape[0]}, "
            f"test({test_start}‚Üí{test_end}) rows={test_df.shape[0]}. "
            f"Consider adjusting dates or breadth_shift_days."
        )

    # Standardize on TRAIN ONLY; transform TEST with same scaler
    scaler = StandardScaler()
    X_train = scaler.fit_transform(train_df[features].to_numpy(dtype=float))
    X_test  = scaler.transform(test_df[features].to_numpy(dtype=float))

    # Persist scaler per window
    if scaler_out_path is None:
        win_tag = f"{train_start}_{train_end}__{test_start}_{test_end}".replace("-", "")
        scaler_out_path = os.path.join(REGIME_DIR, f"scaler_{win_tag}.joblib")
    joblib.dump(scaler, scaler_out_path)

    # Quick QC: mean/var drift (train vs test) and feature coverage
    qc = {
        "train_rows": int(train_df.shape[0]),
        "test_rows": int(test_df.shape[0]),
        "features": features,
        "train_means": dict(zip(features, np.mean(X_train, axis=0).round(6).tolist())),
        "train_stds": dict(zip(features, np.std(X_train, axis=0, ddof=0).round(6).tolist())),
        "test_means": dict(zip(features, np.mean(X_test, axis=0).round(6).tolist())),
        "test_stds": dict(zip(features, np.std(X_test, axis=0, ddof=0).round(6).tolist())),
    }

    # Save a tiny per-window QC file
    win_qc_path = scaler_out_path.replace(".joblib", "_qc.json")
    with open(win_qc_path, "w") as f:
        json.dump(qc, f, indent=2)

    return {
        "X_train": X_train,
        "X_test": X_test,
        "dates_train": train_df["date"].to_list(),
        "dates_test": test_df["date"].to_list(),
        "scaler_path": scaler_out_path,
        "scaler_mean": scaler.mean_.round(12).tolist(),
        "scaler_scale": scaler.scale_.round(12).tolist(),
        "qc": qc,
    }

# Example: pick a first walk-forward split anchored to your warmup_cutoff
# You can replace these with your Section 6 generator later.
train_start = "2007-02-06"  # day after warmup_cutoff in your QC
train_end   = "2016-12-30"
test_start  = "2017-01-03"
test_end    = mkt["date"].max().strftime("%Y-%m-%d")

window = build_hmm_matrices(
    df=mkt,
    features=hmm_feat_cols,
    train_start=train_start,
    train_end=train_end,
    test_start=test_start,
    test_end=test_end,
    scaler_out_path=None,
    breadth_shift_days=0,  # set to 1 if you confirm breadth must be known-at-t from t-1
)

# Persist a small window manifest so later steps (2.2+) can load it
manifest = {
    "window": {
        "train_start": train_start,
        "train_end": train_end,
        "test_start": test_start,
        "test_end": test_end,
    },
    "features": hmm_feat_cols,
    "scaler_path": window["scaler_path"],
    "n_train": len(window["dates_train"]),
    "n_test": len(window["dates_test"]),
}
with open(os.path.join(REGIME_DIR, "window_manifest.json"), "w") as f:
    json.dump(manifest, f, indent=2)

print(json.dumps({
    "status": "2.1 ready",
    "features_used": hmm_feat_cols,
    "scaler_saved": window["scaler_path"],
    "train_rows": manifest["n_train"],
    "test_rows": manifest["n_test"],
}, indent=2))

{
  "status": "2.1 ready",
  "features_used": [
    "spy_rv_20",
    "vix_close",
    "breadth",
    "dvix"
  ],
  "scaler_saved": "artifacts/regimes/scaler_20070206_20161230__20170103_20250811.joblib",
  "train_rows": 2495,
  "test_rows": 2163
}


In [None]:
!pip install hmmlearn --quiet

In [None]:
# ============================================================
# Section 2.2 ‚Äî Model Choice & Configuration (Gaussian HMM)
# Primary: GaussianHMM (full covariance), K in {2,3}
# - Multiple restarts; pick best train log-likelihood
# - Sticky transitions (Dirichlet-like persistence) via diagonal bias
# - Finance recency weighting: time-decayed sub-sequences (ENABLED)
# Reuses:
#   - artifacts/regimes/market_panel.parquet (from 2.0)
#   - artifacts/regimes/window_manifest.json (from 2.1)
#   - scaler_*.joblib (from 2.1)
# Outputs:
#   - artifacts/regimes/regime_hmm.pkl (joblib bundle: model + meta)
#   - artifacts/regimes/hmm_kgrid.json (scores by K)
# ============================================================

from __future__ import annotations

import os
import json
from typing import Dict, Any, List, Tuple
from datetime import datetime

import numpy as np
import pandas as pd
import joblib

from hmmlearn.hmm import GaussianHMM

# Reuse config and paths from 2.0 / 2.1
REGIME_DIR = CFG.regime_dir
PANEL_PATH = os.path.join(REGIME_DIR, "market_panel.parquet")
MANIFEST_PATH = os.path.join(REGIME_DIR, "window_manifest.json")
META_PATH = os.path.join(REGIME_DIR, "regime_meta.json")

assert os.path.exists(PANEL_PATH), f"Missing market panel: {PANEL_PATH}"
assert os.path.exists(MANIFEST_PATH), f"Missing window manifest: {MANIFEST_PATH}"

with open(MANIFEST_PATH, "r") as f:
    MAN = json.load(f)

mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])

scaler = joblib.load(MAN["scaler_path"])
features = MAN["features"]
assert all(c in mkt.columns for c in features), f"Panel missing features: {set(features) - set(mkt.columns)}"

# Train/test windows
train_start = pd.to_datetime(MAN["window"]["train_start"])
train_end   = pd.to_datetime(MAN["window"]["train_end"])
test_start  = pd.to_datetime(MAN["window"]["test_start"])
test_end    = pd.to_datetime(MAN["window"]["test_end"])

train_df = mkt[(mkt["date"] >= train_start) & (mkt["date"] <= train_end)][["date"] + features].dropna().reset_index(drop=True)
test_df  = mkt[(mkt["date"] >= test_start)  & (mkt["date"] <= test_end)][["date"] + features].dropna().reset_index(drop=True)

X_train = scaler.transform(train_df[features].to_numpy(dtype=float))
X_test  = scaler.transform(test_df[features].to_numpy(dtype=float))
dates_train = train_df["date"].to_numpy()
dates_test  = test_df["date"].to_numpy()

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Hyperparameters ‚Äî Test run now, bump for real run (marked TOCHANGE)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
N_COMPONENTS_GRID = [2]   # TOCHANGE: [3] or [2,3] for real run
N_ITER = 200              # TOCHANGE: 1000 for real run
N_INIT = 2                # TOCHANGE: 10 for real run
RANDOM_STATE = 42
COVARIANCE_TYPE = "full"
TOL = 1e-3                # TOCHANGE: 1e-4 for real run

# Sticky transitions strength (Dirichlet-like, diagonal blend post-fit)
# Larger -> stickier regimes (longer dwell times)
LAMBDA_STICK = 0.15       # TOCHANGE: 0.30‚Äì0.50 for real run

# Finance recency weighting ‚Äî ENABLED
APPLY_RECENCY = True
HALF_LIFE_DAYS = 756      # ~3 years; keeps 2008 meaningful
# TOCHANGE: try 504 (~2y, more recency), 756 (~3y, balanced), 1260 (~5y, less recency)

EPSILON_FLOOR = 0.10      # ensures old episodes never get <10% of peak weight
# TOCHANGE: 0.05‚Äì0.15 depending on how protective you want to be

# For recency sampler (still lightweight in test; scale for real run)
SEG_LEN = 60              # TOCHANGE: 90‚Äì120 for real run
N_SEGMENTS = 80           # TOCHANGE: 200‚Äì400 for real run

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Utilities
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _diag_sticky_blend(transmat: np.ndarray, lam: float) -> np.ndarray:
    k = transmat.shape[0]
    T = (1.0 - lam) * transmat + lam * np.eye(k)
    T = T / T.sum(axis=1, keepdims=True)
    return T

def _build_time_decay_weights(dates: np.ndarray, half_life_days: int) -> np.ndarray:
    t = np.array([pd.Timestamp(d).toordinal() for d in dates], dtype=float)
    age = (t.max() - t)  # newer dates -> smaller age
    decay = np.log(2) / max(1, half_life_days)
    w = np.exp(-decay * age)
    return w / (w.sum() + 1e-12)

def _sample_time_weighted_subsequences(
    X: np.ndarray,
    dates: np.ndarray,
    seg_len: int,
    n_segments: int,
    half_life_days: int,
    random_state: int,
) -> Tuple[np.ndarray, List[int]]:
    rng = np.random.RandomState(random_state)
    n = X.shape[0]
    if n < seg_len:
        return X.copy(), [n]
    ends = np.arange(seg_len - 1, n)
    p = _build_time_decay_weights(dates[ends], half_life_days)
    p = np.maximum(p, EPSILON_FLOOR * p.max())
    p = p / p.sum()

    chosen = rng.choice(ends, size=min(n_segments, len(ends)), replace=True, p=p)
    lengths, chunks = [], []
    for e in chosen:
        s = e - (seg_len - 1)
        chunk = X[s:e+1]
        chunks.append(chunk)
        lengths.append(len(chunk))
    X_concat = np.vstack(chunks)
    return X_concat, lengths

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Train HMMs; pick best by train log-likelihood
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
best = {"score": -np.inf, "model": None, "k": None, "seed": None, "train_lengths": None, "fit_mode": None}
results = []

for k in N_COMPONENTS_GRID:
    for r in range(N_INIT):
        seed = RANDOM_STATE + r

        if APPLY_RECENCY:
            X_fit, lengths = _sample_time_weighted_subsequences(
                X_train, dates_train,
                seg_len=SEG_LEN,
                n_segments=N_SEGMENTS,
                half_life_days=HALF_LIFE_DAYS,
                random_state=seed,
            )
            fit_mode = "recency"
        else:
            X_fit, lengths = X_train, [len(X_train)]
            fit_mode = "plain"

        # Init model
        model = GaussianHMM(
            n_components=k,
            covariance_type=COVARIANCE_TYPE,
            n_iter=N_ITER,
            tol=TOL,
            random_state=seed,
            verbose=False,
            # IMPORTANT: do not include 's' or 't' here, otherwise your custom
            # startprob_/transmat_ get overwritten on init.
            init_params="mc",      # means, covars only
            params="stmc",         # learn startprob, transmat, means, covars
        )

        # Sticky-biased initialization (near-diagonal)
        trans0 = np.full((k, k), (1.0 - 0.90) / max(1, k - 1))
        np.fill_diagonal(trans0, 0.90)
        model.transmat_ = trans0

        # Uniform start probabilities
        model.startprob_ = np.full(k, 1.0 / k)

        # Fit
        model.fit(X_fit, lengths=lengths)

        # Post-fit sticky blend (Dirichlet-like)
        model.transmat_ = _diag_sticky_blend(model.transmat_, LAMBDA_STICK)

        score = model.score(X_train)  # comparable scoring on original train sequence

        results.append({
            "k": k,
            "seed": seed,
            "score": float(score),
            "fit_mode": fit_mode,
            "transmat": model.transmat_.tolist(),
        })

        if score > best["score"]:
            best.update({"score": score, "model": model, "k": k, "seed": seed, "train_lengths": lengths, "fit_mode": fit_mode})

# Save grid scores
with open(os.path.join(REGIME_DIR, "hmm_kgrid.json"), "w") as f:
    json.dump({"results": results, "chosen": {"k": best["k"], "seed": best["seed"], "score": float(best["score"]), "fit_mode": best["fit_mode"]}}, f, indent=2)

# Persist best model bundle
bundle = {
    "model": best["model"],
    "k": best["k"],
    "random_state": best["seed"],
    "features": features,
    "scaler_path": MAN["scaler_path"],
    "train_dates": [str(d) for d in dates_train],
    "test_dates": [str(d) for d in dates_test],
    "fit_mode": best["fit_mode"],
    "sticky_lambda": LAMBDA_STICK,
    "n_iter": N_ITER,
    "n_init": N_INIT,
    "tol": TOL,
    "covariance_type": COVARIANCE_TYPE,
    "recency_weighting": True,  # enabled
    "recency_half_life_days": HALF_LIFE_DAYS,
    "recency_seg_len": SEG_LEN,         # TOCHANGE: 90‚Äì120
    "recency_n_segments": N_SEGMENTS,   # TOCHANGE: 200‚Äì400
    "created_at": datetime.utcnow().isoformat() + "Z",
    "recency_epsilon_floor": EPSILON_FLOOR
}
joblib.dump(bundle, os.path.join(REGIME_DIR, "regime_hmm.pkl"))

print(json.dumps({
    "status": "2.2 trained",
    "chosen_k": best["k"],
    "fit_mode": best["fit_mode"],
    "train_score": float(best["score"]),
    "n_iter": N_ITER,
    "n_init": N_INIT,
    "sticky_lambda": LAMBDA_STICK,
    "recency_weighting": True,
    "half_life_days": HALF_LIFE_DAYS,
    "seg_len": SEG_LEN,
    "n_segments": N_SEGMENTS,
}, indent=2))

{
  "status": "2.2 trained",
  "chosen_k": 2,
  "fit_mode": "recency",
  "train_score": -8178.604987870423,
  "n_iter": 200,
  "n_init": 2,
  "sticky_lambda": 0.15,
  "recency_weighting": true,
  "half_life_days": 756,
  "seg_len": 60,
  "n_segments": 80
}


In [None]:
# Quick peek at effective sampling weights (fixed timedelta math)
ends = np.arange(SEG_LEN - 1, len(dates_train))
end_dates = pd.to_datetime(dates_train[ends])

pp = _build_time_decay_weights(end_dates, HALF_LIFE_DAYS)
pp = np.maximum(pp, EPSILON_FLOOR * pp.max())
pp = pp / pp.sum()

# Age in *days* relative to the most recent end_date
max_date = end_dates.max()
ages_days = (max_date - end_dates) / np.timedelta64(1, "D")  # float days

# Weighted mean age (how "old" the typical sampled segment end is)
w_mean_age_days = float(np.sum(ages_days * pp))

# 95% weight age: age threshold below which 95% of weight lies
order = np.argsort(ages_days)                  # youngest ‚Üí oldest
cumw = np.cumsum(pp[order])
w95_idx = np.searchsorted(cumw, 0.95)
w95_age_days = float(ages_days[order][min(w95_idx, len(ages_days)-1)])

print({
    "weights_min": float(pp.min()),
    "weights_max": float(pp.max()),
    "weighted_mean_age_days": w_mean_age_days,
    "weighted_p95_age_days": w95_age_days,
})

{'weights_min': 0.0001335955016895934, 'weights_max': 0.001335955016895934, 'weighted_mean_age_days': 1017.4276078929489, 'weighted_p95_age_days': 2990.0}


In [None]:
# ============================================================
# Section 2.3 ‚Äî State Labeling & Semantics
# - Score posteriors for all dates
# - Profile states on TRAIN window only (no peeking)
# - Label states: Risk-On (‚Üëret, ‚Üìvol), Risk-Off (‚Üìret, ‚Üëvol), Transition (rest)
# - Persist labels, posteriors, and profiles
# Outputs:
#   artifacts/regimes/regime_labels.parquet (date, state_id, p0..pK, regime_label)
#   artifacts/regimes/state_profiles.csv
#   artifacts/regimes/regime_meta.json (updated label map)
# ============================================================

from __future__ import annotations
import os, json
from typing import Dict, Any
import numpy as np
import pandas as pd
import joblib
from datetime import datetime

REGIME_DIR = CFG.regime_dir
PANEL_PATH = os.path.join(REGIME_DIR, "market_panel.parquet")
MANIFEST_PATH = os.path.join(REGIME_DIR, "window_manifest.json")
BUNDLE_PATH = os.path.join(REGIME_DIR, "regime_hmm.pkl")
META_PATH = os.path.join(REGIME_DIR, "regime_meta.json")

# Load artifacts
mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])
with open(MANIFEST_PATH, "r") as f:
    MAN = json.load(f)
bundle = joblib.load(BUNDLE_PATH)
model = bundle["model"]
features = bundle["features"]
scaler = joblib.load(bundle["scaler_path"])

# Prepare matrices for ALL dates (but label semantics computed on TRAIN ONLY)
X_all = scaler.transform(mkt[features].to_numpy(dtype=float))
dates_all = mkt["date"].to_numpy()

# Score posteriors
post = model.predict_proba(X_all)  # shape: (T, K)
states_argmax = post.argmax(axis=1)
K = post.shape[1]

# Train/test masks
train_start = pd.to_datetime(MAN["window"]["train_start"])
train_end   = pd.to_datetime(MAN["window"]["train_end"])
test_start  = pd.to_datetime(MAN["window"]["test_start"])
test_end    = pd.to_datetime(MAN["window"]["test_end"])
train_mask = (mkt["date"] >= train_start) & (mkt["date"] <= train_end)
test_mask  = (mkt["date"] >= test_start)  & (mkt["date"] <= test_end)

# Helper: posterior-weighted stats on TRAIN window
def weighted_mean(x, w):
    w = np.asarray(w, dtype=float)
    x = np.asarray(x, dtype=float)
    s = w.sum()
    return float((x * w).sum() / s) if s > 0 else np.nan

def weighted_std(x, w):
    mu = weighted_mean(x, w)
    w = np.asarray(w, dtype=float)
    x = np.asarray(x, dtype=float)
    s = w.sum()
    if s <= 1: return np.nan
    var = ((w * (x - mu)**2).sum()) / s
    return float(np.sqrt(max(var, 0.0)))

def weighted_quantile(x, w, q=0.05):
    x = np.asarray(x, dtype=float)
    w = np.asarray(w, dtype=float)
    order = np.argsort(x)
    x_sorted, w_sorted = x[order], w[order]
    cw = np.cumsum(w_sorted)
    if cw[-1] == 0: return np.nan
    return float(x_sorted[np.searchsorted(cw, q * cw[-1])])

# Compute per-state profiles on TRAIN
train_ix = np.where(train_mask.values)[0]
has_dvix = "dvix" in mkt.columns
profiles = []
for s in range(K):
    w = post[train_ix, s]
    if w.sum() == 0:
        mu_ret = mu_vol = mu_vix = mu_brd = q05 = np.nan
        sd_ret = np.nan
    else:
        mu_ret = weighted_mean(mkt.loc[train_mask, "spy_ret"].values, w)
        sd_ret = weighted_std(mkt.loc[train_mask, "spy_ret"].values, w)
        mu_vol = weighted_mean(mkt.loc[train_mask, "spy_rv_20"].values, w)
        mu_vix = weighted_mean(mkt.loc[train_mask, "vix_close"].values, w)
        mu_brd = weighted_mean(mkt.loc[train_mask, "breadth"].values, w)
        mu_dvix = weighted_mean(mkt.loc[train_mask, "dvix"].values, w) if has_dvix else np.nan
        q05    = weighted_quantile(mkt.loc[train_mask, "spy_ret"].values, w, q=0.05)

    profiles.append({
        "state_id": s,
        "ret_mean": mu_ret,
        "ret_std": sd_ret,
        "rv20_mean": mu_vol,
        "vix_mean": mu_vix,
        "dvix_mean": mu_dvix if has_dvix else np.nan,
        "breadth_mean": mu_brd,
        "ret_q05": q05,
    })

prof_df = pd.DataFrame(profiles)

# Labeling rules (train window only, no peeking):
#  - Risk-On: highest mean return, lowest vol (tie-breakers help if ambiguous)
#  - Risk-Off: highest vol, lowest mean return (tie-breakers help if ambiguous)
#  - Transition: whichever state is not assigned above
# Primary ranks
ret_rank = prof_df["ret_mean"].rank(method="dense")                        # higher = better
# For clarity: choose Risk-Off by *highest* rv20 (vol spike)
risk_off_id = int(prof_df["rv20_mean"].idxmax())                           # highest vol
risk_on_id  = int(ret_rank.idxmax())                                       # highest return

# Tie-breaker refinement (only if they collide or look ambiguous)
# Risk-On tie-breakers: breadth‚Üë, VIX‚Üì, tail q05‚Üë
# Risk-Off tie-breakers: vol‚Üë, ret‚Üì, ŒîVIX‚Üë, breadth‚Üì
def _best_risk_on_row(df: pd.DataFrame) -> int:
    score = (
        df["breadth_mean"].fillna(-1.0).rank(method="dense", ascending=False) +
        df["vix_mean"].fillna(np.inf).rank(method="dense", ascending=True) +
        df["ret_q05"].fillna(-np.inf).rank(method="dense", ascending=False)
    )
    return int(score.idxmax())

def _best_risk_off_row(df: pd.DataFrame) -> int:
    score = (
        df["rv20_mean"].fillna(-np.inf).rank(method="dense", ascending=False) +
        df["ret_mean"].fillna(np.inf).rank(method="dense", ascending=True) +
        (df["dvix_mean"] if "dvix_mean" in df.columns else pd.Series(0.0, index=df.index)).fillna(0.0).rank(method="dense", ascending=False) +
        df["breadth_mean"].fillna(np.inf).rank(method="dense", ascending=True)
    )
    return int(score.idxmax())

if risk_on_id == risk_off_id:
    risk_on_id  = _best_risk_on_row(prof_df)
    risk_off_id = _best_risk_off_row(prof_df)
    # Safety: if still colliding (extremely rare), force Risk-Off = highest vol, Risk-On = highest return
    if risk_on_id == risk_off_id:
        risk_off_id = int(prof_df["rv20_mean"].idxmax())
        risk_on_id  = int(prof_df["ret_mean"].idxmax())

# Final label map
label_map = {risk_on_id: "Risk-On", risk_off_id: "Risk-Off"}
for s in range(K):
    if s not in label_map:
        label_map[s] = "Transition"

# ---------- Build outputs USING the FINAL label_map ----------
out = mkt[["date"]].copy()
out["state_id"] = post.argmax(axis=1)
for s in range(K):
    out[f"p{s}"] = post[:, s]
out["regime_label"] = out["state_id"].map(label_map)

out_path = os.path.join(REGIME_DIR, "regime_labels.parquet")
out.to_parquet(out_path, index=False)
out.to_csv(os.path.join(REGIME_DIR, "regime_labels.csv"), index=False)

prof_df.to_csv(os.path.join(REGIME_DIR, "state_profiles.csv"), index=False)

# ---------- Update meta WITH the FINAL label_map ----------
if os.path.exists(META_PATH):
    with open(META_PATH, "r") as f:
        meta = json.load(f)
else:
    meta = {}

meta.setdefault("created_at", datetime.utcnow().isoformat() + "Z")
meta.setdefault("config", {})
meta.setdefault("diagnostics", {})
meta["diagnostics"]["state_profiles_train"] = prof_df.to_dict(orient="records")
meta["state_label_map"] = {int(k): v for k, v in label_map.items()}
meta.setdefault("features_used", features)
meta["notes"] = meta.get("notes", []) + [
    "State labeling computed on train window only (no peeking).",
    "Risk-On: highest mean ret & lowest vol; Risk-Off: highest vol & lowest ret; else Transition.",
    "Tie-breakers: breadth‚Üë, VIX‚Üì, tail q05‚Üë (Risk-On); vol‚Üë, ret‚Üì, ŒîVIX‚Üë, breadth‚Üì (Risk-Off).",
]

with open(META_PATH, "w") as f:
    json.dump(meta, f, indent=2)

print(json.dumps({
    "status": "2.3 labeled",
    "k": K,
    "label_map": label_map,
    "profiles_path": os.path.join(REGIME_DIR, "state_profiles.csv"),
    "labels_path": out_path,
}, indent=2))

# Posteriors sanity
assert np.allclose(post.sum(axis=1), 1.0, atol=1e-6), "Posterior rows must sum to 1."
assert not pd.isna(out["regime_label"]).any(), "All states must map to a regime label."

{
  "status": "2.3 labeled",
  "k": 2,
  "label_map": {
    "0": "Risk-On",
    "1": "Risk-Off"
  },
  "profiles_path": "artifacts/regimes/state_profiles.csv",
  "labels_path": "artifacts/regimes/regime_labels.parquet"
}


In [None]:
# ============================================================
# Section 2.4 ‚Äî Smoothing, Persistence & Debounce
# - Option: Viterbi most-likely path vs. posterior argmax
# - Debounce: POSTERIOR_THRESH and MIN_DWELL_DAYS from config
# - Gap handling: inherit last known regime (dates already market-days)
# Reuses:
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/window_manifest.json (2.1)
#   - artifacts/regimes/regime_hmm.pkl (2.2)
#   - artifacts/regimes/regime_labels.parquet (2.3)
# Outputs:
#   - artifacts/regimes/regime_labels.parquet (updated with *_smoothed cols)
#   - artifacts/regimes/regime_meta.json (updated diagnostics)
#   - console summary of dwell-time stats
# ============================================================

from __future__ import annotations
import os, json
from typing import List, Dict, Any, Tuple
from datetime import datetime
import numpy as np
import pandas as pd
import joblib

REGIME_DIR     = CFG.regime_dir
PANEL_PATH     = os.path.join(REGIME_DIR, "market_panel.parquet")
LABELS_PATH    = os.path.join(REGIME_DIR, "regime_labels.parquet")
META_PATH      = os.path.join(REGIME_DIR, "regime_meta.json")
BUNDLE_PATH    = os.path.join(REGIME_DIR, "regime_hmm.pkl")

assert os.path.exists(PANEL_PATH),  f"Missing market panel: {PANEL_PATH}"
assert os.path.exists(LABELS_PATH), f"Missing labels from 2.3: {LABELS_PATH}"
assert os.path.exists(BUNDLE_PATH), f"Missing HMM bundle: {BUNDLE_PATH}"

# --- Config knobs (extend 2.0 config if not present) ---
P_THRESH   = getattr(CFG, "posterior_thresh", 0.55) # TOCHANGE: consider 0.60‚Äì0.65 for a stricter switch confirmation.
MIN_DWELL  = getattr(CFG, "min_dwell_days", 3) #  # TOCHANGE: consider 5‚Äì10 to further reduce chattering.
SMOOTH_MTH = getattr(CFG, "smoothing_method", "posterior")  # "posterior" | "viterbi" # TOCHANGE: try "viterbi" for the real run and compare dwell-time stats and chattering.

# --- Load artifacts ---
labels = pd.read_parquet(LABELS_PATH).sort_values("date").reset_index(drop=True)
bundle = joblib.load(BUNDLE_PATH)
model  = bundle["model"]
features = bundle["features"]
scaler  = joblib.load(bundle["scaler_path"])

mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])

# sanity
assert np.array_equal(labels["date"].values, mkt["date"].values), "Date alignment mismatch between labels and market panel."

# --- Choose base path: Viterbi vs posterior argmax ---
# We need posteriors for thresholding either way; for Viterbi we re-score X_all.
X_all = scaler.transform(mkt[features].to_numpy(dtype=float))
post  = model.predict_proba(X_all)  # (T, K)
K     = post.shape[1]

if SMOOTH_MTH.lower() == "viterbi":
    base_states = model.predict(X_all)     # most-likely state path
else:
    base_states = post.argmax(axis=1)      # raw posterior argmax (already in 2.3)

# --- Debounce step 1: posterior threshold gating (no switch if low confidence) ---
maxp = post.max(axis=1)
debounce_states = np.array(base_states, dtype=int)
for i in range(1, len(debounce_states)):
    if debounce_states[i] != debounce_states[i-1]:
        # require sufficient posterior confidence on the *new* state
        if maxp[i] < P_THRESH:
            debounce_states[i] = debounce_states[i-1]

# --- Debounce step 2: enforce minimum dwell time by collapsing short runs ---
def _runs(state_series: np.ndarray) -> List[Tuple[int,int,int]]:
    """Return list of (start_idx, end_idx_inclusive, state) runs."""
    out = []
    s = 0
    cur = state_series[0]
    for i in range(1, len(state_series)):
        if state_series[i] != cur:
            out.append((s, i-1, cur))
            s = i
            cur = state_series[i]
    out.append((s, len(state_series)-1, cur))
    return out

def _collapse_short_runs(states: np.ndarray, min_len: int, post: np.ndarray) -> np.ndarray:
    arr = states.copy()
    changed = True
    # iterate until stable (collapsing can merge adjacent runs)
    while changed:
        changed = False
        runs = _runs(arr)
        for (s, e, st) in runs:
            run_len = e - s + 1
            if run_len < min_len:
                # Candidate neighbors: previous and next, choose higher avg posterior over this segment
                prev_state = runs[runs.index((s, e, st))-1][2] if runs.index((s, e, st)) > 0 else None
                next_state = runs[runs.index((s, e, st))+1][2] if runs.index((s, e, st)) < len(runs)-1 else None

                # If no neighbors (degenerate), skip
                if prev_state is None and next_state is None:
                    continue

                # Compute average posterior for neighbors over the short segment
                best_neighbor = None
                best_score = -np.inf
                for cand in [prev_state, next_state]:
                    if cand is None:
                        continue
                    score = float(post[s:e+1, cand].mean())
                    if score > best_score:
                        best_score = score
                        best_neighbor = cand
                # Relabel the short run to best neighbor
                arr[s:e+1] = best_neighbor
                changed = True
                break  # restart since runs have changed
    return arr

smoothed_states = _collapse_short_runs(debounce_states, MIN_DWELL, post)

# --- Map to labels using the semantics from 2.3 (state_label_map) ---
# read label map from meta
if os.path.exists(META_PATH):
    with open(META_PATH, "r") as f:
        meta = json.load(f)
else:
    meta = {}

state_label_map = meta.get("state_label_map", None)
if state_label_map is None:
    # fallback to identity names if meta missing (shouldn't happen)
    state_label_map = {int(s): f"State{s}" for s in range(K)}

# Update labels DataFrame with smoothed outputs
labels["state_id_smoothed"] = smoothed_states
for s in range(K):
    # keep original p0..pK as-is from 2.3; they reflect the raw model posteriors
    if f"p{s}" not in labels.columns:
        labels[f"p{s}"] = post[:, s]

labels["regime_label_smoothed"] = labels["state_id_smoothed"].map({int(k): v for k, v in state_label_map.items()})

# --- Dwell-time diagnostics ---
def _dwell_stats(states: np.ndarray) -> pd.DataFrame:
    rr = _runs(states)
    return pd.DataFrame({
        "state_id": [st for (_,_,st) in rr],
        "run_len":  [e - s + 1 for (s,e,_) in rr],
    }).groupby("state_id").agg(
        median_run_length=("run_len", "median"),
        mean_run_length  =("run_len", "mean"),
        n_runs           =("run_len", "count"),
        max_run_length   =("run_len", "max"),
    ).reset_index()

dwell_df = _dwell_stats(labels["state_id_smoothed"].to_numpy())

# --- Save updated labels back to disk ---
labels.to_parquet(LABELS_PATH, index=False)
labels.to_csv(LABELS_PATH.replace(".parquet", ".csv"), index=False)

# --- Update regime_meta.json diagnostics & config snapshot ---
meta.setdefault("diagnostics", {})
meta["diagnostics"]["smoothing"] = {
    "method": SMOOTH_MTH,
    "posterior_thresh": P_THRESH,
    "min_dwell_days": MIN_DWELL,
    "dwell_stats": dwell_df.to_dict(orient="records"),
}
meta.setdefault("notes", [])
meta["notes"] += [
    "2.4 smoothing applied with debounce (posterior threshold + min dwell).",
    "If method='viterbi', base path is Viterbi; else posterior argmax.",
]
# de-dup notes
meta["notes"] = list(dict.fromkeys(meta["notes"]))

with open(META_PATH, "w") as f:
    json.dump(meta, f, indent=2)

print(json.dumps({
    "status": "2.4 smoothed",
    "method": SMOOTH_MTH,
    "posterior_thresh": P_THRESH,
    "min_dwell_days": MIN_DWELL,
    "k": K,
    "median_dwell_by_state": {
        int(r["state_id"]): float(r["median_run_length"]) for r in dwell_df.to_dict(orient="records")
    },
    "labels_path": LABELS_PATH
}, indent=2))

{
  "status": "2.4 smoothed",
  "method": "posterior",
  "posterior_thresh": 0.55,
  "min_dwell_days": 3,
  "k": 2,
  "median_dwell_by_state": {
    "0": 33.5,
    "1": 12.0
  },
  "labels_path": "artifacts/regimes/regime_labels.parquet"
}


In [None]:
# ============================================================
# Section 2.5 ‚Äî Robustness & Sensitivity
# - K sensitivity: K ‚àà {2,3}
# - Feature sensitivity: drop-one/add-one variants
# - Era stability: pre/post-2015 and 2020 crisis
# - Bootstrap: block bootstrap label stability
# Reuses:
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/window_manifest.json (2.1)
#   - scaler_*.joblib (2.1)
#   - artifacts/regimes/regime_hmm.pkl (2.2 baseline)
#   - artifacts/regimes/regime_labels.parquet (2.3 baseline labels)
# Outputs:
#   - artifacts/regimes/regime_sensitivity.json
# Notes:
#   This is a light test pass; heavier settings are tagged with # TOCHANGE
# ============================================================

from __future__ import annotations
import os, json
from typing import Dict, Any, List, Tuple
from datetime import datetime
import numpy as np
import pandas as pd
import joblib
from hmmlearn.hmm import GaussianHMM

REGIME_DIR  = CFG.regime_dir
PANEL_PATH  = os.path.join(REGIME_DIR, "market_panel.parquet")
MAN_PATH    = os.path.join(REGIME_DIR, "window_manifest.json")
BUNDLE_PATH = os.path.join(REGIME_DIR, "regime_hmm.pkl")
LABELS_PATH = os.path.join(REGIME_DIR, "regime_labels.parquet")
OUT_PATH    = os.path.join(REGIME_DIR, "regime_sensitivity.json")

assert os.path.exists(PANEL_PATH) and os.path.exists(MAN_PATH) and os.path.exists(BUNDLE_PATH)
mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])

with open(MAN_PATH, "r") as f:
    MAN = json.load(f)

bundle   = joblib.load(BUNDLE_PATH)
features_base = bundle["features"]
scaler   = joblib.load(bundle["scaler_path"])
k_base   = int(bundle["k"])
recency  = bool(bundle.get("recency_weighting", True))
hl_days  = int(bundle.get("recency_half_life_days", 756))
seg_len  = int(bundle.get("recency_seg_len", 60))
n_segs   = int(bundle.get("recency_n_segments", 80))
tol      = float(bundle.get("tol", 1e-3))
n_iter   = int(bundle.get("n_iter", 200))         # TOCHANGE: 1000 for real run
n_init   = int(bundle.get("n_init", 2))           # TOCHANGE: 10 for real run
covtype  = bundle.get("covariance_type", "full")
lam_stick= float(bundle.get("sticky_lambda", 0.15))  # TOCHANGE: 0.30‚Äì0.50 for real run
rand0    = int(bundle.get("random_state", 42))

train_start = pd.to_datetime(MAN["window"]["train_start"])
train_end   = pd.to_datetime(MAN["window"]["train_end"])
test_start  = pd.to_datetime(MAN["window"]["test_start"])
test_end    = pd.to_datetime(MAN["window"]["test_end"])

mask_train = (mkt["date"] >= train_start) & (mkt["date"] <= train_end)
mask_test  = (mkt["date"] >= test_start)  & (mkt["date"] <= test_end)

# ‚¨áÔ∏è ADD: tiny helper to fit a local scaler on the train (or era) subset
from sklearn.preprocessing import StandardScaler

def _fit_local_scaler(feats: List[str], subset_mask: pd.Series) -> StandardScaler:
    df = mkt.loc[subset_mask, feats].dropna()
    scaler_local = StandardScaler()
    scaler_local.fit(df.to_numpy(dtype=float))
    return scaler_local

def _diag_sticky_blend(T: np.ndarray, lam: float) -> np.ndarray:
    k = T.shape[0]
    out = (1.0 - lam) * T + lam * np.eye(k)
    return out / out.sum(axis=1, keepdims=True)

def _build_time_decay_weights(dates: np.ndarray, half_life_days: int) -> np.ndarray:
    t = np.array([pd.Timestamp(d).toordinal() for d in dates], dtype=float)
    age = (t.max() - t)
    decay = np.log(2) / max(1, half_life_days)
    w = np.exp(-decay * age)
    return w / (w.sum() + 1e-12)

# Canonical recency-sampling params (align with 2.2 bundle keys)
REC_SEG_LEN    = int(bundle.get("recency_seg_len", 60))
REC_N_SEGMENTS = int(bundle.get("recency_n_segments", 80))
REC_HALF_LIFE  = int(bundle.get("recency_half_life_days", 756))
REC_EPS        = float(bundle.get("recency_epsilon_floor", 0.10))  # matches 2.2 key

def _sample_time_weighted_subsequences(
    X: np.ndarray,
    dates: np.ndarray,
    seg_len: int = REC_SEG_LEN,
    n_segments: int = REC_N_SEGMENTS,
    half_life_days: int = REC_HALF_LIFE,
    seed: int = 42,
):
    rng = np.random.RandomState(seed)
    n = X.shape[0]
    if n < seg_len:
        return X.copy(), [n]
    ends = np.arange(seg_len - 1, n)
    p = _build_time_decay_weights(dates[ends], half_life_days)
    p = np.maximum(p, REC_EPS * p.max())
    p = p / p.sum()
    chosen = rng.choice(ends, size=min(n_segments, len(ends)), replace=True, p=p)
    chunks, lengths = [], []
    for e in chosen:
        s = e - (seg_len - 1)
        chunks.append(X[s:e+1])
        lengths.append(seg_len)
    return np.vstack(chunks), lengths

# ‚¨áÔ∏è MODIFY: _fit_hmm_for_features now fits and returns a local scaler,
# and uses it for both training and scoring.
def _fit_hmm_for_features(feats: List[str], k: int, rs: int, subset_mask: pd.Series) -> Dict[str, Any]:
    # Fit local scaler on the subset (train or era) to avoid feature-count mismatch
    scaler_local = _fit_local_scaler(feats, subset_mask)

    df = mkt.loc[subset_mask, ["date"] + feats].dropna().reset_index(drop=True)
    dates = df["date"].to_numpy()
    X = scaler_local.transform(df[feats].to_numpy(dtype=float))

    if recency:
        X_fit, lengths = _sample_time_weighted_subsequences(
            X, dates, seg_len=seg_len, n_segments=n_segs, half_life_days=hl_days, seed=rs
        )
    else:
        X_fit, lengths = X, [len(X)]

    model = GaussianHMM(
        n_components=k,
        covariance_type=covtype,
        n_iter=n_iter,
        tol=tol,
        random_state=rs,
        verbose=False,
        init_params="mc",   # means, covars
        params="stmc",      # learn startprob/transmat as well
    )
    # sticky-ish init
    T0 = np.full((k, k), (1.0 - 0.90) / max(1, k - 1)); np.fill_diagonal(T0, 0.90)
    model.transmat_ = T0
    model.startprob_ = np.full(k, 1.0 / k)

    model.fit(X_fit, lengths=lengths)
    model.transmat_ = _diag_sticky_blend(model.transmat_, lam_stick)

    # score on original (non-sampled) sequence
    score = float(model.score(X))

    return {
        "model": model,
        "score": score,
        "dates": dates,
        "feats": feats,
        "scaler": scaler_local  # ‚¨ÖÔ∏è return it
    }

# ‚¨áÔ∏è MODIFY: _profile_and_label takes the local scaler we fit above
def _profile_and_label(model, feats: List[str], scaler_local: StandardScaler) -> Dict[str, Any]:
    X_all = scaler_local.transform(mkt[feats].to_numpy(dtype=float))
    post  = model.predict_proba(X_all)
    K = post.shape[1]

    # compute profiles on TRAIN ONLY (no peeking)
    train_ix = np.where(mask_train.values)[0]

    def wmean(x, w):
        w = np.asarray(w); x = np.asarray(x); s = w.sum()
        return float((x*w).sum()/s) if s>0 else np.nan
    def wstd(x, w):
        mu = wmean(x, w); w = np.asarray(w); x = np.asarray(x); s=w.sum()
        if s<=1: return np.nan
        return float(np.sqrt(max(((w*(x-mu)**2).sum()/s), 0.0)))
    def wq05(x, w):
        x=np.asarray(x); w=np.asarray(w); o=np.argsort(x); xs,ws=x[o],w[o]; cw=np.cumsum(ws)
        return float(xs[np.searchsorted(cw, 0.05*cw[-1])]) if cw[-1]>0 else np.nan

    prof = []
    for s in range(K):
        w = post[train_ix, s]
        prof.append({
            "state_id": s,
            "ret_mean": wmean(mkt.loc[mask_train,"spy_ret"].values, w),
            "ret_std":  wstd (mkt.loc[mask_train,"spy_ret"].values, w),
            "rv20_mean": wmean(mkt.loc[mask_train,"spy_rv_20"].values, w),
            "vix_mean":  wmean(mkt.loc[mask_train,"vix_close"].values, w),
            "dvix_mean": wmean(mkt.loc[mask_train,"dvix"].values, w) if "dvix" in mkt.columns else np.nan,
            "breadth_mean": wmean(mkt.loc[mask_train,"breadth"].values, w),
            "ret_q05":  wq05 (mkt.loc[mask_train,"spy_ret"].values, w),
        })
    prof_df = pd.DataFrame(prof)

    risk_off_id = int(prof_df["rv20_mean"].idxmax())
    risk_on_id  = int(prof_df["ret_mean"].idxmax())
    label_map = {risk_on_id: "Risk-On", risk_off_id: "Risk-Off"}
    for s in range(K):
        if s not in label_map:
            label_map[s] = "Transition"

    return {
        "profiles": prof_df.to_dict(orient="records"),
        "label_map": {int(k): v for k,v in label_map.items()},
        "posteriors_shape": list(post.shape),
        "transmat": model.transmat_.tolist(),
    }

def _agreement_vs_baseline(new_states: np.ndarray, baseline_states: np.ndarray) -> float:
    # simple percent agreement
    if len(new_states) != len(baseline_states):
        return np.nan
    return float((new_states == baseline_states).mean())

# --- Load baseline label sequence (we'll compare to *smoothed* if present) ---
base_labels = pd.read_parquet(LABELS_PATH).sort_values("date")
base_labels["date"] = pd.to_datetime(base_labels["date"])  # <- add this

if "state_id_smoothed" in base_labels.columns:
    base_states = base_labels["state_id_smoothed"].to_numpy()
else:
    base_states = base_labels["state_id"].to_numpy()

results: Dict[str, Any] = {"k_sensitivity": [], "feature_sensitivity": [], "era_stability": [], "bootstrap": {}}

def _score_states_on_valid_dates(model, feats, scaler_local):
    full_df = mkt[["date"] + feats].dropna().reset_index(drop=True)
    Xa = scaler_local.transform(full_df[feats].to_numpy(dtype=float))
    states = model.predict_proba(Xa).argmax(axis=1)
    return full_df["date"].to_numpy(), states

def _agreement_on_intersection(dates_new, states_new, base_labels_df) -> float:
    df_new = pd.DataFrame({"date": dates_new, "state_new": states_new})
    df_join = df_new.merge(
        base_labels_df[["date", "state_id_smoothed" if "state_id_smoothed" in base_labels_df.columns else "state_id"]]
        .rename(columns={"state_id_smoothed":"state_base","state_id":"state_base"}),
        on="date", how="inner"
    )
    if len(df_join) == 0:
        return np.nan
    return float((df_join["state_new"].to_numpy() == df_join["state_base"].to_numpy()).mean())

# 1) K sensitivity ------------------------------------------------------------
K_GRID = [2, 3]  # TOCHANGE: can expand to [2,3] in real run if currently narrowed
for k in K_GRID:
    best = {"score": -np.inf, "meta": None}
    for r in range(n_init):
        rs = rand0 + r
        fit = _fit_hmm_for_features(features_base, k, rs, mask_train)
        meta = _profile_and_label(fit["model"], features_base, fit["scaler"])
        dates_scored, states_all = _score_states_on_valid_dates(fit["model"], features_base, fit["scaler"])
        agree = _agreement_on_intersection(dates_scored, states_all, base_labels)

        entry = {
            "k": k, "seed": rs, "score": fit["score"], "agreement_vs_baseline": agree,
            "profiles": meta["profiles"], "label_map": meta["label_map"],
            "transmat": meta["transmat"]
        }
        if fit["score"] > best["score"]:
            best = {"score": fit["score"], "meta": entry}
    results["k_sensitivity"].append(best["meta"])

# 2) Feature sensitivity -------------------------------------------------------
# Define variants relative to baseline features
fsets = []
fsets.append(("baseline", features_base))
if "vix_close" in features_base: fsets.append(("no_vix", [f for f in features_base if f!="vix_close"]))
if "breadth"   in features_base: fsets.append(("no_breadth", [f for f in features_base if f!="breadth"]))
if "dvix"      in features_base: fsets.append(("no_dvix", [f for f in features_base if f!="dvix"]))
# minimal core set
core = [f for f in ["spy_rv_20","vix_close"] if f in mkt.columns]
if core: fsets.append(("core_rv_vix", core))

for name, feats in fsets:
    k = k_base
    best = {"score": -np.inf, "meta": None}
    for r in range(n_init):
        rs = rand0 + 100 + r
        fit = _fit_hmm_for_features(feats, k, rs, mask_train)
        meta = _profile_and_label(fit["model"], feats, fit["scaler"])
        dates_scored, states_all = _score_states_on_valid_dates(fit["model"], feats, fit["scaler"])
        agree = _agreement_on_intersection(dates_scored, states_all, base_labels)
        entry = {
            "feature_set": name, "k": k, "seed": rs, "feats": feats,
            "score": fit["score"], "agreement_vs_baseline": agree,
            "profiles": meta["profiles"], "label_map": meta["label_map"],
            "transmat": meta["transmat"]
        }
        if fit["score"] > best["score"]:
            best = {"score": fit["score"], "meta": entry}
    results["feature_sensitivity"].append(best["meta"])


# 3) Era stability -------------------------------------------------------------
def _fit_on_era(start: str, end: str, k: int, seed: int, feats: List[str], name: str) -> Dict[str, Any]:
    mask = (mkt["date"] >= pd.to_datetime(start)) & (mkt["date"] <= pd.to_datetime(end))
    fit  = _fit_hmm_for_features(feats, k, seed, mask)
    meta = _profile_and_label(fit["model"], feats, fit["scaler"])
    return {
        "era": name, "k": k, "seed": seed, "score": fit["score"],
        "profiles": meta["profiles"], "label_map": meta["label_map"],
        "transmat": meta["transmat"], "start": str(start), "end": str(end)
    }


# ‚¨áÔ∏è OPTIONAL TIDY-UP ‚Äî Bootstrap: use a local scaler for baseline features too
df_tr = mkt.loc[mask_train, ["date"] + features_base].dropna().reset_index(drop=True)
scaler_base_local = StandardScaler().fit(df_tr[features_base].to_numpy(dtype=float))  # local
X_tr  = scaler_base_local.transform(df_tr[features_base].to_numpy(dtype=float))
dt_tr = df_tr["date"].to_numpy()

eras = [
    ("pre_2015",  "2007-02-06", "2014-12-31"),
    ("post_2015", "2015-01-01", str(train_end.date())),
    ("crisis_2020","2020-02-15","2020-12-31"),
]
for name, s, e in eras:
    rs = rand0 + hash(name) % 1000
    results["era_stability"].append(_fit_on_era(s, e, k_base, rs, features_base, name))


# 4) Bootstrap (block) ---------------------------------------------------------
def _block_bootstrap_indices(n: int, block: int, n_blocks: int, rng: np.random.RandomState):
    starts = rng.randint(0, n, size=n_blocks)
    idx = []
    for st in starts:
        idx.extend([(st + j) % n for j in range(block)])
    return np.array(idx[:n])

BOOT_REPS   = 5    # TOCHANGE: 100‚Äì300 for real run
BLOCK_DAYS  = 20   # TOCHANGE: 20‚Äì60 depending on serial corr
rng = np.random.RandomState(rand0+999)

# Build train matrix for bootstrap using local scaler
df_tr = mkt.loc[mask_train, ["date"] + features_base].dropna().reset_index(drop=True)
scaler_base_local = StandardScaler().fit(df_tr[features_base].to_numpy(dtype=float))
X_tr  = scaler_base_local.transform(df_tr[features_base].to_numpy(dtype=float))
dt_tr = df_tr["date"].to_numpy()

boot_summ = {"k": k_base, "reps": BOOT_REPS, "block_days": BLOCK_DAYS, "agreement_vs_baseline": []}
for b in range(BOOT_REPS):
    idx = _block_bootstrap_indices(len(dt_tr), BLOCK_DAYS, max(1, len(dt_tr)//BLOCK_DAYS), rng)
    Xb  = X_tr[idx]

    model = GaussianHMM(
        n_components=k_base, covariance_type=covtype, n_iter=n_iter, tol=tol,
        random_state=rand0 + 500 + b, verbose=False, init_params="mc", params="stmc"
    )
    T0 = np.full((k_base, k_base), (1.0 - 0.90) / max(1, k_base - 1)); np.fill_diagonal(T0, 0.90)
    model.transmat_ = T0; model.startprob_ = np.full(k_base, 1.0 / k_base)
    model.fit(Xb, lengths=[len(Xb)])
    model.transmat_ = _diag_sticky_blend(model.transmat_, lam_stick)

    # Score on valid dates & align to baseline
    dates_scored, states_all = _score_states_on_valid_dates(model, features_base, scaler_base_local)
    agree = _agreement_on_intersection(dates_scored, states_all, base_labels)
    boot_summ["agreement_vs_baseline"].append(agree)

boot_summ["agreement_mean"] = float(np.mean(boot_summ["agreement_vs_baseline"]))
boot_summ["agreement_std"]  = float(np.std (boot_summ["agreement_vs_baseline"]))
results["bootstrap"] = boot_summ

# Save results
with open(OUT_PATH, "w") as f:
    json.dump({
        "created_at": datetime.utcnow().isoformat() + "Z",
        "inputs": {
            "features_base": features_base,
            "k_base": k_base,
            "recency_weighting": recency,
            "half_life_days": hl_days,
            "lam_stick": lam_stick,
            "n_iter": n_iter, "n_init": n_init, "tol": tol, "covariance_type": covtype
        },
        "results": results
    }, f, indent=2)

print(json.dumps({
    "status": "2.5 done",
    "out": OUT_PATH,
    "k_choices": [2,3],
    "feature_sets_tested": [fs[0] for fs in fsets],
    "bootstrap": {"reps": BOOT_REPS, "agreement_mean": boot_summ["agreement_mean"], "agreement_std": boot_summ["agreement_std"]},
}, indent=2))


{
  "status": "2.5 done",
  "out": "artifacts/regimes/regime_sensitivity.json",
  "k_choices": [
    2,
    3
  ],
  "feature_sets_tested": [
    "baseline",
    "no_vix",
    "no_breadth",
    "no_dvix",
    "core_rv_vix"
  ],
  "bootstrap": {
    "reps": 5,
    "agreement_mean": 0.531172176899957,
    "agreement_std": 0.247859990522504
  }
}


In [None]:
# ============================================================
# Section 2.6 ‚Äî Diagnostics & QA
# Plots:
#   ‚Ä¢ Timeline with regime shading over SPY price (rebased) & drawdown
#   ‚Ä¢ Posterior probabilities (stacked area)
#   ‚Ä¢ State return histograms, QQ plots
#   ‚Ä¢ Transition matrix heatmap, dwell-time distribution
# Tables:
#   ‚Ä¢ State profiles (load from 2.3), transition matrix & steady-state
#   ‚Ä¢ Switch frequency & chattering metrics
# Alerts:
#   ‚Ä¢ Inconsistent semantics (e.g., positive mean but highest vol)
#   ‚Ä¢ Very short dwell (median < 3d)
#   ‚Ä¢ Mapping flips / excessive chattering
# Reuses (do not recompute):
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/window_manifest.json (2.1)
#   - artifacts/regimes/regime_hmm.pkl (2.2)
#   - artifacts/regimes/regime_labels.parquet (2.3/2.4)
#   - artifacts/regimes/state_profiles.csv (2.3)
#   - artifacts/regimes/regime_meta.json (2.3/2.4)
# Outputs:
#   - artifacts/regimes/diagnostics/*.png
#   - artifacts/regimes/diagnostics/*.csv / *.json
#   - console summary + alerts
# Notes:
#   - #TOCHANGE marks spots to bump for real run (heavier plots/points)
# ============================================================

from __future__ import annotations
import os, json
from typing import Dict, Any, List, Tuple
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt

REGIME_DIR  = CFG.regime_dir
DIAG_DIR    = os.path.join(REGIME_DIR, "diagnostics")
os.makedirs(DIAG_DIR, exist_ok=True)

PANEL_PATH  = os.path.join(REGIME_DIR, "market_panel.parquet")
MAN_PATH    = os.path.join(REGIME_DIR, "window_manifest.json")
BUNDLE_PATH = os.path.join(REGIME_DIR, "regime_hmm.pkl")
LABELS_PATH = os.path.join(REGIME_DIR, "regime_labels.parquet")
META_PATH   = os.path.join(REGIME_DIR, "regime_meta.json")
PROF_PATH   = os.path.join(REGIME_DIR, "state_profiles.csv")

# --- Load artifacts
assert os.path.exists(PANEL_PATH) and os.path.exists(MAN_PATH) and os.path.exists(BUNDLE_PATH) and os.path.exists(LABELS_PATH)

mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])
labels = pd.read_parquet(LABELS_PATH).sort_values("date").reset_index(drop=True)
labels["date"] = pd.to_datetime(labels["date"])

with open(MAN_PATH, "r") as f:
    MAN = json.load(f)

bundle = joblib.load(BUNDLE_PATH)
features = bundle["features"]
k = int(bundle["k"])

meta = {}
if os.path.exists(META_PATH):
    with open(META_PATH, "r") as f:
        meta = json.load(f)

# --- Derived / convenience
date_equal = np.array_equal(mkt["date"].values, labels["date"].values)
if not date_equal:
    # Align by inner-join on date (some rows may be dropped if any side had NA)
    labels = labels.merge(mkt[["date"]], on="date", how="inner").sort_values("date").reset_index(drop=True)
    mkt    = mkt.merge(labels[["date"]], on="date", how="inner").sort_values("date").reset_index(drop=True)

# Prefer smoothed series if present
state_col  = "state_id_smoothed" if "state_id_smoothed" in labels.columns else "state_id"
label_col  = "regime_label_smoothed" if "regime_label_smoothed" in labels.columns else "regime_label"

# K from posterior columns p0..pK-1 (robust to re-fits)
p_cols = [c for c in labels.columns if c.startswith("p")]
K = len(p_cols) if len(p_cols) > 0 else k

# --- Helpers
def _runs(series: np.ndarray) -> List[Tuple[int,int,int]]:
    """Return (start_idx, end_idx, value) runs for integer state series."""
    out = []
    s = 0; cur = series[0]
    for i in range(1, len(series)):
        if series[i] != cur:
            out.append((s, i-1, int(cur)))
            s = i; cur = series[i]
    out.append((s, len(series)-1, int(cur)))
    return out

def _transition_matrix(states: np.ndarray, K: int) -> np.ndarray:
    T = np.zeros((K, K), dtype=float)
    for i in range(len(states)-1):
        T[states[i], states[i+1]] += 1.0
    row_sums = T.sum(axis=1, keepdims=True)
    row_sums[row_sums==0] = 1.0
    return T / row_sums

def _steady_state(T: np.ndarray) -> np.ndarray:
    # Empirical steady-state as left eigenvector (or fallback to state freq)
    try:
        vals, vecs = np.linalg.eig(T.T)
        i = np.argmin(np.abs(vals - 1.0))
        v = np.real(vecs[:, i]); v = np.maximum(v, 0)
        if v.sum() == 0: raise ValueError
        return v / v.sum()
    except Exception:
        return np.ones(T.shape[0]) / T.shape[0]

def _rebase_price_from_returns(rets: np.ndarray, start=100.0) -> np.ndarray:
    # Assumes rets are simple daily returns (e.g., spy_ret). If logrets, replace with exp(cumsum).
    out = np.empty_like(rets, dtype=float); out[0] = start * (1.0 + np.nan_to_num(rets[0], nan=0.0))
    for i in range(1, len(rets)):
        out[i] = out[i-1] * (1.0 + np.nan_to_num(rets[i], nan=0.0))
    return out

def _drawdown(price: np.ndarray) -> np.ndarray:
    cummax = np.maximum.accumulate(price)
    dd = price / np.where(cummax==0, 1.0, cummax) - 1.0
    return dd

# --- Compute core diagnostics
states = labels[state_col].to_numpy(dtype=int)
Tmat   = _transition_matrix(states, K)
ss_emp = _steady_state(Tmat)
runs   = _runs(states)
dwell  = pd.DataFrame({
    "state_id": [st for (s,e,st) in runs],
    "run_len":  [e-s+1 for (s,e,st) in runs],
})

# --- Switch/chattering metrics
switches = (states[1:] != states[:-1]).sum()
switch_rate = switches / max(1, len(states)-1)
one_day_runs = (dwell["run_len"] == 1).mean()  # fraction single-day
lt3_runs = (dwell["run_len"] < 3).mean()

# --- Load state profiles (2.3) if present, else compute from TRAIN weights
profiles_df = None
if os.path.exists(PROF_PATH):
    profiles_df = pd.read_csv(PROF_PATH)
else:
    # Fallback: rough unweighted per-state profiles on all data (not ideal, but avoids recompute)
    tmp = []
    for s in range(K):
        mask = (states == s)
        tmp.append({
            "state_id": s,
            "ret_mean": float(np.nanmean(mkt.loc[mask,"spy_ret"])),
            "ret_std":  float(np.nanstd (mkt.loc[mask,"spy_ret"])),
            "rv20_mean": float(np.nanmean(mkt.loc[mask,"spy_rv_20"])),
            "vix_mean":  float(np.nanmean(mkt.loc[mask,"vix_close"])),
            "dvix_mean": float(np.nanmean(mkt.loc[mask,"dvix"])) if "dvix" in mkt.columns else np.nan,
            "breadth_mean": float(np.nanmean(mkt.loc[mask,"breadth"])),
            "ret_q05":  float(np.nanquantile(mkt.loc[mask,"spy_ret"], 0.05)),
        })
    profiles_df = pd.DataFrame(tmp)
profiles_df.to_csv(os.path.join(DIAG_DIR, "state_profiles_table.csv"), index=False)

# --- Transition matrix & steady-state tables
pd.DataFrame(Tmat, columns=[f"to_{i}" for i in range(K)], index=[f"from_{i}" for i in range(K)]) \
  .to_csv(os.path.join(DIAG_DIR, "transition_matrix.csv"))
pd.DataFrame({"state_id": list(range(K)), "steady_state_prob": ss_emp}) \
  .to_csv(os.path.join(DIAG_DIR, "steady_state.csv"), index=False)

# --- Switch frequency table (yearly)
lab = labels[["date", state_col]].copy()
lab["year"] = lab["date"].dt.year
lab["sw"] = (lab[state_col].shift(-1) != lab[state_col]).astype(int)
switch_by_year = lab.groupby("year")["sw"].sum().reset_index().rename(columns={"sw":"n_switches"})
switch_by_year.to_csv(os.path.join(DIAG_DIR, "switches_by_year.csv"), index=False)

# ============================================================
# PLOTS
# ============================================================

# 1) Price timeline with regime shading & drawdown
spy_price = _rebase_price_from_returns(mkt["spy_ret"].to_numpy(dtype=float), start=100.0)
spy_dd    = _drawdown(spy_price)

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(mkt["date"], spy_price, lw=1.25)
# Shade by regime
for (s,e,st) in runs:
    ax.axvspan(mkt["date"].iloc[s], mkt["date"].iloc[e], alpha=0.15, label=f"State {st}" if s==runs[0][0] else None)
ax.set_title("SPY (rebased) with Regime Shading")
ax.set_xlabel("Date"); ax.set_ylabel("Rebased Price")
fig.tight_layout()
fig.savefig(os.path.join(DIAG_DIR, "regime_timeline.png"), dpi=150)
plt.close(fig)

fig, ax = plt.subplots(figsize=(12, 3))
ax.plot(mkt["date"], spy_dd, lw=1.0)
ax.set_title("SPY Drawdown")
ax.set_xlabel("Date"); ax.set_ylabel("Drawdown")
fig.tight_layout()
fig.savefig(os.path.join(DIAG_DIR, "timeline_drawdown.png"), dpi=150)
plt.close(fig)

# 2) Posterior probabilities (stacked area)
if len(p_cols) == K:
    fig, ax = plt.subplots(figsize=(12, 4))
    ax.stackplot(labels["date"], *(labels[c].to_numpy() for c in p_cols))
    ax.set_ylim(0,1); ax.set_title("Posterior Probabilities (Stacked)")
    ax.set_xlabel("Date"); ax.set_ylabel("Probability")
    fig.tight_layout()
    fig.savefig(os.path.join(DIAG_DIR, "regime_posteriors.png"), dpi=150)
    plt.close(fig)


# (erfinv helper without SciPy)
def erfinv(y):
    # Approximation (Winitzki) good enough for QQ visual; replace with SciPy in real run
    a = 0.147
    sign = np.sign(y)
    x = np.clip(y, -0.999999, 0.999999)
    ln = np.log(1 - x**2)
    first = 2/(np.pi*a) + ln/2
    return sign * np.sqrt( np.sqrt(first**2 - ln/a) - first )

# 3) State return histograms + QQ plots
# TOCHANGE: bump N_QQ_POINTS to 1000 for real run
N_QQ_POINTS = 200
qs = np.linspace(0.01, 0.99, N_QQ_POINTS)
for s in range(K):
    mask = (states == s)
    r = mkt.loc[mask, "spy_ret"].dropna().to_numpy()
    if len(r) == 0:
        continue

    # Histogram
    fig, ax = plt.subplots(figsize=(5,3))
    ax.hist(r, bins=40, alpha=0.8)  # #TOCHANGE: 80 bins for real run
    ax.set_title(f"State {s} return histogram")
    fig.tight_layout()
    fig.savefig(os.path.join(DIAG_DIR, f"state_{s}_ret_hist.png"), dpi=150)
    plt.close(fig)

    # QQ vs normal
    mu, sd = float(np.mean(r)), float(np.std(r, ddof=0))
    if sd <= 0:
        sd = 1e-9
    emp_q = np.quantile(r, qs)
    nor_q = mu + sd * np.sqrt(2) * erfinv(2*qs - 1)  # inverse CDF via erfinv
    fig, ax = plt.subplots(figsize=(5,3))
    ax.scatter(nor_q, emp_q, s=6, alpha=0.7)
    lims = [min(nor_q.min(), emp_q.min()), max(nor_q.max(), emp_q.max())]
    ax.plot(lims, lims, lw=1.0)
    ax.set_title(f"State {s} QQ vs Normal")
    ax.set_xlabel("Theoretical quantiles"); ax.set_ylabel("Empirical quantiles")
    fig.tight_layout()
    fig.savefig(os.path.join(DIAG_DIR, f"state_{s}_qq.png"), dpi=150)
    plt.close(fig)


# 4) Transition heatmap
fig, ax = plt.subplots(figsize=(5,4))
im = ax.imshow(Tmat, aspect="auto", vmin=0, vmax=np.max(Tmat))
ax.set_title("Transition Matrix")
ax.set_xlabel("to"); ax.set_ylabel("from")
ax.set_xticks(range(K)); ax.set_yticks(range(K))
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
fig.tight_layout()
fig.savefig(os.path.join(DIAG_DIR, "transition_matrix_heatmap.png"), dpi=150)
plt.close(fig)

# 5) Dwell-time distribution per state
fig, ax = plt.subplots(figsize=(6,4))
for s in range(K):
    ax.hist(dwell.loc[dwell["state_id"]==s, "run_len"], bins=range(1,51), alpha=0.6, label=f"State {s}")
ax.legend()
ax.set_title("Dwell-time distribution (days)")
ax.set_xlabel("Run length (days)"); ax.set_ylabel("Count")
fig.tight_layout()
fig.savefig(os.path.join(DIAG_DIR, "dwell_time_distribution.png"), dpi=150)
plt.close(fig)

# ============================================================
# ALERTS
# ============================================================
alerts = []

# Semantics: positive mean but highest vol -> suspicious "Risk-On"
if profiles_df is not None and len(profiles_df) >= K:
    vol_rank = profiles_df["rv20_mean"].rank(ascending=True)  # 1 = lowest vol
    hi_vol_state = int(profiles_df.loc[profiles_df["rv20_mean"].idxmax(), "state_id"])
    pos_mean_states = profiles_df.loc[profiles_df["ret_mean"] > 0, "state_id"].astype(int).tolist()
    if hi_vol_state in pos_mean_states and K >= 2:
        alerts.append(f"State {hi_vol_state}: positive mean return but highest realized vol (check semantics).")

# Dwell-time < 3 days median
dwell_median = dwell.groupby("state_id")["run_len"].median()
for s, med in dwell_median.items():
    if med < 3:
        alerts.append(f"State {s}: median dwell {med}d < 3 (too chatty).")

# Chattering: high switch rate or many single-day runs
if switch_rate > 0.15:  # #TOCHANGE: tighten to 0.10 for real run
    alerts.append(f"High switch rate: {switch_rate:.2%}")
if one_day_runs > 0.10:  # #TOCHANGE: tighten to 0.05 for real run
    alerts.append(f"Single-day run fraction elevated: {one_day_runs:.2%}")

# Mapping flips heuristic: compare label continuity around major drawdowns
# (simple heuristic: if label changes >3 times within any 20-day window)
# #TOCHANGE: widen window to 60 days for real run
WINDOW = 20
roll_switch = pd.Series((states[1:] != states[:-1]).astype(int)).rolling(WINDOW).sum().fillna(0)
if (roll_switch > 3).any():
    alerts.append("Frequent label flips in short windows (potential mapping instability).")

# Save alerts
with open(os.path.join(DIAG_DIR, "alerts.json"), "w") as f:
    json.dump({"alerts": alerts}, f, indent=2)

# Save a compact summary CSV
pd.DataFrame({
    "metric": ["K", "switches", "switch_rate", "one_day_runs_frac", "lt3_runs_frac"],
    "value": [K, switches, switch_rate, one_day_runs, lt3_runs]
}).to_csv(os.path.join(DIAG_DIR, "summary_metrics.csv"), index=False)

print(json.dumps({
    "status": "2.6 diagnostics complete",
    "plots_dir": DIAG_DIR,
    "alerts_count": len(alerts),
    "notes": [
        "State profiles loaded from 2.3 if available; else quick fallback was used.",
        "Semantics checks are heuristics; confirm with 2.3 profiles and 2.5 sensitivity."
    ]
}, indent=2))

{
  "status": "2.6 diagnostics complete",
  "plots_dir": "artifacts/regimes/diagnostics",
  "alerts_count": 1,
  "notes": [
    "State profiles loaded from 2.3 if available; else quick fallback was used.",
    "Semantics checks are heuristics; confirm with 2.3 profiles and 2.5 sensitivity."
  ]
}


In [None]:
# ============================================================
# Section 2.7 ‚Äî Regime-Aware Policy Hooks (Interfaces to Sec 3‚Äì5)
# Reuses:
#   - artifacts/regimes/regime_labels.parquet (2.3/2.4)
#   - artifacts/regimes/regime_meta.json (2.3/2.4)
#   - artifacts/regimes/regime_hmm.pkl (2.2)  [fallback if p-cols missing]
#   - artifacts/regimes/window_manifest.json (2.1)
#   - artifacts/regimes/market_panel.parquet (2.0)  [fallback scoring]
# Outputs:
#   - artifacts/regimes/regime_policy_map.json
# Notes:
#   - This file is the single interface consumed by Sections 3‚Äì5.
#   - #TOCHANGE marks values to tune for the real run.
# ============================================================

from __future__ import annotations
import os, json, hashlib
from typing import Dict, Any
import numpy as np
import pandas as pd
import joblib

REGIME_DIR  = CFG.regime_dir
PANEL_PATH  = os.path.join(REGIME_DIR, "market_panel.parquet")
LABELS_PATH = os.path.join(REGIME_DIR, "regime_labels.parquet")
META_PATH   = os.path.join(REGIME_DIR, "regime_meta.json")
MAN_PATH    = os.path.join(REGIME_DIR, "window_manifest.json")
BUNDLE_PATH = os.path.join(REGIME_DIR, "regime_hmm.pkl")
OUT_PATH    = os.path.join(REGIME_DIR, "regime_policy_map.json")

# --- Load essentials
labels = pd.read_parquet(LABELS_PATH).sort_values("date").reset_index(drop=True)
labels["date"] = pd.to_datetime(labels["date"])
with open(MAN_PATH, "r") as f: MAN = json.load(f)
bundle = joblib.load(BUNDLE_PATH)

# state‚Üílabel semantics
state_label_map = None
if os.path.exists(META_PATH):
    with open(META_PATH, "r") as f:
        meta = json.load(f)
    state_label_map = meta.get("state_label_map", None)
else:
    meta = {}

# infer K and get posteriors
p_cols = [c for c in labels.columns if c.startswith("p")]
K = len(p_cols) if p_cols else int(bundle["k"])

# fallback: if no p-cols, score from model on all dates
if not p_cols:
    feats = bundle["features"]
    scaler = joblib.load(bundle["scaler_path"])
    mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
    X_all = scaler.transform(mkt[feats].to_numpy(dtype=float))
    post = bundle["model"].predict_proba(X_all)
    for s in range(post.shape[1]):
        labels[f"p{s}"] = post[:, s]
    p_cols = [f"p{s}" for s in range(K)]

# choose smoothed ids/labels if available
state_col = "state_id_smoothed" if "state_id_smoothed" in labels.columns else "state_id"
label_col = "regime_label_smoothed" if "regime_label_smoothed" in labels.columns else "regime_label"

# if meta has mapping but label_col missing, map on the fly
if label_col not in labels.columns and state_label_map is not None:
    labels[label_col] = labels[state_col].map({int(k): v for k, v in state_label_map.items()})

# --- Confidence proxies
def entropy(p: np.ndarray) -> float:
    p = np.clip(p, 1e-12, 1.0)
    return float(-(p*np.log(p)).sum() / np.log(len(p)))  # normalized to [0,1]

def aggressiveness_from_confidence(p: np.ndarray) -> Dict[str, float]:
    # Proxy 1: max posterior
    c_max = float(p.max())
    # Proxy 2: 1 - normalized entropy (higher -> more certain)
    c_ent = 1.0 - entropy(p)
    # Combine (simple mean)  #TOCHANGE: use weighted combo or monotone spline
    c = 0.5 * (c_max + c_ent)
    # Map to aggressiveness scalar g ‚àà [g_min, g_max]
    g_min, g_max = 0.35, 1.00     #TOCHANGE: (0.25,1.00) if you want deeper throttling
    g = g_min + (g_max - g_min) * c
    return {"c_max": c_max, "c_entropy": c_ent, "c": c, "g": g}

# --- Latest regime & confidence (optionally smooth over last N days)
#TOCHANGE: set N_SMOOTH=5‚Äì10 for prod; 1 for fast test
N_SMOOTH = 3
tail = labels.tail(N_SMOOTH)
p_tail = tail[p_cols].to_numpy(dtype=float)
p_mean = p_tail.mean(axis=0)
latest_row = labels.iloc[-1]
latest_label = str(latest_row[label_col]) if label_col in labels.columns else f"State{int(latest_row[state_col])}"
conf = aggressiveness_from_confidence(p_mean)

# --- Per-regime policy defaults (edit for your stack)
# Use intuitive names; downstream can match by these labels
# Turnover caps are relative (e.g., fraction of portfolio eligible to trade)
policy_by_regime: Dict[str, Dict[str, Any]] = {
    "Risk-On": {
        "weights_multipliers": {          #TOCHANGE: tailor to your factors
            "momentum": 1.20,
            "quality":  1.00,
            "value":    1.00,
            "low_vol":  0.85,
        },
        "turnover_cap": 0.20,             #TOCHANGE: 0.25
        "risk_target_vol_annual": 0.10,   # 10%
        "hedge_intensity": 0.0,           # baseline hedge ratio
    },
    "Transition": {
        "weights_multipliers": {
            "momentum": 0.95,
            "quality":  1.05,
            "value":    1.05,
            "low_vol":  1.05,
        },
        "turnover_cap": 0.15,
        "risk_target_vol_annual": 0.08,   # 8%
        "hedge_intensity": 0.15,
    },
    "Risk-Off": {
        "weights_multipliers": {
            "momentum": 0.70,             # throttle momo
            "quality":  1.15,             # upweight quality/defensive
            "value":    1.05,
            "low_vol":  1.25,
        },
        "turnover_cap": 0.10,
        "risk_target_vol_annual": 0.06,   # 6%
        "hedge_intensity": 0.35,
    },
}

# --- If our label universe differs (e.g., only 2 states), coerce keys
present_labels = set(labels[label_col].dropna().astype(str).unique()) if label_col in labels.columns else set()
for lbl in list(policy_by_regime.keys()):
    if lbl not in present_labels and present_labels:
        # map missing labels to a reasonable fallback  #TOCHANGE: make explicit mapping per run
        del policy_by_regime[lbl]
# If states are only numeric (no semantic labels), synthesize keys
if not policy_by_regime and state_label_map is None:
    unique_states = sorted(labels[state_col].unique())
    for s in unique_states:
        policy_by_regime[f"State{s}"] = {
            "weights_multipliers": {"momentum":1.0,"quality":1.0,"value":1.0,"low_vol":1.0},
            "turnover_cap": 0.15, "risk_target_vol_annual": 0.08, "hedge_intensity": 0.15,
        }

# --- Global scaling by confidence g (downstream can apply this linearly)
# We expose both the raw confidence and recommend common scalings.
scaling = {
    "aggressiveness_scalar_g": conf["g"],
    "confidence": conf,                           # contains c_max, c_entropy, c (combined)
    "recommendations": {
        # Downstream usage suggestions
        "scale_position_sizes_by_g": True,
        "scale_turnover_cap_by_g": True,
        "scale_hedge_intensity_by_(1-g)": True
    }
}

# --- Package the full map
out = {
    "created_at": pd.Timestamp.utcnow().isoformat() + "Z",
    "latest_date": str(latest_row["date"].date()),
    "k": int(K),
    "latest_regime_label": latest_label,
    "latest_state_id": int(latest_row[state_col]),
    "latest_posteriors": {f"p{s}": float(latest_row.get(f"p{s}", np.nan)) for s in range(K)},
    "confidence": scaling,
    "policy_by_regime": policy_by_regime,
    "inputs": {
        "labels_path": LABELS_PATH,
        "meta_path": META_PATH,
        "bundle_path": BUNDLE_PATH,
        "scaler_path": MAN["scaler_path"],
        "features": bundle["features"],
        "window": MAN.get("window", {}),
        "smoothing_window_days": N_SMOOTH,  #TOCHANGE
    },
}

# include sensitivity & diagnostics pointers if present
sens_path = os.path.join(REGIME_DIR, "regime_sensitivity.json")
diag_dir  = os.path.join(REGIME_DIR, "diagnostics")
if os.path.exists(sens_path):
    out["inputs"]["sensitivity_path"] = sens_path
if os.path.isdir(diag_dir):
    out["inputs"]["diagnostics_dir"] = diag_dir

# hash a minimal signature (useful for caching / auditing)
sig = hashlib.sha256(json.dumps({
    "features": out["inputs"]["features"],
    "window": out["inputs"]["window"],
    "k": out["k"]
}, sort_keys=True).encode()).hexdigest()
out["signature"] = sig

with open(OUT_PATH, "w") as f:
    json.dump(out, f, indent=2)

print(json.dumps({
    "status": "2.7 policy hooks exported",
    "out": OUT_PATH,
    "latest_label": latest_label,
    "g_scalar": round(out["confidence"]["aggressiveness_scalar_g"], 4),
}, indent=2))

{
  "status": "2.7 policy hooks exported",
  "out": "artifacts/regimes/regime_policy_map.json",
  "latest_label": "Risk-On",
  "g_scalar": 0.9681
}


In [None]:
# ============================================================
# Section 2.8 ‚Äî Walk-Forward Integration
# - Rolling/expanding windows (match Section 6 when available)
# - Fit scaler+HMM on TRAIN; score/label TEST ONLY
# - Save per-window artifacts; stitch into a continuous timeline
# - Preserve state‚Üílabel semantics per window (no drift)
# Reuses:
#   - CFG.regime_dir (from 2.0)
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/window_manifest.json (2.1; single-window fallback)
#   - artifacts/regimes/windows_manifest.json (preferred multi-window; else autogen)
#   - artifacts/regimes/regime_sensitivity.json (2.5, optional pointer)
#   - artifacts/regimes/diagnostics/* (2.6, optional pointer)
# Outputs:
#   - artifacts/regimes/windowed/regime_labels_<winid>.parquet (+ .csv)
#   - artifacts/regimes/windowed/regime_hmm_<winid>.pkl
#   - artifacts/regimes/windowed/regime_meta_<winid>.json
#   - artifacts/regimes/regime_labels.parquet (+ .csv)  [stitched TEST]
# Notes:
#   - #TOCHANGE marks heavier settings for real run.
# ============================================================

from __future__ import annotations
import os, json
from typing import Dict, Any, List, Tuple
from datetime import datetime
import numpy as np
import pandas as pd
import joblib
from hmmlearn.hmm import GaussianHMM
from sklearn.preprocessing import StandardScaler

REGIME_DIR   = CFG.regime_dir
PANEL_PATH   = os.path.join(REGIME_DIR, "market_panel.parquet")
MAN_SINGLE   = os.path.join(REGIME_DIR, "window_manifest.json")        # from 2.1 (single window)
MAN_WINDOWS  = os.path.join(REGIME_DIR, "windows_manifest.json")       # preferred (multi)
OUT_DIR_WIN  = os.path.join(REGIME_DIR, "windowed")
os.makedirs(OUT_DIR_WIN, exist_ok=True)

assert os.path.exists(PANEL_PATH), f"Missing market panel: {PANEL_PATH}"
mkt = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
mkt["date"] = pd.to_datetime(mkt["date"])

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 0) Windows manifest: prefer multi-window; else autogen a LIGHT test manifest (#TOCHANGE)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _autogen_windows(dates: pd.Series) -> List[Dict[str, Any]]:
    # LIGHT test plan: 2 small rolling windows (#TOCHANGE for real run)
    # Real run suggestion (#TOCHANGE): expanding train of ~5‚Äì8y; test 6‚Äì12m; stride 6m
    dmin, dmax = dates.min(), dates.max()
    # crude splits for quick smoke test: last ~4y span
    cuts = [
        {"win_id":"W1", "train_start":str((dmin + pd.Timedelta(days=365*3)).date()),
         "train_end":  str((dmin + pd.Timedelta(days=365*9)).date()),
         "test_start": str((dmin + pd.Timedelta(days=365*9)+pd.Timedelta(days=1)).date()),
         "test_end":   str((dmin + pd.Timedelta(days=365*10)).date())},
        {"win_id":"W2", "train_start":str((dmin + pd.Timedelta(days=365*4)).date()),
         "train_end":  str((dmin + pd.Timedelta(days=365*10)).date()),
         "test_start": str((dmin + pd.Timedelta(days=365*10)+pd.Timedelta(days=1)).date()),
         "test_end":   str(dmax.date())},
    ]
    return cuts

if os.path.exists(MAN_WINDOWS):
    with open(MAN_WINDOWS, "r") as f:
        WINS = json.load(f)
elif os.path.exists(MAN_SINGLE):
    # Wrap the single window as a one-window WF run
    with open(MAN_SINGLE, "r") as f:
        man = json.load(f)
    WINS = [{
        "win_id":"W0",
        "train_start": man["window"]["train_start"],
        "train_end":   man["window"]["train_end"],
        "test_start":  man["window"]["test_start"],
        "test_end":    man["window"]["test_end"],
    }]
else:
    # autogen LIGHT test manifest (#TOCHANGE)
    WINS = _autogen_windows(mkt["date"])
    with open(MAN_WINDOWS, "w") as f:
        json.dump(WINS, f, indent=2)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 1) HMM hyperparams ‚Äî reuse Section 2.2 defaults; mark heavier real-run values
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
N_ITER = 200            # TOCHANGE: 1000 for real run
N_INIT = 2              # TOCHANGE: 10 for real run
COVTYPE = "full"
TOL = 1e-3              # TOCHANGE: 1e-4 for real run
RANDOM_STATE = 42
LAMBDA_STICK = 0.15     # TOCHANGE: 0.30‚Äì0.50 for real run
K = 3                   # TOCHANGE: choose from sensitivity (2 or 3); default 3

APPLY_RECENCY   = True  # reuse the finance recency rule from 2.2
HALF_LIFE_DAYS  = 756   # TOCHANGE: try 504/756/1260
SEG_LEN         = 60    # TOCHANGE: 90‚Äì120
N_SEGMENTS      = 80    # TOCHANGE: 200‚Äì400
EPSILON_FLOOR   = 0.10  # TOCHANGE: 0.05‚Äì0.15

FEATURES = CFG.hmm_features.copy()
if getattr(CFG, "include_dvix", False) and "dvix" in mkt.columns and "dvix" not in FEATURES:
    FEATURES.append("dvix")

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 2) Utilities reused from 2.2/2.3/2.4
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _diag_sticky_blend(T: np.ndarray, lam: float) -> np.ndarray:
    k = T.shape[0]
    out = (1.0 - lam) * T + lam * np.eye(k)
    return out / out.sum(axis=1, keepdims=True)

def _build_time_decay_weights(dates: np.ndarray, half_life_days: int) -> np.ndarray:
    t = np.array([pd.Timestamp(d).toordinal() for d in dates], dtype=float)
    age = (t.max() - t)
    decay = np.log(2) / max(1, half_life_days)
    w = np.exp(-decay * age)
    return w / (w.sum() + 1e-12)

def _sample_time_weighted_subsequences(
    X: np.ndarray, dates: np.ndarray,
    seg_len: int=SEG_LEN, n_segments: int=N_SEGMENTS, half_life_days: int=HALF_LIFE_DAYS, seed: int=RANDOM_STATE
) -> Tuple[np.ndarray, List[int]]:
    rng = np.random.RandomState(seed)
    n = X.shape[0]
    if n < seg_len:
        return X.copy(), [n]
    ends = np.arange(seg_len - 1, n)
    p = _build_time_decay_weights(dates[ends], half_life_days)
    p = np.maximum(p, EPSILON_FLOOR * p.max()); p = p / p.sum()
    chosen = rng.choice(ends, size=min(n_segments, len(ends)), replace=True, p=p)
    chunks, lengths = [], []
    for e in chosen:
        s = e - (seg_len - 1)
        chunks.append(X[s:e+1]); lengths.append(seg_len)
    return np.vstack(chunks), lengths

def _fit_hmm(X_train: np.ndarray, dates_train: np.ndarray, k: int, seed: int) -> GaussianHMM:
    if APPLY_RECENCY:
        X_fit, lengths = _sample_time_weighted_subsequences(X_train, dates_train, seg_len=SEG_LEN, n_segments=N_SEGMENTS, half_life_days=HALF_LIFE_DAYS, seed=seed)
    else:
        X_fit, lengths = X_train, [len(X_train)]
    model = GaussianHMM(
        n_components=k, covariance_type=COVTYPE, n_iter=N_ITER, tol=TOL,
        random_state=seed, verbose=False, init_params="mc", params="stmc"
    )
    T0 = np.full((k, k), (1.0 - 0.90) / max(1, k - 1)); np.fill_diagonal(T0, 0.90)
    model.transmat_ = T0; model.startprob_ = np.full(k, 1.0 / k)
    model.fit(X_fit, lengths=lengths)
    model.transmat_ = _diag_sticky_blend(model.transmat_, LAMBDA_STICK)
    return model

def _label_states(df_tr: pd.DataFrame, st_train: np.ndarray) -> Dict[int, str]:
    # df_tr is TRAIN-only and aligned to st_train
    assert len(df_tr) == len(st_train), "Train data and train states length mismatch"
    tmp = []
    states_unique = sorted(np.unique(st_train))
    for s in states_unique:
        mask = (st_train == s)
        def wmean(x, w):
            w = np.asarray(w, float); x = np.asarray(x, float)
            z = w.sum()
            return float((x*w).sum()/z) if z > 0 else np.nan
        tmp.append({
            "state_id":  s,
            "ret_mean":  wmean(df_tr["spy_ret"].values, mask),
            "rv20_mean": wmean(df_tr["spy_rv_20"].values, mask) if "spy_rv_20" in df_tr else np.nan,
            "vix_mean":  wmean(df_tr["vix_close"].values, mask) if "vix_close" in df_tr else np.nan,
            "breadth_mean": wmean(df_tr["breadth"].values, mask) if "breadth" in df_tr else np.nan,
        })
    prof = pd.DataFrame(tmp)
    risk_off_id = int(prof["rv20_mean"].idxmax())
    risk_on_id  = int(prof["ret_mean"].idxmax())
    mapping = {risk_on_id: "Risk-On", risk_off_id: "Risk-Off"}
    for s in states_unique:
        if s not in mapping:
            mapping[s] = "Transition"
    return mapping

def _debounce_series(state_ids: np.ndarray, min_dwell_days: int=CFG.min_dwell_days) -> np.ndarray:
    # 2.4 minimal debounce: enforce min dwell by suppressing singleton flips
    out = state_ids.copy()
    i = 1
    while i < len(out)-1:
        if out[i] != out[i-1] and out[i] != out[i+1]:
            out[i] = out[i-1]  # squash 1-day blip
            i += 1
        i += 1
    # NOTE: #TOCHANGE implement full min-run-length >= min_dwell_days if needed
    return out

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3) Walk-forward loop
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
stitched = []  # list[DataFrame] of TEST chunks with posteriors + labels
win_summ = []

for w in WINS:
    win_id = w.get("win_id", f"W{len(win_summ)}")
    ts, te = pd.to_datetime(w["train_start"]), pd.to_datetime(w["train_end"])
    us, ue = pd.to_datetime(w["test_start"]),  pd.to_datetime(w["test_end"])
    m_train = (mkt["date"] >= ts) & (mkt["date"] <= te)
    m_test  = (mkt["date"] >= us) & (mkt["date"] <= ue)
    df_tr = mkt.loc[m_train, ["date", "spy_ret"] + FEATURES].dropna().reset_index(drop=True)
    df_te = mkt.loc[m_test,  ["date", "spy_ret"] + FEATURES].dropna().reset_index(drop=True)

    if df_tr.empty or df_te.empty:
        continue

    scaler = StandardScaler().fit(df_tr[FEATURES].to_numpy(dtype=float))
    X_tr = scaler.transform(df_tr[FEATURES].to_numpy(dtype=float))
    X_te = scaler.transform(df_te[FEATURES].to_numpy(dtype=float))

    # Fit HMM (single best run for speed; #TOCHANGE run N_INIT restarts, keep best)
    seed = RANDOM_STATE
    model = _fit_hmm(X_tr, df_tr["date"].to_numpy(), k=K, seed=seed)

    # Score TEST only
    post = model.predict_proba(X_te)
    hard = post.argmax(axis=1)

    # Map semantics using TRAIN (no peeking)
    # We need states on TRAIN for profiling; use predict_proba on train too (cheap)
    st_train = model.predict_proba(X_tr).argmax(axis=1)
    mapping  = _label_states(df_tr, st_train)
    lbl_test = pd.Series([mapping.get(int(s), f"State{int(s)}") for s in hard], index=df_te.index)

    # Debounce (light)
    hard_db = _debounce_series(hard, min_dwell_days=CFG.min_dwell_days)

    # Package per-window labels
    out_cols = {
        "date": df_te["date"],
        "state_id": hard,
        "state_id_smoothed": hard_db,
        "regime_label": lbl_test.values,
        "regime_label_smoothed": pd.Series([mapping.get(int(s), f"State{int(s)}") for s in hard_db], index=df_te.index).values,
    }
    # Add posterior columns p0..pK-1
    for s in range(post.shape[1]):
        out_cols[f"p{s}"] = post[:, s]
    lab_te = pd.DataFrame(out_cols).sort_values("date").reset_index(drop=True)

    # Save per-window artifacts
    # Bundle: model + scaler + meta knobs for traceability
    bundle = {
        "model": model,
        "k": K,
        "features": FEATURES,
        "scaler": scaler,  # keep in-bundle object; also persist path below if desired
        "random_state": seed,
        "n_iter": N_ITER, "n_init": N_INIT, "tol": TOL, "covariance_type": COVTYPE,
        "recency_weighting": APPLY_RECENCY, "recency_half_life_days": HALF_LIFE_DAYS,
        "recency_seg_len": SEG_LEN, "recency_n_segments": N_SEGMENTS, "recency_epsilon_floor": EPSILON_FLOOR,
        "sticky_lambda": LAMBDA_STICK,
        "train_dates": [str(d) for d in df_tr["date"].to_numpy()],
        "test_dates":  [str(d) for d in df_te["date"].to_numpy()],
        "created_at": datetime.utcnow().isoformat() + "Z",
        "fit_mode": "recency" if APPLY_RECENCY else "plain",
    }
    bpath = os.path.join(OUT_DIR_WIN, f"regime_hmm_{win_id}.pkl")
    joblib.dump(bundle, bpath)

    meta = {
        "win_id": win_id,
        "window": {"train_start": str(ts.date()), "train_end": str(te.date()), "test_start": str(us.date()), "test_end": str(ue.date())},
        "features": FEATURES,
        "k": K,
        "state_label_map": {int(k): v for k, v in mapping.items()},
        "sticky_lambda": LAMBDA_STICK,
        "recency": {"enabled": APPLY_RECENCY, "half_life_days": HALF_LIFE_DAYS, "seg_len": SEG_LEN, "n_segments": N_SEGMENTS, "epsilon_floor": EPSILON_FLOOR},
        "bundle_path": bpath,
        "panel_path": PANEL_PATH,
        "sensitivity_path": os.path.join(REGIME_DIR, "regime_sensitivity.json") if os.path.exists(os.path.join(REGIME_DIR, "regime_sensitivity.json")) else None,
        "diagnostics_dir": os.path.join(REGIME_DIR, "diagnostics") if os.path.isdir(os.path.join(REGIME_DIR, "diagnostics")) else None,
        "created_at": datetime.utcnow().isoformat() + "Z",
    }
    mpath = os.path.join(OUT_DIR_WIN, f"regime_meta_{win_id}.json")
    with open(mpath, "w") as f:
        json.dump(meta, f, indent=2)

    # Always write CSV alongside parquet  #TOCHANGE: keep in prod when many windows
    lpath = os.path.join(OUT_DIR_WIN, f"regime_labels_{win_id}.parquet")
    lab_te.to_parquet(lpath, index=False)
    try:
        lab_te.to_csv(lpath.replace(".parquet", ".csv"), index=False)
    except Exception:
        pass

    stitched.append(lab_te)
    win_summ.append({
        "win_id": win_id,
        "train_start": str(ts.date()),
        "train_end":   str(te.date()),
        "test_start":  str(us.date()),
        "test_end":    str(ue.date()),
        "n_train": int(len(df_tr)),
        "n_test": int(len(df_te)),
        "bundle_path": bpath,
        "labels_path": lpath,
        "meta_path": mpath,
    })

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 4) Save a windows index (QoL)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
idx_json = os.path.join(OUT_DIR_WIN, "windows_index.json")
idx_csv  = os.path.join(OUT_DIR_WIN, "windows_index.csv")
with open(idx_json, "w") as f:
    json.dump(win_summ, f, indent=2)
try:
    pd.DataFrame(win_summ).to_csv(idx_csv, index=False)
except Exception:
    pass


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 5) Stitch all TEST chunks into one continuous timeline
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if len(stitched) == 0:
    raise RuntimeError("No window produced test labels; check window coverage.")

lab_all = pd.concat(stitched, axis=0, ignore_index=True).sort_values("date").reset_index(drop=True)
lab_all.to_parquet(os.path.join(REGIME_DIR, "regime_labels.parquet"), index=False)
try:
    lab_all.to_csv(os.path.join(REGIME_DIR, "regime_labels.csv"), index=False)
except Exception:
    pass

print(json.dumps({
    "status": "2.8 walk-forward complete",
    "n_windows": len(win_summ),
    "windows": win_summ,
    "stitched_out": {
        "parquet": os.path.join(REGIME_DIR, "regime_labels.parquet"),
        "csv": os.path.join(REGIME_DIR, "regime_labels.csv"),
    },
    "windows_index": {
    "json": idx_json,
    "csv": idx_csv
    },
    "notes": [
        "Per-window scaler fitted on TRAIN only; TEST scored out-of-sample.",
        "State‚Üílabel semantics are saved per window and applied to the test chunk.",
        "For real run: increase N_ITER/N_INIT and recency sampler size; align windows with Section 6."
    ]
}, indent=2))


{
  "status": "2.8 walk-forward complete",
  "n_windows": 1,
  "windows": [
    {
      "win_id": "W0",
      "train_start": "2007-02-06",
      "train_end": "2016-12-30",
      "test_start": "2017-01-03",
      "test_end": "2025-08-11",
      "n_train": 2495,
      "n_test": 2163,
      "bundle_path": "artifacts/regimes/windowed/regime_hmm_W0.pkl",
      "labels_path": "artifacts/regimes/windowed/regime_labels_W0.parquet",
      "meta_path": "artifacts/regimes/windowed/regime_meta_W0.json"
    }
  ],
  "stitched_out": {
    "parquet": "artifacts/regimes/regime_labels.parquet",
    "csv": "artifacts/regimes/regime_labels.csv"
  },
  "windows_index": {
    "json": "artifacts/regimes/windowed/windows_index.json",
    "csv": "artifacts/regimes/windowed/windows_index.csv"
  },
  "notes": [
    "Per-window scaler fitted on TRAIN only; TEST scored out-of-sample.",
    "State\u2192label semantics are saved per window and applied to the test chunk.",
    "For real run: increase N_ITER/N_INIT a

In [None]:
# ============================================================
# Section 2.9 ‚Äî Forward (Shadow) Mode
# Daily update:
#   ‚Ä¢ Load latest window bundle (model+scaler+features) and meta
#   ‚Ä¢ Read newest feature rows (from Section 1 products via market_panel.parquet)
#   ‚Ä¢ Transform ‚Üí predict_proba ‚Üí state_id ‚Üí regime_label (via saved mapping)
#   ‚Ä¢ Append to regime_labels.parquet (+ CSV), no backfilling
# Retrain cadence: weekly/bi-weekly (#TOCHANGE)
# Logging: JSONL with model hash/date/posteriors/label
# Alerts: simple chattering/dwell anomaly (rolling window)
# Reuses:
#   - CFG.regime_dir (2.0)
#   - artifacts/regimes/windowed/windows_index.json (2.8)
#   - artifacts/regimes/windowed/regime_meta_<winid>.json (2.8)
#   - artifacts/regimes/windowed/regime_hmm_<winid>.pkl (2.8)
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/regime_labels.parquet (2.8 stitched history)
#   - artifacts/regimes/regime_policy_map.json (2.7; optional refresh)
# Notes:
#   - #TOCHANGE marks production choices (smoothing, policy refresh cadence, thresholds).
# ============================================================

from __future__ import annotations
import os, json, hashlib
from typing import Dict, Any, List
from datetime import datetime
import numpy as np
import pandas as pd
import joblib

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Paths (reuse CFG from 2.0)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
REGIME_DIR   = CFG.regime_dir
PANEL_PATH   = os.path.join(REGIME_DIR, "market_panel.parquet")
LAB_PATH_PQ  = os.path.join(REGIME_DIR, "regime_labels.parquet")
LAB_PATH_CSV = os.path.join(REGIME_DIR, "regime_labels.csv")
WIN_DIR      = os.path.join(REGIME_DIR, "windowed")
WIN_INDEX    = os.path.join(WIN_DIR, "windows_index.json")

# Optional: Section 1 global (won't fail if missing)
START_DATE = globals().get("START_DATE", None)

# Optional policy refresh (2.7-lite)
UPDATE_POLICY_MAP = True   # TOCHANGE: set True for prod daily refresh, False for fast tests
POLICY_OUT        = os.path.join(REGIME_DIR, "regime_policy_map.json")

# Forward log & alerts
FWD_LOG   = os.path.join(REGIME_DIR, "regime_forward_log.jsonl")
ALERTS_FP = os.path.join(REGIME_DIR, "forward_alerts.json")

# Smoothing / chattering guardrails
FWD_DEBOUNCE    = False    # TOCHANGE: consider True in prod (requires short context window)
ROLL_WINDOW_D   = 20       # TOCHANGE: 60 for prod
ROLL_MAX_SWITCH = 4        # TOCHANGE: 3 for prod

# Confidence tail length for policy scaling
N_CONF_TAIL = 3            # TOCHANGE: 5‚Äì10 for prod

os.makedirs(WIN_DIR, exist_ok=True)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Helpers
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
def _load_latest_window_meta() -> Dict[str, Any]:
    """Pick the window with the latest test_end from windows_index.json; fallback: scan meta files."""
    if os.path.exists(WIN_INDEX):
        with open(WIN_INDEX, "r") as f:
            idx = json.load(f)
        if isinstance(idx, list) and len(idx) > 0:
            # sort by test_end
            idx_sorted = sorted(idx, key=lambda d: d.get("test_end", ""), reverse=True)
            return idx_sorted[0]
    # Fallback: scan meta files
    metas = [p for p in os.listdir(WIN_DIR) if p.startswith("regime_meta_") and p.endswith(".json")]
    if not metas:
        raise FileNotFoundError("No window meta files found; run 2.8 first.")
    # choose the lexicographically latest as a fallback heuristic
    metas.sort(reverse=True)
    with open(os.path.join(WIN_DIR, metas[0]), "r") as f:
        m = json.load(f)
    return {
        "win_id": m.get("win_id", "W?"),
        "bundle_path": m.get("bundle_path"),
        "meta_path": os.path.join(WIN_DIR, metas[0]),
        "labels_path": None,
        "test_end": m.get("window", {}).get("test_end", ""),
    }

def _model_signature(features: List[str], k: int, window: Dict[str, Any]) -> str:
    return hashlib.sha256(json.dumps({"features": features, "k": k, "window": window}, sort_keys=True).encode()).hexdigest()

def _entropy(p: np.ndarray) -> float:
    p = np.clip(p, 1e-12, 1.0)
    return float(-(p*np.log(p)).sum() / np.log(len(p)))

def _aggressiveness_from_posterior(p_mean: np.ndarray) -> Dict[str, float]:
    c_max = float(p_mean.max())
    c_ent = 1.0 - _entropy(p_mean)
    c = 0.5 * (c_max + c_ent)
    g_min, g_max = 0.35, 1.00  # TOCHANGE
    g = g_min + (g_max - g_min) * c
    return {"c_max": c_max, "c_entropy": c_ent, "c": c, "g": g}

def _append_jsonl(path: str, rec: Dict[str, Any]) -> None:
    with open(path, "a") as f:
        f.write(json.dumps(rec) + "\n")

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Load artifacts
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
assert os.path.exists(PANEL_PATH), f"Missing market panel: {PANEL_PATH}"
panel = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
panel["date"] = pd.to_datetime(panel["date"])

meta_idx = _load_latest_window_meta()
assert meta_idx.get("bundle_path"), "Latest window has no bundle_path; re-run 2.8."
with open(meta_idx.get("meta_path") or os.path.join(WIN_DIR, f"regime_meta_{meta_idx['win_id']}.json"), "r") as f:
    META = json.load(f)

BUNDLE = joblib.load(meta_idx["bundle_path"])
MODEL  = BUNDLE["model"]
SCALER = BUNDLE["scaler"]
FEATS  = BUNDLE["features"]
K      = int(BUNDLE["k"])

# State‚Üílabel semantics
state_label_map: Dict[int, str] = META.get("state_label_map", {})

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Determine the forward slice (dates to append)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if os.path.exists(LAB_PATH_PQ):
    labels_hist = pd.read_parquet(LAB_PATH_PQ).sort_values("date").reset_index(drop=True)
    labels_hist["date"] = pd.to_datetime(labels_hist["date"])
    last_date_done = labels_hist["date"].max()
else:
    labels_hist = None
    # If no file yet, use START_DATE if provided, else take panel min-1
    last_date_done = pd.to_datetime(START_DATE) - pd.Timedelta(days=1) if START_DATE else panel["date"].min() - pd.Timedelta(days=1)

new_mask = panel["date"] > last_date_done
to_score = panel.loc[new_mask, ["date"] + FEATS].dropna().reset_index(drop=True)

if to_score.empty:
    print(json.dumps({
        "status": "2.9 forward: nothing to do",
        "last_processed": str(last_date_done.date()) if pd.notnull(last_date_done) else None,
        "note": "No new rows in market_panel.parquet beyond current regime_labels."
    }, indent=2))
else:
    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Transform ‚Üí score ‚Üí label
    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    X = SCALER.transform(to_score[FEATS].to_numpy(dtype=float))
    post = MODEL.predict_proba(X)
    hard = post.argmax(axis=1)

    # Map to labels (no re-profiling)
    lbl = [state_label_map.get(int(s), f"State{int(s)}") for s in hard]

    # Optional forward debounce (minimal). For single-day updates this does little.
    if FWD_DEBOUNCE and labels_hist is not None and len(labels_hist) > 2:
        prev_state = int(labels_hist.iloc[-1]["state_id"])
        if len(hard) == 1:
            # squash a 1-day blip if it differs from the last two states
            prev_prev_state = int(labels_hist.iloc[-2]["state_id"])
            if hard[0] != prev_state and prev_state == prev_prev_state:
                hard[0] = prev_state
                lbl[0]  = state_label_map.get(prev_state, f"State{prev_state}")

    # Package new rows
    out = pd.DataFrame({
        "date": to_score["date"],
        "state_id": hard,
        "regime_label": lbl,
        **{f"p{s}": post[:, s] for s in range(K)}
    }).sort_values("date").reset_index(drop=True)

    # Preserve smoothed columns if they exist by copying raw (forward mode = minimal smoothing)
    if labels_hist is not None and "state_id_smoothed" in labels_hist.columns:
        out["state_id_smoothed"] = out["state_id"].values
    if labels_hist is not None and "regime_label_smoothed" in labels_hist.columns:
        out["regime_label_smoothed"] = out["regime_label"].values

    # Append to history (or create)
    if labels_hist is not None:
        labels_new = pd.concat([labels_hist, out], axis=0, ignore_index=True)
    else:
        labels_new = out

    labels_new = labels_new.drop_duplicates(subset=["date"]).sort_values("date").reset_index(drop=True)
    labels_new.to_parquet(LAB_PATH_PQ, index=False)
    try:
        labels_new.to_csv(LAB_PATH_CSV, index=False)
    except Exception:
        pass

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Logging & alerts
    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    sig = _model_signature(FEATS, K, META.get("window", {}))
    for i, r in out.iterrows():
        rec = {
            "ts": datetime.utcnow().isoformat() + "Z",
            "date": str(pd.to_datetime(r["date"]).date()),
            "model_sig": sig,
            "state_id": int(r["state_id"]),
            "regime_label": str(r["regime_label"]),
            "posteriors": {f"p{s}": float(r[f"p{s}"]) for s in range(K)}
        }
        _append_jsonl(FWD_LOG, rec)

    # Simple chattering alert (rolling window over recent labels)
    alerts = {"generated_at": datetime.utcnow().isoformat() + "Z", "alerts": []}
    tail = labels_new.tail(ROLL_WINDOW_D).reset_index(drop=True)
    if len(tail) >= 3:
        sw = int((tail["state_id"].diff().fillna(0) != 0).sum())
        if sw >= ROLL_MAX_SWITCH:
            alerts["alerts"].append(
                f"High switch count in last {ROLL_WINDOW_D}d: {sw} (>= {ROLL_MAX_SWITCH})"
            )
    if alerts["alerts"]:
        with open(ALERTS_FP, "w") as f:
            json.dump(alerts, f, indent=2)

    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    # Optional: refresh policy hooks (2.7-lite)
    # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
    if UPDATE_POLICY_MAP:
        p_cols = [c for c in labels_new.columns if c.startswith("p")]
        if len(p_cols) == K:
            tail_p = labels_new[p_cols].tail(N_CONF_TAIL).to_numpy(dtype=float)
            p_mean = tail_p.mean(axis=0)
            conf   = _aggressiveness_from_posterior(p_mean)
            latest = labels_new.iloc[-1]
            policy = {
                "created_at": datetime.utcnow().isoformat() + "Z",
                "latest_date": str(pd.to_datetime(latest["date"]).date()),
                "k": K,
                "latest_regime_label": str(latest.get("regime_label")),
                "latest_state_id": int(latest.get("state_id")),
                "latest_posteriors": {f"p{s}": float(latest.get(f"p{s}", np.nan)) for s in range(K)},
                "confidence": {
                    "aggressiveness_scalar_g": conf["g"],
                    "confidence": conf,
                    "recommendations": {
                        "scale_position_sizes_by_g": True,
                        "scale_turnover_cap_by_g": True,
                        "scale_hedge_intensity_by_(1-g)": True
                    }
                },
                "inputs": {
                    "bundle_path": meta_idx["bundle_path"],
                    "meta_path": meta_idx.get("meta_path"),
                    "labels_path": LAB_PATH_PQ,
                    "panel_path": PANEL_PATH,
                    "smoothing_window_days": N_CONF_TAIL  # TOCHANGE
                },
                "signature": sig
            }
            with open(POLICY_OUT, "w") as f:
                json.dump(policy, f, indent=2)

    print(json.dumps({
        "status": "2.9 forward appended",
        "n_rows_appended": int(out.shape[0]),
        "last_appended_date": str(out["date"].iloc[-1].date()),
        "bundle_used": meta_idx["bundle_path"],
        "policy_refreshed": bool(UPDATE_POLICY_MAP),
        "log_path": FWD_LOG,
        "alerts_written": os.path.exists(ALERTS_FP),
    }, indent=2))

{
  "status": "2.9 forward: nothing to do",
  "last_processed": "2025-08-11",
  "note": "No new rows in market_panel.parquet beyond current regime_labels."
}


In [None]:
# ============================================================
# Section Forward Smoke Test (Prep + Run)
# Goal: simulate a realistic forward append using the *actual*
# latest tail of the panel, without touching production files.
#
# What it does:
#   - Truncates current labels by last N days into a smoke copy
#   - Loads latest windowed HMM bundle + meta (from 2.8 outputs)
#   - Transforms last N feature rows with saved scaler
#   - Predicts posteriors/labels using saved state map
#   - Appends to smoke labels and writes a smoke policy map
#
# Reuses:
#   - CFG (from 2.0)
#   - artifacts/regimes/market_panel.parquet (1.‚Üí2.0)
#   - artifacts/regimes/regime_labels.parquet (2.8 stitched)
#   - artifacts/regimes/windowed/windows_index.json (2.8)
#   - artifacts/regimes/windowed/regime_hmm_<win>.pkl (2.8)
#   - artifacts/regimes/windowed/regime_meta_<win>.json (2.8)
#
# Outputs (smoke-only; production files untouched):
#   - artifacts/regimes/forward_smoketest/regime_labels_smoke_base.parquet
#   - artifacts/regimes/forward_smoketest/regime_labels_smoke.parquet (+csv)
#   - artifacts/regimes/forward_smoketest/regime_policy_map_smoke.json
#
# Notes:
#   - #TOCHANGE marks heavier/production settings.
#   - This section is safe to run multiple times; it overwrites files in the smoke folder.
# ============================================================

from __future__ import annotations
import os, json, hashlib
from typing import Dict, Any
import numpy as np
import pandas as pd
import joblib

# ---- Tunables ------------------------------------------------
SMOKE_N_DAYS = 3     # #TOCHANGE: 5‚Äì10 for a beefier smoke
REGIME_DIR   = CFG.regime_dir
SMOKE_DIR    = os.path.join(REGIME_DIR, "forward_smoketest")
os.makedirs(SMOKE_DIR, exist_ok=True)

# ---- Paths ---------------------------------------------------
PANEL_PATH    = os.path.join(REGIME_DIR, "market_panel.parquet")
LAB_PROD_PATH = os.path.join(REGIME_DIR, "regime_labels.parquet")
WINDEX_PATH   = os.path.join(REGIME_DIR, "windowed", "windows_index.json")
WIN_DIR       = os.path.join(REGIME_DIR, "windowed")

# ---- Load panel & prod labels --------------------------------
assert os.path.exists(PANEL_PATH), f"Missing panel: {PANEL_PATH}"
panel = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
panel["date"] = pd.to_datetime(panel["date"])

if not os.path.exists(LAB_PROD_PATH):
    raise FileNotFoundError(
        "Expected stitched labels at artifacts/regimes/regime_labels.parquet from 2.8. "
        "Run 2.8 first."
    )
lab_prod = pd.read_parquet(LAB_PROD_PATH).sort_values("date").reset_index(drop=True)
lab_prod["date"] = pd.to_datetime(lab_prod["date"])

# ---- Choose the latest window meta/bundle --------------------
def _pick_latest_window(win_index_json: str) -> Dict[str, Any]:
    # Prefer windows_index.json (2.8 QoL). Fallback to single W0.
    if os.path.exists(win_index_json):
        with open(win_index_json, "r") as f:
            idx = json.load(f)
        if isinstance(idx, list) and len(idx) > 0:
            # sort by test_end
            idx_sorted = sorted(idx, key=lambda d: pd.to_datetime(d["test_end"]))
            return idx_sorted[-1]
    # Fallback: try W0 paths
    meta_fallback = os.path.join(WIN_DIR, "regime_meta_W0.json")
    bundle_fallback = os.path.join(WIN_DIR, "regime_hmm_W0.pkl")
    if os.path.exists(meta_fallback) and os.path.exists(bundle_fallback):
        with open(meta_fallback, "r") as f:
            m = json.load(f)
        return {
            "win_id": "W0",
            "train_start": m["window"]["train_start"],
            "train_end": m["window"]["train_end"],
            "test_start": m["window"]["test_start"],
            "test_end": m["window"]["test_end"],
            "bundle_path": m["bundle_path"],
            "labels_path": os.path.join(WIN_DIR, f"regime_labels_W0.parquet"),
            "meta_path": meta_fallback,
        }
    raise FileNotFoundError("No windowed bundles found. Run 2.8 first.")

latest = _pick_latest_window(WINDEX_PATH)
bundle_path = latest["bundle_path"]
meta_path   = latest["meta_path"]

assert os.path.exists(bundle_path), f"Missing bundle: {bundle_path}"
assert os.path.exists(meta_path),   f"Missing meta:   {meta_path}"

bundle = joblib.load(bundle_path)
with open(meta_path, "r") as f:
    meta = json.load(f)

features = bundle["features"]
scaler   = bundle.get("scaler", None)
if scaler is None and "scaler_path" in bundle:
    scaler = joblib.load(bundle["scaler_path"])
if scaler is None:
    raise RuntimeError("Bundle does not contain a scaler; ensure 2.8 saved the scaler in-bundle.")

state_label_map = {int(k): v for k, v in meta.get("state_label_map", {}).items()}

# ---- Build the smoke base (truncate last N days of labels) ---
lab_prod_sorted = lab_prod.sort_values("date")
if len(lab_prod_sorted) <= SMOKE_N_DAYS:
    raise RuntimeError("Not enough labeled rows to build a smoke base; reduce SMOKE_N_DAYS.")

smoke_cutoff = lab_prod_sorted["date"].iloc[-SMOKE_N_DAYS-1]
lab_smoke_base = lab_prod_sorted.loc[lab_prod_sorted["date"] <= smoke_cutoff].copy()
base_last_date = lab_smoke_base["date"].max()

base_path_parq = os.path.join(SMOKE_DIR, "regime_labels_smoke_base.parquet")
lab_smoke_base.to_parquet(base_path_parq, index=False)
try:
    lab_smoke_base.to_csv(base_path_parq.replace(".parquet", ".csv"), index=False)
except Exception:
    pass

# ---- Determine the forward tail to score ---------------------
panel_tail = panel.loc[panel["date"] > base_last_date].copy()
if panel_tail.empty:
    raise RuntimeError("Panel has no dates after the smoke base cutoff; cannot simulate forward.")

# Use only the first SMOKE_N_DAYS after base_last_date
fwd_dates = sorted(panel_tail["date"].unique())[:SMOKE_N_DAYS]
panel_fwd = panel.loc[panel["date"].isin(fwd_dates)].copy()
panel_fwd = panel_fwd[["date"] + features].dropna().sort_values("date").reset_index(drop=True)

if panel_fwd.empty or len(panel_fwd) < 1:
    raise RuntimeError("No complete feature rows for smoke forward dates; check features coverage.")

# ---- Transform with saved scaler & predict posteriors --------
X_fwd = scaler.transform(panel_fwd[features].to_numpy(dtype=float))
post  = bundle["model"].predict_proba(X_fwd)
hard  = post.argmax(axis=1)

# ---- Map to labels (no re-profiling; use saved mapping) ------
if state_label_map:
    reg_lbl = [state_label_map.get(int(s), f"State{int(s)}") for s in hard]
else:
    reg_lbl = [f"State{int(s)}" for s in hard]  # fallback

# ---- (Optional) minimal debounce just within the smoke block -
def _debounce_series(state_ids: np.ndarray, min_dwell_days: int=CFG.min_dwell_days):
    out = state_ids.copy()
    i = 1
    while i < len(out)-1:
        if out[i] != out[i-1] and out[i] != out[i+1]:
            out[i] = out[i-1]
            i += 1
        i += 1
    return out

hard_db = _debounce_series(hard)

# ---- Build the rows to append --------------------------------
rows = {
    "date": panel_fwd["date"].to_numpy(),
    "state_id": hard,
    "state_id_smoothed": hard_db,
    "regime_label": np.array(reg_lbl, dtype=object),
    "regime_label_smoothed": np.array([reg_lbl[i] if hard_db[i]==hard[i] else state_label_map.get(int(hard_db[i]), f"State{int(hard_db[i])}") for i in range(len(hard))], dtype=object),
}
# Add posteriors p0..pK-1
K = post.shape[1]
for s in range(K):
    rows[f"p{s}"] = post[:, s]

lab_smoke_new = pd.DataFrame(rows).sort_values("date").reset_index(drop=True)

# ---- Append to smoke base & save ------------------------------
lab_smoke_all = pd.concat([lab_smoke_base, lab_smoke_new], axis=0).sort_values("date").reset_index(drop=True)

out_parq = os.path.join(SMOKE_DIR, "regime_labels_smoke.parquet")
lab_smoke_all.to_parquet(out_parq, index=False)
try:
    lab_smoke_all.to_csv(out_parq.replace(".parquet", ".csv"), index=False)
except Exception:
    pass

# ---- Also emit a smoke policy map (same spirit as 2.7) -------
def _entropy(p: np.ndarray) -> float:
    p = np.clip(p, 1e-12, 1.0)
    return float(-(p*np.log(p)).sum() / np.log(len(p)))

# #TOCHANGE: smooth window for confidence in prod 5‚Äì10
N_SMOOTH = 3
tail = lab_smoke_all.tail(N_SMOOTH)
p_cols = [c for c in lab_smoke_all.columns if c.startswith("p")]
p_tail = tail[p_cols].to_numpy(dtype=float)
p_mean = p_tail.mean(axis=0) if p_tail.size else np.ones(K)/K
c_max = float(p_mean.max())
c_ent = 1.0 - _entropy(p_mean)
c_comb = 0.5*(c_max + c_ent)
g_min, g_max = 0.35, 1.00   # #TOCHANGE deeper throttling for prod
g = g_min + (g_max-g_min)*c_comb

latest_row = lab_smoke_all.iloc[-1]
latest_label = str(latest_row["regime_label_smoothed"] if "regime_label_smoothed" in lab_smoke_all.columns else latest_row["regime_label"])

policy_smoke = {
    "created_at": pd.Timestamp.utcnow().isoformat()+"Z",
    "mode": "smoke",
    "latest_date": str(pd.to_datetime(latest_row["date"]).date()),
    "k": int(K),
    "latest_regime_label": latest_label,
    "latest_state_id": int(latest_row["state_id_smoothed"] if "state_id_smoothed" in lab_smoke_all.columns else latest_row["state_id"]),
    "latest_posteriors": {f"p{s}": float(latest_row.get(f"p{s}", np.nan)) for s in range(K)},
    "confidence": {
        "aggressiveness_scalar_g": g,
        "confidence_components": {"c_max": c_max, "c_entropy": c_ent, "combined": c_comb},
        "recommendations": {
            "scale_position_sizes_by_g": True,
            "scale_turnover_cap_by_g": True,
            "scale_hedge_intensity_by_(1-g)": True
        }
    },
    "inputs": {
        "bundle_path": bundle_path,
        "meta_path": meta_path,
        "features": features,
        "smoke_base_labels": base_path_parq,
        "smoke_labels_out": out_parq,
        "n_days_scored": int(len(lab_smoke_new)),
        "window": latest.get("window", {
            "train_start": latest.get("train_start"),
            "train_end":   latest.get("train_end"),
            "test_start":  latest.get("test_start"),
            "test_end":    latest.get("test_end"),
        })
    }
}
with open(os.path.join(SMOKE_DIR, "regime_policy_map_smoke.json"), "w") as f:
    json.dump(policy_smoke, f, indent=2)

print(json.dumps({
    "status": "Forward smoke COMPLETE",
    "base_last_date": str(pd.to_datetime(base_last_date).date()),
    "scored_dates": [str(pd.to_datetime(d).date()) for d in lab_smoke_new["date"]],
    "out_labels_parquet": out_parq,
    "out_policy_smoke": os.path.join(SMOKE_DIR, "regime_policy_map_smoke.json"),
    "notes": [
        f"Truncated prod labels by last {SMOKE_N_DAYS} day(s) to create smoke base.",
        "Used latest windowed HMM bundle + scaler; no re-profiling (state map from meta).",
        "For a heavier smoke, increase SMOKE_N_DAYS (#TOCHANGE)."
    ]
}, indent=2))


{
  "status": "Forward smoke COMPLETE",
  "base_last_date": "2025-08-06",
  "scored_dates": [
    "2025-08-07",
    "2025-08-08",
    "2025-08-11"
  ],
  "out_labels_parquet": "artifacts/regimes/forward_smoketest/regime_labels_smoke.parquet",
  "out_policy_smoke": "artifacts/regimes/forward_smoketest/regime_policy_map_smoke.json",
  "notes": [
    "Truncated prod labels by last 3 day(s) to create smoke base.",
    "Used latest windowed HMM bundle + scaler; no re-profiling (state map from meta).",
    "For a heavier smoke, increase SMOKE_N_DAYS (#TOCHANGE)."
  ]
}


In [None]:
# ============================================================
# Section 2.10 ‚Äî Configuration & Reproducibility
# Tasks:
#   ‚Ä¢ Snapshot effective config + key artifacts (hashes, sizes, dates)
#   ‚Ä¢ Basic determinism check (same bundle, same scores)
#   ‚Ä¢ Validations: posterior sums, date order, gaps, label semantics sanity
#   ‚Ä¢ Emit run manifest + validation report for auditability
#
# Reuses:
#   - CFG (from 2.0)
#   - artifacts/regimes/market_panel.parquet (2.0)
#   - artifacts/regimes/regime_labels.parquet (2.8)
#   - artifacts/regimes/windowed/windows_index.json (2.8 QoL)
#   - artifacts/regimes/windowed/regime_hmm_<win>.pkl + meta (2.8)
#   - artifacts/regimes/diagnostics/state_profiles_table.csv (2.6; optional)
#
# Outputs:
#   - artifacts/regimes/run_manifest.json
#   - artifacts/regimes/validation_report.json
#   - artifacts/regimes/run_fingerprint.txt  (short human-readable summary)
#
# Notes:
#   - #TOCHANGE marks heavier/production settings.
# ============================================================

from __future__ import annotations
import os, json, hashlib, sys
from typing import Dict, Any
import numpy as np
import pandas as pd
import joblib

REGIME_DIR   = CFG.regime_dir
PANEL_PATH   = os.path.join(REGIME_DIR, "market_panel.parquet")
LABELS_PATH  = os.path.join(REGIME_DIR, "regime_labels.parquet")
CONFIG_EFF   = os.path.join(REGIME_DIR, "regime_config_effective.json")
WINDEX_PATH  = os.path.join(REGIME_DIR, "windowed", "windows_index.json")
WIN_DIR      = os.path.join(REGIME_DIR, "windowed")
DIAG_DIR     = os.path.join(REGIME_DIR, "diagnostics")
PROF_TABLE   = os.path.join(DIAG_DIR, "state_profiles_table.csv")

def _sha256_file(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()

def _pick_latest_window(windex: str) -> Dict[str, Any]:
    if os.path.exists(windex):
        with open(windex, "r") as f:
            idx = json.load(f)
        if isinstance(idx, list) and len(idx) > 0:
            idx_sorted = sorted(idx, key=lambda d: pd.to_datetime(d["test_end"]))
            return idx_sorted[-1]
    # fallback to W0
    meta_fallback = os.path.join(WIN_DIR, "regime_meta_W0.json")
    if os.path.exists(meta_fallback):
        with open(meta_fallback, "r") as f:
            m = json.load(f)
        return {
            "win_id": "W0",
            "train_start": m["window"]["train_start"],
            "train_end":   m["window"]["train_end"],
            "test_start":  m["window"]["test_start"],
            "test_end":    m["window"]["test_end"],
            "bundle_path": m["bundle_path"],
            "meta_path":   meta_fallback,
            "labels_path": os.path.join(WIN_DIR, "regime_labels_W0.parquet"),
        }
    raise FileNotFoundError("No windowed artifacts found; run 2.8 first.")

# ‚îÄ‚îÄ 1) Load essentials
assert os.path.exists(PANEL_PATH), f"Missing panel: {PANEL_PATH}"
assert os.path.exists(LABELS_PATH), f"Missing stitched labels: {LABELS_PATH}"
panel  = pd.read_parquet(PANEL_PATH).sort_values("date").reset_index(drop=True)
labels = pd.read_parquet(LABELS_PATH).sort_values("date").reset_index(drop=True)
panel["date"]  = pd.to_datetime(panel["date"])
labels["date"] = pd.to_datetime(labels["date"])

latest = _pick_latest_window(WINDEX_PATH)
bundle_path = latest["bundle_path"]; meta_path = latest["meta_path"]
assert os.path.exists(bundle_path), f"Missing bundle: {bundle_path}"
assert os.path.exists(meta_path),   f"Missing meta:   {meta_path}"
bundle = joblib.load(bundle_path)


# ‚îÄ‚îÄ 2) Build run manifest (config + artifacts snapshot)
manifest: Dict[str, Any] = {
    "created_at": pd.Timestamp.utcnow().isoformat()+"Z",
    "config_effective_path": CONFIG_EFF if os.path.exists(CONFIG_EFF) else None,
    "artifacts": {
        "panel": {"path": PANEL_PATH, "rows": int(len(panel)), "sha256": _sha256_file(PANEL_PATH)},
        "labels_stitched": {"path": LABELS_PATH, "rows": int(len(labels)), "sha256": _sha256_file(LABELS_PATH)},
        "bundle_latest": {"path": bundle_path, "sha256": _sha256_file(bundle_path)},
        "meta_latest":   {"path": meta_path,   "sha256": _sha256_file(meta_path)},
    },
    "latest_window": {
        "win_id": latest.get("win_id"),
        "train_start": latest.get("train_start"), "train_end": latest.get("train_end"),
        "test_start":  latest.get("test_start"),  "test_end":  latest.get("test_end"),
        "labels_path": latest.get("labels_path"),
    },
    "features": bundle.get("features", []),
    "hmm": {
        "k": int(bundle.get("k", -1)),
        "covariance_type": bundle.get("covariance_type", "full"),
        "n_iter": int(bundle.get("n_iter", -1)),
        "n_init": int(bundle.get("n_init", -1)),
        "tol": float(bundle.get("tol", -1)),
        "random_state": int(bundle.get("random_state", -1)),
        "sticky_lambda": float(bundle.get("sticky_lambda", np.nan)),
        "recency": {
            "enabled": bool(bundle.get("recency_weighting", False)),
            "half_life_days": int(bundle.get("recency_half_life_days", 0)),
            "seg_len": int(bundle.get("recency_seg_len", 0)),
            "n_segments": int(bundle.get("recency_n_segments", 0)),
            "epsilon_floor": float(bundle.get("recency_epsilon_floor", 0.0)),
        },
    }
}
if manifest["config_effective_path"]:
    manifest["config_effective_sha256"] = _sha256_file(manifest["config_effective_path"])

with open(os.path.join(REGIME_DIR, "run_manifest.json"), "w") as f:
    json.dump(manifest, f, indent=2)

# --- Checks container + helper (must be defined before any _add_check calls)
report = {"checks": [], "status": "ok"}

def _add_check(name: str, ok: bool, details: Dict[str, Any] | None=None):
    report["checks"].append({"name": name, "pass": bool(ok), "details": details or {}})
    if not ok:
        report["status"] = "fail"

# Config keys presence (from CFG)
try:
    cfg_checks = {
        "hmm_features_present": bool(getattr(CFG, "hmm_features", None)),
        "min_dwell_days_present": hasattr(CFG, "min_dwell_days"),
        "posterior_thresh_present": hasattr(CFG, "posterior_thresh"),
        "recency_weighting_flag_present": hasattr(CFG, "recency_weighting"),
    }
    _add_check("config_keys_present", all(cfg_checks.values()), cfg_checks)
except Exception as e:
    _add_check("config_keys_present_error", False, {"error": repr(e)})

# Optional: environment snapshot (helps reproducibility)  # TOCHANGE: expand in prod
try:
    import platform, sys
    env = {
        "python": sys.version.split()[0],
        "platform": platform.platform(),
        "numpy": np.__version__,
        "pandas": pd.__version__,
        "hmmlearn": __import__("hmmlearn").__version__,
        "scikit_learn": __import__("sklearn").__version__,
        # "git_commit": <inject if available>  # TOCHANGE
    }
    with open(os.path.join(REGIME_DIR, "run_env.json"), "w") as f:
        json.dump(env, f, indent=2)
except Exception:
    pass


# ‚îÄ‚îÄ 3) Validations
# deleted: report = {"checks": [], "status": "ok"}



# 3.a Dates monotonic / duplicates
is_sorted = labels["date"].is_monotonic_increasing
has_dupes = labels["date"].duplicated().any()
_add_check("dates_sorted", is_sorted, {"monotonic_increasing": bool(is_sorted)})
_add_check("dates_no_duplicates", not has_dupes, {"duplicates": int(labels["date"].duplicated().sum())})

# 3.b Posterior rows sum ‚âà 1 (if present)
p_cols = [c for c in labels.columns if c.startswith("p")]
if p_cols:
    row_sums = labels[p_cols].sum(axis=1).to_numpy(dtype=float)
    max_dev = float(np.max(np.abs(row_sums - 1.0)))
    _add_check("posteriors_sum_to_one", bool(max_dev < 1e-6), {"max_abs_deviation": max_dev})
else:
    _add_check("posteriors_present", False, {"note": "No p* columns found."})

# 3.c Gap check vs panel dates (labels ‚äÜ panel)
panel_dates  = set(pd.to_datetime(panel["date"]).dt.date)
labels_dates = set(pd.to_datetime(labels["date"]).dt.date)
missing_in_panel = sorted([str(d) for d in (labels_dates - panel_dates)])
_add_check("labels_dates_subset_of_panel", len(missing_in_panel)==0, {"labels_not_in_panel": missing_in_panel[:10]})


# Insert inside the validations block, before determinism check

# 3.bis No gaps after stitching (business days)
try:
    bd = pd.bdate_range(start=labels["date"].min(), end=labels["date"].max(), freq="C")  # business days
    labd = pd.to_datetime(labels["date"]).dt.normalize().unique()
    labd = pd.DatetimeIndex(labd)
    missing = bd.difference(labd)
    # Allow known market holidays (we can‚Äôt list them here), so only flag if large consecutive gaps
    # TOCHANGE: tighten policy (e.g., require an exchange calendar)
    ok_nogaps = len(missing) == 0
    report_missing = [str(d.date()) for d in missing[:10]]
    _add_check("no_business_day_gaps_in_labels", ok_nogaps, {"missing_first_10": report_missing, "missing_count": int(len(missing))})
except Exception as e:
    _add_check("no_business_day_gaps_in_labels_error", False, {"error": repr(e)})

# 3.d Label semantics sanity (Risk-On higher mean than Risk-Off; vol ordering)
# Prefer 2.6 table if available; else quick compute on full labeled set (heuristic).
try:
    if os.path.exists(PROF_TABLE):
        prof = pd.read_csv(PROF_TABLE)
        # columns expected: state_id, ret_mean, rv20_mean, ...
        ron_mean = float(prof["ret_mean"].max())
        roff_vol = float(prof["rv20_mean"].max())
        ok = np.isfinite(ron_mean) and np.isfinite(roff_vol)
        _add_check("semantics_profiles_present", ok, {})
    else:
        # crude fallback by state_id on labeled rows
        tmp = labels.merge(panel[["date","spy_ret","spy_rv_20"]], on="date", how="left")
        grp = tmp.groupby("state_id", dropna=True).agg(ret_mean=("spy_ret","mean"), rv20_mean=("spy_rv_20","mean"))
        cond = (grp["ret_mean"].max() == grp["ret_mean"].max()) and (grp["rv20_mean"].max() == grp["rv20_mean"].max())
        _add_check("semantics_profiles_fallback", bool(cond), {"n_states": int(grp.shape[0])})
except Exception as e:
    _add_check("semantics_profiles_error", False, {"error": repr(e)})

# 3.e Determinism smoke: rescore a tiny slice with same bundle
# #TOCHANGE: widen slice or repeat N times in prod
try:
    # take last ~200 rows available in panel‚à©labels
    common = labels.merge(panel[["date"]], on="date", how="inner").tail(200)
    if not common.empty and "scaler" in bundle:
        feats = bundle["features"]
        scaler = bundle["scaler"]
        # join features
        X = panel.merge(common[["date"]], on="date", how="inner").sort_values("date")
        X = X[feats].dropna().to_numpy(dtype=float)
        post1 = bundle["model"].predict_proba(scaler.transform(X))
        post2 = bundle["model"].predict_proba(scaler.transform(X))
        max_diff = float(np.max(np.abs(post1 - post2)))
        _add_check("determinism_same_inputs_same_scores", bool(max_diff < 1e-10), {"max_abs_diff": max_diff})
    else:
        _add_check("determinism_skipped", True, {"reason": "No overlap or scaler missing in bundle."})
except Exception as e:
    _add_check("determinism_error", False, {"error": repr(e)})

# 3.f Leakage guard (structural): ensure features are present at t
# (We can‚Äôt fully prove non-leakage without lineage; this is a structural check.)
feats = bundle.get("features", [])
missing_feats = [f for f in feats if f not in panel.columns]
_add_check("features_present_in_panel", len(missing_feats)==0, {"missing": missing_feats})

# ‚îÄ‚îÄ 4) Save report + short fingerprint
rep_path = os.path.join(REGIME_DIR, "validation_report.json")
with open(rep_path, "w") as f:
    json.dump(report, f, indent=2)

# short human-readable fingerprint
fp_lines = [
    f"Run: {manifest['created_at']}",
    f"K={manifest['hmm']['k']}, cov={manifest['hmm']['covariance_type']}, sticky_lambda={manifest['hmm']['sticky_lambda']}",
    f"Panel rows: {manifest['artifacts']['panel']['rows']}, Labels rows: {manifest['artifacts']['labels_stitched']['rows']}",
    f"Latest window: {manifest['latest_window']['win_id']}  "
    f"({manifest['latest_window']['train_start']}‚Üí{manifest['latest_window']['test_end']})",
    f"Validation status: {report['status']}  (checks: {len(report['checks'])})"
]
with open(os.path.join(REGIME_DIR, "run_fingerprint.txt"), "w") as f:
    f.write("\n".join(fp_lines) + "\n")

print(json.dumps({
    "status": "2.10 reproducibility snapshot complete",
    "manifest": os.path.join(REGIME_DIR, "run_manifest.json"),
    "validation_report": rep_path,
    "fingerprint": os.path.join(REGIME_DIR, "run_fingerprint.txt"),
    "notes": [
        "For production, expand determinism checks (multiple random restarts held fixed).",  # TOCHANGE
        "Consider capturing git commit hash and Python/package versions for full lineage.",  # TOCHANGE
        "Posterior-sum, date order, and feature presence checks included."
    ]
}, indent=2))


{
  "status": "2.10 reproducibility snapshot complete",
  "manifest": "artifacts/regimes/run_manifest.json",
  "validation_report": "artifacts/regimes/validation_report.json",
  "fingerprint": "artifacts/regimes/run_fingerprint.txt",
  "notes": [
    "For production, expand determinism checks (multiple random restarts held fixed).",
    "Consider capturing git commit hash and Python/package versions for full lineage.",
    "Posterior-sum, date order, and feature presence checks included."
  ]
}


<details>
<summary>Summary of section 2</summary>

# Section 2.0‚Äì2.1

### **What We‚Äôve Done**
- **2.0**:  
  - Defined configuration (`RegimeConfig`) for regime modeling ‚Äî feature set, HMM params, recency weighting, I/O paths.
  - Loaded `features_filtered.parquet` from Section 1 and **assembled a clean, date-aligned market panel** containing:
    - SPY daily log returns (`spy_ret`)
    - SPY realized volatility (20-day) (`spy_rv_20`)
    - VIX level (`vix_close`) and change (`dvix`, optional)
    - Market breadth (`breadth`)
  - Saved this **market-level panel** to:
    - `artifacts/regimes/market_panel.parquet` (+ CSV if enabled) ‚Äî *core input for all subsequent regime modeling steps*.
  - Wrote `regime_config_effective.json` ‚Äî the **final config** used for reproducibility.

- **2.1**:  
  - Loaded the above market panel and **prepared train/test matrices for HMM** on a first walk-forward window:
    - Features: `spy_rv_20`, `vix_close`, `breadth`, `dvix`
    - Train period: `2007-02-06` ‚Üí `2016-12-30`
    - Test period: `2017-01-03` ‚Üí latest date (`2025-08-08`)
  - Standardized features **per train window** with `StandardScaler`; applied same transform to test.
  - Saved:
    - Per-window **scaler**: `scaler_<dates>.joblib`
    - **QC JSON** with mean/std per feature in train vs test.
    - **Window manifest** (`window_manifest.json`) describing date ranges, features, scaler path, and sample counts.

---

### **Artifacts for Reuse**
| File | Contents | Purpose |
|------|----------|---------|
| `artifacts/regimes/market_panel.parquet` | Date-level DataFrame: `date`, `spy_ret`, `spy_rv_20`, `vix_close`, `breadth`, `dvix` (if enabled) | Core market context for HMM; already cleaned, aligned, NaN-free. |
| `artifacts/regimes/regime_config_effective.json` | JSON dump of final `RegimeConfig` | Ensures downstream sections use same config; includes feature list, HMM params, I/O paths. |
| `artifacts/regimes/scaler_<train>__<test>.joblib` | Fitted `StandardScaler` for the given walk-forward window | Apply same scaling to new data in this window. |
| `artifacts/regimes/scaler_<train>__<test>_qc.json` | QC metrics: train/test row counts, means, stds per feature | For diagnostics and reproducibility. |
| `artifacts/regimes/window_manifest.json` | Dict with train/test date ranges, feature list, scaler path, sample counts | Downstream code can iterate over these windows without recalculating splits. |

---

### **Key Variables**
| Variable | Value Example | Description |
|----------|---------------|-------------|
| `CFG` | `RegimeConfig(...)` | Active config object after merging defaults and `config.yaml`. |
| `mkt` | Pandas DataFrame, ~4657 rows √ó 6 cols | Market panel from 2.0; date-indexed features for HMM. |
| `hmm_feat_cols` | `["spy_rv_20", "vix_close", "breadth", "dvix"]` | Feature list for HMM modeling. |
| `window` | Dict with keys: `X_train`, `X_test`, `dates_train`, `dates_test`, `scaler_path`, `qc` | All matrices and metadata for one walk-forward window. |
| `manifest` | Dict with `window`, `features`, `scaler_path`, `n_train`, `n_test` | Summary of the first walk-forward split, persisted for reuse. |

---


# 2.2 ‚Äî Model Choice & Configuration (Gaussian HMM)

## What we did
- Trained a **GaussianHMM (covariance_type="full")** on the standardized train window from 2.1.
- Used a **time-decayed ‚Äúrecency‚Äù sampler** to bias training toward recent data (segment length 60, 80 segments, half-life 756 days).
- Enforced **regime persistence** with a sticky diagonal blend on the transition matrix (Œª=0.15).
- Tried a small **K grid** (test run K=[2]) and **multiple inits** (2 restarts); **picked the best** by train log-likelihood on the *full* train sequence.
- Persisted a self-contained bundle for reuse.

## Key hyperparams (test run; marked **#TOCHANGE** in code)
- `K grid`: `[2]` (real run: `[2, 3]`)
- `n_iter`: `200` (real run: `1000`)
- `n_init`: `2` (real run: `10`)
- `tol`: `1e-3` (real run: `1e-4`)
- `sticky_lambda`: `0.15` (real run: `0.30‚Äì0.50`)
- Recency: `half_life_days=756`, `seg_len=60`, `n_segments=80`, `epsilon_floor=0.10`

## Files produced (and what‚Äôs inside)
- **`artifacts/regimes/regime_hmm.pkl`** *(joblib bundle)*  
  - `model` ‚Äî fitted `GaussianHMM`  
  - `k`, `random_state`, `covariance_type`, `n_iter`, `n_init`, `tol`  
  - `features` ‚Äî the exact feature list used  
  - **Scaler info:** `scaler_path` (points to the 2.1 scaler)  
  - **Windows:** `train_dates`, `test_dates`  
  - **Stickiness:** `sticky_lambda`  
  - **Recency config** and `fit_mode`  
  - `created_at`
- **`artifacts/regimes/hmm_kgrid.json`** ‚Äî scores for all `(k, seed)` runs and the chosen combo.

## Reusable variables/outputs for Section 3
- **Bundle** (`regime_hmm.pkl`) ‚Äî trained HMM, feature list, scaler path, recency/stickiness configs.
- **Feature list** ‚Äî `bundle["features"]`
- **Hyperparams** ‚Äî recency and persistence knobs, if needed for downstream logic.

---

# 2.3 ‚Äî State Labeling & Semantics

## What we did
- Loaded `regime_hmm.pkl` + 2.1 scaler, scored posteriors for **all dates** in `market_panel.parquet`.
- Built **state profiles** on TRAIN only:
  - `spy_ret` (mean & std), `spy_rv_20` (mean), `vix_close` (mean), `breadth` (mean), `dvix` (if present), `ret_q05` (5% tail).
- **Assigned semantic labels**:
  - **Risk-On**: highest mean return (tie-breakers: breadth‚Üë, VIX‚Üì, better tails)
  - **Risk-Off**: highest vol & lowest return (tie-breakers: vol‚Üë, ŒîVIX‚Üë, breadth‚Üì)
  - **Transition**: remaining state
- Persisted labels + posteriors and saved the label map to ensure stability.

## Files produced
- **`artifacts/regimes/regime_labels.parquet`** (+ CSV):
  - `date`
  - `state_id` (hard assignment)
  - `p0..pK-1` (posteriors)
  - `regime_label`
- **`artifacts/regimes/state_profiles.csv`**:
  - Per-state TRAIN stats (returns, vol, breadth, tails).
- **`artifacts/regimes/regime_meta.json`**:
  - `state_label_map`
  - `diagnostics.state_profiles_train`
  - `features_used`
  - Notes on labeling policy.

## Reusable variables/outputs for Section 3
- **`regime_labels.parquet`** ‚Äî main regime feed for Section 3:
  - `regime_label` or max-posterior state.
  - `p*` columns for confidence metrics.
- **`regime_meta.json ‚Üí state_label_map`** ‚Äî ensures consistent label meanings across windows.
- **`state_profiles.csv`** ‚Äî sanity check and seed values for regime-aware policies.

---

## Quick outputs recap from run
- 2.2: `{"chosen_k": 2, "fit_mode": "recency", "train_score": -8178.567..., "n_iter": 200, "n_init": 2, "sticky_lambda": 0.15, "half_life_days": 756, "seg_len": 60, "n_segments": 80}`
- 2.3: `{"k": 2, "label_map": {"0": "Risk-On", "1": "Risk-Off"}, "profiles_path": ".../state_profiles.csv", "labels_path": ".../regime_labels.parquet"}`

# 2.4 ‚Äî Smoothing, Persistence & Debounce

## What we did
- **Base path selection** (controlled by config):
  - `SMOOTH_MTH` in `CFG` ‚Üí `"viterbi"` uses `model.predict(...)`; `"posterior"` uses `post.argmax(...)`.
- **Posterior threshold gating** (debounce step 1):
  - If a day‚Äôs new state differs from the prior day but the **max posterior** that day `< P_THRESH`, keep the **previous** state.
  - Config knobs:
    - `P_THRESH = CFG.posterior_thresh` (default **0.55**)
    - `MIN_DWELL = CFG.min_dwell_days` (default **3**)
    - `SMOOTH_MTH = CFG.smoothing_method` (default **"posterior"**)
- **Minimum dwell enforcement** (debounce step 2):
  - Collapse any **short runs** (`run_len < MIN_DWELL`) to the better neighbor using average posterior over the short segment.
- **Label mapping**:
  - Map smoothed state IDs to labels via `state_label_map` from `regime_meta.json` (set in 2.3).
- **Gap handling**:
  - Dates are already business days from the panel; no forward-looking fills are introduced.

## Inputs reused
- `artifacts/regimes/market_panel.parquet` (from 2.0)  
- `artifacts/regimes/window_manifest.json` (from 2.1)  
- `artifacts/regimes/regime_hmm.pkl` (from 2.2)  
- `artifacts/regimes/regime_labels.parquet` (from 2.3)  
- `artifacts/regimes/regime_meta.json` (from 2.3)

## Outputs
- **`artifacts/regimes/regime_labels.parquet`** *(updated in-place)*  
  - `date`
  - `p0..pK-1` (posteriors)
  - `state_id` (original argmax)
  - `state_id_smoothed` (after threshold + dwell collapse)  
  - `regime_label_smoothed` (`state_id_smoothed` ‚Üí `state_label_map`)
- **`artifacts/regimes/regime_labels.csv`**
- **`artifacts/regimes/regime_meta.json`** *(updated)*  
  - Adds `diagnostics.smoothing`:
    - `method` (posterior|viterbi)
    - `posterior_thresh`
    - `min_dwell_days`
    - `dwell_stats`

## Reuse in Section 3+
- Use **`regime_label_smoothed`** (or `state_id_smoothed`) as the regime signal.
- Use **`p0..pK-1`** for confidence logic.
- Read **`diagnostics.smoothing.dwell_stats`** for dwell/chattering monitoring.

---

# 2.5 ‚Äî Robustness & Sensitivity

## What we did
- **Baseline context** from 2.2 and 2.1:
  - `features_base = bundle["features"]`
  - `k_base = bundle["k"]`
  - Recency/sticky params from bundle.
- **K sensitivity** (`K_GRID = [2, 3]`):
  - Refit with recency-weighted subsequences.
  - Score on train sequence; record best per K.
  - Compute **agreement vs baseline**.
- **Feature sensitivity**:
  - Variants: `baseline`, `no_vix`, `no_breadth`, `no_dvix`, `core_rv_vix`.
  - Refit at `k_base`, record label map, **agreement vs baseline**.
- **Era stability**:
  - Refit on `pre_2015`, `post_2015`, `crisis_2020`.
  - Record profiles, label map, transition matrix.
- **Bootstrap (block)**:
  - `BLOCK_DAYS = 20`, `BOOT_REPS = 5` (light test).
  - Refit and compute **agreement vs baseline**; report mean/std.

> All refits fit a local `StandardScaler` on the relevant subset.

## Inputs reused
- `artifacts/regimes/market_panel.parquet` (2.0)  
- `artifacts/regimes/window_manifest.json` (2.1)  
- `artifacts/regimes/regime_hmm.pkl` (2.2)  
- `artifacts/regimes/regime_labels.parquet` (2.3/2.4)

## Outputs
- **`artifacts/regimes/regime_sensitivity.json`**  
  - `created_at`
  - `inputs` (features_base, k_base, recency, sticky, etc.)
  - `results`:
    - **`k_sensitivity`**: best per K, agreement vs baseline, profiles, label map, transmat.
    - **`feature_sensitivity`**: best per feature set, agreement, profiles, label map, transmat.
    - **`era_stability`**: per era, profiles, label map, transmat.
    - **`bootstrap`**: agreement list, mean, std.

## Reuse in Section 3+
- Pick **K** balancing separation and stability.
- Decide on features based on agreement.
- Adapt hedging/caps for era drift.
- Gate production with bootstrap agreement thresholds.

---

## File Index for 2.4 & 2.5

| File | Produced/Updated in | Purpose |
|------|---------------------|---------|
| `artifacts/regimes/regime_labels.parquet` (+ `.csv`) | 2.4 | Adds smoothed IDs/labels; ensures posteriors. |
| `artifacts/regimes/regime_meta.json` | 2.4 | Smoothing diagnostics + state map. |
| `artifacts/regimes/regime_sensitivity.json` | 2.5 | K/feature/era/bootstrap results with stability metrics. |

# 2.6 & 2.7

# 2.6 ‚Äî Diagnostics & QA

**What we did**
- Computed core diagnostics from existing labels:
  - Transition matrix, steady‚Äêstate distribution, dwell/run statistics, switch/chattering metrics.
- Generated plots:
  - `regime_timeline.png` (SPY price w/ regime shading), `timeline_drawdown.png`
  - `regime_posteriors.png` (stacked p‚Äôs)
  - Per-state return histograms `state_<s>_ret_hist.png` and QQ plots `state_<s>_qq.png`
  - `transition_matrix_heatmap.png`, `dwell_time_distribution.png`
- Emitted tables and alerts for QA (semantics, dwell < 3d, chattering, mapping flips).

**Reused inputs (exact paths/objects)**
- `artifacts/regimes/market_panel.parquet` ‚Üí market series (`date, spy_ret, spy_rv_20, vix_close, breadth, dvix`)
- `artifacts/regimes/window_manifest.json` ‚Üí window bounds (train/test)
- `artifacts/regimes/regime_hmm.pkl` ‚Üí bundle with `features`, `k` (for K fallback)
- `artifacts/regimes/regime_labels.parquet` ‚Üí label timeline
  - Columns used: `date`, `p0..pK-1` (if present), `state_id` (or `state_id_smoothed`), `regime_label` (or `regime_label_smoothed`)
- `artifacts/regimes/state_profiles.csv` ‚Üí state profile stats from 2.3 (fallback recompute if missing)
- `artifacts/regimes/regime_meta.json` ‚Üí optional label map/notes

**Key variables (in-code names)**
- `DIAG_DIR = artifacts/regimes/diagnostics`
- `state_col` = `"state_id_smoothed"` if present else `"state_id"`
- `label_col` = `"regime_label_smoothed"` if present else `"regime_label"`
- `p_cols` = all columns starting with `"p"`; `K = len(p_cols)` (else fallback to bundle `k`)
- Diagnostics computed:
  - `Tmat` (K√óK transition matrix), `ss_emp` (steady state)
  - `dwell` (run lengths by state), `switches`, `switch_rate`
  - `one_day_runs` (share of 1-day runs), `lt3_runs` (share runs <3d)
  - `alerts` list (semantics, dwell, chattering, flips)

**Outputs (exact filenames & contents)**
- PNGs in `artifacts/regimes/diagnostics/`:
  - `regime_timeline.png`, `timeline_drawdown.png`, `regime_posteriors.png`,
    `state_<s>_ret_hist.png`, `state_<s>_qq.png`,
    `transition_matrix_heatmap.png`, `dwell_time_distribution.png`
- CSVs in `artifacts/regimes/diagnostics/`:
  - `state_profiles_table.csv` (state_id, ret_mean, ret_std, rv20_mean, vix_mean, dvix_mean, breadth_mean, ret_q05)
  - `transition_matrix.csv` (row=from_i, cols=to_j), `steady_state.csv` (state_id, steady_state_prob)
  - `switches_by_year.csv` (year, n_switches), `summary_metrics.csv` (K, switches, switch_rate, one_day_runs_frac, lt3_runs_frac)
- JSON:
  - `alerts.json` (list of QA alerts)

---

# 2.7 ‚Äî Regime-Aware Policy Hooks (Interfaces to Sections 3‚Äì5)

**What we did**
- Built a single hand-off file for downstream portfolio logic:
  - Latest regime, smoothed confidence, and per-regime policy defaults (weights multipliers, turnover caps, risk targets, hedge intensity).
- Confidence proxy combines **max posterior** and **(1 ‚àí normalized entropy)**, then maps to an **aggressiveness scalar `g` ‚àà [0.35, 1.00]**.
- If `p*` columns are missing, we rescore posteriors from the HMM bundle.

**Reused inputs (exact paths/objects)**
- `artifacts/regimes/regime_labels.parquet` ‚Üí posteriors & (smoothed) states
  - Uses `p_cols` when present; else rescored via bundle
  - Picks `state_col` / `label_col` as in 2.6
- `artifacts/regimes/regime_meta.json` ‚Üí `state_label_map` (state ‚Üí semantic label)
- `artifacts/regimes/regime_hmm.pkl` ‚Üí `model`, `features`, `k`, `scaler_path` (for fallback scoring)
- `artifacts/regimes/window_manifest.json` ‚Üí `scaler_path`, `window`
- `artifacts/regimes/market_panel.parquet` ‚Üí fallback features matrix for rescoring

**Key variables (in-code names)**
- `OUT_PATH = artifacts/regimes/regime_policy_map.json`
- `N_SMOOTH = 3` (days) ‚Üí average recent posteriors for confidence
- `p_cols` (derived or rescored), `K` (len(p_cols) or bundle `k`)
- `state_col`, `label_col` (same logic as 2.6)
- Confidence helpers: `entropy(p)`, `aggressiveness_from_confidence(p)` ‚Üí returns `{c_max, c_entropy, c, g}`

**Output (exact file & schema)**
- `artifacts/regimes/regime_policy_map.json`
  - Top-level:
    - `created_at`, `latest_date`, `k`, `latest_regime_label`, `latest_state_id`
    - `latest_posteriors`: `{ "p0": float, ..., "p{K-1}": float }`
    - `confidence`:
      - `aggressiveness_scalar_g` (float in [0.35, 1.00])
      - `confidence`: `{ "c_max", "c_entropy", "c", "g" }`
      - `recommendations`: `{ "scale_position_sizes_by_g", "scale_turnover_cap_by_g", "scale_hedge_intensity_by_(1-g)" }`
    - `policy_by_regime`:
      - Keys: semantic labels present in data (e.g., `"Risk-On"`, `"Transition"`, `"Risk-Off"`) or synthesized `State<i>` when no mapping.
      - Values per regime:
        - `weights_multipliers`: `{ "momentum", "quality", "value", "low_vol" }`
        - `turnover_cap` (float)
        - `risk_target_vol_annual` (float, e.g., 0.10/0.08/0.06)
        - `hedge_intensity` (float)
    - `inputs`:
      - `labels_path`, `meta_path`, `bundle_path`, `scaler_path`
      - `features` (list), `window` (train/test bounds), `smoothing_window_days`
      - If present: `sensitivity_path = artifacts/regimes/regime_sensitivity.json`
      - If present: `diagnostics_dir = artifacts/regimes/diagnostics`
    - `signature` (sha256 over features/window/k)

**How Section 3‚Äì5 should reuse**
- Read **one file**: `artifacts/regimes/regime_policy_map.json`.
  - Use `latest_regime_label` to branch logic.
  - Scale exposures and caps by `confidence.aggressiveness_scalar_g`.
  - Pull regime-specific knobs from `policy_by_regime[<label>]`.
  - Optionally reference `inputs.sensitivity_path` and `inputs.diagnostics_dir` for auditability.


## **Section 2.8 ‚Äî Walk-Forward Integration**

This section implements **rolling or expanding window walk-forward evaluation** for regime detection, matching the methodology in Section 6 (when available).  
It ensures **out-of-sample (OOS) scoring** and prevents **regime meaning drift** by preserving the `state ‚Üí label` mapping per window.

---

### **Core Logic**
1. **Window Handling**
   - **Preferred:** Load `windows_manifest.json` (multi-window plan).
   - **Fallback:** Wrap `window_manifest.json` (single-window).
   - **Smoke Test Autogen:** Generate a small rolling test plan for quick runs.

2. **Per-Window Workflow**
   - **Train Phase:**
     - Fit `StandardScaler` and `GaussianHMM` **only on training subset**.
     - Apply **recency-weighted sampling** if enabled (`APPLY_RECENCY`).
   - **Test Phase:**
     - Transform features using the **train-fitted scaler**.
     - Predict posteriors (`p0...pK-1`) and hard regime states.
     - Map numeric states to semantic labels (`Risk-On`, `Risk-Off`, `Transition`) using training-set profiling.
     - Apply light debouncing (`min_dwell_days`) to remove 1-day flips.
   - **Save Artifacts:**
     - Model bundle (`regime_hmm_<winid>.pkl`) with scaler + params.
     - Labels (`regime_labels_<winid>.parquet` + `.csv`) with smoothed & raw states + posteriors.
     - Metadata (`regime_meta_<winid>.json`) with window dates, features, mappings, and file paths.

3. **Output Stitching**
   - Concatenate all test chunks into a **continuous timeline** (`regime_labels.parquet` + `.csv`) for backtests.
   - Save a **window index** (`windows_index.json` + `.csv`) with summary info.

---

### **Output Example**
```json
{
  "status": "2.8 walk-forward complete",
  "n_windows": 1,
  "windows": [
    {
      "win_id": "W0",
      "train_start": "2007-02-06",
      "train_end": "2016-12-30",
      "test_start": "2017-01-03",
      "test_end": "2025-08-08",
      "n_train": 2495,
      "n_test": 2162,
      "bundle_path": "artifacts/regimes/windowed/regime_hmm_W0.pkl",
      "labels_path": "artifacts/regimes/windowed/regime_labels_W0.parquet",
      "meta_path": "artifacts/regimes/windowed/regime_meta_W0.json"
    }
  ],
  "stitched_out": {
    "parquet": "artifacts/regimes/regime_labels.parquet",
    "csv": "artifacts/regimes/regime_labels.csv"
  },
  "windows_index": {
    "json": "artifacts/regimes/windowed/windows_index.json",
    "csv": "artifacts/regimes/windowed/windows_index.csv"
  },
  "notes": [
    "Per-window scaler fitted on TRAIN only; TEST scored out-of-sample.",
    "State‚Üílabel semantics are saved per window and applied to the test chunk.",
    "For real run: increase N_ITER/N_INIT and recency sampler size; align windows with Section 6."
  ]
}
```

# other details about 2.8:
Files Reused
market_panel.parquet (from 2.0):
Main panel of market features (date, spy_ret, and CFG.hmm_features), sorted by date.

window_manifest.json / windows_manifest.json (from 2.1):
Defines rolling/expanding window splits with train/test boundaries.

regime_sensitivity.json (from 2.5, optional):
Stores results of regime sensitivity tests (e.g., best K values).

Diagnostics directory (from 2.6, optional):
Extra per-state statistics or plots for debugging.

Variables Reused
CFG.hmm_features ‚Äî Feature names for the HMM (from config).

CFG.min_dwell_days ‚Äî Minimum days before switching regimes.

APPLY_RECENCY / HALF_LIFE_DAYS / SEG_LEN / N_SEGMENTS ‚Äî Recency sampling parameters (from 2.2).

K ‚Äî Number of HMM states (can come from sensitivity analysis in 2.5).

LAMBDA_STICK ‚Äî Sticky transition smoothing factor.

Artifacts Produced
Per-Window:

Model + scaler bundle ‚Üí regime_hmm_<winid>.pkl

Regime labels ‚Üí regime_labels_<winid>.parquet (+ .csv)

Metadata ‚Üí regime_meta_<winid>.json

Global:

Continuous labels ‚Üí regime_labels.parquet (+ .csv)

Windows index ‚Üí windows_index.json (+ .csv)

## 2.9 ‚Äî Forward (Shadow) Mode

### What we implemented
- **Daily append loop**:
  1) Load latest window metadata via `windows_index.json` (fallback: scan `regime_meta_*.json`), then load the **bundle** (`regime_hmm_<winid>.pkl`) ‚Üí `MODEL`, `SCALER`, `FEATS`, `K`.
  2) Read newest feature rows from `market_panel.parquet` **after** the last date in `regime_labels.parquet`.
  3) `SCALER.transform` ‚Üí `MODEL.predict_proba` ‚Üí `argmax` to get `state_id`.
  4) Map `state_id` ‚Üí `regime_label` using meta‚Äôs `state_label_map` (no re-profiling).
  5) Append rows (with `p0..p{K-1}`) to `regime_labels.parquet` (+ CSV), **no backfill**.
  6) **Log** each new row to JSONL with a **model signature hash**.
  7) **Alerts**: rolling last `ROLL_WINDOW_D` days; flag if switches ‚â• `ROLL_MAX_SWITCH`.
  8) **Optional** policy refresh (2.7-lite): compute confidence on last `N_CONF_TAIL` posteriors and write `regime_policy_map.json`.

- **Retrain cadence**: not in the daily path ‚Äî set to weekly/bi-weekly (**#TOCHANGE**).

---

### Reusable globals / config knobs
- Paths:
  - `REGIME_DIR`, `PANEL_PATH`, `LAB_PATH_PQ`, `LAB_PATH_CSV`, `WIN_DIR`, `WIN_INDEX`
- Optional:
  - `START_DATE` (taken from `globals()` if present)
- Policy refresh:
  - `UPDATE_POLICY_MAP` (bool), `POLICY_OUT`
- Logging & alerts:
  - `FWD_LOG`, `ALERTS_FP`
- Guardrails / smoothing:
  - `FWD_DEBOUNCE` (bool), `ROLL_WINDOW_D` (int), `ROLL_MAX_SWITCH` (int)
- Confidence window:
  - `N_CONF_TAIL` (int)

**Bundle/meta fields (loaded, reused):**
- From `regime_hmm_<winid>.pkl` ‚Üí `BUNDLE`: `model`, `scaler`, `features`, `k`
- From `regime_meta_<winid>.json` ‚Üí `META`: `"state_label_map"`, `"window"` (for signature)

---

### Reusable helper functions
- `_load_latest_window_meta()`: choose latest window from `windows_index.json` (fallback scan of `regime_meta_*.json`); returns keys like `win_id`, `bundle_path`, `meta_path`, `test_end`.
- `_model_signature(features, k, window)`: SHA256 hash for audit/lineage.
- `_entropy(p)`: normalized entropy of a posterior vector.
- `_aggressiveness_from_posterior(p_mean)`: returns `{c_max, c_entropy, c, g}` for policy scaling.
- `_append_jsonl(path, rec)`: append a record to JSONL log.

*(Smoke-test only, but reusable if desired)*  
- `_pick_latest_window(win_index_json)`: pick latest window from index (used in forward smoke test).
- `_debounce_series(state_ids, min_dwell_days=CFG.min_dwell_days)`: minimal 1-day blip squash (used in smoke test).

---

### Files / artifacts (with exact paths) and what they contain
- **`artifacts/regimes/market_panel.parquet`**: full feature panel (`date` + `FEATS`) used to find new rows to score.
- **`artifacts/regimes/windowed/windows_index.json`**: list of window records; includes `win_id`, `train_*`, `test_*`, `bundle_path`, `meta_path`, `labels_path`.
- **`artifacts/regimes/windowed/regime_meta_<winid>.json`**: per-window meta including `"state_label_map"`, `"window"`, `"features"`, `"k"`, recency/sticky knobs, `"bundle_path"`.
- **`artifacts/regimes/windowed/regime_hmm_<winid>.pkl`**: joblib bundle with `model` (GaussianHMM), `scaler` (StandardScaler), `features` (list), `k` (int).
- **`artifacts/regimes/regime_labels.parquet`** (+ `regime_labels.csv`): master labels time series (stitched history + new forward rows). Columns:  
  `date`, `state_id`, `regime_label`, `p0..p{K-1}`, and if present, `state_id_smoothed`, `regime_label_smoothed`.
- **`artifacts/regimes/regime_forward_log.jsonl`**: one JSON record per appended row:  
  `{ "ts", "date", "model_sig", "state_id", "regime_label", "posteriors": { "p0":..., ... } }`.
- **`artifacts/regimes/forward_alerts.json`**: alerts like `"High switch count in last {ROLL_WINDOW_D}d: {sw} (>= {ROLL_MAX_SWITCH})"`.
- **`artifacts/regimes/regime_policy_map.json`** (optional refresh): latest label, posteriors, confidence `{g, c_max, c_entropy, c}`, and `"inputs"` with paths + `"signature"`.

*(Smoke-test outputs ‚Äî safe sandbox)*  
- **`artifacts/regimes/forward_smoketest/regime_labels_smoke_base.parquet`**: truncated base.
- **`artifacts/regimes/forward_smoketest/regime_labels_smoke.parquet`** (+ CSV): base + newly scored tail.
- **`artifacts/regimes/forward_smoketest/regime_policy_map_smoke.json`**: policy map built from smoke labels tail.

---

### Key columns appended each day
- `date`, `state_id`, `regime_label`, `p0..p{K-1}`  
- (If the historical file already had them) `state_id_smoothed`, `regime_label_smoothed` are mirrored from raw.

## 2.10 ‚Äî Configuration & Reproducibility

### What we implemented
- **Snapshot effective config & artifacts**:
  - Captures `CFG` keys (HMM features, min dwell, posterior threshold, recency flags).
  - Records artifact paths, row counts, SHA256 hashes.
  - Stores latest window metadata (`win_id`, train/test dates, bundle/meta paths).
  - Saves model parameters (`k`, covariance, n_iter, sticky_lambda, recency params).
  - Optional: environment snapshot (Python, platform, package versions).

- **Validations**:
  - Config key presence.
  - Dates sorted, no duplicates.
  - Posterior columns present and sum to 1.
  - Labels‚Äô dates ‚äÜ panel dates, no unexpected business day gaps.
  - Label semantics sanity check (risk-on/off profiles).
  - Determinism check (same inputs ‚Üí identical posteriors).
  - Features in bundle all present in panel (basic leakage guard).

- **Auditability outputs**:
  - `run_manifest.json` ‚Üí full config & artifact snapshot.
  - `validation_report.json` ‚Üí pass/fail status of each check.
  - `run_fingerprint.txt` ‚Üí short human-readable summary.

---

### Reusable globals / config knobs
- **Paths**:
  - `REGIME_DIR`, `PANEL_PATH`, `LABELS_PATH`, `CONFIG_EFF`,  
    `WINDEX_PATH`, `WIN_DIR`, `DIAG_DIR`, `PROF_TABLE`
- **From CFG**:
  - `hmm_features`, `min_dwell_days`, `posterior_thresh`, `recency_weighting`

---

### Reusable helper functions
- `_sha256_file(path)`: file hash for artifact integrity.
- `_pick_latest_window(windex)`: load latest window metadata (fallback to W0).
- `_add_check(name, ok, details)`: append validation result to report.

---

### Files / artifacts produced
- **`artifacts/regimes/run_manifest.json`** ‚Äî config + artifact snapshot with hashes, sizes, dates.
- **`artifacts/regimes/validation_report.json`** ‚Äî structured validation results (`status`, per-check pass/fail).
- **`artifacts/regimes/run_fingerprint.txt`** ‚Äî concise run summary (K, sticky_lambda, row counts, latest window).
- **`artifacts/regimes/run_env.json`** *(optional)* ‚Äî Python & package versions, platform info.

---

### Reused artifacts from earlier sections
- `artifacts/regimes/market_panel.parquet` (2.0)
- `artifacts/regimes/regime_labels.parquet` (2.8)
- `artifacts/regimes/windowed/windows_index.json` (2.8 QoL)
- Latest `regime_hmm_<win>.pkl` + `regime_meta_<win>.json` (2.8)
- `artifacts/regimes/diagnostics/state_profiles_table.csv` (2.6; optional)

# Quick summary of everything

# üì¶ Section 2 ‚Äî Regime Modeling (HMM ‚Üí Regime Labels & Probabilities)

**Goal:** Detect daily market regimes (**Risk-On**, **Risk-Off**, **Transition**) with posterior probabilities, for use in Sections 3‚Äì5 (alpha models, sizing, risk caps).

---

## 1Ô∏è‚É£ What We Built

### 2.0‚Äì2.1 ‚Äî Market Panel & Train/Test Prep
- Created **clean, date-aligned market panel** from Section 1 features:
  - `spy_ret` (SPY log returns)
  - `spy_rv_20` (20d realized vol)
  - `vix_close` (+ optional `dvix` daily change)
  - `breadth` (% advancers in S&P)
- Saved to: `artifacts/regimes/market_panel.parquet`
- Generated **train/test matrices** for walk-forward window:
  - Standardized **per-train-window** using `StandardScaler`
  - QC checks: row counts, mean/std drift, NaNs
  - Saved `scaler_<train>__<test>.joblib` + QC JSON + `window_manifest.json`

### 2.2 ‚Äî HMM Model Training
- Trained **GaussianHMM** (`covariance_type="full"`) with:
  - Optional **recency-weighted sampling**
  - Sticky transitions for persistence
- Searched `K` in {2, 3}, picked best by log-likelihood
- Saved self-contained bundle: model, scaler path, config, training dates

### 2.3 ‚Äî State Labeling
- Profiled states on train set ‚Üí assigned semantic labels:
  - Risk-On: highest mean return, lowest vol
  - Risk-Off: highest vol, lowest return
  - Transition: remainder
- Persisted mapping in `regime_meta.json`
- Created regime timeline with posteriors

### 2.4 ‚Äî Smoothing & Debounce
- Removed short noisy flips using:
  - Posterior threshold (`posterior_thresh`)
  - Min dwell days (`min_dwell_days`)
- Updated regime labels with smoothed states

### 2.5 ‚Äî Robustness Tests
- Sensitivity to:
  - `K` choice
  - Feature removal
  - Era splits (pre/post-2015, 2020 crisis)
- Block bootstrap stability check
- Saved results for audit

### 2.6 ‚Äî Diagnostics & QA
- Computed:
  - Transition matrix, dwell-time stats, chattering metrics
  - State return distributions & QQ plots
- Generated plots + summary CSVs + QA alerts

### 2.7 ‚Äî Regime Policy Map
- Created single JSON for downstream use:
  - Latest regime + confidence score
  - Per-regime weights, turnover caps, risk targets, hedge intensity
  - Confidence scalar `g` ‚àà [0.35, 1.00]

### 2.8 ‚Äî Walk-Forward Integration
- Automated multi-window HMM training & stitching of test outputs
- Ensured **state‚Üílabel** stability across windows
- Produced continuous regime timeline for backtests

### 2.9 ‚Äî Forward (Shadow) Mode
- Daily append loop:
  - Score new rows from `market_panel.parquet`
  - Append regime + posteriors to `regime_labels.parquet`
  - Optional policy map refresh
  - Alerts if excessive regime switches

### 2.10 ‚Äî Config & Reproducibility
- Snapshotted:
  - Effective config
  - Artifact paths & hashes
  - Validation checks
- Produced concise run fingerprint

---

## 2Ô∏è‚É£ Key Global Variables & Functions (Reusable)

| Name | Type | Description |
|------|------|-------------|
| `CFG` | `RegimeConfig` | Loaded from `config.yaml` + defaults; holds all regime model params & paths |
| `mkt` | `pd.DataFrame` | Clean market panel (`market_panel.parquet`) |
| `hmm_feat_cols` | `list[str]` | Features used for HMM (e.g. `["spy_rv_20","vix_close","breadth","dvix"]`) |
| `window` | `dict` | Train/test matrices & metadata for one walk-forward window |
| `manifest` | `dict` | Summary of window bounds, features, scaler path, sample counts |
| `_entropy(p)` | `func` | Normalized entropy from posterior vector |
| `_aggressiveness_from_posterior(p)` | `func` | Returns `{c_max, c_entropy, c, g}` for sizing/risk scaling |
| `_debounce_series(states, min_dwell)` | `func` | Remove short state flips |
| `_model_signature(features,k,window)` | `func` | SHA256 signature for auditability |
| `_load_latest_window_meta()` | `func` | Loads latest walk-forward model/meta paths |

---

## 3Ô∏è‚É£ Artifacts & Their Contents

| File | Purpose | Key Fields |
|------|---------|------------|
| `market_panel.parquet` | Core HMM input features | `date, spy_ret, spy_rv_20, vix_close, breadth, dvix` |
| `regime_hmm.pkl` | HMM model bundle | model, features, scaler_path, training dates, config |
| `regime_labels.parquet` (+ CSV) | Regime timeline | `date, state_id, p0..pK-1, regime_label` (+ smoothed) |
| `regime_meta.json` | State‚Üílabel mapping & diagnostics | mapping, profiles, config |
| `state_profiles.csv` | Per-state stats | mean/std returns, vol, VIX, breadth, tails |
| `regime_sensitivity.json` | Robustness test results | k/feature/era/bootstrap outcomes |
| `diagnostics/*.png` | Plots | timeline, posteriors, histograms, QQ, transmat, dwell dist |
| `diagnostics/*.csv` | Metrics | state_profiles_table, transition_matrix, steady_state, run stats |
| `regime_policy_map.json` | Per-regime strategy knobs | latest regime, g-scalar, per-regime caps & weights |
| `windows_index.json` | Walk-forward plan | window IDs, dates, artifact paths |
| `regime_forward_log.jsonl` | Forward mode log | date, state, posteriors, model signature |
| `forward_alerts.json` | Alerts | excessive switching, anomalies |
| `run_manifest.json` | Full config + artifact snapshot | cfg keys, paths, hashes |
| `validation_report.json` | Pass/fail checks | leakage, dates, semantics |
| `run_fingerprint.txt` | Short summary | key params & latest status |

---

## 4Ô∏è‚É£ How to Reuse in Later Sections

- **For alpha models (Section 3)**  
  - Read `regime_labels.parquet` (use smoothed label columns)  
  - Use `p*` columns for regime-confidence scaling  
  - Read `regime_policy_map.json` to set factor weights, turnover caps, hedge targets  

- **For walk-forward runs**  
  - Use `windows_index.json` to iterate windows  
  - Load matching `regime_hmm_<win>.pkl` and `regime_meta_<win>.json`  

- **For forward mode**  
  - Extend `regime_labels.parquet` daily using `_load_latest_window_meta()` and scoring pipeline  
  - Refresh `regime_policy_map.json` as needed  

- **For diagnostics or tuning**  
  - Use `regime_sensitivity.json` to choose stable `K` and feature set  
  - Use `diagnostics/` CSVs for deeper QA or visual overlays  

---

## 5Ô∏è‚É£ Deliverables Checklist ‚úÖ

- [x] `regime_labels.parquet` (+ CSV)  
- [x] `regime_hmm.pkl`  
- [x] `regime_meta.json`  
- [x] `regime_timeline.png`, `regime_posteriors.png`  
- [x] `state_profiles.csv`  
- [x] `transition_matrix.csv`  
- [x] `regime_sensitivity.json`  
- [x] `regime_policy_map.json`  

---

**Next Dev Tip:**  
All core regime logic is already modularized. Before coding, scan `regime_policy_map.json` and `regime_labels.parquet` ‚Äî they cover 90% of what you‚Äôll need without touching model code.



</details>


In [None]:
# 2.11

# 3. Alpha Layer (Signals)

<details>
<summary><strong>Section 3 ‚Äî Alpha Layer (Signals)</strong></summary>

**Goal:**  
Build the full **signal-generation layer** that produces **daily, regime-aware cross-sectional alpha forecasts** (mean + uncertainty) for each asset in the universe, with strict leakage control. Outputs must be clean, reproducible, and validated ‚Äî ready for **Section 4 (Portfolio Construction & Risk)** and **Section 5 (RL Sizing)**.  
This section transforms **modeling-ready features** (Section 1) and **regime context** (Section 2) into actionable, confidence-scored forecasts via **multifactor composites, ML overlays, regime-aware blending, and uncertainty aggregation**.

---

### **3.0 Scope & Inputs (reuse, don‚Äôt recompute)**  
**Description:** Define what data and configurations are needed to run the alpha modeling. Pull in clean, preprocessed features from Section 1 and regime labels from Section 2, and set up the alpha modeling configuration (`AlphaConfig`) with horizons, models, CV parameters, and uncertainty estimation methods.  

**From Section 1:**  
- `features_filtered.parquet` ‚Äî modeling-ready per-asset panel.  
- `universe.csv` ‚Äî canonical equities universe.  
- `cs_cols`, `non_feature_cols`, `cols_to_shift`, `dates_all`, `px_daily_all`.

**From Section 2:**  
- `artifacts/regimes/regime_labels.parquet` ‚Äî regimes & probabilities.  
- `artifacts/regimes/regime_policy_map.json` ‚Äî latest regime & confidence scalar `g`.  
- `artifacts/regimes/window_manifest.json` or `windows_index.json` ‚Äî walk-forward bounds.

**New config:**  
- `AlphaConfig` (horizons, target definition, models, CV, UQ, losses, regime hooks, artifact paths).

---

### **3.1 Targets & Panels**  
**Description:** Generate prediction targets and assemble training/testing panels for each walk-forward window. Targets are excess returns over hedges (SPY, sector ETFs) for 5- and 10-day horizons. Ensure leakage-free construction.  

**Outputs:**  
- `targets.parquet` ‚Äî per horizon.  
- Per-window `panel_train` / `panel_test` parquet files with features, targets, regime info.

---

### **3.2 Feature Sets**  
**Description:** Select and optionally enhance the clean features from Section 1. Add interaction features and regime-aware taps if configured. Save exact feature lists for reproducibility.  

**Outputs:**  
- `feature_list.json` ‚Äî exact features used.  
- `feature_importance_baseline.csv` ‚Äî baseline feature importances.

---

### **3.3 Model Suite**  
**Description:** Train multiple base models for each horizon and window:  
1. Multifactor composite ‚Äî value, momentum, quality blend per regime.  
2. LSTM ‚Äî sequence modeling with MC-Dropout for uncertainty.  
3. Tabular ensembles ‚Äî LightGBM, XGBoost, optional MLP with quantile heads.  
4. Stacking meta-learner ‚Äî combines base model outputs optimally.  

**Outputs:**  
- Model artifacts, OOF/TEST predictions, feature importances.

---

### **3.4 Uncertainty, Calibration, & Confidence**  
**Description:** Aggregate uncertainty estimates from MC-Dropout, quantile spreads, and model dispersion. Calibrate probability outputs and compute expected Sharpe ratios as a confidence score.  

**Outputs:**  
- `uq_summary.json` ‚Äî uncertainty stats.  
- Calibration plots & CSVs.

---

### **3.5 Regime-Aware Blending**  
**Description:** Fit per-regime weights on base model outputs and blend them according to regime probabilities. Smooth weights over time and cap risk-prone factors in adverse regimes.  

**Outputs:**  
- `blend_<h>_win<id>.json` ‚Äî blending configs.  
- Diagnostics CSVs.

---

### **3.6 Walk-Forward Integration**  
**Description:** Execute the modeling loop over each walk-forward window, save OOF and TEST predictions, and stitch TEST predictions into continuous time-series files.  

**Outputs:**  
- `alpha_raw.parquet` ‚Äî per model/horizon.  
- `alpha_ensemble.parquet` ‚Äî final blended forecasts with mean, sigma, e_sharpe, regime info.

---

### **3.7 Quality Gates & Diagnostics**  
**Description:** Validate outputs against leakage, stability, and performance thresholds. Check information coefficients, spreads, residual betas, and uncertainty calibration.  

**Outputs:**  
- `validation_report.json` ‚Äî pass/fail on all checks.  
- Diagnostics plots and ablation studies.

---

### **3.8 Interfaces to ¬ß4 & ¬ß5**  
**Description:** Export all necessary artifacts for the portfolio construction and RL sizing stages, including the stitched ensemble predictions, targets, and feature lists. Provide helper functions to build RL state inputs.  

**Outputs:**  
- `alpha_ensemble.parquet`  
- `targets.parquet`  
- `feature_list.json`

---

### **3.9 Reproducibility & Manifest**  
**Description:** Save all configuration, run manifests, hashes, seeds, and metadata to guarantee deterministic reruns.  

**Outputs:**  
- `alpha_config_effective.json`  
- `run_manifest.json`  
- `run_fingerprint.txt`  
- `cv_manifest_<winid>.json`

---

### **3.T ‚Äî Test Plan & Definition of Done (DoD)**  
**Description:** Define all validation checks required for completion and passing of Section 3.  

**Checks:**  
- **Data integrity & leakage** ‚Äî no NaNs, unique keys, lag compliance.  
- **Model integrity** ‚Äî correct CV setup, acceptable OOF‚ÄìTEST gap.  
- **Performance** ‚Äî IC ‚â• 0.05 or spread ‚â• 20 bps, per-regime IC ‚â• 0.03 in ‚â•2 regimes.  
- **Stability** ‚Äî no long negative IC runs.  
- **UQ** ‚Äî monotonic relationship between confidence and realized returns.  
- **Neutrality** ‚Äî |Œ≤_SPY| ‚â§ 0.05, balanced sector exposure.  
- **Blending** ‚Äî regime weights bounded, stable over time.  
- **Artifacts** ‚Äî all required outputs present and match manifest.

**DoD:** Section 3 is complete when:  
- All artifacts are written.  
- All tests in `validation_report.json` pass.  
- Performance thresholds met.  
- Reproducibility manifests are updated and correct.

</details>


## **3.0 Scope & Inputs (Reuse, Don‚Äôt Recompute)**

**Goal:**  
Initialize the **Alpha Layer** workflow by defining **all inputs, configurations, and dependencies** required for model training and forecasting. This step reuses existing cleaned artifacts from Section 1 (features) and Section 2 (regime modeling) rather than recomputing them, ensuring consistency and reproducibility across the pipeline.

---

### **Description**
The Alpha Layer relies on:
- **Feature data** ‚Äî modeling-ready, leakage-free cross-sectional features from Section 1.
- **Regime data** ‚Äî daily regime labels and probabilities from Section 2, plus regime-specific control parameters.
- **Model configuration** ‚Äî an `AlphaConfig` object specifying target horizons, models to train, cross-validation structure, uncertainty estimation methods, and artifact paths.

This setup phase ensures that all subsequent modeling steps (Sections 3.1‚Äì3.9) have consistent, correctly versioned inputs and configuration files. No modeling occurs here ‚Äî only input gathering, integrity checks, and config initialization.

---

### **Inputs from Section 1 ‚Äî Feature Engineering**
- `features_filtered.parquet`  
  ‚Üí Fully preprocessed, date-aligned, leakage-free per-asset panel for the trading universe.  
- `universe.csv`  
  ‚Üí Canonical equity universe (S&P 100 or dynamically filtered).  
- Supporting metadata:  
  - `cs_cols` ‚Äî cross-sectional feature names.  
  - `non_feature_cols` ‚Äî identifiers, timestamps, and metadata columns.  
  - `cols_to_shift` ‚Äî features requiring lag alignment.  
  - `dates_all` ‚Äî aligned date index for all assets.  
  - `px_daily_all` ‚Äî adjusted daily close prices for target computation.

---

### **Inputs from Section 2 ‚Äî Regime Modeling**
- `artifacts/regimes/regime_labels.parquet`  
  ‚Üí Daily regime classification (Risk-On, Risk-Off, Transition) with posterior probabilities.  
- `artifacts/regimes/regime_policy_map.json`  
  ‚Üí Latest regime label and **confidence scalar** `g` for blending and model weighting.  
- `artifacts/regimes/window_manifest.json` or `windows_index.json`  
  ‚Üí Walk-forward train/test window boundaries for reproducible modeling.

---

### **New Configuration ‚Äî AlphaConfig**
A single, explicit configuration object for the Alpha Layer, containing:
- **Horizon settings**: prediction targets (e.g., `t+5`, `t+10` excess returns).  
- **Target definition**: return computation method (hedged/unhedged, volatility adjustment).  
- **Model suite**: multifactor composite, LSTM, tabular ensembles, stacking meta-learner.  
- **Cross-validation**: purged & embargoed folds, windowing parameters.  
- **Uncertainty estimation**: MC-Dropout, quantile regression, ensemble dispersion.  
- **Loss functions & optimization**: per-model objective configuration.  
- **Regime hooks**: regime-specific model weights, factor emphasis, and risk constraints.  
- **Artifact paths**: output directories for models, predictions, diagnostics, and manifests.

---

### **Outputs of 3.0**
- `alpha_config_effective.json` ‚Äî frozen configuration used for this run.  
- `alpha_input_manifest.json` ‚Äî versioned list of all Section 1 & 2 inputs, with hashes for reproducibility.  
- Initial log entry confirming that **all required inputs are present, consistent, and in sync** with the latest run of Sections 1 & 2.

---

**Definition of Done (3.0)**  
‚úÖ All inputs from Sections 1 & 2 successfully loaded and validated.  
‚úÖ `AlphaConfig` populated and saved.  
‚úÖ No recomputation of features or regimes ‚Äî hashes match expected values.  
‚úÖ Input manifest written and logged for reproducibility.


In [None]:
# load features_filtered.parquet
# --- Inspect features_filtered.parquet & flag any "SPY" in column names ---

# If needed in your Colab, uncomment the next line:
# !pip -q install pyarrow fastparquet

import re, pandas as pd, numpy as np

PATH = "features_filtered.parquet"   # adjust if stored elsewhere
df = pd.read_parquet(PATH)

# Summary
n_rows, n_cols = df.shape
date_min = pd.to_datetime(df["date"]).min() if "date" in df.columns else None
date_max = pd.to_datetime(df["date"]).max() if "date" in df.columns else None
n_tickers = df["ticker"].nunique() if "ticker" in df.columns else None
print(f"{PATH} ‚Üí rows={n_rows:,}, cols={n_cols:,}, tickers={n_tickers}, dates=[{date_min} ‚Üí {date_max}]")

# Columns that are *not* predictive features
NON_FEATURE_COLS = {"date","ticker","open","high","low","close","adj_close","volume"}

# Buckets for readability (same logic as your pipeline)
CONTEXT_COLS = {"spy_rv_20","vix_close","breadth","spy_ret"}
FUNDAMENTALS = {"book_to_price","earnings_yield","cf_yield","shareholder_yield",
                "gross_profitability","roe","accruals","leverage"}

all_cols = list(df.columns)
feature_cols = [c for c in all_cols if c not in NON_FEATURE_COLS]

# Group by type
lags = sorted([c for c in feature_cols if re.fullmatch(r"ret_lag_\d+", c)], key=lambda x: int(x.split("_")[-1]))
context = [c for c in feature_cols if c in CONTEXT_COLS]
fundas = [c for c in feature_cols if c in FUNDAMENTALS]
funda_masks = sorted([c for c in feature_cols if c.endswith("_is_missing") and c.replace("_is_missing","") in FUNDAMENTALS])

TECH_BASE = {
    "ret_1d","rv_20","atr_14","mom_20","mom_6m","mom_12m","mom_12_1","mom_6_1",
    "sma_20","sma_50","sma_20_gt_50","slope_20","mom_20_vs_vol"
}
tech_known = sorted([c for c in feature_cols if c in TECH_BASE])
tech_extra = sorted([
    c for c in feature_cols
    if c not in tech_known + lags + context + fundas + funda_masks
])

# Print clean inventory
def show(title, cols):
    print(f"\n{title} ({len(cols)}):")
    if cols: print(", ".join(cols))
    else:    print("‚Äî")

print(f"\nPredictive feature columns: {len(feature_cols)} of {n_cols}")
show("Market context", context)
show("Price/technical (known)", tech_known)
show("Price/technical (extra detected)", tech_extra)
show("Return lags", lags)
show("Fundamentals", fundas)
show("Fundamentals ‚Äî missing masks", funda_masks)

# ----- SPY name scan -----
# Flag any column that contains "SPY" anywhere (e.g., "ESPY"), case-insensitive.
spy_name_hits = [c for c in feature_cols if "SPY" in c.upper()]

# Allow-listed SPY references (expected market context)
SPY_WHITELIST = {"spy_rv_20","spy_ret"}  # extend if you intentionally keep others
spy_suspect = [c for c in spy_name_hits if c.lower() not in SPY_WHITELIST]

print("\nColumns containing 'SPY' (case-insensitive):", spy_name_hits or "None")
if spy_suspect:
    print("‚ö†Ô∏è Unexpected 'SPY' in feature names (not in whitelist):", spy_suspect)
else:
    print("‚úÖ No unexpected 'SPY' tokens in feature names (only allowed context present).")


features_filtered.parquet ‚Üí rows=2,223,976, cols=94, tickers=514, dates=[2007-02-05 00:00:00 ‚Üí 2025-08-11 00:00:00]

Predictive feature columns: 86 of 94

Market context (3):
spy_rv_20, vix_close, breadth

Price/technical (known) (13):
atr_14, mom_12_1, mom_12m, mom_20, mom_20_vs_vol, mom_6_1, mom_6m, ret_1d, rv_20, slope_20, sma_20, sma_20_gt_50, sma_50

Price/technical (extra detected) (0):
‚Äî

Return lags (60):
ret_lag_1, ret_lag_2, ret_lag_3, ret_lag_4, ret_lag_5, ret_lag_6, ret_lag_7, ret_lag_8, ret_lag_9, ret_lag_10, ret_lag_11, ret_lag_12, ret_lag_13, ret_lag_14, ret_lag_15, ret_lag_16, ret_lag_17, ret_lag_18, ret_lag_19, ret_lag_20, ret_lag_21, ret_lag_22, ret_lag_23, ret_lag_24, ret_lag_25, ret_lag_26, ret_lag_27, ret_lag_28, ret_lag_29, ret_lag_30, ret_lag_31, ret_lag_32, ret_lag_33, ret_lag_34, ret_lag_35, ret_lag_36, ret_lag_37, ret_lag_38, ret_lag_39, ret_lag_40, ret_lag_41, ret_lag_42, ret_lag_43, ret_lag_44, ret_lag_45, ret_lag_46, ret_lag_47, ret_lag_48, ret_lag_49

In [None]:
# 3.1 ‚Äî Config Primer (globals only; safe to import everywhere)

from pathlib import Path

ARTIFACTS_DIR = Path("artifacts")
ALPHA_DIR     = ARTIFACTS_DIR / "alpha"
PANELS_DIR    = ALPHA_DIR / "panels"
REGIME_DIR    = ARTIFACTS_DIR / "regimes"

ALPHA_DIR.mkdir(parents=True, exist_ok=True)
PANELS_DIR.mkdir(parents=True, exist_ok=True)

# Section-1 inputs
FEATURES_FP   = Path("features_filtered.parquet")
UNIVERSE_FP   = Path("universe.csv")
META_YAML_FP  = Path("meta.yaml")  # optional

# Section-2 inputs
REGIME_LABELS_FP     = REGIME_DIR / "regime_labels.parquet"
REGIME_POLICY_MAP_FP = REGIME_DIR / "regime_policy_map.json"
WIN_DIR              = REGIME_DIR / "windowed"
WINDEX_FP            = WIN_DIR / "windows_index.json"      # preferred
WMANIFEST_FP         = REGIME_DIR / "window_manifest.json" # fallback

# Config knobs (#TOCHANGE for real runs)
HORIZONS         = [5, 10]      # TOCHANGE real: [5, 10, 20]
ROLL_LOOKBACK_D  = 126          # TOCHANGE real: 252
SECTOR_NEUTRAL   = False        # TOCHANGE real: True (with sector_map.csv)
DEBUG_MAX_TICKERS= None         # TOCHANGE e.g., 120 for smoke; real: None
N_SMOOTH_G       = 3            # TOCHANGE real: 5‚Äì10
LAMBDA_RIDGE     = 1e-8

# Optional sector map
SECTOR_ETF_MAP_FP = Path("sector_map.csv")

# Planned outputs used by later cells
PANEL_MASTER_FP = ALPHA_DIR / "panel_master.parquet"
TARGETS_FP      = ALPHA_DIR / "targets.parquet"
TARGETS_QC_FP   = ALPHA_DIR / "targets_qc.json"
FEATURE_LIST_FP = ALPHA_DIR / "feature_list.json"
LEAK_SCAN_FP    = ALPHA_DIR / "leakage_scan.json"

In [None]:
# Feature inventory + SPY token scan (no side effects)

import pandas as pd, re

df = pd.read_parquet(FEATURES_FP)
n_rows, n_cols = df.shape
date_min = pd.to_datetime(df["date"]).min()
date_max = pd.to_datetime(df["date"]).max()
n_tickers = df["ticker"].nunique()
print(f"{FEATURES_FP} ‚Üí rows={n_rows:,}, cols={n_cols:,}, tickers={n_tickers}, dates=[{date_min} ‚Üí {date_max}]")

NON_FEATURE_COLS = {"date","ticker","open","high","low","close","adj_close","volume"}
feature_cols = [c for c in df.columns if c not in NON_FEATURE_COLS]

TECH_BASE = {"ret_1d","rv_20","atr_14","mom_20","mom_6m","mom_12m","mom_12_1","mom_6_1",
             "sma_20","sma_50","sma_20_gt_50","slope_20","mom_20_vs_vol"}
lags = sorted([c for c in feature_cols if re.fullmatch(r"ret_lag_\d+", c)], key=lambda x: int(x.split("_")[-1]))
tech_known = sorted([c for c in feature_cols if c in TECH_BASE])

print(f"\nPredictive feature columns: {len(feature_cols)} of {n_cols}")
print(f"Price/technical (known) ({len(tech_known)}): {', '.join(tech_known) or '‚Äî'}")
print(f"Return lags ({len(lags)}): {', '.join(lags[:12])}{' ‚Ä¶' if len(lags)>12 else ''}")

spy_hits = [c for c in feature_cols if 'SPY' in c.upper()]
whitelist = {'spy_rv_20','spy_ret'}  # allowed context, if present
suspect = [c for c in spy_hits if c.lower() not in whitelist]
print("\nColumns containing 'SPY':", spy_hits or "None")
print("‚úÖ No unexpected 'SPY' tokens." if not suspect else f"‚ö†Ô∏è Unexpected 'SPY' features: {suspect}")

features_filtered.parquet ‚Üí rows=2,223,976, cols=94, tickers=514, dates=[2007-02-05 00:00:00 ‚Üí 2025-08-11 00:00:00]

Predictive feature columns: 86 of 94
Price/technical (known) (13): atr_14, mom_12_1, mom_12m, mom_20, mom_20_vs_vol, mom_6_1, mom_6m, ret_1d, rv_20, slope_20, sma_20, sma_20_gt_50, sma_50
Return lags (60): ret_lag_1, ret_lag_2, ret_lag_3, ret_lag_4, ret_lag_5, ret_lag_6, ret_lag_7, ret_lag_8, ret_lag_9, ret_lag_10, ret_lag_11, ret_lag_12 ‚Ä¶

Columns containing 'SPY': ['spy_rv_20']
‚úÖ No unexpected 'SPY' tokens.


In [None]:
# ---- Hedge loaders (new) ----
RAW_PRICES_FP = Path("raw_prices.parquet")                    # from Section 1
MKT_PANEL_FP  = REGIME_DIR / "market_panel.parquet"           # from Section 2

def load_spy_returns() -> pd.Series:
    """
    Prefer Section 2's market_panel (column 'spy_ret').
    Fallback: compute from Section 1's raw_prices for 'SPY'.
    Returns a pd.Series indexed by date named 'spy_ret'.
    """
    # Preferred source
    if MKT_PANEL_FP.exists():
        mp = pd.read_parquet(MKT_PANEL_FP)
        mp["date"] = pd.to_datetime(mp["date"])
        mp = mp.sort_values("date")
        if "spy_ret" in mp.columns:
            return mp.set_index("date")["spy_ret"].rename("spy_ret")

    # Fallback
    if RAW_PRICES_FP.exists():
        rp = pd.read_parquet(RAW_PRICES_FP)  # expected long format: date, ticker, adj_close (at minimum)
        rp["date"] = pd.to_datetime(rp["date"])
        spy = rp[rp["ticker"] == "SPY"][["date", "adj_close"]].sort_values("date")
        if spy.empty:
            raise RuntimeError("SPY not found in raw_prices.parquet; cannot compute spy_ret.")
        spy["spy_ret"] = np.log(spy["adj_close"]).diff()
        return spy.set_index("date")["spy_ret"].rename("spy_ret")

    raise RuntimeError("Neither market_panel.parquet nor raw_prices.parquet available for SPY returns.")


def load_sector_etf_returns(etf_list: list[str]) -> pd.DataFrame:
    """
    Load sector ETF daily log returns from raw_prices.parquet for tickers in etf_list.
    Returns a DataFrame with columns: date, ticker, sector_ret_1d
    """
    if not RAW_PRICES_FP.exists():
        raise RuntimeError("raw_prices.parquet missing; required for sector ETF returns.")
    rp = pd.read_parquet(RAW_PRICES_FP)
    rp["date"] = pd.to_datetime(rp["date"])
    etf_px = rp[rp["ticker"].isin(etf_list)][["date", "ticker", "adj_close"]].sort_values(["ticker","date"])
    if etf_px.empty:
        raise RuntimeError(f"No sector ETF prices found in raw_prices.parquet for: {etf_list}")
    etf_px["sector_ret_1d"] = etf_px.groupby("ticker")["adj_close"].transform(lambda x: np.log(x).diff())
    return etf_px[["date", "ticker", "sector_ret_1d"]]


In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Section 3.1.1 + 3.1.2 ‚Äî Data Load/Trim & Regime Context
# What this provides:
#   - load_base_data():   loads Section-1 features, trims to universe, extracts price panel,
#                         determines X feature columns (cs_cols), and runs light QC.
#   - load_regime_context(): loads Section-2 regime labels, picks smoothed label when present,
#                         computes historical 'g' from posteriors (if needed), smooths it, light QC.
#
# Notes:
#   ‚Ä¢ Reuses artifacts from Sections 1‚Äì2. No recomputation of features or regimes here.
#   ‚Ä¢ Low-compute defaults are set for smoke tests; real-run values are marked with #TOCHANGE.
#   ‚Ä¢ Outputs are returned as in-memory DataFrames; optional QC artifacts are written to disk.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

from __future__ import annotations

import json
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd

# Optional: pull cs_cols from meta.yaml if present
try:
    import yaml  # type: ignore
    HAVE_YAML = True
except Exception:
    HAVE_YAML = False

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Paths (reuse from earlier sections where possible)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

ARTIFACTS_DIR = Path("artifacts")
ALPHA_DIR = ARTIFACTS_DIR / "alpha"
PANELS_DIR = ALPHA_DIR / "panels"
REGIME_DIR = ARTIFACTS_DIR / "regimes"

ALPHA_DIR.mkdir(parents=True, exist_ok=True)
PANELS_DIR.mkdir(parents=True, exist_ok=True)

FEATURES_FP = Path("features_filtered.parquet")          # Section 1 output
UNIVERSE_FP = Path("universe.csv")                       # Section 1 output
META_YAML_FP = Path("meta.yaml")                         # Section 1 optional meta

REGIME_LABELS_FP = REGIME_DIR / "regime_labels.parquet"  # Section 2 output
REGIME_POLICY_MAP_FP = REGIME_DIR / "regime_policy_map.json"  # Section 2 (latest-only g)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Config knobs for 3.1.1 / 3.1.2
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

DEBUG_MAX_TICKERS: Optional[int] = None   # #TOCHANGE real-run: None; smoke: e.g., 120
PX_FALLBACK_COL = "adj_close"             # must exist in features parquet
N_SMOOTH_G = 3                            # #TOCHANGE real-run: 5‚Äì10 (smoother 'g')

# Non-feature columns (from Section 1 conventions)
NON_FEATURE_COLS_BASE = {
    "date", "ticker", "open", "high", "low", "close", "adj_close", "volume"
}

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Utilities
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _to_datetime(df: pd.DataFrame, col: str = "date") -> pd.DataFrame:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col])
    return df

def _load_meta_cs_cols(meta_yaml_fp: Path) -> Optional[List[str]]:
    if meta_yaml_fp.exists() and HAVE_YAML:
        try:
            meta = yaml.safe_load(meta_yaml_fp.read_text())
            cs_cols = meta.get("cs_cols", None)
            if isinstance(cs_cols, list) and cs_cols:
                return cs_cols
        except Exception:
            pass
    return None

def _derive_cs_cols(df: pd.DataFrame) -> List[str]:
    # Heuristic: everything that is not in known non-feature cols and not a mask label
    # (Section 1 already standardized & winsorized these)
    return [
        c for c in df.columns
        if c not in NON_FEATURE_COLS_BASE
        and not c.lower().startswith("mask_")
        and c not in {"regime_label", "regime_label_smoothed"}  # safety
    ]

def _assert_unique_keys(df: pd.DataFrame, keys=("date", "ticker")) -> None:
    if df.duplicated(list(keys)).any():
        dups = int(df.duplicated(list(keys)).sum())
        raise AssertionError(f"Found {dups} duplicate ({','.join(keys)}) rows in features panel.")

def _check_monotonic_dates(df: pd.DataFrame) -> None:
    # Light check: sample a few tickers to verify monotonicity
    sample = df["ticker"].drop_duplicates().sample(min(10, df["ticker"].nunique()), random_state=42)
    for t in sample:
        s = df.loc[df["ticker"] == t, "date"]
        if not s.is_monotonic_increasing:
            raise AssertionError(f"Non-monotonic dates for ticker {t}.")

def _write_qc_csv(path: Path, stats: Dict) -> None:
    pd.DataFrame([stats]).to_csv(path, index=False)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.1 ‚Äî Load & Trim Base Data (reuse)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def load_base_data(
    features_fp: Path = FEATURES_FP,
    universe_fp: Path = UNIVERSE_FP,
    meta_yaml_fp: Path = META_YAML_FP,
    debug_max_tickers: Optional[int] = DEBUG_MAX_TICKERS,
) -> Tuple[pd.DataFrame, pd.DataFrame, List[str]]:
    """
    Load modeling-ready features (Section 1), trim to universe, optionally
    downsample tickers for smoke runs, extract a minimal price panel, and
    determine X feature columns (cs_cols).

    Returns:
        feats  : DataFrame ‚Äî trimmed features (ready for merge in 3.1.6)
        px     : DataFrame ‚Äî minimal price panel ['date','ticker','adj_close']
        cs_cols: List[str] ‚Äî feature columns to feed models later
    """
    # Load features
    feats = pd.read_parquet(features_fp)
    feats = _to_datetime(feats, "date")
    feats["ticker"] = feats["ticker"].astype(str)

    # Join to canonical universe
    uni = pd.read_csv(universe_fp)["ticker"].astype(str)
    feats = feats[feats["ticker"].isin(uni)].copy()

    # Optional: downsample for smoke runs
    if debug_max_tickers is not None:
        keep = feats["ticker"].drop_duplicates().head(debug_max_tickers)
        feats = feats[feats["ticker"].isin(keep)].copy()

    # Basic sort & schema assertions
    feats = feats.sort_values(["ticker", "date"])
    _assert_unique_keys(feats, keys=("date", "ticker"))
    _check_monotonic_dates(feats)

    # Determine cs_cols
    cs_cols = _load_meta_cs_cols(meta_yaml_fp)
    if cs_cols is None:
        cs_cols = _derive_cs_cols(feats)

    # Extract minimal price panel for later target calc
    if PX_FALLBACK_COL not in feats.columns:
        raise KeyError(
            f"'{PX_FALLBACK_COL}' not found in features parquet ‚Äî "
            f"please ensure Section 1 wrote adj_close. "
            f"(Alternatively, pass a separate price panel from raw_prices.parquet.)"
        )
    px = feats[["date", "ticker", PX_FALLBACK_COL]].drop_duplicates().copy()

    # QC write (optional)
    qc_stats = {
        "rows": int(len(feats)),
        "dates": int(feats["date"].nunique()),
        "tickers": int(feats["ticker"].nunique()),
        "debug_max_tickers": debug_max_tickers,
        "adj_close_missing_frac": float(px[PX_FALLBACK_COL].isna().mean()),
        "n_cs_cols": len(cs_cols),
    }
    _write_qc_csv(PANELS_DIR / "panel_base_qc.csv", qc_stats)

    return feats, px, cs_cols

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.2 ‚Äî Regime Context (reuse + light transform)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _entropy_norm(p: np.ndarray, eps: float = 1e-12) -> float:
    """Normalized entropy in [0,1]."""
    p = np.clip(p, eps, 1.0)
    p = p / p.sum()
    H = -(p * np.log(p)).sum()
    return float(H / np.log(len(p)))  # normalize by log(K)

def _compute_g_from_posteriors(df: pd.DataFrame, n_smooth: int = N_SMOOTH_G) -> pd.Series:
    """
    Build a per-date aggressiveness scalar 'g' from posteriors p*:
      c_max = max(p)
      c_ent = 1 - entropy_norm(p)
      c = 0.5*c_max + 0.5*c_ent
      g = 0.35 + (1.00 - 0.35) * c  (same as ¬ß2.7)
    Then smooth with a trailing mean over n_smooth days.
    """
    p_cols = [c for c in df.columns if c.startswith("p")]
    if not p_cols:
        # No posteriors to compute g from
        return pd.Series(index=df["date"], dtype=float, name="g")

    P = df[p_cols].to_numpy()
    c_max = P.max(axis=1)
    c_ent = np.array([1.0 - _entropy_norm(P[i, :]) for i in range(P.shape[0])])
    c = 0.5 * c_max + 0.5 * c_ent
    g = 0.35 + (1.00 - 0.35) * c
    g_ser = pd.Series(g, index=df.index, name="g")
    if n_smooth and n_smooth > 1:
        g_ser = g_ser.rolling(n_smooth, min_periods=1).mean()
    return g_ser

def load_regime_context(
    regime_labels_fp: Path = REGIME_LABELS_FP,
    regime_policy_map_fp: Path = REGIME_POLICY_MAP_FP,
    n_smooth_g: int = N_SMOOTH_G,
) -> pd.DataFrame:
    """
    Load regime labels from Section 2, pick smoothed label if present,
    compute historical per-date 'g' from posteriors if needed, and
    return a compact DataFrame for merging in 3.1.6.

    Returns:
        regimes_keep: DataFrame with columns:
            ['date','regime_label_use', 'p0..pK-1'(if any), 'g']
            (+ 'state_id_smoothed' optionally for debugging)
    """
    rg = pd.read_parquet(regime_labels_fp).copy()
    rg["date"] = pd.to_datetime(rg["date"])
    rg = rg.sort_values("date").reset_index(drop=True)

    # Choose label column
    if "regime_label_smoothed" in rg.columns:
        rg["regime_label_use"] = rg["regime_label_smoothed"]
    elif "regime_label" in rg.columns:
        rg["regime_label_use"] = rg["regime_label"]
    else:
        raise KeyError("Regime labels missing both 'regime_label_smoothed' and 'regime_label'.")

    # If historical g is not stored, compute it from posteriors
    p_cols = [c for c in rg.columns if c.startswith("p")]
    if "g" not in rg.columns:
        rg["g"] = _compute_g_from_posteriors(rg, n_smooth=n_smooth_g)

    # Keep compact schema for downstream merges
    keep_cols = ["date", "regime_label_use"]
    if "state_id_smoothed" in rg.columns:
        keep_cols.append("state_id_smoothed")
    keep_cols += p_cols
    if "g" in rg.columns:
        keep_cols.append("g")

    regimes_keep = rg[keep_cols].copy()

    # Optional QC snapshot
    qc = {
        "rows": int(len(regimes_keep)),
        "dates": int(regimes_keep["date"].nunique()),
        "has_posteriors": bool(len(p_cols) > 0),
        "K": int(len(p_cols)) if p_cols else 0,
        "g_present": bool("g" in regimes_keep.columns),
        "g_stats": None,
        "n_smooth_g": n_smooth_g,
        "label_used": "regime_label_smoothed" if "regime_label_smoothed" in rg.columns else "regime_label",
    }
    if "g" in regimes_keep.columns:
        gvals = regimes_keep["g"].dropna()
        if len(gvals):
            qc["g_stats"] = {
                "min": float(gvals.min()),
                "mean": float(gvals.mean()),
                "max": float(gvals.max()),
            }
    (ALPHA_DIR / "regimes_qc.json").write_text(json.dumps(qc, indent=2))

    return regimes_keep

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Example (manual) usage:
# feats, px, cs_cols = load_base_data()
# regimes_keep = load_regime_context()
# Next steps in 3.1.x: compute residual daily returns (vs SPY/sector), build forward
# r_ex_h targets, merge features√ótargets√óregimes, and split per walk-forward window.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# 3.1.1 + 3.1.2 glue (optional helper)
def prep_3_1_3_context(
    debug_max_tickers: int | None = DEBUG_MAX_TICKERS,
    n_smooth_g: int = N_SMOOTH_G,
):
    """
    One-call handoff for 3.1.3:
      - Loads trimmed features & price panel
      - Loads regime context with per-date g
      - Performs a couple of readiness checks
    Returns:
      feats, px, cs_cols, regimes_keep
    """
    feats, px, cs_cols = load_base_data(
        features_fp=FEATURES_FP,
        universe_fp=UNIVERSE_FP,
        meta_yaml_fp=META_YAML_FP,
        debug_max_tickers=debug_max_tickers,
    )
    regimes_keep = load_regime_context(
        regime_labels_fp=REGIME_LABELS_FP,
        regime_policy_map_fp=REGIME_POLICY_MAP_FP,
        n_smooth_g=n_smooth_g,
    )

    # readiness checks for 3.1.3
    if "adj_close" not in px.columns:
        raise RuntimeError("adj_close column is required in px for return computation.")

    return feats, px, cs_cols, regimes_keep




    # Persist
    RUN_SUMMARY_FP.write_text(json.dumps(s, indent=2))

    # Human-readable fingerprint
    txt = []
    txt.append("=== Section 3.1 Run Summary ===")
    txt.append(f"Horizons: {s['params']['horizons']}; Sector-neutral: {s['params']['sector_neutral']}; "
               f"Lookback: {s['params']['roll_lookback_days']}d; Debug tickers: {s['params']['debug_max_tickers']}")
    txt.append(f"Features rows: {s['shapes']['feats_rows']:,} | Panel rows: {s['shapes']['panel_master_rows']:,} | "
               f"Targets rows: {s['shapes']['targets_rows']:,}")
    if qc_excess:
        txt.append(f"Sector-neutral enabled: {qc_excess.get('sector_neutral_enabled')}, "
                   f"sector_beta_coverage: {qc_excess.get('sector_beta_coverage'):.3f}")
    txt.append(f"Validation (3.1): {s['qc']['val_3_1_status']}  Notes: {', '.join(s['qc']['val_3_1_notes']) if s['qc']['val_3_1_notes'] else '-'}")
    txt.append("Artifacts:")
    for k, v in s["artifacts"].items():
        mark = "‚úì" if v["exists"] else "‚úó"
        size = f"{v['size']:,}B" if v["size"] else "-"
        txt.append(f"  {mark} {k}: {v['path']} ({size})")
    if s["windows"]:
        txt.append("Windows:")
        for w in s["windows"]:
            txt.append(f"  {w['win_id']}  train[{w['train_start']} ‚Üí {w['train_end']}] rows={w['train_rows']:,} | "
                       f"test[{w['test_start']} ‚Üí {w['test_end']}] rows={w['test_rows']:,}")
    RUN_FINGERPRINT_FP.write_text("\n".join(txt))

    if print_summary:
        print("\n".join(txt))

    return s

# --- Guard the orchestrator so it doesn't run yet ---
RUN_ORCHESTRATOR = False  # set to True when 3.1.3+ are implemented

# Allow `python yourfile.py --run-3-1` style usage
if __name__ == "__main__" and RUN_ORCHESTRATOR:
    run_section_3_1(
        horizons=HORIZONS,
        sector_neutral=SECTOR_NEUTRAL,
        roll_lookback=ROLL_LOOKBACK_D,
        debug_max_tickers=DEBUG_MAX_TICKERS,
        n_smooth_g=N_SMOOTH_G,
        print_summary=True,
    )


# Example usage after running prep_3_1_3_context:
feats, px, cs_cols, regimes_keep = prep_3_1_3_context()
log_3_1_1_2_summary(feats, px, cs_cols, regimes_keep)


[3.1.1 + 3.1.2] SUMMARY
------------------------------------------------------------
Features loaded: 2,174,912 rows, 86 features
Price panel:     2,174,912 rows (503 tickers)
Regimes loaded:  2,163 rows
Feature cols:    ['ret_1d', 'ret_lag_1', 'ret_lag_2', 'ret_lag_3', 'ret_lag_4']...

Artifacts written:
  ‚úî artifacts/alpha/panels/panel_base_qc.csv (exists)
  ‚úî artifacts/alpha/regimes_qc.json (exists)

Preview feats:
      date      open      high       low     close  adj_close    volume ticker    ret_1d  ret_lag_1  ret_lag_2  ret_lag_3  ret_lag_4  ret_lag_5  ret_lag_6  ret_lag_7  ret_lag_8  ret_lag_9  ret_lag_10  ret_lag_11  ret_lag_12  ret_lag_13  ret_lag_14  ret_lag_15  ret_lag_16  ret_lag_17  ret_lag_18  ret_lag_19  ret_lag_20  ret_lag_21  ret_lag_22  ret_lag_23  ret_lag_24  ret_lag_25  ret_lag_26  ret_lag_27  ret_lag_28  ret_lag_29  ret_lag_30  ret_lag_31  ret_lag_32  ret_lag_33  ret_lag_34  ret_lag_35  ret_lag_36  ret_lag_37  ret_lag_38  ret_lag_39  ret_lag_40  ret_lag_41  

In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Section 3.1.1‚Äì3.1.2 ‚Äî Summary Helper & QC Utilities (no orchestrator yet)
# Purpose:
#   ‚Ä¢ log_3_1_1_2_summary(...) ‚Üí human-readable printout of loads from 3.1.1‚Äì3.1.2
#   ‚Ä¢ _exists_size(...)        ‚Üí tiny file helper used by later summaries
# Notes:
#   This cell does NOT run the 3.1 orchestrator; that comes after 3.1.3+ are implemented.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

from __future__ import annotations
import json
from pathlib import Path
import pandas as pd

# Reuse config/paths from earlier cells/files
# HORIZONS, ROLL_LOOKBACK_D, SECTOR_NEUTRAL, N_SMOOTH_G, etc.
# PANEL_MASTER_FP, TARGETS_FP, TARGETS_QC_FP, FEATURE_LIST_FP, LEAK_SCAN_FP
# WINDEX_FP, WMANIFEST_FP, ALPHA_DIR, PANELS_DIR

RUN_SUMMARY_FP = ALPHA_DIR / "3_1_run_summary.json"
RUN_FINGERPRINT_FP = ALPHA_DIR / "3_1_run_fingerprint.txt"

def log_3_1_1_2_summary(feats, px, cs_cols, regimes_keep):
    created_files = [
        str(PANELS_DIR / "panel_base_qc.csv"),
        str(ALPHA_DIR / "regimes_qc.json"),
    ]
    print("\n[3.1.1 + 3.1.2] SUMMARY")
    print("-" * 60)
    print(f"Features loaded: {len(feats):,} rows, {len(cs_cols):,} features")
    print(f"Price panel:     {len(px):,} rows ({px['ticker'].nunique()} tickers)")
    print(f"Regimes loaded:  {len(regimes_keep):,} rows")
    print(f"Feature cols:    {cs_cols[:5]}{'...' if len(cs_cols) > 5 else ''}")
    print("\nArtifacts written:")
    for f in created_files:
        print(f"  ‚úî {f} {'(exists)' if Path(f).exists() else '(missing!)'}")

    # Optional: quick preview
    print("\nPreview feats:")
    print(feats.head(2).to_string(index=False))
    print("\nPreview regimes:")
    print(regimes_keep.head(2).to_string(index=False))
    print("-" * 60)

def _exists_size(path: Path) -> dict:
    return {"exists": path.exists(), "size": (path.stat().st_size if path.exists() else 0), "path": str(path)}

# ‚îÄ‚îÄ 3.1.1‚Äì3.1.2 smoke summary (prints shapes + created files) ‚îÄ‚îÄ
RUN_3_1_12_SMOKE = True  # set False to silence

if RUN_3_1_12_SMOKE:
    try:
        feats, px, cs_cols, regimes_keep = prep_3_1_3_context()
        log_3_1_1_2_summary(feats, px, cs_cols, regimes_keep)

        # Extra: show that the two QC artifacts exist and their sizes
        files = [
            PANELS_DIR / "panel_base_qc.csv",
            ALPHA_DIR / "regimes_qc.json",
        ]
        print("\n[Artifacts check]")
        for f in files:
            es = _exists_size(f)
            mark = "‚úì" if es["exists"] else "‚úó"
            size = f"{es['size']:,}B" if es["size"] else "-"
            print(f"  {mark} {es['path']} ({size})")
    except Exception as e:
        print("[3.1.1‚Äì3.1.2 smoke] failed:", repr(e))




[3.1.1 + 3.1.2] SUMMARY
------------------------------------------------------------
Features loaded: 2,174,912 rows, 86 features
Price panel:     2,174,912 rows (503 tickers)
Regimes loaded:  2,163 rows
Feature cols:    ['ret_1d', 'ret_lag_1', 'ret_lag_2', 'ret_lag_3', 'ret_lag_4']...

Artifacts written:
  ‚úî artifacts/alpha/panels/panel_base_qc.csv (exists)
  ‚úî artifacts/alpha/regimes_qc.json (exists)

Preview feats:
      date      open      high       low     close  adj_close    volume ticker    ret_1d  ret_lag_1  ret_lag_2  ret_lag_3  ret_lag_4  ret_lag_5  ret_lag_6  ret_lag_7  ret_lag_8  ret_lag_9  ret_lag_10  ret_lag_11  ret_lag_12  ret_lag_13  ret_lag_14  ret_lag_15  ret_lag_16  ret_lag_17  ret_lag_18  ret_lag_19  ret_lag_20  ret_lag_21  ret_lag_22  ret_lag_23  ret_lag_24  ret_lag_25  ret_lag_26  ret_lag_27  ret_lag_28  ret_lag_29  ret_lag_30  ret_lag_31  ret_lag_32  ret_lag_33  ret_lag_34  ret_lag_35  ret_lag_36  ret_lag_37  ret_lag_38  ret_lag_39  ret_lag_40  ret_lag_41  

In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Section 3.1.3 + 3.1.4 ‚Äî Daily Returns & Excess-Return Model (SPY-only / Sector-Neutral)
#
# What this provides:
#   3.1.3
#     - compute_daily_returns(px): per-ticker log returns from adj_close
#     - extract_spy_returns(px_ret): SPY daily log returns
#     - prepare_sector_returns(px_ret, sector_map): long-form sector ETF returns (optional)
#
#   3.1.4
#     - rolling two-factor beta helper (fast, covariance-based)
#     - compute_excess_returns(...): residual daily returns vs SPY (default) or vs SPY+sector (if enabled)
#     - coverage/QC summary for sector betas when sector-neutral is used
#
# Reuse:
#   - Assumes you already have from 3.1.1/3.1.2:
#       feats, px, cs_cols = load_base_data(...)
#       regimes_keep       = load_regime_context(...)
#
# #TOCHANGE knobs:
#   - ROLL_LOOKBACK_D   = 126  (real run: 252)
#   - SECTOR_NEUTRAL    = False (real run: True, plus sector_map.csv)
#   - SECTOR_ETF_MAP_FP = Path("sector_map.csv")  (provide/commit this mapping)
#   - LAMBDA_RIDGE      = 0.0  (real run: 1e-6 if your 2x2 determinant gets unstable)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

from __future__ import annotations

from pathlib import Path
from typing import Dict, Iterable, Optional, Tuple

import numpy as np
import pandas as pd

# ‚îÄ‚îÄ Config (light for smoke; mark #TOCHANGE for real runs) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
ROLL_LOOKBACK_D = 126               # #TOCHANGE real run: 252
SECTOR_NEUTRAL = False              # #TOCHANGE real run: True (requires sector_map.csv)
SECTOR_ETF_MAP_FP = Path("sector_map.csv")  # ticker,sector_etf mapping file (equities only)
SPY_TICKER = "SPY"
PX_FALLBACK_COL = "adj_close"
LAMBDA_RIDGE = 0.0                  # #TOCHANGE real run: 1e-6 if needed for numerical stability

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.3 ‚Äî Daily returns for assets and hedges (reuse data, new calc)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# Returns
def compute_daily_returns(px: pd.DataFrame, price_col: str = PX_FALLBACK_COL) -> pd.DataFrame:
    """
    Per-ticker daily *log* returns from adjusted close.
    Input:
        px : ['date','ticker', price_col], unique per (date,ticker)
    Output:
        ['date','ticker', price_col, 'ret_1d'] (log returns t-1 -> t)
    """
    req = {"date", "ticker", price_col}
    if not req.issubset(px.columns):
        missing = req - set(px.columns)
        raise KeyError(f"compute_daily_returns: missing columns {missing}")
    df = px.sort_values(["ticker", "date"]).copy()
    df["ret_1d"] = np.log(df[price_col]) - np.log(df.groupby("ticker")[price_col].shift(1))
    return df

feats, px, cs_cols, regimes_keep = prep_3_1_3_context()
px_ret  = compute_daily_returns(px)
print(px_ret.head()[["date","ticker","ret_1d"]])

def extract_spy_returns(px_ret: pd.DataFrame) -> pd.Series:
    """
    Extract SPY ret_1d series indexed by date.
    Input:
        px_ret : output of compute_daily_returns
    """
    if SPY_TICKER not in px_ret["ticker"].unique():
        raise RuntimeError(
            f"extract_spy_returns: '{SPY_TICKER}' not found in px; include SPY in features or "
            f"load it from raw_prices.parquet before 3.1.3."
        )
    spy = (
        px_ret.loc[px_ret["ticker"].eq(SPY_TICKER), ["date", "ret_1d"]]
        .drop_duplicates()
        .sort_values("date")
        .set_index("date")["ret_1d"]
    )
    return spy


def load_sector_map(fp: Path = SECTOR_ETF_MAP_FP) -> Optional[pd.DataFrame]:
    """
    Load ticker->sector_etf mapping (equity rows only). File format:
        ticker,sector_etf
        AAPL,XLK
        XOM,XLE
        ...
    Returns None if file does not exist.
    """
    if not fp.exists():
        return None
    sm = pd.read_csv(fp, dtype={"ticker": str, "sector_etf": str})
    # Keep clean, drop empty rows
    sm = sm.dropna(subset=["ticker", "sector_etf"]).copy()
    sm["ticker"] = sm["ticker"].astype(str)
    sm["sector_etf"] = sm["sector_etf"].astype(str)
    return sm


def prepare_sector_returns(
    px_ret: pd.DataFrame,
    sector_map: pd.DataFrame,
    sector_etfs: Optional[Iterable[str]] = None,
) -> pd.DataFrame:
    """
    Build long-form sector ETF daily returns from px_ret for the ETFs referenced by sector_map.
    Input:
        px_ret     : ['date','ticker','ret_1d'] for all tickers inc. hedges
        sector_map : ['ticker','sector_etf'] for equities
        sector_etfs: optional explicit list of ETFs to pull; else inferred from sector_map
    Output:
        DataFrame: ['date','sector_etf','sector_ret_1d'] (one row per ETF/date)
    """
    if sector_etfs is None:
        sector_etfs = sector_map["sector_etf"].dropna().unique().tolist()
    etf_ret = (
        px_ret[px_ret["ticker"].isin(set(sector_etfs))][["date", "ticker", "ret_1d"]]
        .rename(columns={"ticker": "sector_etf", "ret_1d": "sector_ret_1d"})
        .sort_values(["sector_etf", "date"])
    )
    return etf_ret


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.4 ‚Äî Excess-return model (SPY-only / Sector-Neutral)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _rolling_two_factor_betas(
    y: pd.Series, x1: pd.Series, x2: pd.Series, lookback: int, lambda_ridge: float = LAMBDA_RIDGE
) -> Tuple[pd.Series, pd.Series]:
    """
    Rolling OLS betas for y ~ b1*x1 + b2*x2 using 2x2 normal equations estimated by
    rolling covariances. Returns (b1,b2). Stable and fast.
    """
    # rolling means (min_periods ~ half window helps early segments)
    m_y  = y.rolling(lookback, min_periods=lookback//2).mean()
    m_x1 = x1.rolling(lookback, min_periods=lookback//2).mean()
    m_x2 = x2.rolling(lookback, min_periods=lookback//2).mean()

    # centered
    yc  = y  - m_y
    x1c = x1 - m_x1
    x2c = x2 - m_x2

    # covariances/variances
    cov_yx1  = (yc * x1c).rolling(lookback, min_periods=lookback//2).mean()
    cov_yx2  = (yc * x2c).rolling(lookback, min_periods=lookback//2).mean()
    cov_x1x2 = (x1c * x2c).rolling(lookback, min_periods=lookback//2).mean()
    var_x1   = (x1c * x1c).rolling(lookback, min_periods=lookback//2).mean()
    var_x2   = (x2c * x2c).rolling(lookback, min_periods=lookback//2).mean()

    # Solve:
    # [var_x1+Œª   cov_x1x2] [b1] = [cov_yx1]
    # [cov_x1x2   var_x2+Œª] [b2]   [cov_yx2]
    var_x1_r = var_x1 + lambda_ridge
    var_x2_r = var_x2 + lambda_ridge
    det = var_x1_r * var_x2_r - cov_x1x2 * cov_x1x2
    eps = 1e-12
    det = det.where(det.abs() > eps, np.nan)

    b1 = ( var_x2_r * cov_yx1 - cov_x1x2 * cov_yx2) / det
    b2 = (-cov_x1x2 * cov_yx1 + var_x1_r   * cov_yx2) / det
    return b1, b2

    def _per_ticker(g: pd.DataFrame) -> pd.DataFrame:
        # choose 2-factor if we have sector_ret_1d and enough non-NaNs
        if have_sector and ("sector_ret_1d" in g.columns) and (g["sector_ret_1d"].notna().sum() >= lookback // 2):
            b1, b2 = _rolling_two_factor_betas(
                y=g["ret_1d"], x1=g["spy_ret_1d"], x2=g["sector_ret_1d"],
                lookback=lookback, lambda_ridge=lambda_ridge
            )
            g["beta_spy"]    = b1
            g["beta_sector"] = b2
            g["resid_1d"]    = g["ret_1d"] - (g["beta_spy"]*g["spy_ret_1d"] + g["beta_sector"]*g["sector_ret_1d"])
        else:
            # fallback: SPY-only
            cov = g["ret_1d"].rolling(lookback, min_periods=lookback//2).cov(g["spy_ret_1d"])
            var = g["spy_ret_1d"].rolling(lookback, min_periods=lookback//2).var()
            beta = cov / var.replace(0, np.nan)
            g["beta_spy"]    = beta
            g["beta_sector"] = np.nan
            g["resid_1d"]    = g["ret_1d"] - g["beta_spy"] * g["spy_ret_1d"]
        return g

    out = df.groupby("ticker", group_keys=False).apply(_per_ticker)

    # Coverage metric: fraction of rows with valid sector beta
    if have_sector:
        sector_cov_frac = float(out["beta_sector"].notna().mean())

    qc = {
        "lookback": lookback,
        "sector_neutral_enabled": bool(have_sector),
        "sector_beta_coverage": sector_cov_frac,
        "lambda_ridge": lambda_ridge,
    }
    return out[["date", "ticker", "resid_1d", "beta_spy", "beta_sector"]], qc


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Example usage inside your 3.1 pipeline:
#
# feats, px, cs_cols, regimes_keep = prep_3_1_3_context(...)
# px_ret  = compute_daily_returns(px)
# spy_ret = extract_spy_returns(px_ret)
# sector_map = load_sector_map(SECTOR_ETF_MAP_FP) if SECTOR_NEUTRAL else None
# sector_rets = prepare_sector_returns(px_ret, sector_map) if (SECTOR_NEUTRAL and sector_map is not None) else None
# resid_df, qc = compute_excess_returns(
#     px_ret=px_ret,
#     spy_ret=spy_ret,
#     lookback=ROLL_LOOKBACK_D,
#     sector_neutral=SECTOR_NEUTRAL,
#     sector_map=sector_map,
#     sector_returns=sector_rets,
#     lambda_ridge=LAMBDA_RIDGE,
# )
#
# Next (3.1.5): roll forward residual sums to build r_ex_h targets; compute ranks & valid_mask;
# then 3.1.6 merge feats √ó targets √ó regimes and split per walk-forward window.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# smoke test
# 3.1.3 smoke summary + optional artifacts
from pathlib import Path
import json

px_ret = compute_daily_returns(px)

n_rows = len(px_ret)
n_tickers = px_ret["ticker"].nunique()
date_min = px_ret["date"].min()
date_max = px_ret["date"].max()
nan_frac = float(px_ret["ret_1d"].isna().mean())

has_spy = (SPY_TICKER in px_ret["ticker"].unique())
first_nan_by_ticker = int(px_ret.sort_values(["ticker","date"])
                          .groupby("ticker")["ret_1d"].first().isna().sum())

print("\n[3.1.3] RETURNS SUMMARY")
print("-" * 60)
print(f"Rows={n_rows:,}  Tickers={n_tickers}  Dates=[{date_min} ‚Üí {date_max}]")
print(f"ret_1d NaN fraction: {nan_frac:.4f} (should be ~1/avg_days_per_ticker for first obs)")
print(f"SPY present: {has_spy}  | tickers with first ret_1d = NaN: {first_nan_by_ticker}")

# Optional: write tiny artifacts so we can eyeball later
(ALPHA_DIR / "alpha").mkdir(parents=True, exist_ok=True)  # safe if already exists
qc = {
    "rows": n_rows,
    "tickers": n_tickers,
    "date_min": str(date_min),
    "date_max": str(date_max),
    "nan_frac": nan_frac,
    "spy_present": has_spy,
    "first_nan_tickers": first_nan_by_ticker,
}
(ALPHA_DIR / "px_returns_qc.json").write_text(json.dumps(qc, indent=2))
px_ret.head(200).to_csv(ALPHA_DIR / "px_returns_head.csv", index=False)

print("\nArtifacts:")
print(f"  ‚úì {ALPHA_DIR / 'px_returns_qc.json'}")
print(f"  ‚úì {ALPHA_DIR / 'px_returns_head.csv'}")

# quick consistency check with Section-1 ret_1d on overlapping rows
chk = (px_ret.merge(feats[["date","ticker","ret_1d"]], on=["date","ticker"], how="inner", suffixes=("_new","_s1"))
              .dropna(subset=["ret_1d_new","ret_1d_s1"]))
corr = chk["ret_1d_new"].corr(chk["ret_1d_s1"])
print(f"[check] corr(ret_1d_new, ret_1d_s1) = {corr:.6f}")

        date ticker    ret_1d
0 2007-02-05      A       NaN
1 2007-02-06      A  0.005327
2 2007-02-07      A  0.017656
3 2007-02-08      A  0.003066
4 2007-02-09      A -0.000306

[3.1.3] RETURNS SUMMARY
------------------------------------------------------------
Rows=2,174,912  Tickers=503  Dates=[2007-02-05 00:00:00 ‚Üí 2025-08-11 00:00:00]
ret_1d NaN fraction: 0.0002 (should be ~1/avg_days_per_ticker for first obs)
SPY present: False  | tickers with first ret_1d = NaN: 0

Artifacts:
  ‚úì artifacts/alpha/px_returns_qc.json
  ‚úì artifacts/alpha/px_returns_head.csv
[check] corr(ret_1d_new, ret_1d_s1) = -0.034104


In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Section 3.1.5 + 3.1.6 ‚Äî Forward Targets (no leakage) & Unified Modeling Panel
#
# What this provides:
#   3.1.5
#     - forward_excess_targets(resid_df, horizons): builds r_ex_h from resid_1d (t+1..t+h),
#       cross-sectional ranks per date, and a valid_mask. Writes targets.parquet + targets_qc.json.
#
#   3.1.6
#     - build_unified_model_panel(feats, targets, regimes_keep, cs_cols): merges X√óY√óR into a
#       single modeling panel (no split yet), writes feature_list.json and panel_master.parquet.
#     - assert_leakage_free(panel, horizons): essential schema/leakage checks, writes leakage_scan.json.
#
# Reuse from earlier steps:
#   - feats, px, cs_cols = load_base_data(...)
#   - regimes_keep       = load_regime_context(...)
#   - px_ret, spy_ret, sector_map, sector_returns
#   - resid_df, qc_excess = compute_excess_returns(...)
#
# #TOCHANGE knobs:
#   - HORIZONS        = [5, 10]     (real run: add 20 ‚Üí [5,10,20])
#   - RANK_METHOD     = "average"   (ok; or "dense")
#   - RANK_CENTER     = False       (real run: consider centering to [-0.5, +0.5] for rank-loss)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

from __future__ import annotations

import json
from pathlib import Path
from typing import Dict, List, Tuple

import numpy as np
import pandas as pd

# Reuse shared dirs from earlier files
ARTIFACTS_DIR = Path("artifacts")
ALPHA_DIR = ARTIFACTS_DIR / "alpha"
PANELS_DIR = ALPHA_DIR / "panels"
REGIME_DIR = ARTIFACTS_DIR / "regimes"

ALPHA_DIR.mkdir(parents=True, exist_ok=True)
PANELS_DIR.mkdir(parents=True, exist_ok=True)

# ‚îÄ‚îÄ Config (mark #TOCHANGE for real runs) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
HORIZONS: List[int] = [5, 10]   # #TOCHANGE real run: [5, 10, 20]
RANK_METHOD = "average"         # "average"|"min"|"max"|"first"|"dense"
RANK_CENTER = False             # #TOCHANGE real run: True to map pct‚Üí(pct-0.5)

TARGETS_FP       = ALPHA_DIR / "targets.parquet"
TARGETS_QC_FP    = ALPHA_DIR / "targets_qc.json"
FEATURE_LIST_FP  = ALPHA_DIR / "feature_list.json"
LEAK_SCAN_FP     = ALPHA_DIR / "leakage_scan.json"
PANEL_MASTER_FP  = PANELS_DIR / "panel_master.parquet"  # unified (pre-split) panel


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.5 ‚Äî Forward targets (no leakage)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def _future_sum(series: pd.Series, h: int) -> pd.Series:
    """
    Sum of next h observations, i.e., Œ£_{t+1..t+h} s.
    Implemented as: shift(-1).rolling(h).sum()
    """
    return series.shift(-1).rolling(window=h, min_periods=h).sum()


def forward_excess_targets(
    resid_df: pd.DataFrame,
    horizons: List[int] = HORIZONS,
    write_artifacts: bool = True,
    extra_qc: Dict | None = None,
) -> pd.DataFrame:
    """
    Build forward targets from residual daily returns (resid_1d).

    Inputs:
        resid_df: ['date','ticker','resid_1d', 'beta_spy', 'beta_sector'(optional)]
    Outputs:
        targets_df with columns:
          - 'date','ticker'
          - r_ex_<h> for each h in horizons
          - y<h>_rank  (cross-sectional percentile per date, in [0,1])
          - valid_mask (True iff all r_ex_* present)
    Artifacts:
        targets.parquet, targets_qc.json
    """
    req = {"date", "ticker", "resid_1d"}
    if not req.issubset(resid_df.columns):
        raise KeyError(f"forward_excess_targets: missing columns {req - set(resid_df.columns)}")

    df = resid_df[["date", "ticker", "resid_1d"]].sort_values(["ticker", "date"]).copy()

    # Per-ticker future sums for each horizon
    out = df[["date", "ticker"]].copy()
    for h in horizons:
        col = f"r_ex_{h}"
        out[col] = (
            df.groupby("ticker", group_keys=False)["resid_1d"]
              .apply(lambda s, _h=h: _future_sum(s, _h))
        )

    # Cross-sectional ranks per date
    for h in horizons:
        rx = f"r_ex_{h}"
        rk = f"y{h}_rank"
        # rank in [0,1]; optionally center to [-0.5,+0.5]
        pct = out.groupby("date")[rx].rank(pct=True, method=RANK_METHOD)
        out[rk] = pct - 0.5 if RANK_CENTER else pct

    # Valid mask: rows where all r_ex_* are present
    r_cols = [f"r_ex_{h}" for h in horizons]
    out["valid_mask"] = out[r_cols].notna().all(axis=1)

    # Optional QC + artifacts
    if write_artifacts:
        qc = {
            "rows": int(len(out)),
            "dates": int(out["date"].nunique()),
            "tickers": int(out["ticker"].nunique()),
            "horizons": list(horizons),
            "valid_frac": float(out["valid_mask"].mean()),
        }
        if extra_qc:
            qc.update(extra_qc)

        out.sort_values(["date", "ticker"]).to_parquet(TARGETS_FP, index=False)
        TARGETS_QC_FP.write_text(json.dumps(qc, indent=2))

    return out


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.6 ‚Äî Merge unified modeling panel (X √ó Y √ó R)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def build_unified_model_panel(
    feats: pd.DataFrame,
    targets: pd.DataFrame,
    regimes_keep: pd.DataFrame,
    cs_cols: List[str],
    write_artifacts: bool = True,
) -> Tuple[pd.DataFrame, Dict]:
    """
    Merge:
      X (features from ¬ß1) √ó Y (targets from 3.1.5) √ó R (regimes from ¬ß2)
    and produce a single modeling panel. No re-standardization is done
    (features are already CS-standardized in ¬ß1).

    Inputs:
        feats        : features_filtered subset with ['date','ticker', <cs_cols>, 'adj_close', ...]
        targets      : output of forward_excess_targets
        regimes_keep : ['date','regime_label_use', 'p*'(opt), 'g'(opt)]
        cs_cols      : feature columns to keep as X

    Outputs:
        panel        : unified panel with X, Y, R (pre-split)
        meta         : dict with X_cols, Y_cols, R_cols for feature_list.json

    Artifacts:
        feature_list.json, panel_master.parquet, leakage_scan.json (via assert)
    """
    # Keep only the columns we truly need from feats: identifiers + X
    keep_X = ["date", "ticker"] + list(cs_cols)
    missing_in_feats = set(keep_X) - set(feats.columns)
    if missing_in_feats:
        raise KeyError(f"build_unified_model_panel: missing feature columns: {missing_in_feats}")

    X_df = feats[keep_X].copy()

    # Targets
    req_tgt = {"date", "ticker"}
    if not req_tgt.issubset(targets.columns):
        raise KeyError(f"build_unified_model_panel: targets missing {req_tgt - set(targets.columns)}")
    Y_cols = [c for c in targets.columns if c.startswith("r_ex_")] + \
             [c for c in targets.columns if c.startswith("y") and c.endswith("_rank")] + ["valid_mask"]
    Y_df = targets[["date", "ticker"] + Y_cols].copy()

    # Regimes (per-date)
    req_rg = {"date", "regime_label_use"}
    if not req_rg.issubset(regimes_keep.columns):
        raise KeyError(f"build_unified_model_panel: regimes missing {req_rg - set(regimes_keep.columns)}")
    R_cols_extra = [c for c in regimes_keep.columns if c.startswith("p")] + \
                   (["g"] if "g" in regimes_keep.columns else [])
    R_df = regimes_keep[["date", "regime_label_use"] + R_cols_extra].copy()

    # Merge: ((X ‚®ù Y) ‚®ù R_on_date)
    panel = (
        X_df.merge(Y_df, on=["date", "ticker"], how="inner")
            .merge(R_df, on="date", how="left")
            .sort_values(["date", "ticker"])
            .reset_index(drop=True)
    )

    # Column groups
    X_cols = cs_cols
    R_cols = ["regime_label_use"] + R_cols_extra

    meta = {
        "X_cols": X_cols,
        "Y_cols": Y_cols,
        "R_cols": R_cols,
        "notes": "CS features from ¬ß1 (already winsorized & CS-z-scored); regimes from ¬ß2; targets are forward sums of residual daily returns.",
    }

    # Leakage/safety checks
    scan = assert_leakage_free(panel, horizons=[int(c.split("_")[-1]) for c in Y_cols if c.startswith("r_ex_")])

    # Write artifacts
    if write_artifacts:
        FEATURE_LIST_FP.write_text(json.dumps(meta, indent=2))
        panel.to_parquet(PANEL_MASTER_FP, index=False)
        LEAK_SCAN_FP.write_text(json.dumps(scan, indent=2))

    return panel, meta


def assert_leakage_free(panel: pd.DataFrame, horizons: List[int]) -> Dict:
    """
    Minimal-yet-critical validations for ¬ß3.1:
      ‚Ä¢ Unique (date,ticker)
      ‚Ä¢ No target defined on last max(h) days per ticker (should be NaN‚Üívalid_mask=False)
      ‚Ä¢ No forward fill (spot-check last obs per ticker)
      ‚Ä¢ Basic NA/coverage summaries
    Returns a JSON-serializable dict; also used to write leakage_scan.json upstream.
    """
    res: Dict = {"status": "PASS", "notes": []}

    # Unique keys
    if panel.duplicated(["date", "ticker"]).any():
        n_dup = int(panel.duplicated(["date", "ticker"]).sum())
        res["status"] = "FAIL"
        res["notes"].append(f"Duplicate (date,ticker) rows: {n_dup}")

    # Tail guard: on the last available date per ticker, targets must be NaN
    r_cols = [c for c in panel.columns if c.startswith("r_ex_")]
    if r_cols:
        last_rows = panel.groupby("ticker", as_index=False).tail(1)
        non_nan_last = {c: int(last_rows[c].notna().sum()) for c in r_cols}
        # It‚Äôs OK if some last rows exist for shorter horizons; but conservative guard:
        if any(v > 0 for v in non_nan_last.values()):
            res["status"] = "WARN"
            res["notes"].append(f"Some last ticker rows have non-NaN targets (check rolling logic): {non_nan_last}")

    # Spot-check: last date overall should have NaNs in all r_ex_* (train tail)
    if r_cols:
        last_date = panel["date"].max()
        last_slice = panel.loc[panel["date"].eq(last_date), r_cols]
        if last_slice.notna().any().any():
            res["status"] = "WARN"
            res["notes"].append("Latest calendar date contains non-NaN targets; confirm tail exclusion in ¬ß3.1.7 split.")

    # NA/coverage
    cov = {
        "rows": int(len(panel)),
        "dates": int(panel["date"].nunique()),
        "tickers": int(panel["ticker"].nunique()),
        "valid_frac": float(panel["valid_mask"].mean()) if "valid_mask" in panel.columns else None,
        "target_na_frac": {c: float(panel[c].isna().mean()) for c in r_cols},
    }
    res["coverage"] = cov
    return res


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Example orchestration for 3.1.5 + 3.1.6 (call these from your section runner):
#
# 1) From 3.1.1/3.1.2:
# feats, px, cs_cols, regimes_keep = prep_3_1_3_context(...)
#
# 2) From 3.1.3/3.1.4:
# px_ret  = compute_daily_returns(px)
# spy_ret = extract_spy_returns(px_ret)
# sector_map  = load_sector_map(SECTOR_ETF_MAP_FP) if SECTOR_NEUTRAL else None
# sector_rets = prepare_sector_returns(px_ret, sector_map) if (SECTOR_NEUTRAL and sector_map is not None) else None
# resid_df, qc_excess = compute_excess_returns(
#     px_ret=px_ret,
#     spy_ret=spy_ret,
#     lookback=ROLL_LOOKBACK_D,
#     sector_neutral=SECTOR_NEUTRAL,
#     sector_map=sector_map,
#     sector_returns=sector_rets,
# )
#
# 3) 3.1.5 ‚Äî targets
# targets = forward_excess_targets(resid_df, horizons=HORIZONS, write_artifacts=True, extra_qc=qc_excess)
#
# 4) 3.1.6 ‚Äî unified modeling panel
# panel, meta = build_unified_model_panel(feats, targets, regimes_keep, cs_cols, write_artifacts=True)
#
# Next (3.1.7): read windows_index.json / window_manifest.json to slice `panel_master.parquet`
# into `panel_train_<winid>.parquet` and `panel_test_<winid>.parquet`, plus per-window QC.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

In [None]:
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Section 3.1.7 (+ optional 3.1.8) ‚Äî Walk-Forward Split & Minimal Validation
#
# What this provides:
#   ‚Ä¢ load_windows_plan(): load windows_index.json (preferred) or window_manifest.json (fallback)
#   ‚Ä¢ split_panel_by_windows(): slice panel_master ‚Üí per-window train/test parquet files
#       - Applies defensive tail drop: remove rows after (last_train_date ‚àí max(HORIZONS))
#       - Writes per-window QC CSVs (row/date/ticker counts, NA rates)
#   ‚Ä¢ (optional) build_3_1_validation_report(): roll up 3.1 checks into one JSON
#
# Reuse:
#   - PANEL_MASTER_FP from 3.1.6 (unified X√óY√óR panel)
#   - HORIZONS from 3.1.5 (to compute defensive tail)
#   - windows_index.json / window_manifest.json from Section 2.8
#
# #TOCHANGE knobs:
#   - HORIZONS        = [5, 10] (real run: add 20)
#   - FORCE_STITCHED_SPLIT = False (leave False; for diagnostics you can set True to re-stitch strict OOS slices)
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

from __future__ import annotations

import json
from pathlib import Path
from typing import Dict, List, Tuple, Optional

import numpy as np
import pandas as pd

ARTIFACTS_DIR = Path("artifacts")
REGIME_DIR    = ARTIFACTS_DIR / "regimes"
ALPHA_DIR     = ARTIFACTS_DIR / "alpha"
PANELS_DIR    = ALPHA_DIR / "panels"

PANELS_DIR.mkdir(parents=True, exist_ok=True)

# From 3.1.5/3.1.6
HORIZONS: List[int] = [5, 10]  # #TOCHANGE real run: [5,10,20]
PANEL_MASTER_FP  = PANELS_DIR / "panel_master.parquet"
FEATURE_LIST_FP  = ALPHA_DIR / "feature_list.json"
TARGETS_FP       = ALPHA_DIR / "targets.parquet"
TARGETS_QC_FP    = ALPHA_DIR / "targets_qc.json"
LEAK_SCAN_FP     = ALPHA_DIR / "leakage_scan.json"

# WF plan inputs (from Section 2.8)
WINDEX_FP    = REGIME_DIR / "windowed" / "windows_index.json"
WMANIFEST_FP = REGIME_DIR / "window_manifest.json"  # fallback (single-window)

# Output QC
VAL_3_1_FP   = ALPHA_DIR / "3_1_validation_report.json"

# Optional behavior
FORCE_STITCHED_SPLIT = False  # #TOCHANGE: keep False. If True, enforces hard OOS no-overlap beyond provided plan.


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.7 ‚Äî Walk-Forward split helpers
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def load_windows_plan(
    windex_fp: Path = WINDEX_FP,
    wmanifest_fp: Path = WMANIFEST_FP
) -> List[Dict]:
    """
    Load a walk-forward plan from:
      1) windows_index.json (preferred, possibly multi-window)
      2) window_manifest.json (fallback, single-window)
    Returns a list of window dicts with keys:
      win_id, train_start, train_end, test_start, test_end
    """
    if windex_fp.exists():
        data = json.loads(windex_fp.read_text())
        # Expected schema (from ¬ß2.8): {"windows": [{...}, ...]}
        windows = data.get("windows", [])
        if not windows:
            raise ValueError("windows_index.json present but 'windows' empty.")
        # Normalize
        outs = []
        for i, w in enumerate(windows):
            outs.append({
                "win_id": w.get("win_id", f"W{i}"),
                "train_start": w["train_start"],
                "train_end":   w["train_end"],
                "test_start":  w["test_start"],
                "test_end":    w["test_end"],
            })
        return outs

    if wmanifest_fp.exists():
        w = json.loads(wmanifest_fp.read_text()).get("window", None)
        if not w:
            raise ValueError("window_manifest.json present but no 'window' key.")
        return [{
            "win_id": "W0",
            "train_start": w["train_start"],
            "train_end":   w["train_end"],
            "test_start":  w["test_start"],
            "test_end":    w["test_end"],
        }]

    raise FileNotFoundError(
        "No windows_index.json or window_manifest.json found. "
        "Run Section 2.8 to generate the walk-forward plan."
    )


def _defensive_tail_drop(df_train: pd.DataFrame, max_h: int) -> pd.DataFrame:
    """
    Defensive tail: drop any training rows after (last_train_date ‚àí max_h business days).
    Rationale: ensure no accidental leakage from partially formed forward targets.
    """
    if df_train.empty:
        return df_train
    last_train_date = df_train["date"].max()
    # Use calendar days minus a buffer; fine for defense (targets themselves are already shifted)
    cutoff = pd.to_datetime(last_train_date) - pd.tseries.offsets.BDay(max_h)
    return df_train[df_train["date"] <= cutoff]


def _write_panel_qc(panel: pd.DataFrame, out_csv: Path, label: str) -> Dict:
    """
    Emit basic QC: row counts, NA rates for targets, valid_mask share, unique keys, date/ticker counts.
    """
    r_cols = [c for c in panel.columns if c.startswith("r_ex_")]
    qc = {
        "label": label,
        "rows": int(len(panel)),
        "dates": int(panel["date"].nunique()),
        "tickers": int(panel["ticker"].nunique()),
        "duplicate_keys": int(panel.duplicated(["date","ticker"]).sum()),
        "valid_frac": float(panel["valid_mask"].mean()) if "valid_mask" in panel.columns else None,
        "target_na_frac": {c: float(panel[c].isna().mean()) for c in r_cols},
    }
    pd.DataFrame([qc]).to_csv(out_csv, index=False)
    return qc


def split_panel_by_windows(
    panel_master_fp: Path = PANEL_MASTER_FP,
    windex_fp: Path = WINDEX_FP,
    wmanifest_fp: Path = WMANIFEST_FP,
    horizons: List[int] = HORIZONS,
) -> Dict:
    """
    Split the unified modeling panel into per-window train/test parquet files.
    Also writes per-window QC CSVs.
    Returns a summary dict with file paths and counts.
    """
    if not panel_master_fp.exists():
        raise FileNotFoundError(f"Unified panel not found at {panel_master_fp}. Run 3.1.6 first.")

    panel = pd.read_parquet(panel_master_fp)
    panel["date"] = pd.to_datetime(panel["date"])

    windows = load_windows_plan(windex_fp, wmanifest_fp)
    max_h = int(max(horizons)) if horizons else 0

    summary = {"windows": [], "max_h": max_h, "panel_master": str(panel_master_fp)}

    for w in windows:
        win_id = w["win_id"]
        t0, t1 = pd.to_datetime(w["train_start"]), pd.to_datetime(w["train_end"])
        u0, u1 = pd.to_datetime(w["test_start"]),  pd.to_datetime(w["test_end"])

        # Train/Test slices (by date only; (date,ticker) are already unique)
        df_train = panel[(panel["date"] >= t0) & (panel["date"] <= t1)].copy()
        df_test  = panel[(panel["date"] >= u0) & (panel["date"] <= u1)].copy()

        # Defensive tail for train
        df_train = _defensive_tail_drop(df_train, max_h=max_h)

        # Optional enforcement (usually not required if ¬ß2.8 plan is clean)
        if FORCE_STITCHED_SPLIT:
            # Drop any accidental overlaps
            max_train_date = df_train["date"].max() if not df_train.empty else t1
            df_test = df_test[df_test["date"] > max_train_date]

        # Write files
        train_fp = PANELS_DIR / f"panel_train_{win_id}.parquet"
        test_fp  = PANELS_DIR / f"panel_test_{win_id}.parquet"
        df_train.to_parquet(train_fp, index=False)
        df_test.to_parquet(test_fp, index=False)

        # Per-window QC
        qc_train = _write_panel_qc(df_train, PANELS_DIR / f"panel_train_{win_id}_QC.csv", f"train_{win_id}")
        qc_test  = _write_panel_qc(df_test,  PANELS_DIR / f"panel_test_{win_id}_QC.csv",  f"test_{win_id}")

        summary["windows"].append({
            "win_id": win_id,
            "train_start": w["train_start"],
            "train_end":   w["train_end"],
            "test_start":  w["test_start"],
            "test_end":    w["test_end"],
            "train_rows":  int(len(df_train)),
            "test_rows":   int(len(df_test)),
            "train_fp":    str(train_fp),
            "test_fp":     str(test_fp),
            "qc_train":    qc_train,
            "qc_test":     qc_test,
        })

    # Save a simple index for Section 3 consumers
    (ALPHA_DIR / "panels_index.json").write_text(json.dumps(summary, indent=2))
    return summary


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3.1.8 (optional) ‚Äî Minimal roll-up validation for ¬ß3.1
# This aggregates what we already wrote in earlier steps (targets_qc, leakage scan,
# per-window QC) so you get a single line in CI telling you whether ¬ß3.1 is healthy.
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

def build_3_1_validation_report() -> Dict:
    """
    Collects and summarizes key QC from ¬ß3.1:
      - targets_qc.json (coverage, sector-neutral coverage if enabled)
      - leakage_scan.json (schema/tail checks)
      - panels_index.json (per-window row counts)
    Writes artifacts/alpha/3_1_validation_report.json
    """
    report = {"status": "PASS", "checks": {}}

    # Targets QC
    if TARGETS_QC_FP.exists():
        tgt_qc = json.loads(TARGETS_QC_FP.read_text())
        report["checks"]["targets_qc"] = tgt_qc
        # Basic guard: decent valid fraction
        if tgt_qc.get("valid_frac", 0.0) < 0.75:  # #TOCHANGE threshold if needed
            report["status"] = "WARN"
            report.setdefault("notes", []).append("Low valid_frac in targets.")
    else:
        report["status"] = "FAIL"
        report["checks"]["targets_qc"] = "missing"
        report.setdefault("notes", []).append("targets_qc.json not found.")

    # Leakage scan
    if LEAK_SCAN_FP.exists():
        leak = json.loads(LEAK_SCAN_FP.read_text())
        report["checks"]["leakage_scan"] = leak
        if leak.get("status") == "FAIL":
            report["status"] = "FAIL"
        elif leak.get("status") == "WARN" and report["status"] != "FAIL":
            report["status"] = "WARN"
    else:
        report["status"] = "FAIL"
        report["checks"]["leakage_scan"] = "missing"
        report.setdefault("notes", []).append("leakage_scan.json not found.")

    # Panels index
    panels_idx_fp = ALPHA_DIR / "panels_index.json"
    if panels_idx_fp.exists():
        panels_idx = json.loads(panels_idx_fp.read_text())
        report["checks"]["panels_index"] = {
            "n_windows": len(panels_idx.get("windows", [])),
            "max_h": panels_idx.get("max_h"),
        }
        # Sanity: all windows must have some test rows
        for w in panels_idx.get("windows", []):
            if w.get("test_rows", 0) <= 0:
                report["status"] = "WARN" if report["status"] != "FAIL" else "FAIL"
                report.setdefault("notes", []).append(f"No TEST rows for {w['win_id']}.")
    else:
        report["status"] = "FAIL"
        report["checks"]["panels_index"] = "missing"
        report.setdefault("notes", []).append("panels_index.json not found.")

    VAL_3_1_FP.write_text(json.dumps(report, indent=2))
    return report


# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# Example end-to-end for ¬ß3.1.7‚Äì3.1.8:
#
# 1) Ensure you‚Äôve run:
#    - 3.1.1/3.1.2 (load_base_data, load_regime_context)
#    - 3.1.3/3.1.4 (compute_daily_returns, compute_excess_returns)
#    - 3.1.5       (forward_excess_targets ‚Üí targets.parquet/targets_qc.json)
#    - 3.1.6       (build_unified_model_panel ‚Üí panel_master.parquet, feature_list.json, leakage_scan.json)
#
# 2) Split into windows:
#    summary = split_panel_by_windows(
#        panel_master_fp=PANEL_MASTER_FP,
#        windex_fp=WINDEX_FP,
#        wmanifest_fp=WMANIFEST_FP,
#        horizons=HORIZONS,
#    )
#
# 3) Optional: roll-up validation
#    report = build_3_1_validation_report()
#    print(report["status"], report.get("notes"))
#
# Deliverables from this step:
#   - artifacts/alpha/panels/panel_train_<winid>.parquet
#   - artifacts/alpha/panels/panel_test_<winid>.parquet
#   - artifacts/alpha/panels/panel_train_<winid>_QC.csv
#   - artifacts/alpha/panels/panel_test_<winid>_QC.csv
#   - artifacts/alpha/panels_index.json
#   - artifacts/alpha/3_1_validation_report.json
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

In [None]:
## this needs work later
# 3.1 ‚Äî Orchestrator (stub)

RUN_ORCHESTRATOR = False

def run_section_3_1(
    horizons=None,
    sector_neutral=None,
    roll_lookback=None,
    debug_max_tickers=None,
    n_smooth_g=None,
    print_summary: bool = True,
) -> dict:
    """
    End-to-end 3.1 runner:
      1) 3.1.1/3.1.2 load base & regimes
      2) 3.1.3 daily returns (assets/hedges)
      3) 3.1.4 residuals (SPY-only or sector-neutral)
      4) 3.1.5 forward targets + ranks
      5) 3.1.6 unified modeling panel (X√óY√óR)
      6) 3.1.7 windowed split + per-window QC
      7) 3.1.8 (optional) roll-up validation for ¬ß3.1
    """
    # honor overrides (use #TOCHANGE defaults otherwise)
    _H = horizons or HORIZONS
    _SN = SECTOR_NEUTRAL if sector_neutral is None else bool(sector_neutral)
    _RL = ROLL_LOOKBACK_D if roll_lookback is None else int(roll_lookback)
    _DM = DEBUG_MAX_TICKERS if debug_max_tickers is None else int(debug_max_tickers)
    _NG = N_SMOOTH_G if n_smooth_g is None else int(n_smooth_g)

    # 1) Base + regimes
    feats, px, cs_cols, regimes_keep = prep_3_1_3_context(
        debug_max_tickers=DEBUG_MAX_TICKERS,
        n_smooth_g=N_SMOOTH_G,
    )
    log_3_1_1_2_summary(feats, px, cs_cols, regimes_keep)

    px_ret = compute_daily_returns(px)   # equities only
    spy_ret = load_spy_returns()         # from S2 market_panel or S1 raw_prices

    # 3) Sector map/returns (only if enabled)
    sector_map = load_sector_map(SECTOR_ETF_MAP_FP) if _SN else None
    if _SN and sector_map is not None:
        etfs = sector_map["sector_etf"].dropna().unique().tolist()
        sector_rets = load_sector_etf_returns(etfs)   # from raw_prices.parquet
    else:
        sector_rets = None

    # 4) Excess returns (residuals)
    def compute_excess_returns(
        px_ret: pd.DataFrame,
        spy_ret: pd.Series,
        lookback: int = ROLL_LOOKBACK_D,
        sector_neutral: bool = SECTOR_NEUTRAL,
        sector_map: Optional[pd.DataFrame] = None,
        sector_returns: Optional[pd.DataFrame] = None,
        lambda_ridge: float = LAMBDA_RIDGE,
    ) -> Tuple[pd.DataFrame, Dict]:
        """
        Compute residual daily log returns ('excess') for each equity:
          ‚Ä¢ SPY-only mode: resid_1d = ret_1d - Œ≤_SPY * spy_ret
          ‚Ä¢ Sector-neutral mode: resid_1d = ret_1d - (Œ≤_SPY*spy + Œ≤_SECTOR*sector)

        Inputs:
            px_ret        : ['date','ticker','ret_1d'] for equities + hedges
            spy_ret       : SPY series indexed by date
            lookback      : rolling window length for betas
            sector_neutral: enable sector model (requires sector_map and ETF returns)
            sector_map    : ['ticker','sector_etf'] for equities (if None ‚Üí SPY-only)
            sector_returns: ['date','sector_etf','sector_ret_1d'] (if None, inferred from px_ret)
            lambda_ridge  : small ridge (stabilizer) for 2x2 inverse #TOCHANGE real run: 1e-6

        Outputs:
            resid_df: ['date','ticker','resid_1d','beta_spy','beta_sector'] (beta_sector may be NaN)
            qc      : dict with coverage info (e.g., sector_beta_coverage)
        """
        # Base frame with asset ret_1d and SPY ret on each date
        df = px_ret[["date", "ticker", "ret_1d"]].copy()
        df = df.merge(spy_ret.rename("spy_ret_1d"), left_on="date", right_index=True, how="left")

        # Defaults
        have_sector = False
        sector_cov_frac = 0.0

        if sector_neutral and (sector_map is not None) and ("sector_etf" in sector_map.columns):
            # Attach sector ETF daily ret to each equity row
            if sector_returns is None:
                # infer sector_returns from px_ret
                etfs = sector_map["sector_etf"].dropna().unique().tolist()
                sector_returns = (
                    px_ret[px_ret["ticker"].isin(etfs)][["date", "ticker", "ret_1d"]]
                    .rename(columns={"ticker": "sector_etf", "ret_1d": "sector_ret_1d"})
                )
            df = df.merge(sector_map[["ticker", "sector_etf"]], on="ticker", how="left")
            df = df.merge(sector_returns, on=["date", "sector_etf"], how="left")
            have_sector = True

    resid_df, qc_excess = compute_excess_returns(
        px_ret=px_ret,
        spy_ret=spy_ret,
        lookback=_RL,
        sector_neutral=_SN,
        sector_map=sector_map,
        sector_returns=sector_rets,
    )

    # 5) Forward targets + ranks
    targets = forward_excess_targets(resid_df, horizons=_H, write_artifacts=True, extra_qc=qc_excess)

    # 6) Unified panel (X√óY√óR)
    panel, meta = build_unified_model_panel(feats, targets, regimes_keep, cs_cols, write_artifacts=True)

    # 7) Split by walk-forward windows + per-window QC
    split_summary = split_panel_by_windows(
        panel_master_fp=PANEL_MASTER_FP, windex_fp=WINDEX_FP, wmanifest_fp=WMANIFEST_FP, horizons=_H
    )

    # 8) Optional: roll-up validation for ¬ß3.1
    val = build_3_1_validation_report()

    # Compose run summary
    s = {
        "params": {
            "horizons": _H,
            "sector_neutral": _SN,
            "roll_lookback_days": _RL,
            "debug_max_tickers": _DM,
            "n_smooth_g": _NG,
        },
        "artifacts": {
            "targets": _exists_size(TARGETS_FP),
            "targets_qc": _exists_size(TARGETS_QC_FP),
            "feature_list": _exists_size(FEATURE_LIST_FP),
            "leakage_scan": _exists_size(LEAK_SCAN_FP),
            "panel_master": _exists_size(PANEL_MASTER_FP),
            "panels_index": _exists_size(ALPHA_DIR / "panels_index.json"),
            "val_3_1": _exists_size(ALPHA_DIR / "3_1_validation_report.json"),
        },
        "shapes": {
            "feats_rows": int(len(feats)),
            "panel_master_rows": int(len(panel)),
            "targets_rows": int(len(targets)),
        },
        "qc": {
            "excess_qc": qc_excess,
            "val_3_1_status": val.get("status"),
            "val_3_1_notes": val.get("notes", []),
        },
        "windows": split_summary.get("windows", []),
    }

# 4. Portfolio Construction & Risk

# 5. RL Sizing Policy (PPO)

# 6. Backtesting (Backward Testing) ‚Äî Rigor

# 7. Forward Testing (No Orders; Shadow Runs)

# 8. Cost Model & Execution Assumptions

# 9. Reproducibility & Testability

# 10. Visualization & Reporting

# 11. Automation Options (Optional, no trading)

# 12. Optional Alpaca Integration (disabled by default)

# 13. File/Module Structure (Colab-friendly)




```
/project
  config.yaml
  data/
    universe.csv
    features.parquet
    regime_labels.parquet
  models/
    lstm_*.pt / .h5
    gbm_*.txt
    stacker_*.pkl
    rl_policy_*.pkl
  runs/YYYY-MM-DD/
    signals.parquet
    weights.parquet
    hedges.parquet
    daily_pnl.csv
    risk.json
  reports/
    backtest_tearsheet.html
    forward_tearsheet_YYYY-MM.html
  src/
    data_loader.py
    feature_engineering.py
    regime.py
    models_lstm.py
    models_tabular.py
    stacking.py
    uncertainty.py
    portfolio_bl_rp.py
    hedging.py
    rl_policy.py
    backtest.py
    forward_shadow.py
    risk_metrics.py
    stats_tests.py  # DM, SPA/White RC, Sharpe inference
    monte_carlo.py  # block bootstrap
    reporting.py    # plots & HTML/PDF
  main.py          # CLI: daily-shadow / weekly-train / monthly-report
  notebook.ipynb   # Colab master: end-to-end run with toggles

```



# 14. More info

- Suggested stack: pandas, numpy, scikit-learn, lightgbm, xgboost, tensorflow/PyTorch (choose one for LSTM), hmmlearn, stable-baselines3, cvxpy (for BL/optimization), arch (optional), statsmodels, scipy, matplotlib/plotly.

Compute plan (fits $50‚Äì$100):

- S&P 100, 5‚Äì8 walk-forward windows.

- LSTM 1‚Äì2 layers (64‚Äì128 units), MC-dropout 20 samples.

- PPO with modest timesteps per window.

- 200‚Äì400 Monte Carlo bootstrap paths.

- 1‚Äì3 GPU hours on Colab Pro/Pro+; RAM < 24GB.

# 15. Build Order (fastest to value)

1. Data + Features + Regimes ‚Üí validate leakage & plots.

2. Multifactor composite ‚Üí baseline cross-sec L/S backtest.

3. GBM/MLP + LSTM ‚Üí stacking + uncertainty; re-run backtest.

4. BL + RP + Dynamic hedge ‚Üí re-run backtest & stress.

5. RL sizing ‚Üí ablation vs no-RL; finalize backtest.

6. Forward shadow loop (daily), weekly retrain, monthly reports.

7. Automation (Actions/cron), optional Alpaca paper stub (off).

# 16. What you'll see in the first results
- Backtest tear sheet with OO-S equity curve, MC bands, by-regime tables, SPA/DM outcomes, VaR/CVaR & stress.

- Ablation:

  - Multifactor only ‚Üí +ML ‚Üí +ML+RL;

  - Market-neutral vs long-only w/ hedging;

  - Cost sensitivity 5‚Äì20 bps.

- A live forward dashboard (from Day 1) accumulating daily PnL + monthly report.



# 17. Forward-Testing Duration Recommendation

- Run at least 4 weeks forward shadow to confirm plumbing & stability.

- Prefer 8‚Äì12 weeks to evaluate regime adaptation, RL sizing behavior under drawdowns, and cost realism.

- Only after the forward period matches backtest risk/return within expected error bands should you consider paper-trading execution.



<details>
<summary><strong>Outline Details</strong></summary>

# Project Outline ‚Äî Regime-Aware Multifactor + LSTM/Ensembles + RL (with rigorous back & forward testing)

## 0) Objectives & Success Criteria
**Primary objective:** Generate statistically significant pure alpha (market-neutral) with controlled drawdowns after transaction costs.  

**Secondary objective:** Build a repeatable process capable of ongoing, unattended forward testing that outputs monthly tear sheets.  

**Pass/Fail gates (OO-S):**  
- Annualized Sharpe ‚â• 1.0 (cost-adjusted) across walk-forward windows.  
- SPA/White Reality Check non-rejection vs family of alternatives at 5‚Äì10% level.  
- Max DD ‚â§ 15‚Äì20% (tunable) in backtests.  
- Forward test (4‚Äì8+ weeks): positive return, rolling Sharpe > 0.8, tail losses consistent with backtest VaR/CVaR.  

---

## 1) Data & Universe

### 1.1 Universe
- S&P 100 equities (liquid, keeps compute sane).  
- Hedging instruments: SPY + sector ETFs (XLY, XLF, XLV, XLK, XLI, XLE, XLP, XLB, XLU, XLRE).  
- Source: Yahoo Finance (daily bars).  
- Lookback: 10‚Äì15 years if available (train 2012‚Üí, test recent).  

### 1.2 Features
- **Returns/vol:** log returns (1‚Äì60d lags), realized vol, ATR.  
- **Momentum:** 12‚Äì1, 6‚Äì1, 20d, trend filters (e.g., SMA cross, slope).  
- **Value:** B/P, E/P, CF/P, shareholder yield (latest available; forward-fill monthly/quarterly).  
- **Quality:** gross profitability, ROE, accruals, leverage, F-Score-like composite.  
- **Market context:** VIX, SPY vol, market breadth (% advancers, optional).  
- Leakage controls: strictly lag all features, align to t-1; winsorize & z-score cross-sectionally.  

### 1.3 Data Hygiene
- Survivorship-bias approach: use current S&P 100 for practicality; (optional) point-in-time later.  
- Corporate actions: use adjusted prices.  
- Missing fundamentals: impute conservatively or drop; record masks for model.  
- **Deliverables:** `features.parquet`, `universe.csv`, `meta.yaml`.  

---

## 2) Regime Modeling

### 2.1 HMM (2‚Äì3 states)
- Inputs: SPY daily returns/vol, VIX level/change, market breadth.  
- States: Risk-On, Risk-Off, Transition (labeled by average return/vol).  
- **Output:** daily regime label + posterior probabilities.  

### 2.2 Usage
- Regime-specific ensemble weights, turnover caps, and risk targets.  
- Momentum throttled in Risk-Off; quality emphasized.  
- **Deliverables:** `regime_labels.parquet`, regime plot.  

---

## 3) Alpha Layer (Signals)

### 3.1 Multifactor Composite
- Value/Momentum/Quality composites (winsorized, z-scored).  
- Per-regime blend fit with ridge.  
- **Output:** factor alpha score per asset/day.  

### 3.2 ML Overlays
- **LSTM:** 60-day sequences ‚Üí t+5/t+10 returns; MC-dropout for uncertainty.  
- **Tabular ensembles:** LightGBM (primary), XGBoost, small MLP; also quantile versions.  
- **Stacking meta-learner:** ridge/LightGBM; OOF training within walk-forward train window.  
- **Output:** final forecast (mean) + uncertainty proxy.  

### 3.3 Uncertainty ‚Üí Confidence
- Expected Sharpe proxy = mean / std_hat.  
- Bucket confidence for analytics.  
- **Deliverables:** `alpha_raw.parquet`, `alpha_ensemble.parquet`, feature importance charts.  

---

## 4) Portfolio Construction & Risk

### 4.1 Baseline Weights
- Cross-sectional L/S: long top decile, short bottom decile by forecasted Sharpe.  
- Beta-neutral, per-name and sector caps.  

### 4.2 Black‚ÄìLitterman (BL)
- Prior: market-cap weights ‚Üí implied Œº.  
- Views: ensemble alphas scaled by uncertainty.  
- Posterior ŒºÃÇ ‚Üí mean-variance with L2 & turnover penalty.  

### 4.3 Risk Parity & Vol Target
- Equalize risk across sector/factor clusters.  
- Target portfolio vol (8‚Äì12% ann.).  

### 4.4 Dynamic Hedging
- Daily orthogonalization vs SPY + sectors; hedge ratios adjustable by RL.  
- **Deliverables:** weights, exposures, hedge plots.  

---

## 5) RL Sizing Policy (PPO)

### 5.1 Role
- Scales risk target and tunes hedges.  

### 5.2 State
- Regime, vol, drawdown, alpha strength, uncertainty, turnover, betas, cost model.  

### 5.3 Reward
- PnL ‚Äì costs ‚Äì Œª¬∑CVaR_tail ‚Äì Œ∫¬∑Œîdrawdown ‚Äì penalties.  

### 5.4 Training
- Train within walk-forward segments; fixed seeds.  
- **Deliverables:** `rl_policy.pkl`, diagnostics.  

---

## 6) Backtesting (Backward Testing) ‚Äî Rigor

### 6.1 Walk-Forward Engine
- Rolling/expanding windows; purged & embargoed CV.  
- Refit all models per window; test daily with costs.  

### 6.2 Significance & Reality Checks
- DM test, SPA/White RC, Sharpe inference.  

### 6.3 Tail Risk & Stress
- VaR/CVaR; stress tests (2008/2020, vol shocks, liquidity cuts).  

### 6.4 Monte Carlo Robustness
- Block bootstrap; output PnL envelopes.  
- **Deliverables:** equity curves, DD charts, ablations.  

---

## 7) Forward Testing (Shadow Mode)

### 7.1 Daily Shadow Run
- No backfill; use latest models; log all artifacts.  

### 7.2 Retraining Cadence
- Weekly or bi-weekly; strict forward-only.  

### 7.3 Monthly Auto-Report
- Tear sheets with returns, Sharpe, DD, risk, regime PnL, VaR/CVaR.  

### 7.4 Duration
- Min: 4 weeks; Pref: 8‚Äì12 weeks.  
- **Deliverables:** daily run files, monthly reports.  

---

## 8) Cost Model & Execution Assumptions
- Costs: 10 bps round-trip (sweep 5‚Äì20).  
- Slippage: 1‚Äì2 bps; higher in Risk-Off.  
- Short borrow: 10‚Äì50 bps ann.  
- Liquidity caps: ‚â§5‚Äì10% ADV.  

---

## 9) Reproducibility & Testability
- Config-driven (`config.yaml`); fixed seeds.  
- Unit/integration tests for leakage, CV folds, NaNs, RL bounds.  
- Experiment tracking with CSV/JSON + git hash.  

---

## 10) Visualization & Reporting
- Equity curves with regime shading, rolling metrics, exposures, attribution, bucket PnL, by-regime performance, risk dashboards.  

---

## 11) Automation Options
- **Colab:** manual or scheduled;  
- **GitHub Actions:** nightly, weekly, monthly;  
- **VM + cron:** low-budget option.  

---

## 12) Optional Alpaca Integration
- Disabled by default; forward test never sends orders; later optional paper fills.  

</details>
