# AI Stock Forecaster  
**(FMP + Kronos + FinText-TSFM | Signal-Only, Point-in-Time Safe)**

---

## 0) Project Explanation & Philosophy

### What this project is

This project builds a **decision-support forecasting model** that answers one core question:

> **Which AI stocks are most attractive to buy today, on a risk-adjusted basis, over the next 20 / 60 / 90 trading days?**

The system outputs **ranked stock recommendations and return distributions**, not trades.  
Its purpose is to generate **credible alpha signals** that survive realistic financial constraints.

The design explicitly accounts for:
- non-stationary market behavior,
- weak and noisy financial signals,
- transaction costs and liquidity effects,
- and strict point-in-time (PIT) correctness.

---

### What this project is NOT

This project does **not**:
- place trades,
- connect to brokers,
- optimize execution,
- or manage live capital.

Any portfolio-related logic exists **only to validate signal realism**, not to implement trading.

### Core modeling philosophy

1. **Ranking beats regression**  
   Relative ordering of stocks is more stable and economically useful than exact price prediction.

2. **Point-in-time correctness is non-negotiable**  
   Any signal unavailable at time *T* must not influence predictions at time *T*.

3. **Economic validity > statistical fit**  
   Signals must survive transaction costs, turnover, and regime shifts.

4. **Multiple weak signals > single strong model**  
   Combine complementary views:
   - price dynamics (Kronos),
   - return structure (FinText-TSFM),
   - fundamentals and context (tabular models).

## 1) System Outputs (Signal-Only)

At each rebalance date **T**, for each stock and horizon (20 / 60 / 90 trading days):

### Per-stock outputs
- **Expected excess return** vs benchmark (QQQ default; XLK/SMH optional)
- **Return distribution** (5th / 50th / 95th percentiles)
- **Alpha ranking score** (cross-sectional)
- **Confidence score** (calibrated uncertainty)
- **Key drivers** (feature blocks influencing the rank)

### Cross-sectional outputs
- Ranked list: **Top buys / neutral / avoid**
- Optional confidence buckets (high vs low confidence)

## 2) Scope & Validation Philosophy (Signal-Only)

### Scope
- The system produces **signals**, not trades.
- No execution or order placement logic is implemented.

### Why portfolio concepts still appear
Portfolio concepts (turnover, costs, constraints) are used **only for evaluation realism**, to answer:
> *Would these signals remain economically meaningful if followed by an investor?*

### Optional realism check
- Paper trading (e.g., Alpaca paper) may be used **post-hoc** to validate:
  - timestamp integrity,
  - universe construction,
  - signal stability.
- Paper trading results are **never** used for training or model selection.

## 3) Data & Point-in-Time Infrastructure (FMP-First)

### 3.1 Data sources
- **Market**: Daily OHLCV, splits, dividends
- **Fundamentals**: Income, balance sheet, cash flow (quarterly)
- **Metadata**: Sector, industry, shares outstanding, market cap
- **Events**: Earnings dates with announcement time
- **Benchmarks**: QQQ (default), optional XLK / SMH
- **Regime proxies**: VIX, market breadth, rate proxies

---

### 3.2 Point-in-time (PIT) rules

Each datapoint stores:
- `value`
- `observed_at` (first public release timestamp)
- `effective_from`
- `source`

Rules:
- Fundamentals are **as-reported**, never restated historically
- Forward-fill allowed **only after** `observed_at`
- No feature may use information released after the cutoff time

---

### 3.3 Daily cutoff policy (anti-lookahead)

- Fixed cutoff time (e.g., 4:00pm ET)
- Features for date *T* may only use data with timestamps ≤ cutoff(T)
- Earnings handling distinguishes pre-market vs after-close announcements

---

### 3.4 Data audits & bias detection

Automated checks:
- PIT violation scanner
- Survivorship reconstruction audit
- Corporate action sanity checks
- Missingness and outlier detection

**Success criteria**
- < 0.1% PIT violations
- Universe reproducible for any historical date
- All datasets auditable and replayable

## 4) Survivorship-Safe Dynamic Universe

### 4.1 Universe construction (critical)

At each rebalance date **T**:
- Start with all U.S. equities meeting liquidity and price thresholds
- Filter by AI-relevant sector / industry tags
- Select **Top N by market cap as-of T**
- Persist constituents with timestamp

Hardcoded "today's winners" are explicitly disallowed.

---

### 4.2 Delistings & mergers
- Delisted stocks remain in historical universes where data exists
- Missing data is explicitly modeled rather than silently dropped

**Success criteria**
- Constituents vary meaningfully through time
- Backtests include both winners and failures

## 5) Feature Engineering (Bias-Safe)

### 5.0 Readiness Checklist & Implementation Plan

#### Infrastructure Available (from Chapters 3-4) ✅
| Component | Module | What It Provides |
|-----------|--------|------------------|
| Prices | `FMPClient.get_historical_prices()` | Split-adjusted OHLCV with `observed_at` |
| Fundamentals | `FMPClient.get_income_statement()` etc. | With `fillingDate` for PIT |
| Volume/ADV | `DuckDBPITStore.get_avg_volume()` | Computed from OHLCV |
| Events | `EventStore` | EARNINGS, FILING, SENTIMENT with PIT |
| Earnings | `AlphaVantageClient` + `ExpectationsClient` | BMO/AMC timing, surprises |
| Regime/VIX | `FMPClient.get_index_historical()` | SPY, VIX for regime detection |
| Universe | `UniverseBuilder` | FULL survivorship via Polygon |
| ID Mapping | `SecurityMaster` | Stable IDs, ticker changes |
| Calendar | `TradingCalendarImpl` | NYSE holidays, cutoffs |
| Caching | All clients | `data/cache/*` directories |

#### API Keys Available ✅
- `FMP_KEYS` - Prices, fundamentals, profiles (free tier: 250/day)
- `POLYGON_KEYS` - Symbol master, universe (free tier: 5/min)
- `ALPHAVANTAGE_KEYS` - Earnings calendar (free tier: 25/day)

---

#### Chapter 5 TODO List ✅ COMPLETE

**5.1 Targets (Labels)** ✅
- [x] Implement forward excess return calculation vs QQQ benchmark
- [x] Create label generator for 20/60/90 trading day horizons
- [x] Ensure labels are strictly PIT-safe (no future leakage)
- [x] **v2: Total return labels (dividends included) - DEFAULT**

**5.2 Price & Volume Features** ✅
- [x] Momentum features (1m, 3m, 6m, 12m returns)
- [x] Volatility (realized vol, vol-of-vol)
- [x] Drawdown (max drawdown, current vs high)
- [x] Relative strength vs universe median
- [x] Beta vs benchmark (rolling window)
- [x] ADV and volatility-adjusted ADV

**5.3 Fundamental Features (Relative)** ✅
- [x] P/E vs own 3-year history (z-score)
- [x] P/S vs sector median
- [x] Margins vs sector peers
- [x] Revenue/earnings growth vs sector
- [x] All ratios rank-transformed cross-sectionally

**5.4 Event & Calendar Features** ✅
- [x] Days to next earnings
- [x] Days since last earnings
- [x] Post-earnings drift window indicator (PEAD 63 days)
- [x] Surprise magnitude (last N quarters)
- [x] Surprise streak and cross-sectional z-score
- [x] Filing recency (days since last 10-Q/10-K)

**5.5 Regime & Macro Features** ✅
- [x] VIX level and percentile (2-year window)
- [x] VIX regime classification (low/normal/elevated/high)
- [x] Market trend regime (bull/bear/neutral)
- [x] Sector rotation indicators (tech vs defensives)
- [x] All features timestamped with cutoff enforcement

**5.6 Missingness Masks** ✅
- [x] Create explicit "known at time T" indicators
- [x] Missingness as first-class feature (not just imputation)
- [x] Track data coverage statistics by category
- [x] Generate coverage reports

**5.7 Feature Hygiene & Redundancy** ✅
- [x] Cross-sectional z-score/rank standardization
- [x] Rolling Spearman correlation matrix
- [x] Feature clustering (identify blocks)
- [x] VIF diagnostics (tabular features)
- [x] Rolling IC stability checks
- [x] Sign consistency analysis

**5.8 Feature Neutralization (Diagnostics)** ✅
- [x] Sector-neutral IC computation
- [x] Beta-neutral IC computation
- [x] Sector+Beta neutral IC computation
- [x] Delta (Δ) reporting for interpretation

**Testing & Validation** ✅
- [x] Unit tests for each feature block (5.1-5.7 all have tests)
- [x] PIT violation scanner on all features
- [x] Univariate IC ≥ 0.03 check for strong signals (IC tools available)
- [x] IC stability across rolling windows (FeatureHygiene.compute_ic_stability)
- [x] Feature coverage > 95% (MissingnessTracker.compute_coverage_stats)

---

### 5.1 Targets (Labels) ✅ COMPLETE
**Implemented in `src/features/labels.py`**

**v2 (DEFAULT): Total Return Labels**
- Forward excess returns include dividends
- Formula: `TR_i,T(H) - TR_b,T(H)` where TR = price return + dividend yield
- Horizons: 20 / 60 / 90 trading days
- PIT-safe with `label_matured_at` timestamps

**v1 (Legacy): Price-Only Labels**
- Available via `label_version='v1'` flag
- For backward compatibility only

---

### 5.2 Price & volume features ✅ COMPLETE
**Implemented in `src/features/price_features.py`**

| Feature | Description |
|---------|-------------|
| `mom_1m/3m/6m/12m` | Returns over 21/63/126/252 trading days |
| `vol_20d/60d` | Annualized volatility |
| `vol_of_vol` | Volatility of rolling volatility |
| `max_drawdown_60d` | Maximum drawdown |
| `rel_strength_1m/3m` | Z-score vs universe |
| `beta_252d` | Beta vs QQQ benchmark |
| `adv_20d/60d` | Average daily dollar volume |

---

### 5.3 Fundamentals (relative, normalized) ✅ COMPLETE
**Implemented in `src/features/fundamental_features.py`**

Raw ratios are avoided — all features are RELATIVE:
- `pe_zscore_3y`: P/E vs own 3-year history
- `pe_vs_sector`: P/E relative to sector median
- `ps_vs_sector`: P/S relative to sector median
- `gross_margin_vs_sector`: Margins vs sector
- `revenue_growth_vs_sector`: Growth vs sector peers
- `roe_zscore`, `roa_zscore`: Quality metrics z-scored

---

### Time-Decay Sample Weighting (Training Policy) ✅
**Implemented in `src/features/time_decay.py`**

**Why time decay matters for AI stocks:**
- AI business models and the "AI regime" (2020+) differ from earlier eras
- Market microstructure evolves (HFT, retail flow)
- Many AI stocks didn't exist 15+ years ago — that's OK
- Recent observations are more relevant for forward predictions

**Recommended half-lives:**
| Horizon | Half-Life | Weight at 6y | Weight at 9y |
|---------|-----------|--------------|--------------|
| 20d     | 2.5 years | ~18%         | ~7%          |
| 60d     | 3.5 years | ~30%         | ~14%         |
| 90d     | 4.5 years | ~38%         | ~21%         |

---

### 5.4 Events & calendars ✅ COMPLETE
**Implemented in `src/features/event_features.py`**

| Feature | Description |
|---------|-------------|
| `days_to_earnings` | Days until next expected earnings |
| `days_since_earnings` | Days since last report |
| `in_pead_window` | Post-earnings drift window (63 days) |
| `last_surprise_pct` | Most recent surprise % |
| `avg_surprise_4q` | Rolling 4Q average |
| `surprise_streak` | Consecutive beats/misses |
| `surprise_zscore` | Cross-sectional z-score |
| `days_since_10k/10q` | Filing recency |
| `reports_bmo` | Typical announcement timing |

---

### 5.5 Regime & macro ✅ COMPLETE
**Implemented in `src/features/regime_features.py`**

| Feature | Description |
|---------|-------------|
| `vix_level`, `vix_percentile` | VIX level and 2-year percentile |
| `vix_regime` | low/normal/elevated/high |
| `market_return_5d/21d/63d` | SPY returns at various windows |
| `market_regime` | bull/bear/neutral (MA-based) |
| `above_ma_50`, `above_ma_200` | Price vs moving averages |
| `tech_vs_staples/utilities` | Sector rotation signals |

---

### 5.6 Availability & missingness masks ✅ COMPLETE
**Implemented in `src/features/missingness.py`**

| Feature | Description |
|---------|-------------|
| `coverage_pct` | Overall feature coverage (0-1) |
| `{category}_coverage` | Per-category availability |
| `has_{type}_data` | Boolean availability flags |
| `is_new_stock` | < 1 year of history |

**Key Philosophy:** Missingness is a SIGNAL, not just noise.

---

### 5.7 Feature Hygiene & Redundancy Control ✅ COMPLETE
**Implemented in `src/features/hygiene.py`**

| Component | Description |
|-----------|-------------|
| Cross-sectional standardization | z-score or rank-transform within date |
| Rolling Spearman correlation | Correlation matrix computation |
| Feature clustering | Hierarchical clustering to identify blocks |
| VIF diagnostics | Variance Inflation Factor (diagnostic, not filter) |
| IC stability analysis | Rolling IC with sign consistency tracking |

> **Principle**: A feature with IC 0.04 once and −0.01 later is worse than IC 0.02 stable forever.

---

### 5.8 Feature Neutralization ✅ COMPLETE
**Implemented in `src/features/neutralization.py`**

**Purpose:** For diagnostics ONLY (not training). Reveals WHERE alpha comes from.

| Component | Description |
|-----------|-------------|
| Sector-neutral IC | IC after removing sector effects |
| Beta-neutral IC | IC after removing market beta |
| Sector+Beta neutral IC | IC after removing both factors |
| Delta (Δ) reporting | neutral_IC - raw_IC for interpretation |

**Interpretation:**
- Large negative Δ_sector → feature was mostly sector rotation
- Large negative Δ_beta → feature was mostly market exposure
- Small Δ → alpha is genuinely stock-specific

---

**Feature success criteria** ✅
- > 95% completeness (post-masking)
- Strong univariate signals show IC ≳ 0.03
- No feature introduces PIT violations
- **Stability**: IC sign consistent across ≥70% of rolling windows
- **Redundancy understood**: Feature blocks documented, correlation matrix computed

## 6) Evaluation Framework (Core Credibility Layer) ✅ CLOSED & FROZEN

---

### 🔒 CHAPTER 6 FREEZE STATUS

**Status:** CLOSED & FROZEN (December 30, 2025)  
**Tests:** 413/413 passing  
**Commits:**
- `18bad8a` - Chapter 6: Closure fixes + freeze REAL baseline reference
- `7e6fa3a` - Chapter 6: Freeze REAL baseline reference artifacts

**Frozen Baseline Floor (REAL DuckDB Data):**

| Horizon | Best Baseline | Median RankIC | Quintile Spread | Hit Rate @10 | N Folds |
|---------|---------------|---------------|-----------------|--------------|---------|
| **20d** | `mom_12m_monthly` | **0.0283** | 0.0035 | 0.50 | 109 |
| **60d** | `momentum_composite_monthly` | **0.0392** | 0.0370 | 0.60 | 109 |
| **90d** | `momentum_composite_monthly` | **0.0169** | 0.0374 | 0.60 | 109 |

**Sanity Check:** ✅ PASSED (`naive_random` RankIC ≈ 0 for all horizons)

**Frozen Artifacts (tracked in git):**
- `evaluation_outputs/chapter6_closure_real/` - Baseline reference (IMMUTABLE)
- `BASELINE_FLOOR.json` - Metrics to beat for Chapter 7+
- `BASELINE_REFERENCE.md` - Usage instructions
- `CLOSURE_MANIFEST.json` - Commit hash (`bf2cf8e`), data hash (`5723d4c88b8ecba1...`)

**Data Snapshot:**
- Source: DuckDB (`data/features.duckdb`)
- Rows: 192,307 (2016-01-04 → 2025-02-19)
- Tickers: ~100 (AI universe)
- Horizons: 20d, 60d, 90d (TRADING DAYS)
- Label Version: v2 (total return with dividends)

**Reference Doc:** See `CHAPTER_6_FREEZE.md` for complete details.

**What This Means:**
- ✅ Chapter 6 evaluation pipeline is COMPLETE and may not be modified
- ✅ Baseline reference is FROZEN and is the immutable comparison anchor for Chapter 7+
- ✅ All future models must use this frozen pipeline and beat the frozen baseline floor
- ⚠️ Any changes to evaluation definitions require a new version and complete re-freeze

---

> **CRITICAL PHILOSOPHY**: "You've crossed the line where bad evaluation can ruin a good system."
>
> **Approach**: Be conservative. Let results look "boring" if they are. Resist the urge to tweak features/models early. If signals survive Chapter 6 as-is, Chapters 7-11 will feel almost easy.

### 6.0 Prerequisites Check ✅
- **Labels**: v2 total return (dividends), mature-aware, PIT-safe ✅
- **Features**: 5.1-5.8 complete, stable, interpretable, auditable ✅
- **Missingness**: Explicit, not dropped ✅
- **Regime**: Visible but not leaked ✅
- **Alpha attribution**: Neutralization working ✅
- **PIT discipline**: Scanner enforced, 0 CRITICAL violations ✅

---

### 6.0.2 Definition Lock ✅ IMPLEMENTED

> **CRITICAL**: All time conventions are locked in `src/evaluation/definitions.py`

**Canonical Definitions (FROZEN):**

| Parameter | Value | Unit | Enforcement |
|-----------|-------|------|-------------|
| **Horizons** | 20, 60, 90 | TRADING DAYS | `validate_horizon()` |
| **Embargo** | 90 | TRADING DAYS | `validate_embargo()` |
| **Rebalance** | 1st of month | Calendar day | Walk-forward splitter |
| **Pricing** | Close-to-close | - | Label generator |
| **Maturity** | label_matured_at <= cutoff_utc | UTC datetime | fold.filter_labels() |

**Anti-Leakage Enforcement (HARD CONSTRAINTS):**
```python
# Embargo validation (raises ValueError if < 90 TRADING DAYS)
from src.evaluation import validate_embargo
validate_embargo(90)  # ✅ Passes
validate_embargo(30)  # ❌ ValueError: "must be at least 90 TRADING DAYS"

# Maturity check (UTC datetime, not naive date)
from src.evaluation import get_market_close_utc, is_label_mature
cutoff_utc = get_market_close_utc(date(2023, 6, 15))  # 4 PM ET → UTC
is_label_mature(label_matured_at, cutoff_date)  # Rejects naive datetimes
```

**Purging Rules (Per-Row-Per-Horizon):**
- Train labels: Purge if T + H (trading days) > train_end
- Val labels: Purge if T - H (trading days) < train_end
- NOT a global rule that "happens to work because embargo = 90"

**End-of-Sample Eligibility:**
- All horizons (20/60/90) must be valid for an as-of date
- No partial horizons near end of evaluation period
- `require_all_horizons=True` in splitter

**Test Coverage:** 65/65 tests passing
- `tests/test_definitions.py`: 40 tests
- `tests/test_walk_forward.py`: 25 tests

---

### 6.0.1 Pre-Implementation Sanity Checks (COMPLETE BEFORE CODING)

> **CRITICAL**: These are not blockers—they are sanity locks. Complete both before writing any Chapter 6 code.

#### ✅ Sanity Check 1: Manual IC vs Qlib IC Parity Test

**Purpose:** Ensure adapter/indexing is correct before generating hundreds of evaluation reports.

**Test Protocol:**
```python
# One fold, one horizon, same predictions
fold = "2023-Q1"
horizon = 20

# Manual RankIC calculation
manual_rankic = df.groupby("date").apply(
    lambda x: spearmanr(x["prediction"], x["label"])[0]
).median()

# Qlib RankIC calculation
qlib_df = adapter.to_qlib_format(predictions, labels)
qlib_rankic = qlib.evaluate(qlib_df)["IC"].median()

# STOP if they don't match
assert abs(manual_rankic - qlib_rankic) < 0.001, "IC mismatch - fix adapter!"
```

**Acceptance:** Manual and Qlib RankIC must agree to 3 decimal places.

**If they don't match → STOP immediately:**
- Check MultiIndex formatting (datetime, instrument)
- Check date alignment (T vs T+H)
- Check for missing data handling differences
- Check for sign flips (prediction vs label)

#### ✅ Sanity Check 2: Experiment Naming Convention

**Purpose:** Prevent chaos when Recorder usage explodes across hundreds of experiments.

**Convention (LOCK THIS IN NOW):**
```
exp = ai_forecaster/
      horizon={20,60,90}/
      model={kronos_v0, fintext_v0, tabular_lgb, baseline_mom12m}/
      labels={v1_priceonly, v2_totalreturn}/
      fold={01, 02, ..., 40}/
```

**Example paths:**
```
ai_forecaster/horizon=20/model=kronos_v0/labels=v2/fold=03
ai_forecaster/horizon=60/model=baseline_mom12m/labels=v2/fold=12
ai_forecaster/horizon=90/model=tabular_lgb/labels=v2/fold=25
```

---

### 6.1 Walk-Forward Engine

**Expanding window** (not rolling):
- Grows forward, never shrinks
- Preserves long-term signal stability
- Avoids artificial lookback dependency

**Rebalance Cadence (LOCKED):**

| Frequency | Purpose | Details |
|-----------|---------|---------|
| **Monthly (Primary)** | Main evaluation | First trading day of month, ~110 points (2016-2025) |
| **Quarterly (Secondary)** | Robustness check | Supplementary slice only |

**Evaluation Date Range (LOCKED):**
```python
EVAL_START = "2016-01-01"  # Earliest reliable fundamentals + universe snapshots
EVAL_END = "2025-06-30"    # Conservative: guarantees 90d label maturity
```

**Why these dates?**
- **2016-01-01**: FMP fundamentals reliable, universe coverage sufficient
- **2025-06-30**: All 90d labels mature (PIT-safe), includes 2023-25 AI rally
- **Result**: ~110 monthly points, multiple regimes (pre-COVID, COVID, drawdown, AI mania)

**Universe snapshots:**
- Uses `stable_id` snapshots from Chapter 4 (survivorship-safe)
- Respects `label_matured_at` timestamps (PIT-safe)
- No forward-looking information leakage

**Time-decay sample weighting** (training only):
- From `src/features/time_decay.py`
- Horizon-specific half-lives: 2.5y (20d), 3.5y (60d), 4.5y (90d)
- Per-date normalization for cross-sectional ranking

---

### 6.1.1 Baselines (Models to Beat)

**3 baselines to beat:**

| Baseline | Feature(s) | Purpose |
|----------|-----------|---------|
| **A: `mom_12m`** | 12-month momentum | Primary naive baseline (embarrassing if we can't beat) |
| **B: `momentum_composite`** | `(mom_1m + mom_3m + mom_6m + mom_12m) / 4` | Stronger but transparent |
| **C: `short_term_strength`** | `mom_1m` or `rel_strength_1m` | Diagnostic for horizon sensitivity |

**+ 1 sanity baseline (not a target):**
- **`naive_random`**: Deterministic random scores (sanity check: RankIC ≈ 0, confirms no systematic bias)

**Critical Guardrail:**
All baselines run through **identical pipeline:**
- Same universe snapshots (`stable_id`)
- Same missingness handling
- Same neutralization setting (raw or sector/beta-neutral)
- Same cost diagnostic treatment (6.4)
- Same purging/embargo
- Same walk-forward splits

**No baseline shopping:** Adding more baselines = temptation to cherry-pick weak ones.

---

### 6.2 Label Hygiene
**Enforce maturity rule:**
- `label_matured_at <= asof` strictly enforced
- No label is used before it matures

**Horizon-aware purging:**
- Remove overlapping labels across training/validation folds
- Prevents leakage from correlated label windows

**Embargo = max horizon:**
- Gap between train and validation = 90 trading days
- Conservative cushion for all horizons (20/60/90)

---

### 6.3 Metrics (Ranking-First)

**Primary Metric: RankIC**
- Spearman correlation of predicted ranks vs actual excess returns
- More stable than Pearson IC (robust to outliers)
- Directly measures ranking quality

**IC by Regime:**
- VIX low/high quartiles
- Bull/bear markets (SPY 200-day MA)
- Sector rotation periods

**Top-Bottom Quintile Spread:**
- Return of top 20% - return of bottom 20%
- Measures practical exploitability

**Top-K Definition (LOCKED):**

| Metric | Primary | Secondary | Target |
|--------|---------|-----------|--------|
| **Top-K size** | Top-10 | Top-20 | - |
| **Churn** | Jaccard or % retained | - | < 30% |
| **Hit Rate** | Excess return > 0 | - | > 55% |

**Churn formula:**
```python
churn = 1 - len(set(top_k_t) & set(top_k_t_minus_1)) / len(set(top_k_t) | set(top_k_t_minus_1))
```

**Hit Rate definition:**
```python
hit = (top_k_portfolio_return - benchmark_return) > 0
hit_rate = hits / total_rebalances
```

---

### 6.4 Cost Realism (Diagnostic)

**Base costs:**
- 20 bps round-trip (conservative for liquid largecaps)
- Higher for small/mid caps if needed

**ADV-scaled slippage:**
- Function of position size / average daily dollar volume
- Penalizes illiquid names

**Question: "Does alpha survive?"**
- This is diagnostic, NOT optimization
- If alpha vanishes post-cost → reject signal
- If survives → document slippage sensitivity

---

### 6.5 Stability Reports

**IC decay plots:**
- How does IC degrade over time within a fold?
- Stable signals show flat or slow decay
- Rapid decay → overfitting or regime shift

**Regime-conditional performance:**
- IC in VIX high vs low
- IC in bull vs bear
- IC across sectors

**Churn diagnostics:**
- Top-10 ranking turnover month-over-month
- High churn (>50%) → unstable, costly
- Target: <30% for exploitability

---

### 6.6 Qlib Integration (Shadow Evaluator)

**Philosophy:** Use Microsoft's Qlib as a "shadow evaluator" for standardized reporting, NOT as a replacement for our core infrastructure.

**What Qlib Does for Us:**

| Chapter 6 Component | Qlib Feature |
|---------------------|--------------|
| **6.3 Metrics** | Built-in IC/RankIC analysis, monthly IC, regime IC, autocorrelation plots |
| **6.5 Reporting** | Quintile analysis, cumulative returns, long-short distribution, drawdown |
| **6.4 Cost Realism** | Backtest engine with configurable transaction costs (second opinion) |
| **Experiment Tracking** | Recorder system for managing walk-forward folds and model variants |

**Integration Pattern (Narrow & Safe):**
```
Our System (source of truth) → predictions + labels → Qlib → evaluation reports
```

**Data Flow:**
1. Our pipeline generates: `(date, ticker, prediction, label, optional_group)`
2. Qlib receives this DataFrame (not raw data/features)
3. Qlib outputs: standardized factor evaluation + backtest summaries

**What Qlib Does NOT Replace:**
- ❌ Universe construction (we keep stable_id + survivorship)
- ❌ Feature engineering (we keep PIT discipline + 5.1-5.8)
- ❌ Label generation (we keep v2 total return)
- ❌ Data storage (we keep DuckDB PIT store)

**Implementation:**
```python
# Adapter layer
def our_predictions_to_qlib_format(predictions_df, labels_df):
    qlib_df = pd.merge(predictions_df, labels_df, on=["date", "ticker"])
    qlib_df = qlib_df.set_index(["datetime", "instrument"])  # Qlib format
    return qlib_df

# Generate reports
from qlib.contrib.evaluate import backtest_daily
reports = backtest_daily(prediction=qlib_df, ...)
```

**References:**
- Qlib GitHub: https://github.com/microsoft/qlib
- Qlib Docs: https://qlib.readthedocs.io/en/latest/
- Evaluation: https://qlib.readthedocs.io/en/latest/component/report.html

---

**Acceptance Criteria:**
- ✅ Median walk-forward RankIC > baseline by ≥ 0.02
- ✅ Net-of-cost improvement: % positive folds ≥ baseline + 10pp (relative gate; frozen floor: 5.8%-40.1%)
- ✅ Top-10 ranking churn < 30% month-over-month
- ✅ Performance degrades gracefully under regime shifts
- ✅ NO PIT violations (enforced by scanner)

**Guardrails:**
- ❌ NO new features mid-evaluation
- ❌ NO retraining models to "fix" bad folds
- ❌ NO cherry-picking good time periods
- ❌ NO optimizing to costs (diagnostic only)
- ❌ NO hiding negative results

**Success = Boring Results That Don't Break**
- Median IC of 0.03-0.05 is GOOD
- Stable across regimes is EXCELLENT
- Survives costs is SUFFICIENT

## 7) Baseline Models (Models to Beat) ✅ IMPLEMENTED

---

### 7.0 Baseline Philosophy

Baselines establish the **floor** that models must clear. They serve three purposes:

1. **Sanity check**: If a model can't beat momentum, something is fundamentally wrong
2. **Value demonstration**: ML must add measurable value over transparent alternatives  
3. **Stability anchor**: Frozen baselines prevent "drifting targets" during model iteration

**Critical rule**: Baselines are **locked before any model is trained**. No baseline shopping.

---

### 7.1 Baseline Categories

All baselines run through the **identical evaluation pipeline** as models:
- Same universe snapshots (stable_id + survivorship)
- Same walk-forward folds (purging/embargo/maturity)
- Same EvaluationRow contract
- Same metrics, costs, and stability reports

#### 7.1.1 Factor Baselines (Models Must Beat)

| Baseline | Description | Formula | Purpose |
|----------|-------------|---------|---------|
| `mom_12m` | 12-month momentum | `score = mom_12m` | **Primary naive baseline** — If models can't beat this, something is wrong |
| `momentum_composite` | Multi-horizon momentum | `score = (mom_1m + mom_3m + mom_6m + mom_12m) / 4` | **Stronger transparent baseline** — Realistic bar for "is ML worth it?" |
| `short_term_strength` | 1-month momentum | `score = mom_1m` | **Diagnostic baseline** — Exposes horizon sensitivity and mean-reversion regimes |

#### 7.1.2 Sanity Baselines (Pipeline Verification Only)

| Baseline | Description | Formula | Purpose |
|----------|-------------|---------|---------|
| `naive_random` | Deterministic random | `score = hash(as_of_date, horizon, stable_id)` | **Pipeline sanity** — If RankIC ≠ ~0, evaluation is hallucinating alpha |

**Note**: `naive_random` should NEVER be used as a "bar to clear". It's purely a sanity check.

#### 7.1.3 ML Baselines (Tuned, Then Frozen)

| Baseline | Description | Status |
|----------|-------------|--------|
| `tabular_lgb` | LightGBM on feature stack | 🔄 TODO (Chapter 7) |

The ML baseline establishes: "Does deep learning add value over tuned gradient boosting?"

---

### 7.2 Baseline Gates (Pass/Fail Thresholds)

These are the minimum thresholds for each baseline category:

| Gate | Metric | Threshold | What It Means |
|------|--------|-----------|---------------|
| **Factor Gate** | median RankIC (best factor) | ≥ 0.02 | Momentum signal exists in data |
| **ML Gate** | median RankIC (tabular_lgb) | ≥ 0.05 | ML extracts signal beyond factors |
| **Model Gate** | median RankIC (model) | ≥ ML baseline + 0.02 | Deep learning adds value |

**Gating policy**:
- Factor gate must pass before proceeding (confirms data quality)
- ML gate establishes the "ML floor" for TSFM models
- TSFM models must beat tuned ML baseline on **median OOS RankIC**

---

### 7.3 Running Baselines

```python
from pathlib import Path
from src.evaluation import (
    ExperimentSpec,
    run_experiment,
    SMOKE_MODE,
    FULL_MODE,
    FACTOR_BASELINES,
    SANITY_BASELINES,
)

# Run all factor baselines (SMOKE mode for CI)
for baseline_name in FACTOR_BASELINES:
    spec = ExperimentSpec.baseline(baseline_name, cadence="monthly")
    results = run_experiment(
        experiment_spec=spec,
        features_df=features,  # Your features DataFrame
        output_dir=Path("evaluation_outputs"),
        mode=SMOKE_MODE  # or FULL_MODE for production
    )
    print(f"{baseline_name}: {results['n_folds']} folds")

# Run sanity baseline (verify ~0 RankIC)
spec = ExperimentSpec.baseline("naive_random", cadence="monthly")
sanity_results = run_experiment(spec, features, Path("evaluation_outputs"), SMOKE_MODE)
assert abs(sanity_results["median_rankic"]) < 0.05, "Pipeline sanity failed!"
```

---

### 7.4 Acceptance Criteria (Models Must Clear)

| Criterion | Threshold | Rationale |
|-----------|-----------|-----------|
| **RankIC Lift** | Model median RankIC >= best baseline + 0.02 | ML must add meaningful signal |
| **Net-Positive Folds** | >= 70% of folds positive after base costs | Signal must survive realistic trading |
| **Top-10 Churn** | Median < 30% | Rankings must be stable enough to trade |
| **No Collapse** | 0 folds with negative median RankIC | Robust across regimes |

These criteria are computed via:
```python
from src.evaluation import compute_acceptance_verdict, save_acceptance_summary

verdict = compute_acceptance_verdict(
    model_summary,
    baseline_summaries={
        "mom_12m": ..., 
        "momentum_composite": ..., 
        "short_term_strength": ...,
        "tabular_lgb": ...  # Once implemented
    },
    cost_overlays=cost_df,
    churn_df=churn_df
)
save_acceptance_summary(verdict, Path("outputs"), "model_vs_baselines")
```

---

### 7.5 FULL_MODE Reference Run (Required Before Model Work)

**Before training any model**, freeze a FULL_MODE baseline run:
- **Range**: 2016-01-01 → 2025-06-30 (locked)
- **Cadence**: Monthly (primary), Quarterly (robustness)
- **Outputs**: All baselines, all 3 horizons (20/60/90)

This produces:
```
evaluation_outputs/
├── baseline_mom_12m_monthly/
│   ├── eval_rows.parquet
│   ├── per_date_metrics.csv
│   ├── fold_summaries.csv
│   ├── cost_overlays.csv
│   ├── stability_scorecard.csv
│   └── REPORT_SUMMARY.md
├── baseline_momentum_composite_monthly/
├── baseline_short_term_strength_monthly/
├── baseline_naive_random_monthly/
├── baseline_tabular_lgb_monthly/  # Once implemented
└── BASELINE_REFERENCE.md  # Frozen floor for all horizons
```

**Freeze requirements**:
- Git commit hash of the run
- Run configuration (cadence, horizons, eval range, cost scenarios)
- Data snapshot identity (DuckDB hash + row counts)
- Output directory + manifest
- Environment snapshot (Python version + pip freeze)

---

### 7.6 Tabular ML Baseline ✅ IMPLEMENTED + FROZEN

The ML baseline uses LightGBM Regressor on the same feature stack as factor baselines.

#### 7.6.1 Training Protocol (Implemented)

```python
# Per-fold training using walk-forward split
for fold in walk_forward_folds:
    # Uses same purging + embargo + maturity
    train_data = fold.train_df
    val_data = fold.val_data
    
    # Horizon-specific models (separate model per horizon)
    for horizon in [20, 60, 90]:
        model = lgb.LGBMRegressor(  # Regressor (not Ranker) for continuous returns
            objective="regression_l1",
            metric="rmse",
            n_estimators=100,  # FROZEN after initial tuning
            learning_rate=0.05,
            max_depth=5,
            num_leaves=31,
            random_state=42,
            # Time-decay weighting: recent data weighted higher (half-life=252d)
        )
        model.fit(
            train_data[features],
            train_data[f"excess_return_{horizon}d"],
            group=train_data.groupby("as_of_date").size().values,
            sample_weight=time_decay_weights  # Exponential decay
        )
```

#### 7.6.2 Frozen Baseline Floor (COMPLETE)

**Status:** FROZEN at tag `chapter7-tabular-lgb-freeze`  
**Artifacts:** `evaluation_outputs/chapter7_tabular_lgb_full/`  
**Reference:** `BASELINE_REFERENCE.md` + `CLOSURE_MANIFEST.json`

| Horizon | Frozen Factor Floor | tabular_lgb (ML) | Lift |
|---------|---------------------|------------------|------|
| 20d | 0.0283 | **0.1009** | **+0.0726** |
| 60d | 0.0392 | **0.1275** | **+0.0883** |
| 90d | 0.0169 | **0.1808** | **+0.1639** |

**Implementation complete:**
1. ✅ One-time param tuning completed
2. ✅ Params frozen and recorded
3. ✅ FULL_MODE reference run executed (109 monthly folds, 36 quarterly)
4. ✅ Artifacts frozen with git tag
5. ✅ **ML baseline is now immutable**

---

### 7.7 Output Schema (EvaluationRow Contract)

Every baseline (and model) produces rows in this exact format:

| Field | Type | Description |
|-------|------|-------------|
| `as_of_date` | date | Evaluation date |
| `ticker` | str | Stock symbol |
| `stable_id` | str | Survivorship-safe identifier |
| `horizon` | int | 20, 60, or 90 trading days |
| `fold_id` | str | Walk-forward fold |
| `score` | float | Ranking score (HIGHER = BETTER) |
| `excess_return` | float | v2 total return vs benchmark |
| `adv_20d` | float | Average daily volume (cost realism) |

**Rules**:
- No duplicates per (as_of_date, stable_id, horizon)
- Missing score/return → row dropped (logged)
- Tie-breaking: deterministic via stable_id

---

### 7.8 Implementation Location

```
src/evaluation/baselines.py     # Baseline definitions and scoring
src/evaluation/run_evaluation.py # End-to-end runner (SMOKE/FULL modes)
tests/test_baselines.py         # 39 tests (monotonicity, determinism, etc.)
tests/test_end_to_end_smoke.py  # 22 integration tests
```

**Tests**: 61 tests specific to baselines and runner, all passing.

---

### 7.9 What "Done" Looks Like

When Chapter 7 baseline work is complete:

```
evaluation_outputs/
├── baseline_mom_12m_monthly/
├── baseline_mom_12m_quarterly/
├── baseline_momentum_composite_monthly/
├── baseline_momentum_composite_quarterly/
├── baseline_short_term_strength_monthly/
├── baseline_short_term_strength_quarterly/
├── baseline_naive_random_monthly/  # Sanity check
├── baseline_tabular_lgb_monthly/   # ML baseline
├── baseline_tabular_lgb_quarterly/
└── BASELINE_REFERENCE.md
```

The `BASELINE_REFERENCE.md` contains:
- Median RankIC per horizon (20/60/90), monthly and quarterly
- Churn and cost-survival diagnostics
- Pass/fail for each baseline gate
- Frozen commit hash + output path

**This becomes the only reference point for Chapter 8+ (Kronos, FinText, Fusion).**

---

### DEV / FINAL Holdout Protocol (established Feb 19, 2026)

All evaluation is partitioned into two windows:

| Window | Date Range | Months | Purpose |
|--------|-----------|:------:|---------|
| **DEV** | Feb 2016 – Dec 2023 | 95 | Research iteration |
| **FINAL** | Jan 2024 – Feb 2025 | 14 | One-shot confirmation |

**Cutoff:** `HOLDOUT_START = 2024-01-01`

**Retroactive caveat:** This split was established after development iteration
on the full 109-fold aggregate. It is a soft holdout, not a true blind holdout.

**Key finding:** LGB signal collapses at 60d/90d in the holdout (RankIC goes
negative) but 20d retains weak positive signal (0.010). Shadow portfolio at 20d
holds Sharpe 1.91 in holdout (vs 3.15 DEV). Year-by-year analysis shows regime
dependency — model fails in strong thematic bull markets (2021, 2024) — not
pure overfitting.

**Rules:**
1. All chapters report both **DEV** and **FINAL** metrics
2. Model iteration uses **DEV only**
3. FINAL evaluated **once per chapter** as confirmation
4. 20d is the **confirmed primary horizon** (others are DEV-only until confirmed)

See `documentation/CHAPTER_12.md` Appendix and `documentation/ROADMAP.md` for
full analysis.

## 8) Kronos Module (Price Dynamics)

### 8.0 Scope (Keep Chapter 8 Signal-Researcher Grade)

Chapter 8 is about **integrating Kronos** cleanly into the **frozen** evaluation pipeline and validating signal quality via:
- **RankIC / IC stability**
- **Quintile spread**
- **Churn / turnover proxies**
- **Net-of-cost survival** (diagnostic overlays)
- **Regime slices**
- **PIT + survivorship correctness**

We intentionally do **not** introduce full portfolio backtests (Sharpe/IR/DD) here. That belongs later (see Chapter 11).

---

### 8.1 Kronos Input/Output (Locked)

- **Input**: OHLCV sequences (from DuckDB `prices` table via `PricesStore`)
- **Future timestamps**: derived from **global trading calendar** (DuckDB dates) — **never** `freq="B"`
- **Inference path**: **batch-first** (`predict_batch`) with deterministic SMOKE settings
- **Score mapping (ranking proxy)**:

$$
score_{t,h} = \frac{\hat{C}_{t+h} - C_t}{C_t}
$$

Where:
- $C_t$ is close at as-of date \(t\) (from `PricesStore`)
-$hat{C}_{t+h}$ is predicted close at horizon $h$ (from Kronos)

---

### 8.2 Kronos Success Criteria (Signal Quality)

**Minimum gates (signal-only):**
- Runs end-to-end with correct scorer contract (scores for every `(date, ticker)` in validation)
- RankIC is meaningfully positive for at least one horizon (primary focus: **20d**)
- Churn/cost overlays do not destroy the signal
- Performance is not purely a momentum clone (correlation checks)

---

### 8.3 Kronos Leak Tripwires (Evaluation-only, Cheap Insurance)

Because strong metrics can still hide subtle alignment bugs, add **two negative controls** as a leak tripwire:

1) **Shuffle-within-date**
- Shuffle scores **within each date** (cross-section)
- Expectation: **RankIC ≈ 0**

2) **+1 trading-day lag**
- Shift scores forward by **+1 trading day** (or equivalently score date misalignment)
- Expectation: RankIC **collapses materially**

**Notes:**
- These are evaluation-only checks; they do **not** change model training.
- They should be implemented as part of **Phase 3+ (evaluation integration)**, not Phase 2.

---

### 8.4 What We Do NOT Add in Chapter 8

- No canonical Sharpe/IR/DD yet (that’s Chapter 11+)
- No optimizer/execution
- No overlapping-hold portfolio logic for 60/90 horizons

(Keep Chapter 8 focused: correctness + scalable integration + signal research metrics.)

## 9) FinText-TSFM Module (Return Structure)

### 9.0 Scope & Context

Chapter 9 integrates **FinText-TSFM** — a suite of 600+ time series foundation models pre-trained from scratch on financial excess returns — into our frozen evaluation pipeline.

**Why FinText after Kronos?**
Chapter 8 showed that generic TSFM (Kronos, pre-trained on OHLCV across exchanges) produces inverted rankings zero-shot. The [FinText paper](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5770562) (Rahimikia, Ni & Wang, 2025) demonstrates that **finance-native pre-training** on excess returns dramatically outperforms generic foundation models — exactly the failure mode we observed with Kronos.

**Key advantages over Kronos:**
- **Input = daily excess returns** (not OHLCV) → directly matches our label definition
- **Pre-trained on 2B+ financial observations** across 94 markets (not generic time series)
- **Year-specific models** → inherently PIT-safe (no look-ahead bias by construction)
- **Proven academic results** with proper out-of-sample evaluation

**References:**
- Paper: "Re(Visiting) Time Series Foundation Models in Finance" ([SSRN 5770562](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5770562))
- Models: [HuggingFace/FinText](https://huggingface.co/FinText) (613 models)
- Code: [GitHub/DeepIntoStreams/TSFM_Finance](https://github.com/DeepIntoStreams/TSFM_Finance)

---

### 9.1 FinText Input / Output (Locked)

**Input:** Historical daily excess return sequences (PIT-safe)
- Source: DuckDB `prices` table + QQQ benchmark prices
- Daily excess return: `r_stock_t - r_QQQ_t` where `r = (close_t / close_{t-1}) - 1`
- Lookback window: **21 trading days** (primary), 252d and 512d (ablation)
- Shape: `torch.Tensor` of shape `(n_stocks, lookback_window)`

**Model:** Amazon Chronos architecture, finance-pre-trained by FinText team
- Primary: `FinText/Chronos_Small_{YEAR}_US` (46M params, U.S. excess returns)
- Secondary ablations: `Chronos_Mini` (20M), `Chronos_Tiny` (8M), `Global`, `Augmented`
- Year selection (PIT-safe): For as-of date in year Y, load model trained through Y-1
  - Example: evaluating at 2024-03-01 → use `FinText/Chronos_Small_2023_US`
  - Example: evaluating at 2020-06-15 → use `FinText/Chronos_Small_2019_US`

**Output:** Predicted future excess return distribution
- `num_samples` draws from the predictive distribution (default: 20)
- Shape: `(n_stocks, num_samples, prediction_length)`
- prediction_length = 1 (next-day excess return; aggregate for multi-day horizons)

**Score mapping (ranking proxy):**

$$
\text{score}_{i,t} = \text{median}\left(\hat{r}^{(1)}_{i,t+1}, \ldots, \hat{r}^{(S)}_{i,t+1}\right)
$$

Where $\hat{r}^{(s)}_{i,t+1}$ is sample $s$ of the predicted excess return for stock $i$.

**Uncertainty quantification (for Chapter 13):**
- Spread: IQR of samples → natural confidence measure
- Save full sample distribution for calibration in Chapter 13

---

### 9.2 Data Plumbing: Daily Excess Return Sequences

**New component:** `src/data/excess_return_store.py`

Responsibilities:
1. Load stock daily closes from DuckDB `prices` table
2. Load QQQ daily closes (add to DuckDB `prices` table if not present)
3. Compute daily excess returns: `stock_daily_return - QQQ_daily_return`
4. Return windowed sequences for any `(ticker, asof_date, lookback)`
5. Cache computed sequences for efficiency

**PIT rules (same as Chapter 8):**
- Only use data with date ≤ asof_date
- All returns computed from split-adjusted prices
- QQQ must be in the same time grid (NYSE trading days)

**QQQ benchmark data:**
- Already cached from FMP: `data/cache/fmp/...symbol=QQQ...`
- One-time script to add QQQ to DuckDB `prices` table
- Verify date alignment with stock universe

---

### 9.3 FinText Adapter

**New component:** `src/models/fintext_adapter.py`

```python
class FinTextAdapter:
    """
    Adapter for FinText-TSFM (Chronos) foundation model inference.
    
    Key design decisions:
    1. Uses ExcessReturnStore for input sequences (daily excess returns)
    2. Year-aware model loading (PIT-safe)
    3. Batch inference via ChronosPipeline
    4. Score = median predicted excess return
    """
    
    # Year-aware model selection
    def get_model_id(self, asof_date) -> str:
        year = asof_date.year - 1  # Use model trained through previous year
        year = min(year, 2023)     # Latest available model
        year = max(year, 2000)     # Earliest available model
        return f"FinText/Chronos_Small_{year}_US"
    
    # Core inference
    def score_universe_batch(
        self,
        asof_date: pd.Timestamp,
        tickers: List[str],
        horizon: int,
        lookback: int = 21,
        num_samples: int = 20,
    ) -> pd.DataFrame:
        """Score all tickers for one as-of date."""
```

**Model caching strategy:**
- Cache loaded models in memory (keyed by model_id)
- Walk-forward evaluation processes dates chronologically → same model reused for months within a year
- Memory footprint: ~200MB per Chronos-Small model

**Dependencies:**
- `chronos-forecasting>=1.5` (Amazon Chronos)
- `transformers>=4.35` (already in requirements)
- `torch>=2.0` (already in requirements)

---

### 9.4 Multi-Day Horizon Strategy

FinText Chronos predicts **1-day-ahead** excess returns. Our evaluation uses 20/60/90 trading day horizons.

**Approach: Autoregressive multi-step (primary)**
- For horizon H, run the model autoregressively H steps
- Use `prediction_length=H` in `pipeline.predict()` (Chronos handles internal AR)
- Score = median of predicted return at step H

**Alternative: Cumulative sum (ablation)**
- Predict H daily returns, sum them for cumulative excess return
- `cum_score = sum(median_prediction[1..H])`
- May capture path effects better

**Start with:** `prediction_length=1` (simplest, cleanest) and use the single-step predicted return as the ranking score for all horizons. This is justified because:
- Cross-sectional ranking is relative (absolute magnitude doesn't matter)
- If a stock's 1-day predicted excess return is positive, its multi-day return tends to be too
- Reduces inference cost by 20-90x vs full autoregressive unrolling

Test multi-step as an ablation if 1-step shows promising RankIC.

---

### 9.5 Walk-Forward Model Selection (PIT-Safe)

Unlike Kronos (single pre-trained model), FinText provides year-specific models.

**Walk-forward model assignment:**

| Evaluation Period | Model Used | Rationale |
|-------------------|------------|-----------|
| 2016-01 to 2016-12 | `Chronos_Small_2015_US` | Trained through 2015 |
| 2017-01 to 2017-12 | `Chronos_Small_2016_US` | Trained through 2016 |
| ... | ... | ... |
| 2024-01 to 2025-06 | `Chronos_Small_2023_US` | Latest model (trained through 2023) |

This is strictly PIT-safe: each model only uses data available before the evaluation period.

---

### 9.6 FinText Success Criteria (Signal Quality)

**Minimum gates (same framework as Chapter 8):**

| Gate | Metric | Threshold | Rationale |
|------|--------|-----------|-----------|
| Gate 1 (Factor) | RankIC ≥ 0.02 for ≥2 horizons | Factor baseline | Confirms signal exists |
| Gate 2 (ML) | Any horizon RankIC ≥ 0.05 or within 0.03 of LGB | ML comparison | Worth pursuing for fusion |
| Gate 3 (Practical) | Churn ≤ 30%, stable across regimes | Implementability | Rankings must be tradeable |

**Additional success criteria:**
- Low correlation with `tabular_lgb` scores (< 0.5) → complementary signal for fusion
- Low correlation with Kronos scores → orthogonal view
- Stable IC sign across ≥ 70% of rolling windows

**Comparison targets (from frozen baselines):**

| Horizon | Factor Floor | ML Floor (tabular_lgb) | FinText Target |
|---------|-------------|----------------------|----------------|
| 20d | 0.0283 | 0.1009 | ≥ 0.02 (Gate 1) |
| 60d | 0.0392 | 0.1275 | ≥ 0.05 (Gate 2) |
| 90d | 0.0169 | 0.1808 | ≥ 0.02 (Gate 1) |

**Note:** Even if FinText doesn't beat LGB on RankIC, a weakly positive but **orthogonal** signal is valuable for fusion (Chapter 11).

---

### 9.7 Leak Tripwires (Evaluation-Only)

Reuse the same negative controls used for Kronos:

1. **Shuffle-within-date** → RankIC ≈ 0
   - Shuffle scores within each date (cross-section)
   - Confirms signal is stock-specific, not systematic

2. **+1 trading-day lag** → RankIC collapses
   - Shift scores forward by 1 trading day
   - Confirms signal is time-aligned

3. **Year-mismatch control** (NEW for FinText)
   - Use deliberately wrong model year (e.g., 2023 model for 2018 dates)
   - If RankIC doesn't change → model year doesn't matter (suspicious)
   - If RankIC degrades → year-specific training provides genuine value

---

### 9.8 Ablation Studies

| Ablation | What We Learn |
|----------|--------------|
| **Lookback window**: 21 vs 252 vs 512 | Optimal context length for ranking |
| **Model size**: Tiny (8M) vs Mini (20M) vs Small (46M) | Quality vs speed tradeoff |
| **Model dataset**: US vs Global vs Augmented | Domain specificity impact |
| **Prediction length**: 1-step vs H-step autoregressive | Multi-day horizon strategy |
| **Score aggregation**: Median vs mean vs trimmed mean | Robustness of score mapping |
| **Num samples**: 5 vs 20 vs 50 | Distribution estimation stability |

Run ablations in SMOKE mode first; only promote to FULL for promising variants.

---

### 9.9 Implementation Phases

**Phase 1: Data Plumbing (½ day)**
1. Add QQQ to DuckDB `prices` table (from existing FMP cache)
2. Implement `ExcessReturnStore` (daily excess return sequences)
3. Unit tests for return computation and PIT safety
4. Verify: sequences match manual calculation for sample dates

**Phase 2: FinText Adapter (1 day)**
1. Install `chronos-forecasting` package
2. Implement `FinTextAdapter` with year-aware model loading
3. Implement `fintext_scoring_function()` for `run_experiment()` integration
4. Stub mode for testing without model download
5. Single-stock sanity test
6. Unit tests (15+ tests covering all non-negotiables)

**Phase 3: Evaluation Integration (½ day)**
1. Create `scripts/run_chapter9_fintext.py`
2. SMOKE evaluation (3 folds, verify pipeline)
3. Leak tripwires (shuffle + lag + year-mismatch)
4. Momentum-clone check (correlation vs factor baselines)

**Phase 4: Full Evaluation & Ablation (1 day)**
1. FULL mode evaluation (109 monthly folds, 3 horizons)
2. Compare vs frozen baselines (factor + ML)
3. Run primary ablations (lookback, model size)
4. Document results and gate evaluation
5. Freeze Chapter 9 artifacts (if gates pass)

---

### 9.10 Files to Create

```
src/data/excess_return_store.py        # Daily excess return sequences
src/models/fintext_adapter.py          # FinText-TSFM adapter
scripts/add_qqq_to_duckdb.py           # One-time: add QQQ prices
scripts/run_chapter9_fintext.py        # Walk-forward evaluation
scripts/test_fintext_single_stock.py   # Quick sanity check
tests/test_excess_return_store.py      # Unit tests for data
tests/test_fintext_adapter.py          # Unit tests for adapter
```

---

### 9.11 What We Do NOT Add in Chapter 9

- No Sharpe/IR/DD yet (that's Chapter 11+)
- No optimizer/execution
- No overlapping-hold portfolio logic
- No fine-tuning of FinText models (zero-shot only; the models are already finance-pre-trained)
- No fusion with tabular features (that's Chapter 11)

(Keep Chapter 9 focused: correctness + scalable integration + signal research metrics.)

## 10) NLP Sentiment Signal

**Goal:** Add a text-based sentiment signal that is **orthogonal** to price/fundamental features, for use in Chapter 11 fusion.

**Why this matters for fusion:** Our existing features (Ch5) capture price dynamics, fundamentals, and macro regime. Sentiment captures **information flow** — what people are saying about a stock before it's fully reflected in prices. For AI stocks (news-heavy sector), this is especially valuable.

**Model:** ProsusAI/FinBERT (pre-trained finance sentiment, zero-shot, no fine-tuning needed — same philosophy as FinText in Ch9).

**Data Sources:**
- **SEC EDGAR 8-K filings** (free, unlimited, PIT-safe): Material event disclosures, earnings releases, management commentary
- **FinnHub company news API** (free tier, 60 req/min): Real-time news headlines with timestamps for our 100 AI stocks
- **Existing EventStore** infrastructure (Ch4): Already supports `SENTIMENT` event type with PIT timestamps

---

### 10.1 Sentiment Data Pipeline

**SEC EDGAR 8-K Text Extraction:**
- Download 8-K filing text for all 100 AI stocks (2016–2025)
- Extract key sections: Item 2.02 (Results of Operations), Item 7.01 (Regulation FD), Item 8.01 (Other Events)
- PIT enforcement: use `acceptanceDateTime` from EDGAR (to-the-second accuracy)
- Rate limit: 8 req/sec (EDGAR guideline), no daily cap

**FinnHub News Collection:**
- Collect company-specific news for all 100 AI stocks
- Fields: headline, summary, datetime, source, category
- PIT enforcement: use article `datetime` as observed_at
- Rate limit: 60 req/min (free tier)
- **Requires:** Free FinnHub API key (https://finnhub.io/register)

**Storage:**
- Store raw text + metadata in `EventStore` with `event_type='SENTIMENT'`
- Deduplicate by source + datetime + ticker
- Cache locally to avoid re-downloading

**Non-negotiables:**
- Every text record has a PIT-safe timestamp
- No future text leaks into historical scoring
- Text is stored raw (scoring happens separately)

---

### 10.2 FinBERT Sentiment Scoring

**Model:** `ProsusAI/finbert` (HuggingFace, ~110M params)
- Pre-trained on financial text (news, reports, SEC filings)
- Outputs: P(positive), P(negative), P(neutral) per text chunk
- Zero-shot — no fine-tuning required

**Scoring Pipeline:**
1. For each text record (8-K excerpt or news headline+summary):
   - Tokenize with FinBERT tokenizer (max 512 tokens)
   - For long filings: split into chunks, score each, aggregate (mean of top-3 most opinionated chunks)
   - Output: `sentiment_score` = P(positive) - P(negative) ∈ [-1, +1]
2. Store scored results back in EventStore with original PIT timestamp
3. Batch processing: score all historical text in one pass, then incremental updates

**Quality Checks:**
- Verify score distribution is not degenerate (not all neutral)
- Spot-check known events (e.g., NVDA earnings beat → should be positive)
- Cross-validate against known market reactions

---

### 10.3 Sentiment Feature Engineering

Build per-stock sentiment features for each evaluation date, using only PIT-safe data:

**Filing Sentiment Features (3):**
- `filing_sentiment_latest`: Sentiment of most recent 8-K filing
- `filing_sentiment_change`: Change in sentiment between last two filings
- `filing_sentiment_90d`: Mean sentiment of all filings in past 90 days

**News Sentiment Features (4):**
- `news_sentiment_7d`: Mean sentiment of news in past 7 trading days
- `news_sentiment_30d`: Mean sentiment of news in past 30 trading days
- `news_sentiment_momentum`: 7d minus 30d sentiment (acceleration)
- `news_volume_30d`: Count of news articles in past 30 days (attention proxy)

**Cross-Sectional Features (2):**
- `sentiment_zscore`: Cross-sectional z-score of 30d sentiment (rank within universe)
- `sentiment_vs_momentum`: Residual of sentiment after regressing on price momentum (captures divergence between text and price signals)

**Total: 9 new features** (deliberately small — avoid feature bloat, let fusion model decide weighting)

**PIT Safety:** All features computed using only text observed before the evaluation date.

---

### 10.4 Walk-Forward Evaluation & Gates ✅

**SMOKE Results (3 folds, 2024):** 19,110 eval rows, 53 seconds.

| Horizon | Mean RankIC | Churn | Gate 1 | Gate 2 | Gate 3 |
|---------|-------------|-------|--------|--------|--------|
| 20d | -0.015 | 10% | ❌ | ❌ | ✅ |
| 60d | -0.029 | 10% | ❌ | ❌ | ✅ |
| 90d | -0.066 | 10% | ❌ | ❌ | ✅ |

**Orthogonality (KEY RESULT):** ρ < 0.16 vs ALL existing signals → HIGH fusion value.

---

### 10.5 Freeze & Documentation ✅

- 9 features frozen, artifacts saved to `evaluation_outputs/chapter10_sentiment_smoke/`
- 101 tests passing (28 + 22 + 33 + 18)
- CHAPTER_10.md + ROADMAP.md updated

**Files to create:**
```
src/data/sentiment_store.py           # Sentiment data collection & storage
src/models/finbert_scorer.py          # FinBERT sentiment scoring
src/features/sentiment_features.py    # Sentiment feature engineering
scripts/collect_sentiment_data.py     # One-time: collect historical sentiment
scripts/run_chapter10_sentiment.py    # Walk-forward evaluation
tests/test_sentiment_store.py         # Unit tests for data layer
tests/test_finbert_scorer.py          # Unit tests for scoring
tests/test_sentiment_features.py      # Unit tests for features
```

## 11) Fusion Model (Ranking-First) ✅ COMPLETE

**Goal:** combine orthogonal signals into a robust cross-sectional ranking expert.

**Signal families:** LGB (Ch7), FinText (Ch9, real Chronos), Sentiment (Ch10)

### 11.1–11.4 ✅ COMPLETE
- Score alignment, fusion architecture (rank-avg, enriched LGB, stacking), gate evaluation, residual archive, expert interface
- 36 tests passing, metric gap fixed (IC stability + cost survival added)

### 11.5 FULL-Mode Comparison & Shadow Portfolio ✅ COMPLETE

All sub-models (including real Chronos FinText) re-run in FULL mode (109 folds, monthly).

**Signal quality (109 folds, 2277 eval dates):**

| Model | Hz | Med. RankIC | IC Stab | Cost Surv |
|-------|---:|:----------:|:------:|:---------:|
| **LGB baseline** | 20d | **0.0805** | **0.3534** | **66.8%** |
| Learned Stacking | 20d | 0.0783 | 0.3452 | 65.9% |
| Rank Avg 2 | 20d | 0.0538 | 0.2978 | 61.4% |
| **LGB baseline** | 60d | **0.1478** | **0.7119** | **77.5%** |
| Learned Stacking | 60d | 0.1480 | 0.7075 | 77.5% |
| Rank Avg 2 | 60d | 0.0920 | 0.5610 | 70.4% |
| **LGB baseline** | 90d | **0.1833** | **0.7972** | **79.8%** |
| Learned Stacking | 90d | 0.1802 | 0.7961 | 79.9% |
| Rank Avg 2 | 90d | 0.1173 | 0.6069 | 72.1% |

**Shadow portfolio (20d):** LGB Sharpe **1.26** vs Stacking 1.14 vs RankAvg 0.84

**FinText standalone:** near-zero signal (20d RankIC 0.014, 60d ~0, 90d slightly negative)

### 11.6 Freeze ✅ COMPLETE

**Verdict:** Gate 4 FAIL — definitive with real FinText. LGB baseline wins on every metric. Root cause: FinText (Chronos) produces near-zero standalone signal for this AI stock universe. Learned Stacking matches LGB because Ridge learns to discard the weak sub-models. Infrastructure value delivered: residual archive, expert interface, disagreement proxy — all ready for Ch13 UQ.


## 12) Regime-Aware Analysis & Heuristic Ensemble — ✅ COMPLETE

**Goal:** Understand when LGB fails, test whether simple regime heuristics improve performance, and build the regime-based baseline that Chapter 13 DEUP must beat.

**Result:** LGB performs 2.6× better in calm markets (20d RankIC 0.18 vs 0.07). Vol-sizing heuristic provides modest improvement (Sharpe 2.65→2.73, max DD −22%→−18%). Regime blending fails. Vol-sized LGB is the DEUP ablation baseline. `data/regime_context.parquet` frozen for Ch13 (201K rows, 16 features, 100% coverage). See `documentation/CHAPTER_12.md`.

**Context from Chapter 11:** LGB baseline (Ch7) dominates all fusion variants. FinText has near-zero signal. There is no portfolio of strong models to ensemble — the value of Chapter 12 is *diagnostic* (when does our best model struggle?) and *infrastructure* (regime context for Ch13 UCB expert selection).

**Role in UQ pipeline:** The UQ reference document (§11.4) specifies four ablation baselines for Chapter 13's DEUP-based expert selection. One is **"UCB with DEUP vs Regime-based switching (heuristic rules)."** Chapter 12 builds that heuristic baseline. Additionally, the "aleatoric" half of dual-gated position sizing (§7.5: `w = min(1, d / √(a(x) + ε))`) — using realized volatility — belongs here.

**Existing infrastructure to reuse:**
- `src/features/regime_features.py` — VIX regime (low/normal/elevated/high), market regime (bull/bear/neutral), sector rotation
- `src/evaluation/metrics.py` — `evaluate_with_regime_slicing()`, `REGIME_DEFINITIONS` (VIX percentile, market regime, market vol)
- `src/evaluation/reports.py` — `format_regime_performance()`, `plot_regime_bars()`
- `scripts/run_shadow_portfolio.py` — Shadow portfolio (Sharpe, DD, turnover, hit rate)
- LGB FULL eval_rows (591K rows, 109 folds, 2277 dates) already available

---

### 12.1 Regime-Conditional Performance Diagnostics

Slice the LGB baseline across regime buckets to understand failure modes.

**Work:**
- Load LGB FULL eval_rows + annotate each `as_of_date` with regime features from `RegimeFeatureGenerator`
- Compute per-regime: RankIC (median/mean), IC stability, cost survival, hit rate
- Two regime axes: **VIX regime** (low/normal/elevated/high) and **market regime** (bull/bear/neutral)
- Identify: does LGB struggle in high-VIX? In bear markets? At regime transitions?
- Compute per-date RankIC correlation with VIX level and market return (continuous, not bucketed)

**Output:** `evaluation_outputs/chapter12/regime_diagnostics.csv`, regime performance tables

**Why it matters:** Foundation for everything downstream. DEUP (Ch13) needs to know *which features predict model failure* — regime is the most obvious candidate. Also essential for the thesis: "Our model performs X in normal conditions and Y in stress conditions."

---

### 12.2 Regime Stress Tests (Shadow Portfolio)

Extend the shadow portfolio with regime-conditional reporting.

**Work:**
- Annotate shadow portfolio daily returns with regime labels
- Compute per-regime: annualized Sharpe, annualized return, max drawdown, hit rate
- Plot rolling 12-month Sharpe with regime overlay (colored background)
- Identify worst 5 drawdown episodes and their regime context
- Compare: LGB vs Learned Stacking vs Rank Avg 2 by regime (using existing shadow portfolio returns)

**Output:** `evaluation_outputs/chapter12/regime_stress_report.json`, rolling Sharpe plot

**Why it matters:** Required for institutional-grade reporting. Shows if the Sharpe 1.26 is driven by one regime or is robust across conditions.

---

### 12.3 Regime-Aware Heuristic Baseline

Build simple regime-conditioned approaches that serve as baselines for Ch13 DEUP.

**Approach A — Volatility-Scaled Position Sizing:**
- Scale signal by inverse realized volatility: `sized_score = score × min(1, c / σ_realized)`
- The "aleatoric" half of dual-gated sizing from UQ reference §7.5
- Re-run shadow portfolio with sized scores, compare Sharpe/DD vs uniform

**Approach B — Regime-Blended Ensemble:**
- In normal/low-VIX regimes → use LGB scores as-is (full weight)
- In elevated/high-VIX regimes → blend LGB with momentum factor (simpler, more robust model)
- Smooth weights with exponential decay across regime boundaries (avoid hard switching artifacts)
- Formula: `w_lgb = 1 - α × sigmoid((vix_pctile - 67) / τ)`, where α and τ are fixed (not optimized)

**Evaluation:**
- Run both through walk-forward evaluation (reuse `run_experiment()`)
- Compare vs pure LGB: RankIC, IC stability, cost survival, churn
- Shadow portfolio comparison: Sharpe, return, DD, worst month
- SMOKE first (sanity), then FULL if either approach shows promise

**Gate criteria:**
- Does volatility sizing improve Sharpe or reduce max drawdown vs uniform?
- Does regime blending improve any metric vs pure LGB?
- If both fail → document negative result (still useful as Ch13 ablation baseline)

**Output:** `evaluation_outputs/chapter12/regime_heuristic/` with eval_rows + shadow portfolio

---

### 12.4 Freeze & Documentation

**Work:**
- Compile final results table: LGB (uniform) vs vol-sized vs regime-blended
- Store regime context features per-date for Ch13 (`data/regime_context.parquet`)
  - Must include **per-stock realized volatility** (`vol_20d` from features store), not just market-level VIX
  - DEUP's aleatoric baseline a(x) needs stock-level noise estimates for per-stock position sizing
  - `vol_20d` already exists in `data/features.duckdb` (computed in Ch5 price features)
- Document regime effects (which regimes hurt performance, by how much)
- Write `documentation/CHAPTER_12.md`
- Update `outline.ipynb` and `ROADMAP.md`

**Success criteria (pragmatic):**
- ✅ Regime-conditional LGB performance quantified across all metrics
- ✅ Shadow portfolio stress-tested by regime with rolling analysis
- ✅ At least one heuristic baseline built for Ch13 DEUP ablation
- ✅ Regime context features stored and validated

**Gate (Chapter 12-specific):**
- If regime heuristics beat LGB → adopt as new primary signal
- If regime heuristics ≈ LGB → useful negative result; heuristic serves as Ch13 ablation baseline
- Either outcome is valid — the key deliverable is the diagnostic analysis and the baseline

**Estimated effort:** ~1 day (heavy reuse of existing evaluation + shadow portfolio infrastructure)

### 12.5 What We Do NOT Add in Chapter 12

- No HMM / Markov regime-switching models (overkill for ~50 stocks, and not needed before DEUP)
- No optimized ensemble weights (would overfit 109 folds; Ch13 handles adaptive weighting properly)
- No DEUP or epistemic uncertainty (that's Chapter 13)
- No new model training (LGB is frozen from Ch7; we're testing regime-conditional *use* of existing models)
- No execution logic (signal-only, evaluation-only)

## 13) DEUP Uncertainty Quantification

**Goal:** Implement DEUP (Direct Epistemic Uncertainty Prediction) for the AI Stock Forecaster expert. Train an error predictor g(x) on walk-forward residuals, calibrate an aleatoric baseline a(x), compute epistemic uncertainty e_hat(x) = max(0, g(x) - a(x)), and prove it adds economic value beyond simple volatility scaling AND beyond industry-standard UQ alternatives.

**Primary horizon:** 20d (confirmed in holdout; 60d/90d are secondary).

**DEUP test case:** Does e_hat(x) spike in Mar-Jul 2024 when the 20d signal goes negative? If yes, the system would have reduced positions during the AI rotation -- that's the core value proposition.

**Ablation baseline:** Vol-sized LGB (Sharpe 2.73 ALL-period, from Ch12.3). DEUP must beat this.

**All metrics reported as DEV (pre-2024) and FINAL (2024+) side by side.**

---

### 13.0 Populate Residual Archive & Loss Definition (infrastructure)

**Loss definition (CRITICAL -- validated empirically):**

The LGB model is a regressor (target = excess_return), but we evaluate it as a RANKER (RankIC). There are two candidate losses for g(x):

| Loss | Definition | Corr with daily RankIC | vol_20d corr |
|------|-----------|----------------------|-------------|
| **Rank displacement** | |rank_pct(ER) - rank_pct(score)| per stock per date | rho = -0.974 | rho = 0.054 |
| **MAE** | |excess_return - score| | rho = -0.119 | rho = 0.254 |

**Decision: use rank displacement as the PRIMARY g(x) target.** Reasons:
1. It correlates almost perfectly with RankIC (the metric we care about)
2. vol_20d does NOT predict rank displacement (rho = 0.05) -- so e_hat will NOT be repackaged volatility
3. Score std (0.025) is 4x smaller than ER std (0.104) -- raw MAE is dominated by irreducible return noise, not model error

**Secondary: also train g(x) on MAE for comparison.** If the MAE-based DEUP produces similar results, the method is robust to loss choice.

**Steps:**
- Load LGB FULL eval_rows (591K rows, 109 folds, 3 horizons)
- Compute per-date cross-sectional ranks for both scores and excess returns
- Compute rank_loss = |rank_pct(ER) - rank_pct(score)| per (date, stock, horizon)
- Also compute mae_loss = |ER - score| for secondary comparison
- Store in `data/residuals.duckdb` via `ResidualArchive.save_from_eval_rows()`
- Join regime_context features (vol_20d, vol_60d, mom_1m, vix_percentile_252d, market_regime, etc.) by (date, stable_id)
- Also populate for Rank Avg 2 (844K rows, 109 folds)

**Output:** `data/residuals.duckdb` populated with both loss types, enriched with regime features

---

### 13.1 Train g(x) Error Predictor

g(x) predicts how wrong the primary model's RANKING will be at input x. Core of DEUP.

**Model:** LightGBM regression (target = rank_loss)

**Walk-forward for g(x):** Expanding window on residuals:
- For fold k (k >= 20): train g(x) on all residuals from folds 1..k-1
- Predict fold k's residuals -> g(x) predictions for ~89 folds
- Start from fold 20 (~35K residual rows) for sufficient training data
- Folds 1-19 get no g(x) prediction

**g(x) features** (candidate set -- ablate to find the useful ones):

| Feature | Source | Rationale |
|---------|--------|-----------|
| `score` (LGB prediction) | eval_rows | Extreme predictions may carry higher ranking error |
| `abs_score` | derived | Magnitude -> confidence proxy |
| `vol_20d` | regime_context | Per-stock noise level |
| `vol_60d` | regime_context | Longer-term vol |
| `mom_1m` | regime_context | Model's top feature; may predict own ranking failures |
| `adv_20d` | eval_rows | Liquidity (low liquidity -> harder to rank) |
| `vix_percentile_252d` | regime_context | Strongest epistemic predictor (rho = -0.21 with RankIC) |
| `market_regime` | regime_context | Bull/neutral/bear (encoded) |
| `market_vol_21d` | regime_context | Market-level noise |
| `market_return_21d` | regime_context | Recent market direction |
| `cross_sectional_rank` | derived | Stock's rank-normalized score within date |

**Recommended density feature** (theoretically justified by Kotelevskii et al., NeurIPS 2022):
- `knn_distance_5`: Euclidean distance to 5th-nearest training neighbor in PCA-reduced (~5 dims) feature space
- The NUQ paper (Kotelevskii et al., 2022) shows epistemic uncertainty is proportional to sigma^2(x)/p(x) -- label variance divided by data density. In regions where p(x) is low (few nearby training examples), epistemic uncertainty is high regardless of model confidence.
- This is the only feature that captures "how unusual is this prediction context compared to training data?" -- the other features capture market conditions and stock characteristics, not distributional novelty.
- Implementation: PCA-reduce g(x) feature space to ~5 dims, use sklearn NearestNeighbors on training set, compute distance to 5th neighbor for each prediction. ~2 hours to implement.
- Can be added as a g(x) feature improvement after initial 13.1 run and compared via ablation.

**Hyperparameters:** Simpler than primary model (avoid overfitting the error predictor):
- `n_estimators=50`, `max_depth=3`, `num_leaves=8`, `min_child_samples=50`

**Secondary g(x):** Also train on MAE target with same features for robustness comparison.

**Output:** g(x) predictions for each (date, ticker, horizon) from fold 20 onwards

**Tests:**
- g(x) is trained out-of-sample (no fold k residuals used to predict fold k)
- g(x) > 0 for >99% of predictions (rank_loss is non-negative)
- g(x) correlates with realized rank_loss (Spearman rho > 0.05 on held-out folds)

---

### 13.2 Aleatoric Baseline a(x)

a(x) estimates irreducible ranking noise -- the floor of rank displacement that ANY model would experience at this point.

**Key insight from data analysis:** vol_20d does NOT predict rank_loss (rho = 0.054). So the standard DEUP approach of a(x) = scaled_vol is WRONG for rank-based loss. Ranking noise comes from a different source than return volatility: it comes from how LITTLE cross-sectional differentiation exists on a given date. When all AI stocks move together (low dispersion), ranking is meaningless -- that's high aleatoric noise. When stocks differentiate (high dispersion), ranking is easy -- that's low aleatoric noise.

**Critical direction rule: a(x) must be HIGH when ranking is HARD (low dispersion) and LOW when ranking is EASY (high dispersion).** All tier formulas below respect this inversion.

**Important subtlety:** Raw cross-sectional dispersion (IQR) has a known weakness: it measures "were outcomes spread out?" not "were outcomes PREDICTABLY spread out given available information?" A day where stocks randomly scatter +/-20% has high IQR but is hard to rank (noise dispersion). A day where one stock dominates and everything else is flat also has high IQR but is trivially easy to rank (signal dispersion). Raw IQR conflates the two. We address this with a tiered approach.

**Tier 0 -- Inverse IQR dispersion (baseline, implement first):**
```
a(date) = c / (IQR(excess_returns on date d) + eps)
```
where c is calibrated so median a(date) ~ median rank_loss, and eps prevents division by zero.
- High dispersion (easy to rank) -> low a(date). Low dispersion (hard to rank) -> high a(date).
- Fastest to implement (~30 min)
- Shared across all stocks on a date (per-date, not per-stock)
- Known weakness: doesn't condition on available information

**Tier 1 -- Inverse factor-residual dispersion (upgrade if Tier 0 fails alignment diagnostic):**
```
For each date t:
  1. Regress cross-sectional excess returns on sector dummies + market_return_21d + mom_1m
     (simple factors any reasonable model could capture)
  2. Take residuals = excess_return - predicted
  3. a(date) = c / (IQR(residuals on date t) + eps)
```
- High residual dispersion = stocks differentiate BEYOND factors = easier to rank = low a(date)
- Low residual dispersion = pure noise after factors = harder to rank = high a(date)
- Measures "unexplained cross-sectional mess" after removing trivial structure
- Still per-date, still fast
- Paper-defensible: "We approximate irreducible ranking noise as the inverse of cross-sectional dispersion of factor-residual returns"
- Add ~2-3 hours implementation time

**Tier 2 -- Heteroscedastic per-stock noise (implement if you want the strongest result, ~3-4h):**
```
Train LGB quantile regression (Q25, Q75) of per-stock rank displacement
on [vol_20d, adv_20d, sector, cross_sectional_dispersion]. Walk-forward fitted.
a(stock, date) = predicted IQR = Q75(rank_loss | features) - Q25(rank_loss | features)
```
- Per-stock aleatoric noise: some stocks are inherently harder to rank (noisy small-caps vs stable large-caps)
- Connects directly to the heteroscedastic likelihood literature (learning sigma^2(x) that varies with inputs)
- Unlike heteroscedastic NNs that learn sigma^2 via negative log-likelihood, this uses tree-compatible quantile regression
- Walk-forward fitting required for PIT safety
- No inversion needed: directly predicts per-stock rank_loss IQR (high predicted IQR = hard to rank = high a)
- This is the first tier that produces per-stock (not just per-date) aleatoric estimates

**Tier 3 -- EXCLUDED (posterior-predictive simulation Bayes risk):**
Intentionally not implemented. Uses the same features as Tier 2 (vol, adv, sector, VIX, mom). If Tier 2's quantile regression cannot predict 20d rank_loss IQR from those features (rho = 0.024), a simulation wrapper won't create signal that isn't there. The bottleneck is information content, not model complexity.

**Prospective empirical (PIT-safe, deployment-ready):**
```
a(date) = P10(rank_loss over trailing 60 trading days, strictly before current date)
```
- Uses only historical information -- no same-date leakage
- Works when ranking difficulty is persistent (autocorrelation > 0.5)
- RESULTS: PASS at 90d (rho=0.516), MARGINAL at 20d (rho=0.192)
- 20d autocorrelation at lag-20 is only 0.061 -- difficulty changes too fast

**Same-date empirical (last resort, retrospective only):**
```
a(date) = P10(rank_loss on date d)
```
- High correlation but partially circular (uses same-date information)
- Not deployment-ready: requires knowing today's rank_losses
- Used only at 20d where no prospective approach works

**RESULTS (actual tier selection):**
- 20d: empirical same-date (rho=0.527) -- retrospective only, no deployable a(x) exists
- 60d: Tier 2 (rho=0.317) -- per-stock, walk-forward, deployable
- 90d: prospective P10 (rho=0.516) -- PIT-safe, deployable

**Critical insight: a(x) precision matters less than expected for per-date tiers.**
When a(x) is per-date, the per-stock RANKING of e_hat(x) = g(x) - a(date) is determined entirely by g(x). The real test is Diagnostic A/D in 13.4."

**a(x) alignment diagnostic (MUST PASS before proceeding to 13.3):**
After computing a(x) at any tier, run this validation:
1. Bin dates into quintiles by a(date)
2. For EACH quintile, compute median rank_loss for ALL models (LGB, Rank Avg 2, Learned Stacking)
3. Expected: monotonically increasing rank_loss as a(date) increases
4. Compute: rho(a(date), median rank_loss across dates)
   - Target: rho > 0.3 (a captures meaningful variation in ranking difficulty)
   - Kill: rho < 0.1 (a is not measuring ranking difficulty at all -> upgrade tier)
5. Additional: compute g(date) - a(date) averaged per date, plot against realized daily RankIC
   - e_hat should spike when MODEL fails more than baseline difficulty
   - If e_hat spikes when a(date) is high AND rank_loss is high for ALL models, that's a(x) being too low (underestimating irreducible noise), not epistemic failure

**Per-date vs per-stock note:**
- a(x) is per-DATE, not per-stock (all ~100 stocks share the same a(date))
- All per-stock variation in e_hat comes from g(x), which IS per-stock
- This is conceptually correct: aleatoric ranking noise is a property of the market on that day; epistemic uncertainty is a property of the model's knowledge about that specific stock
- Exception: Tier 2 produces per-stock sigma_i which feeds into per-date a(date) via simulation, so the per-date aggregation still happens but is informed by per-stock noise levels

**Output:** a(x) predictions for all evaluation rows, alignment diagnostic report

**Tests:**
- a(x) > 0 for all rows
- a(x) varies meaningfully across dates (not constant)
- Alignment diagnostic passes (rho > 0.3, monotonic quintiles)
- a(x) < g(x) for at least some non-trivial fraction of rows

---

### 13.3 Epistemic Signal e_hat(x) = max(0, g(x) - a(x))  ✅ COMPLETE

**Definition:** ê(x) = max(0, g(x) − a(x))

**Merge logic (tier-aware):**
- 20d (per-date a): join on `as_of_date` → broadcast to all stocks → per-stock ê ordering = g(x) ordering
- 60d (per-stock Tier 2): join on `(as_of_date, ticker)` → per-stock a(x) creates selective zero/positive split
- 90d (per-date prospective): join on `as_of_date` → broadcast to all stocks

**Deployment labels:**
- 20d: `retrospective_decomposition` (same-date a(x) not available at prediction time)
- 60d/90d: `deployment_ready` (walk-forward / trailing data only)

**Results:**
| Horizon | ê mean | % zero | % positive | ρ(ê, rl) | Selective Δ |
|---------|--------|--------|------------|----------|-------------|
| 20d | 0.265 | 0.0% | 100% | 0.144 | +0.123 |
| 60d | 0.004 | 85.2% | 14.8% | 0.106 | +0.012 |
| 90d | 0.254 | 0.0% | 100% | 0.146 | +0.112 |

**Critical finding — perfect per-stock quintile monotonicity (ρ = 1.0) at 20d and 90d in both DEV and FINAL.** FINAL holdout shows stronger separation:
- 20d: Q5/Q1 RL ratio 1.69 (FINAL) vs 1.51 (DEV)
- 90d: Q5/Q1 RL ratio 1.88 (FINAL) vs 1.53 (DEV)

**Holdout generalizes:** ρ(ê, rl) is HIGHER in FINAL than DEV at all horizons (20d: 0.192 vs 0.142, 60d: 0.140 vs 0.103, 90d: 0.248 vs 0.138).

**Daily negative ρ at 20d explained:** ê = g − a, and same-date a dominates date-level variation (ρ(ê,a)=−0.815). On easy days (low a), ê is high but rl is low → negative daily correlation. Per-stock ordering is unaffected.

**Sanity checks: 14/14 passed.** Output: `ehat_predictions.parquet` (495,585 rows).

---

### 13.4 Diagnostics (DEV and FINAL separately) -- CRITICAL SECTION  ✅ COMPLETE

Six diagnostics + stability. Results for DEV and FINAL holdout separately.

**Diagnostic A (Disentanglement): PASS at all horizons.** After residualizing on [vol_20d, vix, mom_1m], ê still predicts rank_loss (ρ = 0.11–0.24). ê is NOT repackaged vol or VIX. FINAL is stronger than DEV.

**Diagnostic B (Directional Confidence Stratification — not the target): N/A.** ê predicts error MAGNITUDE (rank displacement), not directional accuracy (RankIC). High-ê stocks actually have slightly higher RankIC because extreme-scored stocks are correct more often but have larger errors when wrong. The quintile monotonicity (13.3, ρ=1.0) is the correct test, not IC stratification. Renamed from "FAIL" to "N/A" because this tests the wrong property.

**Diagnostic C (AUROC): 60d daily AUROC = 0.611 (PASS).** Stock-level AUROC for high-loss: 0.56–0.61. 20d daily AUROC = 0.33 (same-date a(x) artifact inverts date-level signal).

**Diagnostic D (2024 Test): WEAK.** ê is a per-stock error predictor, not a market-level regime detector. The 2024 collapse affects all stocks uniformly. Date-level throttling needs separate infrastructure.

**Diagnostic E (Baselines): ê/g(x) dominate by 3–10×.** ρ with rank_loss: ê = 0.14–0.19, g(x) = 0.16–0.22, vol_20d = 0.01–0.05, VIX = −0.02–0.05. Vol and VIX collapse in FINAL holdout; DEUP does not.

**Diagnostic F (Features): Per-prediction features dominate (40–56%).** cross_sectional_rank is #1 at all horizons. VIX/regime contribute 12–19%. g(x) is NOT an expensive regime detector.

**Stability: PASS.** All conditions positive (pre/post-2020, low/high VIX, low/high vol stocks). Min ρ = 0.031, max ρ = 0.205.

**Full portfolio-level baseline comparison (8-way including sub-model disagreement, seed ensemble, trailing RankIC, conformal width) deferred to 13.6 where sizing formula is applied.**

---

### 13.4b Expert Health H(t) — Per-Date Regime Throttle  ✅ COMPLETE

**Motivation:** Diagnostic D shows ê(x) is per-stock, not per-date. The 2024 collapse is uniform across stocks. Need a date-level throttle.

**Two complementary signals:**
- `ê(x)` = cross-sectional position-level risk control (which names are dangerous today)
- `H(t)` = expert-level regime control (whether the expert is usable today)

**Three PIT-safe signals:**
1. **H_realized:** Trailing EWMA of matured daily RankIC (lagged by horizon). Only component with genuine predictive power (ρ = 0.065–0.210).
2. **H_drift:** Feature drift + score drift + correlation spike vs trailing reference. Real-time (no lag). Mixed predictive value.
3. **H_disagree:** Cross-expert ranking disagreement (LGB vs Rank Avg 2 vs Learned Stacking). Noisy but theoretically motivated.

**Combination:** `H(t) = sigmoid(z_real − 0.3·z_drift − 0.3·z_disagree)`. G(t) = exposure multiplier ∈ [0, 1].

**Key Results:**
- **20d Crisis Throttle (THE CENTRAL VALUE PROP):** H drops from 0.50 (Mar) → 0.17 (Jun). G drops to 0.005 by May, 0.000 by June. System in full abstention mode for Apr–Jul 2024. March losses not avoided (20d lag), but April–July damage prevented.
- **90d best overall:** ρ(H, RankIC) = 0.180, AUROC = 0.60 for predicting bad days.
- **Regime separation works:** DEV mean_G = 0.37 vs FINAL mean_G = 0.15.
- **Day-level prediction weak:** ρ ≈ 0 within-period. H(t) is regime-level, not day-level.
- **H_realized dominates.** Combined AUROC barely above realized-only.

**Honest limitations:** (1) Cannot avoid first month of crisis (lag). (2) Drift/disagreement have limited marginal value. (3) 60d H is inverted due to lag.

**Tests:** 18 passing (PIT safety, leakage alignment, date indexing, monotonicity). **Total Chapter 13: 116 tests.**

**Files:** `src/uncertainty/expert_health.py`, `tests/test_expert_health.py`, orchestrator `--step 5`, outputs `expert_health_lgb_{20,60,90}d.parquet`.

---

### 13.5 Conformal Intervals (rolling, DEUP-normalized)  ✅ COMPLETE

Three nonconformity-score variants: raw, vol-normalized, DEUP-normalized. Rolling 60-day calibration, 90% nominal coverage.

**Key Results:**
- **All variants achieve ~90% marginal coverage** (ECE < 0.01). Conformal guarantees hold.
- **DEUP conditional coverage spread: 0.8%** vs raw 20.2% vs vol 5.9% at 20d. **25× improvement.** Raw conformal over-covers easy stocks (98%) and under-covers hard stocks (78%); DEUP equalizes to ~90% for both. This is the Plassier et al. validation.
- **DEUP intervals are narrower** than raw (0.647 vs 0.675 at 20d) — more efficient, not wider.
- **Width ratio:** DEUP 1.57× wider for high-ê vs low-ê stocks — meaningful differentiation.
- **60d special case:** 85% zero ê → impractically wide intervals for zero-ê stocks. DEUP conformal at 60d is meaningful only for the 15% with positive ê.
- **FINAL holdout:** DEUP maintains best coverage at 20d (0.898, ECE 0.002). 90d degrades slightly (0.87).

**Cite:** "DEUP-normalized conformal prediction approximates conditional validity by scaling nonconformity scores by predicted epistemic uncertainty, following the motivation of Plassier et al. (2025)."

**Tests:** 21 passing (PIT safety, coverage, widths, conditional coverage, sparse ê, edge cases). **Total Chapter 13: 137 tests.**

**Files:** `src/uncertainty/conformal_intervals.py`, `tests/test_conformal_intervals.py`, orchestrator `--step 6`, outputs `conformal_predictions.parquet`.

---

### 13.6 DEUP-Sized Shadow Portfolio + Global Regime Evaluation  ✅ COMPLETE

Economic test: does uncertainty-informed position sizing beat alternatives? Plus regime-level conclusion: does G(t) determine expert usability?

**Sizing Variants (c calibrated on DEV only, median w ≈ 0.7):**
- A) Vol-sized: `sized_score = score × min(1, c_vol / sqrt(vol_20d + ε))`
- B) DEUP-sized: `sized_score = score × min(1, c_deup / sqrt(unc + ε))` (20d/90d: g(x), 60d: ê(x))
- C) Health-only: `return × G(t)` (uniform date-level throttle)
- D) Combined: `DEUP-sized return × G(t)`

**Key Portfolio Results (20d):**
- **Vol-sized (A) wins:** FINAL Sharpe 1.68 vs 1.37 (raw) vs 1.35 (DEUP).
- **DEUP per-stock sizing ≈ raw:** g(x) penalizes extreme scores — the model's strongest signals.
- **Health throttle:** Crisis MaxDD −17.5% vs −44.1% for raw (60% reduction).
- **Optimal: G(t) as binary abstention gate, not continuous scaling.**

**Regime-Trust Classifier (THE REGIME-LEVEL ANSWER):**
- **H(t) AUROC = 0.721** (FINAL: 0.750). VIX: 0.449 (worse than random!).
- **Bucket monotonicity ρ = 1.0:** Low-G → −0.011 IC, 51% bad. High-G → +0.153 IC, 12% bad.
- **Confusion matrix (G ≥ 0.2):** Precision 80%, Recall 64%, Abstention 47%.
- **One-liner for Q&A:** DEUP does not improve per-name sizing, but it decisively improves whether to deploy the model at all via regime trust.
- **Aggregated DEUP ≠ regime signal** (AUROC ≈ 0.50). Regime trust from realized efficacy only.

**Success criteria:** Regime trust AUROC > 0.65 (**PASS**: 0.72). Bucket monotonicity > 0.5 (**PASS**: 1.0). DEUP per-stock > vol (**FAIL** — honest finding).

**Tests:** 17 passing. **Total Chapter 13: 154 tests.**

**Files:** `src/uncertainty/deup_portfolio.py`, `tests/test_deup_portfolio.py`, orchestrator `--step 7`, outputs `chapter13_6_*.json/parquet`.
---

### 13.7 DEUP on Rank Avg 2 (optional but recommended)

Rank Avg 2 held up better in holdout (20d FINAL RankIC: 0.031 vs 0.010).
- Repeat 13.1-13.4 with Rank Avg 2 eval_rows (already populated in 13.0)
- Compare e_hat properties: does a more robust base model produce different e_hat?
- Compare portfolio: DEUP-sizing on Rank Avg 2 vs DEUP-sizing on LGB
- Decision gate: if Rank Avg 2 + DEUP beats LGB + DEUP on FINAL, adopt as primary

---

### 13.8 Freeze & Documentation

- Replace disagreement proxy in `AIStockForecasterExpert.epistemic_uncertainty()` with real DEUP e_hat
- Implement `conformal_interval()` with rolling calibration
- Save frozen e_hat predictions to `evaluation_outputs/chapter13/`
- Save conformal interval calibration parameters
- Save g(x) models and feature importances
- Document all results in `documentation/CHAPTER_13.md`:
  - Loss definition analysis (why rank displacement, not MAE)
  - Bayes risk approximation: "The decomposition e_hat = g(x) - a(x) requires a defensible estimate of irreducible ranking noise a(x). We evaluated four approximations of increasing sophistication: raw cross-sectional dispersion, factor-residual dispersion, heteroscedastic per-stock quantile regression, and posterior-predictive simulation. [Tier X] was selected based on an alignment diagnostic confirming that a(x) monotonically predicts rank_loss across all models, consistent with it capturing market-level ranking difficulty rather than model-specific failure."
  - Variance decomposition framing (connect to literature): "The standard predictive variance decomposition Var[y|x] = E[sigma^2(x)] + Var[mu(x)] separates aleatoric noise from epistemic uncertainty in return space. We adapt this decomposition to ranking space: g(x) estimates total expected rank displacement (analogous to total predictive variance), a(x) estimates irreducible ranking noise (analogous to E[sigma^2]), and e_hat(x) = g(x) - a(x) captures excess model failure (analogous to Var[mu]). Unlike heteroscedastic neural networks that learn sigma^2(x) via negative log-likelihood, our aleatoric estimate is derived from cross-sectional return dispersion -- a market-observable quantity that requires no model fitting -- making it robust to misspecification of the noise model." This connects to Robinson et al., Wong et al., and evidential regression literature while making clear why our approach differs (ranking loss, not return prediction; market-observable aleatoric, not learned).
  - Full diagnostic tables (DEV + FINAL)
  - 8-way UQ baseline comparison table
  - Portfolio comparison (ALL + DEV + FINAL)
  - The 2024 regime test
  - g(x) feature importance analysis and interpretation
  - Kill criteria assessment
  - MC Dropout / deep ensemble exclusion justification
  - Honest discussion of what worked and what didn't
  - **Theoretical framing / citation block:**
    - "Our approach instantiates the pointwise risk decomposition of Kotelevskii & Panov (ICLR 2025), where Total Risk = Bayes Risk + Excess Risk. Since our base model (LightGBM) does not admit tractable Bayesian posterior inference, we follow the DEUP approach (Lahlou et al., 2023) to estimate excess risk directly from walk-forward residuals. The aleatoric component (Bayes risk) is approximated via cross-sectional return dispersion, which captures irreducible ranking difficulty -- the analog of -G(eta) in the proper scoring rule framework adapted to ranking loss."
    - Cite Kotelevskii & Panov, "From Risk to Uncertainty" (ICLR 2025) -- theoretical framework for risk decomposition
    - Cite Kotelevskii et al., "Nonparametric UQ for Single Deterministic NN" (NeurIPS 2022) -- density-based epistemic uncertainty, justifies knn_distance feature
    - Cite Plassier et al., "Probabilistic Conformal Prediction" (ICLR 2025) -- conditional coverage motivation
    - Cite Fishkov et al., "UQ for Regression using Proper Scoring Rules" (2025) -- extends framework to regression; our ranking problem sits between classification and regression
    - Frame thesis explicitly: "Our primary model is tree-based (LightGBM), which precludes standard Bayesian UQ methods (MC Dropout, variational inference). Our sub-models include neural networks (FinText/Chronos, FinBERT), but these produce near-zero standalone signal (Ch11), making their Bayesian uncertainty estimates uninformative for the primary ranking task. We therefore adopt the DEUP approach (Lahlou et al., 2023), which instantiates the pointwise risk decomposition of Kotelevskii & Panov (ICLR 2025) without requiring Bayesian posterior inference. We compare against sub-model disagreement -- an approximation of Expected Pairwise Bregman Divergence (EPBD) -- and LGB seed ensemble variance -- an approximation of Bregman Information -- as baselines (Diagnostic E). For Bayesian UQ via MC Dropout on the neural sub-models, see Chapter 17."
  - **What NOT to implement** (and why): Full Bayesian posterior estimation, energy-based uncertainty, and the 9 risk approximation variants from Kotelevskii & Panov are designed for neural network ensembles with multiple forward passes. LightGBM doesn't support this. The seed ensemble baseline (#7 in Diagnostic E) captures the one applicable variant.
- Update `ROADMAP.md` and `outline.ipynb`

---

### Success Criteria (adapted from UQ reference S12)

| Metric | Target | What It Proves |
|--------|--------|----------------|
| rho(e_hat, vol \| features) | ~ 0 | e_hat is NOT repackaged volatility |
| rho(e_hat, rank_loss \| features) | > 0 | e_hat captures real ranking-failure signal |
| Low-e_hat tercile RankIC | > full-set | Model knows when to trust itself |
| AUROC (e_hat predicts bad days) | > 0.60 | e_hat identifies regime failures |
| e_hat spikes in Mar-Jul 2024 | Yes | DEUP detects holdout regime shift |
| ECE (conformal) | < 0.05 | Intervals well-calibrated |
| Coverage (90% nominal) | 85-95% | Rolling conformal maintains validity |
| a(x) vs rank_loss correlation | rho > 0.3 | Aleatoric baseline captures real ranking difficulty |
| e_hat-sized Sharpe | > ALL baselines | Economic value beyond any alternative |

### Kill Criteria (from UQ reference S14.7)

| Signal | Threshold | Action |
|--------|-----------|--------|
| rho(e_hat, vol \| features) > 0.5 | e_hat is just volatility | DEUP failed |
| Selective risk ~ full risk | e_hat has no power | DEUP failed |
| Coverage < 70% or > 99% | Conformal broken | Fix calibration |
| e_hat-sized <= vol AND <= trailing RankIC | No value vs simple methods | DEUP failed |

Even if DEUP disappoints, the infrastructure (residual archive, conformal intervals, expert interface, baseline comparison framework) is useful regardless. The 8-way comparison table is publishable independently of DEUP's success.

### Excluded methods (with thesis justification)
- **MC Dropout:** inapplicable to tree-based models (LightGBM has no dropout layers)
- **Deep ensembles (NN):** requires rebuilding the primary model as a neural network -- different model, not different UQ
- **BNNs:** same -- requires model rebuild
- **NGBoost:** different model family (natural gradient boosting outputs full distributions). Tests a different model, not a different UQ method on the same model. Potential Ch17 comparison if time allows.
- **cSGLB (Amazon cyclical MCMC):** interesting but complex; no public LightGBM integration available
- **Heteroscedastic likelihood / evidential regression:** requires neural network architectures with learned sigma^2(x) outputs. Our primary model is LightGBM (tree-based), which does not natively support heteroscedastic likelihood training. DEUP's advantage is model-agnosticism -- it produces uncertainty estimates for any base model given only held-out residuals. We compare against LGB quantile regression (Diagnostic E, baseline #6) as the closest tree-compatible probabilistic alternative. If a reviewer asks "why not heteroscedastic likelihood?", this is the answer.

### Existing infrastructure to reuse
- `src/models/residual_archive.py` -- ResidualArchive + AIStockForecasterExpert (Ch11.4)
- `src/models/fusion_scorer.py` -- `compute_disagreement()` for sub-model uncertainty (Ch11.4)
- `data/regime_context.parquet` -- 201K rows x 16 features (Ch12.4)
- `scripts/run_chapter12_heuristics.py` -- shadow portfolio / portfolio metrics code
- `src/evaluation/baselines.py` -- LightGBM training utilities
- All eval_rows (LGB 591K, Rank Avg 2 844K, Vol-Sized 591K) from prior chapters


## 14) Monitoring & Research Ops

- Prediction logging with timestamps
- Matured-label scoring
- Feature and performance drift detection

Alerts:
- RankIC decay
- Calibration breakdown
- Ranking instability

---

### 14.x Monitoring KPIs (Signal + Shadow Portfolio)

Once the Shadow Portfolio is defined (Chapter 11), add monitoring hooks for professional KPIs:
- RankIC drift (rolling)
- Cost survival drift
- Shadow portfolio Sharpe/IR drift
- Drawdown alarms (rolling windows)
- Turnover / cost drag spikes

## 15) Outputs & Interfaces

- Ranked stock lists
- Per-stock explanation summaries
- Batch scoring interface
- Full traceability of inputs and decisions

## 16) Global Research Acceptance Criteria

A model is considered **valid** if:

- Median walk-forward RankIC exceeds baseline by ≥ 0.02
- Net-of-cost performance positive in ≥ 70% of folds
- Top-10 ranking churn < 30% month-over-month
- Performance degrades gracefully under regime shifts
- No PIT or survivorship violations detected

**Institutional-grade add-on (evaluation-only):**
- Shadow portfolio sanity: Sharpe/IR is meaningfully > 0 and not driven by 1–2 months

(Report this via the Chapter 11 Shadow Portfolio mapping; do not optimize to it.)

---

### 16.1 Risk Attribution / Factor Decomposition (acceptance gate)

**Purpose:** Prove the signal isn't just repackaged factor exposure.

Chapter 12 answers *"when does the model fail?"* (regime slicing).
This section answers *"where does the alpha come from?"* (return attribution).
These are complementary but distinct — regime analysis decomposes performance
by market conditions; risk attribution decomposes returns by factor exposures.

**Existing infrastructure:**
- `src/features/neutralization.py` (Ch5): cross-sectional *feature*-level neutralization
  ("Is this score just a sector/beta proxy?")
- This section adds *portfolio*-level return attribution
  ("Are our portfolio returns just market/size/value/momentum exposures?")

**Implementation:**
1. Load Fama-French 5-factor daily returns (Ken French data library or `pandas-datareader`)
   - Factors: Mkt-RF, SMB, HML, RMW, CMA (+ RF for excess returns)
2. Align to shadow portfolio return dates (non-overlapping monthly from Ch11/12)
3. OLS regression: `R_portfolio - RF = α + β₁·MktRF + β₂·SMB + β₃·HML + β₄·RMW + β₅·CMA + ε`
4. Report: alpha intercept, factor betas, t-statistics, R²

**Acceptance criteria:**
- Alpha intercept is **positive and statistically significant** (t-stat > 2)
- Factor loadings documented (some momentum exposure is acceptable, but alpha must survive)
- R² documented (low R² = genuinely idiosyncratic alpha)

**Scope:** ~1 day implementation. Run on LGB baseline + vol-sized shadow portfolio returns.


## 17) Bayesian UQ Extensions & Model Comparisons

**Goal:** Extend the DEUP-based UQ framework with Bayesian uncertainty estimation on the neural sub-models (FinText, FinBERT) and compare against the tree-based DEUP approach from Chapter 13. This chapter is the "what if we could do Bayesian inference?" comparison.

**Prerequisite:** Chapter 13 DEUP must be complete with all diagnostics.

**Context:** The Kotelevskii & Panov (ICLR 2025) risk decomposition framework offers 9 approximation variants for Bayesian risk estimation, most of which require multiple forward passes through a neural network. Chapter 13 uses DEUP because the primary model (LightGBM) doesn't support Bayesian inference. This chapter tests whether Bayesian approaches on the available NNs add value.

---

### 17.1 MC Dropout on FinBERT Sentiment

- Run FinBERT inference 10-20 times per stock per date with dropout enabled
- Compute variance of sentiment scores across passes = epistemic uncertainty of the sentiment model
- Evaluate: does FinBERT dropout variance predict LGB ranking failures?
- Compare: FinBERT dropout variance vs DEUP e_hat as sizing signal
- Cost: significant (20x inference per fold x 109 folds), but FinBERT is small enough

### 17.2 MC Dropout on FinText/Chronos

- Run Chronos inference with dropout enabled (multiple forward passes)
- Compute distribution of return forecasts per stock per date
- Variance across passes = epistemic uncertainty of the time series model
- Cost: expensive (Chronos is 46M params), may need to subsample folds

### 17.3 Bayesian Risk Estimates (Kotelevskii & Panov Framework)

Use the MC Dropout samples to compute the formal Bayesian risk decomposition:
- R_Bayes (aleatoric) via ensemble averaging of per-pass predictions
- R_Exc (epistemic) via Bregman divergence between ensemble mean and individual passes
- Compare against DEUP's simpler g(x) - a(x) decomposition
- Frame: "When Bayesian inference IS tractable (NN sub-models), does it outperform DEUP?"

### 17.4 Seed Ensemble as Bregman Information

- The 5-LGB-seed ensemble from Ch13 Diagnostic E baseline #7 already enables this
- Frame explicitly as R_Exc^(1,2) = Bregman Information = variance of predictions across seed models (Kotelevskii & Panov, 2025)
- Compare: seed ensemble Bregman Information vs DEUP e_hat vs MC Dropout variance

### 17.5 NGBoost Comparison (optional)

- Natural Gradient Boosting outputs full probability distributions (not just point estimates)
- Different model family, not just different UQ method on same model
- Tests whether a natively probabilistic tree model outperforms DEUP on a deterministic tree model
- ~4 hours implementation

### 17.6 Comparison Table & Documentation

Produce the definitive UQ comparison table:

| Method | Model | Type | Sharpe | Max DD | AUROC |
|--------|-------|------|--------|--------|-------|
| DEUP e_hat (Ch13) | LGB | Excess risk | ? | ? | ? |
| Vol-sizing (Ch12) | Any | Heuristic | 2.73 | -18.1% | - |
| Sub-model disagreement (EPBD) | LGB+FinText+Sent | Ensemble | ? | ? | ? |
| Seed ensemble (Bregman Info) | LGB x5 | Ensemble | ? | ? | ? |
| MC Dropout FinBERT | FinBERT | Bayesian | ? | ? | ? |
| MC Dropout Chronos | Chronos | Bayesian | ? | ? | ? |
| NGBoost | NGBoost | Probabilistic | ? | ? | ? |

**Thesis framing:** "We compare three paradigms of uncertainty quantification: (1) DEUP-based excess risk estimation for tree models, (2) ensemble disagreement as EPBD/Bregman Information approximations, and (3) Bayesian posterior sampling via MC Dropout on neural sub-models. This provides the first comprehensive comparison of modern UQ methods on a real financial ranking system with point-in-time safe evaluation."

**Note:** The primary finding may be that DEUP on a strong base model (LGB) outperforms Bayesian UQ on weak sub-models (FinText, Sentiment), which is itself an interesting result -- model strength matters more than UQ sophistication.

### Estimated effort
- 17.1 MC Dropout FinBERT: 4-6 hours (inference + evaluation)
- 17.2 MC Dropout Chronos: 6-8 hours (expensive inference)
- 17.3 Bayesian risk estimates: 2-3 hours (computation on existing samples)
- 17.4 Seed ensemble framing: 1 hour (already computed in Ch13)
- 17.5 NGBoost: 4 hours (new model family)
- 17.6 Documentation: 2-3 hours
- **Total: ~20-25 hours**
