# AI Stock Forecaster  
**(FMP + Kronos + FinText-TSFM | Signal-Only, Point-in-Time Safe)**

---

## 0) Project Explanation & Philosophy

### What this project is

This project builds a **decision-support forecasting model** that answers one core question:

> **Which AI stocks are most attractive to buy today, on a risk-adjusted basis, over the next 20 / 60 / 90 trading days?**

The system outputs **ranked stock recommendations and return distributions**, not trades.  
Its purpose is to generate **credible alpha signals** that survive realistic financial constraints.

The design explicitly accounts for:
- non-stationary market behavior,
- weak and noisy financial signals,
- transaction costs and liquidity effects,
- and strict point-in-time (PIT) correctness.

---

### What this project is NOT

This project does **not**:
- place trades,
- connect to brokers,
- optimize execution,
- or manage live capital.

Any portfolio-related logic exists **only to validate signal realism**, not to implement trading.

---

### Core modeling philosophy

1. **Ranking beats regression**  
   Relative ordering of stocks is more stable and economically useful than exact price prediction.

2. **Point-in-time correctness is non-negotiable**  
   Any signal unavailable at time *T* must not influence predictions at time *T*.

3. **Economic validity > statistical fit**  
   Signals must survive transaction costs, turnover, and regime shifts.

4. **Multiple weak signals > single strong model**  
   Combine complementary views:
   - price dynamics (Kronos),
   - return structure (FinText-TSFM),
   - fundamentals and context (tabular models).

---

## 1) System Outputs (Signal-Only)

At each rebalance date **T**, for each stock and horizon (20 / 60 / 90 trading days):

### Per-stock outputs
- **Expected excess return** vs benchmark (QQQ default; XLK/SMH optional)
- **Return distribution** (5th / 50th / 95th percentiles)
- **Alpha ranking score** (cross-sectional)
- **Confidence score** (calibrated uncertainty)
- **Key drivers** (feature blocks influencing the rank)

### Cross-sectional outputs
- Ranked list: **Top buys / neutral / avoid**
- Optional confidence buckets (high vs low confidence)

---

## 2) Scope & Validation Philosophy (Signal-Only)

### Scope
- The system produces **signals**, not trades.
- No execution or order placement logic is implemented.

### Why portfolio concepts still appear
Portfolio concepts (turnover, costs, constraints) are used **only for evaluation realism**, to answer:
> *Would these signals remain economically meaningful if followed by an investor?*

### Optional realism check
- Paper trading (e.g., Alpaca paper) may be used **post-hoc** to validate:
  - timestamp integrity,
  - universe construction,
  - signal stability.
- Paper trading results are **never** used for training or model selection.

---

## 3) Data & Point-in-Time Infrastructure (FMP-First)

### 3.1 Data sources
- **Market**: Daily OHLCV, splits, dividends
- **Fundamentals**: Income, balance sheet, cash flow (quarterly)
- **Metadata**: Sector, industry, shares outstanding, market cap
- **Events**: Earnings dates with announcement time
- **Benchmarks**: QQQ (default), optional XLK / SMH
- **Regime proxies**: VIX, market breadth, rate proxies

---

### 3.2 Point-in-time (PIT) rules

Each datapoint stores:
- `value`
- `observed_at` (first public release timestamp)
- `effective_from`
- `source`

Rules:
- Fundamentals are **as-reported**, never restated historically
- Forward-fill allowed **only after** `observed_at`
- No feature may use information released after the cutoff time

---

### 3.3 Daily cutoff policy (anti-lookahead)

- Fixed cutoff time (e.g., 4:00pm ET)
- Features for date *T* may only use data with timestamps ≤ cutoff(T)
- Earnings handling distinguishes pre-market vs after-close announcements

---

### 3.4 Data audits & bias detection

Automated checks:
- PIT violation scanner
- Survivorship reconstruction audit
- Corporate action sanity checks
- Missingness and outlier detection

**Success criteria**
- < 0.1% PIT violations
- Universe reproducible for any historical date
- All datasets auditable and replayable

---

## 4) Survivorship-Safe Dynamic Universe

### 4.1 Universe construction (critical)

At each rebalance date **T**:
- Start with all U.S. equities meeting liquidity and price thresholds
- Filter by AI-relevant sector / industry tags
- Select **Top N by market cap as-of T**
- Persist constituents with timestamp

Hardcoded “today’s winners” are explicitly disallowed.

---

### 4.2 Delistings & mergers
- Delisted stocks remain in historical universes where data exists
- Missing data is explicitly modeled rather than silently dropped

**Success criteria**
- Constituents vary meaningfully through time
- Backtests include both winners and failures

---

## 5) Feature Engineering (Bias-Safe)

### 5.0 Readiness Checklist & Implementation Plan

#### Infrastructure Available (from Chapters 3-4) ✅
| Component | Module | What It Provides |
|-----------|--------|------------------|
| Prices | `FMPClient.get_historical_prices()` | Split-adjusted OHLCV with `observed_at` |
| Fundamentals | `FMPClient.get_income_statement()` etc. | With `fillingDate` for PIT |
| Volume/ADV | `DuckDBPITStore.get_avg_volume()` | Computed from OHLCV |
| Events | `EventStore` | EARNINGS, FILING, SENTIMENT with PIT |
| Earnings | `AlphaVantageClient` + `ExpectationsClient` | BMO/AMC timing, surprises |
| Regime/VIX | `FMPClient.get_index_historical()` | SPY, VIX for regime detection |
| Universe | `UniverseBuilder` | FULL survivorship via Polygon |
| ID Mapping | `SecurityMaster` | Stable IDs, ticker changes |
| Calendar | `TradingCalendarImpl` | NYSE holidays, cutoffs |
| Caching | All clients | `data/cache/*` directories |

#### API Keys Available ✅
- `FMP_KEYS` - Prices, fundamentals, profiles (free tier: 250/day)
- `POLYGON_KEYS` - Symbol master, universe (free tier: 5/min)
- `ALPHAVANTAGE_KEYS` - Earnings calendar (free tier: 25/day)

---

#### Chapter 5 TODO List

**5.1 Targets (Labels)**
- [x] Implement forward excess return calculation vs QQQ benchmark
- [x] Create label generator for 20/60/90 trading day horizons
- [x] Ensure labels are strictly PIT-safe (no future leakage)

**5.2 Price & Volume Features**
- [x] Momentum features (1m, 3m, 6m, 12m returns)
- [x] Volatility (realized vol, vol-of-vol)
- [x] Drawdown (max drawdown, current vs high)
- [x] Relative strength vs universe median
- [x] Beta vs benchmark (rolling window)
- [x] ADV and volatility-adjusted ADV

**5.3 Fundamental Features (Relative)**
- [x] P/E vs own 3-year history (z-score)
- [x] P/S vs sector median
- [x] Margins vs sector peers
- [x] Revenue/earnings growth vs sector
- [x] All ratios rank-transformed cross-sectionally

**5.4 Event & Calendar Features** ✅
- [x] Days to next earnings
- [x] Days since last earnings
- [x] Post-earnings drift window indicator (PEAD 63 days)
- [x] Surprise magnitude (last N quarters)
- [x] Surprise streak and cross-sectional z-score
- [x] Filing recency (days since last 10-Q/10-K)

**5.5 Regime & Macro Features** ✅
- [x] VIX level and percentile (2-year window)
- [x] VIX regime classification (low/normal/elevated/high)
- [x] Market trend regime (bull/bear/neutral)
- [x] Sector rotation indicators (tech vs defensives)
- [x] All features timestamped with cutoff enforcement

**5.6 Missingness Masks** ✅
- [x] Create explicit "known at time T" indicators
- [x] Missingness as first-class feature (not just imputation)
- [x] Track data coverage statistics by category
- [x] Generate coverage reports

**5.7 Feature Hygiene & Redundancy** ✅
- [x] Cross-sectional z-score/rank standardization
- [x] Rolling Spearman correlation matrix
- [x] Feature clustering (identify blocks)
- [x] VIF diagnostics (tabular features)
- [x] Rolling IC stability checks
- [x] Sign consistency analysis

**5.8 Feature Neutralization (Diagnostics)** ✅
- [x] Sector-neutral IC computation
- [x] Beta-neutral IC computation
- [x] Sector+Beta neutral IC computation
- [x] Delta (Δ) reporting for interpretation

**Testing & Validation**
- [x] Unit tests for each feature block (5.1-5.7 all have tests)
- [ ] PIT violation scanner on all features
- [x] Univariate IC ≥ 0.03 check for strong signals (IC tools available)
- [x] IC stability across rolling windows (FeatureHygiene.compute_ic_stability)
- [x] Feature coverage > 95% (MissingnessTracker.compute_coverage_stats)

---

#### Rate Limit Strategy
1. Cache universe snapshots by rebalance date (Polygon: 5/min)
2. Batch FMP requests where possible (profiles, quotes)
3. Use Alpha Vantage sparingly (25/day limit)
4. Store computed features in DuckDB for reuse

---

### 5.1 Targets
- Forward **excess returns** vs benchmark
- Horizons: 20 / 60 / 90 trading days

---

### 5.2 Price & volume features ✅ COMPLETE
**Implemented in `src/features/price_features.py`**

| Feature | Description |
|---------|-------------|
| `mom_1m/3m/6m/12m` | Returns over 21/63/126/252 trading days |
| `vol_20d/60d` | Annualized volatility |
| `vol_of_vol` | Volatility of rolling volatility |
| `max_drawdown_60d` | Maximum drawdown |
| `rel_strength_1m/3m` | Z-score vs universe |
| `beta_252d` | Beta vs QQQ benchmark |
| `adv_20d/60d` | Average daily dollar volume |

---

### 5.3 Fundamentals (relative, normalized) ✅ COMPLETE
**Implemented in `src/features/fundamental_features.py`**

Raw ratios are avoided — all features are RELATIVE:
- `pe_zscore_3y`: P/E vs own 3-year history
- `pe_vs_sector`: P/E relative to sector median
- `ps_vs_sector`: P/S relative to sector median
- `gross_margin_vs_sector`: Margins vs sector
- `revenue_growth_vs_sector`: Growth vs sector peers
- `roe_zscore`, `roa_zscore`: Quality metrics z-scored

---

### Time-Decay Sample Weighting (Training Policy) ✅
**Implemented in `src/features/time_decay.py`**

**Why time decay matters for AI stocks:**
- AI business models and the "AI regime" (2020+) differ from earlier eras
- Market microstructure evolves (HFT, retail flow)
- Many AI stocks didn't exist 15+ years ago — that's OK
- Recent observations are more relevant for forward predictions

**Recommended half-lives:**
| Horizon | Half-Life | Weight at 6y | Weight at 9y |
|---------|-----------|--------------|--------------|
| 20d     | 2.5 years | ~18%         | ~7%          |
| 60d     | 3.5 years | ~30%         | ~14%         |
| 90d     | 4.5 years | ~38%         | ~21%         |

**Key rules:**
1. Apply during training (Section 6, 11), NOT feature computation
2. Per-row weights, normalized per date for cross-sectional ranking
3. Use survivorship-safe universe — young stocks get fewer rows but high-weight rows

**30-year data note:** FMP Premium provides 30 years, but effective sample naturally concentrates in last 10-15 years for AI stocks.

---

### 5.4 Events & calendars ✅ COMPLETE
**Implemented in `src/features/event_features.py`**

| Feature | Description |
|---------|-------------|
| `days_to_earnings` | Days until next expected earnings |
| `days_since_earnings` | Days since last report |
| `in_pead_window` | Post-earnings drift window (63 days) |
| `last_surprise_pct` | Most recent surprise % |
| `avg_surprise_4q` | Rolling 4Q average |
| `surprise_streak` | Consecutive beats/misses |
| `surprise_zscore` | Cross-sectional z-score |
| `days_since_10k/10q` | Filing recency |
| `reports_bmo` | Typical announcement timing |

---

### 5.5 Regime & macro ✅ COMPLETE
**Implemented in `src/features/regime_features.py`**

| Feature | Description |
|---------|-------------|
| `vix_level`, `vix_percentile` | VIX level and 2-year percentile |
| `vix_regime` | low/normal/elevated/high |
| `market_return_5d/21d/63d` | SPY returns at various windows |
| `market_regime` | bull/bear/neutral (MA-based) |
| `above_ma_50`, `above_ma_200` | Price vs moving averages |
| `tech_vs_staples/utilities` | Sector rotation signals |

**Key:** Market-level features, common to all stocks.

---

### 5.6 Availability & missingness masks ✅ COMPLETE
**Implemented in `src/features/missingness.py`**

| Feature | Description |
|---------|-------------|
| `coverage_pct` | Overall feature coverage (0-1) |
| `{category}_coverage` | Per-category availability |
| `has_{type}_data` | Boolean availability flags |
| `is_new_stock` | < 1 year of history |

**Key Philosophy:** Missingness is a SIGNAL, not just noise.
---

### 5.7 Feature Hygiene & Redundancy Control ✅ COMPLETE
**Implemented in `src/features/hygiene.py`**

| Component | Description |
|-----------|-------------|
| Cross-sectional standardization | z-score or rank-transform within date |
| Rolling Spearman correlation | Correlation matrix computation |
| Feature clustering | Hierarchical clustering to identify blocks |
| VIF diagnostics | Variance Inflation Factor (diagnostic, not filter) |
| IC stability analysis | Rolling IC with sign consistency tracking |

> **Principle**: A feature with IC 0.04 once and −0.01 later is worse than IC 0.02 stable forever.

**Usage:**
```python
from src.features.hygiene import FeatureHygiene

hygiene = FeatureHygiene()
blocks = hygiene.identify_feature_blocks(features_df)
ic_results = hygiene.compute_ic_stability(features_df, labels_df)
report = hygiene.generate_hygiene_report(features_df)
```

**Tests:** 9/9 passed in `tests/test_hygiene.py`

---

### 5.8 Feature Neutralization ✅ COMPLETE
**Implemented in `src/features/neutralization.py`**

**Purpose:** For diagnostics ONLY (not training). Reveals WHERE alpha comes from.

| Component | Description |
|-----------|-------------|
| Sector-neutral IC | IC after removing sector effects |
| Beta-neutral IC | IC after removing market beta |
| Sector+Beta neutral IC | IC after removing both factors |
| Delta (Δ) reporting | neutral_IC - raw_IC for interpretation |

**Interpretation:**
- Large negative Δ_sector → feature was mostly sector rotation
- Large negative Δ_beta → feature was mostly market exposure
- Small Δ → alpha is genuinely stock-specific

**Tests:** 9/9 passed in `tests/test_neutralization.py`

---

**Feature success criteria**
- > 95% completeness (post-masking)
- Strong univariate signals show IC ≳ 0.03
- No feature introduces PIT violations
- **Stability**: IC sign consistent across ≥70% of rolling windows
- **Redundancy understood**: Feature blocks documented, correlation matrix computed

---

## 6) Evaluation Framework (Core Credibility Layer)

> **CRITICAL PHILOSOPHY**: "You've crossed the line where bad evaluation can ruin a good system."
>
> **Approach**: Be conservative. Let results look "boring" if they are. Resist the urge to tweak features/models early. If signals survive Chapter 6 as-is, Chapters 7-11 will feel almost easy.

### 6.0 Prerequisites Check ✅
- **Labels**: v2 total return (dividends), mature-aware, PIT-safe ✅
- **Features**: 5.1-5.8 complete, stable, interpretable, auditable ✅
- **Missingness**: Explicit, not dropped ✅
- **Regime**: Visible but not leaked ✅
- **Alpha attribution**: Neutralization working ✅
- **PIT discipline**: Scanner enforced, 0 CRITICAL violations ✅

---

### 6.1 Walk-Forward Engine
**Expanding window** (not rolling):
- Grows forward, never shrinks
- Preserves long-term signal stability
- Avoids artificial lookback dependency

**Rebalance frequency:**
- Monthly or quarterly
- Aligns with institutional constraints
- Balances signal refresh vs turnover

**Universe snapshots:**
- Uses `stable_id` snapshots from Chapter 4 (survivorship-safe)
- Respects `label_matured_at` timestamps (PIT-safe)
- No forward-looking information leakage

**Time-decay sample weighting** (training only):
- From `src/features/time_decay.py`
- Horizon-specific half-lives: 2.5y (20d), 3.5y (60d), 4.5y (90d)
- Per-date normalization for cross-sectional ranking

---

### 6.2 Label Hygiene
**Enforce maturity rule:**
- `label_matured_at <= asof` strictly enforced
- No label is used before it matures

**Horizon-aware purging:**
- Remove overlapping labels across training/validation folds
- Prevents leakage from correlated label windows

**Embargo = max horizon:**
- Gap between train and validation = 90 trading days
- Conservative cushion for all horizons (20/60/90)

**PIT violations:**
- Scanner runs pre-commit and in CI
- Zero tolerance for CRITICAL/HIGH violations

---

### 6.3 Metrics (Ranking-First)

**Primary Metric: RankIC**
- Spearman correlation of predicted ranks vs actual excess returns
- More stable than Pearson IC (robust to outliers)
- Directly measures ranking quality

**IC by Regime:**
- VIX low/high quartiles
- Bull/bear markets (SPY 200-day MA)
- Sector rotation periods

**Top-Bottom Quintile Spread:**
- Return of top 20% - return of bottom 20%
- Measures practical exploitability
- More robust than full IC

**Hit Rate (Top-K):**
- % of top 10 picks that beat benchmark
- Simple, interpretable
- Aligns with concentrated portfolio construction

**NOT USED:**
- MSE, MAE (we're ranking, not forecasting exact returns)
- Price accuracy (irrelevant for long/short)

---

### 6.4 Cost Realism (Diagnostic)

**Base costs:**
- 20 bps round-trip (conservative for liquid largecaps)
- Higher for small/mid caps if needed

**ADV-scaled slippage:**
- Function of position size / average daily dollar volume
- Penalizes illiquid names

**Question: "Does alpha survive?"**
- This is diagnostic, NOT optimization
- If alpha vanishes post-cost → reject signal
- If survives → document slippage sensitivity

**NOT cost-optimization:**
- No portfolio optimizer (yet)
- No transaction cost minimization tricks
- Pure diagnostic: is signal real?

---

### 6.5 Stability Reports

**IC decay plots:**
- How does IC degrade over time within a fold?
- Stable signals show flat or slow decay
- Rapid decay → overfitting or regime shift

**Regime-conditional performance:**
- IC in VIX high vs low
- IC in bull vs bear
- IC across sectors

**Churn diagnostics:**
- Top-10 ranking turnover month-over-month
- High churn (>50%) → unstable, costly
- Target: <30% for exploitability

**Feature stability:**
- Use 5.7 IC stability metrics
- Sign consistency across folds
- Block-level stability (not individual features)

---

**Acceptance Criteria:**
- ✅ Median walk-forward RankIC > baseline by ≥ 0.02
- ✅ Net-of-cost performance positive in ≥ 70% of folds
- ✅ Top-10 ranking churn < 30% month-over-month
- ✅ Performance degrades gracefully under regime shifts
- ✅ NO PIT violations (enforced by scanner)

**Guardrails:**
- ❌ NO new features mid-evaluation
- ❌ NO retraining models to "fix" bad folds
- ❌ NO cherry-picking good time periods
- ❌ NO optimizing to costs (diagnostic only)
- ❌ NO hiding negative results

**Success = Boring Results That Don't Break**
- Median IC of 0.03-0.05 is GOOD
- Stable across regimes is EXCELLENT
- Survives costs is SUFFICIENT

---



## 7) Baseline Models (Models to Beat)

1. Naive (random / benchmark mean)
2. Factor baselines (momentum, low-vol, quality)
3. Tabular ML (LightGBM / CatBoost)

**Baseline gates**
- Factor IC > 0.02
- ML IC > 0.05
- TSFM models must beat tuned ML baseline on **median OOS IC**

---

## 8) Kronos Module (Price Dynamics)

- Input: OHLCV sequences
- Rolling / ReVIN-style normalization
- Outputs: embeddings and horizon-aware signals
- Fine-tuning via walk-forward only

**Kronos success criteria**
- Zero-shot IC measured
- Fine-tuning improves IC by ≥ 0.01
- Stable behavior across price level shifts

---

## 9) FinText-TSFM Module (Return Structure)

- Input: historical excess returns
- Year-specific checkpoints to reduce pretraining leakage
- Outputs: return distributions and embeddings

**FinText success criteria**
- Adds independent signal (low correlation with Kronos)
- Improves fusion IC consistently across regimes

---

## 10) NLP Sentiment (Separate)

- Finance-specific NLP model (news / transcripts)
- Strict cutoff-time enforcement

Sentiment is optional and never required.

---

## 11) Fusion Model (Ranking-First)

- Gated fusion of:
  - Kronos embeddings
  - FinText-TSFM embeddings
  - Tabular context features

### Training
- **Time-decay sample weighting** (from `src/features/time_decay.py`)
- Per-date normalization for cross-sectional ranking loss

### Objectives
- Primary: pairwise / listwise ranking loss
- Secondary: distribution calibration loss
- No pure MSE price regression

### Ablation gates
- Feature blocks removed if unstable
- Fusion must beat best single model

---

## 12) Regime-Aware Ensembling

- Components: Fusion, ML baseline, simple factor
- Regime detector (volatility / trend)
- Smooth, regularized ensemble weights

**Success criteria**
- Ensemble improves median IC
- Reduces variance across regimes

---


## 13) Calibration & Confidence

- Quantile calibration
- Confidence stratification

**Success criteria**
- Quantile coverage error < 5%
- High-confidence bucket materially outperforms

---

## 14) Monitoring & Research Ops

- Prediction logging with timestamps
- Matured-label scoring
- Feature and performance drift detection

Alerts:
- RankIC decay
- Calibration breakdown
- Ranking instability

---

## 15) Outputs & Interfaces

- Ranked stock lists
- Per-stock explanation summaries
- Batch scoring interface
- Full traceability of inputs and decisions

---

## 16) Global Research Acceptance Criteria

A model is considered **valid** if:

- Median walk-forward RankIC exceeds baseline by ≥ 0.02
- Net-of-cost performance positive in ≥ 70% of folds
- Top-10 ranking churn < 30% month-over-month
- Performance degrades gracefully under regime shifts
- No PIT or survivorship violations detected
