# Week 14 — Capstone: End-to-End ML Trading Strategy

**Course:** ML for Quantitative Finance  
**Due:** Final submission

---

## Objective

Build a **complete, production-quality** ML trading strategy that combines at least 2 techniques from the course.

## Grading Rubric

| Component | Points | Description |
|-----------|--------|-------------|
| Data Pipeline | 10 | Clean data acquisition, bias handling |
| Feature Engineering | 15 | ≥15 features across ≥3 categories |
| Model & Training | 20 | Proper financial CV (purged or expanding) |
| Labeling | 10 | Triple-barrier or justified alternative |
| Portfolio Construction | 10 | Signal → weights |
| Backtest | 15 | Walk-forward, realistic costs (≥5 bps/side) |
| Evaluation | 10 | Full tear sheet + deflated Sharpe |
| Analysis | 10 | Honest limitations, capacity, improvements |

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

## Part 1: Data Pipeline (10 pts)

1. Download data for ≥50 liquid US equities (2010-2024)
2. Handle missing data, stock splits, survivorship bias
3. Document your data cleaning decisions

In [None]:
# TODO: Define universe (≥50 tickers)
# TODO: Download price + volume data
# TODO: Clean data (ffill, drop tickers with >20% missing)
# TODO: Compute daily and monthly returns

## Part 2: Feature Engineering (15 pts)

Build ≥15 features spanning at least 3 of these categories:
- **Momentum:** 1m, 3m, 6m, 12m-skip-1
- **Reversal:** Short-term mean reversion
- **Volatility:** Rolling vol, vol ratio, vol-of-vol
- **Volume:** Volume ratio, dollar volume rank
- **Technical:** MA ratios, RSI, Bollinger band position
- **Fundamental:** (if using yfinance .info or other sources)

In [None]:
# TODO: Compute ≥15 features
# TODO: Rank-transform features cross-sectionally
# TODO: Build panel dataset (dates × stocks × features)
# TODO: Print feature summary statistics

## Part 3: Labeling (10 pts)

Choose ONE:
- **Option A:** Triple-barrier labeling (from Week 6)
- **Option B:** Forward 1-month returns (justify why this is acceptable for your strategy)

If using triple-barrier, compute daily volatility and set barriers relative to it.

In [None]:
# TODO: Implement your chosen labeling method
# TODO: Report label distribution (balanced? skewed?)
# TODO: Justify your choice

## Part 4: Model Training (20 pts)

1. Train at least one ML model (XGBoost, LightGBM, or RF recommended)
2. Use proper financial cross-validation:
   - Expanding window OR purged k-fold (from Week 6)
3. Optional: Tune hyperparameters with Optuna (Week 5)
4. Report in-sample vs out-of-sample IC

In [None]:
# TODO: Split data (train: 2012-2017, validation: 2018-2019, test: 2020-2024)
# TODO: Train model with expanding window or purged k-fold
# TODO: Optional: Optuna tuning on validation set
# TODO: Report IC metrics per period

## Part 5: Portfolio Construction & Backtest (15 pts)

1. Translate model signals to portfolio weights
   - Simple: rank-based quintile long-short
   - Advanced: signal-weighted or optimized
2. Walk-forward backtest with monthly rebalancing
3. Apply realistic transaction costs (≥5 bps per side)
4. Record monthly returns

In [None]:
# TODO: Walk-forward loop
#   For each month from 2018 onward:
#     - Train on expanding window up to previous month
#     - Predict cross-section for current month
#     - Form long-short portfolio
#     - Apply transaction costs
#     - Record return
# TODO: Compute cumulative returns

## Part 6: Evaluation (10 pts)

1. Full performance tear sheet
2. Deflated Sharpe ratio (how many strategies did you try?)
3. Rolling Sharpe ratio (is performance stable or declining?)

In [None]:
# TODO: Compute all metrics (Sharpe, Sortino, Calmar, Max DD, VaR, CVaR, etc.)
# TODO: Compute deflated Sharpe ratio (be honest about n_trials!)
# TODO: Plot cumulative returns + drawdown
# TODO: Plot rolling 12-month Sharpe
# TODO: Compare to SPY buy-and-hold

## Part 7: Analysis & Discussion (10 pts)

Write a brief (1-2 paragraphs per point) analysis covering:

1. **Source of return:** Why should this strategy work? What economic mechanism drives the alpha?
2. **Limitations:** What could go wrong? Regime changes, crowding, capacity?
3. **Overfitting risk:** How many strategies/parameters did you try? Is the DSR convincing?
4. **Capacity:** How much capital could this strategy manage? (hint: think about position sizes vs ADV)
5. **Improvements:** What would you do with more time/data/compute?

In [None]:
# TODO: Write your analysis in markdown cells below
# This section is as important as the code — be honest and thorough