# Week 5 — Seminar: Gradient Boosting for Finance

**Course:** ML for Quantitative Finance  
**Type:** Seminar (90 min)

---

## Setup

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from scipy import stats
import xgboost as xgb
import lightgbm as lgb
import shap
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

In [None]:
# Reuse Week 4 feature pipeline
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'JPM', 'JNJ', 'XOM', 'PG', 'V', 'UNH',
           'HD', 'MA', 'PFE', 'COST', 'NKE', 'BAC', 'NVDA', 'META', 'LLY', 'ABBV',
           'CVX', 'WMT', 'CRM', 'AVGO', 'TXN', 'QCOM', 'HON', 'GS', 'MS', 'BLK']

data = yf.download(tickers, start='2010-01-01', end='2024-12-31', progress=False)
prices = data['Close'].ffill().dropna()
volume = data['Volume'].ffill().dropna()
returns_daily = prices.pct_change()
monthly_prices = prices.resample('M').last()
monthly_returns = monthly_prices.pct_change()

## Exercise 1: XGBoost vs. LightGBM vs. RF (25 min)

Using the feature matrix from Week 4:
1. Train RF, XGBoost, LightGBM with default parameters
2. Compare IC and compute time
3. Which model wins?

In [None]:
# TODO: Build feature panel (reuse Week 4 code)
# TODO: Train all 3 models with expanding window
# TODO: Compare IC, training time, and prediction time
# TODO: Create comparison table

## Exercise 2: Optuna Tuning (25 min)

Tune XGBoost with Optuna using time-series CV (no shuffling!).
1. Define the search space (max_depth, learning_rate, subsample, etc.)
2. Run 30+ trials
3. Compare tuned vs. default IC

In [None]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

# TODO: Define objective function with time-series CV
# TODO: Run optimization
# TODO: Compare best tuned model vs. default parameters

## Exercise 3: SHAP Analysis (20 min)

1. Compute SHAP values for your best model
2. Create summary and dependence plots
3. Do the top features match known anomalies?

In [None]:
# TODO: Compute SHAP values
# TODO: Create summary plot (bar and beeswarm)
# TODO: Create dependence plots for top 3 features
# TODO: Do they match momentum, value, low-vol anomalies?

## Discussion (20 min)

1. Your XGBoost has in-sample IC = 0.15 and OOS IC = 0.02. What happened?
2. Why do production quant systems retrain models weekly rather than once?
3. You find that momentum × volatility interaction is the most important SHAP feature. How would you trade this?
4. Is there a risk that SHAP analysis itself leaks future information?