# Week 5 — Tree-Based Methods & Gradient Boosting

**Course:** ML for Quantitative Finance  
**Type:** Lecture (90 min)

---

## Why This Matters

Gradient boosted trees (XGBoost, LightGBM) are the **single most used ML model class**  
in production quant finance. They dominate Kaggle finance competitions.  
If you learn only one ML method for finance, learn this one.

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import TimeSeriesSplit
from scipy import stats
import xgboost as xgb
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

## 1. Trees → Random Forests → Gradient Boosting

**Decision Tree:** Recursive binary splits. Greedy, interpretable, high variance.

**Random Forest:** Bag many trees (bootstrap + feature subsampling). Reduces variance.

**Gradient Boosting:** Build trees sequentially, each correcting the previous one's errors.

### Why Trees Dominate Tabular Financial Data
- Handle missing values natively (no imputation needed)
- Capture nonlinear interactions (momentum × volatility)
- No feature scaling required
- Built-in feature importance
- Regularization through depth limits, subsampling
- Fast to train and predict

### XGBoost vs. LightGBM vs. CatBoost

| | XGBoost | LightGBM | CatBoost |
|---|---------|----------|----------|
| Tree growth | Level-wise | Leaf-wise (faster) | Symmetric |
| Speed | Fast | Fastest | Moderate |
| Categorical | Needs encoding | Native support | Best native support |
| Missing values | Native | Native | Native |
| When to use | Default choice | Large datasets | Heavy categorical data |

In [None]:
# Reuse feature pipeline from Week 4 (simplified)
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'JPM', 'JNJ', 'XOM', 'PG', 'V', 'UNH',
           'HD', 'MA', 'PFE', 'COST', 'NKE', 'BAC', 'NVDA', 'META', 'LLY', 'ABBV']

data = yf.download(tickers, start='2010-01-01', end='2024-12-31', progress=False)
prices = data['Close'].ffill().dropna()
volume = data['Volume'].ffill().dropna()
returns_daily = prices.pct_change()

# Monthly
monthly_prices = prices.resample('M').last()
monthly_returns = monthly_prices.pct_change()

# Features
features = {}
features['mom_1m'] = monthly_prices.pct_change(1)
features['mom_3m'] = monthly_prices.pct_change(3)
features['mom_12m_skip1'] = monthly_prices.pct_change(12).shift(1)
features['vol_20d'] = returns_daily.rolling(20).std().resample('M').last()
features['vol_60d'] = returns_daily.rolling(60).std().resample('M').last()
features['vol_ratio'] = (volume.rolling(5).mean() / volume.rolling(60).mean()).resample('M').last()
features['ma_ratio'] = (prices / prices.rolling(50).mean()).resample('M').last()

target = monthly_returns.shift(-1)

## 2. XGBoost for Cross-Sectional Prediction

In [None]:
# Build panel data
months = sorted(set.intersection(*[set(f.index) for f in features.values()]))
months = [m for m in months if m >= pd.Timestamp('2012-01-01') and m <= pd.Timestamp('2024-06-30')]

X_all, y_all, dates_all = [], [], []
for month in months:
    X_cs = pd.DataFrame({name: feat.loc[month] for name, feat in features.items()})
    y_cs = target.loc[month] if month in target.index else pd.Series(dtype=float)
    valid = X_cs.dropna().index.intersection(y_cs.dropna().index)
    if len(valid) > 5:
        X_all.append(X_cs.loc[valid])
        y_all.append(y_cs.loc[valid])
        dates_all.extend([month] * len(valid))

X_panel = pd.concat(X_all)
y_panel = pd.concat(y_all)
dates_panel = np.array(dates_all)

print(f"Panel: {len(X_panel)} observations, {X_panel.shape[1]} features")

In [None]:
# Train XGBoost with expanding window
pred_start = pd.Timestamp('2018-01-31')
ics = {'XGBoost': [], 'LightGBM': [], 'RandomForest': []}

for month in months:
    if month < pred_start:
        continue

    train_mask = dates_panel < month
    test_mask = dates_panel == month

    if test_mask.sum() < 5 or train_mask.sum() < 100:
        continue

    X_tr, y_tr = X_panel[train_mask].values, y_panel[train_mask].values
    X_te, y_te = X_panel[test_mask].values, y_panel[test_mask].values

    # XGBoost
    xgb_model = xgb.XGBRegressor(
        n_estimators=100, max_depth=3, learning_rate=0.1,
        subsample=0.8, colsample_bytree=0.8, verbosity=0
    )
    xgb_model.fit(X_tr, y_tr)
    pred_xgb = xgb_model.predict(X_te)
    ics['XGBoost'].append({'month': month, 'IC': stats.spearmanr(pred_xgb, y_te)[0]})

    # LightGBM
    lgb_model = lgb.LGBMRegressor(
        n_estimators=100, max_depth=3, learning_rate=0.1,
        subsample=0.8, colsample_bytree=0.8, verbosity=-1
    )
    lgb_model.fit(X_tr, y_tr)
    pred_lgb = lgb_model.predict(X_te)
    ics['LightGBM'].append({'month': month, 'IC': stats.spearmanr(pred_lgb, y_te)[0]})

    # Random Forest
    rf_model = RandomForestRegressor(n_estimators=100, max_depth=5, n_jobs=-1)
    rf_model.fit(X_tr, y_tr)
    pred_rf = rf_model.predict(X_te)
    ics['RandomForest'].append({'month': month, 'IC': stats.spearmanr(pred_rf, y_te)[0]})

for name in ics:
    ic_vals = [x['IC'] for x in ics[name]]
    print(f"{name}: avg IC = {np.mean(ic_vals):.4f}, t-stat = {np.mean(ic_vals)/np.std(ic_vals)*np.sqrt(len(ic_vals)):.2f}")

## 3. Feature Importance: SHAP

**Gain-based importance** (built into XGBoost) is biased toward high-cardinality features.  
**SHAP values** (SHapley Additive exPlanations) give consistent, unbiased feature importance.

SHAP decomposes each prediction: $f(x) = \phi_0 + \sum_j \phi_j$  
Where $\phi_j$ is feature $j$'s contribution to this specific prediction.

In [None]:
import shap

# Fit final XGBoost model on all data
xgb_final = xgb.XGBRegressor(
    n_estimators=100, max_depth=3, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8, verbosity=0
)
xgb_final.fit(X_panel.values, y_panel.values)

# SHAP values
explainer = shap.TreeExplainer(xgb_final)
shap_values = explainer.shap_values(X_panel.values[-500:])  # last 500 observations

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Bar plot
shap.summary_plot(shap_values, X_panel.iloc[-500:], plot_type='bar',
                  show=False, max_display=10)
plt.title('SHAP Feature Importance')
plt.tight_layout()
plt.show()

## 4. Hyperparameter Tuning with Optuna

**Critical:** Use time-series-aware CV — never shuffle across time.

Key hyperparameters:
- `max_depth`: 3-6 for financial data (deeper = more overfitting)
- `learning_rate`: 0.01-0.3 (lower = more trees needed)
- `n_estimators`: 50-500 (use early stopping)
- `subsample`: 0.6-0.9 (row sampling)
- `colsample_bytree`: 0.6-0.9 (column sampling)

**Aggressive regularization is needed** — financial data is extremely noisy.

In [None]:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 2, 6),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 50, 300),
        'subsample': trial.suggest_float('subsample', 0.5, 0.9),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.9),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10, log=True),
    }

    # Time-series CV: 3 expanding splits
    split_points = [pd.Timestamp('2016-01-31'), pd.Timestamp('2018-01-31'), pd.Timestamp('2020-01-31')]
    test_end_points = [pd.Timestamp('2018-01-31'), pd.Timestamp('2020-01-31'), pd.Timestamp('2022-01-31')]

    ic_scores = []
    for split, test_end in zip(split_points, test_end_points):
        tr_mask = dates_panel < split
        te_mask = (dates_panel >= split) & (dates_panel < test_end)
        if te_mask.sum() < 10:
            continue
        model = xgb.XGBRegressor(**params, verbosity=0)
        model.fit(X_panel.values[tr_mask], y_panel.values[tr_mask])
        pred = model.predict(X_panel.values[te_mask])

        # Average IC across months in test period
        test_dates = dates_panel[te_mask]
        for m in np.unique(test_dates):
            m_mask = test_dates == m
            if m_mask.sum() > 5:
                ic = stats.spearmanr(pred[m_mask], y_panel.values[te_mask][m_mask])[0]
                ic_scores.append(ic)

    return np.mean(ic_scores) if ic_scores else 0

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=20)  # use 50+ in practice

print(f"\nBest IC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

## Key Takeaways

1. **XGBoost/LightGBM are the production workhorses** of quant finance.
2. **Trees capture interactions** (momentum × volatility) that linear models miss.
3. **SHAP > gain-based importance** for understanding what the model learned.
4. **Aggressive regularization** (low depth, high subsample) prevents overfitting.
5. **Never use shuffled k-fold** with financial data — always temporal CV.
6. **Optuna with time-series CV** is the standard hyperparameter tuning approach.

**Next week:** Financial ML methodology — triple-barrier labeling, meta-labeling, purged k-fold CV.