# Week 6 — Financial ML Methodology: Labeling, CV, and Backtesting

**Course:** ML for Quantitative Finance  
**Type:** Lecture (90 min)

---

## Why This Matters

This is what separates people who "do ML" from people who do ML **correctly** in finance.  
Lopez de Prado's entire career is built on showing that **methodology matters more than models**.

Three core problems:
1. How to **label** financial data (triple-barrier method)
2. How to **validate** without leaking information (purged k-fold)
3. How to **combine** a primary model with a sizing model (meta-labeling)

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

In [None]:
spy = yf.download('SPY', start='2010-01-01', end='2024-12-31', progress=False)
spy.columns = spy.columns.droplevel(1)
spy['ret'] = spy['Close'].pct_change()
spy['log_ret'] = np.log(spy['Close'] / spy['Close'].shift(1))
spy = spy.dropna()

## 1. Why Standard ML Methodology Fails in Finance

### Problem 1: K-Fold CV Leaks Information
- Financial data has serial correlation
- Labels overlap (a 10-day return at $t$ shares 9 days with the return at $t+1$)
- Shuffled k-fold puts overlapping samples in train AND test → information leakage

### Problem 2: Fixed-Threshold Labeling Ignores Volatility
- Labeling returns as +1 if $r > 0$ ignores that $r = 0.5\%$ means very different things in a low-vol vs. high-vol regime
- This creates class imbalance that shifts with market regimes

### Problem 3: Backtesting ≠ Cross-Validation
- Running 1000 backtests and picking the best one → you're overfitting to the backtest
- The deflated Sharpe ratio corrects for this

## 2. Triple-Barrier Labeling (Lopez de Prado Ch. 3)

Three barriers that define each trade's outcome:
1. **Profit-take barrier** (upper): price hits $+\tau_{pt}$ → label = +1
2. **Stop-loss barrier** (lower): price hits $-\tau_{sl}$ → label = -1  
3. **Time barrier** (vertical): max holding period expires → label = sign of return

The barriers scale with **daily volatility**, making labels regime-adaptive.

$$\tau_{pt} = \sigma_{daily} \times \text{multiplier}_{pt}$$
$$\tau_{sl} = \sigma_{daily} \times \text{multiplier}_{sl}$$

In [None]:
def triple_barrier_labels(prices, vol, pt_mult=2.0, sl_mult=2.0, max_holding=10):
    """Triple-barrier labeling.
    
    Args:
        prices: Series of close prices
        vol: Series of daily volatility (e.g., 20-day rolling std of returns)
        pt_mult: profit-take multiplier × daily vol
        sl_mult: stop-loss multiplier × daily vol
        max_holding: maximum holding period in days
    Returns:
        DataFrame with columns: label, return, barrier_hit, end_date
    """
    results = []

    for i in range(len(prices) - max_holding):
        entry_price = prices.iloc[i]
        entry_date = prices.index[i]
        daily_vol = vol.iloc[i]

        if np.isnan(daily_vol) or daily_vol <= 0:
            continue

        pt = entry_price * (1 + pt_mult * daily_vol)
        sl = entry_price * (1 - sl_mult * daily_vol)

        # Check each day in the holding period
        for j in range(1, max_holding + 1):
            if i + j >= len(prices):
                break
            current_price = prices.iloc[i + j]

            if current_price >= pt:
                results.append({'date': entry_date, 'label': 1,
                               'ret': (current_price - entry_price) / entry_price,
                               'barrier': 'profit_take', 'days': j})
                break
            elif current_price <= sl:
                results.append({'date': entry_date, 'label': -1,
                               'ret': (current_price - entry_price) / entry_price,
                               'barrier': 'stop_loss', 'days': j})
                break
        else:
            # Time barrier hit
            end_price = prices.iloc[i + max_holding]
            ret = (end_price - entry_price) / entry_price
            results.append({'date': entry_date, 'label': np.sign(ret),
                           'ret': ret, 'barrier': 'time', 'days': max_holding})

    return pd.DataFrame(results).set_index('date')


# Apply to SPY
vol = spy['ret'].rolling(20).std()
labels = triple_barrier_labels(spy['Close'], vol, pt_mult=2.0, sl_mult=2.0, max_holding=10)

print(f"Total labels: {len(labels)}")
print(f"\nBarrier distribution:")
print(labels['barrier'].value_counts())
print(f"\nLabel distribution:")
print(labels['label'].value_counts())

In [None]:
# Compare to fixed-threshold labeling
fixed_labels = np.sign(spy['Close'].pct_change(10).shift(-10)).dropna()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(labels['ret'], bins=50, color='steelblue', edgecolor='white')
axes[0].set_title(f'Triple-Barrier Returns\n+1: {(labels["label"]==1).mean():.0%}, -1: {(labels["label"]==-1).mean():.0%}')

axes[1].bar(['+1', '0', '-1'], [float((fixed_labels==1).mean()), float((fixed_labels==0).mean()), float((fixed_labels==-1).mean())], color='salmon')
axes[1].set_title('Fixed Sign Labeling (10-day forward return)')

plt.tight_layout()
plt.show()

## 3. Meta-Labeling (Lopez de Prado Ch. 3)

**The idea:** Don't build one model. Build two:
1. **Primary model:** Decides direction (buy/sell). Can be simple (e.g., MA crossover)
2. **Meta-model:** Decides whether to *act* on the primary signal and *how much*

The meta-model predicts: "Given the primary model says BUY, will this trade be profitable?"

**Advantages:**
- Primary model doesn't need to be perfect — just needs directional skill
- Meta-model handles position sizing and risk management
- Easier to interpret and debug than a single end-to-end model

In [None]:
# Primary model: SMA crossover (50/200)
spy['sma_50'] = spy['Close'].rolling(50).mean()
spy['sma_200'] = spy['Close'].rolling(200).mean()
spy['signal'] = np.where(spy['sma_50'] > spy['sma_200'], 1, -1)

# Generate triple-barrier labels for meta-labeling
# Label = 1 if the primary model's trade was profitable
meta_labels = labels.copy()
common_idx = meta_labels.index.intersection(spy.index)
meta_labels = meta_labels.loc[common_idx]

# Meta-label: did the trade in the direction of the signal make money?
primary_signal = spy.loc[common_idx, 'signal']
meta_labels['meta_label'] = (meta_labels['ret'] * primary_signal > 0).astype(int)

print(f"Meta-labels: {len(meta_labels)}")
print(f"Meta-label distribution (1=profitable trade):")
print(meta_labels['meta_label'].value_counts(normalize=True))

## 4. Purged K-Fold Cross-Validation (Lopez de Prado Ch. 7)

Standard k-fold leaks information when labels overlap. The fix:

1. **Purge:** Remove training samples whose labels overlap with the test period
2. **Embargo:** Add a gap between train and test to prevent leakage from any remaining serial correlation

$$\text{Embargo} \geq \text{max label duration}$$

In [None]:
class PurgedKFold:
    """Purged K-Fold CV for financial data."""

    def __init__(self, n_splits=5, embargo_days=10):
        self.n_splits = n_splits
        self.embargo_days = embargo_days

    def split(self, dates):
        """Generate purged train/test indices.
        
        Args:
            dates: array of datetime indices for each sample
        Yields:
            (train_indices, test_indices)
        """
        unique_dates = np.sort(np.unique(dates))
        fold_size = len(unique_dates) // self.n_splits

        for i in range(self.n_splits):
            test_start = unique_dates[i * fold_size]
            test_end = unique_dates[min((i + 1) * fold_size - 1, len(unique_dates) - 1)]

            # Embargo: extend test period boundaries
            embargo_start = test_start - pd.Timedelta(days=self.embargo_days)
            embargo_end = test_end + pd.Timedelta(days=self.embargo_days)

            test_mask = (dates >= test_start) & (dates <= test_end)
            # Purge: remove training samples that overlap with test + embargo
            train_mask = (dates < embargo_start) | (dates > embargo_end)

            train_idx = np.where(train_mask)[0]
            test_idx = np.where(test_mask)[0]

            if len(train_idx) > 0 and len(test_idx) > 0:
                yield train_idx, test_idx


# Demonstrate information leakage
# Compare standard k-fold vs purged k-fold on a simple model
print("Purged K-Fold splits:")
pkf = PurgedKFold(n_splits=5, embargo_days=15)
dates_arr = labels.index.values
for fold, (train_idx, test_idx) in enumerate(pkf.split(dates_arr)):
    print(f"  Fold {fold}: train={len(train_idx)}, test={len(test_idx)}")

## 5. Demonstrating Information Leakage

In [None]:
from sklearn.model_selection import KFold

# Create feature matrix for the meta-labeling problem
feature_cols = ['ret', 'sma_50', 'sma_200']
X_meta = spy.loc[meta_labels.index, ['ret']].copy()
X_meta['vol'] = spy.loc[meta_labels.index, 'ret'].rolling(20).std()
X_meta['mom'] = spy.loc[meta_labels.index, 'Close'].pct_change(20)
X_meta['signal'] = spy.loc[meta_labels.index, 'signal']
X_meta = X_meta.dropna()

y_meta = meta_labels.loc[X_meta.index, 'meta_label']

# Standard K-Fold (WRONG for financial data)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores_standard = []
for train_idx, test_idx in kf.split(X_meta):
    model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    model.fit(X_meta.values[train_idx], y_meta.values[train_idx])
    score = model.score(X_meta.values[test_idx], y_meta.values[test_idx])
    scores_standard.append(score)

# Purged K-Fold (CORRECT)
pkf = PurgedKFold(n_splits=5, embargo_days=15)
scores_purged = []
for train_idx, test_idx in pkf.split(X_meta.index.values):
    model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    model.fit(X_meta.values[train_idx], y_meta.values[train_idx])
    score = model.score(X_meta.values[test_idx], y_meta.values[test_idx])
    scores_purged.append(score)

print(f"Standard K-Fold accuracy: {np.mean(scores_standard):.3f} ± {np.std(scores_standard):.3f}")
print(f"Purged K-Fold accuracy:   {np.mean(scores_purged):.3f} ± {np.std(scores_purged):.3f}")
print(f"\nDifference = {np.mean(scores_standard) - np.mean(scores_purged):.3f}")
print("This gap is the information leakage. Standard K-Fold overstates performance.")

## Key Takeaways

1. **Triple-barrier labeling** adapts to volatility regimes. Fixed labels don't.
2. **Meta-labeling** separates direction (primary model) from sizing (meta-model). More interpretable, more robust.
3. **Purged k-fold** eliminates the information leakage that inflates standard CV scores.
4. **The gap between standard and purged CV** tells you how much your model is cheating.
5. **Methodology > models.** A Ridge with correct CV beats XGBoost with wrong CV.

**Next week:** PyTorch fundamentals — transitioning from sklearn to neural networks for finance.