# Week 14 — Backtesting, Evaluation & Causal Inference

**Course:** ML for Quantitative Finance  
**Type:** Lecture (90 min)

---

## Why This Matters

A bad model with honest backtesting is more valuable than a great model with flawed backtesting.  
Most published strategies fail out-of-sample because their backtests are overfitted.  
This lecture teaches you to evaluate strategies the way real quant firms do.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import yfinance as yf
from scipy import stats
from sklearn.ensemble import RandomForestRegressor
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

## 1. The Anatomy of a Flawed Backtest

Common sources of backtest overfitting:

1. **Look-ahead bias:** Using future information (even subtly, e.g., survivorship-biased universes)
2. **Multiple testing:** Trying 1000 strategies and reporting the best one
3. **Unrealistic assumptions:** Zero transaction costs, unlimited liquidity, instant execution
4. **In-sample selection:** Choosing the "best" parameters using the same data you evaluate on

**Key insight (Lopez de Prado):** A Sharpe ratio of 2.0 from trying 100 strategies is worth less than a Sharpe of 0.8 from a single hypothesis-driven strategy.

In [None]:
# Demo: How easy it is to find a "great" strategy by chance
np.random.seed(42)
n_days = 2520  # 10 years
n_trials = 1000

# Simulate pure noise returns
market_returns = np.random.normal(0.0003, 0.01, n_days)  # slight drift

# Try random strategies
best_sharpe = -np.inf
sharpes = []
for _ in range(n_trials):
    # Random signal: go long/short based on random MA crossover params
    fast = np.random.randint(2, 20)
    slow = np.random.randint(20, 100)
    prices = (1 + market_returns).cumprod()
    signal = (pd.Series(prices).rolling(fast).mean() > pd.Series(prices).rolling(slow).mean()).astype(float) * 2 - 1
    strat_returns = signal.shift(1).values * market_returns
    strat_returns = strat_returns[~np.isnan(strat_returns)]
    sr = np.mean(strat_returns) / np.std(strat_returns) * np.sqrt(252)
    sharpes.append(sr)
    if sr > best_sharpe:
        best_sharpe = sr

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].hist(sharpes, bins=50, color='steelblue', edgecolor='white')
axes[0].axvline(best_sharpe, color='red', linestyle='--', label=f'Best: {best_sharpe:.2f}')
axes[0].set_title(f'Sharpe Ratios from {n_trials} Random Strategies on Noise')
axes[0].set_xlabel('Sharpe Ratio')
axes[0].legend()

# Expected max Sharpe under null
trials_range = np.arange(1, 501)
expected_max = stats.norm.ppf(1 - 1/trials_range)  # approx E[max(Z_1,...,Z_n)]
axes[1].plot(trials_range, expected_max, color='steelblue')
axes[1].set_xlabel('Number of Strategies Tried')
axes[1].set_ylabel('Expected Best Sharpe (under null)')
axes[1].set_title('Multiple Testing: More Tries → Higher "Best" Sharpe')

plt.tight_layout()
plt.show()
print(f"With {n_trials} random strategies on pure noise, best Sharpe = {best_sharpe:.2f}")
print(f"This is statistically meaningless — it's selection bias.")

## 2. The Deflated Sharpe Ratio

Bailey & Lopez de Prado (2014): Adjust the Sharpe ratio for:
- Number of strategies tried ($N$)
- Skewness and kurtosis of returns
- Length of the backtest

$$\text{DSR} = P\left[\hat{SR} > SR_0 \;\middle|\; \hat{\gamma}_3, \hat{\gamma}_4, T, N\right]$$

where $SR_0$ is the expected max Sharpe under the null (depends on $N$).

In [None]:
def deflated_sharpe_ratio(sharpe_obs, n_trials, T, skew=0, kurtosis=3):
    """Probability that observed Sharpe is genuine (not from multiple testing).
    
    Args:
        sharpe_obs: Observed annualized Sharpe ratio
        n_trials: Number of strategies tried
        T: Number of return observations
        skew: Skewness of returns
        kurtosis: Kurtosis of returns (3 = normal)
    
    Returns:
        DSR probability (0 to 1)
    """
    # Expected max Sharpe under null (Euler-Mascheroni approximation)
    euler_mascheroni = 0.5772
    sr0 = np.sqrt(2 * np.log(n_trials)) - (np.log(np.pi) + euler_mascheroni) / (2 * np.sqrt(2 * np.log(n_trials)))
    
    # Standard error of Sharpe ratio (Lo, 2002)
    se_sr = np.sqrt((1 + 0.5 * sharpe_obs**2 - skew * sharpe_obs + 
                     (kurtosis - 3) / 4 * sharpe_obs**2) / T)
    
    # PSR: probability that true SR > sr0
    z = (sharpe_obs - sr0) / se_sr
    dsr = stats.norm.cdf(z)
    return dsr


# Example: A strategy with Sharpe 1.5
print("Deflated Sharpe Ratio Analysis:")
print("="*50)
for n in [1, 5, 10, 50, 100, 500]:
    dsr = deflated_sharpe_ratio(sharpe_obs=1.5, n_trials=n, T=2520)
    print(f"  Tried {n:>3} strategies: DSR = {dsr:.3f} {'✓' if dsr > 0.95 else '✗'}")

print("\n→ The same Sharpe of 1.5 is convincing if you tried 1 strategy,")
print("  but meaningless if you tried 500.")

In [None]:
# Visualize DSR surface
sharpes_grid = np.linspace(0.5, 3.0, 50)
trials_grid = [1, 5, 10, 50, 100, 500]

fig, ax = plt.subplots(figsize=(10, 6))
for n in trials_grid:
    dsrs = [deflated_sharpe_ratio(sr, n, T=2520) for sr in sharpes_grid]
    ax.plot(sharpes_grid, dsrs, label=f'N={n} trials')

ax.axhline(0.95, color='red', linestyle='--', alpha=0.5, label='95% threshold')
ax.set_xlabel('Observed Sharpe Ratio')
ax.set_ylabel('Deflated Sharpe Ratio (probability)')
ax.set_title('Deflated Sharpe Ratio vs. Number of Trials')
ax.legend()
plt.tight_layout()
plt.show()

## 3. Walk-Forward Optimization

The only honest way to backtest:

```
For each month t:
  1. Train model on data up to month t-1 (expanding or rolling window)
  2. Generate predictions for month t
  3. Form portfolio based on predictions
  4. Record actual returns for month t
  5. Move to month t+1
```

**Critical:** Never retrain using any information from the test period.

In [None]:
# Load data for walk-forward demo
TICKERS = [
    'AAPL', 'MSFT', 'AMZN', 'GOOGL', 'META', 'NVDA', 'JPM', 'JNJ', 'V', 'PG',
    'UNH', 'HD', 'MA', 'DIS', 'BAC', 'XOM', 'CSCO', 'PFE', 'COST', 'ABT',
    'PEP', 'AVGO', 'CRM', 'NKE', 'CVX', 'WMT', 'MRK', 'LLY', 'ABBV', 'INTC',
    'T', 'VZ', 'QCOM', 'TXN', 'PM', 'UNP', 'NEE', 'LOW', 'BMY', 'AMGN',
]

cache_path = Path('w14_data_cache.pkl')
if cache_path.exists():
    raw = pd.read_pickle(cache_path)
else:
    raw = yf.download(TICKERS, start='2010-01-01', end='2024-12-31', progress=True)
    raw.to_pickle(cache_path)

prices = raw['Close'].ffill().dropna(axis=1, thresh=int(0.8 * len(raw)))
returns_daily = prices.pct_change()
monthly_prices = prices.resample('M').last()
monthly_returns = monthly_prices.pct_change()

# Features
features = {}
features['mom_1m'] = monthly_prices.pct_change(1)
features['mom_3m'] = monthly_prices.pct_change(3)
features['mom_6m'] = monthly_prices.pct_change(6)
features['mom_12m_skip1'] = monthly_prices.pct_change(12).shift(1)
features['vol_20d'] = returns_daily.rolling(20).std().resample('M').last()
features['vol_60d'] = returns_daily.rolling(60).std().resample('M').last()
target = monthly_returns.shift(-1)

print(f"Universe: {prices.shape[1]} stocks, {len(monthly_prices)} months")

In [None]:
# Walk-forward backtest
pred_start = pd.Timestamp('2016-01-31')
months = sorted(monthly_prices.index)
pred_months = [m for m in months if m >= pred_start and m in target.index]

portfolio_returns = []

for month in pred_months:
    # Build cross-section for this month
    train_months = [m for m in months if m < month and m >= pd.Timestamp('2012-01-01')]
    if len(train_months) < 24:
        continue

    # Assemble training data
    X_train, y_train = [], []
    for tm in train_months:
        X_cs = pd.DataFrame({n: f.loc[tm] for n, f in features.items() if tm in f.index})
        y_cs = target.loc[tm] if tm in target.index else pd.Series(dtype=float)
        valid = X_cs.dropna().index.intersection(y_cs.dropna().index)
        if len(valid) > 5:
            X_train.append(X_cs.loc[valid])
            y_train.append(y_cs.loc[valid])

    if not X_train:
        continue

    X_tr = pd.concat(X_train)
    y_tr = pd.concat(y_train)

    # Test data for this month
    X_te = pd.DataFrame({n: f.loc[month] for n, f in features.items() if month in f.index})
    y_te = target.loc[month] if month in target.index else pd.Series(dtype=float)
    valid_te = X_te.dropna().index.intersection(y_te.dropna().index)
    if len(valid_te) < 5:
        continue

    # Train and predict
    model = RandomForestRegressor(n_estimators=100, max_depth=4, n_jobs=-1, random_state=42)
    model.fit(X_tr.values, y_tr.values)
    pred = pd.Series(model.predict(X_te.loc[valid_te].values), index=valid_te)

    # Long-short portfolio: top quintile long, bottom quintile short
    n_stocks = len(pred) // 5
    if n_stocks < 2:
        continue
    longs = pred.nlargest(n_stocks).index
    shorts = pred.nsmallest(n_stocks).index
    actual = y_te.loc[valid_te]
    port_ret = actual.loc[longs].mean() - actual.loc[shorts].mean()

    # Apply transaction costs (10 bps per side)
    tc = 0.001  # 10 bps per side × 2 sides
    portfolio_returns.append({'month': month, 'return': port_ret - tc})

results = pd.DataFrame(portfolio_returns).set_index('month')
print(f"Walk-forward backtest: {len(results)} months")

In [None]:
# Evaluate
cum_returns = (1 + results['return']).cumprod()
sharpe = results['return'].mean() / results['return'].std() * np.sqrt(12)
max_dd = (cum_returns / cum_returns.cummax() - 1).min()
sortino_denom = results['return'][results['return'] < 0].std()
sortino = results['return'].mean() / sortino_denom * np.sqrt(12) if sortino_denom > 0 else np.nan
calmar = (results['return'].mean() * 12) / abs(max_dd) if max_dd != 0 else np.nan

print("Walk-Forward Strategy Performance:")
print(f"  Annualized Return: {results['return'].mean() * 12:.1%}")
print(f"  Sharpe Ratio: {sharpe:.2f}")
print(f"  Sortino Ratio: {sortino:.2f}")
print(f"  Calmar Ratio: {calmar:.2f}")
print(f"  Max Drawdown: {max_dd:.1%}")
print(f"  Hit Rate: {(results['return'] > 0).mean():.0%}")

# Deflated Sharpe
ret_vals = results['return'].values
dsr = deflated_sharpe_ratio(
    sharpe_obs=sharpe, n_trials=1, T=len(ret_vals),
    skew=stats.skew(ret_vals), kurtosis=stats.kurtosis(ret_vals, fisher=False)
)
print(f"  DSR (1 trial): {dsr:.3f}")

fig, axes = plt.subplots(2, 1, figsize=(14, 8))
axes[0].plot(cum_returns, color='steelblue')
axes[0].set_title(f'Walk-Forward Long-Short Portfolio (Sharpe={sharpe:.2f})')
axes[0].set_ylabel('Cumulative Return')

drawdown = cum_returns / cum_returns.cummax() - 1
axes[1].fill_between(drawdown.index, drawdown.values, 0, color='salmon', alpha=0.5)
axes[1].set_title('Drawdown')
axes[1].set_ylabel('Drawdown')

plt.tight_layout()
plt.show()

## 4. Transaction Costs and Realistic Assumptions

**The cost stack:**
- Commission: ~0-2 bps (near zero for large brokers)
- Bid-ask spread: 2-10 bps for liquid large-caps
- Market impact: Depends on order size / ADV
- **Rule of thumb:** 5-10 bps per side for liquid US equities

**Turnover matters:** A strategy that turns over 200% per month needs 4× the alpha of one that turns over 50%.

In [None]:
# Impact of transaction costs
gross_returns = results['return'] + 0.001  # add back the 10bps we subtracted
tc_levels = [0, 5, 10, 20, 50]  # bps per side

print("Impact of Transaction Costs (assuming 100% monthly turnover):")
print("="*60)
for tc_bps in tc_levels:
    tc = tc_bps * 2 / 10000  # both sides
    net = gross_returns - tc
    sr = net.mean() / net.std() * np.sqrt(12)
    ann_ret = net.mean() * 12
    print(f"  {tc_bps:>2} bps/side: Sharpe = {sr:.2f}, Ann. Return = {ann_ret:.1%}")

print("\n→ Many academic strategies look great at 0 costs but die at realistic costs.")

## 5. Causal Inference in Factor Investing

**Lopez de Prado (2023): Most factors are mirages.**

The standard workflow:
1. Mine data for patterns → 2. Find correlation → 3. Publish "factor"

The correct workflow:
1. Economic theory → 2. Causal hypothesis → 3. Testable prediction → 4. Backtest

**Why this matters:** A correlation that happens to exist in-sample will break out-of-sample.  
A causal mechanism will persist because it reflects how markets actually work.

In [None]:
# Demo: Spurious factors
np.random.seed(42)
n = 1000
n_candidates = 200

# Generate random "factors" and random returns
returns = np.random.normal(0, 0.02, n)
factors = np.random.normal(0, 1, (n, n_candidates))

# Find the "best" factor by in-sample correlation
correlations = [np.corrcoef(factors[:, i], returns)[0, 1] for i in range(n_candidates)]
best_idx = np.argmax(np.abs(correlations))

print(f"Best 'factor' out of {n_candidates} random ones:")
print(f"  In-sample correlation: {correlations[best_idx]:.4f}")
print(f"  t-stat: {correlations[best_idx] * np.sqrt(n-2) / np.sqrt(1 - correlations[best_idx]**2):.2f}")
print(f"  Looks significant! But it's pure noise.")

# Out-of-sample test
oos_returns = np.random.normal(0, 0.02, n)
oos_corr = np.corrcoef(factors[:, best_idx], oos_returns)[0, 1]
print(f"\n  Out-of-sample correlation: {oos_corr:.4f}")
print(f"  → The 'factor' completely fails out-of-sample.")

## 6. Strategy Evaluation Metrics

Beyond the Sharpe ratio — a comprehensive evaluation toolkit:

In [None]:
def full_tear_sheet(returns_series, name='Strategy', rf=0.0):
    """Compute comprehensive strategy metrics."""
    r = returns_series.dropna()
    excess = r - rf / 12  # monthly rf
    cum = (1 + r).cumprod()
    dd = cum / cum.cummax() - 1

    metrics = {
        'Ann. Return': r.mean() * 12,
        'Ann. Volatility': r.std() * np.sqrt(12),
        'Sharpe': excess.mean() / r.std() * np.sqrt(12) if r.std() > 0 else np.nan,
        'Sortino': excess.mean() / r[r < 0].std() * np.sqrt(12) if (r < 0).sum() > 0 else np.nan,
        'Calmar': (r.mean() * 12) / abs(dd.min()) if dd.min() != 0 else np.nan,
        'Max Drawdown': dd.min(),
        'Hit Rate': (r > 0).mean(),
        'Profit Factor': abs(r[r > 0].sum() / r[r < 0].sum()) if (r < 0).sum() > 0 else np.nan,
        'Skewness': stats.skew(r),
        'Kurtosis': stats.kurtosis(r, fisher=False),
        'VaR 5%': np.percentile(r, 5),
        'CVaR 5%': r[r <= np.percentile(r, 5)].mean(),
        'Tail Ratio': abs(np.percentile(r, 95) / np.percentile(r, 5)) if np.percentile(r, 5) != 0 else np.nan,
    }

    print(f"\n{'='*50}")
    print(f" {name} — Performance Summary")
    print(f"{'='*50}")
    for k, v in metrics.items():
        if isinstance(v, float):
            if 'Rate' in k or 'Return' in k or 'Volatility' in k or 'Drawdown' in k or 'VaR' in k or 'CVaR' in k:
                print(f"  {k:<20s}: {v:.1%}")
            else:
                print(f"  {k:<20s}: {v:.3f}")
    return metrics


metrics = full_tear_sheet(results['return'], name='Walk-Forward RF Long-Short')

## 7. What Quant Firms Look For

1. **Intellectual honesty:** Do you acknowledge limitations? Or do you oversell?
2. **Realistic assumptions:** Transaction costs, slippage, capacity
3. **Out-of-sample validation:** Walk-forward, not in-sample optimization
4. **Economic intuition:** Why should this strategy work? What is the source of return?
5. **Clean code:** Readable, reproducible, well-documented

**The hierarchy:**
```
Theory → Hypothesis → Prediction → Data → Backtest → Evaluation
NOT:
Data → Mine → Find pattern → Backtest → Publish
```

## Key Takeaways

1. **Multiple testing kills most backtests.** The Deflated Sharpe Ratio corrects for this.
2. **Walk-forward is the gold standard.** Never train on test data, never look ahead.
3. **Transaction costs are real.** A strategy with Sharpe 2.0 at 0 costs may have Sharpe 0.3 at realistic costs.
4. **Correlation ≠ causation.** Data-mined factors fail out-of-sample. Start from economic theory.
5. **Honesty > performance.** A Sharpe of 0.8 with honest methodology beats a Sharpe of 3.0 from overfitting.

**The capstone project should demonstrate all of these principles.**