---
title: "Lab 6B: Backtest Overfitting (CSCV, PBO, PSR/DSR)"
format:
  html:
    toc: true
bibliography: ../resources/reading.bib
execute:
  echo: false
  warning: false
  message: false
---

## Before You Code: The Big Picture

The #1 problem in quantitative finance: **your backtest looks great, but it fails in live trading**. Why? **Overfitting**—you optimized parameters on the same data you tested on. Your "alpha" is actually selection bias.

::: {.callout-note}
## The Backtest Overfitting Problem

**The Scenario:**
You test 200 trading strategies on 20 years of data. One strategy has a Sharpe ratio of 2.5—amazing! You deploy it with real money. It loses money immediately. What happened?

**The Problem:**
- With 200 tries, **one will look good by pure luck** (multiple testing)
- In-sample optimization + in-sample testing = guaranteed overfitting
- Traditional cross-validation doesn't detect this (data leakage across folds)

**The Solution (Bailey & López de Prado):**
1. **CSCV (Combinatorially Symmetric Cross-Validation)**: Proper walk-forward splits
2. **PBO (Probability of Backtest Overfitting)**: Quantifies selection bias
3. **PSR (Probabilistic Sharpe Ratio)**: Tests if Sharpe > 0 with statistical significance
4. **DSR (Deflated Sharpe Ratio)**: Adjusts for multiple testing

**The Evidence:**
Harvey, Liu & Zhu (2016, RFS): Most published factor strategies fail out-of-sample due to p-hacking and multiple testing. PBO/PSR help detect this **before** losing real money.
:::

### What You'll Build Today

By the end of this lab, you will have:

- ✅ Understanding of why standard backtesting fails
- ✅ CSCV implementation for honest validation
- ✅ PBO calculation showing selection bias
- ✅ PSR/DSR metrics for performance significance
- ✅ Critical perspective on published trading strategies

**Time estimate:** 90-120 minutes (this is advanced material—take your time)

::: {.callout-important}
## Why This Matters for Coursework 2
Your factor replication **must** use walk-forward validation and report PBO/PSR. Otherwise, your Sharpe ratio is meaningless—it's just in-sample optimization parading as out-of-sample performance. This lab shows you how to do it right.
:::

# Objectives

- Diagnose backtest overfitting with combinatorially symmetric cross‑validation (CSCV)  
- Estimate Probability of Backtest Overfitting (PBO)  
- Quantify performance significance via Probabilistic Sharpe Ratio (PSR); discuss Deflated Sharpe Ratio (DSR)

::: callout-note
This lab follows Bailey & López de Prado's approach to selection bias: CSCV → PBO and PSR/DSR. We implement lightweight utilities and show how to compare against `mlfinlab` if available.
:::

# Setup

In [None]:
import sys, pathlib
# Ensure project root (parent of labs/) is on the Python path for `scripts/`
sys.path.append(str(pathlib.Path().resolve().parent))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scripts.overfit_metrics import (
    generate_noise_strategies,
    sharpe_ratio,
    probabilistic_sharpe_ratio,
    cscv_pbo,
)

# Optional: compare with mlfinlab if installed
try:
    from mlfinlab.backtest_statistics import deflated_sharpe_ratio as dsr_mlfinlab
except Exception:
    dsr_mlfinlab = None

np.random.seed(123)

# Part A — A garden of strategies on pure noise

We simulate `N=200` strategies with no true edge. In a finite sample, one will “win” in‑sample by chance.

In [None]:
T, N = 240, 200  # 20 years of monthly returns (approx)
X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=123)

# In‑sample Sharpe ratios across strategies
sr_all = np.array([sharpe_ratio(X[:, j]) for j in range(N)])
j_star = int(np.argmax(sr_all))
sr_star = sr_all[j_star]

fig, ax = plt.subplots(1,1, figsize=(7,4))
ax.hist(sr_all, bins=30, color='tab:gray', alpha=0.8)
ax.axvline(sr_star, color='r', linestyle='--', label=f'Winner SR≈{sr_star:.2f}')
ax.set_title('In‑sample Sharpe across noise strategies')
ax.legend(); plt.tight_layout(); plt.show()

Observation: Even with zero true edge, the best in‑sample Sharpe can look compelling.

# Part B — CSCV and Probability of Backtest Overfitting (PBO)

We split the time axis into contiguous folds and repeatedly pick the in‑sample “champion”, then measure its out‑of‑sample rank. PBO is the fraction of splits where the champion underperforms out‑of‑sample (negative logit rank).

In [None]:
res = cscv_pbo(X, n_folds=10, max_splits=150)  # subsample CSCV splits for speed; increase if time allows
res.pbo, res.splits_used

In [None]:
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].hist(res.taus, bins=30, color='tab:blue', alpha=0.8)
ax[0].axvline(0, color='r', linestyle='--', label='tau=0'); ax[0].legend()
ax[0].set_title('Logit ranks of in‑sample champion (CSCV)')

ax[1].hist(res.oos_ranks, bins=np.arange(1, X.shape[1]+2)-0.5, color='tab:orange', alpha=0.8)
ax[1].set_title('OOS ranks of in‑sample champion')
ax[1].set_xlim(0.5, min(40.5, X.shape[1]+0.5))
plt.tight_layout(); plt.show()

Interpretation: A high PBO indicates that selecting the in‑sample “winner” is likely to disappoint out‑of‑sample.

# Part C — PSR and discussion of DSR

We compute the Probabilistic Sharpe Ratio (PSR) of the champion against a 0 benchmark. DSR additionally deflates for selection bias by using a higher benchmark Sharpe (selection threshold). If `mlfinlab` is installed, we compare against its DSR.

In [None]:
# Champion’s in‑sample series and summary stats
x_star = X[:, j_star]
sr_hat = sharpe_ratio(x_star)
n_obs = len(x_star)

# Use normal‑like defaults for skew/kurtosis when unknown
skew = pd.Series(x_star).skew()
kurt = pd.Series(x_star).kurtosis() + 3  # pandas returns excess kurtosis

psr_0 = probabilistic_sharpe_ratio(sr_hat, 0.0, n_obs, skew=skew, kurtosis=kurt)
psr_0

Optional: compare with `mlfinlab`’s DSR (if available). Note DSR uses an elevated benchmark Sharpe that accounts for the number of trials and their correlation (see paper for details).

In [None]:
if dsr_mlfinlab is not None:
    # Example parameters — you should set n_trials and correlation based on your research context
    n_trials = N
    corr = 0.2
    dsr_val = dsr_mlfinlab(observed_sr=sr_hat, number_of_trials=n_trials, skew=skew, kurtosis=kurt, rho=corr, length=n_obs)
    dsr_val
else:
    'mlfinlab not available in this environment'

## Optional — Empirical Selection Benchmark (SR*)

An intuitive (but approximate) benchmark SR* is the selection threshold you would have used to promote a strategy, e.g., the 95th percentile of candidate SRs or the top‑k cutoff used in model selection. This inflates the benchmark to reflect the search.

In [None]:
# Naive empirical SR* from the in-sample garden (use with caution)
sr_garden = sr_all  # in-sample Sharpe across candidates
sr_star_empirical = np.quantile(sr_garden, 0.95)
psr_emp = probabilistic_sharpe_ratio(sr_hat, sr_star_empirical, n_obs, skew=skew, kurtosis=kurt)
sr_star_empirical, psr_emp

::: callout-tip
Guidance: PSR answers “what is the probability that the true SR > benchmark SR*?”. DSR raises SR* to deflate for selection bias (many trials and correlation among them). When reporting results, disclose the number of trials and use CSCV/PBO to evidence robustness.
:::

# Extension — Replace noise with weak‑edge signals

Modify the simulation so a small subset of strategies has a slight positive mean. Re‑run CSCV/PBO and PSR to see whether evidence accumulates honestly.

In [None]:
# Example: 10 strategies have a small true edge
T, N = 240, 200
X = generate_noise_strategies(T=T, N=N, rho=0.2, seed=777)
edge_idx = np.arange(10)
X[:, edge_idx] += 0.05 / np.sqrt(12)  # ~5% annual edge distributed monthly

res2 = cscv_pbo(X, n_folds=10)
res2.pbo

# Deliverables

- Report the observed PBO and interpret its meaning
- Report PSR for the selected strategy; if available, compare with DSR
- Describe how your result changes when a few strategies have a genuine (small) edge

## How to Report (Template)

- Trials: We evaluated N strategies/hyper‑parameters (comment on similarity/correlation if relevant).  
- Selection: In‑sample selection metric = [Sharpe/alpha/etc.] with CSCV splits (k=10).  
- Robustness: PBO = X.XX across S splits (show logit rank histogram).  
- Significance: PSR = X.XX vs SR*=0 (skew=..., kurt=..., n=...)  
  - Optional: DSR = X.XX (assumptions: trials=N, rho=..., length=n).  
- Data: period, universe, costs/slippage, vintages/release timing.  
- Decision: [Promote/Park], rationale and next steps (e.g., live paper trading).

# References

- @bailey2015pbo — Probability of Backtest Overfitting (PBO) and CSCV  
- @lopezdeprado2014dsr — Deflated Sharpe Ratio (DSR)  
- López de Prado, M. — Deflated Sharpe Ratio (DSR), SSRN  
- White (2000) — Reality Check for data snooping  
- Hansen (2005) — Superior Predictive Ability (SPA) test  
```