# Week 3 -- Portfolio Theory, Factor Models & Risk

> *"Alpha is what's left after you account for all the risks you took. Most people discover they have none."*

In 1998, Long-Term Capital Management had $125 billion in assets, two Nobel laureates on staff, and a portfolio optimizer that said their positions were perfectly hedged. They lost $4.6 billion in four months and nearly brought down the global financial system. Eleven years later, three professors published a paper showing that a strategy as dumb as "put equal money in everything" beats the Nobel-Prize-winning portfolio optimization technique in most real-world tests. Both of these facts are true, and together they tell you something important about the gap between theory and practice in portfolio construction.

This week, we learn the theory -- Markowitz, CAPM, factor models -- understand why it's beautiful, understand why it fails, and then learn what actually works. We'll meet the risk metrics that will serve as the KPIs for every model you build for the rest of this course: Sharpe, Sortino, maximum drawdown, VaR, CVaR. We'll encounter the single most important equation for an ML engineer entering finance -- the Fundamental Law of Active Management -- which tells you whether your model's prediction accuracy is good enough to make money, *before* you ever run a backtest. And we'll discover that a "pathetic" prediction correlation of 0.05, applied across 500 stocks monthly, can produce world-class performance.

The journey this week goes like this: build the beautiful theory, watch it shatter on real data, then pick up the pieces with methods that actually survive contact with markets.

In [None]:
# ── Setup & Imports ──────────────────────────────────────────────
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize, stats
from scipy.cluster.hierarchy import linkage, dendrogram, leaves_list
from IPython.display import display, Markdown
import warnings; warnings.filterwarnings('ignore', category=FutureWarning)

plt.rcParams.update({'figure.figsize': (10, 5), 'figure.dpi': 110,
                      'axes.spines.top': False, 'axes.spines.right': False})
np.random.seed(42)

We keep the plumbing minimal -- standard scientific Python plus `yfinance` for market data. Everything in this notebook runs with these imports. Let's start by downloading the small universe of stocks we'll use throughout the lecture for demonstrations.

In [None]:
# ── Data: 10 diversified tickers, 5 years ────────────────────────
tickers = ['AAPL','MSFT','JNJ','JPM','XOM','PG','NVDA','DUK','GLD','TLT']
raw = yf.download(tickers, start='2019-01-01', end='2024-01-01',
                  auto_adjust=True, progress=False)
prices = raw['Close'].dropna()
returns = np.log(prices / prices.shift(1)).dropna()
returns.head(3)

Ten tickers spanning tech (AAPL, MSFT, NVDA), healthcare (JNJ), financials (JPM), energy (XOM), consumer staples (PG), utilities (DUK), gold (GLD), and long-term treasuries (TLT). This isn't a random grab bag -- it's deliberately diverse so we can see how diversification actually works when correlations vary. We're using log returns because they're additive over time, which matters when we start compounding portfolio returns across rebalancing periods.

Now, let's build the theory that won Harry Markowitz a Nobel Prize -- and then watch it break.

---

## 1. Mean-Variance Optimization -- The Beautiful Theory

Diversification is the only free lunch in finance. If you hold two stocks that aren't perfectly correlated, the portfolio's risk is *less* than the weighted average of the individual risks. That's not a model assumption -- it's a mathematical fact. Harry Markowitz proved it in 1952 and won the Nobel Prize. The tragedy is that the optimizer he invented to exploit this fact is so sensitive to estimation errors that it usually produces portfolios worse than equal-weighting.

Let's start with the math, which is genuinely elegant. For a portfolio with weight vector $\mathbf{w}$ across $N$ assets:

$$R_p = \mathbf{w}^T \boldsymbol{\mu}, \qquad \sigma_p^2 = \mathbf{w}^T \boldsymbol{\Sigma} \mathbf{w}$$

The key insight is in $\sigma_p^2$. If every pair of assets had correlation 1.0, portfolio variance would just be the weighted average of individual variances -- no free lunch. But real correlations are below 1.0, so the cross-terms $w_i w_j \sigma_{ij}$ shrink the total. The lower the correlations, the bigger the diversification benefit.

The optimization problem Markowitz solved is: maximize the Sharpe ratio subject to weights summing to 1:

$$\mathbf{w}^* = \frac{\boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - r_f \mathbf{1})}{\mathbf{1}^T \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - r_f \mathbf{1})}$$

Beautiful. And completely unstable, because $\boldsymbol{\Sigma}^{-1}$ amplifies every estimation error in $\boldsymbol{\Sigma}$. Let's see both sides -- the elegance and the collapse.

First, we'll compute the annualized mean returns and covariance matrix from our 10-stock sample, then plot the efficient frontier -- the curve of maximum-return portfolios for each level of risk. Watch for how the frontier bulges leftward: that bulge *is* diversification.

In [None]:
# ── Annualized stats ─────────────────────────────────────────────
mu  = returns.mean() * 252
cov = returns.cov()  * 252

# ── Monte-Carlo efficient frontier (random portfolios) ───────────
n_port, n_assets = 8000, len(tickers)
results = np.zeros((n_port, 3))
for i in range(n_port):
    w = np.random.dirichlet(np.ones(n_assets))
    results[i, 0] = w @ mu
    results[i, 1] = np.sqrt(w @ cov.values @ w)
    results[i, 2] = results[i, 0] / results[i, 1]  # Sharpe

We've generated 8,000 random long-only portfolios using a Dirichlet distribution (which guarantees weights are positive and sum to 1). Each dot below will represent one portfolio in risk-return space. The upper-left boundary of this cloud *is* the efficient frontier -- the set of portfolios where you can't get more return without taking more risk.

In [None]:
fig, ax = plt.subplots()
sc = ax.scatter(results[:,1], results[:,0], c=results[:,2],
                cmap='viridis', s=4, alpha=0.5)
plt.colorbar(sc, ax=ax, label='Sharpe Ratio')
for t in tickers:
    ax.scatter(np.sqrt(cov.loc[t,t]), mu[t], marker='x',
               s=80, c='red', zorder=5)
    ax.annotate(t, (np.sqrt(cov.loc[t,t]), mu[t]), fontsize=7)
ax.set_xlabel('Annualized Volatility'); ax.set_ylabel('Annualized Return')
ax.set_title('Efficient Frontier -- 10-Stock Universe')
plt.tight_layout(); plt.show()

See the shape? Individual assets (the red x's) sit *inside* the cloud. Every portfolio on the upper-left edge dominates them -- higher return for the same risk, or lower risk for the same return. That's the diversification free lunch visualized. NVDA has the highest return but also the highest volatility. The efficient frontier finds blends that keep much of NVDA's return while dramatically reducing risk by mixing in less-correlated assets like TLT and GLD.

Now here's the part Markowitz wouldn't tell you at a dinner party. Let's find the optimal tangency portfolio (the one with the highest Sharpe ratio), then add a tiny amount of noise to our expected return estimates and re-optimize. If the optimizer is robust, the weights should barely change.

In [None]:
# ── Tangency portfolio via scipy ─────────────────────────────────
def neg_sharpe(w, mu, cov):
    return -(w @ mu) / np.sqrt(w @ cov @ w)

cons = {'type': 'eq', 'fun': lambda w: w.sum() - 1}
bnds = [(0, 1)] * n_assets
w0   = np.ones(n_assets) / n_assets
opt  = optimize.minimize(neg_sharpe, w0, args=(mu.values, cov.values),
                         bounds=bnds, constraints=cons, method='SLSQP')
w_opt = opt.x

We've found the tangency portfolio -- the weight vector that maximizes the Sharpe ratio subject to long-only constraints and weights summing to 1. Now let's see what happens when we perturb the expected returns by a small amount of noise. In practice, your expected return estimates come from a model, and models have estimation error. Even a 10% perturbation of the standard deviation of returns is optimistic -- real estimation error is worse.

In [None]:
# ── Perturb expected returns, re-optimise ────────────────────────
noise = np.random.normal(0, mu.std() * 0.15, n_assets)
mu_noisy = mu.values + noise
opt2 = optimize.minimize(neg_sharpe, w0, args=(mu_noisy, cov.values),
                         bounds=bnds, constraints=cons, method='SLSQP')
w_pert = opt2.x

comp = pd.DataFrame({'Original': w_opt, 'Perturbed': w_pert,
                      'Diff': w_pert - w_opt}, index=tickers)
display(comp.style.format('{:.3f}').bar(subset='Diff', color=['#d65f5f','#5fba7d']))

Look at those weight swings. A tiny perturbation in expected returns -- noise well within the range of normal estimation error -- produces a *completely different* portfolio. Stocks that were at 25% allocation drop to near zero. Stocks that were ignored suddenly become the largest position. This is the fundamental instability of mean-variance optimization: $\boldsymbol{\Sigma}^{-1}$ amplifies small errors in $\boldsymbol{\mu}$ into enormous position changes.

> **Did You Know?** When asked how he invested his own retirement money, Harry Markowitz -- the inventor of mean-variance optimization, Nobel laureate -- admitted he used a simple 50/50 split between stocks and bonds. "I should have computed the efficient frontier," he said. "Instead, I split my contributions 50/50 between stocks and bonds, to minimize my future regret." The inventor of the optimal portfolio used the simplest possible approach for his own money. He knew something his optimizer didn't.

If your ML model produces expected return predictions and you feed them into a Markowitz optimizer, the optimizer will take your small prediction errors and amplify them into enormous position errors. The portfolio won't reflect your model's views -- it'll reflect your model's noise. We need a better way to go from predictions to positions. But first, we need to understand what "alpha" actually means.

---

## 2. CAPM & Factor Models -- What "Alpha" Actually Means

When a hedge fund manager tells you they earned 15% last year, the first question isn't "Is that good?" It's "How much risk did you take to get it?" If the market was up 20% and you earned 15% with a beta of 1.5, your alpha was actually $15\% - 1.5 \times 20\% = -15\%$. You *underperformed* massively. You just took a lot of market risk and the market happened to go up. Alpha is what's left after you subtract the return you would have earned just by being exposed to known risk factors. Most managers, after this subtraction, have no alpha at all.

The simplest factor model is the Capital Asset Pricing Model (CAPM):

$$E[R_i] - R_f = \beta_i (E[R_m] - R_f)$$

Beta measures sensitivity to the market. A stock with $\beta = 1.5$ goes up 1.5% when the market goes up 1%, and drops 1.5% when the market drops 1%. CAPM says the *only* risk that gets compensated is market risk. Everything else should be diversified away.

But empirically, CAPM is incomplete. Small stocks and value stocks earn more than CAPM predicts. Fama and French captured these extra risk premia:

$$R_i - R_f = \alpha_i + \beta_i^{MKT}(R_m - R_f) + \beta_i^{SMB} \cdot SMB + \beta_i^{HML} \cdot HML + \epsilon_i$$

Now $\alpha_i$ is the return after controlling for market, size, and value exposures. The more factors you control for, the harder it is to claim you have genuine skill. That's the point -- the Fama-French framework is a lie detector for investment managers.

Let's download the Fama-French three factors from Kenneth French's data library and run regressions for a couple of our stocks. We'll pick NVDA (high-beta tech) and DUK (low-beta utility) to see the contrast. Watch what happens to "alpha" once we control for factor exposures.

In [None]:
# ── Download Fama-French 3 factors ───────────────────────────────
import pandas_datareader.data as web
ff3 = web.DataReader('F-F_Research_Data_Factors_daily',
                     'famafrench', start='2019-01-01')[0]
ff3 = ff3 / 100  # convert from percent to decimal
ff3.index = ff3.index.to_timestamp()
ff3.head(3)

The Fama-French data comes directly from Kenneth French's website -- it's the gold standard for academic factor research. MKT-RF is the market excess return, SMB captures the small-minus-big size premium, and HML captures the high-minus-low value premium. RF is the daily risk-free rate. Everything is in decimal form now (0.01 = 1%).

Now let's align our stock returns with the factors and run the regressions. The question: does NVDA have genuine alpha, or is its impressive return just compensation for being a high-beta, growth-tilted stock?

In [None]:
# ── Factor regressions for NVDA and DUK ──────────────────────────
common = returns.index.intersection(ff3.index)
ff_aligned = ff3.loc[common]
ret_aligned = returns.loc[common]

results_ff = {}
for tkr in ['NVDA', 'DUK']:
    y = ret_aligned[tkr] - ff_aligned['RF']
    X = ff_aligned[['Mkt-RF', 'SMB', 'HML']]
    X = X.assign(const=1.0)
    betas = np.linalg.lstsq(X.values, y.values, rcond=None)[0]
    resid = y.values - X.values @ betas
    se = np.sqrt(np.diag(np.var(resid) * np.linalg.inv(X.T @ X)))
    results_ff[tkr] = pd.Series(betas, index=['Mkt-RF','SMB','HML','alpha'])
    results_ff[tkr + '_tstat'] = pd.Series(betas / se, index=['Mkt-RF','SMB','HML','alpha'])

display(pd.DataFrame(results_ff).T.style.format('{:.4f}'))

Look at the contrast. NVDA has a market beta well above 1.0 -- it amplifies market moves. Its raw return over this period has been extraordinary, driven heavily by the AI boom. But much of that return comes from being a high-beta stock during a bull market. The alpha tells you what's left after stripping out the market, size, and value exposures. DUK, the boring utility, has a market beta well below 1.0. Its raw return is unremarkable, but its alpha per unit of risk taken is interesting -- utilities often earn a quiet premium precisely because they're boring and nobody wants to own them.

> **Did You Know?** From 1927 to 2007, value stocks (high book-to-market) outperformed growth stocks by about 5% per year -- the HML factor. Since 2007, the premium has essentially vanished. Fama and French themselves published a paper in 2020 acknowledging the weakening. Some argue the premium was arbitraged away -- quants exploited it until it disappeared. Others say it's cyclical and will return. Your Fama-French regressions will show it: the HML loading has become unreliable in recent data.

When we build ML models to predict stock returns in later weeks, we're implicitly searching for alpha -- the return component that known risk factors can't explain. If your model just learns to buy high-beta stocks, it hasn't learned anything useful; it's just taking market risk. The Fama-French regression tells you whether your model found genuine alpha or just rediscovered beta.

---

## 3. Risk Metrics -- The KPIs of Quantitative Finance

A strategy with a Sharpe ratio of 0.5 is mediocre. A Sharpe of 1.0 is good. A Sharpe of 2.0 is excellent. A Sharpe of 3.0 is almost certainly wrong -- either the backtest is overfitted, the costs are missing, or the data has survivorship bias. Jim Simons' Medallion Fund, the best-performing hedge fund in history, has a Sharpe ratio of about 2.5 after fees. If your homework Sharpe is higher than Simons', you have a bug, not a breakthrough.

Let's define the metrics that will serve as KPIs for every model you build in this course.

**Sharpe ratio** -- the primary metric. Risk-adjusted return:

$$SR = \frac{\mu_{\text{excess}}}{\sigma_{\text{excess}}} \times \sqrt{252}$$

**Sortino ratio** -- like Sharpe, but only penalizes downside volatility (because nobody complains about upside surprises):

$$\text{Sortino} = \frac{\mu_{\text{excess}}}{\sigma_{\text{downside}}} \times \sqrt{252}, \quad \sigma_{\text{downside}} = \sqrt{E[\min(R - R_f,\, 0)^2]}$$

**Maximum drawdown** -- the worst peak-to-trough loss. This is the number that gets people fired:

$$\text{MDD} = \max_{t} \left( \frac{\max_{s \le t} V_s - V_t}{\max_{s \le t} V_s} \right)$$

**VaR** and **CVaR** -- Value at Risk is the loss threshold at a confidence level; Conditional VaR (Expected Shortfall) is the average loss beyond that threshold. CVaR is better because it tells you the *size* of the catastrophe, not just the doorway.

Let's compute all of these for SPY. But here's the real lesson: we'll compute them for different market regimes to show how dramatically they change. A Sharpe ratio computed over a calm period is a fairy tale -- you need to see what happens during the storm.

In [None]:
# ── Download SPY for risk metric demo ────────────────────────────
spy = yf.download('SPY', start='2017-01-01', end='2024-01-01',
                  auto_adjust=True, progress=False)
spy_ret = np.log(spy['Close'] / spy['Close'].shift(1)).dropna()
spy_ret = spy_ret.squeeze() if hasattr(spy_ret, 'squeeze') else spy_ret

Now we'll build a compact function that computes all five metrics at once, then apply it to three regimes: the calm pre-COVID period (2017-2019), the COVID crash (Feb-Apr 2020), and the full sample. The contrast will be stark.

In [None]:
# ── Risk metrics function ────────────────────────────────────────
def risk_metrics(r, rf_annual=0.02):
    rf = rf_annual / 252
    excess = r - rf
    sharpe  = excess.mean() / excess.std() * np.sqrt(252)
    down    = excess[excess < 0]
    sortino = excess.mean() / down.std() * np.sqrt(252)
    cum     = (1 + r).cumprod()
    peak    = cum.cummax()
    mdd     = ((peak - cum) / peak).max()
    var95   = -np.percentile(r, 5)
    cvar95  = -r[r < -var95].mean() if (r < -var95).any() else var95
    return pd.Series({'Sharpe': sharpe, 'Sortino': sortino,
                      'MaxDD': f'{mdd:.1%}', 'VaR95': f'{var95:.2%}',
                      'CVaR95': f'{cvar95:.2%}'})

That function packs five metrics into 12 lines. Notice we annualize Sharpe and Sortino by $\sqrt{252}$ -- the standard convention. VaR and CVaR are reported as daily figures (un-annualized) because they represent single-day tail risk. Maximum drawdown is always peak-to-trough, regardless of period length.

Now let's see how these numbers tell different stories across different regimes.

In [None]:
# ── Regime comparison ────────────────────────────────────────────
calm   = spy_ret['2017':'2019']
crisis = spy_ret['2020-02':'2020-04']
full   = spy_ret

regime_df = pd.DataFrame({
    'Calm (2017-19)':  risk_metrics(calm),
    'COVID (Feb-Apr 2020)': risk_metrics(crisis),
    'Full Sample': risk_metrics(full)
})
display(regime_df)

The numbers tell the story more clearly than any prose could. During the calm period, SPY looks wonderful -- a solid Sharpe, modest drawdown, well-behaved tails. Then COVID arrives and every metric inverts. The Sharpe turns deeply negative. The maximum drawdown explodes. The CVaR -- the average loss on the worst 5% of days -- is several times the calm-period figure. And here's the kicker: the full-sample metrics *average out* the extremes, making the strategy look moderate. If you only saw the full-sample Sharpe, you'd have no idea that the ride included a 30%+ drawdown over three weeks.

This is why you should always report maximum drawdown alongside Sharpe. A strategy with Sharpe 1.5 and max drawdown of 40% is very different from Sharpe 1.5 with max drawdown of 10%. The first one will get you fired during the drawdown, regardless of its long-run performance. The S&P 500's maximum drawdown during 2008 was about 55% -- at most funds, a 20% drawdown triggers serious conversations, 30% gets people fired, and 50% closes the fund.

Let's visualize the drawdown path to make the pain concrete.

In [None]:
# ── Drawdown plot ────────────────────────────────────────────────
cum  = (1 + spy_ret).cumprod()
peak = cum.cummax()
dd   = (cum - peak) / peak

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, figsize=(10, 6))
ax1.plot(cum, color='steelblue'); ax1.set_ylabel('Cumulative Return')
ax1.set_title('SPY: Growth of $1 and Underwater Chart')
ax2.fill_between(dd.index, dd, 0, color='salmon', alpha=0.7)
ax2.set_ylabel('Drawdown'); ax2.set_ylim(dd.min()*1.1, 0.02)
plt.tight_layout(); plt.show()

The underwater chart (bottom panel) is one of the most important visualizations in quantitative finance. Every dip below zero represents a period where you're losing money relative to your prior peak. The COVID drawdown is unmistakable -- a sharp plunge in early 2020 followed by a remarkably fast recovery. But notice the 2022 bear market too: a slower, grinding drawdown that took longer to recover from. Different pain profiles, both captured by the same max drawdown metric but with very different psychological impacts on real investors.

> **Did You Know?** Jim Simons' Medallion Fund, the best-performing hedge fund in history, has earned approximately 66% annual returns before fees (39% after its punishing 5-and-44 fee structure) since 1988. The estimated Sharpe ratio is around 2.5 after fees. To achieve this, they trade thousands of instruments with tiny edge per trade but enormous breadth -- the Fundamental Law of Active Management in action. The fund has been closed to outside investors since 1993. The $10 billion inside it is all employee money.

Now that we have the risk metrics framework, let's understand *why* tiny edges can compound into extraordinary performance.

---

## 4. The Fundamental Law of Active Management

Here's the single most important equation in quantitative finance for an ML engineer:

$$IR = IC \times \sqrt{BR}$$

where:
- **IC** (Information Coefficient) is the rank correlation between your model's predictions and realized returns. Think of it as your model's Spearman correlation on the ranking task.
- **BR** (Breadth) is the number of independent bets per year -- stocks $\times$ rebalancing frequency.
- **IR** (Information Ratio) is the portfolio-level Sharpe ratio of your active returns above a benchmark.

An IC of 0.05 sounds pathetic. In image classification, a 0.05 correlation between predictions and labels would get you fired. But in finance, with 500 stocks rebalanced monthly, your breadth is $BR = 500 \times 12 = 6{,}000$. So your IR is $0.05 \times \sqrt{6{,}000} \approx 3.9$. That's Jim Simons territory. A terrible-sounding prediction accuracy, applied across many stocks many times, compounds into extraordinary performance.

This is why cross-sectional models -- predicting *all* stocks simultaneously rather than one at a time -- dominate quantitative finance. They have massive breadth. Let's prove it with a simulation.

We'll generate synthetic stock signals with a known IC, build long-short portfolios, and measure whether the portfolio IR matches the Fundamental Law's prediction. Then we'll vary both IC and breadth to see how the equation works in practice. Watch how breadth is the dominant lever.

In [None]:
# ── Fundamental Law simulation ───────────────────────────────────
def simulate_flaw(ic, n_stocks, n_months=120):
    """Simulate long-short portfolio returns for given IC & breadth."""
    port_rets = []
    for _ in range(n_months):
        signal = np.random.randn(n_stocks)
        actual = ic * signal + np.sqrt(1 - ic**2) * np.random.randn(n_stocks)
        ranks  = signal.argsort().argsort()
        top    = ranks >= np.percentile(ranks, 80)
        bot    = ranks <= np.percentile(ranks, 20)
        port_rets.append(actual[top].mean() - actual[bot].mean())
    r = np.array(port_rets)
    return r.mean() / r.std() * np.sqrt(12)  # annualised IR

The function generates a signal that has a pre-specified rank correlation (IC) with realized returns, then builds a quintile long-short portfolio each month. We annualize the resulting Sharpe to get the IR. Now let's sweep across different combinations of IC and number of stocks.

In [None]:
# ── Sweep IC and breadth ─────────────────────────────────────────
ics = [0.02, 0.05, 0.10]
stocks_list = [50, 100, 200, 500]
rows = []
for ic in ics:
    for ns in stocks_list:
        ir = simulate_flaw(ic, ns, n_months=240)
        theory = ic * np.sqrt(ns * 12)
        rows.append({'IC': ic, 'N_stocks': ns,
                     'Simulated IR': round(ir, 2),
                     'Theory IR': round(theory, 2)})

flaw_df = pd.DataFrame(rows)
display(flaw_df.style.format({'IC': '{:.2f}'}))

The theoretical IR ($IC \times \sqrt{BR}$) and the simulated IR line up remarkably well, validating the Fundamental Law. But look at the *magnitudes*. With IC = 0.05 and just 50 stocks, the IR is modest -- around 1.2. Increase to 500 stocks and the same IC produces an IR approaching 4. Now try the other lever: doubling IC from 0.05 to 0.10 with 500 stocks roughly doubles the IR. Both levers matter, but breadth has the square-root advantage -- you need to 4x your stock universe to 2x your IR through breadth alone.

The practical implication: if you're building an ML model for stock prediction, the number of stocks you trade is at least as important as how accurate your predictions are. A model with IC = 0.03 across 500 stocks monthly has an IR of about $0.03 \times \sqrt{6000} \approx 2.3$. The *same model* applied to a single stock has $IR = 0.03 \times \sqrt{12} \approx 0.10$ -- useless. This is why we'll build cross-sectional models for most of the course.

There's a catch, though. The "independent" in breadth is doing a lot of heavy lifting. If you hold 500 stocks but they're all tech names that move together, your effective breadth is far less than 500. Correlation reduces effective breadth -- and this brings us back to the covariance matrix, where everything goes wrong.

---

## 5. Covariance Estimation -- Where Everything Goes Wrong

If you have 500 stocks and 1,260 trading days (5 years), your covariance matrix has 125,250 unique entries. You're estimating 125,250 parameters from 1,260 observations. That's not statistics -- that's hallucination. The smallest eigenvalues of the sample covariance matrix are pure noise. And the Markowitz optimizer puts the most weight on exactly those smallest eigenvalues, because they correspond to "diversification opportunities" that don't actually exist. The optimizer is fitting noise and calling it alpha.

The fix comes from two directions. The first is **Ledoit-Wolf shrinkage**, which blends the noisy sample covariance $\mathbf{S}$ with a well-conditioned structured target $\mathbf{F}$:

$$\hat{\boldsymbol{\Sigma}} = \delta \mathbf{F} + (1 - \delta) \mathbf{S}$$

The target $\mathbf{F}$ is typically a scaled identity or constant-correlation matrix. It provides stability; the sample $\mathbf{S}$ provides information. The blend is better than either alone.

The second fix is **random matrix theory**. The Marchenko-Pastur distribution tells you what eigenvalues to expect from a *purely random* covariance matrix:

$$\lambda_{\pm} = \sigma^2 \left(1 \pm \sqrt{N/T}\right)^2$$

Eigenvalues inside the $[\lambda_-, \lambda_+]$ range are indistinguishable from noise. Eigenvalues *outside* this range carry signal. For typical equity portfolios, only 5-10 eigenvalues contain genuine market structure. The other hundreds are pure noise. Think of it as PCA -- a concept you already know from ML -- but with a principled statistical cutoff for how many components to keep.

Let's see this with our 10-stock universe. We'll compute the sample covariance eigenvalues and overlay the Marchenko-Pastur bounds. For 10 stocks and ~1,250 days, the ratio $N/T$ is small enough that most eigenvalues should be signal -- but the principle will be clear, and in the seminar you'll scale this up to 100 stocks where the noise dominates.

In [None]:
# ── Eigenvalue decomposition & Marchenko-Pastur ──────────────────
N, T = len(tickers), len(returns)
q = N / T
sigma2 = returns.var().mean()  # average variance as sigma^2
lam_plus  = sigma2 * (1 + np.sqrt(q))**2
lam_minus = sigma2 * (1 - np.sqrt(q))**2

eigenvalues = np.linalg.eigvalsh(returns.cov().values)

fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(range(N), sorted(eigenvalues), color='steelblue', alpha=0.8)
ax.axhline(lam_plus, color='red', ls='--', label=f'MP upper = {lam_plus:.6f}')
ax.axhline(lam_minus, color='orange', ls='--', label=f'MP lower = {lam_minus:.6f}')
ax.set_xlabel('Eigenvalue rank'); ax.set_ylabel('Eigenvalue')
ax.set_title('Covariance Eigenvalues vs Marchenko-Pastur Bounds')
ax.legend(); plt.tight_layout(); plt.show()

With only 10 stocks and ~1,250 days, our ratio $N/T$ is small (about 0.008), so the Marchenko-Pastur bounds are tight and most eigenvalues escape above them -- our small universe has enough data per dimension. But look at the largest eigenvalue: it towers over the rest. That's the market factor -- the single dominant force that drives almost all stocks up or down together. The next few eigenvalues correspond roughly to sector and style factors.

Now imagine scaling to 500 stocks with the same 5-year window. $N/T$ jumps to 0.4, the MP upper bound widens dramatically, and suddenly 490 of your 500 eigenvalues are swallowed by the noise zone. Only the top 5-10 survive. The Markowitz optimizer, which treats all 500 eigenvalues as equally real, is optimizing in 490 dimensions of pure hallucination.

> **Did You Know?** Joel Bun, Jean-Philippe Bouchaud, and Marc Potters at Capital Fund Management -- a quant hedge fund managing $14 billion -- have used random matrix theory to separate signal from noise in covariance matrices. They found that for typical equity portfolios, only the top 5-10 eigenvalues contain genuine market structure. Using the full covariance matrix for optimization is, in Bouchaud's words, like fitting a model to noise in 490 out of 500 dimensions.

Let's see how Ledoit-Wolf shrinkage changes the covariance matrix and the resulting portfolio weights.

In [None]:
# ── Ledoit-Wolf shrinkage vs sample covariance ───────────────────
from sklearn.covariance import LedoitWolf
lw = LedoitWolf().fit(returns.values)
cov_lw = pd.DataFrame(lw.covariance_ * 252,
                       index=tickers, columns=tickers)

# Optimise with shrunk covariance
opt_lw = optimize.minimize(neg_sharpe, w0, args=(mu.values, cov_lw.values),
                           bounds=bnds, constraints=cons, method='SLSQP')
comp2 = pd.DataFrame({'Sample Cov': w_opt, 'Ledoit-Wolf': opt_lw.x},
                      index=tickers)
display(comp2.style.format('{:.3f}'))

Compare the weight vectors. The Ledoit-Wolf-shrunk covariance produces noticeably less extreme allocations -- the weight distribution is more balanced, with fewer stocks pushed to zero. Shrinkage is pulling the covariance matrix toward a better-conditioned structure, which tames the optimizer's tendency to concentrate bets on noise-driven "opportunities."

But shrinkage is the *minimum* fix. The real solution is to avoid matrix inversion entirely. That's what brings us to Hierarchical Risk Parity.

---

## 6. Hierarchical Risk Parity (HRP) -- The Tree-Based Solution

Marcos Lopez de Prado developed HRP while managing over $13 billion at Guggenheim Partners. His insight was that portfolio construction should respect the hierarchical structure of markets -- tech stocks cluster with tech stocks, utilities cluster with utilities -- and allocate risk *within* and *between* clusters. The result is a method that never inverts the covariance matrix, never produces extreme weights, and consistently outperforms Markowitz out of sample.

HRP works in three steps:

**Step 1 -- Tree clustering:** Compute a distance matrix from correlations: $d_{ij} = \sqrt{\frac{1}{2}(1 - \rho_{ij})}$. Then apply hierarchical clustering.

**Step 2 -- Quasi-diagonalization:** Reorder the covariance matrix so correlated assets are adjacent. This is the dendrogram's leaf order.

**Step 3 -- Recursive bisection:** Split assets at the top of the hierarchy. Allocate risk inversely proportional to each cluster's variance:

$$w_1 = \frac{\sigma_2^2}{\sigma_1^2 + \sigma_2^2}, \qquad w_2 = 1 - w_1$$

Then recurse into each cluster. The key: we never invert $\boldsymbol{\Sigma}$. We only use it to compute distances (step 1) and cluster-level variances (step 3). These operations are far more robust to estimation error than matrix inversion.

Let's build HRP from scratch using scipy's hierarchical clustering. First, the dendrogram -- this will show us how our 10 stocks naturally cluster. You should expect to see tech names grouping together, defensive names grouping together, and the gold/treasury pair off on their own.

In [None]:
# ── HRP Step 1: hierarchical clustering ──────────────────────────
corr = returns.corr()
dist = np.sqrt(0.5 * (1 - corr))
from scipy.spatial.distance import squareform
link = linkage(squareform(dist.values), method='ward')

fig, ax = plt.subplots(figsize=(10, 4))
dendrogram(link, labels=tickers, ax=ax, leaf_font_size=10)
ax.set_title('Asset Hierarchy (Ward Linkage)')
ax.set_ylabel('Distance'); plt.tight_layout(); plt.show()

The dendrogram reveals the natural structure of our universe. Tech names (AAPL, MSFT, NVDA) cluster tightly -- they share common risk drivers (tech sentiment, AI narrative, growth vs. value rotation). The defensive names (JNJ, PG, DUK) form another cluster. GLD and TLT -- the non-equity safe havens -- are the furthest from everything else. This hierarchy mirrors the real economic structure of markets.

HRP uses this hierarchy to allocate risk. Instead of asking "what's the optimal weight for each stock?" (which requires inverting the covariance matrix), it asks "how much risk should go to the tech cluster vs. the defensive cluster?" and then recursively drills down within each cluster. The questions at each level are simpler and more robust than the global optimization Markowitz requires.

Now let's implement the recursive bisection to get actual portfolio weights.

In [None]:
# ── HRP Steps 2-3: quasi-diag + recursive bisection ─────────────
order = list(leaves_list(link))
sorted_tickers = [tickers[i] for i in order]
cov_vals = cov.values

def hrp_weights(cov, order):
    w = pd.Series(1.0, index=range(len(order)))
    clusters = [order]
    while clusters:
        new_clusters = []
        for cl in clusters:
            if len(cl) <= 1: continue
            mid = len(cl) // 2
            l, r = cl[:mid], cl[mid:]
            lv = cov[np.ix_(l, l)].diagonal().sum()
            rv = cov[np.ix_(r, r)].diagonal().sum()
            alpha = rv / (lv + rv)
            for i in l: w[i] *= alpha
            for i in r: w[i] *= (1 - alpha)
            new_clusters += [l, r]
        clusters = new_clusters
    return w

w_hrp_raw = hrp_weights(cov_vals, order)
w_hrp = pd.Series(w_hrp_raw.values, index=[tickers[i] for i in order])

We now have three sets of portfolio weights: Markowitz (sample covariance), Markowitz with Ledoit-Wolf shrinkage, and HRP. Let's compare them side by side. The key question is: which approach produces the most *reasonable* allocation -- one that doesn't bet the farm on a single stock?

In [None]:
# ── Compare all three weight vectors ─────────────────────────────
eq_w = np.ones(n_assets) / n_assets
all_w = pd.DataFrame({
    'Equal Wt':  pd.Series(eq_w, index=tickers),
    'Markowitz': pd.Series(w_opt, index=tickers),
    'LW-Markowitz': pd.Series(opt_lw.x, index=tickers),
    'HRP':  w_hrp.reindex(tickers)
}).fillna(0)

all_w.plot(kind='bar', figsize=(10, 4), width=0.75)
plt.title('Portfolio Weights: Four Approaches')
plt.ylabel('Weight'); plt.xticks(rotation=0)
plt.legend(loc='upper right'); plt.tight_layout(); plt.show()

The visual tells the whole story. Equal weight is the flattest bar -- boring, democratic, everybody gets 10%. Markowitz is the most extreme -- it concentrates heavily in a few assets and ignores others entirely. The Ledoit-Wolf version is somewhat less extreme but still has clear favorites. HRP falls between equal-weight and the optimizers -- it *does* differentiate between assets (assigning more to lower-volatility names and less to higher-volatility ones), but it never produces the wild concentrations that Markowitz does.

DeMiguel, Garlappi & Uppal (2009), with over 5,000 citations, showed that the "boring" equal-weight portfolio beats 14 different optimization methods in most realistic settings. The reason: optimization error in estimating the covariance matrix overwhelms any benefit from optimization. You need approximately 3,000 months -- 250 years of data -- for mean-variance optimization to reliably beat 1/N for a 25-stock portfolio. HRP closes much of this gap by avoiding the matrix inversion that causes Markowitz to hallucinate.

In the seminar, you'll run a proper rolling out-of-sample test comparing all three methods. In the homework, you'll scale this to 100 stocks with monthly rebalancing and transaction costs. For now, HRP is the default portfolio construction method for this course.

---

## 7. Transaction Costs -- The Silent Killer

Every model you build, every backtest you run, every Sharpe ratio you report -- all of it is a fantasy until you subtract transaction costs. A strategy that turns over 100% per month at 10 basis points round-trip loses 12% per year to costs. If your gross return is 10%, your net return is -2%. You're paying market makers for the privilege of losing money.

The simplest cost model -- and a reasonable starting point for liquid US large-caps -- is the constant-bps model:

$$\text{TC}_t = c \times \sum_{i=1}^{N} |w_{i,t} - w_{i,t-1}|$$

where $c$ is the cost per dollar traded (e.g., 5 bps = 0.0005 per side, 10 bps round-trip). The sum is the **turnover** -- the total fraction of your portfolio that changes.

From now on, every homework in this course includes transaction costs. 10 bps round-trip is our standard assumption for liquid large-cap US equities. We'll revisit this with more realistic market impact models in Week 18.

Let's see what costs do to a simple momentum strategy.

We'll construct a simple 12-1 momentum signal (past 12-month return, skipping the most recent month) for our 10-stock universe, rebalance monthly, and compute cumulative returns at three different cost levels. Watch how the green line (no costs) diverges from reality as costs increase.

In [None]:
# ── Momentum strategy with transaction costs ─────────────────────
monthly_ret = returns.resample('ME').sum()
signal = monthly_ret.rolling(12).sum().shift(1)  # 12-1 momentum
signal = signal.dropna()

# Long top-3, short bottom-3, equal weighted within
weights_ts = signal.apply(lambda row: pd.Series(
    np.where(row.rank() >= 8, 1/3,
    np.where(row.rank() <= 3, -1/3, 0)),
    index=row.index), axis=1)

# Portfolio return each month
port_ret = (weights_ts.shift(1) * monthly_ret).sum(axis=1).loc[signal.index]
turnover = weights_ts.diff().abs().sum(axis=1)

We've built a straightforward long-short momentum strategy: each month, go long the top 3 performers and short the bottom 3, equally weighted within each leg. Now let's apply three cost levels and see how the cumulative return profile degrades.

In [None]:
# ── Cumulative returns at different cost levels ──────────────────
costs_bps = [0, 10, 20]
fig, ax = plt.subplots(figsize=(10, 5))
for c in costs_bps:
    tc = turnover * c / 10000
    net = port_ret - tc
    cum_net = (1 + net).cumprod()
    ax.plot(cum_net, label=f'{c} bps round-trip')

ax.set_title('Momentum Strategy: Impact of Transaction Costs')
ax.set_ylabel('Cumulative Return ($1 invested)')
ax.legend(); ax.axhline(1, color='grey', ls=':', lw=0.8)
plt.tight_layout(); plt.show()

The gap between the zero-cost fantasy and reality is the "transaction cost illusion" -- the difference between backtest results that make you feel smart and live results that make you feel poor. At 10 bps, the strategy's edge is substantially eroded. At 20 bps, the strategy may be underwater entirely. And 20 bps is generous for anything less liquid than S&P 500 stocks.

Notice which portfolio construction methods suffer most from costs: the Markowitz optimizer, which produces extreme weight changes each rebalance period (because small changes in the covariance estimate produce wild swings in optimal weights), has the highest turnover and therefore the highest cost burden. Equal-weight has the lowest turnover (you only rebalance the drift). HRP falls in between. This is another reason HRP tends to beat Markowitz in practice -- lower turnover means fewer dollars donated to market makers.

> **Did You Know?** A daily-turnover strategy at 10 bps round-trip burns 25% of portfolio value per year in costs alone. For reference, the average hedge fund's gross return is about 10-15%. That's spending twice your expected revenue on transaction costs. It's not a strategy -- it's a donation to market makers. This is why holding periods matter: weekly or monthly rebalancing strategies survive; second-by-second strategies require infrastructure that costs millions per year to operate.

---

## Summary

| Concept | Key Formula | The One Thing to Remember |
|---------|------------|-------------------------|
| Mean-Variance Optimization | $\mathbf{w}^* = \frac{\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu} - r_f)}{\mathbf{1}^T \boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu} - r_f)}$ | Beautiful theory, catastrophic in practice -- $\boldsymbol{\Sigma}^{-1}$ amplifies noise |
| CAPM / Fama-French | $R_i - R_f = \alpha + \beta^{MKT}(R_m - R_f) + \ldots$ | Alpha is what remains after subtracting known risk factor returns |
| Sharpe Ratio | $SR = \frac{\mu_{excess}}{\sigma_{excess}} \times \sqrt{252}$ | SR > 2.5 in a backtest? You have a bug, not Medallion Fund |
| Fundamental Law | $IR = IC \times \sqrt{BR}$ | Breadth matters as much as accuracy; cross-sectional > time-series |
| Ledoit-Wolf Shrinkage | $\hat{\Sigma} = \delta F + (1-\delta) S$ | Always shrink. Never use the raw sample covariance for optimization |
| Marchenko-Pastur | $\lambda_{\pm} = \sigma^2(1 \pm \sqrt{N/T})^2$ | For 500 stocks, ~490 eigenvalues are pure noise |
| HRP | Cluster $\to$ quasi-diag $\to$ recursive bisection | No matrix inversion $\to$ stable weights $\to$ beats Markowitz OOS |
| Transaction Costs | $TC = c \times \sum |\Delta w_i|$ | 10 bps round-trip is our standard; always report net-of-cost results |

## Looking Ahead -- Week 4: Cross-Sectional Return Prediction

You now have the complete toolkit for evaluating any strategy or ML model in finance. You know what alpha means (the return that known risk factors can't explain), why the Sharpe ratio is the primary metric (risk-adjusted, annualized, comparable across strategies), and why most "optimal" portfolios blow up (covariance estimation error amplified by matrix inversion). You've seen that a tiny prediction accuracy of IC = 0.05, applied across many stocks, can generate world-class performance -- the Fundamental Law tells you exactly when your model is good enough.

Next week, we put all of this to work. We'll build our first cross-sectional return prediction model -- the bread and butter of quantitative asset management. You'll engineer features (momentum, value, volatility, size), train Ridge and Lasso regressions, and evaluate them with the IC we introduced today and expanding-window cross-validation. For the first time in the course, you'll see a number come out of a model and know exactly what it means -- and whether it's good enough to trade.

---

### Suggested Reading

- **Grinold & Kahn, *Active Portfolio Management* (1999)** -- The source of the Fundamental Law of Active Management. Dense and mathematical, but Chapter 2 on IC and the Fundamental Law is essential reading for anyone building ML models for finance. It tells you whether your model has any chance of making money before you ever run a backtest.

- **DeMiguel, Garlappi & Uppal (2009), "Optimal Versus Naive Diversification"** -- The paper that showed equal-weighting beats optimization. Over 5,000 citations. Short, readable, and humbling. If you read one paper this week, read this one.

- **Lopez de Prado, *Advances in Financial Machine Learning*, Chapter 16** -- HRP explained by its inventor. The motivation is compelling even if the mathematical details are dense. The key insight -- that hierarchical structure avoids matrix inversion entirely -- will change how you think about portfolio construction.