# Week 4 — Cross-Sectional Return Prediction: Linear Models

> *"An R-squared of 0.4% on 3,000 stocks, rebalanced monthly, translates to a long-short portfolio with a Sharpe ratio above 2. That's better than 99% of hedge funds. Welcome to the economics of cross-sectional prediction, where tiny edges compound across enormous breadth."*

In 2020, Shihao Gu, Bryan Kelly, and Dacheng Xiu published a paper in the *Review of Financial Studies* that ended a decades-long debate: can machine learning predict stock returns? They tested eight model classes — from ordinary least squares to deep neural networks — on 30,000 US stocks over 60 years, using 94 firm characteristics as features. The answer was yes. Neural networks achieved an out-of-sample monthly R-squared of roughly 0.4%.

Before you scoff at that number, let's do some math. The Fundamental Law of Active Management (Week 3) says your information ratio is IC times the square root of breadth. An R-squared of 0.4% on 3,000 stocks, rebalanced monthly, translates to an IC around 0.05 and annual breadth of roughly 36,000 stock-months. That gives you an information ratio above 2 — which means a long-short portfolio Sharpe above 2. That's better than virtually every discretionary hedge fund on the planet. And the model "explains" less than half a percent of return variation.

Here's the real surprise from the Gu-Kelly-Xiu paper: the improvement from neural networks over regularized linear models was modest. Elastic Net achieved an R-squared of ~0.2%. Neural networks got ~0.4%. The gap is real but not enormous. Most of the predictive power came from the *features*, not the model architecture. The model was a second-order effect.

Everything you've learned so far has been building toward this moment. You have clean data (Week 1). You know how to make it stationary (Week 2). You understand what alpha means, how to measure it with the IC, and how the Fundamental Law turns tiny signal into portfolio performance (Week 3). Today, we put it all together. We build the cross-sectional prediction pipeline that is the bread and butter of quantitative asset management — and we start, deliberately, with the simplest models. Because if Ridge regression can produce meaningful out-of-sample signal with one hyperparameter, that tells you something profound about the structure of the problem.

In [None]:
import numpy as np
import pandas as pd
import yfinance as yf
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from IPython.display import display, Markdown

plt.rcParams.update({'figure.figsize': (10, 5), 'figure.dpi': 100,
                      'axes.spines.top': False, 'axes.spines.right': False,
                      'font.size': 11})
sns.set_palette('colorblind')

With our standard toolkit loaded, we're ready to build. We'll use `yfinance` for price and volume data, `scikit-learn` for the linear models, and `scipy` for the rank correlations that define our evaluation metric. Everything from here forward follows a clean pipeline: download data, engineer features, train models, evaluate with expanding-window cross-validation, and interpret results.

Let's start by downloading a representative universe of stocks. We're using 30 well-known large-cap and mid-cap names — not the full S&P 500 (that would take a while), but enough to illustrate every concept. In the homework, you'll scale this to 200+ stocks.

In [None]:
tickers = ['AAPL', 'MSFT', 'AMZN', 'GOOGL', 'META', 'NVDA', 'JPM', 'JNJ',
           'V', 'PG', 'UNH', 'HD', 'DIS', 'BAC', 'XOM', 'PFE', 'KO',
           'CSCO', 'PEP', 'ABT', 'MRK', 'INTC', 'WMT', 'T', 'CVX',
           'NEE', 'MDT', 'LIN', 'TXN', 'QCOM']

raw = yf.download(tickers, start='2010-01-01', end='2024-01-01',
                  auto_adjust=True, progress=False)

We now have daily price and volume data for 30 stocks spanning 14 years — roughly 3,500 trading days each. That's enough cross-sectional breadth and time-series depth to demonstrate every concept in this lecture. Notice that we're pulling well-established companies with long trading histories; in the homework, you'll also deal with stocks that have shorter histories, more missing data, and wilder behavior.

Let's extract what we need: closing prices and daily returns.

In [None]:
close = raw['Close'].dropna(how='all')
volume = raw['Volume'].dropna(how='all')
returns_daily = close.pct_change()
display(Markdown(f"**Universe:** {close.shape[1]} stocks, "
                 f"{close.shape[0]} trading days"))

---

## 1. The Cross-Sectional Prediction Problem

Here's a question that separates ML engineers from quant researchers: "Predict whether Apple stock will go up next month." An ML engineer starts thinking about LSTMs, attention mechanisms, and historical price sequences. A quant researcher says: *wrong question.*

The right question is: "Will Apple outperform the median stock next month?" This reframing — from absolute to relative prediction — is the most important conceptual shift in this entire course. It changes the loss function, the features, and the evaluation metric. And it makes the problem dramatically more tractable.

Think about why. If you're predicting Apple's absolute return, you need to forecast the macroeconomy, interest rates, consumer spending, the iPhone cycle, and whatever Tim Cook had for breakfast. If you're predicting Apple's *relative* return — whether it beats the average stock — you only need to know whether Apple-specific factors (momentum, valuation, volatility) are more favorable than average. The macro stuff cancels out because it affects all stocks similarly.

Formally, the cross-sectional prediction problem at time $t$ is:

Given features $\mathbf{x}_{i,t}$ for stocks $i = 1, \dots, N_t$, predict *excess returns*:

$$r_{i,t+1}^{e} = r_{i,t+1} - r_{m,t+1}$$

where $r_{m,t+1}$ is the market return (the average return across all stocks). The model:

$$\hat{r}_{i,t+1}^{e} = f(\mathbf{x}_{i,t};\; \theta_t)$$

We evaluate with the **Information Coefficient** — the rank correlation between predictions and realized returns:

$$IC_t = \text{Spearman}\!\left(\hat{r}_{i,t+1}^{e},\; r_{i,t+1}^{e}\right)$$

If you're an ML engineer, this is just Spearman's $\rho$ between your predictions and labels. In finance, it has a name and a mystique, but it's the same thing. An IC of 0.05 — a rank correlation of 5% — is considered *excellent*. In image classification, 5% correlation would be garbage. In finance, it's a career.

Let's see what this data structure looks like in practice. The cross-sectional panel is a DataFrame indexed by (date, ticker), with columns for features and a target column for next-month excess return. This is the data format that every model in this course will consume — from the Ridge regression we build today through the neural networks of Week 7.

In [None]:
monthly_close = close.resample('ME').last()
monthly_ret = monthly_close.pct_change()
market_ret = monthly_ret.mean(axis=1)
excess_ret = monthly_ret.sub(market_ret, axis=0)

panel_example = excess_ret.stack().reset_index()
panel_example.columns = ['date', 'ticker', 'excess_return']
display(panel_example.head(10))

That stacked format — every row is one stock in one month — is the canonical shape of cross-sectional data. At each date, you have $N$ stocks, each with an excess return. When you stack these cross-sections over time, you get a panel with $(N \times T)$ rows. For 30 stocks over 14 years of monthly data, that's roughly 5,000 rows. For 3,000 stocks over 60 years (the Gu-Kelly-Xiu scale), it's over 2 million rows. The models are the same — only the scale changes.

Now let's populate this panel with the features that decades of research have shown drive the cross-section of stock returns.

---

## 2. Feature Engineering — The Factors That Move Stocks

In 1993, Narasimhan Jegadeesh and Sheridan Titman published one of the most influential papers in finance: stocks that went up in the past 12 months tend to keep going up, and stocks that went down tend to keep going down. This is the **momentum anomaly**, and it's been replicated in 40+ countries, across asset classes, and over centuries of data. The paper has over 12,000 citations. It remains the single most robust stock market anomaly ever documented.

Your ML model will rediscover momentum. It has no choice — momentum is the strongest single predictor in the cross-section. But the feature set goes deeper. Decades of academic research have identified the characteristics that predict stock returns: momentum, value, size, volatility, quality, and liquidity. These aren't arbitrary choices — each represents a *reason* why some stocks outperform others. Momentum works because investors underreact to news. Value works (sometimes) because investors overpay for exciting growth stories. The low-volatility anomaly works because institutional investors chase high-beta stocks to hit their return targets, bidding them above fair value.

> **Did You Know?** As of 2024, academic researchers have published over 400 "factors" that supposedly predict stock returns. Harvey, Liu & Zhu (2016) showed that with 400 factors tested, the standard statistical bar (t-stat > 2) is far too low — you'd expect 20 false positives by chance alone. The true threshold should be t > 3 or higher. This "factor zoo" problem means most published anomalies are probably noise. The handful that survive rigorous testing — momentum, value, quality — are the ones we'll focus on.

The standard momentum feature uses a "skip-1" convention:

$$\text{MOM}_{12,1}(i, t) = \frac{P_{i,t-1}}{P_{i,t-13}} - 1$$

Why skip the most recent month? Because the most recent month shows *reversal* — recent losers tend to bounce back, and recent winners tend to fade. Momentum and reversal are opposite effects operating at different timescales. If you include the recent month, you're mixing a positive signal (12-month momentum) with a negative one (1-month reversal). Separating them gives the model two clean features instead of one noisy one.

Volatility is also predictive — but in the *wrong* direction. High-volatility stocks tend to underperform, not outperform. This "low-volatility anomaly" violates the basic finance textbook claim that higher risk earns higher return. It persists because institutional investors are constrained to hit return targets, so they pile into high-beta stocks, bidding them above fair value.

Let's compute the momentum family first — the features that carry the most cross-sectional signal.

In [None]:
mom_1m = monthly_close.pct_change(1)
mom_3m = monthly_close.pct_change(3)
mom_6m = monthly_close.pct_change(6)
mom_12m_skip1 = monthly_close.shift(1).pct_change(11)
reversal = mom_1m.copy()

We've built the momentum features using monthly prices. Notice the `shift(1).pct_change(11)` pattern for 12-month momentum with the skip — we shift the prices by one month (to exclude the most recent month) and then compute the 11-month return. That gives us the return from month $t-12$ to month $t-1$, exactly the Jegadeesh-Titman specification. The reversal feature is simply the prior month's return, entered as a separate feature so the model can learn its negative relationship with forward returns independently of the positive momentum signal.

Now let's add volatility, volume, and size — the features that capture risk, liquidity, and scale.

In [None]:
vol_20d = returns_daily.rolling(20).std().resample('ME').last() * np.sqrt(252)
vol_60d = returns_daily.rolling(60).std().resample('ME').last() * np.sqrt(252)
log_price = np.log(monthly_close)
dollar_vol = (close * volume).rolling(20).mean().resample('ME').last()
log_dollar_vol = np.log(dollar_vol)

We're using log-price as a size proxy since `yfinance` doesn't provide shares outstanding directly. It's imperfect — a $500 stock isn't necessarily a larger company than a $100 stock — but it correlates with market cap reasonably well for a lecture demo. In the homework, you'll compute proper log market cap. Log dollar volume captures liquidity — stocks that trade more dollars per day are easier to enter and exit without moving the price.

Let's add two more technical features and then assemble the full matrix.

In [None]:
ma_50 = close.rolling(50).mean()
ma_200 = close.rolling(200).mean()
ma_ratio = (ma_50 / ma_200).resample('ME').last()
vol_adj_mom = mom_6m / vol_20d

The moving average ratio captures whether a stock is trending above or below its long-term average — when MA50 > MA200, the stock is in an uptrend. Volatility-adjusted momentum divides the raw 6-month return by its volatility, producing a Sharpe-like ratio that normalizes for risk. A stock that gained 20% with 10% volatility is a stronger momentum signal than one that gained 20% with 40% volatility.

Now let's stack everything into the panel format our models need.

In [None]:
features_dict = {
    'mom_1m': mom_1m, 'mom_3m': mom_3m, 'mom_6m': mom_6m,
    'mom_12m_skip1': mom_12m_skip1, 'reversal': reversal,
    'vol_20d': vol_20d, 'vol_60d': vol_60d, 'log_price': log_price,
    'log_dollar_vol': log_dollar_vol, 'ma_ratio': ma_ratio,
    'vol_adj_mom': vol_adj_mom,
}

feat_frames = []
for name, df in features_dict.items():
    s = df.stack(); s.name = name
    feat_frames.append(s)

features = pd.concat(feat_frames, axis=1)
features.index.names = ['date', 'ticker']

We've assembled 11 features for 30 stocks at monthly frequency. The MultiIndex `(date, ticker)` is the standard structure for cross-sectional panels: each row is one stock in one month. Let's inspect the shape, check for missing data, and look at the feature correlations — because all three will shape how the models behave.

In [None]:
display(Markdown(f"**Panel shape:** {features.shape[0]:,} rows "
                 f"({features.index.get_level_values('ticker').nunique()} stocks "
                 f"x {features.index.get_level_values('date').nunique()} months)"))
display(Markdown("**Missing data by feature:**"))
display((features.isna().mean() * 100).round(1).to_frame('% missing'))

Notice the missing data pattern: features that require long lookback windows — like 12-month momentum and the moving average ratio (which needs 200 daily observations) — have more missing values at the start of the sample. The 20-day volatility fills in quickly, but 12-month momentum needs a full year of history before it can produce its first value. In the homework, where you'll work with 200+ stocks including some that were listed recently, the missing data problem is substantially worse. Cross-sectional median imputation — replacing NaN with the median feature value across all stocks at that date — is the standard approach at most quant funds.

These features are also *correlated* with each other, and that correlation is about to cause problems for OLS. Let's see how correlated they are.

In [None]:
corr = features.dropna().corr()
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='RdBu_r', center=0,
            vmin=-1, vmax=1, ax=ax, square=True, cbar_kws={'shrink': .7})
ax.set_title('Feature Correlation Matrix')
plt.tight_layout()

Study the blocks of high correlation. The momentum features cluster together — 1m, 3m, 6m, and 12m momentum are all measuring the same underlying tendency for winners to keep winning, just over different horizons. The volatility features are nearly identical (20-day and 60-day realized vol are computed from heavily overlapping windows). Volume correlates with size, because big companies trade more.

When OLS tries to disentangle these correlated signals, it gives one momentum feature a coefficient of +50 and another -47 — the net effect is modest, but the individual coefficients are absurd and unstable. Add one month of new data and the signs can flip entirely. This is multicollinearity in action, and it's the disease that Ridge regression was invented to cure.

Let's assemble the full panel with our target variable — next-month excess return — and apply the standard preprocessing before we train models.

In [None]:
target = excess_ret.shift(-1).stack()
target.name = 'target'
target.index.names = ['date', 'ticker']

panel = features.join(target, how='inner').dropna()
display(Markdown(f"**Clean panel:** {len(panel):,} rows, "
                 f"{panel.shape[1] - 1} features + 1 target"))

The target is the forward excess return — what each stock earned relative to the market in the *following* month. The `shift(-1)` ensures that at date $t$, the target is the return from $t$ to $t+1$. This is the standard label for cross-sectional prediction: we use features known at $t$ to predict the return realized at $t+1$.

> **Did You Know?** Cross-sectional rank normalization — mapping each feature to the [0, 1] interval across stocks at each date — is used at virtually every quant fund and is so standard that academic papers don't even bother mentioning it. It removes outliers (a stock that rose 300% won't dominate the regression), makes features comparable (market cap and momentum are both on [0, 1]), and makes the model robust to distributional shifts over time. Think of it as the quantitative finance equivalent of batch normalization.

In [None]:
feat_cols = [c for c in panel.columns if c != 'target']

def rank_normalize(group):
    return group[feat_cols].rank(pct=True)

panel[feat_cols] = panel.groupby(level='date', group_keys=False).apply(
    rank_normalize)

Each feature is now rank-normalized cross-sectionally: at every date, every stock's feature value is mapped to its percentile rank among all stocks at that date. The stock with the highest momentum gets 1.0, the lowest gets approximately 0.0, and everything else is linearly spaced between. This removes the influence of outliers, makes features directly comparable, and ensures the model is robust to regime changes where the *levels* of features shift but the *ranks* remain meaningful.

Our data pipeline is complete. We have a clean panel of 11 rank-normalized features and a forward excess return target. Time to train some models.

---

## 3. Linear Models — OLS, Ridge, Lasso, and Elastic Net

In the Gu-Kelly-Xiu study, OLS — ordinary least squares, no regularization — had the *worst* out-of-sample performance of all eight model classes. Worse than random forests. Worse than neural nets. Even worse than partial least squares. OLS overfit spectacularly, fitting noise in 94 correlated features as if it were signal.

Ridge regression, with a single hyperparameter controlling the penalty, nearly doubled OLS's out-of-sample R-squared. One hyperparameter. That's the lesson: in financial prediction, where signal-to-noise is extremely low, regularization isn't optional — it's the difference between a model that works and one that hallucinates.

Here's the intuition. OLS finds the $\boldsymbol{\beta}$ that minimizes the sum of squared errors — period. If two features are correlated, OLS is free to give one a coefficient of +100 and the other -100, as long as the net prediction is good *in sample*. That's a terrible strategy when the correlations shift even slightly out of sample. Ridge adds a penalty for coefficient magnitude:

$$\mathcal{L}_{\text{Ridge}} = \underbrace{\sum_{i=1}^{N} (r_i - \mathbf{x}_i^T \boldsymbol{\beta})^2}_{\text{fit the data}} + \underbrace{\alpha \|\boldsymbol{\beta}\|_2^2}_{\text{keep coefficients small}}$$

The analytical solution:

$$\hat{\boldsymbol{\beta}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \alpha \mathbf{I})^{-1} \mathbf{X}^T \mathbf{r}$$

Compare with OLS: $\hat{\boldsymbol{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{r}$. The only difference is $\alpha\mathbf{I}$ — a small diagonal addition that stabilizes the matrix inversion. Same idea as Ledoit-Wolf shrinkage from Week 3: add structure to a noisy estimate.

Lasso replaces the L2 penalty with L1:

$$\mathcal{L}_{\text{Lasso}} = \sum_{i=1}^{N} (r_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 + \alpha \|\boldsymbol{\beta}\|_1$$

There's no closed-form solution — Lasso requires coordinate descent. But the L1 penalty does something Ridge can't: it drives coefficients to *exactly zero*, performing automatic feature selection. In a world with 400+ published factors and a signal-to-noise ratio near 0.001, knowing which features to *ignore* is almost as valuable as knowing which to use.

Elastic Net combines both: $\alpha\bigl[\ell_1 \|\boldsymbol{\beta}\|_1 + (1 - \ell_1) \|\boldsymbol{\beta}\|_2^2\bigr]$, hedging between Ridge's smooth shrinkage and Lasso's sparse selection. Think of it as a Bayesian prior that says: "most coefficients should be small, and some should be zero."

Let's train all four models on the same data and compare their coefficients. This is where the differences become visceral.

In [None]:
dates = panel.index.get_level_values('date').unique().sort_values()
split = dates[int(len(dates) * 0.7)]

train = panel.loc[panel.index.get_level_values('date') <= split]
test = panel.loc[panel.index.get_level_values('date') > split]

X_train, y_train = train[feat_cols].values, train['target'].values
X_test, y_test = test[feat_cols].values, test['target'].values

We've split the data at the 70% mark in time — everything before that date is training, everything after is test. This is a simple temporal split, not the full expanding-window CV we'll build in Section 4, but it's enough to compare models. Notice: we split by *date*, not randomly. Shuffling would contaminate the test set with future information. This is the first commandment of financial ML evaluation — violate it and your results are fiction.

In [None]:
models = {
    'OLS': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.001, max_iter=5000),
    'ElasticNet': ElasticNet(alpha=0.001, l1_ratio=0.5, max_iter=5000),
}

for name, m in models.items():
    m.fit(X_train, y_train)

All four models are trained on the same data. Now let's look at what they learned — the coefficient vectors. This is where the story gets interesting, because the coefficients reveal *how* each model interprets the features, and the differences between OLS and the regularized models are often dramatic.

In [None]:
coef_df = pd.DataFrame({name: m.coef_ for name, m in models.items()},
                        index=feat_cols)

fig, axes = plt.subplots(1, 4, figsize=(16, 5), sharey=True)
for ax, name in zip(axes, coef_df.columns):
    coef_df[name].plot.barh(ax=ax, color='steelblue')
    ax.set_title(name, fontsize=12)
    ax.axvline(0, color='k', lw=0.5)
plt.suptitle('Coefficient Comparison Across Models', y=1.02, fontsize=14)
plt.tight_layout()

Study those coefficient bars carefully. OLS has wild swings — some features get large positive coefficients, others large negative ones, and the magnitudes are much larger than the regularized models. This is multicollinearity doing its damage: OLS is using one momentum feature to offset another, producing a fragile linear combination that works in sample but will blow up on new data.

Ridge shrinks everything toward zero uniformly. The coefficients are smaller, smoother, and more stable. Momentum features (especially the 12-month skip-1 variant) tend to retain the largest positive coefficients, consistent with decades of academic evidence. Volatility features tend to be negative — consistent with the low-volatility anomaly.

Lasso is more surgical: it zeros out several features entirely, keeping only the ones that carry the strongest independent signal. The features that survive Lasso's knife are a useful diagnostic — they tell you which characteristics the cross-section is actually rewarding right now.

But coefficients don't tell you whether the model *works*. For that, we need out-of-sample evaluation.

In [None]:
results = []
for name, m in models.items():
    preds = m.predict(X_test)
    ic = spearmanr(preds, y_test).statistic
    r2_is = m.score(X_train, y_train)
    r2_oos = m.score(X_test, y_test)
    results.append({'Model': name, 'IC (OOS)': round(ic, 4),
                    'R-sq in-sample': round(r2_is, 5),
                    'R-sq OOS': round(r2_oos, 5)})

display(pd.DataFrame(results).set_index('Model'))

There's the pattern that repeats in every cross-sectional study: OLS has the highest in-sample R-squared — it's the *definition* of the best fit to training data — but out of sample, it typically has the lowest IC and the worst (often negative) R-squared. The model memorized the noise and extrapolated it forward.

Ridge, with its gentle shrinkage, tends to produce the best or near-best out-of-sample IC. Lasso is competitive but often slightly lower, because it zeros out features that carry small but genuine signal. With only 30 stocks, the differences between Ridge and Lasso are within sampling error — you need hundreds of stocks and dozens of years to see a reliable difference. That's what the homework is for.

> **Did You Know?** At most quant funds, an IC of 0.05 is considered excellent. An IC of 0.10 would be world-class — and suspicious (your first reaction should be "check for look-ahead bias"). Most published factor ICs are between 0.02 and 0.06. Cliff Asness founded AQR (Applied Quantitative Research) in 1998 with \$1 billion, and the firm now manages over \$100 billion — primarily through systematic strategies based on exactly the factors we're implementing: momentum, value, quality, and low volatility. The strategy hasn't fundamentally changed in 25 years. The edge comes from discipline, scale, and continuous refinement.

---

## 4. Expanding-Window Cross-Validation

If you use standard 5-fold cross-validation on financial data, you will get a beautiful, impressive, and completely fake result.

Here's why. Imagine your training set contains data from January 2020 and your test fold contains data from December 2019. Your model has seen the future — it knows the COVID crash is coming — and it's using that information to "predict" the past. The CV score looks great, but it's measuring your model's ability to time-travel, not to predict. In production, your model doesn't have a time machine.

The fix is expanding-window cross-validation: at each month $t$, train on *everything* from the beginning of the sample up to month $t-1$, then predict month $t$. The training window grows over time (hence "expanding"), and the model never sees future data. This is the only correct evaluation method for temporal data, and it's the standard we'll use for the remaining 14 weeks of the course.

Formally, for each month $t = T_0, T_0 + 1, \dots, T$:

$$\text{Train: } \{(\mathbf{x}_{i,s},\, r_{i,s+1})\}_{s=1}^{t-1} \qquad\qquad \text{Predict: } \hat{r}_{i,t+1} = f(\mathbf{x}_{i,t};\, \hat{\theta}_{1:t-1})$$

$$\text{Evaluate: } IC_t = \text{Spearman}(\hat{r}_{i,t+1},\, r_{i,t+1})$$

The final metric is the average IC across all out-of-sample months:

$$\overline{IC} = \frac{1}{T - T_0 + 1} \sum_{t=T_0}^{T} IC_t$$

The key parameter $T_0$ is the minimum training window — you need enough historical data to fit a stable model before you start predicting. We'll use 60 months (5 years) as the minimum, which is standard in the industry.

Let's implement this from scratch and see what the rolling IC looks like for Ridge regression.

In [None]:
dates_sorted = panel.index.get_level_values('date').unique().sort_values()
min_train = 60
ic_results = []

for i in range(min_train, len(dates_sorted) - 1):
    train_end = dates_sorted[i]
    test_date = dates_sorted[i + 1]
    tr = panel.loc[panel.index.get_level_values('date') <= train_end]
    te = panel.loc[panel.index.get_level_values('date') == test_date]
    if len(te) < 5:
        continue
    ridge = Ridge(alpha=1.0).fit(tr[feat_cols], tr['target'])
    preds = ridge.predict(te[feat_cols])
    ic_val = spearmanr(preds, te['target']).statistic
    ic_results.append({'date': test_date, 'ic': ic_val})

That loop is the backbone of cross-sectional research — the exact same structure used at every quant fund, and the one you'll reuse for every model from here through Week 18. At each month, we train Ridge on *all prior data* and predict next month's cross-sectional returns. The training window starts at 60 months and grows to the full sample. No shuffling. No leakage. No time travel.

Now let's visualize the results. What does the IC look like over time? Is it consistently positive, or does it swing wildly between positive and negative? The answer matters enormously, because a strategy built on this signal needs to survive the bad months.

In [None]:
ic_df = pd.DataFrame(ic_results).set_index('date')

fig, axes = plt.subplots(2, 1, figsize=(10, 7), sharex=True)
axes[0].bar(ic_df.index, ic_df['ic'], color='steelblue', alpha=0.7, width=20)
axes[0].axhline(ic_df['ic'].mean(), color='red', ls='--',
                label=f"Mean IC: {ic_df['ic'].mean():.3f}")
axes[0].set_ylabel('Monthly IC'); axes[0].legend()
axes[0].set_title('Rolling Out-of-Sample IC (Ridge, Expanding Window)')

axes[1].plot(ic_df.index, ic_df['ic'].cumsum(), color='darkblue', lw=2)
axes[1].set_ylabel('Cumulative IC')
axes[1].set_title('Cumulative IC Over Time')
plt.tight_layout()

Study that top panel. The monthly IC bounces around — positive in some months, negative in others. That's not a bug, it's the reality of financial prediction. Even the best quant models are wrong *almost as often as they're right*. What matters is whether the average is reliably positive and whether the cumulative IC drifts upward over time.

The bottom panel shows the cumulative IC, which should trend upward with periodic drawdowns. Those drawdowns correspond to market regimes where the linear model's signal breaks down — typically during sudden crashes, sharp sector rotations, or regime changes. During the COVID crash of March 2020, for instance, momentum signals reversed violently: stocks that had been winning suddenly became the biggest losers. A model trained on historical momentum would have been on the wrong side of that reversal.

The average IC across our 30-stock universe should be in the 0.02-0.05 range — exactly where the academic literature says it should be. Tiny, but economically meaningful when applied across hundreds of stocks.

Now let's see how badly standard k-fold CV misleads you. We'll evaluate the same Ridge model two ways: our honest expanding-window method and shuffled 5-fold CV. This comparison is the single most important methodological lesson in this course.

In [None]:
from sklearn.model_selection import KFold

X_all, y_all = panel[feat_cols].values, panel['target'].values
kfold_ics = []
for tr_idx, te_idx in KFold(5, shuffle=True, random_state=42).split(X_all):
    m = Ridge(alpha=1.0).fit(X_all[tr_idx], y_all[tr_idx])
    ic = spearmanr(m.predict(X_all[te_idx]), y_all[te_idx]).statistic
    kfold_ics.append(ic)

expanding_ic = ic_df['ic'].mean()
kfold_ic = np.mean(kfold_ics)

We've now run the same Ridge model through both evaluation pipelines. Let's see the numbers side by side — the gap is usually striking.

In [None]:
denom = expanding_ic if abs(expanding_ic) > 0.001 else 0.001
comparison = pd.DataFrame({
    'Method': ['5-Fold CV (shuffled)', 'Expanding-Window CV'],
    'Average IC': [round(kfold_ic, 4), round(expanding_ic, 4)],
    'Inflation Factor': [round(kfold_ic / denom, 1), 1.0]
}).set_index('Method')
display(comparison)

The shuffled 5-fold IC is systematically higher than the expanding-window IC — typically by a factor of 2-4x. That gap is pure temporal leakage: the shuffled folds mix past and future data, allowing the model to exploit autocorrelation in the target variable. A monthly return in January 2020 is correlated with returns in December 2019 and February 2020 — shuffle them into different folds and you've given the model a crystal ball.

Every impressive financial ML result that used shuffled CV is lying — not maliciously, but structurally. This comparison should be permanent calibration for you: whenever someone shows you a financial ML result, your first question should be "what CV method did you use?" If the answer is k-fold with shuffling, divide their reported performance by 3 and that's a *generous* estimate of the truth.

---

## 5. The Gu-Kelly-Xiu Framework — Where We Stand

Let's place our work in context. We've built a small version of the cross-sectional prediction pipeline that Gu, Kelly, and Xiu scaled to industrial proportions. Their dataset: 30,000 stocks, 720 months, 94 characteristics. Ours: 30 stocks, ~150 months, 11 features. Same methodology, different scale.

Here are the key results from the GKX paper — the benchmark that defines the field:

| Model | OOS Monthly $R^2$ | Relative Rank |
|-------|-------------------|---------------|
| OLS | ~0.08% | 8 (worst) |
| Elastic Net | ~0.20% | 6 |
| PLS / PCR | ~0.15% | 7 |
| Random Forest | ~0.22% | 4 |
| Gradient Boosted Trees | ~0.28% | 3 |
| Neural Net (1-layer) | ~0.30% | 2 |
| Neural Net (3-5 layer) | ~0.40% | 1 (best) |

These numbers look absurdly small. A monthly R-squared of 0.4% means the model explains less than half a percent of the variation in stock returns. In any other ML domain, you'd throw this model away. But in finance, you'd build a billion-dollar firm on it.

The most important finding in the paper isn't the model ranking — it's the *interaction effect*. GKX showed that the single most important nonlinear feature is the interaction between momentum and volatility. High-momentum, low-volatility stocks outperform by a wide margin. High-momentum, high-volatility stocks? Not so much — the momentum signal is unreliable when volatility is high. Trees and neural nets capture this interaction automatically. Linear models can't, because they model each feature independently. That's the specific gap we'll close in Week 5.

Let's make the Fundamental Law connection concrete. If our Ridge model achieves an IC of 0.03 on a universe of $N$ stocks rebalanced monthly, what annualized information ratio does that imply? The answer depends entirely on breadth.

In [None]:
ic_values = [0.02, 0.03, 0.05, 0.10]
breadth_values = [30, 100, 500, 3000]

ir_table = pd.DataFrame(
    {f'BR={br}': [ic * np.sqrt(br * 12) for ic in ic_values]
     for br in breadth_values},
    index=[f'IC={ic}' for ic in ic_values])
ir_table = ir_table.round(2)
display(Markdown('**Annualized Information Ratio = IC x sqrt(BR x 12)**'))
display(ir_table)

Read this table carefully — it's the economics of the entire cross-sectional prediction business. With 30 stocks (our lecture demo), even an IC of 0.05 gives an IR around 1.0 — decent but not spectacular. With 500 stocks and the same IC, the IR jumps above 3. With 3,000 stocks (the GKX scale), even IC = 0.02 gives an IR above 1.5.

Breadth is the multiplier. This is why quant funds trade thousands of stocks — not because any single bet is confident, but because the law of large numbers turns a tiny edge into a reliable stream. Your Ridge model with IC = 0.03 is unprofitable at 30 stocks. It's a decent living at 200 stocks. It's a career at 3,000 stocks. Same model, same signal, same IC — the only difference is how many independent bets you can make.

---

## 6. From Predictions to Portfolios — Quantile Analysis

The acid test of a cross-sectional prediction model isn't its R-squared or even its IC. It's this: if you sort stocks into quintiles by predicted return, does the top quintile actually outperform the bottom? If the spread is monotonic — Q1 < Q2 < Q3 < Q4 < Q5 — your model has learned something real. If it's not monotonic, you might just have noise that correlates with the target by accident.

The long-short portfolio return at time $t$:

$$r_{LS,t} = \frac{1}{|Q_5|} \sum_{i \in Q_5} r_{i,t} \;-\; \frac{1}{|Q_1|} \sum_{i \in Q_1} r_{i,t}$$

where $Q_5$ is the top quintile (highest predicted returns) and $Q_1$ is the bottom. This portfolio is approximately dollar-neutral (zero net investment) and market-neutral (roughly zero beta), so its return approximates pure alpha. At every quant fund, the quintile spread chart is the standard deliverable for evaluating a signal. You don't publish an R-squared — you show the spread.

Let's build this using our expanding-window predictions.

In [None]:
quantile_results = []
for i in range(min_train, len(dates_sorted) - 1):
    train_end = dates_sorted[i]
    test_date = dates_sorted[i + 1]
    tr = panel.loc[panel.index.get_level_values('date') <= train_end]
    te = panel.loc[panel.index.get_level_values('date') == test_date].copy()
    if len(te) < 10:
        continue
    ridge = Ridge(alpha=1.0).fit(tr[feat_cols], tr['target'])
    te['pred'] = ridge.predict(te[feat_cols])
    te['quintile'] = pd.qcut(te['pred'], 5, labels=[1,2,3,4,5], duplicates='drop')
    for q in range(1, 6):
        mask = te['quintile'] == q
        if mask.any():
            quantile_results.append({'date': test_date, 'quintile': q,
                                     'ret': te.loc[mask, 'target'].mean()})

At each out-of-sample month, we've sorted stocks into five buckets by predicted return and recorded the average realized return for each bucket. If the model is working, quintile 5 (the stocks the model likes most) should have the highest average return, and quintile 1 (the stocks the model likes least) should have the lowest. Let's see the quintile spread chart.

In [None]:
qdf = pd.DataFrame(quantile_results)
q_means = qdf.groupby('quintile')['ret'].mean() * 100

fig, ax = plt.subplots(figsize=(7, 4))
colors = ['#d62728', '#ff7f0e', '#cccccc', '#2ca02c', '#1f77b4']
q_means.plot.bar(ax=ax, color=colors, edgecolor='black', linewidth=0.5)
ax.set_xlabel('Quintile (1=Lowest Predicted, 5=Highest)')
ax.set_ylabel('Mean Monthly Excess Return (%)')
ax.set_title('Quintile Returns: Ridge Expanding-Window Predictions')
ax.axhline(0, color='k', lw=0.5)
plt.xticks(rotation=0); plt.tight_layout()

If the quintile chart shows an upward slope from Q1 to Q5, the model is working. The spread between Q5 and Q1 — the long-short return — is the economic value of the prediction. With 30 stocks and 11 features, the spread will be modest. With 500 stocks and 20+ features (the homework scale), it should be more pronounced.

The key diagnostic is *monotonicity*: do the returns increase steadily from Q1 to Q5? A perfectly monotonic spread means the model's ranking is meaningful at every level. A non-monotonic spread (say, Q3 outperforms Q5) suggests the model only reliably separates winners from losers at the extremes.

Now let's trace out the long-short portfolio's cumulative performance — what a trader would actually experience living with this strategy day after day.

In [None]:
q5 = qdf[qdf['quintile'] == 5].set_index('date')['ret']
q1 = qdf[qdf['quintile'] == 1].set_index('date')['ret']
ls_returns = q5.subtract(q1, fill_value=0).dropna()

fig, ax = plt.subplots(figsize=(10, 4))
(1 + ls_returns).cumprod().plot(ax=ax, lw=2, color='darkblue')
ax.axhline(1, color='k', ls='--', lw=0.5)
ax.set_ylabel('Cumulative Return')
ax.set_title('Ridge: Long Q5 / Short Q1 — Cumulative Performance')
plt.tight_layout()

That cumulative line tells the full story. If it drifts upward over time, the strategy is extracting genuine signal from the cross-section. If it's flat or declining, the model isn't working — or the universe is too small to generate reliable quintile sorts (a common issue with only 30 stocks, where each quintile contains just 6 names).

Notice the drawdowns: there will be periods where the long-short portfolio loses money for several consecutive months. These drawdowns correspond to regime changes where the model's learned relationships break down. A model that "explains" 0.3% of return variation is still wrong on roughly 48% of its monthly bets. The edge is statistical, not certain. This is the psychological challenge of quantitative investing — you need to keep the system running through the bad months, trusting that the long-run average IC is positive.

Let's compute the performance metrics that a portfolio manager would demand — including what happens after transaction costs. We introduced the cost framework in Week 3: 10 basis points per side is realistic for large-cap US stocks.

In [None]:
ann_ret = ls_returns.mean() * 12
ann_vol = ls_returns.std() * np.sqrt(12)
sharpe = ann_ret / ann_vol if ann_vol > 0 else 0
cum = (1 + ls_returns).cumprod()
max_dd = (cum / cum.cummax() - 1).min()

cost_per_rebal = 0.0010 * 2
net_ret = ls_returns - cost_per_rebal
sharpe_net = (net_ret.mean() * 12) / (net_ret.std() * np.sqrt(12)) if net_ret.std() > 0 else 0

display(Markdown(
    f"| Metric | Value |\n|--------|-------|\n"
    f"| Ann. Return (gross) | {ann_ret:.2%} |\n"
    f"| Ann. Volatility | {ann_vol:.2%} |\n"
    f"| Sharpe (gross) | {sharpe:.2f} |\n"
    f"| Sharpe (net 10bps) | {sharpe_net:.2f} |\n"
    f"| Max Drawdown | {max_dd:.2%} |"))

Look at the Sharpe ratio degradation from gross to net-of-costs. Even at 10 basis points per side — which is realistic for large-cap US equities — the monthly rebalancing cost eats into the strategy's edge. The degradation is proportional to turnover: if the model's predictions change dramatically month to month (high turnover), you're paying more in trading costs. If the predictions are stable (low turnover), costs are manageable.

Ridge's smooth coefficients tend to produce more stable predictions than Lasso's sparse ones — another advantage of L2 regularization that doesn't show up in IC comparisons but matters for real money. This is the permanent tension in quantitative finance: signal decays with time (you want to trade frequently to capture fresh signal), but transaction costs accumulate with trading frequency (you want to trade rarely to minimize costs). Every strategy lives somewhere on this tradeoff curve.

---

## 7. What Linear Models Miss — A Preview of Week 5

We've built a working cross-sectional prediction pipeline with linear models. The IC is small but positive. The quintile spread is hopefully monotonic. The Fundamental Law tells us this should work at scale. So why bother with trees and neural nets?

Because linear models can't capture *interactions*. The Gu-Kelly-Xiu paper showed that the single most important nonlinear effect in stock prediction is the momentum-volatility interaction: high-momentum, low-volatility stocks massively outperform, while high-momentum, high-volatility stocks deliver unreliable returns. A linear model sees momentum and volatility as independent features and combines them additively. It can't learn that the *combination* matters more than either feature alone.

Let's see this interaction in our data. We'll split stocks by both momentum tercile and volatility tercile, then look at the average forward return in each cell. If there's an interaction, the pattern won't be additive — the best cell won't simply be "highest momentum + lowest volatility summed together."

In [None]:
idata = panel.copy()
idata['mom_q'] = idata.groupby(level='date')['mom_12m_skip1'].transform(
    lambda x: pd.qcut(x, 3, labels=['Low','Mid','High'], duplicates='drop'))
idata['vol_q'] = idata.groupby(level='date')['vol_20d'].transform(
    lambda x: pd.qcut(x, 3, labels=['Low','Mid','High'], duplicates='drop'))

interact_ret = idata.groupby(['mom_q','vol_q'])['target'].mean() * 100
pivot = interact_ret.unstack('vol_q')
display(Markdown('**Mean Monthly Excess Return (%) by Momentum x Volatility**'))
display(pivot.round(3))

If the data reveals the interaction, you'll see that high-momentum, low-volatility stocks have substantially better returns than you'd predict by simply adding the momentum effect and the volatility effect. A linear model can't capture this — it would assign a fixed weight to momentum and a fixed weight to volatility and add them. A tree can learn that "when momentum is high AND volatility is low, the prediction should be especially positive" — because trees partition the feature space, they naturally capture interactions.

This is the specific gap we'll close in Week 5 with XGBoost and LightGBM. The improvement over Ridge in R-squared terms is typically modest (from ~0.2% to ~0.3%), but it translates to real portfolio improvement because the interaction captures a genuine economic effect: momentum is more reliable when volatility is low, because low-volatility environments are less affected by noise, sentiment, and liquidity shocks.

> **Did You Know?** Kenneth French at Dartmouth's Tuck School maintains a free online data library with factor returns going back to 1926 — the most-used free dataset in academic finance. Every asset pricing paper references it. The Fama-French factors (market, size, value, momentum, profitability, investment) are computed from CRSP data using a standardized methodology. Bookmark it at mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html — you'll use it repeatedly for benchmarking your models against decades of academic evidence.

---

## Key Takeaways

| Concept | What You Need to Remember |
|---------|---------------------------|
| Cross-sectional prediction | Predict *relative* returns (Apple vs. Microsoft), not absolute. The IC measures your ranking accuracy. |
| Feature engineering | Momentum, value, volatility, size, liquidity — each represents decades of research. Rank-normalize cross-sectionally at each date. |
| OLS vs. Ridge vs. Lasso | OLS overfits. Ridge shrinks all coefficients smoothly (best for correlated features). Lasso zeros out weak features (interpretable but may discard weak signal). |
| Expanding-window CV | The only correct evaluation for temporal data. Never shuffle. Train on past, predict future, period. |
| IC calibration | IC of 0.05 is *excellent*. IC of 0.10 is suspicious. Breadth (number of stocks) is the multiplier that makes small IC economically meaningful. |
| Quantile analysis | Sort stocks into quintiles by prediction. Monotonic spread = real signal. The Q5-Q1 long-short return is your signal's economic value. |
| Transaction costs | At 10 bps round-trip, monthly rebalancing costs ~2.4% per year. Your Sharpe after costs is the only Sharpe that matters. |
| The GKX benchmark | Linear models: OOS $R^2$ ~0.2%. Neural nets: ~0.4%. Most prediction power comes from features, not model architecture. |

---

## Bridge to Week 5

You've just built your first cross-sectional alpha model — the bread and butter of quantitative asset management. It uses simple linear models (Ridge, Lasso), standard features (momentum, volatility, size), and rigorous expanding-window evaluation. The IC is small — maybe 0.02-0.05 — but the Fundamental Law tells us that's enough when breadth is high.

Next week, we replace the linear model with gradient-boosted trees — XGBoost and LightGBM. These models dominate production quant finance for a specific reason: they capture nonlinear interactions between features that linear models miss. The momentum-volatility interaction we glimpsed in Section 7 is just the beginning. Trees find it automatically. They also give us SHAP values — a way to see *exactly* which features and interactions your model is using, which is essential for trust (both your own trust in the model, and your PM's trust in your model). You'll discover that the improvement over Ridge is real but modest — and that a well-tuned Ridge with engineered interaction features can sometimes match a default XGBoost. The model matters, but the features and methodology matter more.

In the homework this week, you'll scale the pipeline to 200+ stocks and 20+ features. That's where the Fundamental Law kicks in: breadth amplifies small signal into reliable returns. The code you build is modular by design — swap in a tree model next week, a neural net in Week 7, and the evaluation framework stays exactly the same.

---

## Suggested Reading

- **Gu, Kelly & Xiu (2020), "Empirical Asset Pricing via Machine Learning"** — The paper that defined the field. Focus on Tables 3-5 (model comparison) and Figure 3 (feature importance and interactions). Dense but essential — every quant you'll ever work with has read it.

- **Jegadeesh & Titman (1993), "Returns to Buying Winners and Selling Losers"** — The paper that discovered momentum. Over 12,000 citations. The anomaly has been replicated in 40+ countries and remains the single most robust predictor. Read it to understand the foundation of your strongest feature.

- **Harvey, Liu & Zhu (2016), "...and the Cross-Section of Expected Returns"** — The "factor zoo" paper that showed most published factors are probably false positives. Essential context for why rigorous methodology matters — connects directly to Week 6's discussion of backtest overfitting.

- **Stefan Jansen, *Machine Learning for Algorithmic Trading*, Chapters 4-7** — The most practical treatment of alpha factor research with Python code. Working implementations of feature engineering and linear models with alphalens integration. Start here if you want production-grade code for the homework.