# Week 4 Seminar: Cross-Sectional Return Prediction with Linear Models

In the lecture, we framed the cross-sectional prediction problem and showed you the moving parts: features, regularized regression, expanding-window evaluation, and the Gu-Kelly-Xiu benchmark. Now you're going to build all of this yourself. You'll construct a real feature matrix for dozens of stocks, trace the regularization path to see exactly what Ridge and Lasso are doing under the hood, expose the information leakage trap that has fooled countless published papers, and rank individual features by their raw predictive power. By the end of this session, you'll have a working cross-sectional prediction pipeline -- and a healthy skepticism about any backtest result that doesn't specify its evaluation methodology.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import yfinance as yf
from scipy.stats import spearmanr
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['figure.dpi'] = 100
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

def get_close(data):
    """Extract close prices, handling yfinance MultiIndex."""
    if isinstance(data.columns, pd.MultiIndex):
        return data['Close']
    return data[['Close']]

We need data for all four exercises, so we'll download everything once. We're pulling about 50 liquid US stocks with enough history to build a meaningful cross-section. In a production system, you'd use the full S&P 500 or broader -- but 50 stocks is enough to see the patterns without waiting 20 minutes for yfinance to respond. We'll grab daily data from 2010 through mid-2024 and also download SPY to compute excess returns.

In [None]:
tickers = [
    'AAPL', 'MSFT', 'AMZN', 'GOOGL', 'META', 'NVDA', 'TSLA', 'JPM',
    'JNJ', 'V', 'PG', 'UNH', 'HD', 'MA', 'DIS', 'BAC', 'XOM',
    'PFE', 'CSCO', 'VZ', 'INTC', 'KO', 'PEP', 'ABT', 'MRK',
    'CVX', 'WMT', 'ABBV', 'COST', 'TMO', 'NKE', 'DHR', 'LLY',
    'NEE', 'MDT', 'TXN', 'UNP', 'QCOM', 'LOW', 'BMY', 'AMGN',
    'CAT', 'GS', 'BLK', 'SPGI', 'ADP', 'MMM', 'SYK', 'DE', 'CL'
]

raw = yf.download(tickers + ['SPY'], start='2010-01-01', end='2024-07-01',
                  auto_adjust=True, progress=False)
prices = get_close(raw)
volume = raw['Volume'] if isinstance(raw.columns, pd.MultiIndex) else raw[['Volume']]
prices.shape, volume.shape

That gives us roughly 14 years of daily data for 51 tickers (50 stocks plus SPY). A few tickers may have started trading after 2010 (META didn't IPO until 2012, for instance), so you'll see some NaNs in the early rows. That's fine -- it's exactly the kind of messiness you'd encounter in a real feature pipeline, and we'll handle it.

## Exercise 1: Building the Feature Matrix

**The question:** Can you construct a clean, complete cross-sectional feature matrix for 50 stocks using only free data -- and what does the resulting data actually look like?

This is the foundation of everything we'll build for the remaining 14 weeks of the course. You're going to compute a dozen features for each stock at each month-end, handle the inevitable missing data, cross-sectionally normalize, and produce a panel dataset ready for ML. Think of this as building your feature store -- the asset that every model downstream depends on. Get it wrong here, and every backtest, every IC calculation, and every portfolio return you compute later is garbage.

**Your tasks:**
1. Compute daily returns and monthly returns for all stocks
2. Compute the following features at each month-end: momentum (1m, 3m, 6m, 12m-skip-1), short-term reversal, realized volatility (20-day and 60-day), log dollar volume (size proxy), 50-day/200-day MA ratio, Amihud illiquidity
3. Compute the market return from SPY and create excess returns as the prediction target
4. Cross-sectionally rank-normalize all features to [0, 1] at each date
5. Examine correlations and summary statistics

In [None]:
# YOUR EXPLORATION HERE

---
### ▶ Solution

In [None]:
stock_tickers = [t for t in tickers if t != 'SPY']
stock_prices = prices[stock_tickers]
spy_prices = prices['SPY']

daily_returns = stock_prices.pct_change()
monthly_prices = stock_prices.resample('ME').last()
monthly_spy = spy_prices.resample('ME').last()
monthly_returns = monthly_prices.pct_change()
spy_monthly = monthly_spy.pct_change()
excess_returns = monthly_returns.sub(spy_monthly, axis=0)

# Momentum features
mom_1m = monthly_prices.pct_change(1)
mom_3m = monthly_prices.pct_change(3)
mom_6m = monthly_prices.pct_change(6)
mom_12m_skip1 = monthly_prices.shift(1).pct_change(12)
reversal = monthly_returns.shift(1)

# Volatility (from daily, sampled monthly)
vol_20d_monthly = (daily_returns.rolling(20).std() * np.sqrt(252)).resample('ME').last()
vol_60d_monthly = (daily_returns.rolling(60).std() * np.sqrt(252)).resample('ME').last()

Notice the skip-1 trick in 12-month momentum: we shift the price series by 1 month before computing the 12-month return. Jegadeesh and Titman (1993) showed that stocks that won over the past 12 months tend to keep winning -- but stocks that won *last month specifically* tend to reverse. Including the most recent month contaminates momentum with reversal, muddying both signals. The skip is standard practice at every quant fund. Now let's add size, trend, and liquidity features.

In [None]:
dollar_volume = stock_prices * volume[stock_tickers]
log_dollar_vol = np.log(dollar_volume.rolling(21).mean()).resample('ME').last()
ma_ratio = (stock_prices.rolling(50).mean() / stock_prices.rolling(200).mean()).resample('ME').last()
amihud_monthly = (daily_returns.abs() / dollar_volume).rolling(21).mean().resample('ME').last()

# Assemble panel DataFrame
feature_dict = {
    'mom_1m': mom_1m, 'mom_3m': mom_3m, 'mom_6m': mom_6m,
    'mom_12m_skip1': mom_12m_skip1, 'reversal': reversal,
    'vol_20d': vol_20d_monthly, 'vol_60d': vol_60d_monthly,
    'log_dollar_vol': log_dollar_vol, 'ma_ratio': ma_ratio,
    'amihud': amihud_monthly
}
panels = []
for name, df in feature_dict.items():
    s = df.stack(); s.name = name; panels.append(s)
target = excess_returns.shift(-1).stack(); target.name = 'target'
panels.append(target)
panel = pd.concat(panels, axis=1).dropna()
panel.index.names = ['date', 'ticker']
panel.shape

We should see something like (5,000-8,000 rows, 11 columns) -- each row is a (month, stock) observation with 10 features and one target. The `shift(-1)` on the target is critical: we're aligning features at time t with the *next month's* excess return. Getting this alignment wrong is the most common source of look-ahead bias in financial ML -- and once it's in your pipeline, it's nearly invisible. Your backtest will look great and your live trading will be mediocre.

Now for rank normalization -- the step that separates amateur from professional feature engineering.

In [None]:
feature_cols = [c for c in panel.columns if c != 'target']

def rank_normalize(group):
    return group[feature_cols].rank(pct=True)

panel[feature_cols] = panel.groupby(level='date', group_keys=False).apply(rank_normalize)

# Correlation heatmap
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(panel[feature_cols].corr(), annot=True, fmt='.2f',
            cmap='RdBu_r', center=0, vmin=-1, vmax=1, ax=ax)
ax.set_title('Cross-Sectional Feature Correlations (After Rank Normalization)')
plt.tight_layout()
plt.show()

After rank normalization, every feature at every date is uniformly distributed on [0, 1]. This is the trick that everyone at quant funds uses but rarely discusses openly. Raw features have wildly different scales -- log dollar volume ranges from 15 to 25, while momentum ranges from -0.5 to +1.0. More importantly, raw features have extreme outliers: a stock that tripled in a month will dominate the regression. Rank normalization eliminates all of this.

Look at the correlation matrix. The momentum features (1m, 3m, 6m, 12m) should be substantially correlated -- typically 0.4 to 0.7. The two volatility measures should be highly correlated (0.8+). This multicollinearity is exactly why OLS struggles: when two features are correlated at 0.8, OLS can't decide how to split the credit between them and the coefficients become wildly unstable. This is the problem that Ridge was born to solve -- it says "I don't care which momentum variant gets the credit, just shrink them all toward zero and let the ensemble do the work."

## Exercise 2: Regularization Path -- How Lambda Shapes the Cross-Section

**The question:** What does the model "see" at different regularization strengths -- and does the optimal amount of regularization stay constant over time, or does it shift with market conditions?

The lecture showed that Ridge beats OLS. But that's a binary comparison. In practice, the regularization parameter lambda (called `alpha` in scikit-learn, because naming conventions in ML are a controlled disaster) is a continuous dial from zero (pure OLS) to infinity (all coefficients crushed to zero). Somewhere on that dial is a sweet spot. You're going to trace the full regularization path and watch exactly how that dial changes what the model sees.

**Your tasks:**
1. Pick one expanding-window fold: train on all data up to Dec 2018, test on 2019
2. Sweep Ridge alpha from 0.001 to 10,000 (50 values, log scale). Record coefficients and out-of-sample IC
3. Plot the Ridge regularization path (coefficients vs log-alpha)
4. Repeat for Lasso -- note which features get zeroed out first
5. Time-stability check: find the best alpha for 2019-2023 separately

In [None]:
# YOUR EXPLORATION HERE

---
### ▶ Solution

In [None]:
train_mask = panel.index.get_level_values('date') <= '2018-12-31'
test_2019 = (panel.index.get_level_values('date') >= '2019-01-01') & \
            (panel.index.get_level_values('date') <= '2019-12-31')
X_train = panel.loc[train_mask, feature_cols].values
y_train = panel.loc[train_mask, 'target'].values
X_test = panel.loc[test_2019, feature_cols].values
y_test = panel.loc[test_2019, 'target'].values

alphas = np.logspace(-3, 4, 50)
ridge_coefs, ridge_ics = [], []
for alpha in alphas:
    model = Ridge(alpha=alpha).fit(X_train, y_train)
    ridge_coefs.append(model.coef_)
    ridge_ics.append(spearmanr(model.predict(X_test), y_test).statistic)
ridge_coefs = np.array(ridge_coefs)

We've swept 50 alpha values from near-zero (essentially OLS) to 10,000 (everything crushed flat). Each entry in `ridge_coefs` records the 10 feature coefficients at that regularization strength, and `ridge_ics` records the out-of-sample Spearman IC on 2019 data. Let's visualize both -- the coefficient path and the IC curve -- side by side. This dual view is one of the most informative diagnostic plots in applied ML.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
for i, feat in enumerate(feature_cols):
    axes[0].plot(np.log10(alphas), ridge_coefs[:, i], label=feat)
axes[0].set_xlabel('log10(alpha)'); axes[0].set_ylabel('Coefficient')
axes[0].set_title('Ridge Regularization Path')
axes[0].legend(fontsize=7, loc='upper right')
axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
best_idx = np.argmax(ridge_ics)
axes[1].plot(np.log10(alphas), ridge_ics, 'b-o', markersize=3)
axes[1].axvline(x=np.log10(alphas[best_idx]), color='r', linestyle='--',
                label=f'Best alpha={alphas[best_idx]:.1f}')
axes[1].set_xlabel('log10(alpha)'); axes[1].set_ylabel('OOS IC (2019)')
axes[1].set_title('Ridge: IC vs Regularization'); axes[1].legend()
plt.tight_layout(); plt.show()

Two things to notice. In the left panel, at very low alpha the coefficients are scattered wildly -- some large and positive, some large and negative. This is OLS-like behavior: with correlated features, the model assigns huge positive weight to one momentum variant and huge negative weight to another, trying to exploit tiny differences. That's overfitting. As alpha increases, all coefficients smoothly converge toward zero. The features that shrink *last* are carrying the most genuine signal.

In the right panel, the IC curve should show a clear inverted-U shape. Too little regularization: overfitting kills out-of-sample performance. Too much: the model can't predict anything useful. The peak is where the reduction in variance from shrinkage exactly offsets the increase in bias. Now let's see what Lasso does differently -- where Ridge shrinks smoothly, Lasso makes hard choices.

In [None]:
lasso_alphas = np.logspace(-6, -2, 50)
lasso_coefs, lasso_ics = [], []
for alpha in lasso_alphas:
    model = Lasso(alpha=alpha, max_iter=10000).fit(X_train, y_train)
    lasso_coefs.append(model.coef_)
    lasso_ics.append(spearmanr(model.predict(X_test), y_test).statistic)
lasso_coefs = np.array(lasso_coefs)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
for i, feat in enumerate(feature_cols):
    axes[0].plot(np.log10(lasso_alphas), lasso_coefs[:, i], label=feat)
axes[0].set_xlabel('log10(alpha)'); axes[0].set_ylabel('Coefficient')
axes[0].set_title('Lasso Regularization Path')
axes[0].legend(fontsize=7, loc='upper left')
axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
n_nonzero = (np.abs(lasso_coefs) > 1e-8).sum(axis=1)
axes[1].plot(np.log10(lasso_alphas), n_nonzero, 'g-o', markersize=3)
axes[1].set_xlabel('log10(alpha)'); axes[1].set_ylabel('Non-zero features')
axes[1].set_title('Lasso: Feature Selection via Regularization')
plt.tight_layout(); plt.show()

The Lasso path looks fundamentally different from Ridge. Instead of smooth shrinkage, features snap to exactly zero as alpha increases. The right panel shows it starkly: at low alpha, all 10 features are active; crank up regularization and features drop out one by one until only 2-3 survive. The *order* of elimination is itself a feature importance ranking -- complementary to SHAP values you'll see in Week 5.

Ridge tells you "all features contribute a little." Lasso tells you "these 3 features matter, the rest are noise." In the low signal-to-noise environment of stock prediction, Ridge typically produces better IC because it retains small correlated signals that Lasso discards. But Lasso gives you a clearer story about *which* features do the heavy lifting. Now for the time-stability check: does the optimal alpha stay constant across market regimes?

In [None]:
test_years = [2019, 2020, 2021, 2022, 2023]
best_alphas_ridge, best_ics_ridge = [], []
for year in test_years:
    tr = panel.index.get_level_values('date') < f'{year}-01-01'
    te = (panel.index.get_level_values('date') >= f'{year}-01-01') & \
         (panel.index.get_level_values('date') <= f'{year}-12-31')
    Xtr, ytr = panel.loc[tr, feature_cols].values, panel.loc[tr, 'target'].values
    Xte, yte = panel.loc[te, feature_cols].values, panel.loc[te, 'target'].values
    year_ics = [spearmanr(Ridge(alpha=a).fit(Xtr, ytr).predict(Xte), yte).statistic
                for a in alphas]
    best_alphas_ridge.append(alphas[np.argmax(year_ics)])
    best_ics_ridge.append(max(year_ics))

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].bar(test_years, np.log10(best_alphas_ridge), color='steelblue')
axes[0].set_xlabel('Test Year'); axes[0].set_ylabel('log10(Best Alpha)')
axes[0].set_title('Ridge: Optimal Alpha Over Time')
axes[1].bar(test_years, best_ics_ridge, color='coral')
axes[1].set_xlabel('Test Year'); axes[1].set_ylabel('Best OOS IC')
axes[1].set_title('Ridge: Best IC Achieved Per Year')
axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.tight_layout(); plt.show()

Two findings that matter for production. First, the optimal alpha is *not* constant -- it fluctuates across years. You should see higher regularization preferred during volatile years (2020 with COVID, 2022 with rate hikes) and lower regularization during calm trending markets (2021). When the world is noisy, you want the model to be more conservative. The signal-to-noise ratio itself is non-stationary.

Second, the best achievable IC varies dramatically. Some years you might see IC of 0.05-0.08; others show IC near zero or negative. This is not a bug -- it's the reality of financial prediction. The cross-section is not equally predictable at all times. This is why AQR and Two Sigma re-tune their models regularly and monitor IC in real time. The regularization path isn't a one-time tuning exercise -- it's a diagnostic you run continuously.

## Exercise 3: The Leakage Trap -- 5-Fold vs. Expanding-Window vs. Purged

**The question:** How dramatically does your choice of evaluation methodology change your assessment of the same model's skill?

This is the exercise that should permanently change how you read ML papers. You're going to take the exact same Ridge model -- same features, same hyperparameters, same data -- and evaluate it three different ways. The results will differ by a factor of 2-5x. Only one of those results is honest. The other two are structural self-deception that fills journals with "impressive" results that evaporate in production.

**Your tasks:**
1. Take your Ridge model with a reasonable alpha (median of the best alphas from Exercise 2)
2. Evaluate with standard 5-fold CV (random shuffling -- the Kaggle approach)
3. Evaluate with expanding-window CV (train up to month t, predict month t+1)
4. Evaluate with expanding-window CV + 1-month purge buffer
5. Compare the ICs and plot all temporal IC series

In [None]:
# YOUR EXPLORATION HERE

---
### ▶ Solution

In [None]:
chosen_alpha = np.median(best_alphas_ridge)
X_all, y_all = panel[feature_cols].values, panel['target'].values

# Method 1: 5-fold shuffled CV (WRONG for financial data)
kfold_ics = []
for train_idx, test_idx in KFold(5, shuffle=True, random_state=42).split(X_all):
    m = Ridge(alpha=chosen_alpha).fit(X_all[train_idx], y_all[train_idx])
    kfold_ics.append(spearmanr(m.predict(X_all[test_idx]), y_all[test_idx]).statistic)
kfold_mean_ic = np.mean(kfold_ics)

# Method 2: Expanding-window CV (correct temporal ordering)
dates = panel.index.get_level_values('date').unique().sort_values()
min_train = 60
expanding_ics, expanding_dates = [], []
for i in range(min_train, len(dates) - 1):
    tr = panel.index.get_level_values('date') <= dates[i]
    te = panel.index.get_level_values('date') == dates[i + 1]
    if te.sum() < 10: continue
    m = Ridge(alpha=chosen_alpha).fit(panel.loc[tr, feature_cols], panel.loc[tr, 'target'])
    expanding_ics.append(spearmanr(m.predict(panel.loc[te, feature_cols]),
                                   panel.loc[te, 'target']).statistic)
    expanding_dates.append(dates[i + 1])

The 5-fold shuffled IC should look suspiciously good -- probably 0.05-0.10 or higher. Shuffling randomly assigns observations to folds regardless of time, so your training set might include March 2020 data while the test fold contains February 2020. The model has effectively seen the future. Adjacent observations share information through autocorrelated returns, persistent volatility regimes, and overlapping feature windows. The result is a metric measuring time-travel ability, not predictive skill.

The expanding-window IC is the honest evaluation: train on everything up to month t, predict month t+1, no future information. Now let's add the purge buffer -- the most conservative evaluation.

In [None]:
# Method 3: Expanding-window with 1-month purge buffer
purged_ics, purged_dates = [], []
for i in range(min_train + 1, len(dates) - 1):
    tr = panel.index.get_level_values('date') <= dates[i - 1]
    te = panel.index.get_level_values('date') == dates[i + 1]
    if te.sum() < 10 or tr.sum() < 100: continue
    m = Ridge(alpha=chosen_alpha).fit(panel.loc[tr, feature_cols], panel.loc[tr, 'target'])
    purged_ics.append(spearmanr(m.predict(panel.loc[te, feature_cols]),
                                panel.loc[te, 'target']).statistic)
    purged_dates.append(dates[i + 1])

fig, axes = plt.subplots(1, 2, figsize=(16, 6))
methods = ['5-Fold\n(Shuffled)', 'Expanding\nWindow', 'Expanding\n+ Purge']
mean_ics = [kfold_mean_ic, np.mean(expanding_ics), np.mean(purged_ics)]
colors = ['#e74c3c', '#f39c12', '#27ae60']
axes[0].bar(methods, mean_ics, color=colors)
axes[0].set_ylabel('Mean IC')
axes[0].set_title('Same Model, Three Evaluation Methods')
for j, v in enumerate(mean_ics):
    axes[0].text(j, v + 0.002, f'{v:.4f}', ha='center', fontsize=11)
axes[1].plot(expanding_dates, expanding_ics, alpha=0.7, label='Expanding')
axes[1].plot(purged_dates, purged_ics, alpha=0.7, label='Purged')
axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1].set_title('Rolling IC Over Time'); axes[1].legend()
plt.tight_layout(); plt.show()

The bar chart on the left is the punchline of this entire seminar. The shuffled 5-fold IC (red) should be noticeably higher than both temporal methods -- typically 1.5x to 5x higher. The gap between the red bar and the green bar is the amount of self-deception in your evaluation. That gap is not sampling noise. It's structural information leakage.

The time series on the right shows that IC fluctuates wildly month to month -- some months +0.10, others -0.10. The model goes through regimes of predictability and unpredictability. The expanding and purged series should track each other closely, confirming the purge is a conservative adjustment rather than a wholesale methodology change.

Here's the lesson that should stick permanently: whenever someone shows you a financial ML result, your first question is "what evaluation methodology did you use?" If the answer is k-fold CV, or if they look confused by the question, the number is probably inflated by 2-5x. Every impressive backtest built on shuffled CV is a fairy tale. The honest numbers are smaller, uglier, and the only ones that matter when real capital is on the line.

## Exercise 4: IC Analysis by Feature

**The question:** Which individual features actually predict returns on their own, how strong are they, and does combining them in a linear model beat the best single feature -- or is the model just an expensive way to rediscover momentum?

Before you build complex models -- trees, neural nets, attention mechanisms -- you should understand what each input feature contributes on its own. Sometimes the best feature does 80% of the work and the other nine are passengers. Knowing which situation you're in determines whether you need a better model or better features.

**Your tasks:**
1. For each of the 10 features, compute the monthly cross-sectional Spearman IC with next-month excess returns
2. Rank features by average absolute IC
3. Compare: does the full Ridge model beat the best single feature?
4. Plot a feature IC bar chart and rolling 12-month IC of the top 3 features

In [None]:
# YOUR EXPLORATION HERE

---
### ▶ Solution

In [None]:
feature_ic_series = {feat: [] for feat in feature_cols}
ic_dates = []
for date in dates:
    subset = panel.loc[panel.index.get_level_values('date') == date]
    if len(subset) < 10: continue
    ic_dates.append(date)
    for feat in feature_cols:
        feature_ic_series[feat].append(
            spearmanr(subset[feat], subset['target']).statistic)

avg_ics = {f: np.nanmean(v) for f, v in feature_ic_series.items()}
ic_ranking = sorted(avg_ics.items(), key=lambda x: abs(x[1]), reverse=True)

fig, ax = plt.subplots(figsize=(12, 6))
feats = [x[0] for x in ic_ranking]
vals = [x[1] for x in ic_ranking]
ax.barh(feats, vals, color=['#27ae60' if v > 0 else '#e74c3c' for v in vals])
ax.set_xlabel('Average Cross-Sectional IC')
ax.set_title('Feature Predictive Power: Average IC with Next-Month Excess Returns')
ax.axvline(x=0, color='k', linestyle='-', alpha=0.3)
plt.tight_layout(); plt.show()

The ranking reveals which features carry genuine predictive power. You should see momentum features (especially `mom_12m_skip1`) near the top -- confirming Jegadeesh and Titman's 1993 finding, replicated in 40+ countries. Short-term `reversal` should show a meaningful negative IC (last month's winners underperform this month). And `vol_20d` should show a negative IC too: the low-volatility anomaly, where less volatile stocks outperform, is one of the most reliable effects in empirical finance.

The absolute magnitudes are small -- probably 0.02-0.06 for the strongest features. An IC of 0.03 sounds like nothing in ML terms. But remember the Fundamental Law from Week 3: IR = IC * sqrt(BR). With 50 stocks rebalanced monthly, breadth = 600/year. IC of 0.03 gives IR = 0.03 * sqrt(600) = 0.73. That's a respectable Sharpe. AQR manages over $100 billion on signals of roughly this magnitude. Now let's check whether these signals are consistent or regime-dependent.

In [None]:
top_3 = [x[0] for x in ic_ranking[:3]]
fig, ax = plt.subplots(figsize=(14, 5))
for feat in top_3:
    rolling = pd.Series(feature_ic_series[feat], index=ic_dates).rolling(12).mean()
    ax.plot(rolling, label=feat, alpha=0.8)
ax.axhline(y=0, color='k', linestyle='--', alpha=0.3)
ax.set_title('Rolling 12-Month Average IC: Top 3 Features')
ax.set_ylabel('Rolling IC (12-month avg)'); ax.legend()
plt.tight_layout(); plt.show()

This is where the "features are non-stationary" lesson becomes visceral. Even the best features oscillate -- sometimes positive, sometimes negative, sometimes wildly so. Momentum tends to work beautifully in trending markets but gets destroyed during sharp reversals. The "momentum crash" of March 2009 is the most famous example: stocks that had been falling for months suddenly reversed violently, and momentum portfolios lost 40% in a single month.

The practical implication: any single feature is unreliable over short horizons. The value of a multi-feature model isn't dramatic IC improvement -- it's *diversification* across signals. When momentum stops working, maybe volatility or reversal picks up the slack. Let's test whether the Ridge model actually beats the best single feature.

In [None]:
model_mean_ic = np.mean(expanding_ics)
best_feat_name, best_feat_ic = ic_ranking[0]

fig, ax = plt.subplots(figsize=(8, 5))
labels = [f'Best Feature\n({best_feat_name})', 'Ridge Model\n(all 10 features)']
bars = ax.bar(labels, [best_feat_ic, model_mean_ic], color=['#3498db', '#e67e22'])
ax.set_ylabel('Mean IC')
ax.set_title('Single Feature vs. Multi-Feature Model')
for bar, val in zip(bars, [best_feat_ic, model_mean_ic]):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
            f'{val:.4f}', ha='center', fontsize=12)
plt.tight_layout(); plt.show()

The Ridge model's IC should be modestly higher than the best single feature -- but the improvement is probably smaller than you'd expect. This is the honest reality of linear models in cross-sectional prediction: when the model is a weighted average of features, it can't exceed what the individual features collectively offer. Ridge diversifies across features and stabilizes the signal, but it can't *create* signal that doesn't exist in the raw inputs. The ceiling is fundamentally low.

This is precisely why Gu, Kelly, and Xiu found that gradient-boosted trees and neural nets outperform linear models. Those models capture *interactions* between features -- for example, momentum behaving differently for high-vol versus low-vol stocks. A linear model can't represent "momentum works better when volatility is low" without you manually engineering that interaction. A tree finds it automatically by splitting on volatility first. That interaction -- momentum crossed with volatility -- is the single most important nonlinear effect in the GKX paper, and it's Week 5's story.

But don't dismiss the linear baseline. It's interpretable, stable, fast to train, and surprisingly hard to beat. Many quant funds still use linear models as their primary signal -- not because they're ignorant of neural nets, but because the marginal improvement often doesn't justify the complexity, overfitting risk, and the fact that you can no longer explain to your risk manager what the model is doing.

## Summary

Here's what you built and discovered in this session:

- **Feature engineering is mostly data plumbing.** Computing 10 features for 50 stocks across 14 years required careful temporal alignment (the `shift(-1)` for targets, the skip-1 for momentum), missing data handling, and cross-sectional rank normalization. Getting the time alignment wrong by a single period is the difference between a genuine prediction and a look-ahead bias that inflates everything downstream.

- **The regularization path is a diagnostic tool, not just a tuning method.** The Lasso path reveals feature importance through elimination order; the Ridge path shows how aggressively the data wants to be shrunk. The optimal alpha shifts with market regimes -- higher regularization during volatile periods when signal-to-noise deteriorates.

- **Evaluation methodology is more important than model architecture.** The same Ridge model evaluated three ways produced ICs differing by 2-5x. Shuffled 5-fold CV is structurally dishonest for temporal data. Expanding-window CV with purging is the minimum honest standard. Always ask "how was this evaluated?" before trusting any financial ML result.

- **Individual feature ICs are small but real, and non-stationary.** The strongest features have average ICs of 0.02-0.06. Rolling IC plots reveal that even the most robust signals go through extended weakness. The value of combining features is diversification, not amplification.

- **Linear models provide a hard-to-beat baseline, but have a low ceiling.** The Ridge model modestly outperforms the best single feature. But the improvement is bounded by the linear assumption -- no feature interactions. Trees and neural nets lift this ceiling.

In the homework, you'll scale this pipeline to 200+ stocks, compare all four linear models (OLS, Ridge, Lasso, Elastic Net), build a long-short portfolio, and measure performance net of transaction costs. You'll discover that transaction costs change the model comparison in surprising ways -- Lasso's sparser predictions may produce lower turnover, partially offsetting its lower gross IC.