# Regularized Linear Models: Ridge, Lasso, and ElasticNet

---

## Learning Objectives

By the end of this notebook, you will be able to:

- Explain why regularization is needed (overfitting, multicollinearity, feature selection)
- Describe the difference between L1 (Lasso), L2 (Ridge), and ElasticNet penalties
- Fit Ridge, Lasso, and ElasticNet models using sklearn
- Visualize coefficient paths as the regularization strength varies
- Use cross-validation (`RidgeCV`, `LassoCV`) to select the best alpha
- Apply best practices: standardize features, use pipelines

## Prerequisites

- Completed Notebooks 01-02 (Linear Regression basics and assumptions)
- Understanding of overfitting and train/test splits
- Basic familiarity with cross-validation

## Table of Contents

1. [Why Regularize?](#1-why-regularize)
2. [Ridge Regression (L2)](#2-ridge-regression-l2)
3. [Lasso Regression (L1)](#3-lasso-regression-l1)
4. [ElasticNet (L1 + L2)](#4-elasticnet-l1--l2)
5. [Comparing Coefficients: Ridge vs Lasso vs ElasticNet](#5-comparing-coefficients)
6. [Coefficient Paths as Alpha Varies](#6-coefficient-paths-as-alpha-varies)
7. [Cross-Validation for Alpha Selection](#7-cross-validation-for-alpha-selection)
8. [Best Practices](#8-best-practices)
9. [Common Mistakes](#9-common-mistakes)
10. [Exercise](#10-exercise)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import (
    LinearRegression, Ridge, Lasso, ElasticNet,
    RidgeCV, LassoCV, ElasticNetCV
)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

---

## 1. Why Regularize?

Standard linear regression minimizes the MSE cost function:

$$J = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$

This can lead to problems:

- **Overfitting:** With many features (especially more features than samples), the model memorizes noise
- **Multicollinearity:** Correlated features cause large, unstable coefficient estimates
- **No feature selection:** OLS keeps all features, even irrelevant ones

**Regularization** adds a penalty term to the cost function that discourages large coefficients, leading to simpler, more generalizable models.

In [None]:
# Generate dataset with many features (some irrelevant)
np.random.seed(42)
n_samples = 100
n_features = 20

X = np.random.randn(n_samples, n_features)

# Only first 5 features matter
true_coefs = np.zeros(n_features)
true_coefs[:5] = [3.0, -2.0, 1.5, -1.0, 0.5]

y = X @ true_coefs + 2.0 + np.random.randn(n_samples) * 0.5

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"Only features 0-4 have non-zero true coefficients.")
print(f"Train: {X_train.shape[0]} samples, Test: {X_test.shape[0]} samples")

---

## 2. Ridge Regression (L2)

Ridge adds an **L2 penalty** (sum of squared coefficients) to the cost function:

$$J_{\text{Ridge}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} w_j^2$$

**Key properties:**
- Shrinks all coefficients toward zero, but **never exactly to zero**
- Handles multicollinearity well
- $\alpha$ controls regularization strength (higher = more shrinkage)
- Does **not** perform feature selection

In [None]:
# Fit Ridge with different alpha values
alphas_demo = [0.001, 0.1, 1.0, 10.0, 100.0]

print(f"{'Alpha':<10} {'Train R2':<12} {'Test R2':<12} {'Coef L2 Norm':<14}")
print("-" * 48)

for alpha in alphas_demo:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_s, y_train)
    train_r2 = ridge.score(X_train_s, y_train)
    test_r2 = ridge.score(X_test_s, y_test)
    coef_norm = np.sqrt(np.sum(ridge.coef_ ** 2))
    print(f"{alpha:<10} {train_r2:<12.4f} {test_r2:<12.4f} {coef_norm:<14.4f}")

print("\nAs alpha increases, coefficients shrink and the model becomes simpler.")

---

## 3. Lasso Regression (L1)

Lasso adds an **L1 penalty** (sum of absolute values of coefficients):

$$J_{\text{Lasso}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |w_j|$$

**Key properties:**
- Can shrink coefficients **exactly to zero** (automatic feature selection)
- Useful when you suspect many features are irrelevant
- May arbitrarily pick one feature from a group of correlated features
- Key parameters: `alpha`, `max_iter`

In [None]:
# Fit Lasso with different alpha values
print(f"{'Alpha':<10} {'Train R2':<12} {'Test R2':<12} {'Non-zero coefs':<16}")
print("-" * 50)

for alpha in [0.001, 0.01, 0.1, 0.5, 1.0]:
    lasso = Lasso(alpha=alpha, max_iter=10000, random_state=42)
    lasso.fit(X_train_s, y_train)
    train_r2 = lasso.score(X_train_s, y_train)
    test_r2 = lasso.score(X_test_s, y_test)
    n_nonzero = np.sum(lasso.coef_ != 0)
    print(f"{alpha:<10} {train_r2:<12.4f} {test_r2:<12.4f} {n_nonzero:<16}")

print("\nHigher alpha -> more coefficients set to zero (feature selection).")

---

## 4. ElasticNet (L1 + L2)

ElasticNet combines both penalties:

$$J_{\text{ElasticNet}} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \left[ \rho \sum_{j=1}^{p} |w_j| + \frac{(1-\rho)}{2} \sum_{j=1}^{p} w_j^2 \right]$$

Where $\rho$ (`l1_ratio` in sklearn) controls the mix:
- $\rho = 1$: pure Lasso
- $\rho = 0$: pure Ridge
- $0 < \rho < 1$: combination

**Key properties:**
- Can select features (like Lasso) while handling correlated features (like Ridge)
- Useful when features are grouped and correlated
- Key parameters: `alpha`, `l1_ratio`, `max_iter`

In [None]:
# ElasticNet with different l1_ratio values
alpha_en = 0.1

print(f"ElasticNet (alpha={alpha_en}):")
print(f"{'l1_ratio':<12} {'Train R2':<12} {'Test R2':<12} {'Non-zero coefs':<16}")
print("-" * 52)

for l1_ratio in [0.1, 0.3, 0.5, 0.7, 0.9]:
    en = ElasticNet(alpha=alpha_en, l1_ratio=l1_ratio, max_iter=10000, random_state=42)
    en.fit(X_train_s, y_train)
    train_r2 = en.score(X_train_s, y_train)
    test_r2 = en.score(X_test_s, y_test)
    n_nonzero = np.sum(en.coef_ != 0)
    print(f"{l1_ratio:<12} {train_r2:<12.4f} {test_r2:<12.4f} {n_nonzero:<16}")

print("\nHigher l1_ratio -> more Lasso-like (more zero coefficients).")

---

## 5. Comparing Coefficients: Ridge vs Lasso vs ElasticNet <a id='5-comparing-coefficients'></a>

In [None]:
# Fit all models with a fixed alpha
alpha_compare = 0.1

ols = LinearRegression()
ridge = Ridge(alpha=alpha_compare)
lasso = Lasso(alpha=alpha_compare, max_iter=10000, random_state=42)
elastic = ElasticNet(alpha=alpha_compare, l1_ratio=0.5, max_iter=10000, random_state=42)

models = {"OLS": ols, "Ridge": ridge, "Lasso": lasso, "ElasticNet": elastic}
coef_dict = {}

for name, model in models.items():
    model.fit(X_train_s, y_train)
    coef_dict[name] = model.coef_

# Plot coefficient comparison
fig, ax = plt.subplots(figsize=(14, 6))

x_pos = np.arange(n_features)
width = 0.2

ax.bar(x_pos - 1.5 * width, true_coefs, width, label="True", color="black", alpha=0.6)
ax.bar(x_pos - 0.5 * width, coef_dict["OLS"], width, label="OLS", color="steelblue", alpha=0.7)
ax.bar(x_pos + 0.5 * width, coef_dict["Ridge"], width, label="Ridge", color="coral", alpha=0.7)
ax.bar(x_pos + 1.5 * width, coef_dict["Lasso"], width, label="Lasso", color="mediumseagreen", alpha=0.7)

ax.set_xticks(x_pos)
ax.set_xticklabels([f"w{i}" for i in range(n_features)], fontsize=8)
ax.set_ylabel("Coefficient value")
ax.set_title(f"Coefficient Comparison (alpha={alpha_compare})")
ax.legend()
ax.axhline(y=0, color="gray", linestyle="-", linewidth=0.5)

plt.tight_layout()
plt.show()

print("Lasso drives irrelevant feature coefficients (w5-w19) to exactly zero.")
print("Ridge shrinks all coefficients but keeps them non-zero.")
print("OLS assigns non-trivial values even to noise features.")

---

## 6. Coefficient Paths as Alpha Varies

In [None]:
# Lasso coefficient path
alphas = np.logspace(-4, 1, 100)
lasso_coefs = []

for a in alphas:
    lasso_temp = Lasso(alpha=a, max_iter=10000, random_state=42)
    lasso_temp.fit(X_train_s, y_train)
    lasso_coefs.append(lasso_temp.coef_)

lasso_coefs = np.array(lasso_coefs)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Lasso paths
for i in range(n_features):
    color = "steelblue" if true_coefs[i] != 0 else "lightgray"
    lw = 2 if true_coefs[i] != 0 else 0.8
    label = f"w{i} (true={true_coefs[i]:.1f})" if true_coefs[i] != 0 else None
    axes[0].plot(alphas, lasso_coefs[:, i], color=color, linewidth=lw, label=label)

axes[0].set_xscale("log")
axes[0].set_xlabel("Alpha (log scale)")
axes[0].set_ylabel("Coefficient value")
axes[0].set_title("Lasso Coefficient Paths")
axes[0].legend(loc="upper right", fontsize=8)
axes[0].axhline(y=0, color="black", linestyle="-", linewidth=0.5)

# Ridge paths for comparison
ridge_coefs = []
for a in alphas:
    ridge_temp = Ridge(alpha=a)
    ridge_temp.fit(X_train_s, y_train)
    ridge_coefs.append(ridge_temp.coef_)

ridge_coefs = np.array(ridge_coefs)

for i in range(n_features):
    color = "coral" if true_coefs[i] != 0 else "lightgray"
    lw = 2 if true_coefs[i] != 0 else 0.8
    label = f"w{i} (true={true_coefs[i]:.1f})" if true_coefs[i] != 0 else None
    axes[1].plot(alphas, ridge_coefs[:, i], color=color, linewidth=lw, label=label)

axes[1].set_xscale("log")
axes[1].set_xlabel("Alpha (log scale)")
axes[1].set_ylabel("Coefficient value")
axes[1].set_title("Ridge Coefficient Paths")
axes[1].legend(loc="upper right", fontsize=8)
axes[1].axhline(y=0, color="black", linestyle="-", linewidth=0.5)

plt.tight_layout()
plt.show()

print("Lasso (left): Coefficients drop to exactly zero as alpha increases.")
print("Ridge (right): Coefficients shrink smoothly toward zero but never reach it.")

---

## 7. Cross-Validation for Alpha Selection

In [None]:
# Compare models using cross-validation on the training set
model_results = {}

for name, model in models.items():
    scores = cross_val_score(model, X_train_s, y_train, cv=5,
                             scoring="neg_mean_squared_error")
    rmse_scores = np.sqrt(-scores)
    model_results[name] = {
        "CV RMSE Mean": rmse_scores.mean(),
        "CV RMSE Std": rmse_scores.std(),
        "Test R2": r2_score(y_test, model.predict(X_test_s))
    }

results_df = pd.DataFrame(model_results).T
print("Cross-Validation Comparison:")
print(results_df.round(4).to_string())

In [None]:
# Use built-in CV models to find best alpha

# RidgeCV
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_train_s, y_train)
print(f"RidgeCV:")
print(f"  Best alpha: {ridge_cv.alpha_:.6f}")
print(f"  Test R2:    {ridge_cv.score(X_test_s, y_test):.4f}")

# LassoCV
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000, random_state=42)
lasso_cv.fit(X_train_s, y_train)
print(f"\nLassoCV:")
print(f"  Best alpha:     {lasso_cv.alpha_:.6f}")
print(f"  Test R2:        {lasso_cv.score(X_test_s, y_test):.4f}")
print(f"  Non-zero coefs: {np.sum(lasso_cv.coef_ != 0)}/{n_features}")

# ElasticNetCV
en_cv = ElasticNetCV(alphas=np.logspace(-4, 1, 50), l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],
                     cv=5, max_iter=10000, random_state=42)
en_cv.fit(X_train_s, y_train)
print(f"\nElasticNetCV:")
print(f"  Best alpha:     {en_cv.alpha_:.6f}")
print(f"  Best l1_ratio:  {en_cv.l1_ratio_:.2f}")
print(f"  Test R2:        {en_cv.score(X_test_s, y_test):.4f}")
print(f"  Non-zero coefs: {np.sum(en_cv.coef_ != 0)}/{n_features}")

---

## 8. Best Practices

- **Always standardize features** before applying regularization — the penalty treats all features equally, so they must be on the same scale
- **Use a Pipeline** to avoid data leakage (scaler must be fit only on training data)
- **Use built-in CV models** (`RidgeCV`, `LassoCV`, `ElasticNetCV`) for efficient alpha selection
- **Compare against OLS baseline** to confirm regularization helps
- **Check the number of non-zero coefficients** in Lasso to understand feature selection

In [None]:
# Best practice: Pipeline with StandardScaler + regularized model
pipe_lasso = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000, random_state=42))
])

# Fit on raw (unscaled) training data — pipeline handles scaling
pipe_lasso.fit(X_train, y_train)
y_pred_pipe = pipe_lasso.predict(X_test)

print(f"Pipeline (StandardScaler + LassoCV):")
print(f"  Best alpha: {pipe_lasso.named_steps['lasso'].alpha_:.6f}")
print(f"  Test R2:    {r2_score(y_test, y_pred_pipe):.4f}")
print(f"  Test RMSE:  {np.sqrt(mean_squared_error(y_test, y_pred_pipe)):.4f}")

---

## 9. Common Mistakes

| Mistake | Why It's a Problem | Fix |
|---|---|---|
| Not standardizing features | Regularization penalizes all coefficients equally; different scales = unfair penalty | Always use `StandardScaler` before regularized models |
| Using a fixed alpha without tuning | Suboptimal regularization strength | Use `RidgeCV`, `LassoCV`, or `ElasticNetCV` |
| Ignoring `max_iter` warnings | Lasso/ElasticNet may not converge | Increase `max_iter` (e.g., 10000) |
| Scaling outside the pipeline | Leaks test set statistics into training | Put `StandardScaler` inside a `Pipeline` |
| Using Lasso with highly correlated features | Lasso arbitrarily picks one, drops others | Use Ridge or ElasticNet instead |
| Forgetting to compare with OLS | You may not need regularization at all | Always benchmark against unregularized model |

---

## 10. Exercise

**Task:** Use `RidgeCV` and `LassoCV` (inside pipelines) to find the best alpha for the dataset below. Compare their test R2 scores.

Steps:
1. Use the dataset generated below (30 features, only 8 are relevant)
2. Build a Pipeline with `StandardScaler` + `RidgeCV` and another with `StandardScaler` + `LassoCV`
3. Fit both on training data, predict on test data
4. Report: best alpha, test R2, number of non-zero coefficients (for Lasso)

In [None]:
# --- Exercise Solution ---

# Generate dataset
np.random.seed(42)
n_ex, p_ex = 150, 30
X_ex = np.random.randn(n_ex, p_ex)
true_coefs_ex = np.zeros(p_ex)
true_coefs_ex[:8] = [4.0, -3.0, 2.5, -2.0, 1.5, -1.0, 0.7, -0.3]
y_ex = X_ex @ true_coefs_ex + 1.0 + np.random.randn(n_ex) * 0.8

X_ex_train, X_ex_test, y_ex_train, y_ex_test = train_test_split(
    X_ex, y_ex, test_size=0.2, random_state=42
)

# Pipeline: RidgeCV
pipe_ridge = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5))
])
pipe_ridge.fit(X_ex_train, y_ex_train)
y_pred_ridge = pipe_ridge.predict(X_ex_test)

# Pipeline: LassoCV
pipe_lasso_ex = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000, random_state=42))
])
pipe_lasso_ex.fit(X_ex_train, y_ex_train)
y_pred_lasso = pipe_lasso_ex.predict(X_ex_test)

# Results
print("RidgeCV Results:")
print(f"  Best alpha: {pipe_ridge.named_steps['ridge'].alpha_:.6f}")
print(f"  Test R2:    {r2_score(y_ex_test, y_pred_ridge):.4f}")

lasso_coefs_ex = pipe_lasso_ex.named_steps['lasso'].coef_
print(f"\nLassoCV Results:")
print(f"  Best alpha:     {pipe_lasso_ex.named_steps['lasso'].alpha_:.6f}")
print(f"  Test R2:        {r2_score(y_ex_test, y_pred_lasso):.4f}")
print(f"  Non-zero coefs: {np.sum(lasso_coefs_ex != 0)}/{p_ex}")
print(f"  (True non-zero: 8/{p_ex})")