# L1 and L2 Regularization: Geometric Intuition

---

## Learning Objectives

By the end of this notebook you will be able to:

- Explain **why** regularization is needed (prevent overfitting by constraining weights)
- Write the cost functions for **Ridge (L2)** and **Lasso (L1)** regularization
- Explain the **geometric intuition** behind why L1 produces sparse solutions and L2 does not
- Describe **ElasticNet** as a combination of L1 and L2
- Use sklearn to fit Ridge, Lasso, and ElasticNet and compare their coefficient behavior
- Visualize how coefficients change with the regularization strength $\alpha$

## Prerequisites

- Loss and cost functions (Notebook 01)
- Gradient descent and optimization (Notebook 02)
- Linear regression (ML200)

## Table of Contents

1. [Why Regularization?](#1)
2. [L2 Regularization (Ridge)](#2)
3. [L1 Regularization (Lasso)](#3)
4. [Geometric Intuition: Why L1 Gives Sparse Solutions](#4)
5. [ElasticNet](#5)
6. [Code: Comparing Ridge, Lasso, and ElasticNet](#6)
7. [Coefficient Paths vs Alpha](#7)
8. [Common Mistakes](#8)
9. [Exercise](#9)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Circle
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

np.random.seed(42)
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

<a id='1'></a>
## 1. Why Regularization?

When a model overfits, its weights tend to become **very large** -- the model contorts itself to fit noise in the training data.

**Regularization** adds a penalty term to the cost function that discourages large weights:

$$J_{\text{regularized}} = J_{\text{data}} + \lambda \cdot \text{penalty}(\mathbf{w})$$

This forces the model to find a balance between:
- Fitting the training data well (low data loss)
- Keeping the weights small (low penalty)

The result is a **simpler, more generalizable** model.

> **Note:** In sklearn, the regularization strength parameter is called `alpha` (not $\lambda$). Larger `alpha` = stronger regularization.

<a id='2'></a>
## 2. L2 Regularization (Ridge)

Ridge regression adds the **sum of squared weights** to the cost function:

$$J_{\text{Ridge}} = \frac{1}{n}\sum_{i=1}^{n}(y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{j=1}^{p} w_j^2$$

**Key properties:**
- Shrinks all coefficients **toward zero**, but never exactly to zero
- The geometric constraint region is a **circle** (or hypersphere in higher dimensions)
- Good when you believe all features are somewhat relevant
- Has a closed-form solution: $\mathbf{w} = (\mathbf{X}^T\mathbf{X} + \alpha\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}$

<a id='3'></a>
## 3. L1 Regularization (Lasso)

Lasso regression adds the **sum of absolute values** of the weights:

$$J_{\text{Lasso}} = \frac{1}{n}\sum_{i=1}^{n}(y^{(i)} - \hat{y}^{(i)})^2 + \alpha \sum_{j=1}^{p} |w_j|$$

**Key properties:**
- Can shrink coefficients **exactly to zero** -- performs automatic **feature selection**
- The geometric constraint region is a **diamond** (or cross-polytope)
- Useful when you suspect many features are irrelevant
- No closed-form solution -- requires iterative optimization

<a id='4'></a>
## 4. Geometric Intuition: Why L1 Gives Sparse Solutions

The key insight is about where the **MSE contour ellipses** intersect the **constraint region**:

- **L2 (Ridge):** The constraint is a **circle**. The MSE contours are most likely to touch the circle at a point where **no coordinate is exactly zero**.
- **L1 (Lasso):** The constraint is a **diamond** with **corners on the axes**. The MSE contours are most likely to touch a corner, where **one or more coordinates are exactly zero**.

This is why L1 produces **sparse** solutions (many zero coefficients) and L2 does not.

In [None]:
# --- Geometric visualization: MSE contours + L1 diamond + L2 circle ---
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Create MSE-like elliptical contours (centered off-origin to simulate unconstrained optimum)
w1_range = np.linspace(-3, 3, 300)
w2_range = np.linspace(-3, 3, 300)
W1, W2 = np.meshgrid(w1_range, w2_range)

# Elliptical cost centered at (1.5, 1.0) -- the unconstrained OLS solution
ols_w1, ols_w2 = 1.5, 1.0
Z = 2 * (W1 - ols_w1) ** 2 + 3 * (W2 - ols_w2) ** 2 + 1.5 * (W1 - ols_w1) * (W2 - ols_w2)

# --- L2 (Ridge): circle constraint ---
ax = axes[0]
ax.contour(W1, W2, Z, levels=15, cmap='Blues', alpha=0.8)
circle = plt.Circle((0, 0), 1.0, fill=False, color='green', linewidth=3, linestyle='-', label='L2 constraint (circle)')
ax.add_patch(circle)
ax.plot(ols_w1, ols_w2, 'r*', markersize=15, label=f'OLS solution ({ols_w1}, {ols_w2})')

# Approximate Ridge solution (on the circle, not on an axis)
ridge_angle = np.arctan2(ols_w2, ols_w1)
ridge_w1 = 0.83 * np.cos(ridge_angle)
ridge_w2 = 0.83 * np.sin(ridge_angle)
ax.plot(ridge_w1, ridge_w2, 'go', markersize=12, label=f'Ridge solution ({ridge_w1:.2f}, {ridge_w2:.2f})')

ax.set_xlabel('$w_1$', fontsize=14)
ax.set_ylabel('$w_2$', fontsize=14)
ax.set_title('L2 (Ridge): Circle Constraint', fontsize=14)
ax.set_xlim(-2.5, 2.5)
ax.set_ylim(-2.5, 2.5)
ax.set_aspect('equal')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.legend(fontsize=9, loc='lower left')
ax.grid(True, alpha=0.2)

# --- L1 (Lasso): diamond constraint ---
ax = axes[1]
ax.contour(W1, W2, Z, levels=15, cmap='Blues', alpha=0.8)

# Draw diamond
diamond_size = 1.0
diamond_x = [diamond_size, 0, -diamond_size, 0, diamond_size]
diamond_y = [0, diamond_size, 0, -diamond_size, 0]
ax.plot(diamond_x, diamond_y, 'darkorange', linewidth=3, label='L1 constraint (diamond)')

ax.plot(ols_w1, ols_w2, 'r*', markersize=15, label=f'OLS solution ({ols_w1}, {ols_w2})')
# Lasso solution touches corner -- w2 = 0
ax.plot(1.0, 0.0, 'o', color='darkorange', markersize=12, label=f'Lasso solution (1.0, 0.0)')

ax.set_xlabel('$w_1$', fontsize=14)
ax.set_ylabel('$w_2$', fontsize=14)
ax.set_title('L1 (Lasso): Diamond Constraint', fontsize=14)
ax.set_xlim(-2.5, 2.5)
ax.set_ylim(-2.5, 2.5)
ax.set_aspect('equal')
ax.axhline(0, color='gray', linewidth=0.5)
ax.axvline(0, color='gray', linewidth=0.5)
ax.legend(fontsize=9, loc='lower left')
ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

print("Left (Ridge):  The contours touch the circle away from the axes -- both w1, w2 are non-zero.")
print("Right (Lasso): The contours touch the diamond at a corner -- w2 is exactly zero (sparsity!).")

<a id='5'></a>
## 5. ElasticNet

ElasticNet combines L1 and L2 regularization:

$$J_{\text{ElasticNet}} = \frac{1}{n}\sum_{i=1}^{n}(y^{(i)} - \hat{y}^{(i)})^2 + \alpha \left[ \rho \sum |w_j| + \frac{1-\rho}{2} \sum w_j^2 \right]$$

Where $\rho$ (`l1_ratio` in sklearn) controls the mix:

| `l1_ratio` | Behavior |
|------------|----------|
| 1.0 | Pure Lasso (L1 only) |
| 0.0 | Pure Ridge (L2 only) |
| 0.5 | Equal mix |

**When to use ElasticNet:**
- When you have correlated features (Lasso alone may arbitrarily pick one)
- When you want some feature selection (from L1) with stability (from L2)

<a id='6'></a>
## 6. Code: Comparing Ridge, Lasso, and ElasticNet

We will create a dataset with:
- 5 **relevant** features (with true non-zero coefficients)
- 15 **irrelevant** features (noise, true coefficient = 0)

Then we will see which methods correctly identify the irrelevant features.

In [None]:
# --- Create dataset with relevant and irrelevant features ---
np.random.seed(42)
n_samples = 200
n_relevant = 5
n_irrelevant = 15
n_features = n_relevant + n_irrelevant

# Generate features
X_all = np.random.randn(n_samples, n_features)

# True coefficients: first 5 are non-zero, rest are zero
true_coefs = np.zeros(n_features)
true_coefs[:n_relevant] = [3.0, -2.0, 1.5, -1.0, 0.5]

# Generate target
y_all = X_all @ true_coefs + np.random.randn(n_samples) * 0.5

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_all, y_all, test_size=0.3, random_state=42
)

# Standardize
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

feature_names = [f'Relevant_{i+1}' for i in range(n_relevant)] + \
                [f'Noise_{i+1}' for i in range(n_irrelevant)]

print(f"Dataset: {n_samples} samples, {n_features} features")
print(f"  - {n_relevant} relevant features with true coefficients: {true_coefs[:n_relevant]}")
print(f"  - {n_irrelevant} irrelevant noise features (true coefficient = 0)")

In [None]:
# --- Fit all models ---
alpha_val = 0.5

models = {
    'OLS (no regularization)': LinearRegression(),
    'Ridge (L2)': Ridge(alpha=alpha_val),
    'Lasso (L1)': Lasso(alpha=alpha_val, max_iter=10000),
    'ElasticNet (L1+L2)': ElasticNet(alpha=alpha_val, l1_ratio=0.5, max_iter=10000),
}

results = {}
for name, model in models.items():
    model.fit(X_train_s, y_train)
    train_mse = mean_squared_error(y_train, model.predict(X_train_s))
    test_mse = mean_squared_error(y_test, model.predict(X_test_s))
    n_zeros = np.sum(np.abs(model.coef_) < 1e-6)
    results[name] = {
        'coefs': model.coef_,
        'train_mse': train_mse,
        'test_mse': test_mse,
        'n_zeros': n_zeros,
    }
    print(f"{name:30s}  Train MSE: {train_mse:.4f}  Test MSE: {test_mse:.4f}  Zero coefs: {n_zeros}/{n_features}")

In [None]:
# --- Compare coefficients across models ---
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
axes = axes.ravel()

colors = ['steelblue' if i < n_relevant else 'lightcoral' for i in range(n_features)]

for ax, (name, res) in zip(axes, results.items()):
    bars = ax.bar(range(n_features), res['coefs'], color=colors, edgecolor='gray', linewidth=0.5)
    ax.axhline(0, color='black', linewidth=0.8)
    ax.set_title(f"{name}\nZero coefficients: {res['n_zeros']}/{n_features}", fontsize=12)
    ax.set_xlabel('Feature Index')
    ax.set_ylabel('Coefficient Value')
    ax.set_xticks(range(n_features))
    ax.set_xticklabels([str(i) for i in range(n_features)], fontsize=8)
    ax.grid(True, alpha=0.3, axis='y')

# Add legend to first subplot
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor='steelblue', label='Relevant features (0-4)'),
                   Patch(facecolor='lightcoral', label='Noise features (5-19)')]
axes[0].legend(handles=legend_elements, fontsize=9)

plt.tight_layout()
plt.show()

print("OLS:        Fits noise features with non-zero coefficients (overfitting risk).")
print("Ridge (L2): Shrinks all coefficients, but none are exactly zero.")
print("Lasso (L1): Drives noise feature coefficients to EXACTLY zero (feature selection!).")
print("ElasticNet: Some sparsity from L1, some shrinkage from L2.")

<a id='7'></a>
## 7. Coefficient Paths vs Alpha

How do the coefficients change as we increase the regularization strength $\alpha$?

In [None]:
# --- Coefficient paths: Ridge vs Lasso ---
alphas = np.logspace(-3, 2, 100)

ridge_coefs = []
lasso_coefs = []

for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train_s, y_train)
    ridge_coefs.append(ridge.coef_)

    lasso = Lasso(alpha=a, max_iter=10000)
    lasso.fit(X_train_s, y_train)
    lasso_coefs.append(lasso.coef_)

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Ridge coefficient paths
ax = axes[0]
for j in range(n_features):
    color = 'steelblue' if j < n_relevant else 'lightcoral'
    lw = 2.0 if j < n_relevant else 0.8
    alpha_line = 1.0 if j < n_relevant else 0.5
    ax.plot(alphas, ridge_coefs[:, j], color=color, linewidth=lw, alpha=alpha_line)
ax.set_xscale('log')
ax.set_xlabel('$\\alpha$ (regularization strength)', fontsize=13)
ax.set_ylabel('Coefficient value', fontsize=13)
ax.set_title('Ridge (L2): Coefficient Paths', fontsize=14)
ax.axhline(0, color='black', linewidth=0.8)
ax.grid(True, alpha=0.3)

# Lasso coefficient paths
ax = axes[1]
for j in range(n_features):
    color = 'steelblue' if j < n_relevant else 'lightcoral'
    lw = 2.0 if j < n_relevant else 0.8
    alpha_line = 1.0 if j < n_relevant else 0.5
    ax.plot(alphas, lasso_coefs[:, j], color=color, linewidth=lw, alpha=alpha_line)
ax.set_xscale('log')
ax.set_xlabel('$\\alpha$ (regularization strength)', fontsize=13)
ax.set_ylabel('Coefficient value', fontsize=13)
ax.set_title('Lasso (L1): Coefficient Paths', fontsize=14)
ax.axhline(0, color='black', linewidth=0.8)
ax.grid(True, alpha=0.3)

# Add shared legend
legend_elements = [Patch(facecolor='steelblue', label='Relevant features'),
                   Patch(facecolor='lightcoral', label='Noise features')]
axes[0].legend(handles=legend_elements, fontsize=10)
axes[1].legend(handles=legend_elements, fontsize=10)

plt.tight_layout()
plt.show()

print("Ridge: All coefficients shrink toward zero but NEVER reach exactly zero.")
print("Lasso: Noise features hit zero quickly; relevant features survive longer.")
print("       This is automatic feature selection!")

In [None]:
# --- Test MSE vs alpha for Ridge and Lasso ---
ridge_test_mse = []
lasso_test_mse = []

for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train_s, y_train)
    ridge_test_mse.append(mean_squared_error(y_test, ridge.predict(X_test_s)))

    lasso = Lasso(alpha=a, max_iter=10000)
    lasso.fit(X_train_s, y_train)
    lasso_test_mse.append(mean_squared_error(y_test, lasso.predict(X_test_s)))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(alphas, ridge_test_mse, 'o-', color='green', markersize=2, linewidth=2, label='Ridge (L2)')
ax.plot(alphas, lasso_test_mse, 's-', color='darkorange', markersize=2, linewidth=2, label='Lasso (L1)')
ax.set_xscale('log')
ax.set_xlabel('$\\alpha$ (regularization strength)', fontsize=13)
ax.set_ylabel('Test MSE', fontsize=13)
ax.set_title('Test Error vs Regularization Strength', fontsize=14)
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_ridge_alpha = alphas[np.argmin(ridge_test_mse)]
best_lasso_alpha = alphas[np.argmin(lasso_test_mse)]
print(f"Best Ridge alpha: {best_ridge_alpha:.4f} (Test MSE: {min(ridge_test_mse):.4f})")
print(f"Best Lasso alpha: {best_lasso_alpha:.4f} (Test MSE: {min(lasso_test_mse):.4f})")
print("\nToo little regularization -> overfitting. Too much -> underfitting.")

<a id='8'></a>
## 8. Common Mistakes

1. **Not standardizing features before regularization**
   - Regularization penalizes large coefficients
   - If features have different scales, the penalty is **unfair** -- it penalizes features with large scales more
   - **Always** use `StandardScaler` before Ridge, Lasso, or ElasticNet

2. **Using too large an alpha**
   - Drives all coefficients to near-zero -- the model predicts (roughly) the mean of y
   - This is **underfitting** caused by over-regularization
   - Use cross-validation (`RidgeCV`, `LassoCV`) to find the optimal alpha

3. **Forgetting that the intercept is NOT regularized**
   - By default, sklearn does not penalize the intercept (bias term)
   - This is correct -- the intercept should be free to shift the predictions

4. **Using Lasso with highly correlated features**
   - Lasso arbitrarily picks one of the correlated features and zeros out the others
   - Use ElasticNet instead -- the L2 component helps with correlated groups

5. **Confusing sklearn's alpha with the textbook lambda**
   - They are the same concept, just different naming conventions
   - Larger alpha = stronger penalty = simpler model

<a id='9'></a>
## 9. Exercise

**Task:** Use cross-validated regularized regression on the California Housing dataset.

1. Load data: `from sklearn.datasets import fetch_california_housing`
2. Standardize features using `StandardScaler`
3. Use `sklearn.linear_model.LassoCV` with `cv=5` to find the best alpha automatically
4. Use `sklearn.linear_model.RidgeCV` with a range of alphas
5. Print the best alpha and test MSE for both
6. Which features does Lasso zero out? Are those features truly unimportant?
7. Plot the coefficient comparison between OLS, Ridge (best alpha), and Lasso (best alpha)

In [None]:
# YOUR CODE HERE
# from sklearn.datasets import fetch_california_housing
# from sklearn.linear_model import LassoCV, RidgeCV
#
# data = fetch_california_housing()
# X_ex, y_ex = data.data, data.target
# feature_names_ex = data.feature_names
#
# ... standardize, fit LassoCV and RidgeCV, compare coefficients ...