Day 3: Ridge & Lasso Regression (Regularization)

Objective of the day
Learn how regularization (Ridge & Lasso) helps prevent overfitting by penalizing large coefficients.

Linear regression → fits a line/plane using all 8 features.

Ridge → same line, but shrinks all coefficients a bit to avoid overfitting.

Lasso → same line, but can shrink some coefficients to exactly 0 (ignoring less important features).

You see multiple coefficients because you’re no longer using just MedInc, but all features in the dataset.

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd

housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Features + target
X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Baseline: plain Linear Regression
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print("Linear R²:", r2_score(y_test, lin_reg.predict(X_test)))

# Ridge Regression (L2 Penalty)

ridge = Ridge(alpha=10)  # alpha = strength of penalty
ridge.fit(X_train, y_train)
print("Ridge R²:", r2_score(y_test, ridge.predict(X_test)))

# Lasso Regression (L1 Penalty)

lasso = Lasso(alpha=0.01, max_iter=10000)  # L1 often needs more iterations
lasso.fit(X_train, y_train)
print("Lasso R²:", r2_score(y_test, lasso.predict(X_test)))

print("Linear coefficients:", lin_reg.coef_)
print("Ridge coefficients:", ridge.coef_)
print("Lasso coefficients:", lasso.coef_)


Linear R²: 0.5757877060324514
Ridge R²: 0.5764371559180015
Lasso R²: 0.5845196673976367
Linear coefficients: [ 4.48674910e-01  9.72425752e-03 -1.23323343e-01  7.83144907e-01
 -2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]
Ridge coefficients: [ 4.47068597e-01  9.74130199e-03 -1.20293353e-01  7.66201258e-01
 -1.99123989e-06 -3.52184780e-03 -4.19720067e-01 -4.33421866e-01]
Lasso coefficients: [ 4.08895632e-01  1.03084903e-02 -4.74445353e-02  3.63345952e-01
 -3.08601321e-07 -3.35945603e-03 -4.07109936e-01 -4.14933167e-01]


📊 Exercise of the Day

Report R² for Linear, Ridge, and Lasso. Which generalizes best?

Compare coefficient sizes: do Ridge and Lasso reduce their magnitude?

Try different alpha values (0.1, 1, 10). How does performance change?

1) Lasso does best but the R^2s are almost identical with these alpha values  and data set.

2) They are slightly reduced.

3) 
-Ridge with 0.1 has almost no penalty it is very similar to the normal regression. R2: 0.575794. 
-Ridge with 1 slighty changes but not much . R2: 0.5758549
-Ridge with 10 . R2: 0.576437

Performance from 0.1 to 1 rises then form 1 to 10 falls.

🌟 Mini-Challenge

Train a Ridge model with very high alpha (1000).

What happens to the coefficients and R²?
👉 Explain in plain words why this happens.

In [9]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print("Linear R²:", r2_score(y_test, lin_reg.predict(X_test)))
print("Linear coefficients:", lin_reg.coef_)


ridge = Ridge(alpha=1000)  # alpha = strength of penalty
ridge.fit(X_train, y_train)
print("Ridge R²:", r2_score(y_test, ridge.predict(X_test)))

print("Ridge coefficients:", ridge.coef_)

Linear R²: 0.5757877060324514
Linear coefficients: [ 4.48674910e-01  9.72425752e-03 -1.23323343e-01  7.83144907e-01
 -2.02962058e-06 -3.52631849e-03 -4.19792487e-01 -4.33708065e-01]
Ridge R²: 0.5832769033874909
Ridge coefficients: [ 4.00508705e-01  1.10993594e-02 -3.02437173e-02  2.41663687e-01
  2.15610668e-06 -3.48144901e-03 -3.67788890e-01 -3.71818461e-01]


Some coefficients increase others decrease. Performance increases slightly.

✅ Mini-Challenge (α=1000)

Coefficients shrink significantly, but not all in the same direction → they’re forced to be small.

R² barely changes (~0.583).

Explanation: with high α, Ridge heavily penalizes large coefficients, which prevents overfitting but also limits the model’s expressiveness. Sometimes this stabilizes performance, sometimes it underfits.

✨ Key Insight

Regularization doesn’t always improve R² → its real value is in preventing overfitting when you have lots of features or noisy data.

Ridge is like “weight decay” → spreads the penalty evenly.

Lasso is like a “feature selector” → it can zero out unhelpful features.