Week 2 – Day 7: Gradient Boosting + Mini Project
🎯 Objectives

Understand boosting vs bagging (Random Forest).

Train a Gradient Boosting model (using sklearn).

Build a mini-project: California Housing Price Predictor 🏡

1. Big idea of Gradient Boosting

Gradient Boosting = build trees sequentially.

Each new tree focuses on fixing the errors of the previous one.

Works great for tabular data, often outperforming Random Forest.

2. Step by step (intuition)

Let’s imagine you’re predicting house prices:

Start with a simple guess (e.g., the average price = baseline model).

Check errors (residuals = actual − predicted).

Train a small decision tree to predict those residuals.

Add that tree’s predictions to improve the model.

Repeat many times: each new tree focuses on what’s left unexplained.

So the final model is:

Final Prediction = Baseline + Tree1 + Tree2 + Tree3 + ... TreeN

3. What “gradient” means

Instead of just fixing errors randomly, Gradient Boosting uses gradient descent (like in neural networks).

It calculates the direction of the steepest improvement (the gradient of the loss function).

Each new tree is trained to follow that gradient, reducing the error step by step.

That’s why it’s called Gradient Boosting.


In [6]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# Load California Housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting
gbr = GradientBoostingRegressor(
    n_estimators=200,    # number of trees (iterations)
    learning_rate=0.1,  # how much each tree contributes
    max_depth=3,        # depth of each tree (controls complexity)
    random_state=42
)
gbr.fit(X_train, y_train)

y_pred = gbr.predict(X_test)
print("R²:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

print("Feature importances:", gbr.feature_importances_)
print(df.head)


R²: 0.8004451261281281
MSE: 0.26149849837343114
Feature importances: [0.58743113 0.03590314 0.02525868 0.00628766 0.0037467  0.12240635
 0.1030154  0.11595095]
<bound method NDFrame.head of        MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.8

📊 Exercise of the Day

Report the R² and MSE of Gradient Boosting. Compare with your Linear Regression (~0.46 R² earlier).

Check .feature_importances_ → which features matter most for predicting house values?

Try changing n_estimators (50, 200, 500). What happens?

1) 
Gradient Boosting:
R²: 0.8004451261281281
MSE: 0.26149849837343114

Linear Regression:
MSE: 0.7091157771765548
R²: 0.45885918903846656

Gradient Boosting performs much better at predicting house prices.

2) We can see that the most important feature is MedInc

3) 
With 50 n_esttimators:
R²: 0.7434686073011577
MSE: 0.3361610401932049

With 500 n_estimators:
R²: 0.8197817612242545
MSE: 0.2361595981344991

200 estimators is the deafualt we used. 

We can see that with more estimators the performance increases but the time it takes to process is longer.


🌟 Mini-Challenge

Compare Random Forest vs Gradient Boosting on this dataset.

Which one performs better? Why might boosting have an advantage?

In [8]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor

# Load California Housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print("R²:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

print("Feature importances:", rf.feature_importances_)
print(df.head)


R²: 0.8051230593157366
MSE: 0.2553684927247781
Feature importances: [0.52487148 0.05459322 0.04427185 0.02960631 0.03064978 0.13844281
 0.08893574 0.08862881]
<bound method NDFrame.head of        MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0      8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1      8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2      7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3      5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4      3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
...       ...       ...       ...        ...         ...       ...       ...   
20635  1.5603      25.0  5.045455   1.133333       845.0  2.560606     39.48   
20636  2.5568      18.0  6.114035   1.315789       356.0  3.122807     39.49   
20637  1.7000      17.0  5.205543   1.120092      1007.0  2.325635     39.43   
20638  1.86

Ive tried random forest regressor because random forest classifier doesnt work on these data set. R^2 result and MSE are almost identical to the boosting. However, boosting might have the advanteg as each tree it create is an improvement of the previous, so it corrects errors this way.