# **Advanced Tree Models – Boosting Trees**

Maintainer: Zhaohu(Jonathan) Fan. Contact him at (psujohnny@gmail.com)


Note: This lab note is still WIP, let us know if you encounter bugs or issues.

1. [Boosting](#1-boosting)  
   1.1 [Boosting for Regression Trees](#11-boosting-for-regression-trees)  
   1.2 [Boosting for Classification Trees](#12-boosting-for-classification-trees)  


#### *Colab Notebook [Open in Colab](https://colab.research.google.com/drive/1Ud1aLBXB0ZHnQlvDuUmueYCjG4D4UHBU?usp=sharing)*

#### *Useful information about [Advanced Tree Models – Boosting Trees in R](https://yanyudm.github.io/Data-Mining-R/lecture/7.C_Boosting.html)*



## Lab Overview

In this lab, we will cover state-of-the-art techniques within the tree-modeling framework. We will use the same datasets as in the previous lab:

- **Boston Housing** dataset  
- **Credit Card Default** dataset (subsampled to **n = 12,000** observations)


In [3]:
# ============================================================
# Boosting Lab (Google Colab) — Boston Housing + Credit Default
#   - Data loading and splits (Boston 90/10, Credit 60/40)
#   - 1.1 Boosting for regression trees (Boston) using sklearn
#   - Variable importance + partial dependence (lstat, rm)
#   - Test MSE for n_estimators=10000
#   - Test MSE curve vs number of trees (100..10000 by 100)
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.inspection import PartialDependenceDisplay

RANDOM_STATE = 123



A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/anaconda3/lib/python3.12/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/anaconda3/lib/python3.12/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/opt/anaconda3/lib/python3.12/site-packages/ipykernel/kernelapp.py", line 701, in start
    self.io_loop.start()
  File "/opt/anaconda3/lib/python3.12/site-

ImportError: 
A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.4.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.



ImportError: numpy.core.multiarray failed to import

### 1.1 Boosting for regression trees

In [None]:
# ============================================================
# 0) Load Boston Housing data (CMU StatLib format) — same as your earlier method
# ============================================================
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)

data = np.hstack([
    raw_df.values[::2, :],      # even rows: 13 columns
    raw_df.values[1::2, :2]     # odd rows: first 2 columns
])
target = raw_df.values[1::2, 2] # odd rows: 3rd column is MEDV

feature_names = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE",
    "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"
]

boston = pd.DataFrame(data, columns=feature_names)
boston["MEDV"] = target

boston.head()


In [None]:
# ============================================================
# 1) Train/Test split (90/10) — Boston (matches your R split idea)
# ============================================================
X_boston = boston.drop(columns=["MEDV"])
y_boston = boston["MEDV"]

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_boston, y_boston, test_size=0.10, random_state=RANDOM_STATE
)

print("Boston train:", X_train_b.shape, " Boston test:", X_test_b.shape)


In [None]:
# ============================================================
# 1.1 Boosting for Regression Trees (Boston)
# R(gbm) settings:
#   n.trees = 10000
#   shrinkage = 0.01
#   interaction.depth = 8
#
# sklearn analogue:
#   n_estimators = 10000
#   learning_rate = 0.01
#   max_depth = 8   (depth of individual regression trees)
# ============================================================
boston_boost = GradientBoostingRegressor(
    n_estimators=10000,
    learning_rate=0.01,
    max_depth=8,
    random_state=RANDOM_STATE
)
boston_boost.fit(X_train_b, y_train_b)

# Variable importance (gbm "relative influence" analogue)
imp = pd.Series(boston_boost.feature_importances_, index=X_train_b.columns).sort_values(ascending=False)
imp


In [None]:
# Plot variable importance
plt.figure(figsize=(8, 5))
plt.barh(imp.index[::-1], imp.values[::-1])
plt.xlabel("Relative Influence (Feature Importance)")
plt.ylabel("Predictor")
plt.title("Boosted Regression Trees: Variable Importance (Boston)")
plt.tight_layout()
plt.show()


In [None]:
# ============================================================
# Partial dependence plots (analog to plot(gbm_model, i="lstat") and i="rm")
# ============================================================
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
PartialDependenceDisplay.from_estimator(boston_boost, X_train_b, ["LSTAT"], ax=ax[0])
PartialDependenceDisplay.from_estimator(boston_boost, X_train_b, ["RM"], ax=ax[1])
ax[0].set_title("Partial Dependence: LSTAT")
ax[1].set_title("Partial Dependence: RM")
plt.tight_layout()
plt.show()


In [None]:
# ============================================================
# Prediction on testing sample + Test MSE (5 decimal places)
# ============================================================
boston_boost_pred_test = boston_boost.predict(X_test_b)
test_mse_10000 = mean_squared_error(y_test_b, boston_boost_pred_test)

print(f"Boosting Test MSE (n_estimators=10000): {test_mse_10000:.5f}")


In [None]:
# ============================================================
# Investigate how test error changes with different number of trees
#
# IMPORTANT FIXES vs earlier draft:
# - No stray/unused 'test.err' variable.
# - Robust check that we actually collected points.
# - Correctly find and report the minimum MSE on the grid.
# ============================================================
ntree = list(range(100, 10001, 100))
ntree_set = set(ntree)

mse_curve = []
trees_recorded = []

ntree = list(range(100, 10001, 100))
ntree_set = set(ntree)

err = []             # Test MSE at each ntree value
trees_recorded = []  # Corresponding number of trees

for t, yhat in enumerate(boston_boost.staged_predict(X_test_b), start=1):
    if t in ntree_set:
        err.append(mean_squared_error(y_test_b, yhat))
        trees_recorded.append(t)

# Plot Test MSE vs number of trees
plt.figure(figsize=(8, 4))
plt.plot(trees_recorded, err, linewidth=2)
plt.xlabel("n.trees (n_estimators)")
plt.ylabel("Test MSE")
plt.title("Boosting (Boston): Test MSE vs Number of Trees")

# Horizontal dashed line at the minimum Test MSE on this grid
plt.axhline(y=min(err), linestyle="--")

plt.tight_layout()
plt.show()

best_idx = int(np.argmin(err))
print(f"Minimum Test MSE on this grid: {err[best_idx]:.5f} at n.trees = {trees_recorded[best_idx]}")

### 1.2 Boosting for classification trees

In [None]:
# ============================================================
# AdaBoost (Classification) — Credit Card Default (Google Colab)
# R(adabag::boosting) analogue in Python:
#   - sklearn.ensemble.AdaBoostClassifier with decision tree stumps
#   - Train/Test split: reuse your existing credit_train / credit_test if available
#   - ROC curve + AUC on training and testing sets
#   - Save the fitted model to disk (pickle), similar to save(.Rdata)
# ============================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_curve, roc_auc_score
import pickle

RANDOM_STATE = 123


In [None]:
# ============================================================
# 0) Load + split credit card data (ONLY run this cell if you do not already
#    have credit_train and credit_test from earlier cells)
# ============================================================
from sklearn.model_selection import train_test_split

credit_url = "https://yanyudm.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit_data = pd.read_csv(credit_url)

for col in ["SEX", "EDUCATION", "MARRIAGE"]:
    credit_data[col] = credit_data[col].astype("category")

credit_train, credit_test = train_test_split(
    credit_data, test_size=0.40, random_state=RANDOM_STATE, stratify=credit_data["default.payment.next.month"]
)

print("Credit train:", credit_train.shape, " Credit test:", credit_test.shape)


In [None]:
# ============================================================
# 1) Prepare features and target (one-hot encode categorical variables)
# ============================================================
target_col = "default.payment.next.month"

X_train = credit_train.drop(columns=[target_col])
y_train = credit_train[target_col].astype(int)

X_test = credit_test.drop(columns=[target_col])
y_test = credit_test[target_col].astype(int)

# One-hot encoding for categorical predictors (sklearn needs numeric input)
X_train_enc = pd.get_dummies(X_train, drop_first=False)
X_test_enc  = pd.get_dummies(X_test, drop_first=False)

# Align columns so train/test have identical feature sets
X_train_enc, X_test_enc = X_train_enc.align(X_test_enc, join="left", axis=1, fill_value=0)

X_train_enc.shape, X_test_enc.shape


In [None]:
# ============================================================
# 2) Fit AdaBoost classifier (analog to adabag::boosting)
# IMPORTANT FIX:
#   - Remove algorithm="SAMME.R" (not supported in recent sklearn)
# ============================================================
base_tree = DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE)

credit_boost = AdaBoostClassifier(
    estimator=base_tree,
    n_estimators=200,
    learning_rate=0.5,
    algorithm="SAMME",          # <- this is the only supported option in recent sklearn
    random_state=RANDOM_STATE
)

credit_boost.fit(X_train_enc, y_train)

print("AdaBoost fitted.")
print("n_estimators:", credit_boost.n_estimators)
print("learning_rate:", credit_boost.learning_rate)
print("algorithm:", credit_boost.algorithm)


In [None]:
# ============================================================
# 3) Save the trained model (Python equivalent of save(..., .Rdata))
# ============================================================
with open("credit_boost.pkl", "wb") as f:
    pickle.dump(credit_boost, f)

print("Saved model to: credit_boost.pkl")


In [None]:
# ============================================================
# 4) Training ROC + AUC (R code uses training predictions)
# ============================================================
train_prob = credit_boost.predict_proba(X_train_enc)[:, 1]
train_auc = roc_auc_score(y_train, train_prob)

fpr_tr, tpr_tr, _ = roc_curve(y_train, train_prob)

plt.figure(figsize=(6, 5))
plt.plot(fpr_tr, tpr_tr, label=f"Training ROC (AUC = {train_auc:.4f})")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("AdaBoost (Credit Default): Training ROC Curve")
plt.legend()
plt.tight_layout()
plt.show()

train_auc


In [None]:
# ============================================================
# 5) Testing ROC + AUC
# ============================================================
test_prob = credit_boost.predict_proba(X_test_enc)[:, 1]
test_auc = roc_auc_score(y_test, test_prob)

fpr_te, tpr_te, _ = roc_curve(y_test, test_prob)

plt.figure(figsize=(6, 5))
plt.plot(fpr_te, tpr_te, label=f"Testing ROC (AUC = {test_auc:.4f})")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate (FPR)")
plt.ylabel("True Positive Rate (TPR)")
plt.title("AdaBoost (Credit Default): Testing ROC Curve")
plt.legend()
plt.tight_layout()
plt.show()

test_auc


In [None]:
%%shell
jupyter nbconvert --to html ///content/7C_Advanced_Tree_Models_–_Boosting_Trees