# ML503: Boosting -- AdaBoost & Gradient Boosting

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain the boosting paradigm: sequential learners correcting predecessor errors
2. Understand AdaBoost's sample reweighting mechanism
3. Understand Gradient Boosting's residual-fitting approach
4. Train and tune `AdaBoostClassifier` and `GradientBoostingClassifier`
5. Compare boosting methods against Random Forest on the same dataset
6. Detect overfitting via training/validation error curves

## Prerequisites

- Decision tree fundamentals (Notebook 01)
- Ensemble and bagging concepts (Notebook 02)
- Understanding of bias-variance trade-off

## Table of Contents

1. [Boosting Theory](#1-boosting-theory)
2. [AdaBoost](#2-adaboost)
3. [Gradient Boosting](#3-gradient-boosting)
4. [Scikit-Learn Implementation](#4-scikit-learn-implementation)
5. [Comparison: AdaBoost vs GradientBoosting vs RandomForest](#5-comparison-adaboost-vs-gradientboosting-vs-randomforest)
6. [Overfitting Detection](#6-overfitting-detection)
7. [Common Mistakes](#7-common-mistakes)
8. [Exercises](#8-exercises)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score

plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
sns.set_style('whitegrid')
np.random.seed(42)

## 1. Boosting Theory

**Boosting** is a family of ensemble methods where models are trained **sequentially**, and each new model focuses on correcting the errors of the previous ones.

### Bagging vs Boosting

| Aspect | Bagging (e.g., RF) | Boosting (e.g., GBM) |
|--------|-------------------|---------------------|
| Training | Parallel (independent) | Sequential (dependent) |
| Reduces | Variance | Bias (and variance) |
| Base learners | Full-depth trees | Shallow trees (stumps/weak learners) |
| Overfitting risk | Low | Higher (must tune carefully) |
| Weighting | Equal vote | Weighted contributions |

### General Boosting Update

The ensemble prediction after $m$ stages:

$$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$$

where $\eta$ is the learning rate and $h_m(x)$ is the new weak learner added at stage $m$.

## 2. AdaBoost

**Adaptive Boosting (AdaBoost)** works by reweighting samples:

1. Start with equal sample weights: $w_i = 1/n$
2. Train a weak learner (e.g., decision stump) on the weighted data
3. Compute the weighted error rate $\epsilon_m$
4. Compute the learner weight: $\alpha_m = \frac{1}{2} \ln\frac{1 - \epsilon_m}{\epsilon_m}$
5. Update sample weights -- increase weight of misclassified samples:
   $$w_i \leftarrow w_i \cdot \exp(-\alpha_m \cdot y_i \cdot h_m(x_i))$$
6. Normalize weights and repeat

**Final prediction:** $F(x) = \text{sign}\left(\sum_{m=1}^{M} \alpha_m \cdot h_m(x)\right)$

**Key insight:** Misclassified samples get higher weights, so the next learner focuses more on the hard cases.

## 3. Gradient Boosting

**Gradient Boosting** generalizes boosting by fitting new learners to the **negative gradient of the loss function** (which, for squared error, equals the residuals).

### Algorithm (simplified for regression with MSE):

1. Initialize $F_0(x) = \bar{y}$ (mean of targets)
2. For $m = 1, 2, \ldots, M$:
   - Compute residuals: $r_i = y_i - F_{m-1}(x_i)$
   - Fit a regression tree $h_m$ to the residuals
   - Update: $F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$

### Key Parameters

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| `n_estimators` | Number of boosting stages | 100-1000 |
| `learning_rate` | Shrinkage per step | 0.01-0.3 |
| `max_depth` | Depth of each tree (shallow!) | 3-8 |
| `subsample` | Fraction of samples per tree | 0.5-1.0 |

**Important trade-off:** Lower `learning_rate` requires more `n_estimators` but usually gives better results.

## 4. Scikit-Learn Implementation

In [None]:
# Load data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set:     {X_test.shape[0]} samples")

In [None]:
# AdaBoost
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # decision stumps
    n_estimators=200,
    learning_rate=0.5,
    random_state=42,
)
ada.fit(X_train, y_train)

print("AdaBoost (200 stumps, lr=0.5):")
print(f"  Train accuracy: {accuracy_score(y_train, ada.predict(X_train)):.4f}")
print(f"  Test accuracy:  {accuracy_score(y_test, ada.predict(X_test)):.4f}")

In [None]:
# Gradient Boosting
gb = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,           # shallow trees for boosting
    subsample=0.8,
    random_state=42,
)
gb.fit(X_train, y_train)

print("Gradient Boosting (200 trees, lr=0.1, depth=3):")
print(f"  Train accuracy: {accuracy_score(y_train, gb.predict(X_train)):.4f}")
print(f"  Test accuracy:  {accuracy_score(y_test, gb.predict(X_test)):.4f}")

## 5. Comparison: AdaBoost vs GradientBoosting vs RandomForest

In [None]:
# Random Forest for comparison
rf = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Cross-validation comparison
models = {
    'AdaBoost': AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=200, learning_rate=0.5, random_state=42
    ),
    'GradientBoosting': GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
    ),
    'RandomForest': RandomForestClassifier(
        n_estimators=200, random_state=42, n_jobs=-1
    ),
}

results = []
for name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    model.fit(X_train, y_train)
    results.append({
        'Model': name,
        'Train Acc': accuracy_score(y_train, model.predict(X_train)),
        'Test Acc': accuracy_score(y_test, model.predict(X_test)),
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std(),
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

In [None]:
# Visualize comparison
fig, ax = plt.subplots(figsize=(9, 5))
x_pos = np.arange(len(results_df))
width = 0.3

ax.bar(x_pos - width/2, results_df['Train Acc'], width, label='Train Acc', alpha=0.8)
ax.bar(x_pos + width/2, results_df['Test Acc'], width, label='Test Acc', alpha=0.8)
ax.set_xticks(x_pos)
ax.set_xticklabels(results_df['Model'])
ax.set_ylabel('Accuracy')
ax.set_title('Model Comparison on Breast Cancer Dataset')
ax.legend()
ax.set_ylim(0.9, 1.01)
plt.tight_layout()
plt.show()

## 6. Overfitting Detection

Boosting can overfit if too many estimators are used or the learning rate is too high. We can monitor training vs validation loss across boosting stages.

In [None]:
# Gradient Boosting: staged prediction for train/test error curves
gb_monitor = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
)
gb_monitor.fit(X_train, y_train)

# Compute error at each stage using staged_predict
train_errors = []
test_errors = []

for y_pred_train in gb_monitor.staged_predict(X_train):
    train_errors.append(1 - accuracy_score(y_train, y_pred_train))

for y_pred_test in gb_monitor.staged_predict(X_test):
    test_errors.append(1 - accuracy_score(y_test, y_pred_test))

fig, ax = plt.subplots(figsize=(10, 5))
stages = range(1, len(train_errors) + 1)
ax.plot(stages, train_errors, label='Training Error', linewidth=2)
ax.plot(stages, test_errors, label='Test Error', linewidth=2)
ax.set_xlabel('Number of Boosting Stages')
ax.set_ylabel('Error Rate')
ax.set_title('Gradient Boosting: Training vs Test Error over Stages')
ax.legend()

# Mark approximate best number of estimators
best_stage = np.argmin(test_errors) + 1
ax.axvline(x=best_stage, color='gray', linestyle='--', alpha=0.7)
ax.annotate(f'Best stage: {best_stage}', xy=(best_stage, test_errors[best_stage - 1]),
            xytext=(best_stage + 50, test_errors[best_stage - 1] + 0.01),
            arrowprops=dict(arrowstyle='->', color='gray'),
            fontsize=11)
plt.tight_layout()
plt.show()

print(f"Best test error at stage {best_stage}: {test_errors[best_stage - 1]:.4f}")
print("After this point, additional stages may overfit (test error increases).")

In [None]:
# AdaBoost staged prediction
ada_monitor = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=500,
    learning_rate=0.5,
    random_state=42,
)
ada_monitor.fit(X_train, y_train)

ada_train_errors = [
    1 - accuracy_score(y_train, y_pred)
    for y_pred in ada_monitor.staged_predict(X_train)
]
ada_test_errors = [
    1 - accuracy_score(y_test, y_pred)
    for y_pred in ada_monitor.staged_predict(X_test)
]

fig, ax = plt.subplots(figsize=(10, 5))
stages = range(1, len(ada_train_errors) + 1)
ax.plot(stages, ada_train_errors, label='Training Error', linewidth=2)
ax.plot(stages, ada_test_errors, label='Test Error', linewidth=2)
ax.set_xlabel('Number of Boosting Stages')
ax.set_ylabel('Error Rate')
ax.set_title('AdaBoost: Training vs Test Error over Stages')
ax.legend()
plt.tight_layout()
plt.show()

## 7. Common Mistakes

1. **High learning rate + many estimators**: A large learning rate (e.g., 1.0) with many estimators causes overfitting. Use a small learning rate (0.01-0.1) and increase `n_estimators` accordingly.

2. **Not tuning `max_depth` for boosting**: Boosting uses **shallow trees** (depth 3-5), not deep trees like Random Forest. Deep trees in boosting lead to severe overfitting.

3. **Treating boosting like bagging**: In bagging, more trees never hurt. In boosting, too many stages can overfit. Always monitor validation error.

4. **Ignoring the learning rate / n_estimators trade-off**: These two parameters must be tuned together. A smaller learning rate generally needs more estimators but produces better results.

5. **Not using early stopping**: For Gradient Boosting, use `n_iter_no_change` and `validation_fraction` to automatically stop when validation performance plateaus. This prevents overfitting and saves compute.

## 8. Exercises

### Exercise 1: Learning Rate Experiment
Train `GradientBoostingClassifier` with learning rates of 0.01, 0.05, 0.1, 0.5, and 1.0, all with `n_estimators=300`. Plot the test error curves for each. Which learning rate gives the best final test accuracy? Which converges fastest?

### Exercise 2: AdaBoost Depth Experiment
Train `AdaBoostClassifier` with base estimators of depth 1 (stump), 2, 3, and 5. Does increasing the base learner complexity help AdaBoost? At what depth does overfitting become visible?

### Exercise 3: Early Stopping
Train a `GradientBoostingClassifier` with `n_estimators=1000` and `n_iter_no_change=10`. Compare the final number of estimators used (access `n_estimators_` after fit) with the full 1000. How much compute did early stopping save?

In [None]:
# Exercise 1 starter code
# learning_rates = [0.01, 0.05, 0.1, 0.5, 1.0]
# fig, ax = plt.subplots(figsize=(10, 6))
# for lr in learning_rates:
#     gb_exp = GradientBoostingClassifier(
#         n_estimators=300, learning_rate=lr, max_depth=3, random_state=42
#     )
#     gb_exp.fit(X_train, y_train)
#     test_errors = [
#         1 - accuracy_score(y_test, y_pred)
#         for y_pred in gb_exp.staged_predict(X_test)
#     ]
#     ax.plot(range(1, 301), test_errors, label=f'lr={lr}')
# ax.set_xlabel('Boosting Stages')
# ax.set_ylabel('Test Error')
# ax.legend()
# plt.show()