# ML504: XGBoost / HistGradientBoosting -- Practical Guide

> **Note:** The original spec mentioned "Chegg" which appears to be a typo. This notebook covers XGBoost/HistGradientBoosting-style gradient boosting.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Recall how CART decision trees perform recursive splitting
2. Use `HistGradientBoostingClassifier` and `HistGradientBoostingRegressor` from scikit-learn
3. Understand and tune key parameters: `max_iter`, `learning_rate`, `max_depth`, `l2_regularization`
4. Apply early stopping to prevent overfitting and save compute
5. Optionally use `xgboost.XGBClassifier` if the library is installed
6. Compare HistGBDT vs standard GradientBoosting in speed and performance

## Prerequisites

- Decision trees and CART (Notebook 01)
- Gradient Boosting fundamentals (Notebook 03)
- Familiarity with scikit-learn model APIs

## Table of Contents

1. [CART Recap](#1-cart-recap)
2. [HistGradientBoosting Overview](#2-histgradientboosting-overview)
3. [Key Parameters](#3-key-parameters)
4. [HistGradientBoosting Demo with Early Stopping](#4-histgradientboosting-demo-with-early-stopping)
5. [Optional: XGBoost](#5-optional-xgboost)
6. [Speed and Performance Comparison](#6-speed-and-performance-comparison)
7. [Practical Tips](#7-practical-tips)
8. [Common Mistakes](#8-common-mistakes)
9. [Exercises](#9-exercises)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.ensemble import (
    GradientBoostingClassifier,
    HistGradientBoostingClassifier,
    HistGradientBoostingRegressor,
)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, mean_squared_error

plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
sns.set_style('whitegrid')
np.random.seed(42)

## 1. CART Recap

Before diving into advanced boosting, let us recall how a single CART decision tree works.

**CART (Classification and Regression Trees):**
- Builds a binary tree via **recursive binary splitting**
- At each node, selects the feature and threshold that minimizes a cost function
  - Classification: Gini impurity $G = 1 - \sum p_k^2$
  - Regression: MSE $= \frac{1}{n}\sum(y_i - \bar{y})^2$
- Continues splitting until a stopping criterion is met
- Predictions: majority class (classification) or mean value (regression) at each leaf

**In boosting**, each tree is typically **shallow** (depth 3-6) and fits the residuals (or pseudo-residuals) of the ensemble built so far.

## 2. HistGradientBoosting Overview

`HistGradientBoostingClassifier` and `HistGradientBoostingRegressor` are scikit-learn's fast gradient boosting implementations, inspired by LightGBM.

### Key advantages over standard `GradientBoostingClassifier`:

| Feature | Standard GB | HistGradientBoosting |
|---------|------------|---------------------|
| Speed | Slow on large data | Much faster (histogram-based) |
| Missing values | Requires imputation | Native support |
| Categorical features | Requires encoding | Native support (experimental) |
| Early stopping | Manual (`n_iter_no_change`) | Built-in (`early_stopping='auto'`) |
| Scalability | Poor for >10k samples | Good for millions of samples |

### How histogram-based splitting works:
Instead of evaluating all possible thresholds for each feature, the algorithm **bins** continuous features into ~256 discrete bins. This dramatically reduces the number of candidate splits and makes computation much faster.

## 3. Key Parameters

| Parameter | Description | Default | Typical Range |
|-----------|-------------|---------|---------------|
| `max_iter` | Number of boosting iterations | 100 | 100-1000 |
| `learning_rate` | Shrinkage per step | 0.1 | 0.01-0.3 |
| `max_depth` | Max depth of each tree | None | 3-8 |
| `max_leaf_nodes` | Max leaves per tree | 31 | 15-63 |
| `min_samples_leaf` | Min samples in a leaf | 20 | 5-50 |
| `l2_regularization` | L2 penalty on leaf values | 0 | 0-10 |
| `max_bins` | Number of histogram bins | 255 | 63-255 |
| `early_stopping` | When to use early stopping | `'auto'` | `True`, `False`, `'auto'` |
| `n_iter_no_change` | Iterations without improvement before stopping | 10 | 5-20 |
| `validation_fraction` | Fraction of data for early stopping | 0.1 | 0.1-0.2 |

## 4. HistGradientBoosting Demo with Early Stopping

In [None]:
# Load data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training: {X_train.shape}, Test: {X_test.shape}")

In [None]:
# HistGradientBoosting with early stopping
hgb = HistGradientBoostingClassifier(
    max_iter=500,
    learning_rate=0.1,
    max_depth=5,
    l2_regularization=1.0,
    early_stopping=True,
    n_iter_no_change=15,
    validation_fraction=0.15,
    random_state=42,
)
hgb.fit(X_train, y_train)

print(f"Requested max_iter: 500")
print(f"Actual iterations used: {hgb.n_iter_}")
print(f"Train accuracy: {accuracy_score(y_train, hgb.predict(X_train)):.4f}")
print(f"Test accuracy:  {accuracy_score(y_test, hgb.predict(X_test)):.4f}")
print(f"\nEarly stopping saved {500 - hgb.n_iter_} iterations of unnecessary computation.")

In [None]:
# HistGradientBoosting for Regression demo
from sklearn.datasets import make_regression

X_reg, y_reg = make_regression(
    n_samples=1000, n_features=10, n_informative=5, noise=20, random_state=42
)
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

hgb_reg = HistGradientBoostingRegressor(
    max_iter=300,
    learning_rate=0.1,
    max_depth=5,
    early_stopping=True,
    n_iter_no_change=10,
    random_state=42,
)
hgb_reg.fit(X_reg_train, y_reg_train)

y_pred_reg = hgb_reg.predict(X_reg_test)
rmse = np.sqrt(mean_squared_error(y_reg_test, y_pred_reg))
print(f"HistGradientBoostingRegressor:")
print(f"  Iterations used: {hgb_reg.n_iter_}")
print(f"  Test RMSE: {rmse:.2f}")

## 5. Optional: XGBoost

XGBoost is a popular external library for gradient boosting. It provides additional features like:
- More regularization options (L1 + L2)
- GPU acceleration
- Built-in cross-validation

The cell below attempts to import XGBoost. If it is not installed, it will skip gracefully.

In [None]:
try:
    import xgboost as xgb
    print(f"XGBoost version: {xgb.__version__}")
    
    xgb_clf = xgb.XGBClassifier(
        n_estimators=300,
        learning_rate=0.1,
        max_depth=5,
        reg_lambda=1.0,       # L2 regularization
        reg_alpha=0.0,        # L1 regularization
        subsample=0.8,
        colsample_bytree=0.8,
        early_stopping_rounds=15,
        eval_metric='logloss',
        random_state=42,
        n_jobs=-1,
    )
    
    xgb_clf.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=False,
    )
    
    xgb_train_acc = accuracy_score(y_train, xgb_clf.predict(X_train))
    xgb_test_acc = accuracy_score(y_test, xgb_clf.predict(X_test))
    
    print(f"\nXGBoost Results:")
    print(f"  Best iteration: {xgb_clf.best_iteration}")
    print(f"  Train accuracy: {xgb_train_acc:.4f}")
    print(f"  Test accuracy:  {xgb_test_acc:.4f}")
    
    HAS_XGBOOST = True

except ImportError:
    print("XGBoost is not installed. Skipping this section.")
    print("To install: pip install xgboost")
    print("\nThe HistGradientBoosting models above provide similar functionality")
    print("without requiring any additional packages.")
    HAS_XGBOOST = False

## 6. Speed and Performance Comparison

In [None]:
# Generate a larger dataset for meaningful speed comparison
X_large, y_large = make_classification(
    n_samples=10000, n_features=30, n_informative=15,
    n_redundant=5, random_state=42
)
X_lg_train, X_lg_test, y_lg_train, y_lg_test = train_test_split(
    X_large, y_large, test_size=0.3, random_state=42
)

print(f"Large dataset: {X_large.shape}")

In [None]:
# Standard GradientBoosting
gb_std = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=5, random_state=42
)
start = time.time()
gb_std.fit(X_lg_train, y_lg_train)
gb_time = time.time() - start
gb_acc = accuracy_score(y_lg_test, gb_std.predict(X_lg_test))

# HistGradientBoosting
hgb_fast = HistGradientBoostingClassifier(
    max_iter=200, learning_rate=0.1, max_depth=5,
    early_stopping=False, random_state=42
)
start = time.time()
hgb_fast.fit(X_lg_train, y_lg_train)
hgb_time = time.time() - start
hgb_acc = accuracy_score(y_lg_test, hgb_fast.predict(X_lg_test))

# Results
print(f"{'Model':<30} {'Time (s)':<12} {'Test Acc':<10}")
print(f"{'-'*52}")
print(f"{'GradientBoosting':<30} {gb_time:<12.3f} {gb_acc:<10.4f}")
print(f"{'HistGradientBoosting':<30} {hgb_time:<12.3f} {hgb_acc:<10.4f}")
print(f"\nHistGradientBoosting speedup: {gb_time / hgb_time:.1f}x faster")

In [None]:
# Visualize the comparison
models_compared = ['GradientBoosting', 'HistGradientBoosting']
times = [gb_time, hgb_time]
accs = [gb_acc, hgb_acc]

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].bar(models_compared, times, color=['steelblue', 'coral'])
axes[0].set_ylabel('Training Time (seconds)')
axes[0].set_title('Training Speed Comparison')

axes[1].bar(models_compared, accs, color=['steelblue', 'coral'])
axes[1].set_ylabel('Test Accuracy')
axes[1].set_title('Accuracy Comparison')
axes[1].set_ylim(min(accs) - 0.02, max(accs) + 0.02)

plt.suptitle('Standard vs Histogram-Based Gradient Boosting (10k samples)', fontsize=13)
plt.tight_layout()
plt.show()

## 7. Practical Tips

1. **Always use early stopping**: Set `early_stopping=True` (HistGBDT) or `early_stopping_rounds` (XGBoost). This prevents overfitting and saves compute.

2. **Tune `learning_rate` and `max_iter` together**: Start with `learning_rate=0.1` and `max_iter=500` with early stopping. If performance is good, try lowering the learning rate and increasing max_iter.

3. **Keep trees shallow**: `max_depth=3-6` is usually sufficient for boosting. Unlike Random Forest, deep trees in boosting cause overfitting.

4. **Use HistGradientBoosting for large datasets**: It is much faster than standard GradientBoosting and handles missing values natively.

5. **Start simple, add complexity**: Begin with default parameters, evaluate, then tune. Do not over-tune from the start.

6. **Regularization matters**: Use `l2_regularization` (HistGBDT) or `reg_lambda`/`reg_alpha` (XGBoost) to control leaf values and reduce overfitting.

## 8. Common Mistakes

1. **Not using early stopping**: Training for a fixed number of iterations without monitoring validation loss is the most common source of overfitting in gradient boosting.

2. **Ignoring overfitting signs**: If training accuracy is much higher than validation accuracy, reduce `max_iter`, increase `l2_regularization`, or lower `learning_rate`.

3. **Over-tuning hyperparameters**: Gradient boosting has many knobs. Excessive grid search over all parameters leads to overfitting the validation set. Focus on `learning_rate`, `max_iter`, and `max_depth` first.

4. **Using standard GradientBoosting on large data**: For datasets with more than ~10k samples, `HistGradientBoostingClassifier` is dramatically faster with similar or better accuracy.

5. **Forgetting that boosting cannot extrapolate**: Like all tree-based methods, gradient boosting predicts constants outside the training data range.

## 9. Exercises

### Exercise 1: Early Stopping Experiment
Train `HistGradientBoostingClassifier` with `max_iter=1000` and different values of `n_iter_no_change` (5, 10, 20, 50). How does the patience parameter affect the number of iterations used and final test accuracy?

### Exercise 2: Regularization
Train models with `l2_regularization` values of 0, 0.1, 1.0, 10.0, and 100.0. Plot test accuracy vs regularization strength. At what point does regularization start to hurt performance?

### Exercise 3: HistGBDT vs XGBoost
If XGBoost is installed, compare `HistGradientBoostingClassifier` and `XGBClassifier` with similar parameters on the breast cancer dataset. Report training time, train accuracy, and test accuracy. Are the results comparable?

In [None]:
# Exercise 1 starter code
# patience_values = [5, 10, 20, 50]
# for patience in patience_values:
#     hgb_exp = HistGradientBoostingClassifier(
#         max_iter=1000, learning_rate=0.1, max_depth=5,
#         early_stopping=True, n_iter_no_change=patience, random_state=42
#     )
#     hgb_exp.fit(X_train, y_train)
#     print(f"patience={patience:2d}: iters={hgb_exp.n_iter_:3d}, "
#           f"test_acc={accuracy_score(y_test, hgb_exp.predict(X_test)):.4f}")