# ML502: Bagging, Random Forest & Extra Trees

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain ensemble methods and the bagging (bootstrap aggregating) strategy
2. Understand how Random Forest combines bagging with random feature subsets
3. Train and tune `RandomForestClassifier` with key hyperparameters
4. Interpret out-of-bag (OOB) scores and feature importances
5. Compare single decision trees, Random Forests, and Extra Trees
6. Know when Random Forest is a strong default choice

## Prerequisites

- Decision tree fundamentals (Notebook 01)
- Understanding of overfitting and variance in models
- NumPy, pandas, and matplotlib basics

## Table of Contents

1. [Ensemble Methods and Bagging Theory](#1-ensemble-methods-and-bagging-theory)
2. [Random Forest: Bagging + Random Features](#2-random-forest-bagging--random-features)
3. [Key Hyperparameters](#3-key-hyperparameters)
4. [OOB Score Explanation](#4-oob-score-explanation)
5. [Demo: Single Tree vs Random Forest](#5-demo-single-tree-vs-random-forest)
6. [Feature Importance from Random Forest](#6-feature-importance-from-random-forest)
7. [Extra Trees: Even More Randomness](#7-extra-trees-even-more-randomness)
8. [When to Use Random Forest](#8-when-to-use-random-forest)
9. [Common Mistakes](#9-common-mistakes)
10. [Exercises](#10-exercises)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    BaggingClassifier,
    RandomForestClassifier,
    ExtraTreesClassifier,
)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report

plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
sns.set_style('whitegrid')
np.random.seed(42)

## 1. Ensemble Methods and Bagging Theory

**Ensemble methods** combine multiple models to produce a better predictor than any single model.

### Bagging (Bootstrap Aggregating)

Bagging reduces **variance** by:
1. Drawing $B$ bootstrap samples (random samples with replacement) from the training data
2. Training one model (usually a decision tree) on each bootstrap sample
3. Aggregating predictions: **majority vote** for classification, **mean** for regression

**Why it works:** Individual trees are high-variance models. By averaging many trees trained on different subsets, the variance decreases while bias stays roughly the same.

$$\text{Var}\left(\frac{1}{B}\sum_{b=1}^B T_b(x)\right) \approx \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

where $\rho$ is the average correlation between trees and $\sigma^2$ is the variance of a single tree. The lower the correlation $\rho$, the more variance reduction we get.

## 2. Random Forest: Bagging + Random Features

Random Forest goes one step beyond bagging by also **randomizing the features** considered at each split:

1. For each tree, draw a bootstrap sample of the training data
2. At each split, randomly select `max_features` features (not all features)
3. Find the best split among those selected features
4. Aggregate predictions across all trees

This additional randomness **decorrelates the trees** (reduces $\rho$), which further reduces ensemble variance.

**Default `max_features`:**
- Classification: $\sqrt{p}$ (square root of total features)
- Regression: $p / 3$

## 3. Key Hyperparameters

| Parameter | Description | Typical Values |
|-----------|-------------|----------------|
| `n_estimators` | Number of trees in the forest | 100-1000 |
| `max_features` | Features to consider at each split | `'sqrt'`, `'log2'`, or float |
| `max_depth` | Maximum depth of each tree | `None` or 10-30 |
| `min_samples_split` | Min samples to split a node | 2-10 |
| `min_samples_leaf` | Min samples in a leaf | 1-5 |
| `oob_score` | Whether to use OOB samples for validation | `True` / `False` |
| `n_jobs` | Parallel workers (-1 = all cores) | -1 |

## 4. OOB Score Explanation

Each bootstrap sample leaves out roughly **37%** of the data (out-of-bag samples). The OOB score evaluates each sample using only the trees that did **not** include it in their training set.

This provides a **free validation estimate** without needing a separate validation set, similar to leave-one-out cross-validation but much cheaper.

## 5. Demo: Single Tree vs Random Forest

In [None]:
# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

print(f"Shape: {X.shape}")
print(f"Classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y)}")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

In [None]:
# Single Decision Tree
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
dt_train_acc = accuracy_score(y_train, dt.predict(X_train))
dt_test_acc = accuracy_score(y_test, dt.predict(X_test))

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_features='sqrt',
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
rf_train_acc = accuracy_score(y_train, rf.predict(X_train))
rf_test_acc = accuracy_score(y_test, rf.predict(X_test))

print("Single Decision Tree:")
print(f"  Train accuracy: {dt_train_acc:.4f}")
print(f"  Test accuracy:  {dt_test_acc:.4f}")
print()
print("Random Forest (100 trees):")
print(f"  Train accuracy: {rf_train_acc:.4f}")
print(f"  Test accuracy:  {rf_test_acc:.4f}")
print(f"  OOB score:      {rf.oob_score_:.4f}")

In [None]:
# Cross-validation comparison
dt_cv = cross_val_score(DecisionTreeClassifier(random_state=42), X, y, cv=5, scoring='accuracy')
rf_cv = cross_val_score(
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    X, y, cv=5, scoring='accuracy'
)

print(f"Decision Tree CV: {dt_cv.mean():.4f} +/- {dt_cv.std():.4f}")
print(f"Random Forest CV: {rf_cv.mean():.4f} +/- {rf_cv.std():.4f}")

In [None]:
# Visualize how test accuracy improves with number of trees
n_estimators_range = [1, 5, 10, 25, 50, 100, 200, 300, 500]
oob_scores = []
test_scores = []

for n in n_estimators_range:
    rf_temp = RandomForestClassifier(
        n_estimators=n, oob_score=True, random_state=42, n_jobs=-1
    )
    rf_temp.fit(X_train, y_train)
    oob_scores.append(rf_temp.oob_score_)
    test_scores.append(accuracy_score(y_test, rf_temp.predict(X_test)))

fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(n_estimators_range, oob_scores, 'o-', label='OOB Score', linewidth=2)
ax.plot(n_estimators_range, test_scores, 's-', label='Test Accuracy', linewidth=2)
ax.set_xlabel('Number of Trees (n_estimators)')
ax.set_ylabel('Accuracy')
ax.set_title('Random Forest: Performance vs Number of Trees')
ax.legend()
plt.tight_layout()
plt.show()

print("Performance saturates after enough trees -- more trees never hurt, but returns diminish.")

## 6. Feature Importance from Random Forest

In [None]:
# Feature importance (mean decrease in impurity)
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)

# Sort by importance
indices = np.argsort(importances)[::-1]

# Show top 15 features
top_n = 15
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(
    range(top_n),
    importances[indices[:top_n]][::-1],
    xerr=std[indices[:top_n]][::-1],
    align='center',
    color='steelblue',
    alpha=0.8
)
ax.set_yticks(range(top_n))
ax.set_yticklabels(feature_names[indices[:top_n]][::-1])
ax.set_xlabel('Feature Importance (Mean Decrease in Impurity)')
ax.set_title('Top 15 Feature Importances from Random Forest')
plt.tight_layout()
plt.show()

## 7. Extra Trees: Even More Randomness

`ExtraTreesClassifier` (Extremely Randomized Trees) differs from Random Forest in two ways:

1. **No bootstrapping** -- each tree uses the full training set
2. **Random split thresholds** -- instead of finding the best split, it picks a random threshold for each feature

This adds even more randomness, which can further reduce variance (at the cost of slightly more bias). Extra Trees is also **faster** because it skips the optimization of split thresholds.

In [None]:
# Compare Random Forest vs Extra Trees
et = ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=-1)
et.fit(X_train, y_train)

et_train_acc = accuracy_score(y_train, et.predict(X_train))
et_test_acc = accuracy_score(y_test, et.predict(X_test))

et_cv = cross_val_score(
    ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    X, y, cv=5, scoring='accuracy'
)

# Summary comparison
results = pd.DataFrame({
    'Model': ['Decision Tree', 'Random Forest', 'Extra Trees'],
    'Train Acc': [dt_train_acc, rf_train_acc, et_train_acc],
    'Test Acc': [dt_test_acc, rf_test_acc, et_test_acc],
    'CV Mean': [dt_cv.mean(), rf_cv.mean(), et_cv.mean()],
    'CV Std': [dt_cv.std(), rf_cv.std(), et_cv.std()],
})
print(results.to_string(index=False))

## 8. When to Use Random Forest

Random Forest is an excellent **default model** because:

- **Robust out of the box** -- works well without extensive tuning
- **Handles mixed feature types** -- numerical and categorical (with encoding)
- **No feature scaling needed** -- tree-based splitting is invariant to monotone transformations
- **Built-in feature importance** -- useful for feature selection
- **Parallelizable** -- trees are independent, so training scales with `n_jobs=-1`
- **Resistant to overfitting** -- more trees do not increase overfitting (unlike boosting)

**Limitations:**
- Cannot extrapolate beyond training data range
- Less interpretable than a single tree
- Can be slow for very large datasets (many trees, many features)
- Often outperformed by gradient boosting on structured/tabular data

## 9. Common Mistakes

1. **Too few trees**: Using `n_estimators=10` is almost always too few. Start with 100-300; more trees only cost computation, not accuracy.

2. **Ignoring feature importance**: Random Forest provides feature importances for free. Use them for feature selection and understanding your data.

3. **Overfitting with deep trees**: While Random Forest is resistant to overfitting, individual trees with no depth limit on small datasets can still memorize noise. Consider setting `max_depth` or `min_samples_leaf` for small datasets.

4. **Not using OOB score**: The OOB score is a free validation estimate. Set `oob_score=True` to monitor generalization without a separate validation split.

5. **Expecting extrapolation**: Like single decision trees, Random Forest cannot predict values outside the training range.

## 10. Exercises

### Exercise 1: Effect of `max_features`
Train Random Forest models with `max_features` set to `'sqrt'`, `'log2'`, `0.3`, `0.5`, and `1.0` on the breast cancer dataset. Compare their OOB scores and test accuracies. Which setting works best?

### Exercise 2: Bagging vs Random Forest
Use `BaggingClassifier` with a `DecisionTreeClassifier` base estimator (with all features at each split) and compare it to `RandomForestClassifier`. This isolates the effect of random feature selection.

### Exercise 3: Feature Selection with RF
Train a Random Forest and identify the top 10 features by importance. Re-train the model using only those 10 features. Does performance degrade significantly? What does this tell you about the other features?

In [None]:
# Exercise 1 starter code
# max_features_options = ['sqrt', 'log2', 0.3, 0.5, 1.0]
# for mf in max_features_options:
#     rf_temp = RandomForestClassifier(
#         n_estimators=200, max_features=mf, oob_score=True, random_state=42
#     )
#     rf_temp.fit(X_train, y_train)
#     print(f"max_features={mf}: OOB={rf_temp.oob_score_:.4f}, "
#           f"Test={accuracy_score(y_test, rf_temp.predict(X_test)):.4f}")