# Random Forest from Scratch - Classification

Complete implementation of Random Forest classifier with detailed explanation of **Bagging**.

**Key Concepts:**
- **Bagging** (Bootstrap Aggregating)
- **Bootstrap Sampling** (sampling with replacement)
- **Feature Randomness** (random feature subsets)
- **Ensemble Voting** (majority voting)
- **Out-of-Bag (OOB) Error** estimation
- **Variance Reduction** through averaging


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.datasets import make_classification, load_iris, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

print("=" * 80)
print("RANDOM FOREST FROM SCRATCH - WITH BAGGING EXPLAINED")
print("=" * 80)

## What is Bagging?

**Bagging = Bootstrap Aggregating**

### **Core Idea:**
Train multiple models on different **random subsets** of data, then **aggregate** their predictions.

### **Why Bagging Works:**

**Bias-Variance Tradeoff:**
$$\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

- **High Variance Models** (like deep decision trees): Overfit to training data
- **Bagging**: Reduces variance by averaging predictions
- **Result**: Lower overall error!

### **Mathematical Intuition:**

**Variance of Average:**

If we have $n$ independent models with variance $\sigma^2$:
$$\text{Var}(\text{average}) = \frac{\sigma^2}{n}$$

**With correlation** $\rho$ between models:
$$\text{Var}(\text{average}) = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2$$

- As $n \to \infty$ and $\rho \to 0$: Variance approaches 0!
- **Random Forest**: Creates diverse (low $\rho$) trees via randomness

### **Bootstrap Sampling:**

Sample $n$ examples **with replacement** from dataset of size $n$:
- Each sample has $1/n$ probability of selection
- Probability of NOT being selected in one draw: $(1 - 1/n)$
- Probability of NOT being selected in $n$ draws: $(1 - 1/n)^n \to 1/e \approx 0.632$

**Result:** Each bootstrap sample contains ~63.2% unique samples, ~36.8% duplicates

### **Out-of-Bag (OOB) Samples:**

The ~36.8% of samples **not** in a bootstrap sample are "out-of-bag":
- Can be used for **free validation** (no need for separate test set!)
- Each tree is tested on its OOB samples
- Aggregate to get **OOB error** estimate

## Random Forest = Bagging + Feature Randomness

**Standard Bagging:**
1. Create $B$ bootstrap samples
2. Train a decision tree on each
3. Average predictions

**Random Forest Enhancement:**
1. Create $B$ bootstrap samples
2. Train tree on each, BUT at each split:
   - Randomly select $m$ features (out of $p$ total)
   - Find best split only among these $m$ features
3. Average predictions

**Why the extra randomness?**
- Further **decorrelates** trees (lower $\rho$)
- Prevents one strong feature from dominating
- Better variance reduction!

**Typical choice for $m$:**
- Classification: $m = \sqrt{p}$
- Regression: $m = p/3$

## Simple Decision Tree (Base Learner)

In [None]:
class DecisionTreeNode:
    """
    Node in a decision tree.
    Can be either:
    - Internal node: has feature, threshold, left/right children
    - Leaf node: has class prediction
    """
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Feature index to split on
        self.threshold = threshold  # Threshold value for split
        self.left = left           # Left child node
        self.right = right         # Right child node
        self.value = value         # Class prediction (for leaf nodes)
    
    def is_leaf(self):
        return self.value is not None


class DecisionTreeClassifierScratch:
    """
    Simple Decision Tree for classification (base learner for Random Forest).
    
    Parameters:
    -----------
    max_depth : int
        Maximum tree depth
    min_samples_split : int
        Minimum samples to split a node
    max_features : int or None
        Number of features to consider at each split (for RF randomness)
    """
    
    def __init__(self, max_depth=10, min_samples_split=2, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.root = None
    
    def _gini(self, y):
        """
        Calculate Gini impurity.
        Gini = 1 - Σ(p_i²)
        """
        _, counts = np.unique(y, return_counts=True)
        probabilities = counts / len(y)
        return 1 - np.sum(probabilities ** 2)
    
    def _information_gain(self, parent, left, right):
        """
        Calculate information gain from a split.
        IG = Gini(parent) - weighted_average(Gini(children))
        """
        n = len(parent)
        n_left, n_right = len(left), len(right)
        
        if n_left == 0 or n_right == 0:
            return 0
        
        gini_parent = self._gini(parent)
        gini_left = self._gini(left)
        gini_right = self._gini(right)
        
        weighted_gini = (n_left / n) * gini_left + (n_right / n) * gini_right
        return gini_parent - weighted_gini
    
    def _best_split(self, X, y, feature_indices):
        """
        Find best split among given features.
        
        KEY FOR RANDOM FOREST:
        Only considers features in feature_indices (random subset!)
        """
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        for feature in feature_indices:
            X_column = X[:, feature]
            thresholds = np.unique(X_column)
            
            for threshold in thresholds:
                left_mask = X_column <= threshold
                right_mask = ~left_mask
                
                if np.sum(left_mask) < 1 or np.sum(right_mask) < 1:
                    continue
                
                gain = self._information_gain(y, y[left_mask], y[right_mask])
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _build_tree(self, X, y, depth=0):
        """
        Recursively build decision tree.
        """
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Stopping criteria
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or 
            n_classes == 1):
            # Create leaf node
            leaf_value = Counter(y).most_common(1)[0][0]
            return DecisionTreeNode(value=leaf_value)
        
        # RANDOM FOREST FEATURE RANDOMNESS:
        # Select random subset of features
        if self.max_features is None:
            feature_indices = np.arange(n_features)
        else:
            feature_indices = np.random.choice(
                n_features, 
                self.max_features, 
                replace=False
            )
        
        # Find best split among selected features
        best_feature, best_threshold = self._best_split(X, y, feature_indices)
        
        if best_feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return DecisionTreeNode(value=leaf_value)
        
        # Split data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Recursively build children
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return DecisionTreeNode(
            feature=best_feature,
            threshold=best_threshold,
            left=left_child,
            right=right_child
        )
    
    def fit(self, X, y):
        """Build decision tree."""
        self.root = self._build_tree(X, y)
        return self
    
    def _traverse_tree(self, x, node):
        """
        Traverse tree to make prediction for single sample.
        """
        if node.is_leaf():
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self._traverse_tree(x, node.left)
        else:
            return self._traverse_tree(x, node.right)
    
    def predict(self, X):
        """Predict for multiple samples."""
        return np.array([self._traverse_tree(x, self.root) for x in X])

print("\n✓ DecisionTreeClassifierScratch defined (base learner)")

## Random Forest Implementation with Bagging

In [None]:
class RandomForestClassifierScratch:
    """
    Random Forest Classifier from scratch using Bagging.
    
    BAGGING STEPS:
    1. Create multiple bootstrap samples (sampling with replacement)
    2. Train a tree on each bootstrap sample
    3. Aggregate predictions via majority voting
    
    RANDOM FOREST ADDITIONS:
    4. Use random feature subsets at each split
    5. Track out-of-bag (OOB) samples for validation
    
    Parameters:
    -----------
    n_estimators : int, default=100
        Number of trees in the forest
    max_depth : int, default=10
        Maximum depth of each tree
    min_samples_split : int, default=2
        Minimum samples to split a node
    max_features : str or int, default='sqrt'
        Number of features for random selection:
        - 'sqrt': sqrt(n_features)
        - 'log2': log2(n_features)
        - int: specific number
        - None: all features (no randomness, just bagging)
    bootstrap : bool, default=True
        Whether to use bootstrap sampling
    oob_score : bool, default=False
        Whether to calculate out-of-bag score
    random_state : int, default=None
        Random seed
    verbose : bool, default=False
        Print progress
    """
    
    def __init__(self, n_estimators=100, max_depth=10, min_samples_split=2,
                 max_features='sqrt', bootstrap=True, oob_score=False,
                 random_state=None, verbose=False):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.bootstrap = bootstrap
        self.oob_score = oob_score
        self.random_state = random_state
        self.verbose = verbose
        
        # Will be set during fit
        self.trees = []
        self.oob_samples = []  # OOB samples for each tree
        self.oob_score_ = None
        
    def _get_max_features(self, n_features):
        """
        Determine number of features to use at each split.
        """
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return self.max_features
        else:
            return n_features
    
    def _bootstrap_sample(self, X, y):
        """
        Create bootstrap sample (sampling with replacement).
        
        BAGGING CORE:
        Sample n examples WITH REPLACEMENT from dataset of size n.
        
        Returns:
        --------
        X_sample, y_sample : Bootstrap sample
        oob_indices : Indices of out-of-bag samples (~36.8% of data)
        """
        n_samples = X.shape[0]
        
        if self.bootstrap:
            # Sample WITH replacement
            indices = np.random.choice(n_samples, n_samples, replace=True)
            
            # Out-of-bag samples: not selected in bootstrap
            oob_indices = np.setdiff1d(np.arange(n_samples), indices)
        else:
            # No bootstrap: use all samples
            indices = np.arange(n_samples)
            oob_indices = np.array([])
        
        return X[indices], y[indices], oob_indices
    
    def fit(self, X, y):
        """
        Build Random Forest using Bagging.
        
        BAGGING ALGORITHM:
        For b = 1 to B:
          1. Create bootstrap sample Sb by sampling n with replacement
          2. Train tree Tb on Sb
        
        RANDOM FOREST ENHANCEMENT:
          2b. At each split, select random m features
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        max_features = self._get_max_features(n_features)
        
        if self.verbose:
            print(f"\nBuilding Random Forest:")
            print(f"  Trees: {self.n_estimators}")
            print(f"  Max features per split: {max_features}/{n_features}")
            print(f"  Bootstrap: {self.bootstrap}")
            print(f"  OOB score: {self.oob_score}\n")
        
        self.trees = []
        self.oob_samples = []
        
        # BAGGING: Train multiple trees
        for i in range(self.n_estimators):
            # Step 1: BOOTSTRAP SAMPLING (with replacement)
            X_sample, y_sample, oob_indices = self._bootstrap_sample(X, y)
            
            # Step 2: TRAIN TREE with random feature subsets
            tree = DecisionTreeClassifierScratch(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=max_features  # RF feature randomness
            )
            tree.fit(X_sample, y_sample)
            
            # Store tree and OOB samples
            self.trees.append(tree)
            self.oob_samples.append(oob_indices)
            
            if self.verbose and (i + 1) % 10 == 0:
                print(f"  Trained {i + 1}/{self.n_estimators} trees")
        
        # Calculate OOB score if requested
        if self.oob_score and self.bootstrap:
            self._calculate_oob_score(X, y)
        
        if self.verbose:
            print(f"\n✓ Random Forest training complete!")
            if self.oob_score_:
                print(f"  OOB Score: {self.oob_score_:.4f}")
        
        return self
    
    def _calculate_oob_score(self, X, y):
        """
        Calculate Out-of-Bag score.
        
        OOB ESTIMATION:
        For each sample, predict using only trees where it was OOB.
        This gives a validation score WITHOUT needing a separate test set!
        """
        n_samples = X.shape[0]
        oob_predictions = np.zeros(n_samples, dtype=int) - 1
        oob_counts = np.zeros(n_samples, dtype=int)
        
        # For each tree
        for tree_idx, (tree, oob_indices) in enumerate(zip(self.trees, self.oob_samples)):
            if len(oob_indices) == 0:
                continue
            
            # Predict on OOB samples
            predictions = tree.predict(X[oob_indices])
            
            # Aggregate predictions
            for i, sample_idx in enumerate(oob_indices):
                if oob_predictions[sample_idx] == -1:
                    oob_predictions[sample_idx] = predictions[i]
                else:
                    # Majority vote
                    oob_predictions[sample_idx] = predictions[i]  # Simplified
                oob_counts[sample_idx] += 1
        
        # Calculate accuracy on samples that were OOB at least once
        valid_mask = oob_counts > 0
        if np.sum(valid_mask) > 0:
            self.oob_score_ = np.mean(oob_predictions[valid_mask] == y[valid_mask])
    
    def predict(self, X):
        """
        Predict using majority voting (AGGREGATING step of bagging).
        
        BAGGING PREDICTION:
        For each sample x:
          1. Get prediction from each tree
          2. Return majority vote
        
        For classification: majority vote
        For regression: average
        """
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # AGGREGATING: Majority voting
        # For each sample, find most common prediction across trees
        predictions = []
        for i in range(X.shape[0]):
            sample_predictions = tree_predictions[:, i]
            # Majority vote
            majority = Counter(sample_predictions).most_common(1)[0][0]
            predictions.append(majority)
        
        return np.array(predictions)
    
    def predict_proba(self, X):
        """
        Predict class probabilities.
        
        Probability = fraction of trees voting for each class
        """
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        n_samples = X.shape[0]
        n_classes = len(np.unique(tree_predictions))
        
        probabilities = np.zeros((n_samples, n_classes))
        
        for i in range(n_samples):
            sample_predictions = tree_predictions[:, i]
            # Count votes for each class
            for class_label in range(n_classes):
                probabilities[i, class_label] = np.sum(sample_predictions == class_label) / len(self.trees)
        
        return probabilities

print("\n✓ RandomForestClassifierScratch defined")
print("\n" + "="*80)
print("BAGGING COMPONENTS IMPLEMENTED:")
print("="*80)
print("1. ✓ Bootstrap sampling (with replacement)")
print("2. ✓ Multiple tree training")
print("3. ✓ Majority voting aggregation")
print("4. ✓ Random feature subsets (RF enhancement)")
print("5. ✓ Out-of-bag (OOB) error estimation")
print("="*80)

## Example 1: Synthetic Dataset

In [None]:
# Generate dataset
X, y = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    n_classes=3,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Synthetic Dataset:")
print(f"  Training: {len(X_train)} samples")
print(f"  Test: {len(X_test)} samples")
print(f"  Features: {X.shape[1]}")
print(f"  Classes: {len(np.unique(y))}")

In [None]:
# Train Random Forest
print("\n" + "="*80)
print("TRAINING RANDOM FOREST")
print("="*80)

rf = RandomForestClassifierScratch(
    n_estimators=100,
    max_depth=10,
    max_features='sqrt',  # √20 ≈ 4 features per split
    bootstrap=True,
    oob_score=True,
    random_state=42,
    verbose=True
)

rf.fit(X_train, y_train)

In [None]:
# Evaluate
y_pred = rf.predict(X_test)
y_proba = rf.predict_proba(X_test)

test_acc = accuracy_score(y_test, y_pred)

print("\n" + "="*80)
print("RESULTS")
print("="*80)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
print("\nNote: OOB score is like validation accuracy without needing separate val set!")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## Demonstrate Bagging Effect

In [None]:
print("\n" + "="*80)
print("DEMONSTRATING VARIANCE REDUCTION THROUGH BAGGING")
print("="*80)

# Compare: Single tree vs Multiple trees
n_trials = 20
single_tree_accs = []
rf_accs = []

print("\nRunning multiple trials to measure variance...")

for trial in range(n_trials):
    # Single decision tree (high variance)
    single_tree = DecisionTreeClassifierScratch(
        max_depth=10,
        max_features=int(np.sqrt(X_train.shape[1]))
    )
    single_tree.fit(X_train, y_train)
    single_pred = single_tree.predict(X_test)
    single_tree_accs.append(accuracy_score(y_test, single_pred))
    
    # Random Forest (bagging reduces variance)
    rf_trial = RandomForestClassifierScratch(
        n_estimators=50,
        max_depth=10,
        max_features='sqrt',
        random_state=trial,
        verbose=False
    )
    rf_trial.fit(X_train, y_train)
    rf_pred = rf_trial.predict(X_test)
    rf_accs.append(accuracy_score(y_test, rf_pred))

# Calculate statistics
print("\n" + "="*80)
print("VARIANCE REDUCTION RESULTS")
print("="*80)
print(f"\nSingle Decision Tree:")
print(f"  Mean Accuracy: {np.mean(single_tree_accs):.4f}")
print(f"  Std Dev:       {np.std(single_tree_accs):.4f}  ← HIGH VARIANCE")
print(f"  Min:           {np.min(single_tree_accs):.4f}")
print(f"  Max:           {np.max(single_tree_accs):.4f}")

print(f"\nRandom Forest (50 trees):")
print(f"  Mean Accuracy: {np.mean(rf_accs):.4f}")
print(f"  Std Dev:       {np.std(rf_accs):.4f}  ← REDUCED VARIANCE")
print(f"  Min:           {np.min(rf_accs):.4f}")
print(f"  Max:           {np.max(rf_accs):.4f}")

variance_reduction = (np.std(single_tree_accs) - np.std(rf_accs)) / np.std(single_tree_accs) * 100
print(f"\nVariance Reduction: {variance_reduction:.1f}%")
print("\n✓ Bagging reduces variance → more stable predictions!")

In [None]:
# Visualize variance reduction
fig, ax = plt.subplots(figsize=(12, 6))

positions = [1, 2]
data = [single_tree_accs, rf_accs]
labels = ['Single Tree\n(High Variance)', 'Random Forest\n(Low Variance)']

bp = ax.boxplot(data, positions=positions, labels=labels, widths=0.6,
                patch_artist=True, showmeans=True)

# Color boxes
colors = ['lightcoral', 'lightgreen']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)

ax.set_ylabel('Test Accuracy', fontsize=12)
ax.set_title('Variance Reduction through Bagging', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\n✓ Box plot shows Random Forest has tighter distribution (lower variance)")

## Effect of Number of Trees

In [None]:
print("\n" + "="*80)
print("EFFECT OF NUMBER OF TREES")
print("="*80)

n_trees_list = [1, 5, 10, 25, 50, 100, 200]
train_accs = []
test_accs = []
oob_scores = []

for n_trees in n_trees_list:
    print(f"\nTraining with {n_trees} trees...")
    
    rf_temp = RandomForestClassifierScratch(
        n_estimators=n_trees,
        max_depth=10,
        max_features='sqrt',
        bootstrap=True,
        oob_score=True,
        random_state=42,
        verbose=False
    )
    
    rf_temp.fit(X_train, y_train)
    
    train_pred = rf_temp.predict(X_train)
    test_pred = rf_temp.predict(X_test)
    
    train_acc = accuracy_score(y_train, train_pred)
    test_acc = accuracy_score(y_test, test_pred)
    
    train_accs.append(train_acc)
    test_accs.append(test_acc)
    oob_scores.append(rf_temp.oob_score_)
    
    print(f"  Train: {train_acc:.4f} | Test: {test_acc:.4f} | OOB: {rf_temp.oob_score_:.4f}")

In [None]:
# Plot effect of number of trees
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(n_trees_list, train_accs, marker='o', linewidth=2, label='Train Accuracy')
ax.plot(n_trees_list, test_accs, marker='s', linewidth=2, label='Test Accuracy')
ax.plot(n_trees_list, oob_scores, marker='^', linewidth=2, label='OOB Score')

ax.set_xlabel('Number of Trees', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Performance vs Number of Trees', fontsize=14, fontweight='bold')
ax.set_xscale('log')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ Performance improves and stabilizes with more trees")
print("✓ OOB score tracks test accuracy closely (free validation!)")

## Effect of max_features (Feature Randomness)

In [None]:
print("\n" + "="*80)
print("EFFECT OF FEATURE RANDOMNESS (max_features)")
print("="*80)

n_features = X_train.shape[1]
max_features_options = [
    1,
    int(np.log2(n_features)),
    int(np.sqrt(n_features)),
    n_features // 2,
    n_features
]

results_features = []

for max_feat in max_features_options:
    print(f"\nTesting max_features={max_feat}/{n_features}...")
    
    rf_feat = RandomForestClassifierScratch(
        n_estimators=100,
        max_depth=10,
        max_features=max_feat,
        random_state=42,
        verbose=False
    )
    
    rf_feat.fit(X_train, y_train)
    test_pred = rf_feat.predict(X_test)
    test_acc = accuracy_score(y_test, test_pred)
    
    results_features.append({
        'max_features': max_feat,
        'label': f'{max_feat}/{n_features}',
        'accuracy': test_acc
    })
    
    print(f"  Test Accuracy: {test_acc:.4f}")

print("\n" + "="*80)
print("KEY INSIGHT: Feature randomness decorrelates trees!")
print("  - Too few features: Trees too random, high bias")
print("  - Too many features: Trees correlated, high variance")
print("  - sqrt(n_features): Good balance (typical choice)")
print("="*80)

## Bootstrap vs No Bootstrap

In [None]:
print("\n" + "="*80)
print("BAGGING vs NO BAGGING")
print("="*80)

# With bootstrap (standard Random Forest)
print("\n[1] WITH Bootstrap (Bagging):")
rf_bootstrap = RandomForestClassifierScratch(
    n_estimators=100,
    max_depth=10,
    max_features='sqrt',
    bootstrap=True,  # ← Bagging ON
    random_state=42,
    verbose=False
)
rf_bootstrap.fit(X_train, y_train)
test_pred_boot = rf_bootstrap.predict(X_test)
acc_boot = accuracy_score(y_test, test_pred_boot)
print(f"  Test Accuracy: {acc_boot:.4f}")

# Without bootstrap (just feature randomness)
print("\n[2] WITHOUT Bootstrap (No Bagging):")
rf_no_bootstrap = RandomForestClassifierScratch(
    n_estimators=100,
    max_depth=10,
    max_features='sqrt',
    bootstrap=False,  # ← Bagging OFF
    random_state=42,
    verbose=False
)
rf_no_bootstrap.fit(X_train, y_train)
test_pred_noboot = rf_no_bootstrap.predict(X_test)
acc_noboot = accuracy_score(y_test, test_pred_noboot)
print(f"  Test Accuracy: {acc_noboot:.4f}")

print("\n" + "="*80)
print("DIFFERENCE:")
print(f"  With Bagging:    {acc_boot:.4f}")
print(f"  Without Bagging: {acc_noboot:.4f}")
print(f"  Improvement:     {(acc_boot - acc_noboot):.4f}")
print("\n✓ Bootstrap sampling creates diversity → better ensemble!")
print("="*80)

## Example 2: Breast Cancer Dataset

In [None]:
# Load dataset
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

print("\nBreast Cancer Dataset:")
print(f"  Training: {len(X_train_c)} samples")
print(f"  Test: {len(X_test_c)} samples")
print(f"  Features: {X_cancer.shape[1]}")
print(f"  Classes: Malignant (0), Benign (1)")

In [None]:
# Train Random Forest
print("\n" + "="*80)
print("TRAINING ON BREAST CANCER DATA")
print("="*80)

rf_cancer = RandomForestClassifierScratch(
    n_estimators=100,
    max_depth=15,
    max_features='sqrt',
    bootstrap=True,
    oob_score=True,
    random_state=42,
    verbose=True
)

rf_cancer.fit(X_train_c, y_train_c)

# Evaluate
y_pred_c = rf_cancer.predict(X_test_c)
acc_c = accuracy_score(y_test_c, y_pred_c)

print("\n" + "="*80)
print("BREAST CANCER RESULTS")
print("="*80)
print(f"\nTest Accuracy: {acc_c:.4f}")
print(f"OOB Score: {rf_cancer.oob_score_:.4f}")

print("\nClassification Report:")
print(classification_report(y_test_c, y_pred_c, 
                          target_names=['Malignant', 'Benign']))

# Confusion Matrix
cm = confusion_matrix(y_test_c, y_pred_c)
print("\nConfusion Matrix:")
print(cm)

## Summary

### **What is Bagging?**

**Bootstrap Aggregating (Bagging):**
1. Create $B$ bootstrap samples (sampling with replacement)
2. Train a model on each sample
3. Aggregate predictions (voting for classification, averaging for regression)

### **Why Bagging Works:**

**Variance Reduction:**
- Individual trees: High variance (overfit to specific samples)
- Bagged ensemble: Lower variance (averaged predictions are stable)
- Mathematical: $\text{Var}(\text{avg}) = \frac{\sigma^2}{n}$

**Key Properties:**
- Each bootstrap sample: ~63.2% unique samples
- OOB samples: ~36.8% not in sample → free validation!
- Works best with high-variance, low-bias models (deep trees)

### **Random Forest = Bagging + Extra Randomness:**

**Enhancements:**
1. **Bootstrap sampling** (from bagging)
2. **Random feature subsets** at each split (NEW!)
   - Typically $m = \sqrt{p}$ for classification
   - Further decorrelates trees
   - Prevents strong features from dominating

### **Advantages:**
- **Reduces variance** without increasing bias
- **Parallel training** (trees are independent)
- **Robust to outliers** (averaging smooths them out)
- **No overfitting** (more trees ≠ overfitting)
- **OOB error** (free validation estimate)
- **Feature importance** (built-in)

### **Hyperparameters:**

| Parameter | Effect | Typical Value |
|-----------|--------|---------------|
| `n_estimators` | More trees → lower variance | 100-500 |
| `max_features` | Fewer → more decorrelation | sqrt(p) |
| `max_depth` | Deeper → lower bias, higher variance | 10-30 |
| `min_samples_split` | Higher → regularization | 2-10 |
| `bootstrap` | True → bagging, False → no diversity | True |

### **When to Use Random Forest:**
- Default choice for tabular data
- Need robust, low-maintenance model
- Feature importance needed
- Have mixed feature types
- Want parallel training

### **Comparison:**

| Aspect | Single Tree | Random Forest |
|--------|-------------|---------------|
| **Variance** | High | Low |
| **Overfitting** | Prone | Resistant |
| **Training Time** | Fast | Slower (but parallel) |
| **Interpretability** | High | Lower |
| **Accuracy** | Good | Better |
| **Stability** | Low | High |
