# 017 - Random Forests: Bootstrap Aggregating for Robust Predictions

## 📋 Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand Random Forest fundamentals** - bagging, bootstrap sampling, random feature selection
2. **Implement Random Forest from scratch** - build ensemble of decorrelated trees
3. **Use sklearn RandomForestRegressor/Classifier** - production implementations with tuning
4. **Apply to post-silicon validation** - robust yield prediction, multi-failure detection, feature ranking
5. **Interpret ensemble models** - feature importance aggregation, OOB error estimation
6. **Tune for optimal performance** - n_estimators, max_features, max_depth, min_samples_split

---

## 🎯 Why Random Forests?

### The Problem with Single Decision Trees

**Decision trees suffer from:**
- **High variance**: Small data changes → completely different trees
- **Overfitting**: Deep trees memorize training noise
- **Instability**: Not robust to outliers or sampling variation
- **Greedy splitting**: Locally optimal decisions may miss global patterns

**Example:** Train 5 trees on same data with slight variations → 5 very different structures

### Random Forest Solution: Wisdom of the Crowd

**Key insight:** Average of many diverse models is more stable than any single model.

✅ **Variance reduction**: Averaging reduces prediction variance (bias stays same)
✅ **Robustness**: Outliers affect only some trees, not the entire forest
✅ **No overfitting**: Individual trees can be deep (high variance), averaging stabilizes
✅ **Decorrelation**: Random feature selection ensures trees are different
✅ **OOB validation**: Built-in cross-validation (out-of-bag samples)

### Real-World Impact

**Post-Silicon:**
- **Robust yield prediction**: Stable predictions despite noisy test data
- **Multi-failure detection**: Each tree specializes in different failure modes
- **Feature importance**: Aggregate rankings reveal truly important parameters
- **Missing data handling**: Trees naturally handle missing values via surrogate splits

**General AI/ML:**
- **Kaggle competitions**: Random Forests consistently rank in top 3 methods
- **Credit scoring**: Robust to data quality issues (missing values, outliers)
- **Medical diagnosis**: Ensemble consensus reduces misdiagnosis risk
- **Production ML**: Minimal tuning, handles mixed data types, fast training

---

## 🔄 Random Forest Algorithm

```mermaid
graph TD
    A[Training Data N samples] --> B[Bootstrap Sample 1]
    A --> C[Bootstrap Sample 2]
    A --> D[Bootstrap Sample K]
    
    B --> E[Tree 1: Random m features at each split]
    C --> F[Tree 2: Random m features at each split]
    D --> G[Tree K: Random m features at each split]
    
    E --> H[Predictions from Tree 1]
    F --> I[Predictions from Tree 2]
    G --> J[Predictions from Tree K]
    
    H --> K[Average for Regression]
    I --> K
    J --> K
    
    H --> L[Majority Vote for Classification]
    I --> L
    J --> L
    
    K --> M[Final Prediction]
    L --> M
    
    style A fill:#e1f5ff
    style M fill:#e1ffe1
    style B fill:#fff3cd
    style C fill:#fff3cd
    style D fill:#fff3cd
```

### Key Concepts

1. **Bootstrap Aggregating (Bagging):** Sample N observations with replacement → K different datasets
2. **Random Feature Selection:** At each split, consider only random subset of m features (m ≈ √p)
3. **Decorrelation:** Bootstrap + random features → trees make different mistakes
4. **Out-of-Bag (OOB) Error:** ~37% of samples left out of each bootstrap → free validation set
5. **Averaging/Voting:** Regression = mean prediction, Classification = majority class

### Mathematical Intuition

**Why averaging reduces variance:**

If we have K independent models with variance $\sigma^2$, the ensemble variance is:

$$Var(\bar{f}) = \frac{\sigma^2}{K}$$

**With correlation ρ between models:**

$$Var(\bar{f}) = \rho \sigma^2 + \frac{1-\rho}{K} \sigma^2$$

Where:
- First term: Irreducible variance from correlation
- Second term: Reducible variance (decreases with K)
- **Goal:** Minimize ρ via random feature selection

---

## 📐 Mathematical Foundation

### 1. Bootstrap Sampling

**Procedure:** Sample N observations from training set **with replacement**.

**Result:** Each bootstrap sample has:
- ~63.2% unique observations (in-bag)
- ~36.8% duplicates or missing observations (out-of-bag)

**Proof:** Probability observation i is NOT selected in one draw:

$$P(\text{not selected}) = \left(1 - \frac{1}{N}\right)^N \approx e^{-1} \approx 0.368$$

Therefore, ~63.2% of observations appear at least once.

### 2. Random Feature Selection

At each split, consider random subset of m features:

- **Classification:** $m = \sqrt{p}$ (default)
- **Regression:** $m = p/3$ (default)
- **Max decorrelation:** $m = 1$ (extremely randomized trees)

**Why?** Prevents strong features from dominating all trees → more diverse ensemble.

### 3. Prediction

**Regression:**
$$\hat{y}_{RF} = \frac{1}{K} \sum_{k=1}^{K} \hat{y}_k(x)$$

**Classification:**
$$\hat{y}_{RF} = \arg\max_c \sum_{k=1}^{K} \mathbb{1}(\hat{y}_k(x) = c)$$

### 4. Out-of-Bag (OOB) Error

For each observation i:
1. Find all trees where i was NOT in bootstrap sample (~37% of trees)
2. Predict i using only those trees
3. Compare to true label

$$OOB\_Error = \frac{1}{N} \sum_{i=1}^{N} L(y_i, \hat{y}_{OOB,i})$$

**Advantage:** OOB error ≈ cross-validation error, no need for separate validation set.

### 5. Feature Importance

**Method 1: Mean Decrease in Impurity (MDI)**

$$Importance(feature\_j) = \frac{1}{K} \sum_{k=1}^{K} \sum_{t \in Tree_k} \Delta Impurity_t \cdot \mathbb{1}(split\_feature = j)$$

**Method 2: Permutation Importance**
1. Compute OOB error
2. Shuffle feature j in OOB samples
3. Compute new OOB error
4. Importance = increase in error

**Advantage:** Permutation importance handles correlated features better.

---

## 🔨 Implementation from Scratch

### 📝 What's Happening in This Code?

**Purpose:** Build Random Forest using our DecisionTreeRegressorScratch from notebook 016

**Key Points:**
- **Bootstrap sampling**: Use `np.random.choice` with `replace=True` for each tree
- **Random features**: At each split, consider only random subset (max_features)
- **Parallel trees**: Each tree trained independently (can parallelize)
- **Prediction**: Average all tree predictions for final output
- **OOB tracking**: Store which samples were out-of-bag for each tree

**Why This Matters:** Understanding bootstrap + random features reveals how Random Forests achieve decorrelation and variance reduction. This is the foundation for all ensemble methods.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from copy import deepcopy

# Reuse DecisionTreeRegressorScratch from 016
class Node:
    def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None):
        self.feature_idx = feature_idx
        self.threshold = threshold
        self.left = left
        self.right = right
        self.value = value
    
    def is_leaf(self):
        return self.value is not None

class DecisionTreeRegressorScratch:
    def __init__(self, max_depth=5, min_samples_split=2, min_samples_leaf=1, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.root = None
    
    def fit(self, X, y):
        X = np.array(X)
        y = np.array(y)
        self.n_features = X.shape[1]
        if self.max_features is None:
            self.max_features = self.n_features
        self.root = self._build_tree(X, y, depth=0)
        return self
    
    def _build_tree(self, X, y, depth):
        n_samples, n_features = X.shape
        
        if (depth >= self.max_depth or n_samples < self.min_samples_split or len(np.unique(y)) == 1):
            return Node(value=np.mean(y))
        
        best_feature, best_threshold = self._find_best_split(X, y)
        
        if best_feature is None:
            return Node(value=np.mean(y))
        
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(feature_idx=best_feature, threshold=best_threshold,
                    left=left_child, right=right_child)
    
    def _find_best_split(self, X, y):
        n_samples, n_features = X.shape
        
        # Random feature selection for Random Forest
        feature_indices = np.random.choice(n_features, self.max_features, replace=False)
        
        best_rss = float('inf')
        best_feature = None
        best_threshold = None
        
        for feature_idx in feature_indices:
            thresholds = np.unique(X[:, feature_idx])
            
            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask
                
                if (np.sum(left_mask) < self.min_samples_leaf or 
                    np.sum(right_mask) < self.min_samples_leaf):
                    continue
                
                y_left = y[left_mask]
                y_right = y[right_mask]
                rss = self._calculate_rss(y_left, y_right)
                
                if rss < best_rss:
                    best_rss = rss
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _calculate_rss(self, y_left, y_right):
        rss_left = np.sum((y_left - np.mean(y_left))**2) if len(y_left) > 0 else 0
        rss_right = np.sum((y_right - np.mean(y_right))**2) if len(y_right) > 0 else 0
        return rss_left + rss_right
    
    def predict(self, X):
        X = np.array(X)
        return np.array([self._predict_single(x, self.root) for x in X])
    
    def _predict_single(self, x, node):
        if node.is_leaf():
            return node.value
        
        if x[node.feature_idx] <= node.threshold:
            return self._predict_single(x, node.left)
        else:
            return self._predict_single(x, node.right)

print('✅ DecisionTreeRegressorScratch with max_features implemented')

In [None]:
class RandomForestRegressorScratch:
    """Random Forest from scratch using bootstrap + random features"""
    
    def __init__(self, n_estimators=100, max_depth=10, min_samples_split=2, 
                 max_features='sqrt', random_state=None):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.random_state = random_state
        self.trees = []
        self.oob_indices = []  # Track OOB samples for each tree
    
    def fit(self, X, y):
        """Train forest of trees with bootstrap sampling"""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        X = np.array(X)
        y = np.array(y)
        n_samples, n_features = X.shape
        
        # Determine max_features
        if self.max_features == 'sqrt':
            max_features = int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            max_features = int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            max_features = self.max_features
        else:  # Default to p/3 for regression
            max_features = max(1, n_features // 3)
        
        # Build each tree
        for i in range(self.n_estimators):
            # Bootstrap sample
            bootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            # Track OOB indices
            oob_mask = np.ones(n_samples, dtype=bool)
            oob_mask[bootstrap_indices] = False
            self.oob_indices.append(np.where(oob_mask)[0])
            
            # Train tree
            tree = DecisionTreeRegressorScratch(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=max_features
            )
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
        
        return self
    
    def predict(self, X):
        """Average predictions from all trees"""
        X = np.array(X)
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.mean(tree_predictions, axis=0)
    
    def compute_oob_score(self, X, y):
        """Compute out-of-bag R² score"""
        X = np.array(X)
        y = np.array(y)
        n_samples = len(y)
        oob_predictions = np.zeros(n_samples)
        oob_counts = np.zeros(n_samples)
        
        # For each tree, predict its OOB samples
        for tree, oob_idx in zip(self.trees, self.oob_indices):
            if len(oob_idx) > 0:
                oob_predictions[oob_idx] += tree.predict(X[oob_idx])
                oob_counts[oob_idx] += 1
        
        # Average OOB predictions
        mask = oob_counts > 0
        oob_predictions[mask] /= oob_counts[mask]
        
        # Compute R²
        ss_res = np.sum((y[mask] - oob_predictions[mask])**2)
        ss_tot = np.sum((y[mask] - np.mean(y[mask]))**2)
        oob_r2 = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0
        
        return oob_r2

print('✅ RandomForestRegressorScratch implemented')

---

### 📝 What's Happening in This Code?

**Purpose:** Generate noisy non-linear data to demonstrate Random Forest variance reduction

**Key Points:**
- **Non-linear function**: $y = sin(2\pi x) + noise$ (challenging for linear models)
- **High noise**: Substantial random noise tests ensemble robustness
- **Single tree weakness**: Individual tree will overfit noise
- **Random Forest strength**: Averaging smooths out overfitting

**Why This Matters:** Sine function with noise is perfect test case - single tree will create jagged staircase, but Random Forest averages to smooth curve.

In [None]:
# Generate non-linear data with noise
np.random.seed(42)
X_train = np.random.uniform(0, 1, size=(200, 1))
y_train = np.sin(2 * np.pi * X_train[:, 0]) + np.random.normal(0, 0.3, 200)

X_test = np.random.uniform(0, 1, size=(50, 1))
y_test = np.sin(2 * np.pi * X_test[:, 0]) + np.random.normal(0, 0.3, 50)

print('Training data generated:')
print(f'  Train samples: {len(X_train)}')
print(f'  Test samples: {len(X_test)}')
print(f'  True function: y = sin(2πx) + noise')
print(f'  Noise std: 0.3')

---

### 📝 What's Happening in This Code?

**Purpose:** Train single tree vs Random Forest to demonstrate variance reduction

**Key Points:**
- **Single tree**: max_depth=10 (intentionally deep to overfit)
- **Random Forest**: 50 trees with max_depth=10 (each overfits, but average is smooth)
- **max_features**: sqrt(1) = 1 (only 1 feature, so all trees use same feature but different splits)
- **OOB score**: Built-in validation without separate test set

**Why This Matters:** Demonstrates core Random Forest principle - averaging high-variance models produces low-variance ensemble (wisdom of the crowd).

In [None]:
# Train single deep tree (high variance)
single_tree = DecisionTreeRegressorScratch(max_depth=10, min_samples_split=2)
single_tree.fit(X_train, y_train)

# Train Random Forest (variance reduction via averaging)
rf_scratch = RandomForestRegressorScratch(
    n_estimators=50,
    max_depth=10,
    min_samples_split=2,
    max_features='sqrt',
    random_state=42
)
rf_scratch.fit(X_train, y_train)

# Predictions
y_train_pred_tree = single_tree.predict(X_train)
y_test_pred_tree = single_tree.predict(X_test)

y_train_pred_rf = rf_scratch.predict(X_train)
y_test_pred_rf = rf_scratch.predict(X_test)

# Metrics
train_mse_tree = np.mean((y_train - y_train_pred_tree)**2)
test_mse_tree = np.mean((y_test - y_test_pred_tree)**2)

train_mse_rf = np.mean((y_train - y_train_pred_rf)**2)
test_mse_rf = np.mean((y_test - y_test_pred_rf)**2)

# OOB score
oob_r2 = rf_scratch.compute_oob_score(X_train, y_train)

print('Performance Comparison:')
print('\nSingle Tree (depth=10):')
print(f'  Train MSE: {train_mse_tree:.4f}')
print(f'  Test MSE: {test_mse_tree:.4f}')
print(f'  Overfit ratio: {test_mse_tree / train_mse_tree:.2f}x')

print('\nRandom Forest (50 trees, depth=10):')
print(f'  Train MSE: {train_mse_rf:.4f}')
print(f'  Test MSE: {test_mse_rf:.4f}')
print(f'  OOB R²: {oob_r2:.4f}')
print(f'  Overfit ratio: {test_mse_rf / train_mse_rf:.2f}x')

print(f'\n✅ Random Forest reduces test MSE by {(1 - test_mse_rf/test_mse_tree)*100:.1f}%')

---

### 📝 What's Happening in This Code?

**Purpose:** Visualize how Random Forest smooths jagged single-tree predictions

**Key Points:**
- **True function**: Smooth sine wave (dashed black line)
- **Single tree**: Jagged staircase overfits training noise (red)
- **Random Forest**: Smooth curve follows true function (blue)
- **Variance reduction**: Forest averages out individual tree mistakes

**Why This Matters:** Visual proof that ensemble averaging produces smoother, more generalizable predictions than any single model.

In [None]:
# Visualization
X_plot = np.linspace(0, 1, 300).reshape(-1, 1)
y_true = np.sin(2 * np.pi * X_plot[:, 0])
y_pred_tree = single_tree.predict(X_plot)
y_pred_rf = rf_scratch.predict(X_plot)

plt.figure(figsize=(12, 5))

# Single tree
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, alpha=0.3, s=30, label='Training data')
plt.plot(X_plot, y_true, 'k--', linewidth=2, label='True function')
plt.plot(X_plot, y_pred_tree, 'r-', linewidth=2, label='Single tree')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Single Tree (Test MSE: {test_mse_tree:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)

# Random Forest
plt.subplot(1, 2, 2)
plt.scatter(X_train, y_train, alpha=0.3, s=30, label='Training data')
plt.plot(X_plot, y_true, 'k--', linewidth=2, label='True function')
plt.plot(X_plot, y_pred_rf, 'b-', linewidth=2, label='Random Forest (50 trees)')
plt.xlabel('X')
plt.ylabel('y')
plt.title(f'Random Forest (Test MSE: {test_mse_rf:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\n✅ Batch 1 Complete:')
print('  - Random Forest theory (bagging, bootstrap, random features)')
print('  - From-scratch implementation with OOB scoring')
print('  - Variance reduction demonstration (single tree vs forest)')
print('  - Visual confirmation: smooth forest vs jagged tree')

---

## 🏭 Production Implementation: Sklearn RandomForestRegressor

### 📝 What's Happening in This Code?

**Purpose:** Use sklearn's optimized Random Forest with advanced features

**Key Points:**
- **sklearn.ensemble.RandomForestRegressor**: C-optimized implementation (100x faster)
- **Parallel training**: n_jobs=-1 uses all CPU cores
- **OOB built-in**: oob_score=True automatically computes OOB R²
- **Feature importance**: Aggregated from all trees (mean decrease in impurity)
- **Warm start**: Can add more trees incrementally without retraining

**Why This Matters:** Production Random Forests handle massive datasets with parallel processing and provide extensive diagnostics for model interpretation.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Train sklearn Random Forest
rf_sklearn = RandomForestRegressor(
    n_estimators=100,
    max_depth=10,
    min_samples_split=2,
    max_features='sqrt',
    oob_score=True,
    n_jobs=-1,  # Use all CPU cores
    random_state=42
)
rf_sklearn.fit(X_train, y_train)

# Predictions
y_train_pred_sklearn = rf_sklearn.predict(X_train)
y_test_pred_sklearn = rf_sklearn.predict(X_test)

# Metrics
train_mse_sklearn = mean_squared_error(y_train, y_train_pred_sklearn)
test_mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
test_r2_sklearn = r2_score(y_test, y_test_pred_sklearn)
oob_score_sklearn = rf_sklearn.oob_score_

print('Sklearn RandomForestRegressor (100 trees):')
print(f'  Train MSE: {train_mse_sklearn:.4f}')
print(f'  Test MSE: {test_mse_sklearn:.4f}')
print(f'  Test R²: {test_r2_sklearn:.4f}')
print(f'  OOB Score (R²): {oob_score_sklearn:.4f}')

print(f'\nComparison with from-scratch:')
print(f'  From-scratch (50 trees): MSE = {test_mse_rf:.4f}')
print(f'  Sklearn (100 trees): MSE = {test_mse_sklearn:.4f}')
print(f'  Improvement: {(test_mse_rf - test_mse_sklearn)/test_mse_rf*100:.1f}%')

---

### 📝 What's Happening in This Code?

**Purpose:** Demonstrate impact of n_estimators on performance

**Key Points:**
- **n_estimators**: More trees → lower variance, but diminishing returns
- **OOB tracking**: Monitor OOB score to detect overfitting (or lack thereof)
- **Train vs OOB vs Test**: OOB closely tracks test performance
- **Convergence**: Typically plateau around 100-500 trees

**Why This Matters:** Shows Random Forests rarely overfit - more trees almost always help or plateau (unlike single trees which overfit with more depth).

In [None]:
# Study impact of n_estimators
n_trees_range = [1, 5, 10, 25, 50, 100, 200]
results = []

for n_trees in n_trees_range:
    rf = RandomForestRegressor(
        n_estimators=n_trees,
        max_depth=10,
        max_features='sqrt',
        oob_score=True,
        random_state=42
    )
    rf.fit(X_train, y_train)
    
    train_mse = mean_squared_error(y_train, rf.predict(X_train))
    test_mse = mean_squared_error(y_test, rf.predict(X_test))
    oob_r2 = rf.oob_score_
    
    results.append({
        'n_trees': n_trees,
        'train_mse': train_mse,
        'test_mse': test_mse,
        'oob_r2': oob_r2
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

# Plot convergence
plt.figure(figsize=(10, 5))
plt.plot(results_df.n_trees, results_df.train_mse, 'b-o', label='Train MSE')
plt.plot(results_df.n_trees, results_df.test_mse, 'r-s', label='Test MSE')
plt.axhline(test_mse_tree, color='gray', linestyle='--', label='Single tree baseline')
plt.xlabel('Number of Trees')
plt.ylabel('MSE')
plt.title('Random Forest Convergence')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.tight_layout()
plt.show()

print(f'\n✅ Performance plateaus around {results_df.loc[results_df.test_mse.idxmin(), "n_trees"]:.0f} trees')

---

## 🔬 Post-Silicon Application: Robust Multi-Parameter Yield Prediction

### 📝 What's Happening in This Code?

**Purpose:** Predict device yield from noisy multi-parameter test data

**Key Points:**
- **10 test parameters**: Voltage, current, frequency, temperature, power, leakage, etc.
- **Complex interactions**: Yield depends on V×F, V×Temp, and other non-linear combinations
- **Noisy data**: Real test data has measurement errors and outliers
- **Random Forest advantage**: Robust to noise, captures interactions automatically

**Why This Matters:** Production test data is messy - Random Forests handle noise and missing values better than single trees or linear models, making them ideal for real-world semiconductor yield prediction.

In [None]:
# Generate synthetic multi-parameter yield data
np.random.seed(42)
n_devices = 1000

# 10 test parameters
voltage = np.random.uniform(0.95, 1.05, n_devices)
current = np.random.uniform(0.8, 1.2, n_devices)
frequency = np.random.uniform(450, 550, n_devices)
temperature = np.random.uniform(25, 85, n_devices)
power = voltage * current * frequency / 100 + np.random.normal(0, 0.5, n_devices)
leakage = np.random.exponential(scale=10, size=n_devices)
delay = np.random.uniform(5, 15, n_devices)
noise_margin = np.random.uniform(0.1, 0.3, n_devices)
jitter = np.random.uniform(10, 50, n_devices)
skew = np.random.uniform(-5, 5, n_devices)

# Yield depends on complex interactions
# Good devices: low power, high frequency, low leakage, good margins
yield_score = (
    100 +
    10 * (frequency - 500) / 50 +  # Higher freq → higher yield
    -15 * (power - 5) / 2 +        # Higher power → lower yield
    -8 * (leakage - 10) / 10 +     # Higher leakage → lower yield
    5 * (noise_margin - 0.2) / 0.1 +  # Better margin → higher yield
    -3 * (temperature - 55) / 30 +  # Higher temp → lower yield
    np.random.normal(0, 5, n_devices)  # Measurement noise
)

# Convert to binary yield (pass/fail threshold at 95%)
yield_binary = (yield_score > 95).astype(int)

# Create DataFrame
yield_data = pd.DataFrame({
    'Voltage': voltage,
    'Current': current,
    'Frequency': frequency,
    'Temperature': temperature,
    'Power': power,
    'Leakage': leakage,
    'Delay': delay,
    'Noise_Margin': noise_margin,
    'Jitter': jitter,
    'Skew': skew,
    'Yield': yield_binary
})

print('Multi-Parameter Yield Data:')
print(yield_data.head(10))
print(f'\nDataset: {len(yield_data)} devices')
print(f'Yield rate: {yield_binary.mean()*100:.1f}%')
print(f'Features: {yield_data.shape[1]-1} test parameters')

---

### 📝 What's Happening in This Code?

**Purpose:** Train Random Forest classifier for binary yield prediction

**Key Points:**
- **RandomForestClassifier**: For binary classification (pass/fail)
- **Class imbalance**: Yield rate ~60-80% typical (handle via class_weight if needed)
- **Feature importance**: Reveals which parameters drive yield
- **Accuracy + AUC**: Both metrics important for production deployment

**Why This Matters:** Feature importance guides test optimization - focus on parameters that strongly predict yield, reduce testing of irrelevant parameters.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Split data
X_yield = yield_data.drop('Yield', axis=1).values
y_yield = yield_data['Yield'].values
X_yield_train, X_yield_test, y_yield_train, y_yield_test = train_test_split(
    X_yield, y_yield, test_size=0.2, random_state=42, stratify=y_yield
)

# Train Random Forest classifier
rf_yield = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=10,
    max_features='sqrt',
    oob_score=True,
    n_jobs=-1,
    random_state=42
)
rf_yield.fit(X_yield_train, y_yield_train)

# Predictions
y_yield_pred = rf_yield.predict(X_yield_test)
y_yield_proba = rf_yield.predict_proba(X_yield_test)[:, 1]

# Metrics
accuracy = accuracy_score(y_yield_test, y_yield_pred)
auc = roc_auc_score(y_yield_test, y_yield_proba)
oob_score = rf_yield.oob_score_

print('Yield Prediction Model Performance:')
print(f'  Accuracy: {accuracy:.3f}')
print(f'  AUC-ROC: {auc:.3f}')
print(f'  OOB Score: {oob_score:.3f}')
print('\nClassification Report:')
print(classification_report(y_yield_test, y_yield_pred, target_names=['Fail', 'Pass']))

# Feature importance
feature_names = yield_data.columns[:-1]
importances = rf_yield.feature_importances_
indices = np.argsort(importances)[::-1]

print('\nFeature Importance (Top 5):')
for i in range(5):
    print(f'  {feature_names[indices[i]]}: {importances[indices[i]]:.3f}')

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(importances)), importances[indices], color='skyblue')
plt.yticks(range(len(importances)), [feature_names[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Random Forest Feature Importance for Yield Prediction')
plt.tight_layout()
plt.show()

print('\n✅ Feature importance reveals:')
print('  - Which parameters most strongly predict yield')
print('  - Candidates for test time reduction (skip low-importance features)')
print('  - Process optimization targets (improve high-importance parameters)')

---

## 🎯 Real-World Project Ideas

### Post-Silicon Validation Projects (4)

#### 1. **Robust Multi-Failure Mode Detection**
**Objective:** Multi-class classification of failure modes with noisy test data.

**Business Value:** Identify root causes faster (voltage fail vs current fail vs timing fail). Reduces debug time by 40-60% compared to manual analysis.

**Key Features:** All parametric tests, wafer coordinates, lot info, temperature, voltage corners

**Implementation:** RandomForestClassifier with 200-500 trees, tune max_depth (10-20), use permutation importance to identify diagnostic tests

**Success Metric:** 90% failure mode classification accuracy, <5% test time increase

---

#### 2. **Test Importance Ranking for Flow Optimization**
**Objective:** Rank tests by importance, eliminate redundant/low-value tests.

**Business Value:** Reduce test time 20-35% while maintaining yield coverage. $500K-$2M annual savings per product.

**Key Features:** Historical test results (all parameters), final yield outcomes

**Implementation:** Train RF on full test suite, use feature importance + permutation importance to rank tests, validate removal of bottom 20% on holdout data

**Success Metric:** 25% test time reduction, <0.5% yield escape rate increase

---

#### 3. **Wafer-Level Spatial Pattern Classification**
**Objective:** Classify wafer defects (edge fail, center fail, random, systematic).

**Business Value:** Automated root cause analysis for yield improvement. Edge fails → etch issue, center fails → deposition uniformity, random → particles.

**Key Features:** Die coordinates (x, y), parametric values, neighbors' results, radial distance

**Implementation:** Random Forest with spatial features, visualize decision boundaries on wafer maps

**Success Metric:** 85% pattern classification accuracy, actionable insights for process engineers

---

#### 4. **Robust Yield Prediction with Missing Data**
**Objective:** Predict final test yield from incomplete wafer test data.

**Business Value:** Early yield prediction enables production planning. Random Forests handle missing values naturally (surrogate splits).

**Key Features:** Wafer test parameters (10-30% missing randomly), lot characteristics

**Implementation:** RF trained on complete data, evaluate on data with artificial missingness

**Success Metric:** <5% RMSE degradation with 30% missing data vs complete data

---

### General AI/ML Projects (4)

#### 5. **Credit Scoring with Noisy Data**
**Objective:** Predict loan default from applicant data with missing values and outliers.

**Business Value:** Random Forests robust to data quality issues, no imputation needed. Reduces default rate by 15-20% vs simpler models.

**Key Features:** Income, credit history, employment, debt-to-income, age, location

**Success Metric:** AUC > 0.80, handles 20% missing data without imputation

---

#### 6. **Medical Diagnosis Ensemble**
**Objective:** Diagnose disease from symptoms and test results using ensemble voting.

**Business Value:** Ensemble reduces misdiagnosis risk. Feature importance reveals most diagnostic symptoms for triage.

**Key Features:** Symptoms (binary), vital signs, lab tests, demographics, medical history

**Success Metric:** 90% diagnostic accuracy, interpretable feature importance for clinicians

---

#### 7. **Customer Churn Prediction**
**Objective:** Predict which customers will churn next quarter.

**Business Value:** Target retention campaigns at high-risk customers. RF handles mixed data types (numerical usage + categorical plan type) without encoding.

**Key Features:** Usage patterns, support tickets, billing amount, contract type, tenure

**Success Metric:** 75% churn prediction accuracy, 20% reduction in churn via targeted campaigns

---

#### 8. **Kaggle Competition Baseline**
**Objective:** Random Forest as strong baseline for any tabular data competition.

**Business Value:** Minimal tuning, fast training, often top 10% performance out-of-box. Establishes performance floor for more complex models.

**Key Features:** Any tabular data (mixed types, missing values, outliers)

**Success Metric:** Achieve top 25% leaderboard with default hyperparameters

---

## ✅ Key Takeaways

### When to Use Random Forests

| **Scenario** | **Random Forests** | **Single Decision Tree** | **XGBoost** |
|-------------|-------------------|-------------------------|-------------|
| **Robustness to noise** | ✅ Excellent | ❌ Overfits | ✅ Very good |
| **Missing data handling** | ✅ Built-in (surrogates) | ✅ Yes | ⚠️ Needs imputation |
| **Training speed** | ✅ Fast (parallel) | ✅ Very fast | ⚠️ Slower |
| **Tuning difficulty** | ✅ Minimal | ⚠️ Easy to overfit | ⚠️ Many hyperparameters |
| **Interpretability** | ⚠️ Feature importance only | ✅ Full tree readable | ⚠️ Feature importance only |
| **Performance ceiling** | ✅ High | ❌ Low (high variance) | ✅ Highest |
| **Overfitting risk** | ✅ Low (averaging) | ❌ High | ⚠️ Moderate |
| **Production deployment** | ✅ Stable, reliable | ❌ Unstable | ✅ Best performance |

### Best Practices

1. **Hyperparameter tuning priorities:**
   - **n_estimators**: 100-500 (more is almost always better, diminishing returns after 500)
   - **max_features**: sqrt(p) for classification, p/3 for regression (default is good)
   - **max_depth**: 10-30 (deeper than single tree, averaging prevents overfitting)
   - **min_samples_split**: 2-20 (higher for very noisy data)
   - **min_samples_leaf**: 1-10 (higher creates smoother predictions)

2. **Feature engineering:**
   - No scaling needed (tree-based)
   - Handles categorical features with ordinal encoding
   - Missing values handled automatically (surrogate splits)
   - Interaction features helpful but not required (trees capture some interactions)

3. **Overfitting prevention:**
   - Use OOB score for validation (no separate validation set needed)
   - More trees → lower variance (unlike single tree)
   - Increase min_samples_split if overfitting persists
   - Consider max_features < p for decorrelation

4. **Production deployment:**
   - Use n_jobs=-1 for parallel training (utilize all CPUs)
   - Monitor feature importance drift (indicates data distribution changes)
   - Warm start allows incremental tree addition without retraining
   - Serialize with joblib.dump() for fast loading

### Limitations

- **Memory usage**: Stores all trees (can be large for 1000+ trees with deep depth)
- **Prediction speed**: Slower than single tree (must query all trees)
- **Extrapolation**: Cannot predict beyond training range (piecewise constant)
- **Interpretability**: Less interpretable than single tree (black box ensemble)
- **Imbalanced data**: May need class_weight='balanced' for minority class

### Next Steps

- **018_Gradient_Boosting.ipynb:** Sequential tree building (boosting vs bagging)
- **019_XGBoost.ipynb:** High-performance gradient boosting with regularization
- **020_LightGBM.ipynb:** Fast histogram-based boosting for large datasets
- **Advanced:** Extremely randomized trees, isolation forests, Mondrian forests

---

## 📚 References & Further Reading

**Foundational Papers:**
- Breiman (2001): Random Forests - Original paper introducing bagging + random features
- Breiman (1996): Bagging Predictors - Bootstrap aggregating foundation

**Sklearn Documentation:**
- RandomForestRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
- RandomForestClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- User guide: https://scikit-learn.org/stable/modules/ensemble.html#forest

**Advanced Topics:**
- Out-of-bag error estimation (built-in cross-validation)
- Permutation importance (better for correlated features)
- Partial dependence plots (visualize feature effects)
- Proximity matrices (measure sample similarity)

---

**Notebook Complete!** 🎉

You now understand:
- ✅ Random Forest theory (bagging, bootstrap, random features, variance reduction)
- ✅ From-scratch implementation with OOB scoring
- ✅ Production sklearn usage (parallel training, feature importance)
- ✅ Post-silicon applications (yield prediction, failure detection, test optimization)
- ✅ General AI/ML applications (credit scoring, medical diagnosis, churn prediction)
- ✅ Feature importance for interpretability
- ✅ Hyperparameter tuning (n_estimators, max_features, max_depth)
- ✅ 8 real-world projects to practice

**Next:** `018_Gradient_Boosting.ipynb` for sequential boosting (different from bagging).