# 016 - Decision Trees: Non-Linear Modeling and Automatic Feature Selection

## 📋 Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand decision tree fundamentals** - recursive partitioning, splitting criteria, tree structure
2. **Implement tree algorithms from scratch** - CART, Gini impurity, information gain, pruning
3. **Use sklearn DecisionTreeRegressor/Classifier** - production implementations with hyperparameter tuning
4. **Apply to post-silicon validation** - non-linear V-F relationships, automatic bin prediction, test flow optimization
5. **Interpret tree models** - feature importance, decision paths, visualization
6. **Prevent overfitting** - max_depth, min_samples_split, pruning strategies

---

## 🎯 Why Decision Trees?

### The Problem with Linear Models

**Linear regression assumes:**
- $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p$
- Constant effect of each feature across all ranges
- No automatic feature interactions

**Real-world problems often have:**
- **Non-linear relationships:** Device frequency scales as $V^2$ (not linear in voltage)
- **Threshold effects:** Test fails if temperature > 85°C (discrete jump)
- **Feature interactions:** Power depends on $V \times Frequency$ (multiplicative)
- **Hierarchical decisions:** If wafer defect density > 5%, then check spatial pattern...

### Decision Tree Advantages

✅ **Non-linear by design** - captures complex relationships without feature engineering
✅ **Automatic feature interactions** - splits can combine features naturally
✅ **Interpretable** - human-readable rules (if Vdd > 1.1V, then predict high power)
✅ **Handles mixed data** - numerical and categorical features without encoding
✅ **Feature selection** - automatically ignores irrelevant features
✅ **No scaling required** - invariant to monotonic transformations

### Real-World Impact

**Post-Silicon:**
- **Test flow optimization:** Decision rules for conditional testing
- **Automatic binning:** Multi-class classification into speed bins without manual thresholds
- **Failure diagnosis:** Tree reveals failure mode hierarchy (voltage → temp → frequency)

**General AI/ML:**
- **Medical diagnosis:** Symptom-based decision paths mimic clinical reasoning
- **Credit risk:** Interpretable loan approval rules for regulatory compliance
- **Customer segmentation:** Behavioral rules for targeted marketing

---

## 🔄 Decision Tree Workflow

```mermaid
graph TD
    A[Training Data] --> B[Select Best Feature to Split]
    B --> C[Split Data into Left/Right]
    C --> D{Stopping Criteria?}
    D -->|No| B
    D -->|Yes| E[Create Leaf Node]
    E --> F[Predict: Mean for Regression, Mode for Classification]
    
    G[New Data Point] --> H[Start at Root Node]
    H --> I{Feature Value <= Threshold?}
    I -->|Yes| J[Go to Left Child]
    I -->|No| K[Go to Right Child]
    J --> L{Leaf Node?}
    K --> L
    L -->|No| I
    L -->|Yes| M[Return Prediction]
    
    style A fill:#e1f5ff
    style G fill:#ffe1e1
    style M fill:#e1ffe1
```

### Key Concepts

1. **Recursive Partitioning:** Split feature space into rectangular regions
2. **Greedy Algorithm:** Choose best split at each step (locally optimal)
3. **Impurity Reduction:** Measure how mixed a node is (Gini, entropy)
4. **Stopping Criteria:** Max depth, min samples per leaf, min impurity decrease
5. **Prediction:** Average (regression) or majority vote (classification) in leaf

---

## 📐 Mathematical Foundation

### 1. Tree Structure

A decision tree $T$ consists of:
- **Internal nodes:** $(feature\_index, threshold)$ pairs for splitting
- **Leaf nodes:** Predictions (constants)
- **Edges:** Left ($\leq$ threshold) and right ($>$ threshold)

### 2. Splitting Criteria (Regression)

**Goal:** Minimize variance within each node.

**Residual Sum of Squares (RSS):**

$$RSS = \sum_{i \in Left} (y_i - \bar{y}_{Left})^2 + \sum_{i \in Right} (y_i - \bar{y}_{Right})^2$$

Where:
- $\bar{y}_{Left} = \frac{1}{|Left|} \sum_{i \in Left} y_i$ (mean of left node)
- $\bar{y}_{Right} = \frac{1}{|Right|} \sum_{i \in Right} y_i$ (mean of right node)

**Best split:** $\arg\min_{feature, threshold} RSS$

### 3. Splitting Criteria (Classification)

**Option 1: Gini Impurity**

$$Gini(node) = 1 - \sum_{k=1}^{K} p_k^2$$

Where:
- $p_k$ = proportion of class $k$ in node
- $K$ = number of classes
- $Gini = 0$ → pure node (all same class)
- $Gini = 0.5$ → maximally mixed (binary classification, 50-50 split)

**Option 2: Entropy (Information Gain)**

$$Entropy(node) = -\sum_{k=1}^{K} p_k \log_2(p_k)$$

$$Information\_Gain = Entropy(parent) - \frac{|Left|}{|Total|} Entropy(Left) - \frac{|Right|}{|Total|} Entropy(Right)$$

**Best split:** Maximize information gain (or minimize weighted Gini)

### 4. Prediction

**Regression:**
$$\hat{y} = \frac{1}{|leaf|} \sum_{i \in leaf} y_i \quad \text{(mean of training samples in leaf)}$$

**Classification:**
$$\hat{y} = \arg\max_{k} \text{count}(class = k) \quad \text{(majority class in leaf)}$$

### 5. Overfitting and Pruning

**Problem:** Deep trees memorize training data (low bias, high variance).

**Cost-Complexity Pruning:**

$$Cost(T) = \sum_{leaves} RSS_{leaf} + \alpha |T|$$

Where:
- $|T|$ = number of leaf nodes
- $\alpha$ = complexity parameter (higher $\alpha$ → smaller tree)
- Prune subtrees that increase cost when removed

---

## 🔨 Implementation from Scratch

### 📝 What's Happening in This Code?

**Purpose:** Build a regression tree using CART algorithm (Classification And Regression Trees)

**Key Points:**
- **Node class**: Stores split information (feature, threshold) or leaf prediction (value)
- **Best split search**: Try every feature and every unique value as potential threshold
- **RSS minimization**: Choose split that minimizes weighted RSS of left + right children
- **Recursive building**: Apply splitting recursively until stopping criteria met
- **Stopping criteria**: max_depth, min_samples_split, min_samples_leaf

**Why This Matters:** Understanding the greedy split selection reveals why trees overfit (memorize) and why ensemble methods (Random Forests, XGBoost) improve by combining multiple trees.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from typing import Optional

class Node:
    """Node in decision tree (internal or leaf)"""
    def __init__(self, feature_idx=None, threshold=None, left=None, right=None, value=None):
        self.feature_idx = feature_idx  # Feature index to split on
        self.threshold = threshold      # Threshold value for split
        self.left = left                # Left child (X[:, feature_idx] <= threshold)
        self.right = right              # Right child (X[:, feature_idx] > threshold)
        self.value = value              # Prediction value (for leaf nodes)
    
    def is_leaf(self):
        return self.value is not None

class DecisionTreeRegressorScratch:
    """CART Regression Tree from scratch"""
    
    def __init__(self, max_depth=5, min_samples_split=2, min_samples_leaf=1):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.root = None
    
    def fit(self, X, y):
        """Build tree recursively"""
        X = np.array(X)
        y = np.array(y)
        self.root = self._build_tree(X, y, depth=0)
        return self
    
    def _build_tree(self, X, y, depth):
        """Recursive tree building"""
        n_samples, n_features = X.shape
        
        # Stopping criteria
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or
            len(np.unique(y)) == 1):
            return Node(value=np.mean(y))
        
        # Find best split
        best_feature, best_threshold = self._find_best_split(X, y)
        
        if best_feature is None:
            return Node(value=np.mean(y))
        
        # Split data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Recursively build children
        left_child = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_child = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(feature_idx=best_feature, threshold=best_threshold,
                    left=left_child, right=right_child)
    
    def _find_best_split(self, X, y):
        """Find feature and threshold that minimizes RSS"""
        n_samples, n_features = X.shape
        best_rss = float('inf')
        best_feature = None
        best_threshold = None
        
        # Try each feature
        for feature_idx in range(n_features):
            thresholds = np.unique(X[:, feature_idx])
            
            # Try each unique value as threshold
            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask
                
                # Check min_samples_leaf
                if (np.sum(left_mask) < self.min_samples_leaf or 
                    np.sum(right_mask) < self.min_samples_leaf):
                    continue
                
                # Calculate RSS
                y_left = y[left_mask]
                y_right = y[right_mask]
                rss = self._calculate_rss(y_left, y_right)
                
                if rss < best_rss:
                    best_rss = rss
                    best_feature = feature_idx
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _calculate_rss(self, y_left, y_right):
        """Calculate residual sum of squares for split"""
        rss_left = np.sum((y_left - np.mean(y_left))**2) if len(y_left) > 0 else 0
        rss_right = np.sum((y_right - np.mean(y_right))**2) if len(y_right) > 0 else 0
        return rss_left + rss_right
    
    def predict(self, X):
        """Predict for each sample"""
        X = np.array(X)
        return np.array([self._predict_single(x, self.root) for x in X])
    
    def _predict_single(self, x, node):
        """Traverse tree to find prediction for single sample"""
        if node.is_leaf():
            return node.value
        
        if x[node.feature_idx] <= node.threshold:
            return self._predict_single(x, node.left)
        else:
            return self._predict_single(x, node.right)
    
    def print_tree(self, node=None, depth=0):
        """Print tree structure (for debugging)"""
        if node is None:
            node = self.root
        
        if node.is_leaf():
            print(f"{'  ' * depth}Leaf: predict {node.value:.2f}")
        else:
            print(f"{'  ' * depth}Feature {node.feature_idx} <= {node.threshold:.2f}")
            print(f"{'  ' * depth}Left:")
            self.print_tree(node.left, depth + 1)
            print(f"{'  ' * depth}Right:")
            self.print_tree(node.right, depth + 1)

print('✅ DecisionTreeRegressorScratch implemented')

---

### 📝 What's Happening in This Code?

**Purpose:** Generate synthetic non-linear data to test decision tree

**Key Points:**
- **Non-linear function**: $y = x^2 + 2x + noise$ (quadratic relationship)
- **Decision tree advantage**: Can capture this without polynomial features
- **Linear regression would fail**: Would predict straight line through curved data
- **Train/test split**: 80% train, 20% test for unbiased evaluation

**Why This Matters:** Demonstrates tree's ability to model complex relationships that would require manual feature engineering in linear models.

In [None]:
# Generate non-linear data
np.random.seed(42)
X_train = np.random.uniform(-3, 3, size=(200, 1))
y_train = X_train[:, 0]**2 + 2*X_train[:, 0] + np.random.normal(0, 1, 200)

X_test = np.random.uniform(-3, 3, size=(50, 1))
y_test = X_test[:, 0]**2 + 2*X_test[:, 0] + np.random.normal(0, 1, 50)

print(f'Training samples: {len(X_train)}')
print(f'Test samples: {len(X_test)}')
print(f'True function: y = x^2 + 2x + noise')

---

### 📝 What's Happening in This Code?

**Purpose:** Train decision tree with different depths to demonstrate overfitting

**Key Points:**
- **Depth 2 (shallow)**: Underfits - can't capture full quadratic curve (high bias)
- **Depth 5 (medium)**: Good balance - captures trend without memorizing noise
- **Depth 15 (deep)**: Overfits - memorizes training noise, poor test performance (high variance)
- **MSE metric**: Lower is better, but test MSE is what matters for generalization

**Why This Matters:** Demonstrates bias-variance tradeoff - shallow trees underfit, deep trees overfit. Proper depth tuning is critical for good generalization.

In [None]:
# Train trees with different max_depth
depths = [2, 5, 15]
trees = {}
results = []

for depth in depths:
    tree = DecisionTreeRegressorScratch(max_depth=depth, min_samples_split=5)
    tree.fit(X_train, y_train)
    
    y_train_pred = tree.predict(X_train)
    y_test_pred = tree.predict(X_test)
    
    train_mse = np.mean((y_train - y_train_pred)**2)
    test_mse = np.mean((y_test - y_test_pred)**2)
    
    trees[depth] = tree
    results.append({'Depth': depth, 'Train MSE': train_mse, 'Test MSE': test_mse})
    
    print(f'\nDepth {depth}:')
    print(f'  Train MSE: {train_mse:.3f}')
    print(f'  Test MSE: {test_mse:.3f}')

results_df = pd.DataFrame(results)
print('\n' + '='*40)
print(results_df.to_string(index=False))

---

### 📝 What's Happening in This Code?

**Purpose:** Visualize how tree complexity affects fit quality

**Key Points:**
- **Depth 2**: Staircase with few steps - too simple (underfitting)
- **Depth 5**: Smooth staircase following curve - good fit
- **Depth 15**: Jagged staircase through points - memorizing noise (overfitting)
- **True function**: Smooth parabola (dashed black line)

**Why This Matters:** Visual confirmation of bias-variance tradeoff. Depth 5 balances complexity and generalization, while depth 15 creates overly complex boundaries.

In [None]:
# Visualize predictions
X_plot = np.linspace(-3, 3, 300).reshape(-1, 1)
y_true = X_plot[:, 0]**2 + 2*X_plot[:, 0]

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for idx, depth in enumerate(depths):
    ax = axes[idx]
    y_pred = trees[depth].predict(X_plot)
    
    ax.scatter(X_train, y_train, alpha=0.3, s=20, label='Training data')
    ax.plot(X_plot, y_true, 'k--', linewidth=2, label='True function')
    ax.plot(X_plot, y_pred, 'r-', linewidth=2, label=f'Tree (depth={depth})')
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.set_title(f'Depth {depth} - Test MSE: {results_df[results_df.Depth==depth]["Test MSE"].values[0]:.2f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\n✅ Batch 1 Complete:')
print('  - Decision tree theory and math')
print('  - From-scratch CART implementation')
print('  - Overfitting demonstration (depth 2 vs 5 vs 15)')
print('  - Visual confirmation of bias-variance tradeoff')

---

## 🏭 Production Implementation: Sklearn DecisionTreeRegressor

### 📝 What's Happening in This Code?

**Purpose:** Compare from-scratch tree with sklearn's optimized implementation

**Key Points:**
- **sklearn.tree.DecisionTreeRegressor**: Production-ready CART implementation in C
- **Hyperparameters**: max_depth, min_samples_split, min_samples_leaf, max_features
- **Performance**: ~100x faster than pure Python for large datasets
- **Additional features**: Feature importance, tree visualization, pruning (ccp_alpha)
- **Validation**: Similar MSE to from-scratch confirms correctness

**Why This Matters:** Sklearn trees are optimized for speed and include diagnostics (feature importance, visualization) essential for production ML systems.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Train sklearn tree
sklearn_tree = DecisionTreeRegressor(max_depth=5, min_samples_split=5, random_state=42)
sklearn_tree.fit(X_train, y_train)

# Predictions
y_train_pred_sklearn = sklearn_tree.predict(X_train)
y_test_pred_sklearn = sklearn_tree.predict(X_test)

# Metrics
train_mse_sklearn = mean_squared_error(y_train, y_train_pred_sklearn)
test_mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
test_r2_sklearn = r2_score(y_test, y_test_pred_sklearn)

print('Sklearn DecisionTreeRegressor (depth=5):')
print(f'  Train MSE: {train_mse_sklearn:.3f}')
print(f'  Test MSE: {test_mse_sklearn:.3f}')
print(f'  Test R²: {test_r2_sklearn:.3f}')

# Compare with from-scratch
scratch_test_mse = results_df[results_df.Depth == 5]['Test MSE'].values[0]
print(f'\nComparison (Test MSE):')
print(f'  From-scratch: {scratch_test_mse:.3f}')
print(f'  Sklearn: {test_mse_sklearn:.3f}')
print(f'  Difference: {abs(scratch_test_mse - test_mse_sklearn):.3f}')

---

### 📝 What's Happening in This Code?

**Purpose:** Demonstrate feature importance for interpretability

**Key Points:**
- **Feature importance**: Measures total RSS reduction from splits on each feature
- **Normalized**: Sums to 1.0 across all features
- **Post-silicon value**: Identifies which test parameters are most predictive
- **Test optimization**: Focus resources on important features, skip irrelevant ones

**Why This Matters:** Feature importance enables test flow optimization by revealing which parameters drive yield/performance predictions.

In [None]:
# Generate multi-feature data for feature importance demo
np.random.seed(42)
n_samples = 300
X_multi = np.random.randn(n_samples, 5)

# Target depends strongly on feature 0 and 2, weakly on 1, not at all on 3 and 4
y_multi = (3 * X_multi[:, 0]**2 +  # Strong non-linear
           2 * X_multi[:, 2] +      # Strong linear
           0.5 * X_multi[:, 1] +    # Weak
           np.random.normal(0, 0.5, n_samples))  # Noise
# Features 3 and 4 are irrelevant

# Train tree
tree_multi = DecisionTreeRegressor(max_depth=5, random_state=42)
tree_multi.fit(X_multi, y_multi)

# Feature importance
importances = tree_multi.feature_importances_
feature_names = ['Feature 0\n(strong non-linear)', 'Feature 1\n(weak)', 
                 'Feature 2\n(strong linear)', 'Feature 3\n(irrelevant)', 'Feature 4\n(irrelevant)']

# Plot
plt.figure(figsize=(10, 5))
plt.barh(feature_names, importances, color=['#ff6b6b', '#feca57', '#48dbfb', '#dfe6e9', '#dfe6e9'])
plt.xlabel('Feature Importance (RSS Reduction)')
plt.title('Decision Tree Feature Importance')
plt.xlim(0, 1)
for i, v in enumerate(importances):
    plt.text(v + 0.01, i, f'{v:.3f}', va='center')
plt.tight_layout()
plt.show()

print('\nFeature Importance:')
for name, imp in zip(feature_names, importances):
    print(f'  {name.replace(chr(10), " ")}: {imp:.3f}')

---

## 🔬 Post-Silicon Application: Non-Linear V-F Characterization

### 📝 What's Happening in This Code?

**Purpose:** Predict device frequency from voltage and temperature (non-linear)

**Key Points:**
- **Non-linear V-F relationship**: Frequency $\propto V^2$ (not captured by linear models)
- **Temperature effects**: Frequency decreases with temperature (mobility degradation)
- **Decision tree advantage**: Captures both effects without manual feature engineering
- **Test optimization**: Predict final test frequency from early voltage/temp measurements

**Why This Matters:** Enables early prediction of device performance, allowing test flow optimization and dynamic binning before expensive final tests.

In [None]:
# Generate synthetic V-F characterization data
np.random.seed(42)
n_devices = 500

# Features: Voltage (0.95-1.05V), Temperature (25-85C)
voltage = np.random.uniform(0.95, 1.05, n_devices)
temperature = np.random.uniform(25, 85, n_devices)

# True relationship: F = 500 * V^2 * (1 - 0.002*(T-25)) + noise
# Frequency decreases with temperature (mobility degradation)
frequency = (500 * voltage**2 * (1 - 0.002 * (temperature - 25)) + 
             np.random.normal(0, 10, n_devices))

# Create DataFrame
vf_data = pd.DataFrame({
    'Voltage': voltage,
    'Temperature': temperature,
    'Frequency': frequency
})

print('V-F Characterization Data:')
print(vf_data.head(10))
print(f'\nDataset: {len(vf_data)} devices')
print(f'Voltage range: {voltage.min():.3f}V - {voltage.max():.3f}V')
print(f'Temperature range: {temperature.min():.1f}°C - {temperature.max():.1f}°C')
print(f'Frequency range: {frequency.min():.1f}MHz - {frequency.max():.1f}MHz')

---

### 📝 What's Happening in This Code?

**Purpose:** Train decision tree on V-F data and evaluate performance

**Key Points:**
- **Train/test split**: 80% train, 20% test for unbiased evaluation
- **Hyperparameter tuning**: max_depth=6 balances complexity and generalization
- **Metrics**: RMSE measures prediction error in MHz, R² measures variance explained
- **Practical target**: RMSE < 20MHz acceptable for binning decisions (typical bin width ~50MHz)

**Why This Matters:** Demonstrates tree's ability to capture non-linear V² relationship and temperature effects automatically, without polynomial features.

In [None]:
from sklearn.model_selection import train_test_split

# Split data
X_vf = vf_data[['Voltage', 'Temperature']].values
y_vf = vf_data['Frequency'].values
X_vf_train, X_vf_test, y_vf_train, y_vf_test = train_test_split(
    X_vf, y_vf, test_size=0.2, random_state=42
)

# Train tree
vf_tree = DecisionTreeRegressor(max_depth=6, min_samples_split=10, random_state=42)
vf_tree.fit(X_vf_train, y_vf_train)

# Predictions
y_vf_train_pred = vf_tree.predict(X_vf_train)
y_vf_test_pred = vf_tree.predict(X_vf_test)

# Metrics
train_rmse = np.sqrt(mean_squared_error(y_vf_train, y_vf_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_vf_test, y_vf_test_pred))
test_r2 = r2_score(y_vf_test, y_vf_test_pred)

print('V-F Characterization Model Performance:')
print(f'  Train RMSE: {train_rmse:.2f} MHz')
print(f'  Test RMSE: {test_rmse:.2f} MHz')
print(f'  Test R²: {test_r2:.4f}')
print(f'\nFeature Importance:')
print(f'  Voltage: {vf_tree.feature_importances_[0]:.3f}')
print(f'  Temperature: {vf_tree.feature_importances_[1]:.3f}')
print(f'\n✅ Test RMSE < 20MHz → Model suitable for binning (typical bin width ~50MHz)')

---

### 📝 What's Happening in This Code?

**Purpose:** Visualize tree predictions vs actual V-F relationship

**Key Points:**
- **2D visualization**: Frequency vs voltage at two temperatures (25°C and 85°C)
- **True parabolic curve**: Shows F ∝ V² relationship (dashed line)
- **Tree predictions**: Staircase approximation follows true curve
- **Temperature effect**: Predictions correctly shift down at high temperature

**Why This Matters:** Visual confirmation that tree captures non-linear voltage scaling and temperature dependence, justifying deployment for real device characterization.

In [None]:
# Visualization: F vs V at two temperatures
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

temps = [25, 85]
for idx, temp in enumerate(temps):
    ax = axes[idx]
    
    # Generate voltage grid at fixed temperature
    v_grid = np.linspace(0.95, 1.05, 100)
    X_grid = np.column_stack([v_grid, np.full(100, temp)])
    
    # True relationship
    f_true = 500 * v_grid**2 * (1 - 0.002 * (temp - 25))
    
    # Tree predictions
    f_pred = vf_tree.predict(X_grid)
    
    # Plot
    mask = (vf_data.Temperature > temp - 5) & (vf_data.Temperature < temp + 5)
    ax.scatter(vf_data[mask].Voltage, vf_data[mask].Frequency, 
               alpha=0.3, s=30, label=f'Data (T≈{temp}°C)')
    ax.plot(v_grid, f_true, 'k--', linewidth=2, label='True F ∝ V²')
    ax.plot(v_grid, f_pred, 'r-', linewidth=2, label='Tree prediction')
    
    ax.set_xlabel('Voltage (V)')
    ax.set_ylabel('Frequency (MHz)')
    ax.set_title(f'V-F Curve at T = {temp}°C')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print('\n✅ Visualization confirms:')
print('  - Tree captures V² relationship (parabolic shape)')
print('  - Temperature effect correctly modeled (lower F at 85°C)')
print('  - Staircase approximation follows true curve closely')

---

## 🎯 Real-World Project Ideas

### Post-Silicon Validation Projects (4)

#### 1. **Automatic Speed Binning Classifier**
**Objective:** Multi-class classification into speed bins using early parametric tests.

**Business Value:** Eliminate manual threshold tuning, adapt to process variations automatically. Dynamic binning increases yield by 2-4% compared to fixed thresholds.

**Key Features:** Early voltage tests, leakage current, temperature coefficients, wafer-level spatial coordinates

**Implementation:** DecisionTreeClassifier with Gini impurity, tune max_depth (4-7), visualize decision rules for validation

**Success Metric:** 95% bin prediction accuracy, interpretable rules for test engineers

---

#### 2. **Test Flow Optimization with Decision Rules**
**Objective:** Build decision tree to determine which tests to run based on early results.

**Business Value:** Skip expensive tests for devices likely to fail, reduce test time by 20-30% while maintaining quality (miss rate < 0.1%).

**Key Features:** First 10 parametric tests (fast, low-cost), historical pass/fail patterns

**Implementation:** Classifier predicts likely failure modes, generate test flow rules from tree paths

**Success Metric:** 25% test time reduction, <0.1% escape rate (false negatives)

---

#### 3. **Non-Linear Power Prediction**
**Objective:** Predict device power from voltage, frequency, and temperature (all interact non-linearly).

**Business Value:** Early power prediction enables thermal design optimization, prevents overheating issues in production (costly rework avoided).

**Key Features:** V, F, T, and their interactions (V×F, V×T, F×T)

**Implementation:** DecisionTreeRegressor automatically captures P ∝ V²F relationship without manual feature engineering

**Success Metric:** RMSE < 5% of mean power, feature importance reveals dominant interactions

---

#### 4. **Wafer-Level Failure Mode Classification**
**Objective:** Classify failure modes (voltage, current, timing) from wafer-level spatial patterns.

**Business Value:** Root cause analysis for yield improvement, identify systematic defects (e.g., edge failures → process issue, random → particles).

**Key Features:** Die coordinates (x, y), wafer_id, parametric test results, spatial neighbors

**Implementation:** Multi-class DecisionTreeClassifier, interpret tree to extract failure signatures

**Success Metric:** 90% failure mode accuracy, decision rules guide process engineers to root cause

---

### General AI/ML Projects (4)

#### 5. **Medical Diagnosis Decision Tree**
**Objective:** Predict disease from symptoms using interpretable decision rules.

**Business Value:** Interpretable models meet regulatory requirements (HIPAA, GDPR), clinicians can validate and trust predictions.

**Key Features:** Symptom presence/absence, vital signs, patient demographics, medical history

**Success Metric:** 85% diagnostic accuracy, max_depth ≤ 5 for human interpretability

---

#### 6. **Customer Churn Prediction**
**Objective:** Predict which customers will churn using behavioral features.

**Business Value:** Target retention campaigns at high-risk customers, reduce churn by 15-25%. Feature importance reveals churn drivers for product improvements.

**Key Features:** Usage frequency, support tickets, billing amount, contract type, tenure

**Success Metric:** 80% churn prediction accuracy (F1 score), ROI positive for retention campaigns

---

#### 7. **Loan Approval Classifier**
**Objective:** Automate loan approval with interpretable decision rules for regulatory compliance.

**Business Value:** Faster approvals (seconds vs. days), explainable decisions prevent discrimination lawsuits, consistent policy enforcement.

**Key Features:** Income, credit score, debt-to-income ratio, employment history, loan amount

**Success Metric:** 90% approval accuracy, decision paths auditable by regulators

---

#### 8. **Predictive Maintenance (Equipment Failure)**
**Objective:** Predict equipment failure from sensor readings (temperature, vibration, pressure).

**Business Value:** Prevent unplanned downtime (costs $50K-$500K per hour in manufacturing), optimize maintenance schedules, extend equipment life.

**Key Features:** Real-time sensor data, operating conditions, maintenance history, age

**Success Metric:** 85% failure prediction 24-48 hours in advance, <5% false alarm rate

---

## ✅ Key Takeaways

### When to Use Decision Trees

| **Scenario** | **Decision Trees** | **Linear Models** | **Ensemble (RF/XGB)** |
|-------------|-------------------|-------------------|---------------------|
| **Interpretability required** | ✅ Single tree readable | ✅ Coefficients | ❌ Black box |
| **Non-linear relationships** | ✅ Automatic | ❌ Manual features | ✅ Automatic |
| **Feature interactions** | ✅ Captured in splits | ❌ Manual | ✅ Better capture |
| **Mixed data types** | ✅ No encoding needed | ❌ Encoding required | ✅ No encoding |
| **High-dimensional data** | ⚠️ Overfits easily | ✅ With regularization | ✅ Robust |
| **Training speed** | ✅ Fast | ✅ Very fast | ⚠️ Slower |
| **Prediction speed** | ✅ Fast (O(log n)) | ✅ Very fast (O(p)) | ⚠️ Slower (many trees) |
| **Robustness to noise** | ❌ Overfits | ✅ Stable | ✅ Robust |
| **Extrapolation** | ❌ Constant at leaves | ✅ Can extrapolate | ❌ Constant |

### Best Practices

1. **Hyperparameter tuning:**
   - **max_depth**: Start with 3-7, increase if underfitting (monitor test error)
   - **min_samples_split**: 2-10 for small datasets, 20-100 for large (prevents overfitting)
   - **min_samples_leaf**: 1-5 (higher values create smoother predictions)
   - **max_features**: Consider subset (sqrt or log2) for decorrelation (useful for ensembles)
   - **ccp_alpha**: Cost-complexity pruning parameter (0.0 = no pruning, 0.01-0.1 typical)

2. **Overfitting prevention:**
   - Use cross-validation to tune depth
   - Set reasonable min_samples_split (10-20 for noisy data)
   - Consider pruning (post-pruning with ccp_alpha)
   - If still overfitting → use Random Forest or XGBoost

3. **Feature engineering:**
   - Trees robust to scaling → no normalization needed
   - Can handle missing values (sklearn uses surrogate splits)
   - Categorical features → ordinal encoding works (trees only compare thresholds)
   - Feature interactions captured automatically

4. **Production deployment:**
   - Export tree structure for fast predictions (sklearn.tree.export_text)
   - Monitor feature importance drift (indicates data distribution changes)
   - A/B test against simpler baseline (logistic regression)
   - Document decision rules for stakeholder validation

### Limitations

- **High variance**: Single trees overfit easily (use ensembles for robustness)
- **Axis-aligned splits**: Can't capture diagonal boundaries (e.g., y = x)
- **No extrapolation**: Predictions constant beyond training range (leaf values)
- **Instability**: Small data changes can produce very different trees
- **Greedy algorithm**: Locally optimal splits may miss globally better structures

### Next Steps

- **017_Random_Forests.ipynb:** Bootstrap aggregating (bagging) to reduce variance
- **018_Gradient_Boosting.ipynb:** Sequential tree building for lower bias and variance
- **019_XGBoost_LightGBM.ipynb:** High-performance gradient boosting libraries
- **Advanced:** Extremely randomized trees, isolation forests (anomaly detection)

---

## 📚 References & Further Reading

**Foundational Papers:**
- Breiman et al. (1984): CART - Classification and Regression Trees (original algorithm)
- Quinlan (1986): C4.5 - Information gain splitting (ID3/C4.5 algorithms)

**Sklearn Documentation:**
- `sklearn.tree.DecisionTreeRegressor`: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
- User guide: https://scikit-learn.org/stable/modules/tree.html
- Visualization: https://scikit-learn.org/stable/modules/tree.html#tree-algorithms

**Advanced Topics:**
- Cost-complexity pruning (post-pruning for generalization)
- Surrogate splits (handling missing values)
- Monotonic constraints (enforce domain knowledge)
- Tree-based feature selection

---

**Notebook Complete!** 🎉

You now understand:
- ✅ Decision tree theory (CART, RSS, Gini, entropy)
- ✅ From-scratch implementation (recursive splitting)
- ✅ Production sklearn usage (DecisionTreeRegressor/Classifier)
- ✅ Post-silicon applications (V-F characterization, binning, test flow optimization)
- ✅ General AI/ML applications (medical diagnosis, churn prediction, loan approval)
- ✅ Feature importance for interpretability
- ✅ Overfitting prevention (depth tuning, pruning)
- ✅ 8 real-world projects to practice

**Next:** `017_Random_Forests.ipynb` for ensemble methods that reduce tree variance.