# 036: Isolation Forest### 1. Isolation Tree Construction**Random Partitioning**:1. Randomly select feature $q \in \{1, ..., d\}$2. Randomly select split value $p \in [\min(X_q), \max(X_q)]$3. Partition: $X_{left} = \{x | x_q < p\}$, $X_{right} = \{x | x_q \geq p\}$4. Recurse until:   - Node contains 1 point (external node)   - Tree reaches height limit $l = \lceil \log_2(\psi) \rceil$ (subsample size $\psi = 256$ typical)**Path Length $h(x)$**:$$h(x) = e + c(T.size)$$Where:- $e$: Number of edges from root to terminating node- $c(T.size)$: Adjustment for termination at internal node (average path length for remaining points)### 2. Anomaly Score Computation**Average Path Length** (across $t$ trees):$$E[h(x)] = \frac{1}{t} \sum_{i=1}^{t} h_i(x)$$**Normalization Factor** (average path length in BST):$$c(n) = \begin{cases}2H(n-1) - \frac{2(n-1)}{n} & \text{if } n > 2 \\1 & \text{if } n = 2 \\0 & \text{otherwise}\end{cases}$$Where $H(i) = \ln(i) + \gamma$ (Euler's constant $\gamma \approx 0.5772$)**Anomaly Score**:$$s(x, \psi) = 2^{-\frac{E[h(x)]}{c(\psi)}}$$**Interpretation**:- $s(x) \to 1$: Clear anomaly (short path, easy to isolate)- $s(x) \approx 0.5$: Normal point (path length near average)- $s(x) \to 0$: Very normal point (long path, hard to isolate)**Typical threshold**: $s(x) > 0.6$ for anomalies (or top contamination% by score)### 3. Why It Works**Anomalies have shorter path lengths** because:1. They're **sparse** → less likely to be with other points after random splits2. They're **far from clusters** → early splits separate them from dense regions3. Random partitioning naturally finds gaps around outliers**Example**: In 2D space with 99 normal points clustered and 1 outlier:- Outlier: ~2-3 splits to isolate (path length ≈ 3)- Normal: ~6-7 splits to isolate from cluster (path length ≈ 7)- Score: Outlier $s \approx 0.7$, Normal $s \approx 0.4$

## 💻 Implementation from Scratch

### 📝 What's Happening in This Code?

**Purpose:** Build complete Isolation Forest from scratch with binary tree structure

**Key Points:**
- **IsolationTree class**: Represents single tree with recursive splitting logic
  - `fit()`: Builds tree by random feature/split selection until height limit or isolation
  - `path_length()`: Computes h(x) for a test point (traverses tree, counts edges)
- **IsolationForest class**: Ensemble of t trees
  - Fits multiple trees on random subsamples (ψ = 256 typical)
  - Averages path lengths across trees for robust scoring
  - Computes normalized anomaly scores s(x) using c(ψ) formula
- **c(n) function**: Implements BST average path length formula with Euler's constant
- **Key hyperparameters**:
  - `n_estimators`: Number of trees (100-200 typical, more = stable but slower)
  - `max_samples`: Subsample size ψ (256 typical, controls tree height)
  - `contamination`: Expected anomaly proportion (sets decision threshold)

**Why This Matters:** 
- Understanding tree recursion clarifies why anomalies have shorter paths
- Implementation shows efficiency (no distance computations, simple tree traversal)
- Enables customization for domain-specific splits (e.g., engineering limits)

### 📝 Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
import seaborn as sns
# Set style
sns.set_style('whitegrid')
np.random.seed(42)
class IsolationTree:
    """Single isolation tree for anomaly detection."""
    
    def __init__(self, height_limit):
        self.height_limit = height_limit
        self.split_feature = None
        self.split_value = None
        self.left = None
        self.right = None
        self.size = 0  # Number of points in this node
        
    def fit(self, X, current_height=0):
        """Build isolation tree via recursive random partitioning."""
        self.size = len(X)
        
        # Termination conditions
        if current_height >= self.height_limit or len(X) <= 1:
            return self
        
        # Random feature and split
        n_features = X.shape[1]
        self.split_feature = np.random.randint(0, n_features)
        feature_values = X[:, self.split_feature]
        
        min_val, max_val = feature_values.min(), feature_values.max()
        if min_val == max_val:  # All values same, can't split
            return self
        
        self.split_value = np.random.uniform(min_val, max_val)
        
        # Partition
        left_mask = feature_values < self.split_value
        X_left = X[left_mask]
        X_right = X[~left_mask]
        
        # Recurse
        if len(X_left) > 0:
            self.left = IsolationTree(self.height_limit)
            self.left.fit(X_left, current_height + 1)
        if len(X_right) > 0:
            self.right = IsolationTree(self.height_limit)
            self.right.fit(X_right, current_height + 1)
        
        return self
    
    def path_length(self, x, current_height=0):
        """Compute path length h(x) for a single point."""
        # External node (leaf) or height limit reached
        if self.split_feature is None or current_height >= self.height_limit:
            return current_height + c(self.size)
        
        # Traverse tree
        if x[self.split_feature] < self.split_value:
            if self.left is not None:
                return self.left.path_length(x, current_height + 1)
        else:
            if self.right is not None:
                return self.right.path_length(x, current_height + 1)
        
        # Shouldn't reach here, but handle edge case
        return current_height + c(self.size)


### 📝 Function: c

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
def c(n):
    """Average path length of unsuccessful search in BST."""
    if n > 2:
        return 2.0 * (np.log(n - 1) + 0.5772) - (2.0 * (n - 1) / n)
    elif n == 2:
        return 1.0
    else:
        return 0.0
class IsolationForest:
    """Isolation Forest ensemble for anomaly detection."""
    
    def __init__(self, n_estimators=100, max_samples=256, contamination=0.1):
        self.n_estimators = n_estimators
        self.max_samples = max_samples
        self.contamination = contamination
        self.trees = []
        self.threshold = None
        
    def fit(self, X):
        """Build ensemble of isolation trees."""
        n_samples = len(X)
        sample_size = min(self.max_samples, n_samples)
        height_limit = int(np.ceil(np.log2(sample_size)))
        
        self.trees = []
        for _ in range(self.n_estimators):
            # Random subsample
            indices = np.random.choice(n_samples, sample_size, replace=False)
            X_sample = X[indices]
            
            # Build tree
            tree = IsolationTree(height_limit)
            tree.fit(X_sample)
            self.trees.append(tree)
        
        # Set threshold based on contamination
        scores = self.score_samples(X)
        self.threshold = np.percentile(scores, 100 * (1 - self.contamination))
        
        return self
    
    def score_samples(self, X):
        """Compute anomaly scores s(x) for samples."""
        # Average path length across trees
        avg_path_lengths = np.array([
            np.mean([tree.path_length(x) for tree in self.trees])
            for x in X
        ])
        
        # Normalize by c(max_samples)
        normalization = c(self.max_samples)
        scores = 2.0 ** (-avg_path_lengths / normalization)
        
        return scores
    
    def predict(self, X):
        """Predict anomalies (-1) vs normal (1)."""
        scores = self.score_samples(X)
        return np.where(scores >= self.threshold, -1, 1)
print("✅ Isolation Forest implementation complete!")
print(f"   - IsolationTree: Random partitioning with height limit")
print(f"   - c(n): BST average path length (normalization)")
print(f"   - IsolationForest: Ensemble of {100} trees")
print(f"   - Anomaly score: s(x) = 2^(-E[h(x)]/c(ψ))")


## 🧪 Test on Synthetic Data

### 📝 What's Happening in This Code?

**Purpose:** Validate from-scratch implementation on 2D synthetic dataset with known anomalies

**Key Points:**
- **Data generation**: 300 normal points (single Gaussian cluster) + 20 anomalies (uniform random)
- **Contamination = 0.0625**: Matches true anomaly proportion (20/320 = 6.25%)
- **Visualization**: Decision boundary shows isolation regions (higher scores = red/anomalous)
- **Path length distribution**: Anomalies have significantly shorter average paths (4-6) vs normal (7-9)
- **Score interpretation**: 
  - s(x) > 0.6: Clear anomalies (correctly detected)
  - s(x) ≈ 0.5: Borderline (near cluster edges)
  - s(x) < 0.5: Normal points (deep in cluster)

**Why This Matters:** 
- Confirms algorithm correctly identifies sparse, isolated points
- Decision boundary is non-parametric (not elliptical like Gaussian models)
- Shows sensitivity to cluster boundaries (some edge points flagged)

In [None]:
# Generate synthetic data: cluster + anomalies
X_normal, _ = make_blobs(n_samples=300, centers=1, cluster_std=1.0, random_state=42)
X_anomalies = np.random.uniform(low=-6, high=6, size=(20, 2))  # Scattered anomalies
X = np.vstack([X_normal, X_anomalies])
y_true = np.array([1]*300 + [-1]*20)  # 1=normal, -1=anomaly

# Fit Isolation Forest
iforest = IsolationForest(n_estimators=100, max_samples=256, contamination=0.0625)
iforest.fit(X)

# Predict and score
y_pred = iforest.predict(X)
scores = iforest.score_samples(X)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Anomaly detection results
ax = axes[0]
scatter = ax.scatter(X[:, 0], X[:, 1], c=scores, cmap='RdYlBu_r', s=50, alpha=0.7)
ax.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], 
           marker='x', s=200, linewidths=3, color='red', label='Predicted Anomalies')
plt.colorbar(scatter, ax=ax, label='Anomaly Score s(x)')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_title('Isolation Forest: Anomaly Detection\n(Red X = Predicted Anomalies)')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: Path length distribution
ax = axes[1]
avg_paths = np.array([
    np.mean([tree.path_length(x) for tree in iforest.trees])
    for x in X
])
ax.hist(avg_paths[y_true == 1], bins=30, alpha=0.6, label='Normal Points', color='blue')
ax.hist(avg_paths[y_true == -1], bins=30, alpha=0.6, label='True Anomalies', color='red')
ax.axvline(c(iforest.max_samples), color='green', linestyle='--', linewidth=2, 
           label=f'c(ψ={iforest.max_samples}) = {c(iforest.max_samples):.2f}')
ax.set_xlabel('Average Path Length E[h(x)]')
ax.set_ylabel('Frequency')
ax.set_title('Path Length Distribution\n(Anomalies have shorter paths)')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Performance metrics
from sklearn.metrics import classification_report, confusion_matrix
print("\n📊 Classification Report:")
print(classification_report(y_true, y_pred, target_names=['Normal', 'Anomaly']))
print("\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))
print(f"\n✅ Detected {(y_pred == -1).sum()} anomalies (expected {int(len(X) * iforest.contamination)})")
print(f"   Score range: [{scores.min():.3f}, {scores.max():.3f}]")
print(f"   Threshold: {iforest.threshold:.3f}")

## 🏭 Post-Silicon Application: Parametric Test Anomaly Detection

### 📝 What's Happening in This Code?

**Purpose:** Detect anomalous devices from 3D parametric test data (Vdd, Idd, Frequency)

**Key Points:**
- **Realistic data generation**: 
  - Normal devices: Vdd~N(1.8, 0.05), Idd~N(0.5, 0.08), Freq~N(3.2, 0.15) GHz
  - Anomalies: High leakage (Idd > 1.0A), low frequency (Freq < 2.5 GHz), voltage drift (Vdd > 2.1V)
- **Multi-dimensional detection**: Isolation Forest naturally handles 3D correlation structure
- **contamination = 0.03**: Typical for semiconductor yield (3% defect rate)
- **3D visualization**: Shows anomalies in voltage-current-frequency space
- **Actionable output**: Device IDs flagged for failure analysis or binning
- **Business value**: Early detection prevents faulty chips from reaching customers

**Why This Matters:** 
- Multi-parameter anomalies hard to detect with univariate thresholds
- Isolation Forest finds devices unusual in **combination** of parameters (e.g., high Vdd + high Idd)
- No assumptions about parameter distributions (unlike Gaussian models)
- Fast enough for inline test (< 1ms per device)

In [None]:
# Generate synthetic semiconductor test data
np.random.seed(42)
n_devices = 1000

# Normal devices (97%)
n_normal = 970
vdd_normal = np.random.normal(1.8, 0.05, n_normal)  # Voltage (V)
idd_normal = np.random.normal(0.5, 0.08, n_normal)  # Current (A)
freq_normal = np.random.normal(3.2, 0.15, n_normal)  # Frequency (GHz)

# Anomalous devices (3%)
n_anomaly = 30
vdd_anomaly = np.random.uniform(1.6, 2.2, n_anomaly)  # Voltage drift
idd_anomaly = np.random.uniform(0.8, 1.5, n_anomaly)  # High leakage
freq_anomaly = np.random.uniform(2.0, 2.8, n_anomaly)  # Low frequency

# Combine
X_test = np.vstack([
    np.column_stack([vdd_normal, idd_normal, freq_normal]),
    np.column_stack([vdd_anomaly, idd_anomaly, freq_anomaly])
])
y_true_test = np.array([1]*n_normal + [-1]*n_anomaly)
device_ids = np.arange(1, n_devices + 1)

# Fit Isolation Forest
iforest_test = IsolationForest(n_estimators=100, max_samples=256, contamination=0.03)
iforest_test.fit(X_test)
y_pred_test = iforest_test.predict(X_test)
scores_test = iforest_test.score_samples(X_test)

# 3D Visualization
fig = plt.figure(figsize=(14, 6))

# Plot 1: 3D scatter
ax = fig.add_subplot(121, projection='3d')
scatter = ax.scatter(X_test[:, 0], X_test[:, 1], X_test[:, 2], 
                     c=scores_test, cmap='RdYlBu_r', s=30, alpha=0.6)
ax.scatter(X_test[y_pred_test == -1, 0], 
           X_test[y_pred_test == -1, 1], 
           X_test[y_pred_test == -1, 2],
           marker='x', s=200, linewidths=3, color='red', label='Detected Anomalies')
ax.set_xlabel('Vdd (V)')
ax.set_ylabel('Idd (A)')
ax.set_zlabel('Frequency (GHz)')
ax.set_title('Parametric Test Anomaly Detection\n(3D: Vdd-Idd-Freq)')
ax.legend()
plt.colorbar(scatter, ax=ax, label='Anomaly Score', shrink=0.5)

# Plot 2: Score distribution
ax2 = fig.add_subplot(122)
ax2.hist(scores_test[y_true_test == 1], bins=30, alpha=0.6, label='Normal Devices', color='blue')
ax2.hist(scores_test[y_true_test == -1], bins=30, alpha=0.6, label='True Anomalies', color='red')
ax2.axvline(iforest_test.threshold, color='green', linestyle='--', linewidth=2,
            label=f'Threshold = {iforest_test.threshold:.3f}')
ax2.set_xlabel('Anomaly Score s(x)')
ax2.set_ylabel('Number of Devices')
ax2.set_title('Anomaly Score Distribution\n(Higher = More Anomalous)')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Report anomalous devices
anomalous_indices = np.where(y_pred_test == -1)[0]
print("\n⚠️  Anomalous Devices Detected:")
print(f"   Total flagged: {len(anomalous_indices)} / {n_devices} ({100*len(anomalous_indices)/n_devices:.1f}%)")
print("\n   Top 5 Most Anomalous Devices:")
top5_indices = np.argsort(scores_test)[-5:][::-1]
for idx in top5_indices:
    print(f"   Device {device_ids[idx]:04d}: Vdd={X_test[idx,0]:.3f}V, "
          f"Idd={X_test[idx,1]:.3f}A, Freq={X_test[idx,2]:.3f}GHz, Score={scores_test[idx]:.3f}")

print("\n📊 Classification Performance:")
print(classification_report(y_true_test, y_pred_test, target_names=['Normal', 'Anomaly']))

## 🔧 Production Implementation with Scikit-Learn

### 📝 What's Happening in This Code?

**Purpose:** Use optimized sklearn.ensemble.IsolationForest for production deployment

**Key Points:**
- **sklearn advantages**:
  - Cython-optimized (10-100x faster than pure Python)
  - Handles sparse matrices for large-scale data
  - Built-in joblib parallelization (n_jobs=-1)
  - Memory-efficient tree storage
- **decision_function()**: Raw anomaly scores (negative = more anomalous, inverse of s(x))
- **predict()**: Binary labels (-1 anomaly, 1 normal)
- **score_samples()**: Opposite sign convention from sklearn (higher = more anomalous)
- **Hyperparameter tuning**: Contamination most critical, n_estimators secondary

**Why This Matters:** 
- Production systems need speed (process millions of devices)
- Sklearn integration with MLOps pipelines (joblib persistence, Docker deployment)
- Same results as from-scratch but enterprise-grade performance

In [None]:
from sklearn.ensemble import IsolationForest as SklearnIsolationForest
from sklearn.metrics import roc_auc_score, roc_curve

# Fit sklearn Isolation Forest
clf = SklearnIsolationForest(
    n_estimators=100,
    max_samples=256,
    contamination=0.03,
    random_state=42,
    n_jobs=-1  # Parallel processing
)
clf.fit(X_test)

# Predictions
y_pred_sklearn = clf.predict(X_test)
scores_sklearn = clf.decision_function(X_test)  # Note: negative = anomalous in sklearn
scores_sklearn_normalized = -scores_sklearn  # Flip sign for consistency

# Compare with from-scratch implementation
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Score comparison
ax = axes[0]
ax.scatter(scores_test, scores_sklearn_normalized, alpha=0.5, s=20)
ax.plot([scores_test.min(), scores_test.max()], 
        [scores_test.min(), scores_test.max()], 
        'r--', linewidth=2, label='Perfect Agreement')
ax.set_xlabel('From-Scratch Anomaly Score')
ax.set_ylabel('Sklearn Anomaly Score (negated)')
ax.set_title('Score Comparison: From-Scratch vs Sklearn\n(High correlation confirms implementation)')
ax.legend()
ax.grid(True, alpha=0.3)

# Plot 2: ROC curve
ax = axes[1]
y_true_binary = (y_true_test == -1).astype(int)  # 1=anomaly, 0=normal for ROC

# From-scratch ROC
fpr1, tpr1, _ = roc_curve(y_true_binary, scores_test)
auc1 = roc_auc_score(y_true_binary, scores_test)
ax.plot(fpr1, tpr1, linewidth=2, label=f'From-Scratch (AUC={auc1:.3f})')

# Sklearn ROC
fpr2, tpr2, _ = roc_curve(y_true_binary, scores_sklearn_normalized)
auc2 = roc_auc_score(y_true_binary, scores_sklearn_normalized)
ax.plot(fpr2, tpr2, linewidth=2, label=f'Sklearn (AUC={auc2:.3f})')

ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC Curve: Anomaly Detection Performance')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n🔬 Implementation Comparison:")
print(f"   From-Scratch: {(y_pred_test == -1).sum()} anomalies detected, AUC={auc1:.3f}")
print(f"   Sklearn:      {(y_pred_sklearn == -1).sum()} anomalies detected, AUC={auc2:.3f}")
print(f"   Agreement:    {(y_pred_test == y_pred_sklearn).mean()*100:.1f}% predictions match")

print("\n⚡ Performance Considerations:")
print(f"   - Sklearn uses Cython (10-100x faster than pure Python)")
print(f"   - n_jobs=-1: Parallel tree building across CPU cores")
print(f"   - max_samples=256: Memory-efficient (doesn't load full dataset per tree)")
print(f"   - Typical inference: <1ms per device (suitable for inline test)")

## 📊 Comparison: Isolation Forest vs Other Anomaly Detectors

### 📝 What's Happening in This Code?

**Purpose:** Compare Isolation Forest with LOF, DBSCAN, and One-Class SVM on same dataset

**Key Points:**
- **LOF (Local Outlier Factor)**: Density-based, O(n²) complexity, struggles with high dimensions
- **DBSCAN**: Clustering-based, sensitive to eps/min_samples, treats noise as anomalies
- **One-Class SVM**: Boundary-based, kernel trick for non-linearity, expensive training
- **Isolation Forest**: Tree-based, O(n log n), no distance computations, works in high-D
- **Performance metrics**: AUC, F1-score, training time, memory usage
- **Visualization**: Decision boundaries show different detection philosophies

**Why This Matters:** 
- No single best method (depends on data characteristics)
- Isolation Forest often wins for large, high-dimensional data
- LOF better for variable-density clusters
- Ensemble multiple methods for critical applications (voting/averaging)

In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.cluster import DBSCAN
import time

# Use 2D data for visualization
X_compare = X  # From earlier synthetic data
y_true_compare = y_true

# Initialize models
models = {
    'Isolation Forest': SklearnIsolationForest(contamination=0.0625, random_state=42),
    'LOF': LocalOutlierFactor(contamination=0.0625, novelty=False),
    'One-Class SVM': OneClassSVM(nu=0.0625, kernel='rbf', gamma='auto'),
}

# Train and evaluate
results = {}
fig, axes = plt.subplots(2, 3, figsize=(18, 11))
axes = axes.flatten()

for idx, (name, model) in enumerate(models.items()):
    # Train
    start_time = time.time()
    if name == 'LOF':
        y_pred = model.fit_predict(X_compare)
        scores = -model.negative_outlier_factor_
    else:
        model.fit(X_compare)
        y_pred = model.predict(X_compare)
        scores = -model.decision_function(X_compare) if hasattr(model, 'decision_function') else \
                 model.score_samples(X_compare)
    train_time = time.time() - start_time
    
    # Metrics
    y_true_binary = (y_true_compare == -1).astype(int)
    auc = roc_auc_score(y_true_binary, scores)
    from sklearn.metrics import f1_score
    f1 = f1_score(y_true_compare, y_pred, pos_label=-1)
    
    results[name] = {
        'AUC': auc,
        'F1': f1,
        'Time (s)': train_time,
        'Detected': (y_pred == -1).sum()
    }
    
    # Plot decision boundary
    ax = axes[idx]
    xx, yy = np.meshgrid(np.linspace(X_compare[:, 0].min()-1, X_compare[:, 0].max()+1, 200),
                         np.linspace(X_compare[:, 1].min()-1, X_compare[:, 1].max()+1, 200))
    X_grid = np.c_[xx.ravel(), yy.ravel()]
    
    if name == 'LOF':
        # LOF requires re-fitting for new points (not suitable for grid)
        Z = np.zeros(len(X_grid))
    else:
        Z = -model.decision_function(X_grid) if hasattr(model, 'decision_function') else \
            model.score_samples(X_grid)
    Z = Z.reshape(xx.shape)
    
    if name != 'LOF':
        ax.contourf(xx, yy, Z, levels=20, cmap='RdYlBu_r', alpha=0.4)
    ax.scatter(X_compare[:, 0], X_compare[:, 1], c=y_true_compare, 
               cmap='RdYlGn', edgecolors='k', s=30, alpha=0.7)
    ax.scatter(X_compare[y_pred == -1, 0], X_compare[y_pred == -1, 1],
               marker='x', s=150, linewidths=2, color='red', label='Detected Anomalies')
    ax.set_title(f'{name}\nAUC={auc:.3f}, F1={f1:.3f}, Time={train_time:.3f}s')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    # ROC curve
    ax_roc = axes[idx + 3]
    fpr, tpr, _ = roc_curve(y_true_binary, scores)
    ax_roc.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={auc:.3f})')
    ax_roc.plot([0, 1], [0, 1], 'k--', linewidth=1)
    ax_roc.set_xlabel('False Positive Rate')
    ax_roc.set_ylabel('True Positive Rate')
    ax_roc.set_title(f'ROC Curve: {name}')
    ax_roc.legend()
    ax_roc.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
print("\n📊 Model Comparison Summary:")
print("-" * 70)
print(f"{'Model':<20} {'AUC':>8} {'F1':>8} {'Time (s)':>12} {'Detected':>10}")
print("-" * 70)
for name, metrics in results.items():
    print(f"{name:<20} {metrics['AUC']:>8.3f} {metrics['F1']:>8.3f} "
          f"{metrics['Time (s)']:>12.4f} {metrics['Detected']:>10}")
print("-" * 70)

print("\n🎯 Key Takeaways:")
print("   - Isolation Forest: Fast, scales well, no hyperparameter sensitivity")
print("   - LOF: Good for variable density, but O(n²) complexity")
print("   - One-Class SVM: Strong boundary, expensive for large datasets")
print("   - Best choice depends on: data size, dimensionality, density patterns")

## 🎯 Real-World Project Ideas

### Post-Silicon Validation Projects

1. **Multi-Site Test Anomaly Monitor** 💰 $5M+ Annual Savings
   - **Objective**: Real-time anomaly detection across 5+ test sites for 10+ parametric tests
   - **Features**: Site ID, test name, mean/std/min/max per wafer, temperature, handler ID
   - **Success Metric**: Detect equipment drift 24 hours earlier (prevents 1000+ bad wafers)
   - **Implementation**: Streaming Isolation Forest with scikit-multiflow, Kafka pipeline
   - **Business Value**: Early detection of tester calibration issues, handler failures

2. **Wafer-Level Spatial Anomaly Detector** 💰 $3M+ Yield Recovery
   - **Objective**: Identify abnormal die patterns on wafer maps (process defects)
   - **Features**: die_x, die_y, Vdd, Idd, frequency, test_time_ms, 5 parametric tests
   - **Success Metric**: Detect edge/center/quadrant defects, flag wafers for SEM analysis
   - **Implementation**: 2D spatial features + Isolation Forest, visualize on wafer heatmap
   - **Business Value**: Root cause analysis for lithography/etch/CMP issues

3. **High-Dimensional Parametric Outlier System** 💰 $8M+ Quality Improvement
   - **Objective**: Monitor 50+ parametric tests simultaneously for device health
   - **Features**: All electrical tests (voltage, current, power, frequency, timing)
   - **Success Metric**: 95% anomaly detection rate, <5% false positive
   - **Implementation**: PCA preprocessing + Isolation Forest, explain top features
   - **Business Value**: Catch marginal devices before customer field failures

4. **Test Time Efficiency Anomaly Tracker** 💰 $2M+ Throughput Gain
   - **Objective**: Detect tests taking unusually long/short (handler/probe issues)
   - **Features**: test_name, test_time_ms, temperature, handler_id, probe_card_age
   - **Success Metric**: Reduce test time variance by 30%, increase ATE utilization 15%
   - **Implementation**: Per-test Isolation Forest models, alert on probe card wear
   - **Business Value**: Optimize test flow, maximize throughput

### General AI/ML Projects

5. **Network Intrusion Detection System** 💰 $20M+ Security Value
   - **Objective**: Real-time detection of abnormal network traffic patterns
   - **Features**: Packet size, protocol, port, connection duration, byte counts, packet flags
   - **Success Metric**: 98% attack detection, <0.1% false positive rate
   - **Implementation**: Isolation Forest on sliding window, alert on high anomaly scores
   - **Business Value**: Prevent data breaches, DDoS attacks, malware propagation

6. **Credit Card Fraud Detection Engine** 💰 $100M+ Fraud Prevention
   - **Objective**: Flag fraudulent transactions in real-time (sub-second latency)
   - **Features**: Transaction amount, merchant category, time, location, user history
   - **Success Metric**: 85% fraud detection, 1% false decline rate
   - **Implementation**: Ensemble (Isolation Forest + XGBoost), A/B test threshold
   - **Business Value**: Reduce chargebacks, customer trust, regulatory compliance

7. **Industrial IoT Sensor Anomaly Platform** 💰 $15M+ Downtime Prevention
   - **Objective**: Predictive maintenance via anomaly detection on 1000+ sensors
   - **Features**: Temperature, vibration, pressure, flow rate, power consumption
   - **Success Metric**: Predict failures 48 hours in advance, 90% accuracy
   - **Implementation**: Per-sensor Isolation Forest, federated learning across sites
   - **Business Value**: Avoid unplanned downtime, optimize maintenance schedules

8. **Healthcare Patient Deterioration Alert** 💰 $50M+ Lives Saved
   - **Objective**: Early warning system for patient vital sign abnormalities
   - **Features**: Heart rate, blood pressure, SpO2, temperature, respiratory rate
   - **Success Metric**: Detect sepsis/cardiac events 4 hours earlier, 80% sensitivity
   - **Implementation**: Streaming Isolation Forest, personalized baselines, HIPAA-compliant
   - **Business Value**: Reduce ICU transfers, mortality rates, healthcare costs

## 🔍 Key Takeaways

### ✅ When to Use Isolation Forest
- **Large datasets** (>10K samples): O(n log n) scales well
- **High dimensions** (>10 features): No distance metric curse
- **Sparse anomalies**: Few, isolated outliers (not clustered anomalies)
- **Speed critical**: Real-time inference (<1ms per sample)
- **No labeled data**: Unsupervised, only need contamination estimate

### ❌ Limitations
- **Struggles with local anomalies**: If anomalies form small clusters (use LOF instead)
- **Contamination sensitivity**: Requires decent estimate of anomaly proportion
- **No anomaly explanation**: Scores don't reveal which features caused anomaly (use SHAP/LIME)
- **Assumes anomalies are global outliers**: Not suitable for context-dependent anomalies

### 🔧 Hyperparameter Tuning Guidelines
1. **contamination** (most critical):
   - Start with domain knowledge (e.g., 3% defect rate in semiconductors)
   - Too low: Miss anomalies (high false negatives)
   - Too high: Many false positives
   - Tune via precision-recall curve on validation set

2. **n_estimators** (number of trees):
   - Default 100 usually sufficient
   - Increase to 200-500 for more stable scores (diminishing returns)
   - Monitor score variance across trees

3. **max_samples** (subsample size ψ):
   - Default 256 works well (controls tree height)
   - Larger: More global view, slower
   - Smaller: More localized, faster, may miss subtle anomalies

### 🚀 Next Steps
1. **Notebook 037**: One-Class SVM for boundary-based anomaly detection
2. **Notebook 038**: AutoEncoders for deep learning-based anomaly detection
3. **Advanced topic**: Explain anomaly scores with SHAP TreeExplainer
4. **Production**: Deploy with MLflow, monitor score distribution drift

### 📚 Further Reading
- Liu et al. (2008): "Isolation Forest" - Original paper
- Isolation Forest for anomaly detection in streaming data
- Extended Isolation Forest (EIF) - improved split selection
- Combining Isolation Forest with supervised learning (semi-supervised)