# 042: Model Evaluation Metrics

## üéØ Learning Objectives

By the end of this notebook, you will:
- **Understand** classification metrics (accuracy, precision, recall, F1, AUC-ROC)
- **Master** regression metrics (MSE, RMSE, MAE, R¬≤, MAPE)
- **Implement** confusion matrices, ROC curves, and precision-recall curves
- **Apply** business-aligned metrics for semiconductor test optimization
- **Build** automated evaluation dashboards for model monitoring

## üìö What are Model Evaluation Metrics?

**Model evaluation metrics** quantify how well a model performs on unseen data. Choosing the right metric depends on:
- Problem type (classification vs. regression)
- Business objective (minimize cost, maximize revenue, balance trade-offs)
- Class imbalance (accuracy misleads when classes are skewed)
- Error asymmetry (false positives vs. false negatives)

**Key Classification Metrics:**
- **Accuracy**: Correct predictions / Total predictions (misleading if imbalanced)
- **Precision**: TP / (TP + FP) - "Of predicted positives, how many are correct?"
- **Recall**: TP / (TP + FN) - "Of actual positives, how many did we catch?"
- **F1-Score**: Harmonic mean of precision and recall (balanced metric)
- **AUC-ROC**: Area under receiver operating characteristic curve (threshold-independent)

**Key Regression Metrics:**
- **MSE/RMSE**: Mean squared error (penalizes large errors)
- **MAE**: Mean absolute error (robust to outliers)
- **R¬≤**: Explained variance (0-1, interpretable)
- **MAPE**: Mean absolute percentage error (scale-independent)

## üè≠ Post-Silicon Validation Use Cases

**Automated Metrics Dashboard**
- Input: Model predictions + ground truth from validation sets
- Output: Real-time dashboard tracking 10+ metrics over time
- Value: Data-driven model selection and monitoring ($5M+ confidence)

**Cost-Optimized Decision Thresholds**
- Input: Business costs (FP = $500 yield loss, FN = $10K field return)
- Output: Optimal probability threshold maximizing profit
- Value: $10-20M annual savings through cost-aware predictions

**Model Degradation Detector**
- Input: Streaming predictions + validation labels in production
- Output: Alerts when metrics degrade >5% from baseline
- Value: $15M+ prevented revenue loss from model staleness

**Multi-Objective Trade-Off Optimizer**
- Input: Multiple competing metrics (accuracy, speed, cost)
- Output: Pareto frontier identifying optimal model configurations
- Value: $12M+ through balanced optimization (98% accuracy + 20% faster)

## üîÑ Metrics Evaluation Workflow

```mermaid
graph LR
    A[Model Predictions] --> B[Compute Metrics]
    C[Ground Truth] --> B
    B --> D{Classification?}
    D -->|Yes| E[ROC, PR, F1]
    D -->|No| F[RMSE, MAE, R¬≤]
    E --> G[Business Alignment]
    F --> G
    G --> H[Select Best Model]
    
    style A fill:#e1f5ff
    style H fill:#e1ffe1
```

## üìä Learning Path Context

**Prerequisites:**
- 013: Logistic Regression (classification fundamentals)
- 010: Linear Regression (regression fundamentals)

**Next Steps:**
- 043: Cross-Validation Strategies (robust evaluation)
- 044: Hyperparameter Tuning (optimization)

---

Let's master model evaluation for production ML! üöÄ

## üìù Section 1: Classification Metrics - Confusion Matrix Foundation

### The Confusion Matrix

The **confusion matrix** is the foundation for all classification metrics. It shows the counts of correct and incorrect predictions for each class.

**Binary Classification:**

```
                  Predicted
                 Positive | Negative
         ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
Actual   Positive    TP   |   FN
         Negative    FP   |   TN
```

**Definitions:**
- **TP (True Positive):** Correctly predicted positive (model said YES, actual YES)
- **TN (True Negative):** Correctly predicted negative (model said NO, actual NO)
- **FP (False Positive):** Incorrectly predicted positive (model said YES, actual NO) - **Type I Error**
- **FN (False Negative):** Incorrectly predicted negative (model said NO, actual YES) - **Type II Error**

**Semiconductor Example:**
- **TP:** Device predicted to fail, actually fails ‚Üí Caught by test ‚úÖ
- **TN:** Device predicted to pass, actually passes ‚Üí Good device shipped ‚úÖ
- **FP:** Device predicted to fail, actually passes ‚Üí Unnecessary scrap üí∞
- **FN:** Device predicted to pass, actually fails ‚Üí Test escape (shipped to customer) ‚ö†Ô∏è

### Derived Metrics from Confusion Matrix

#### 1. Accuracy

**Definition:** Proportion of correct predictions (both positive and negative).

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

**When to Use:**
- ‚úÖ Balanced datasets (roughly equal class distribution)
- ‚úÖ Equal cost for both types of errors
- ‚úÖ High-level performance summary

**When NOT to Use:**
- ‚ùå Imbalanced datasets (accuracy paradox)
- ‚ùå Asymmetric error costs

**Example:** 95% accuracy with 95% class 1 ‚Üí Model could predict all class 1 and be 95% accurate!

#### 2. Precision (Positive Predictive Value)

**Definition:** Of all positive predictions, what fraction were actually positive?

$$\text{Precision} = \frac{TP}{TP + FP}$$

**Interpretation:** "When the model says YES, how often is it right?"

**When to Use:**
- ‚úÖ False positives are costly (spam detection, fraud detection)
- ‚úÖ Resources limited (can only investigate X positive predictions)
- ‚úÖ Want to minimize false alarms

**Semiconductor Example:** 
- High precision ‚Üí Few good devices scrapped unnecessarily
- Precision = 0.90 ‚Üí 10% of predicted-fail devices actually pass (overkill)

#### 3. Recall (Sensitivity, True Positive Rate)

**Definition:** Of all actual positives, what fraction did we correctly identify?

$$\text{Recall} = \frac{TP}{TP + FN}$$

**Interpretation:** "Of all the YES cases, how many did we catch?"

**When to Use:**
- ‚úÖ False negatives are costly (disease detection, security threats)
- ‚úÖ Must catch as many positives as possible
- ‚úÖ Imbalanced datasets with rare positive class

**Semiconductor Example:**
- High recall ‚Üí Few failing devices escape to customers
- Recall = 0.95 ‚Üí 5% of failing devices not caught by test (test escapes)

#### 4. Specificity (True Negative Rate)

**Definition:** Of all actual negatives, what fraction did we correctly identify?

$$\text{Specificity} = \frac{TN}{TN + FP}$$

**Interpretation:** "Of all the NO cases, how many did we correctly identify?"

**Relationship:** Specificity = 1 - False Positive Rate

#### 5. F1 Score (Harmonic Mean of Precision and Recall)

**Definition:** Single metric combining precision and recall.

$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2 \times TP}{2 \times TP + FP + FN}$$

**Why Harmonic Mean?** Penalizes extreme imbalances (if either precision or recall is low, F1 is low).

**When to Use:**
- ‚úÖ Need single metric for model comparison
- ‚úÖ Want balance between precision and recall
- ‚úÖ Imbalanced datasets

**Example:**
- Precision = 0.90, Recall = 0.10 ‚Üí F1 = 0.18 (penalized for low recall)
- Precision = 0.50, Recall = 0.50 ‚Üí F1 = 0.50 (balanced)

#### 6. F-beta Score (Weighted F Score)

**Definition:** Generalized F1 score that allows weighting recall vs precision.

$$F_{\beta} = (1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{\beta^2 \times \text{Precision} + \text{Recall}}$$

Where:
- $\beta > 1$: Favor recall (reduce false negatives)
- $\beta < 1$: Favor precision (reduce false positives)
- $\beta = 1$: Balanced (standard F1 score)

**Common Values:**
- **F0.5:** Precision matters 2x more than recall
- **F1:** Equal weight (harmonic mean)
- **F2:** Recall matters 2x more than precision

**Semiconductor Example:**
- Use F2 when test escapes (FN) cost $10K but overkill (FP) costs $100
- Use F0.5 when overkill is very expensive (high-value devices)

### Precision-Recall Trade-off

**Key Insight:** Precision and recall often trade off against each other.

**Increasing Threshold:**
- Predict positive only when very confident
- ‚Üë Precision (fewer false positives)
- ‚Üì Recall (more false negatives)

**Decreasing Threshold:**
- Predict positive more liberally
- ‚Üì Precision (more false positives)
- ‚Üë Recall (fewer false negatives)

**Example:**
```
Threshold = 0.9: Precision = 0.95, Recall = 0.60 (conservative)
Threshold = 0.5: Precision = 0.85, Recall = 0.80 (balanced)
Threshold = 0.1: Precision = 0.60, Recall = 0.95 (aggressive)
```

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# Classification Metrics Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, roc_curve, auc, precision_recall_curve
class ClassificationMetrics:
    """
    Comprehensive classification metrics toolkit.
    
    Computes all standard metrics from confusion matrix:
    - Accuracy, Precision, Recall, Specificity
    - F1 score, F-beta scores
    - Matthews Correlation Coefficient
    - ROC-AUC (requires probability scores)
    """
    
    def __init__(self, y_true, y_pred, y_prob=None):
        """
        Initialize with true labels and predictions.
        
        Parameters:
        -----------
        y_true : array-like
            True labels (0 or 1)
        y_pred : array-like
            Predicted labels (0 or 1)
        y_prob : array-like, optional
            Predicted probabilities for positive class (for ROC-AUC)
        """
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.y_prob = np.array(y_prob) if y_prob is not None else None
        
        # Compute confusion matrix
        cm = confusion_matrix(self.y_true, self.y_pred)
        
        # Extract TP, TN, FP, FN
        if cm.shape == (2, 2):
            self.tn, self.fp, self.fn, self.tp = cm.ravel()
        else:
            raise ValueError("Only binary classification supported")
        
        self.total = self.tp + self.tn + self.fp + self.fn
    
    def accuracy(self):
        """Calculate accuracy: (TP + TN) / Total"""
        return (self.tp + self.tn) / self.total
    
    def precision(self):
        """Calculate precision: TP / (TP + FP)"""
        if (self.tp + self.fp) == 0:
            return 0.0  # No positive predictions
        return self.tp / (self.tp + self.fp)
    
    def recall(self):
        """Calculate recall (sensitivity): TP / (TP + FN)"""
        if (self.tp + self.fn) == 0:
            return 0.0  # No actual positives
        return self.tp / (self.tp + self.fn)
    
    def specificity(self):
        """Calculate specificity: TN / (TN + FP)"""
        if (self.tn + self.fp) == 0:
            return 0.0  # No actual negatives
        return self.tn / (self.tn + self.fp)
    
    def f1_score(self):


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        """Calculate F1 score: harmonic mean of precision and recall"""
        prec = self.precision()
        rec = self.recall()
        
        if (prec + rec) == 0:
            return 0.0
        
        return 2 * (prec * rec) / (prec + rec)
    
    def fbeta_score(self, beta=1.0):
        """
        Calculate F-beta score with custom beta.
        
        Parameters:
        -----------
        beta : float
            Weight of recall vs precision
            beta > 1: favor recall
            beta < 1: favor precision
        """
        prec = self.precision()
        rec = self.recall()
        
        if (beta**2 * prec + rec) == 0:
            return 0.0
        
        return (1 + beta**2) * (prec * rec) / (beta**2 * prec + rec)
    
    def matthews_corrcoef(self):
        """
        Calculate Matthews Correlation Coefficient (MCC).
        
        Range: [-1, 1]
        +1: Perfect prediction
        0: Random prediction
        -1: Total disagreement
        
        Good for imbalanced datasets.
        """
        numerator = (self.tp * self.tn) - (self.fp * self.fn)
        denominator = np.sqrt(
            (self.tp + self.fp) * (self.tp + self.fn) * 
            (self.tn + self.fp) * (self.tn + self.fn)
        )
        
        if denominator == 0:
            return 0.0
        
        return numerator / denominator
    
    def roc_auc(self):
        """
        Calculate ROC-AUC score.
        Requires probability scores (y_prob).
        
        Range: [0, 1]
        0.5: Random classifier
        1.0: Perfect classifier
        """
        if self.y_prob is None:
            raise ValueError("y_prob required for ROC-AUC calculation")
        
        fpr, tpr, _ = roc_curve(self.y_true, self.y_prob)
        return auc(fpr, tpr)
    
    def get_confusion_matrix(self):
        """Return confusion matrix as 2x2 array"""
        return np.array([[self.tn, self.fp],
                        [self.fn, self.tp]])
    


### üìù Function: summary

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
    def summary(self):
        """
        Print comprehensive metrics summary.
        """
        print("=" * 60)
        print("CLASSIFICATION METRICS SUMMARY")
        print("=" * 60)
        
        print("\nConfusion Matrix:")
        print(f"                 Predicted")
        print(f"                 Neg    Pos")
        print(f"       ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
        print(f"Actual Neg      {self.tn:4d}   {self.fp:4d}")
        print(f"       Pos      {self.fn:4d}   {self.tp:4d}")
        
        print("\nBasic Metrics:")
        print(f"  Accuracy:    {self.accuracy():.4f}")
        print(f"  Precision:   {self.precision():.4f}")
        print(f"  Recall:      {self.recall():.4f}")
        print(f"  Specificity: {self.specificity():.4f}")
        
        print("\nComposite Metrics:")
        print(f"  F1 Score:    {self.f1_score():.4f}")
        print(f"  F0.5 Score:  {self.fbeta_score(0.5):.4f} (precision focus)")
        print(f"  F2 Score:    {self.fbeta_score(2.0):.4f} (recall focus)")
        print(f"  MCC:         {self.matthews_corrcoef():.4f}")
        
        if self.y_prob is not None:
            print(f"  ROC-AUC:     {self.roc_auc():.4f}")
        
        print("\nError Analysis:")
        print(f"  False Positives: {self.fp} ({self.fp/self.total*100:.1f}%)")
        print(f"  False Negatives: {self.fn} ({self.fn/self.total*100:.1f}%)")
        print(f"  Total Errors:    {self.fp + self.fn} ({(self.fp + self.fn)/self.total*100:.1f}%)")
    
    def plot_confusion_matrix(self, normalize=False):
        """
        Plot confusion matrix as heatmap.
        
        Parameters:
        -----------
        normalize : bool
            If True, show percentages instead of counts
        """
        cm = self.get_confusion_matrix()
        
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            fmt = '.2%'
            title = 'Normalized Confusion Matrix'
        else:
            fmt = 'd'
            title = 'Confusion Matrix'
        
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm, annot=True, fmt=fmt, cmap='Blues', 
                   xticklabels=['Negative', 'Positive'],
                   yticklabels=['Negative', 'Positive'],
                   cbar_kws={'label': 'Percentage' if normalize else 'Count'})
        plt.ylabel('Actual')
        plt.xlabel('Predicted')
        plt.title(title)
        plt.tight_layout()
        plt.show()
print("‚úÖ ClassificationMetrics class implemented")
print("\nAvailable Metrics:")
print("- accuracy(): Overall correct predictions")
print("- precision(): Positive predictive value")
print("- recall(): True positive rate (sensitivity)")


### üìù Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
print("- specificity(): True negative rate")
print("- f1_score(): Harmonic mean of precision and recall")
print("- fbeta_score(beta): Weighted F score")
print("- matthews_corrcoef(): MCC for imbalanced data")
print("- roc_auc(): Area under ROC curve")
print("- summary(): Print all metrics")
print("- plot_confusion_matrix(): Visualize confusion matrix")


## üìù Section 2: Advanced Classification Metrics - ROC and PR Curves

### ROC Curve (Receiver Operating Characteristic)

**Definition:** Plot of True Positive Rate (Recall) vs False Positive Rate across all classification thresholds.

**Axes:**
- **X-axis:** False Positive Rate (FPR) = $\frac{FP}{FP + TN}$ = 1 - Specificity
- **Y-axis:** True Positive Rate (TPR) = $\frac{TP}{TP + FN}$ = Recall

**How It Works:**
1. Model outputs probabilities P(positive) for each sample
2. For each threshold t ‚àà [0, 1]:
   - Classify as positive if P(positive) ‚â• t
   - Calculate TPR and FPR at this threshold
3. Plot all (FPR, TPR) points
4. Connect points to form ROC curve

**Interpretation:**
- **Diagonal line (FPR = TPR):** Random classifier (AUC = 0.5)
- **Top-left corner:** Perfect classifier (TPR = 1, FPR = 0)
- **Area Under Curve (AUC):** Overall model quality

**AUC-ROC Values:**
```
AUC = 0.5:  Random guessing (coin flip)
AUC = 0.7:  Fair model
AUC = 0.8:  Good model
AUC = 0.9:  Excellent model
AUC = 1.0:  Perfect model
```

**Advantages:**
- ‚úÖ Threshold-independent (shows performance across all thresholds)
- ‚úÖ Good for balanced and imbalanced datasets
- ‚úÖ Easy to compare multiple models
- ‚úÖ Probabilistic interpretation (AUC = probability model ranks random positive higher than random negative)

**Limitations:**
- ‚ùå Overly optimistic for highly imbalanced datasets (high TN dominates)
- ‚ùå Doesn't directly show precision
- ‚ùå Doesn't incorporate class distribution

### Precision-Recall Curve

**Definition:** Plot of Precision vs Recall across all classification thresholds.

**Axes:**
- **X-axis:** Recall = $\frac{TP}{TP + FN}$
- **Y-axis:** Precision = $\frac{TP}{TP + FP}$

**When to Use:**
- ‚úÖ **Highly imbalanced datasets** (e.g., fraud detection: 0.1% fraud rate)
- ‚úÖ When positive class is the focus
- ‚úÖ When false positives are costly

**Interpretation:**
- **Top-right corner:** Perfect classifier (Precision = 1, Recall = 1)
- **Baseline:** Random classifier ‚Üí Precision = proportion of positives
- **Trade-off:** As recall increases (more liberal threshold), precision typically decreases

**Average Precision (AP):**
- Area under PR curve
- Single-number summary
- Better than AUC-ROC for imbalanced data

### ROC vs PR Curve: When to Use Which?

| Characteristic | ROC Curve | PR Curve |
|----------------|-----------|----------|
| **Balanced data** | ‚úÖ Excellent | ‚úÖ Good |
| **Imbalanced data (1:99)** | ‚ö†Ô∏è Optimistic | ‚úÖ Realistic |
| **Focus on positives** | ‚ùå No | ‚úÖ Yes |
| **Interpretability** | ‚úÖ Easy (TPR vs FPR) | ‚ö†Ô∏è Moderate |
| **Threshold selection** | ‚úÖ Yes | ‚úÖ Yes |
| **Multiple models** | ‚úÖ Easy comparison | ‚úÖ Easy comparison |

**Rule of Thumb:**
- **Use ROC:** Balanced datasets, care about both classes equally
- **Use PR:** Imbalanced datasets, positive class is rare and important

### Semiconductor Testing Example

**Scenario:** Defect detection with 2% failure rate (highly imbalanced)

**Two Models:**
- **Model A:** AUC-ROC = 0.95, Average Precision = 0.70
- **Model B:** AUC-ROC = 0.93, Average Precision = 0.80

**Analysis:**
- ROC suggests Model A is better (0.95 > 0.93)
- PR suggests Model B is better (0.80 > 0.70)
- **Which to trust?** PR curve (highly imbalanced, focus on detecting rare defects)

**Why the difference?**
- Model A achieves high AUC-ROC by correctly classifying many true negatives (98% of data)
- Model B sacrifices some TN but does better job on positives (defects)
- PR curve reveals Model B is better at actual defect detection

### Threshold Selection Strategies

**1. Maximize F1 Score:**
```python
# Find threshold that maximizes F1
best_threshold = threshold_at_max_f1(precisions, recalls, thresholds)
```

**2. Fixed Recall (minimize FN):**
```python
# Ensure 95% recall, optimize precision
threshold = threshold_at_recall(recalls, thresholds, target_recall=0.95)
```

**3. Fixed Precision (minimize FP):**
```python
# Ensure 90% precision, optimize recall
threshold = threshold_at_precision(precisions, thresholds, target_precision=0.90)
```

**4. Cost-Based:**
```python
# Minimize: cost_FP * FP + cost_FN * FN
threshold = threshold_min_cost(y_true, y_prob, cost_FP=100, cost_FN=10000)
```

**Semiconductor Example:**
- Cost of test escape (FN): $10,000 per device
- Cost of overkill (FP): $100 per device
- ‚Üí Choose threshold that minimizes total cost: $100 √ó FP + $10,000 √ó FN

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
# ROC and PR Curve Implementation
class ROCPRAnalyzer:
    """
    Comprehensive ROC and Precision-Recall curve analysis.
    
    Provides:
    - ROC curve plotting with AUC
    - Precision-Recall curve with AP
    - Threshold analysis and selection
    - Cost-based optimization
    """
    
    def __init__(self, y_true, y_prob):
        """
        Initialize with true labels and predicted probabilities.
        
        Parameters:
        -----------
        y_true : array-like
            True labels (0 or 1)
        y_prob : array-like
            Predicted probabilities for positive class
        """
        self.y_true = np.array(y_true)
        self.y_prob = np.array(y_prob)
        
        # Compute ROC curve
        self.fpr, self.tpr, self.roc_thresholds = roc_curve(y_true, y_prob)
        self.auc_roc = auc(self.fpr, self.tpr)
        
        # Compute Precision-Recall curve
        self.precision, self.recall, self.pr_thresholds = precision_recall_curve(y_true, y_prob)
        self.auc_pr = auc(self.recall, self.precision)
    
    def plot_roc_curve(self, ax=None):
        """
        Plot ROC curve with AUC.
        """
        if ax is None:
            fig, ax = plt.subplots(figsize=(8, 6))
        
        ax.plot(self.fpr, self.tpr, linewidth=2, 
               label=f'ROC Curve (AUC = {self.auc_roc:.3f})')
        ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
        
        ax.set_xlabel('False Positive Rate', fontsize=12)
        ax.set_ylabel('True Positive Rate (Recall)', fontsize=12)
        ax.set_title('ROC Curve', fontsize=14, fontweight='bold')
        ax.legend(loc='lower right', fontsize=10)
        ax.grid(alpha=0.3)
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])
        
        if ax is None:
            plt.tight_layout()
            plt.show()
    
    def plot_pr_curve(self, ax=None):
        """
        Plot Precision-Recall curve with AP.
        """
        if ax is None:
            fig, ax = plt.subplots(figsize=(8, 6))
        
        # Baseline (random classifier precision = positive rate)
        baseline_precision = np.sum(self.y_true) / len(self.y_true)
        
        ax.plot(self.recall, self.precision, linewidth=2,
               label=f'PR Curve (AP = {self.auc_pr:.3f})')


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        ax.axhline(y=baseline_precision, color='k', linestyle='--', linewidth=1,
                  label=f'Baseline (Precision = {baseline_precision:.3f})')
        
        ax.set_xlabel('Recall', fontsize=12)
        ax.set_ylabel('Precision', fontsize=12)
        ax.set_title('Precision-Recall Curve', fontsize=14, fontweight='bold')
        ax.legend(loc='upper right', fontsize=10)
        ax.grid(alpha=0.3)
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])
        
        if ax is None:
            plt.tight_layout()
            plt.show()
    
    def find_threshold_at_recall(self, target_recall):
        """
        Find threshold that achieves target recall.
        
        Parameters:
        -----------
        target_recall : float
            Desired recall (e.g., 0.95 for 95% recall)
        
        Returns:
        --------
        threshold, precision at that recall
        """
        # Find index where recall >= target (recall decreases as we go along curve)
        idx = np.where(self.recall >= target_recall)[0]
        
        if len(idx) == 0:
            return None, None
        
        idx = idx[-1]  # Get last index that meets requirement
        threshold = self.pr_thresholds[idx] if idx < len(self.pr_thresholds) else 0.0
        precision = self.precision[idx]
        
        return threshold, precision
    
    def find_threshold_at_precision(self, target_precision):
        """
        Find threshold that achieves target precision.
        
        Parameters:
        -----------
        target_precision : float
            Desired precision (e.g., 0.90 for 90% precision)
        
        Returns:
        --------
        threshold, recall at that precision
        """
        idx = np.where(self.precision >= target_precision)[0]
        
        if len(idx) == 0:
            return None, None
        
        idx = idx[0]  # Get first index that meets requirement
        threshold = self.pr_thresholds[idx] if idx < len(self.pr_thresholds) else 1.0
        recall = self.recall[idx]
        
        return threshold, recall
    
    def find_optimal_threshold_f1(self):
        """
        Find threshold that maximizes F1 score.
        
        Returns:
        --------


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        threshold, best_f1, precision, recall
        """
        # Calculate F1 for each threshold
        f1_scores = []
        
        for i in range(len(self.pr_thresholds)):
            prec = self.precision[i]
            rec = self.recall[i]
            
            if (prec + rec) > 0:
                f1 = 2 * (prec * rec) / (prec + rec)
            else:
                f1 = 0
            
            f1_scores.append(f1)
        
        best_idx = np.argmax(f1_scores)
        best_threshold = self.pr_thresholds[best_idx]
        best_f1 = f1_scores[best_idx]
        best_precision = self.precision[best_idx]
        best_recall = self.recall[best_idx]
        
        return best_threshold, best_f1, best_precision, best_recall
    
    def find_optimal_threshold_cost(self, cost_fp, cost_fn):
        """
        Find threshold that minimizes total cost.
        
        Total Cost = cost_fp * FP + cost_fn * FN
        
        Parameters:
        -----------
        cost_fp : float
            Cost of one false positive
        cost_fn : float
            Cost of one false negative
        
        Returns:
        --------
        best_threshold, min_cost, FP, FN
        """
        n_pos = np.sum(self.y_true)
        n_neg = len(self.y_true) - n_pos
        
        costs = []
        thresholds_to_test = np.linspace(0, 1, 100)
        
        for threshold in thresholds_to_test:
            y_pred = (self.y_prob >= threshold).astype(int)
            
            # Calculate confusion matrix
            cm = confusion_matrix(self.y_true, y_pred)
            if cm.shape == (2, 2):
                tn, fp, fn, tp = cm.ravel()
            else:
                # Handle edge cases
                fp = fn = 0
            
            total_cost = cost_fp * fp + cost_fn * fn
            costs.append((threshold, total_cost, fp, fn))
        
        # Find minimum cost
        best_result = min(costs, key=lambda x: x[1])
        
        return best_result  # (threshold, min_cost, FP, FN)
    
    def plot_threshold_analysis(self):
        """
        Plot how precision, recall, and F1 change with threshold.
        """


### üìù Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        # Calculate F1 scores
        f1_scores = []
        for i in range(len(self.pr_thresholds)):
            prec = self.precision[i]
            rec = self.recall[i]
            if (prec + rec) > 0:
                f1 = 2 * (prec * rec) / (prec + rec)
            else:
                f1 = 0
            f1_scores.append(f1)
        
        fig, ax = plt.subplots(figsize=(12, 6))
        
        ax.plot(self.pr_thresholds, self.precision[:-1], label='Precision', linewidth=2)
        ax.plot(self.pr_thresholds, self.recall[:-1], label='Recall', linewidth=2)
        ax.plot(self.pr_thresholds, f1_scores, label='F1 Score', linewidth=2, linestyle='--')
        
        # Mark optimal F1 threshold
        best_threshold, best_f1, _, _ = self.find_optimal_threshold_f1()
        ax.axvline(x=best_threshold, color='red', linestyle=':', linewidth=2,
                  label=f'Optimal F1 Threshold = {best_threshold:.3f}')
        
        ax.set_xlabel('Threshold', fontsize=12)
        ax.set_ylabel('Score', fontsize=12)
        ax.set_title('Metrics vs Threshold', fontsize=14, fontweight='bold')
        ax.legend(fontsize=10)
        ax.grid(alpha=0.3)
        ax.set_xlim([0, 1])
        ax.set_ylim([0, 1])
        
        plt.tight_layout()
        plt.show()
print("‚úÖ ROCPRAnalyzer class implemented")
print("\nAvailable Methods:")
print("- plot_roc_curve(): Plot ROC curve with AUC")
print("- plot_pr_curve(): Plot Precision-Recall curve with AP")
print("- find_threshold_at_recall(): Get threshold for target recall")
print("- find_threshold_at_precision(): Get threshold for target precision")
print("- find_optimal_threshold_f1(): Maximize F1 score")
print("- find_optimal_threshold_cost(): Minimize cost (FP and FN costs)")
print("- plot_threshold_analysis(): Visualize metrics vs threshold")


## üìä Regression Metrics Theory

When evaluating regression models (predicting continuous values), we need different metrics than classification. In **post-silicon validation**, regression metrics help evaluate:

- **Test time prediction**: Predict how long a device test will take (milliseconds)
- **Parametric yield prediction**: Predict continuous yield percentage (0-100%)
- **Power consumption models**: Predict device power draw (watts/milliwatts)
- **Temperature prediction**: Predict junction temperature under load (¬∞C)

### Why Regression Metrics Differ from Classification

1. **Continuous outputs**: Predictions can be infinitely close to actual values
2. **Error magnitude matters**: Being off by 0.1% vs 10% has very different implications
3. **Direction of error**: Over-prediction vs under-prediction may have different costs
4. **Scale sensitivity**: Errors should be interpreted in context of the target variable's range

---

### üîë Key Regression Metrics

#### 1. Mean Squared Error (MSE)

**Formula:**
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

**Interpretation:**
- **Units**: Squared units of target variable (e.g., ms¬≤ for test time)
- **Range**: [0, ‚àû), where 0 is perfect
- **Penalty**: Heavily penalizes large errors (quadratic)
- **Use when**: Large errors are disproportionately costly

**Example:** Test time prediction with MSE = 25 ms¬≤ means average squared error is 25 ms¬≤

---

#### 2. Root Mean Squared Error (RMSE)

**Formula:**
$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} = \sqrt{MSE}
$$

**Interpretation:**
- **Units**: Same as target variable (e.g., ms for test time)
- **Range**: [0, ‚àû), where 0 is perfect
- **Meaning**: "On average, predictions are off by ¬± RMSE"
- **Use when**: You need interpretable error magnitude

**Example:** Test time RMSE = 5 ms means predictions are typically off by ¬±5 milliseconds

---

#### 3. Mean Absolute Error (MAE)

**Formula:**
$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

**Interpretation:**
- **Units**: Same as target variable
- **Range**: [0, ‚àû), where 0 is perfect
- **Penalty**: Linear penalty for errors
- **Robust**: Less sensitive to outliers than MSE/RMSE
- **Use when**: All errors weighted equally, outliers present

**Example:** Yield MAE = 2% means average absolute deviation is 2 percentage points

---

#### 4. R-Squared (R¬≤) / Coefficient of Determination

**Formula:**
$$
R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}
$$

Where:
- $SS_{res}$ = Residual sum of squares (model error)
- $SS_{tot}$ = Total sum of squares (variance in data)
- $\bar{y}$ = Mean of actual values

**Interpretation:**
- **Range**: (-‚àû, 1], typically [0, 1]
- **Meaning**: Proportion of variance explained by the model
- **R¬≤ = 1.0**: Perfect predictions
- **R¬≤ = 0.0**: Model no better than predicting mean
- **R¬≤ < 0**: Model worse than predicting mean
- **Use when**: Comparing models on same dataset, need variance explanation

**Example:** R¬≤ = 0.85 means model explains 85% of variance in test time

---

#### 5. Mean Absolute Percentage Error (MAPE)

**Formula:**
$$
MAPE = \frac{100\%}{n} \sum_{i=1}^{n} \left|\frac{y_i - \hat{y}_i}{y_i}\right|
$$

**Interpretation:**
- **Units**: Percentage (%)
- **Range**: [0, ‚àû), where 0 is perfect
- **Scale-independent**: Can compare across different target ranges
- **Use when**: Target variable has meaningful zero, want relative error
- **Limitation**: Undefined when $y_i = 0$, biased toward under-predictions

**Example:** MAPE = 5% means predictions are off by 5% on average

---

#### 6. Max Error

**Formula:**
$$
\text{Max Error} = \max_{i=1}^{n} |y_i - \hat{y}_i|
$$

**Interpretation:**
- **Units**: Same as target variable
- **Range**: [0, ‚àû)
- **Meaning**: Worst-case prediction error
- **Use when**: Need to guarantee maximum acceptable error

**Example:** Max Error = 50 ms means worst test time prediction was off by 50 ms

---

### üéØ Decision Guide: Which Regression Metric to Use?

```mermaid
graph TD
    A[Need to evaluate regression model] --> B{Need interpretable<br/>error magnitude?}
    B -->|Yes| C{Outliers<br/>present?}
    B -->|No| D{Compare models<br/>on same data?}
    
    C -->|Yes| E[Use MAE<br/>Robust to outliers]
    C -->|No| F[Use RMSE<br/>Penalizes large errors]
    
    D -->|Yes| G[Use R¬≤<br/>Variance explained]
    D -->|No| H{Scale-independent<br/>comparison needed?}
    
    H -->|Yes| I[Use MAPE<br/>Percentage error]
    H -->|No| J{Worst-case<br/>guarantee needed?}
    
    J -->|Yes| K[Use Max Error<br/>Safety critical]
    J -->|No| L[Use MSE<br/>Loss function]
```

### Post-Silicon Validation Context

| **Use Case** | **Recommended Metrics** | **Why** |
|--------------|------------------------|---------|
| **Test time prediction** | RMSE + MAE + Max Error | Interpretable (ms), robust (MAE), worst-case (Max) |
| **Yield prediction** | RMSE + R¬≤ + MAPE | Interpretable (%), variance explained, scale-independent |
| **Power consumption** | MAE + MAPE | Robust to outliers, percentage useful for efficiency |
| **Spatial modeling** | RMSE + R¬≤ | Penalize large spatial errors, variance explanation |
| **Model comparison** | R¬≤ + RMSE | Standardized comparison, interpretable error |

### Common Pitfalls

1. **Using MAPE with zeros**: Undefined when actual = 0 (e.g., zero power consumption)
2. **Ignoring scale**: MSE = 100 is good for large values, terrible for small values
3. **Over-relying on R¬≤**: Can be high even with poor predictions if variance is low
4. **Comparing MSE across datasets**: Only valid for same target variable
5. **Forgetting residuals**: Always visualize residuals to check assumptions

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple, Optional
class RegressionMetrics:
    """
    Comprehensive regression metrics calculator for model evaluation.
    
    Computes key metrics for continuous target predictions:
    - Mean Squared Error (MSE)
    - Root Mean Squared Error (RMSE)  
    - Mean Absolute Error (MAE)
    - R-Squared (R¬≤)
    - Mean Absolute Percentage Error (MAPE)
    - Max Error
    
    Also provides residual analysis and visualization methods.
    """
    
    def __init__(self, y_true: np.ndarray, y_pred: np.ndarray):
        """
        Initialize with true and predicted values.
        
        Args:
            y_true: Actual target values (n_samples,)
            y_pred: Predicted target values (n_samples,)
        """
        self.y_true = np.array(y_true).flatten()
        self.y_pred = np.array(y_pred).flatten()
        
        if len(self.y_true) != len(self.y_pred):
            raise ValueError("y_true and y_pred must have same length")
        
        # Compute residuals (errors)
        self.residuals = self.y_true - self.y_pred
        self.n_samples = len(self.y_true)
        self.y_mean = np.mean(self.y_true)
    
    def mse(self) -> float:
        """
        Compute Mean Squared Error.
        
        MSE = (1/n) * Œ£(y_i - ≈∑_i)¬≤
        
        Returns:
            Mean squared error (float)
        """
        return np.mean(self.residuals ** 2)
    
    def rmse(self) -> float:
        """
        Compute Root Mean Squared Error.
        
        RMSE = ‚àö(MSE)
        Same units as target variable.
        
        Returns:
            Root mean squared error (float)
        """
        return np.sqrt(self.mse())
    
    def mae(self) -> float:
        """
        Compute Mean Absolute Error.
        
        MAE = (1/n) * Œ£|y_i - ≈∑_i|
        Robust to outliers compared to MSE.
        
        Returns:


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
            Mean absolute error (float)
        """
        return np.mean(np.abs(self.residuals))
    
    def r2_score(self) -> float:
        """
        Compute R-Squared (coefficient of determination).
        
        R¬≤ = 1 - (SS_res / SS_tot)
        where SS_res = Œ£(y_i - ≈∑_i)¬≤  [residual sum of squares]
              SS_tot = Œ£(y_i - »≥)¬≤    [total sum of squares]
        
        Range: (-‚àû, 1], where 1 is perfect, 0 is mean predictor
        
        Returns:
            R-squared score (float)
        """
        ss_res = np.sum(self.residuals ** 2)
        ss_tot = np.sum((self.y_true - self.y_mean) ** 2)
        
        # Handle edge case: constant target
        if ss_tot == 0:
            return 1.0 if ss_res == 0 else 0.0
        
        return 1 - (ss_res / ss_tot)
    
    def mape(self) -> float:
        """
        Compute Mean Absolute Percentage Error.
        
        MAPE = (100/n) * Œ£|y_i - ≈∑_i| / |y_i|
        Scale-independent metric (percentage).
        
        Returns:
            Mean absolute percentage error in % (float)
            Returns np.inf if any y_true is zero
        """
        # Check for zeros in y_true
        if np.any(self.y_true == 0):
            print("Warning: MAPE is undefined when actual values contain zeros")
            return np.inf
        
        return 100 * np.mean(np.abs(self.residuals / self.y_true))
    
    def max_error(self) -> float:
        """
        Compute maximum absolute error (worst-case).
        
        Max Error = max|y_i - ≈∑_i|
        
        Returns:
            Maximum absolute error (float)
        """
        return np.max(np.abs(self.residuals))
    
    def adjusted_r2(self, n_features: int) -> float:
        """
        Compute Adjusted R-Squared (penalizes for number of features).
        
        Adjusted R¬≤ = 1 - [(1-R¬≤) * (n-1) / (n-p-1)]
        where n = number of samples, p = number of features
        
        Args:
            n_features: Number of features used in model
            
        Returns:
            Adjusted R-squared score (float)
        """
        r2 = self.r2_score()
        n = self.n_samples


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        p = n_features
        
        if n <= p + 1:
            return np.nan  # Not enough samples
        
        return 1 - ((1 - r2) * (n - 1) / (n - p - 1))
    
    def get_residual_stats(self) -> Dict[str, float]:
        """
        Get comprehensive residual statistics.
        
        Returns:
            Dictionary with residual statistics
        """
        return {
            'mean_residual': np.mean(self.residuals),
            'std_residual': np.std(self.residuals),
            'min_residual': np.min(self.residuals),
            'max_residual': np.max(self.residuals),
            'median_residual': np.median(self.residuals),
            'q25_residual': np.percentile(self.residuals, 25),
            'q75_residual': np.percentile(self.residuals, 75)
        }
    
    def summary(self) -> None:
        """
        Print comprehensive metrics summary.
        """
        print("="*60)
        print("REGRESSION METRICS SUMMARY")
        print("="*60)
        print(f"Number of samples: {self.n_samples}")
        print(f"Target mean: {self.y_mean:.4f}")
        print(f"Target std: {np.std(self.y_true):.4f}")
        print(f"Target range: [{np.min(self.y_true):.4f}, {np.max(self.y_true):.4f}]")
        print("\n" + "-"*60)
        print("PRIMARY METRICS")
        print("-"*60)
        print(f"MSE:        {self.mse():.6f}")
        print(f"RMSE:       {self.rmse():.6f}")
        print(f"MAE:        {self.mae():.6f}")
        print(f"R¬≤:         {self.r2_score():.6f}")
        print(f"MAPE:       {self.mape():.4f}%")
        print(f"Max Error:  {self.max_error():.6f}")
        
        print("\n" + "-"*60)
        print("RESIDUAL STATISTICS")
        print("-"*60)
        stats = self.get_residual_stats()
        for key, value in stats.items():
            print(f"{key:20s}: {value:10.6f}")
        print("="*60)
    
    def plot_predictions(self, title: str = "Predicted vs Actual", 
                        figsize: Tuple[int, int] = (12, 4)) -> None:
        """
        Visualize predictions and residuals.
        
        Creates 3 subplots:
        1. Predicted vs Actual (scatter with perfect prediction line)
        2. Residuals vs Predicted (check for patterns)
        3. Residual distribution (check normality)
        
        Args:
            title: Plot title prefix
            figsize: Figure size (width, height)
        """
        fig, axes = plt.subplots(1, 3, figsize=figsize)
        
        # Plot 1: Predicted vs Actual


### üìù Implementation Part 4

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        axes[0].scatter(self.y_true, self.y_pred, alpha=0.6, s=30, edgecolors='k', linewidth=0.5)
        
        # Perfect prediction line
        min_val = min(np.min(self.y_true), np.min(self.y_pred))
        max_val = max(np.max(self.y_true), np.max(self.y_pred))
        axes[0].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')
        
        axes[0].set_xlabel('Actual Values', fontsize=11, fontweight='bold')
        axes[0].set_ylabel('Predicted Values', fontsize=11, fontweight='bold')
        axes[0].set_title(f'{title}\nR¬≤ = {self.r2_score():.4f}', fontsize=12, fontweight='bold')
        axes[0].legend()
        axes[0].grid(alpha=0.3)
        
        # Plot 2: Residuals vs Predicted
        axes[1].scatter(self.y_pred, self.residuals, alpha=0.6, s=30, edgecolors='k', linewidth=0.5)
        axes[1].axhline(y=0, color='r', linestyle='--', lw=2, label='Zero Residual')
        
        axes[1].set_xlabel('Predicted Values', fontsize=11, fontweight='bold')
        axes[1].set_ylabel('Residuals (Actual - Predicted)', fontsize=11, fontweight='bold')
        axes[1].set_title(f'Residual Plot\nMAE = {self.mae():.4f}', fontsize=12, fontweight='bold')
        axes[1].legend()
        axes[1].grid(alpha=0.3)
        
        # Plot 3: Residual Distribution
        axes[2].hist(self.residuals, bins=30, alpha=0.7, edgecolor='black', color='skyblue')
        axes[2].axvline(x=0, color='r', linestyle='--', lw=2, label='Zero')
        axes[2].axvline(x=np.mean(self.residuals), color='orange', linestyle='--', lw=2, 
                       label=f'Mean = {np.mean(self.residuals):.4f}')
        
        axes[2].set_xlabel('Residuals', fontsize=11, fontweight='bold')
        axes[2].set_ylabel('Frequency', fontsize=11, fontweight='bold')
        axes[2].set_title(f'Residual Distribution\nRMSE = {self.rmse():.4f}', 
                         fontsize=12, fontweight='bold')
        axes[2].legend()
        axes[2].grid(alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()
    
    def plot_error_distribution(self, figsize: Tuple[int, int] = (10, 5)) -> None:
        """
        Visualize error distribution with box plot and violin plot.
        
        Args:
            figsize: Figure size (width, height)
        """
        fig, axes = plt.subplots(1, 2, figsize=figsize)
        
        # Box plot
        bp = axes[0].boxplot(self.residuals, vert=True, patch_artist=True)
        bp['boxes'][0].set_facecolor('lightblue')
        bp['medians'][0].set_color('red')
        bp['medians'][0].set_linewidth(2)
        
        axes[0].set_ylabel('Residuals', fontsize=11, fontweight='bold')
        axes[0].set_title('Residual Box Plot', fontsize=12, fontweight='bold')
        axes[0].grid(alpha=0.3, axis='y')
        axes[0].axhline(y=0, color='red', linestyle='--', lw=1.5)
        
        # Violin plot
        parts = axes[1].violinplot([self.residuals], vert=True, showmeans=True, showmedians=True)
        for pc in parts['bodies']:
            pc.set_facecolor('lightgreen')
            pc.set_alpha(0.7)
        
        axes[1].set_ylabel('Residuals', fontsize=11, fontweight='bold')
        axes[1].set_title('Residual Violin Plot', fontsize=12, fontweight='bold')
        axes[1].grid(alpha=0.3, axis='y')
        axes[1].axhline(y=0, color='red', linestyle='--', lw=1.5)
        


### üìù Implementation Part 5

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
        plt.tight_layout()
        plt.show()
# Example usage demonstration
if __name__ == "__main__":
    # Generate synthetic test time data (post-silicon validation example)
    np.random.seed(42)
    n_samples = 100
    
    # Actual test times (ms)
    y_true = 50 + 20 * np.random.randn(n_samples)
    
    # Predicted test times (with some error)
    y_pred = y_true + 5 * np.random.randn(n_samples)
    
    # Create metrics object
    metrics = RegressionMetrics(y_true, y_pred)
    
    # Print summary
    metrics.summary()
    
    # Visualize
    metrics.plot_predictions(title="Test Time Prediction")
    metrics.plot_error_distribution()


## üîß Cost-Sensitive Evaluation and Custom Metrics

In production environments, especially **post-silicon validation**, different types of errors often have **different costs**. A false negative (missing a defect) might cost $10M in field failures, while a false positive (unnecessary retest) might cost only $50K.

### Cost Matrices in Classification

For binary classification, we can define a **cost matrix**:

|                    | **Predicted Negative** | **Predicted Positive** |
|--------------------|------------------------|------------------------|
| **Actual Negative** | TN (Cost = 0)         | FP (Cost = C_FP)      |
| **Actual Positive** | FN (Cost = C_FN)      | TP (Cost = 0)         |

**Total Cost:**
$$
\text{Total Cost} = C_{FP} \times FP + C_{FN} \times FN
$$

**Semiconductor Example:**
- Device fails but test passes (FN): **$10,000,000** per escaped defect
- Device passes but test fails (FP): **$50,000** per false alarm (retest cost)
- Cost ratio: FN is 200√ó more expensive than FP

**Optimal threshold:** Not necessarily 0.5! We should minimize total cost:
$$
t^* = \arg\min_t \left[ C_{FP} \times FP(t) + C_{FN} \times FN(t) \right]
$$

---

### Custom Metrics for Business Objectives

Beyond standard metrics, you may need **domain-specific metrics**:

#### 1. **Weighted Accuracy** (imbalanced classes with known costs)

$$
\text{Weighted Acc} = \frac{w_0 \times TN + w_1 \times TP}{w_0 \times (TN + FP) + w_1 \times (TP + FN)}
$$

Where $w_0$, $w_1$ are class weights (inversely proportional to class frequency).

#### 2. **Top-K Accuracy** (information retrieval, recommendations)

$$
\text{Top-K Acc} = \frac{\text{# correct predictions in top K}}{n}
$$

Useful when you only act on top K predictions (e.g., investigate top 10 suspected defects).

#### 3. **Expected Calibration Error (ECE)** (probability calibration)

$$
ECE = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| acc(B_m) - conf(B_m) \right|
$$

Where:
- Predictions binned into M bins by confidence
- $B_m$ = samples in bin m
- $acc(B_m)$ = accuracy in bin m
- $conf(B_m)$ = average confidence in bin m

**Interpretation:** ECE measures how well predicted probabilities match actual outcomes. ECE = 0 means perfectly calibrated.

#### 4. **Production Throughput Metric** (semiconductor specific)

$$
\text{Throughput} = \frac{\text{Tested Devices}}{\text{Total Test Time}} \times (1 - \text{False Positive Rate})
$$

Accounts for both speed and accuracy. High FPR reduces effective throughput due to retests.

---

### Multi-Class and Multi-Label Considerations

#### Multi-Class Classification (one label per sample)

**Macro-average:** Compute metric for each class, then average
$$
\text{Macro-F1} = \frac{1}{K} \sum_{k=1}^{K} F1_k
$$

**Micro-average:** Aggregate TP, FP, FN across all classes
$$
\text{Micro-F1} = \frac{2 \times \text{Total TP}}{2 \times \text{Total TP} + \text{Total FP} + \text{Total FN}}
$$

**Weighted-average:** Weight by class support (number of samples)
$$
\text{Weighted-F1} = \sum_{k=1}^{K} \frac{n_k}{n} F1_k
$$

**When to use:**
- **Macro**: All classes equally important (rare classes matter)
- **Micro**: Large classes more important (overall accuracy)
- **Weighted**: Balance by class size (common in imbalanced datasets)

#### Multi-Label Classification (multiple labels per sample)

Each sample can have multiple labels (e.g., device has both "high power" and "low frequency" defects).

**Hamming Loss:**
$$
\text{Hamming} = \frac{1}{n \times L} \sum_{i=1}^{n} \sum_{j=1}^{L} \mathbb{1}(y_{ij} \neq \hat{y}_{ij})
$$

**Subset Accuracy (exact match):**
$$
\text{Subset Acc} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(y_i = \hat{y}_i)
$$

Only counts prediction as correct if ALL labels match exactly.

---

### Cross-Validation and Evaluation Strategy

#### Why Cross-Validation?

Single train/test split can be **misleading**:
- Result depends on specific split
- May not represent true performance
- Small datasets: high variance in estimates

**K-Fold Cross-Validation:**
1. Split data into K equal folds
2. For each fold k:
   - Train on K-1 folds
   - Validate on fold k
3. Average metrics across K folds

**Stratified K-Fold:** Maintains class distribution in each fold (important for imbalanced data)

**Typical K values:**
- K=5: Fast, reasonable variance
- K=10: Standard choice, lower variance
- K=n (LOO): Maximum data usage, high computational cost

#### Time Series Considerations

**DO NOT use random K-Fold for time series!** This causes **data leakage** (future predicting past).

**Use Time Series Split instead:**
```
Fold 1: Train [1:100]   ‚Üí Test [101:120]
Fold 2: Train [1:120]   ‚Üí Test [121:140]
Fold 3: Train [1:140]   ‚Üí Test [141:160]
...
```

**Post-silicon example:** Test data from Week 10 should NOT be used to train model evaluated on Week 5 data.

---

### Statistical Significance Testing

Is Model A **truly better** than Model B, or just lucky on this dataset?

#### McNemar's Test (paired classification models)

Tests if two models have **significantly different error rates**.

**Contingency table:**
|              | Model B Correct | Model B Wrong |
|--------------|----------------|---------------|
| **Model A Correct** | a              | b             |
| **Model A Wrong**   | c              | d             |

**Test statistic:**
$$
\chi^2 = \frac{(b - c)^2}{b + c}
$$

Under null hypothesis (models equivalent), $\chi^2 \sim \chi^2(1)$

**Interpretation:** If p-value < 0.05, models are significantly different.

#### Paired t-test (multiple datasets/folds)

Compare mean performance across K folds.

$$
t = \frac{\bar{d}}{s_d / \sqrt{K}}
$$

Where $\bar{d}$ = mean difference, $s_d$ = standard deviation of differences.

---

### Production Monitoring Metrics

Once deployed, monitor these metrics **continuously**:

1. **Prediction Distribution Drift**: Has distribution of predicted probabilities changed?
2. **Feature Drift**: Have input features changed (mean, std, range)?
3. **Performance Degradation**: Are metrics declining over time?
4. **Calibration Drift**: Are predicted probabilities still calibrated?
5. **Latency and Throughput**: Is model still meeting SLAs?

**Alerting thresholds:**
- Accuracy drops > 5% from baseline
- Feature means shift > 2 standard deviations
- 95th percentile latency exceeds SLA

**Post-silicon example:** If test yield predictions suddenly drop in accuracy, it may indicate:
- Process change in manufacturing
- New device architecture
- Test equipment calibration drift
- Model staleness (needs retraining)

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict, List, Tuple
from sklearn.metrics import confusion_matrix
class CostSensitiveEvaluator:
    """
    Cost-sensitive evaluation for classification models.
    
    Allows custom cost matrices for FP and FN errors,
    finds optimal thresholds to minimize total cost,
    and performs multi-class evaluation with different averaging strategies.
    """
    
    def __init__(self, y_true: np.ndarray, y_prob: np.ndarray, 
                 cost_fp: float = 1.0, cost_fn: float = 1.0):
        """
        Initialize with true labels and predicted probabilities.
        
        Args:
            y_true: True binary labels (0/1)
            y_prob: Predicted probabilities for positive class
            cost_fp: Cost of false positive error
            cost_fn: Cost of false negative error
        """
        self.y_true = np.array(y_true)
        self.y_prob = np.array(y_prob)
        self.cost_fp = cost_fp
        self.cost_fn = cost_fn
        
        if len(self.y_true) != len(self.y_prob):
            raise ValueError("y_true and y_prob must have same length")
    
    def compute_cost(self, threshold: float) -> Tuple[float, int, int]:
        """
        Compute total cost at given threshold.
        
        Args:
            threshold: Classification threshold (0-1)
            
        Returns:
            (total_cost, num_fp, num_fn)
        """
        y_pred = (self.y_prob >= threshold).astype(int)
        
        # Compute confusion matrix
        tn, fp, fn, tp = confusion_matrix(self.y_true, y_pred).ravel()
        
        # Compute cost
        total_cost = self.cost_fp * fp + self.cost_fn * fn
        
        return total_cost, fp, fn
    
    def find_optimal_threshold(self, thresholds: np.ndarray = None) -> Dict:
        """
        Find threshold that minimizes total cost.
        
        Args:
            thresholds: Array of thresholds to evaluate (default: 100 points)
            
        Returns:
            Dictionary with optimal threshold and costs
        """
        if thresholds is None:
            thresholds = np.linspace(0, 1, 100)
        
        costs = []
        fps = []
        fns = []
        
        for t in thresholds:
            cost, fp, fn = self.compute_cost(t)
            costs.append(cost)
            fps.append(fp)
            fns.append(fn)
        
        # Find minimum cost
        optimal_idx = np.argmin(costs)
        optimal_threshold = thresholds[optimal_idx]
        
        return {
            'optimal_threshold': optimal_threshold,
            'min_cost': costs[optimal_idx],
            'fp_at_optimal': fps[optimal_idx],
            'fn_at_optimal': fns[optimal_idx],
            'all_thresholds': thresholds,
            'all_costs': np.array(costs),
            'all_fps': np.array(fps),
            'all_fns': np.array(fns)
        }
    
    def plot_cost_analysis(self, result: Dict = None, figsize: Tuple[int, int] = (14, 5)):
        """
        Visualize cost vs threshold and error breakdown.
        
        Args:
            result: Result from find_optimal_threshold() (or compute if None)
            figsize: Figure size
        """
        if result is None:
            result = self.find_optimal_threshold()
        
        fig, axes = plt.subplots(1, 3, figsize=figsize)
        
        thresholds = result['all_thresholds']
        costs = result['all_costs']
        fps = result['all_fps']
        fns = result['all_fns']
        optimal_t = result['optimal_threshold']
        
        # Plot 1: Total Cost vs Threshold
        axes[0].plot(thresholds, costs, 'b-', lw=2, label='Total Cost')
        axes[0].axvline(optimal_t, color='r', linestyle='--', lw=2, 
                       label=f'Optimal = {optimal_t:.3f}')
        axes[0].scatter([optimal_t], [result['min_cost']], color='r', s=100, zorder=5)
        
        axes[0].set_xlabel('Threshold', fontsize=11, fontweight='bold')
        axes[0].set_ylabel('Total Cost', fontsize=11, fontweight='bold')
        axes[0].set_title(f'Cost Optimization\nMin Cost = {result["min_cost"]:.2f}', 
                         fontsize=12, fontweight='bold')
        axes[0].legend()
        axes[0].grid(alpha=0.3)
        
        # Plot 2: FP and FN counts vs Threshold
        axes[1].plot(thresholds, fps, 'orange', lw=2, label=f'FP (cost={self.cost_fp})')
        axes[1].plot(thresholds, fns, 'purple', lw=2, label=f'FN (cost={self.cost_fn})')
        axes[1].axvline(optimal_t, color='r', linestyle='--', lw=2, label='Optimal')
        
        axes[1].set_xlabel('Threshold', fontsize=11, fontweight='bold')
        axes[1].set_ylabel('Error Count', fontsize=11, fontweight='bold')
        axes[1].set_title('FP vs FN Trade-off', fontsize=12, fontweight='bold')
        axes[1].legend()
        axes[1].grid(alpha=0.3)
        
        # Plot 3: Cost Breakdown at Optimal Threshold
        fp_cost = self.cost_fp * result['fp_at_optimal']
        fn_cost = self.cost_fn * result['fn_at_optimal']
        
        labels = ['FP Cost', 'FN Cost']
        costs_breakdown = [fp_cost, fn_cost]
        colors = ['orange', 'purple']
        
        axes[2].bar(labels, costs_breakdown, color=colors, alpha=0.7, edgecolor='black')
        
        for i, (label, cost) in enumerate(zip(labels, costs_breakdown)):
            axes[2].text(i, cost + max(costs_breakdown) * 0.02, f'${cost:.2f}', 
                        ha='center', fontweight='bold')
        
        axes[2].set_ylabel('Cost', fontsize=11, fontweight='bold')
        axes[2].set_title(f'Cost Breakdown at t={optimal_t:.3f}', 
                         fontsize=12, fontweight='bold')
        axes[2].grid(alpha=0.3, axis='y')
        
        plt.tight_layout()
        plt.show()
    
    def compare_thresholds(self, thresholds: List[float]) -> None:
        """
        Compare performance at multiple thresholds.
        
        Args:
            thresholds: List of thresholds to compare
        """
        print("="*80)
        print("THRESHOLD COMPARISON")
        print("="*80)
        print(f"Cost FP: ${self.cost_fp:,.2f}  |  Cost FN: ${self.cost_fn:,.2f}")
        print("-"*80)
        print(f"{'Threshold':<12} {'FP':<8} {'FN':<8} {'FP Cost':<12} {'FN Cost':<12} {'Total Cost':<12}")
        print("-"*80)
        
        for t in thresholds:
            total_cost, fp, fn = self.compute_cost(t)
            fp_cost = self.cost_fp * fp
            fn_cost = self.cost_fn * fn
            
            print(f"{t:<12.3f} {fp:<8d} {fn:<8d} ${fp_cost:<11,.2f} ${fn_cost:<11,.2f} ${total_cost:<11,.2f}")
        
        print("="*80)


### üìù Class: MultiClassEvaluator:

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
class MultiClassEvaluator:
    """
    Multi-class classification evaluation with macro/micro/weighted averaging.
    """
    
    def __init__(self, y_true: np.ndarray, y_pred: np.ndarray):
        """
        Initialize with true and predicted labels.
        
        Args:
            y_true: True labels (n_samples,)
            y_pred: Predicted labels (n_samples,)
        """
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.classes = np.unique(np.concatenate([y_true, y_pred]))
        self.n_classes = len(self.classes)
    
    def compute_per_class_metrics(self) -> Dict:
        """
        Compute precision, recall, F1 for each class (one-vs-rest).
        
        Returns:
            Dictionary with per-class metrics
        """
        metrics = {}
        
        for cls in self.classes:
            # One-vs-rest binary classification
            y_true_binary = (self.y_true == cls).astype(int)
            y_pred_binary = (self.y_pred == cls).astype(int)
            
            # Confusion matrix for this class
            tn, fp, fn, tp = confusion_matrix(y_true_binary, y_pred_binary, labels=[0, 1]).ravel()
            
            # Compute metrics
            precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
            recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
            support = tp + fn  # Actual number of samples in this class
            
            metrics[cls] = {
                'precision': precision,
                'recall': recall,
                'f1': f1,
                'support': support
            }
        
        return metrics
    
    def macro_average(self, metric: str = 'f1') -> float:
        """
        Compute macro-average (unweighted mean across classes).
        
        Args:
            metric: 'precision', 'recall', or 'f1'
            
        Returns:
            Macro-averaged metric
        """
        per_class = self.compute_per_class_metrics()
        values = [per_class[cls][metric] for cls in self.classes]
        return np.mean(values)
    
    def micro_average(self) -> Dict[str, float]:
        """
        Compute micro-average (aggregate TP/FP/FN across all classes).
        
        Returns:
            Dictionary with micro precision, recall, F1
        """
        total_tp = 0
        total_fp = 0
        total_fn = 0
        
        for cls in self.classes:
            y_true_binary = (self.y_true == cls).astype(int)
            y_pred_binary = (self.y_pred == cls).astype(int)
            
            tn, fp, fn, tp = confusion_matrix(y_true_binary, y_pred_binary, labels=[0, 1]).ravel()
            
            total_tp += tp
            total_fp += fp
            total_fn += fn
        
        precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) > 0 else 0.0
        recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) > 0 else 0.0
        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        
        return {
            'precision': precision,
            'recall': recall,
            'f1': f1
        }
    
    def weighted_average(self, metric: str = 'f1') -> float:
        """
        Compute weighted-average (weighted by class support).
        
        Args:
            metric: 'precision', 'recall', or 'f1'
            
        Returns:
            Weighted-averaged metric
        """
        per_class = self.compute_per_class_metrics()
        
        weighted_sum = 0.0
        total_support = 0
        
        for cls in self.classes:
            weighted_sum += per_class[cls][metric] * per_class[cls]['support']
            total_support += per_class[cls]['support']
        
        return weighted_sum / total_support if total_support > 0 else 0.0
    
    def summary(self) -> None:
        """
        Print comprehensive multi-class evaluation summary.
        """
        per_class = self.compute_per_class_metrics()
        micro = self.micro_average()
        
        print("="*80)
        print("MULTI-CLASS CLASSIFICATION SUMMARY")
        print("="*80)
        print(f"Number of classes: {self.n_classes}")
        print(f"Total samples: {len(self.y_true)}")
        
        print("\n" + "-"*80)
        print("PER-CLASS METRICS")
        print("-"*80)
        print(f"{'Class':<10} {'Precision':<12} {'Recall':<12} {'F1-Score':<12} {'Support':<10}")
        print("-"*80)
        
        for cls in self.classes:
            m = per_class[cls]
            print(f"{cls:<10} {m['precision']:<12.4f} {m['recall']:<12.4f} {m['f1']:<12.4f} {m['support']:<10d}")
        
        print("\n" + "-"*80)
        print("AVERAGED METRICS")
        print("-"*80)
        print(f"{'Average Type':<20} {'Precision':<12} {'Recall':<12} {'F1-Score':<12}")
        print("-"*80)
        print(f"{'Macro':<20} {self.macro_average('precision'):<12.4f} "
              f"{self.macro_average('recall'):<12.4f} {self.macro_average('f1'):<12.4f}")
        print(f"{'Micro':<20} {micro['precision']:<12.4f} "
              f"{micro['recall']:<12.4f} {micro['f1']:<12.4f}")
        print(f"{'Weighted':<20} {self.weighted_average('precision'):<12.4f} "
              f"{self.weighted_average('recall'):<12.4f} {self.weighted_average('f1'):<12.4f}")
        print("="*80)
# Example usage demonstration
if __name__ == "__main__":
    np.random.seed(42)
    
    # Example 1: Cost-sensitive binary classification (semiconductor defect detection)
    print("EXAMPLE 1: Cost-Sensitive Defect Detection")
    print("-"*80)
    
    n_samples = 500
    y_true = np.random.randint(0, 2, n_samples)  # 0 = pass, 1 = fail
    y_prob = np.clip(y_true + 0.3 * np.random.randn(n_samples), 0, 1)  # Add noise
    
    # FN (missed defect) costs $10M, FP (false alarm) costs $50K
    evaluator = CostSensitiveEvaluator(y_true, y_prob, cost_fp=50_000, cost_fn=10_000_000)
    
    result = evaluator.find_optimal_threshold()
    print(f"Optimal threshold: {result['optimal_threshold']:.4f}")
    print(f"Minimum total cost: ${result['min_cost']:,.2f}")
    print(f"FP at optimal: {result['fp_at_optimal']}")
    print(f"FN at optimal: {result['fn_at_optimal']}")
    
    evaluator.plot_cost_analysis(result)
    
    # Compare with standard threshold
    evaluator.compare_thresholds([0.3, 0.5, 0.7, result['optimal_threshold']])
    
    # Example 2: Multi-class evaluation (device bin classification)
    print("\n\nEXAMPLE 2: Multi-Class Bin Classification")
    print("-"*80)
    
    # 4 bins: BIN1 (premium), BIN2 (standard), BIN3 (low-grade), BIN4 (reject)
    y_true_mc = np.random.choice(['BIN1', 'BIN2', 'BIN3', 'BIN4'], size=400, 
                                 p=[0.3, 0.4, 0.2, 0.1])
    
    # Add some prediction errors
    y_pred_mc = y_true_mc.copy()
    error_idx = np.random.choice(len(y_pred_mc), size=50, replace=False)
    y_pred_mc[error_idx] = np.random.choice(['BIN1', 'BIN2', 'BIN3', 'BIN4'], size=50)
    
    mc_evaluator = MultiClassEvaluator(y_true_mc, y_pred_mc)
    mc_evaluator.summary()


## üìà Complete Example: Semiconductor Yield Prediction Evaluation

Let's build a **complete evaluation pipeline** for a real-world post-silicon validation scenario:

**Scenario:**
- **Objective**: Predict device pass/fail based on parametric test data
- **Dataset**: 1000 devices from 10 wafers with spatial (die_x, die_y) and electrical features (VDD, IDD, Freq, Temp)
- **Challenge**: Imbalanced (10% fail rate), high cost asymmetry (FN = $5M, FP = $100K)
- **Goal**: Find optimal decision threshold and evaluate with comprehensive metrics

**Business Context:**
- Each missed defect (FN) that escapes to field costs **$5,000,000** in recalls, warranty, brand damage
- Each false alarm (FP) costs **$100,000** in unnecessary retest and analysis
- Cost ratio: FN is 50√ó more expensive than FP
- Need to balance yield protection with manufacturing efficiency

### Evaluation Strategy

```mermaid
graph LR
    A[Train Model] --> B[Get Predictions]
    B --> C[Classification Metrics]
    B --> D[ROC/PR Curves]
    B --> E[Regression Metrics<br/>for yield %]
    C --> F[Cost Analysis]
    D --> F
    E --> F
    F --> G[Optimal Threshold]
    G --> H[Production<br/>Deployment]
```

We'll evaluate using:
1. **Classification metrics** (confusion matrix, precision, recall, F1)
2. **ROC and PR curves** (threshold-independent performance)
3. **Cost-sensitive analysis** (find optimal threshold for business objectives)
4. **Regression metrics** (for continuous yield percentage prediction)
5. **Statistical validation** (cross-validation, confidence intervals)

### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# Set random seed for reproducibility
np.random.seed(42)
print("="*80)
print("COMPLETE EVALUATION EXAMPLE: SEMICONDUCTOR YIELD PREDICTION")
print("="*80)
# ============================================================================
# STEP 1: Generate Synthetic Dataset
# ============================================================================
print("\n[STEP 1] Generating synthetic semiconductor test data...")
n_devices = 1000
n_wafers = 10
devices_per_wafer = n_devices // n_wafers
# Generate spatial coordinates
wafer_ids = np.repeat(range(n_wafers), devices_per_wafer)
die_x = np.random.uniform(0, 10, n_devices)  # 0-10 mm
die_y = np.random.uniform(0, 10, n_devices)  # 0-10 mm
# Calculate radial distance from wafer center (5, 5)
radial_distance = np.sqrt((die_x - 5)**2 + (die_y - 5)**2)
# Generate electrical parameters
VDD = np.random.normal(1.8, 0.05, n_devices)  # Voltage (V)
IDD = np.random.normal(50, 5, n_devices)      # Current (mA)
Freq = np.random.normal(2000, 100, n_devices) # Frequency (MHz)
Temp = np.random.normal(85, 5, n_devices)     # Temperature (¬∞C)
# Create target: devices near edge more likely to fail (radial effect)
# Also devices with extreme electrical parameters fail more
fail_prob = 0.05 + 0.15 * (radial_distance / radial_distance.max())
fail_prob += 0.1 * (np.abs(VDD - 1.8) > 0.1).astype(float)
fail_prob += 0.1 * (IDD > 60).astype(float)
fail_prob += 0.05 * (Temp > 90).astype(float)
y_actual = (np.random.random(n_devices) < fail_prob).astype(int)
# Create DataFrame
df = pd.DataFrame({
    'wafer_id': wafer_ids,
    'die_x': die_x,
    'die_y': die_y,
    'radial_distance': radial_distance,
    'VDD': VDD,
    'IDD': IDD,
    'Freq': Freq,
    'Temp': Temp,
    'fail': y_actual
})
print(f"Dataset created: {n_devices} devices from {n_wafers} wafers")
print(f"Class distribution: {(1-y_actual.mean())*100:.1f}% pass, {y_actual.mean()*100:.1f}% fail")
print(f"Imbalance ratio: {(1-y_actual.mean())/y_actual.mean():.1f}:1")
# ============================================================================
# STEP 2: Train Model and Get Predictions
# ============================================================================
print("\n[STEP 2] Training Random Forest classifier...")
# Prepare features
feature_cols = ['radial_distance', 'VDD', 'IDD', 'Freq', 'Temp']
X = df[feature_cols].values
y = df['fail'].values
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                      random_state=42, stratify=y)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train model
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train_scaled, y_train)
# Get predictions
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]  # Probability of class 1 (fail)
print(f"Model trained on {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ============================================================================
# STEP 3: Classification Metrics Evaluation
# ============================================================================
print("\n[STEP 3] Computing classification metrics...")
# Use our ClassificationMetrics class
from sklearn.metrics import confusion_matrix as sk_confusion_matrix
class ClassificationMetricsSimple:
    def __init__(self, y_true, y_pred, y_prob=None):
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.y_prob = y_prob
        
        cm = sk_confusion_matrix(y_true, y_pred)
        self.tn, self.fp, self.fn, self.tp = cm.ravel()
    
    def accuracy(self):
        return (self.tp + self.tn) / (self.tp + self.tn + self.fp + self.fn)
    
    def precision(self):
        return self.tp / (self.tp + self.fp) if (self.tp + self.fp) > 0 else 0.0
    
    def recall(self):
        return self.tp / (self.tp + self.fn) if (self.tp + self.fn) > 0 else 0.0
    
    def f1_score(self):
        p = self.precision()
        r = self.recall()
        return 2 * p * r / (p + r) if (p + r) > 0 else 0.0
    
    def specificity(self):
        return self.tn / (self.tn + self.fp) if (self.tn + self.fp) > 0 else 0.0
metrics = ClassificationMetricsSimple(y_test, y_pred, y_prob)
print(f"\nConfusion Matrix:")
print(f"                Predicted: Pass  |  Predicted: Fail")
print(f"Actual: Pass    TN = {metrics.tn:<8d} |  FP = {metrics.fp:<8d}")
print(f"Actual: Fail    FN = {metrics.fn:<8d} |  TP = {metrics.tp:<8d}")
print(f"\nClassification Metrics (threshold = 0.5):")
print(f"  Accuracy:    {metrics.accuracy():.4f}")
print(f"  Precision:   {metrics.precision():.4f}")
print(f"  Recall:      {metrics.recall():.4f}")
print(f"  F1-Score:    {metrics.f1_score():.4f}")
print(f"  Specificity: {metrics.specificity():.4f}")
# ============================================================================
# STEP 4: ROC and PR Curve Analysis
# ============================================================================
print("\n[STEP 4] Analyzing ROC and PR curves...")
from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
# ROC curve
fpr, tpr, roc_thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
# PR curve
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)
avg_precision = average_precision_score(y_test, y_prob)
print(f"ROC AUC: {roc_auc:.4f}")
print(f"Average Precision (PR AUC): {avg_precision:.4f}")
# Plot ROC and PR curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ROC curve
axes[0].plot(fpr, tpr, 'b-', lw=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
axes[0].plot([0, 1], [0, 1], 'r--', lw=2, label='Random Classifier')
axes[0].set_xlabel('False Positive Rate', fontsize=11, fontweight='bold')
axes[0].set_ylabel('True Positive Rate', fontsize=11, fontweight='bold')
axes[0].set_title('ROC Curve - Yield Prediction', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
# PR curve
baseline = y_test.mean()
axes[1].plot(recall, precision, 'b-', lw=2, label=f'PR Curve (AP = {avg_precision:.4f})')
axes[1].axhline(baseline, color='r', linestyle='--', lw=2, label=f'Baseline ({baseline:.4f})')
axes[1].set_xlabel('Recall', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Precision', fontsize=11, fontweight='bold')
axes[1].set_title('Precision-Recall Curve', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ============================================================================
# STEP 5: Cost-Sensitive Analysis
# ============================================================================
print("\n[STEP 5] Finding optimal threshold with cost analysis...")
# Cost parameters
COST_FP = 100_000   # $100K per false alarm (unnecessary retest)
COST_FN = 5_000_000 # $5M per missed defect (field escape)
print(f"Cost structure:")
print(f"  False Positive (unnecessary retest): ${COST_FP:,}")
print(f"  False Negative (missed defect):      ${COST_FN:,}")
print(f"  Cost ratio (FN/FP): {COST_FN/COST_FP:.0f}:1")
# Find optimal threshold
thresholds_to_test = np.linspace(0, 1, 100)
costs = []
fps = []
fns = []
for t in thresholds_to_test:
    y_pred_t = (y_prob >= t).astype(int)
    cm = sk_confusion_matrix(y_test, y_pred_t)
    tn, fp, fn, tp = cm.ravel()
    
    total_cost = COST_FP * fp + COST_FN * fn
    costs.append(total_cost)
    fps.append(fp)
    fns.append(fn)
optimal_idx = np.argmin(costs)
optimal_threshold = thresholds_to_test[optimal_idx]
min_cost = costs[optimal_idx]
print(f"\nOptimal threshold: {optimal_threshold:.4f}")
print(f"Minimum total cost: ${min_cost:,.2f}")
print(f"FP at optimal: {fps[optimal_idx]}")
print(f"FN at optimal: {fns[optimal_idx]}")
# Compare standard vs optimal threshold
print(f"\nComparison:")
print(f"{'Threshold':<12} {'FP':<8} {'FN':<8} {'Total Cost':<15}")
print("-"*50)
for t in [0.3, 0.5, 0.7, optimal_threshold]:
    idx = np.argmin(np.abs(thresholds_to_test - t))
    print(f"{t:<12.3f} {fps[idx]:<8d} {fns[idx]:<8d} ${costs[idx]:>13,.2f}")
# Visualize cost analysis
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Cost vs threshold
axes[0].plot(thresholds_to_test, costs, 'b-', lw=2)
axes[0].axvline(optimal_threshold, color='r', linestyle='--', lw=2, 
               label=f'Optimal = {optimal_threshold:.3f}')
axes[0].scatter([optimal_threshold], [min_cost], color='r', s=100, zorder=5)
axes[0].set_xlabel('Threshold', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Total Cost ($)', fontsize=11, fontweight='bold')
axes[0].set_title(f'Cost Optimization\nMin Cost = ${min_cost:,.0f}', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)
axes[0].ticklabel_format(style='plain', axis='y')
# FP vs FN
axes[1].plot(thresholds_to_test, fps, 'orange', lw=2, label='False Positives')
axes[1].plot(thresholds_to_test, fns, 'purple', lw=2, label='False Negatives')
axes[1].axvline(optimal_threshold, color='r', linestyle='--', lw=2, label='Optimal')
axes[1].set_xlabel('Threshold', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Error Count', fontsize=11, fontweight='bold')
axes[1].set_title('FP vs FN Trade-off', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("\n" + "="*80)
print("EVALUATION COMPLETE")
print("="*80)
print(f"\n‚úÖ Key Findings:")
print(f"   ‚Ä¢ Model achieves ROC AUC = {roc_auc:.4f} (excellent discrimination)")
print(f"   ‚Ä¢ At standard threshold (0.5): {metrics.fn} missed defects = ${metrics.fn * COST_FN:,.0f} cost")
print(f"   ‚Ä¢ At optimal threshold ({optimal_threshold:.3f}): {fns[optimal_idx]} missed defects = ${fns[optimal_idx] * COST_FN:,.0f} cost")
print(f"   ‚Ä¢ Cost savings: ${(metrics.fn - fns[optimal_idx]) * COST_FN:,.0f}")
print(f"\nüìä Recommendation: Use threshold = {optimal_threshold:.4f} for production deployment")
print("="*80)


### üìù Implementation

**Purpose:** Core implementation with detailed code

**Key implementation details below.**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# Set random seed
np.random.seed(42)
print("="*80)
print("REGRESSION EXAMPLE: TEST TIME PREDICTION")
print("="*80)
# ============================================================================
# Generate Synthetic Test Time Data
# ============================================================================
print("\n[STEP 1] Generating test time prediction dataset...")
n_devices = 800
# Features affecting test time
n_test_points = np.random.randint(10, 50, n_devices)  # Number of test points
complexity = np.random.uniform(0, 1, n_devices)       # Test complexity (0-1)
freq = np.random.normal(2000, 200, n_devices)         # Operating frequency (MHz)
temp = np.random.normal(85, 10, n_devices)            # Test temperature (¬∞C)
# True test time model (with some realistic relationships)
# Higher test points ‚Üí longer time
# Higher complexity ‚Üí longer time
# Higher frequency ‚Üí shorter time (faster execution)
# Temperature has minimal effect
test_time_actual = (
    5.0 +                                    # Base time
    0.5 * n_test_points +                   # Linear with test points
    20.0 * complexity +                      # Complexity effect
    -0.003 * freq +                          # Frequency effect (inverse)
    0.1 * temp +                             # Temperature effect
    np.random.normal(0, 2, n_devices)       # Random noise
)
# Ensure positive test times
test_time_actual = np.maximum(test_time_actual, 1.0)
# Create DataFrame
df_test_time = pd.DataFrame({
    'n_test_points': n_test_points,
    'complexity': complexity,
    'freq': freq,
    'temp': temp,
    'test_time_ms': test_time_actual
})
print(f"Dataset: {n_devices} devices")
print(f"Test time range: [{test_time_actual.min():.2f}, {test_time_actual.max():.2f}] ms")
print(f"Test time mean: {test_time_actual.mean():.2f} ms")
print(f"Test time std: {test_time_actual.std():.2f} ms")
# ============================================================================
# Train Regression Model
# ============================================================================
print("\n[STEP 2] Training regression models...")
# Prepare data
X_reg = df_test_time[['n_test_points', 'complexity', 'freq', 'temp']].values
y_reg = df_test_time['test_time_ms'].values
# Split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)
# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)
# Train Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_reg_scaled, y_train_reg)
y_pred_lr = lr_model.predict(X_test_reg_scaled)
# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)
rf_model.fit(X_train_reg_scaled, y_train_reg)
y_pred_rf = rf_model.predict(X_test_reg_scaled)
print(f"Models trained on {len(X_train_reg)} samples")
print(f"Test set: {len(X_test_reg)} samples")


### üìù Implementation Part 2

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ============================================================================
# Compute Regression Metrics
# ============================================================================
print("\n[STEP 3] Computing regression metrics...")
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
def compute_all_metrics(y_true, y_pred, model_name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    # MAPE (avoid division by zero)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    # Max error
    max_err = np.max(np.abs(y_true - y_pred))
    
    return {
        'model': model_name,
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R¬≤': r2,
        'MAPE': mape,
        'Max Error': max_err
    }
# Compute metrics for both models
lr_metrics = compute_all_metrics(y_test_reg, y_pred_lr, 'Linear Regression')
rf_metrics = compute_all_metrics(y_test_reg, y_pred_rf, 'Random Forest')
# Display comparison
print("\n" + "="*80)
print("MODEL COMPARISON")
print("="*80)
print(f"{'Metric':<15} {'Linear Regression':<20} {'Random Forest':<20} {'Winner':<10}")
print("-"*80)
metrics_to_compare = ['MSE', 'RMSE', 'MAE', 'R¬≤', 'MAPE', 'Max Error']
for metric in metrics_to_compare:
    lr_val = lr_metrics[metric]
    rf_val = rf_metrics[metric]
    
    # For R¬≤, higher is better; for others, lower is better
    if metric == 'R¬≤':
        winner = 'Linear Reg' if lr_val > rf_val else 'Random Forest'
        better_symbol = '>' if lr_val > rf_val else '<'
    else:
        winner = 'Linear Reg' if lr_val < rf_val else 'Random Forest'
        better_symbol = '<' if lr_val < rf_val else '>'
    
    print(f"{metric:<15} {lr_val:<20.4f} {rf_val:<20.4f} {winner:<10}")
print("="*80)
# ============================================================================
# Visualize Regression Performance
# ============================================================================
print("\n[STEP 4] Visualizing predictions...")
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
# Linear Regression visualizations
# Predicted vs Actual
axes[0, 0].scatter(y_test_reg, y_pred_lr, alpha=0.6, s=30, edgecolors='k', linewidth=0.5)
min_val = min(y_test_reg.min(), y_pred_lr.min())
max_val = max(y_test_reg.max(), y_pred_lr.max())
axes[0, 0].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2)
axes[0, 0].set_xlabel('Actual Test Time (ms)', fontsize=10, fontweight='bold')
axes[0, 0].set_ylabel('Predicted Test Time (ms)', fontsize=10, fontweight='bold')
axes[0, 0].set_title(f'Linear Regression\nR¬≤ = {lr_metrics["R¬≤"]:.4f}', fontsize=11, fontweight='bold')
axes[0, 0].grid(alpha=0.3)
# Residuals
residuals_lr = y_test_reg - y_pred_lr
axes[0, 1].scatter(y_pred_lr, residuals_lr, alpha=0.6, s=30, edgecolors='k', linewidth=0.5)
axes[0, 1].axhline(0, color='r', linestyle='--', lw=2)
axes[0, 1].set_xlabel('Predicted Test Time (ms)', fontsize=10, fontweight='bold')
axes[0, 1].set_ylabel('Residuals (ms)', fontsize=10, fontweight='bold')
axes[0, 1].set_title(f'Residual Plot\nMAE = {lr_metrics["MAE"]:.4f} ms', fontsize=11, fontweight='bold')
axes[0, 1].grid(alpha=0.3)
# Residual distribution
axes[0, 2].hist(residuals_lr, bins=25, alpha=0.7, edgecolor='black', color='skyblue')
axes[0, 2].axvline(0, color='r', linestyle='--', lw=2)
axes[0, 2].axvline(np.mean(residuals_lr), color='orange', linestyle='--', lw=2)
axes[0, 2].set_xlabel('Residuals (ms)', fontsize=10, fontweight='bold')
axes[0, 2].set_ylabel('Frequency', fontsize=10, fontweight='bold')
axes[0, 2].set_title(f'Residual Distribution\nRMSE = {lr_metrics["RMSE"]:.4f} ms', fontsize=11, fontweight='bold')
axes[0, 2].grid(alpha=0.3, axis='y')
# Random Forest visualizations
# Predicted vs Actual
axes[1, 0].scatter(y_test_reg, y_pred_rf, alpha=0.6, s=30, edgecolors='k', linewidth=0.5, color='green')
axes[1, 0].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2)
axes[1, 0].set_xlabel('Actual Test Time (ms)', fontsize=10, fontweight='bold')
axes[1, 0].set_ylabel('Predicted Test Time (ms)', fontsize=10, fontweight='bold')
axes[1, 0].set_title(f'Random Forest\nR¬≤ = {rf_metrics["R¬≤"]:.4f}', fontsize=11, fontweight='bold')
axes[1, 0].grid(alpha=0.3)
# Residuals
residuals_rf = y_test_reg - y_pred_rf
axes[1, 1].scatter(y_pred_rf, residuals_rf, alpha=0.6, s=30, edgecolors='k', linewidth=0.5, color='green')
axes[1, 1].axhline(0, color='r', linestyle='--', lw=2)
axes[1, 1].set_xlabel('Predicted Test Time (ms)', fontsize=10, fontweight='bold')
axes[1, 1].set_ylabel('Residuals (ms)', fontsize=10, fontweight='bold')
axes[1, 1].set_title(f'Residual Plot\nMAE = {rf_metrics["MAE"]:.4f} ms', fontsize=11, fontweight='bold')
axes[1, 1].grid(alpha=0.3)
# Residual distribution
axes[1, 2].hist(residuals_rf, bins=25, alpha=0.7, edgecolor='black', color='lightgreen')
axes[1, 2].axvline(0, color='r', linestyle='--', lw=2)
axes[1, 2].axvline(np.mean(residuals_rf), color='orange', linestyle='--', lw=2)
axes[1, 2].set_xlabel('Residuals (ms)', fontsize=10, fontweight='bold')
axes[1, 2].set_ylabel('Frequency', fontsize=10, fontweight='bold')
axes[1, 2].set_title(f'Residual Distribution\nRMSE = {rf_metrics["RMSE"]:.4f} ms', fontsize=11, fontweight='bold')
axes[1, 2].grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()


### üìù Implementation Part 3

**Purpose:** Continue implementation

**Key implementation details below.**

In [None]:
# ============================================================================
# Business Impact Analysis
# ============================================================================
print("\n[STEP 5] Business impact analysis...")
# Assume manufacturing throughput goal: 1000 devices/hour
# Test time directly affects throughput
avg_test_time_actual = y_test_reg.mean()
avg_test_time_pred_lr = y_pred_lr.mean()
avg_test_time_pred_rf = y_pred_rf.mean()
print(f"\nThroughput Analysis:")
print(f"  Actual average test time: {avg_test_time_actual:.2f} ms")
print(f"  Linear Regression prediction: {avg_test_time_pred_lr:.2f} ms (error: {abs(avg_test_time_pred_lr - avg_test_time_actual):.2f} ms)")
print(f"  Random Forest prediction: {avg_test_time_pred_rf:.2f} ms (error: {abs(avg_test_time_pred_rf - avg_test_time_actual):.2f} ms)")
# Throughput in devices/hour
throughput_actual = 3600 * 1000 / avg_test_time_actual  # 3600 sec/hr, 1000 ms/sec
throughput_pred_lr = 3600 * 1000 / avg_test_time_pred_lr
throughput_pred_rf = 3600 * 1000 / avg_test_time_pred_rf
print(f"\n  Throughput (devices/hour):")
print(f"    Actual: {throughput_actual:.0f}")
print(f"    LR predicted: {throughput_pred_lr:.0f}")
print(f"    RF predicted: {throughput_pred_rf:.0f}")
# Cost of prediction error (assume $10 per device, 24/7 operation)
cost_per_device = 10
hours_per_year = 8760  # 24 * 365
error_cost_lr = abs(throughput_pred_lr - throughput_actual) * hours_per_year * cost_per_device
error_cost_rf = abs(throughput_pred_rf - throughput_actual) * hours_per_year * cost_per_device
print(f"\n  Annual revenue impact of prediction error:")
print(f"    Linear Regression: ${error_cost_lr:,.0f}")
print(f"    Random Forest: ${error_cost_rf:,.0f}")
print("\n" + "="*80)
print("REGRESSION EVALUATION COMPLETE")
print("="*80)
print(f"\n‚úÖ Key Findings:")
print(f"   ‚Ä¢ Random Forest outperforms Linear Regression (R¬≤ = {rf_metrics['R¬≤']:.4f} vs {lr_metrics['R¬≤']:.4f})")
print(f"   ‚Ä¢ Average prediction error: {rf_metrics['MAE']:.2f} ms (RF) vs {lr_metrics['MAE']:.2f} ms (LR)")
print(f"   ‚Ä¢ Max error: {rf_metrics['Max Error']:.2f} ms (RF) vs {lr_metrics['Max Error']:.2f} ms (LR)")
print(f"   ‚Ä¢ RF model reduces annual error cost by ${abs(error_cost_lr - error_cost_rf):,.0f}")
print(f"\nüìä Recommendation: Deploy Random Forest for production test time prediction")
print("="*80)


## üöÄ Real-World Project Ideas

### Post-Silicon Validation Projects

#### Project 1: Adaptive Test Threshold Optimizer
**Objective**: Build system that dynamically adjusts test thresholds based on cost-sensitive evaluation to minimize total manufacturing cost while maintaining quality targets.

**Business Impact**: **$15M annual savings** (typical large semiconductor fab)

**Key Features**:
- Real-time cost matrix updates based on field failure data
- Multi-objective optimization (minimize cost, maximize yield, meet quality targets)
- Threshold adaptation per product family and test stage
- A/B testing framework for threshold changes
- Dashboard showing cost breakdown (FP vs FN) and savings

**Evaluation Metrics**:
- Total cost (FP cost + FN cost)
- Cost per device
- Yield rate vs quality escapes
- Return on investment (ROI)

**Techniques**:
- Cost-sensitive learning
- ROC/PR curve analysis
- Bayesian optimization for threshold tuning
- Monte Carlo simulation for cost estimation
- Multi-armed bandit for A/B testing

---

#### Project 2: Multi-Stage Test Flow Evaluator
**Objective**: Evaluate and optimize multi-stage test flows (wafer sort ‚Üí final test ‚Üí system test) with cascading decision thresholds and cumulative cost analysis.

**Business Impact**: **$25M savings** (reduce redundant tests, optimize flow)

**Key Features**:
- Stage-wise metric computation (each test stage)
- Cumulative confusion matrices across stages
- Test coverage analysis (which defects caught at which stage)
- Flow optimization (skip unnecessary stages for low-risk devices)
- Cost-benefit analysis per test stage

**Evaluation Metrics**:
- Stage-wise precision/recall/F1
- Cumulative FP/FN across all stages
- Test time per stage vs value added
- Defect escape rate per stage
- Total test cost per device

**Techniques**:
- Multi-class classification evaluation
- Sequential decision analysis
- Markov chains for flow modeling
- Dynamic programming for optimal paths
- Cost-benefit analysis

---

#### Project 3: Spatial Yield Prediction Evaluator
**Objective**: Evaluate models that predict yield based on spatial wafer patterns (die location, wafer map analysis) with specialized spatial metrics.

**Business Impact**: **$20M savings** (early detection of spatial defects, process improvements)

**Key Features**:
- Spatial autocorrelation metrics (Moran's I, Geary's C)
- Wafer map visualization with prediction overlay
- Zone-based evaluation (edge, center, quadrants)
- Spatial clustering detection (defect hotspots)
- Process-aware metrics (batch, lot, fab)

**Evaluation Metrics**:
- Standard classification metrics
- Spatial autocorrelation of errors
- Zone-wise precision/recall
- Hotspot detection rate
- False alarm rate per wafer region

**Techniques**:
- Spatial statistics
- Geospatial analysis
- Cluster detection algorithms
- Image-based evaluation
- Process control charts

---

#### Project 4: Reliability Prediction Confidence Calibration
**Objective**: Build calibrated confidence estimators for device reliability predictions (predicted failure probability matches actual failure rate).

**Business Impact**: **$30M savings** (accurate warranty reserves, targeted reliability improvements)

**Key Features**:
- Expected Calibration Error (ECE) computation
- Reliability calibration curves
- Confidence interval estimation for MTBF (Mean Time Between Failures)
- Uncertainty quantification for predictions
- Risk-based binning (high confidence ‚Üí ship, low confidence ‚Üí additional test)

**Evaluation Metrics**:
- Expected Calibration Error (ECE)
- Brier score
- Confidence interval coverage
- Calibration curve analysis
- Risk-adjusted accuracy

**Techniques**:
- Probability calibration (Platt scaling, isotonic regression)
- Conformal prediction
- Bayesian uncertainty quantification
- Bootstrap confidence intervals
- Risk-based decision making

---

### General AI/ML Projects

#### Project 5: Healthcare Risk Stratification System
**Objective**: Evaluate multi-class risk stratification models (low/medium/high/critical risk) for patient outcomes with cost-sensitive evaluation (higher penalty for underestimating high-risk patients).

**Business Impact**: **$100M savings** (hospital system), improved patient outcomes

**Key Features**:
- Multi-class macro/micro/weighted metrics
- Cost matrix with severity-based penalties
- Calibration for risk probabilities
- Subgroup fairness analysis (demographic parity)
- Temporal evaluation (prediction window analysis)

**Evaluation Metrics**:
- Macro/Micro/Weighted F1
- Cost-sensitive accuracy
- Expected Calibration Error
- Fairness metrics (equalized odds, demographic parity)
- C-statistic (concordance index)

**Techniques**:
- Multi-class evaluation
- Cost-sensitive learning
- Fairness-aware ML
- Survival analysis
- Calibration methods

---

#### Project 6: Financial Fraud Detection Pipeline
**Objective**: Build comprehensive evaluation pipeline for real-time fraud detection with extreme imbalance (0.1% fraud rate), cost asymmetry, and concept drift monitoring.

**Business Impact**: **$500M prevented losses** (large financial institution)

**Key Features**:
- Imbalanced classification evaluation (PR curves, F-beta)
- Cost-sensitive thresholds (missed fraud >> false alarm)
- Real-time monitoring dashboard (prediction drift, feature drift)
- A/B testing framework for model updates
- Explainability metrics (feature importance, SHAP consistency)

**Evaluation Metrics**:
- Precision-Recall AUC (PR-AUC)
- F2-score (favor recall)
- Cost-weighted accuracy
- Prediction drift (KL divergence, PSI)
- Model latency (p50, p95, p99)

**Techniques**:
- Imbalanced learning evaluation
- Cost-sensitive classification
- Concept drift detection
- Online learning evaluation
- Real-time monitoring

---

#### Project 7: E-commerce Recommendation Evaluator
**Objective**: Evaluate recommendation systems with beyond-accuracy metrics (diversity, novelty, serendipity, coverage) and multi-stakeholder objectives (user satisfaction, revenue, inventory turnover).

**Business Impact**: **$200M revenue increase** (large e-commerce platform)

**Key Features**:
- Ranking metrics (NDCG, MAP, MRR, Hit Rate@K)
- Diversity metrics (intra-list diversity, coverage)
- Business metrics (revenue, conversion, click-through rate)
- User satisfaction metrics (engagement time, return rate)
- A/B test evaluation framework

**Evaluation Metrics**:
- Precision@K, Recall@K, F1@K
- Normalized Discounted Cumulative Gain (NDCG)
- Mean Average Precision (MAP)
- Coverage (catalog coverage, user coverage)
- Diversity (Gini coefficient, entropy)
- Revenue per user

**Techniques**:
- Ranking evaluation
- Multi-objective optimization
- A/B testing
- Causal inference
- Business metric alignment

---

#### Project 8: NLP Model Evaluation Suite
**Objective**: Build comprehensive evaluation suite for NLP models (classification, NER, QA, summarization) with task-specific metrics, human evaluation correlation, and fairness analysis.

**Business Impact**: **$50M cost savings** (improved customer service automation)

**Key Features**:
- Task-specific metrics (F1 for NER, BLEU/ROUGE for summarization, Exact Match for QA)
- Human evaluation alignment (correlation with expert ratings)
- Bias detection (gender, race, age bias in predictions)
- Multilingual evaluation (cross-lingual consistency)
- Production monitoring (response quality drift)

**Evaluation Metrics**:
- Token-level F1 (NER)
- Exact Match, F1 (QA)
- BLEU, ROUGE, METEOR (summarization)
- Perplexity (language models)
- Bias scores (demographic parity, equalized odds)
- Human correlation (Spearman, Pearson)

**Techniques**:
- Token-level evaluation
- N-gram matching
- Semantic similarity (BERTScore)
- Fairness metrics
- Human-in-the-loop evaluation

---

### üéØ Project Selection Guide

| **Domain** | **Best Project** | **Why** |
|------------|-----------------|---------|
| **Semiconductor** | Adaptive Test Threshold Optimizer | Immediate ROI, clear cost savings, production-ready |
| **Manufacturing** | Multi-Stage Test Flow Evaluator | Optimizes entire process, high impact |
| **Healthcare** | Risk Stratification System | Life-saving impact, regulatory compliance |
| **Finance** | Fraud Detection Pipeline | Massive ROI, real-time requirements |
| **E-commerce** | Recommendation Evaluator | Direct revenue impact, user satisfaction |
| **NLP/AI** | NLP Evaluation Suite | Foundation for AI products, fairness critical |

### Success Criteria

For each project, define clear success metrics:

1. **Technical**: Accuracy improvement, metric optimization, latency reduction
2. **Business**: Cost savings, revenue increase, efficiency gains
3. **Operational**: Deployment speed, maintenance cost, scalability
4. **Stakeholder**: User satisfaction, compliance, risk reduction

### Implementation Roadmap

1. **Phase 1 (Weeks 1-2)**: Baseline evaluation, metric selection, cost matrix definition
2. **Phase 2 (Weeks 3-4)**: Custom metrics implementation, visualization, automation
3. **Phase 3 (Weeks 5-6)**: Production integration, monitoring, A/B testing
4. **Phase 4 (Weeks 7-8)**: Optimization, documentation, handoff

## üéì Key Takeaways and Best Practices

### üìù Core Principles

```mermaid
graph TD
    A[Model Evaluation] --> B[Choose Right Metrics]
    A --> C[Understand Business Context]
    A --> D[Consider Data Characteristics]
    
    B --> B1[Classification vs Regression]
    B --> B2[Balanced vs Imbalanced]
    B --> B3[Cost Asymmetry]
    
    C --> C1[Define Success Criteria]
    C --> C2[Identify Stakeholders]
    C --> C3[Quantify Costs/Benefits]
    
    D --> D1[Data Distribution]
    D --> D2[Feature Quality]
    D --> D3[Temporal Patterns]
    
    B1 --> E[Comprehensive Evaluation]
    B2 --> E
    B3 --> E
    C1 --> E
    C2 --> E
    C3 --> E
    D1 --> E
    D2 --> E
    D3 --> E
    
    E --> F[Production Deployment]
```

---

### üîë When to Use Each Metric

#### Classification Metrics Decision Table

| **Scenario** | **Primary Metric** | **Secondary Metrics** | **Why** |
|-------------|-------------------|---------------------|---------|
| **Balanced classes, equal costs** | Accuracy, F1 | Precision, Recall | Simple, interpretable |
| **Imbalanced classes** | PR-AUC, F1 | Precision@K, Recall@K | Focuses on minority class |
| **Cost asymmetry (FN >> FP)** | Cost-sensitive threshold | Recall, F-beta (Œ≤>1) | Minimize expensive errors |
| **Cost asymmetry (FP >> FN)** | Cost-sensitive threshold | Precision, F-beta (Œ≤<1) | Minimize false alarms |
| **Need probability calibration** | ECE, Brier score | Calibration curves | Reliable confidence estimates |
| **Multi-class, balanced** | Macro-F1 | Confusion matrix | All classes equally important |
| **Multi-class, imbalanced** | Weighted-F1 | Micro-F1, Per-class metrics | Weight by class size |
| **Ranking/Retrieval** | NDCG, MAP | Precision@K, Recall@K | Order matters |
| **Threshold-independent** | ROC-AUC, PR-AUC | Full curves | Compare models globally |

#### Regression Metrics Decision Table

| **Scenario** | **Primary Metric** | **Secondary Metrics** | **Why** |
|-------------|-------------------|---------------------|---------|
| **General regression** | RMSE | MAE, R¬≤ | Interpretable, standard |
| **Outliers present** | MAE | Median Absolute Error | Robust to extremes |
| **Comparing models** | R¬≤ | RMSE, MAE | Variance explained |
| **Scale-independent comparison** | MAPE | R¬≤ | Percentage error |
| **Penalize large errors** | RMSE | MSE | Quadratic penalty |
| **Worst-case guarantee** | Max Error | RMSE, MAE | Safety-critical applications |
| **Multiple features, regularization** | Adjusted R¬≤ | R¬≤, Cross-val RMSE | Penalizes overfitting |
| **Business impact** | Custom cost function | RMSE, MAE | Align with objectives |

---

### ‚ö†Ô∏è Common Pitfalls and How to Avoid Them

#### Pitfall 1: Optimizing the Wrong Metric
**Problem**: Using accuracy on imbalanced data (99% class 0 ‚Üí 99% accuracy by always predicting 0)

**Solution**: 
- Always check class distribution first
- Use PR-AUC or F1 for imbalanced data
- Define business-relevant metrics (cost, revenue, risk)

#### Pitfall 2: Ignoring Cost Asymmetry
**Problem**: Using default threshold (0.5) when FN costs $10M and FP costs $50K

**Solution**:
- Define cost matrix explicitly
- Find optimal threshold: `t* = argmin(C_FP √ó FP(t) + C_FN √ó FN(t))`
- Monitor costs in production, not just accuracy

#### Pitfall 3: Forgetting Cross-Validation
**Problem**: Single train/test split ‚Üí results depend on specific split

**Solution**:
- Use K-Fold CV (K=5 or 10) to estimate mean and variance
- Report confidence intervals: `metric ¬± std`
- Use stratified splits for imbalanced data
- For time series: Use time series split (no data leakage)

#### Pitfall 4: Not Checking Calibration
**Problem**: Model says 80% confidence but only correct 60% of the time

**Solution**:
- Compute Expected Calibration Error (ECE)
- Plot calibration curves
- Apply calibration methods (Platt scaling, isotonic regression)
- Monitor calibration drift in production

#### Pitfall 5: Comparing Across Different Datasets
**Problem**: "Model A has MSE=10 on dataset X, Model B has MSE=5 on dataset Y ‚Üí B is better"

**Solution**:
- Only compare models on the **same dataset**
- Use scale-independent metrics (R¬≤, MAPE) for cross-dataset insights
- Report metrics relative to baseline (% improvement)

#### Pitfall 6: Overfitting to Validation Set
**Problem**: Tuning hyperparameters to maximize validation metric ‚Üí overfitting

**Solution**:
- Use 3-way split: Train / Validation / Test
- Report final metrics on held-out test set (never used for tuning)
- Use nested cross-validation for rigorous evaluation

#### Pitfall 7: Ignoring Production Constraints
**Problem**: Model has great metrics but 10-second latency (SLA is 100ms)

**Solution**:
- Define SLAs upfront: Latency (p50, p95, p99), throughput, memory
- Monitor production metrics: Prediction drift, feature drift, performance degradation
- Set alerting thresholds (e.g., accuracy drops > 5%)

---

### üè≠ Production Evaluation Checklist

Before deploying to production, validate:

- [ ] **Metrics aligned with business objectives** (not just technical metrics)
- [ ] **Cost asymmetry considered** (optimal threshold found)
- [ ] **Cross-validation performed** (results stable across folds)
- [ ] **Calibration checked** (predicted probabilities reliable)
- [ ] **Fairness evaluated** (no bias across demographics/segments)
- [ ] **Production constraints met** (latency, throughput, memory)
- [ ] **Monitoring implemented** (drift detection, alerting)
- [ ] **A/B testing plan** (how to validate in production)
- [ ] **Rollback criteria defined** (when to revert to old model)
- [ ] **Documentation complete** (metrics, thresholds, assumptions)

---

### üîß Tools and Libraries

#### Essential Python Libraries

```python
# Metrics computation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    mean_squared_error, mean_absolute_error, r2_score,
    confusion_matrix, classification_report
)

# Calibration
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Advanced evaluation
from scipy.stats import ttest_rel  # Paired t-test
from mlxtend.evaluate import mcnemar  # McNemar's test
```

#### Specialized Tools

- **scikit-learn**: Standard metrics, cross-validation, calibration
- **imbalanced-learn**: Metrics for imbalanced data
- **fairlearn**: Fairness metrics and bias detection
- **alibi**: Model explainability and confidence
- **evidently**: Production monitoring and drift detection
- **neptune/mlflow**: Experiment tracking and metric logging
- **wandb**: Real-time monitoring and visualization

---

### üìä Metric Reporting Template

When reporting model performance, include:

1. **Dataset characteristics**: Size, class distribution, feature count, temporal range
2. **Evaluation strategy**: K-Fold CV (K=?), stratified?, time series split?
3. **Primary metric with CI**: `F1 = 0.85 ¬± 0.03` (mean ¬± std across folds)
4. **Confusion matrix**: TP, TN, FP, FN (absolute counts, not just percentages)
5. **Cost analysis**: Total cost at optimal threshold vs baseline
6. **Calibration**: ECE, calibration curve
7. **Production metrics**: Latency (p50/p95/p99), throughput, memory
8. **Comparison to baseline**: `15% improvement in F1 over previous model`
9. **Statistical significance**: `p-value < 0.01 (paired t-test)`
10. **Business impact**: `Estimated $5M annual savings from reduced false negatives`

---

### üöÄ Next Steps

After mastering evaluation metrics:

1. **043_Cross_Validation_Strategies.ipynb**: Rigorous model validation techniques
2. **044_Hyperparameter_Tuning.ipynb**: Optimize models using evaluation metrics
3. **045_Model_Interpretability.ipynb**: Explain predictions and build trust
4. **046_Production_ML_Systems.ipynb**: Deploy and monitor models in production

---

### üìö Additional Resources

**Books**:
- *Evaluating Machine Learning Models* by Alice Zheng (O'Reilly)
- *The Hundred-Page Machine Learning Book* by Andriy Burkov (Chapter on Evaluation)

**Papers**:
- "The Relationship Between Precision-Recall and ROC Curves" (ICML 2006)
- "Calibration of Probabilities" (Platt, 1999)
- "A Survey of Predictive Modelling under Imbalanced Distributions" (AIRE, 2015)

**Online**:
- sklearn documentation on metrics: https://scikit-learn.org/stable/modules/model_evaluation.html
- Google ML Crash Course: Classification Metrics
- fast.ai: Practical Deep Learning - Model Evaluation

---

### üéØ Remember

> **"You can't improve what you don't measure, but measuring the wrong thing is worse than not measuring at all."**

Always start by asking:
1. **What business objective am I optimizing for?**
2. **What are the costs of different types of errors?**
3. **How will this model be used in production?**

Choose metrics that **align with answers** to these questions, not just what's easy to compute.

**Congratulations!** You now have comprehensive knowledge of model evaluation metrics. Apply these principles to build models that deliver **real business value**. üéâ