# Day 7: Model Evaluation and Metrics

**Time:** 3 hours

**Mathematical Prerequisites:**
- Probability theory (conditional probability, Bayes theorem)
- Statistics (hypothesis testing, confidence intervals)
- Linear algebra (for multi-class metrics)
- Information theory basics (for some metrics)

---

## Objectives

Accuracy alone is **insufficient** for evaluating ML models. Today we explore:
1. Comprehensive classification metrics (precision, recall, F1, etc.)
2. Multi-class evaluation strategies
3. Visual evaluation tools (ROC curves, PR curves, calibration plots)
4. Statistical model comparison
5. Systematic error analysis

**Goal:** Build a complete evaluation framework for any classification task

---

## Part 1: Theory - Why Accuracy is Not Enough

### 1.1 The Accuracy Paradox

Consider a cancer detection problem:
- 1% of patients have cancer
- Model predicts "no cancer" for everyone
- **Accuracy: 99%!** But **completely useless**

**Problem:** Accuracy treats all errors equally and ignores class imbalance.

### 1.2 Confusion Matrix

For binary classification:

```
                Predicted
              Pos      Neg
Actual Pos    TP       FN
       Neg    FP       TN
```

- **True Positive (TP):** Correctly predicted positive
- **True Negative (TN):** Correctly predicted negative
- **False Positive (FP):** Type I error (predicted positive, actually negative)
- **False Negative (FN):** Type II error (predicted negative, actually positive)

### 1.3 Key Metrics

**Accuracy:**
$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

**Precision (Positive Predictive Value):**
$$
\text{Precision} = \frac{TP}{TP + FP} = P(\text{actual pos} | \text{predicted pos})
$$
*"Of all positive predictions, how many were correct?"*

**Recall (Sensitivity, True Positive Rate):**
$$
\text{Recall} = \frac{TP}{TP + FN} = P(\text{predicted pos} | \text{actual pos})
$$
*"Of all actual positives, how many did we find?"*

**Specificity (True Negative Rate):**
$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

**F1 Score (Harmonic Mean of Precision and Recall):**
$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}
$$

**F-Beta Score (Weighted Harmonic Mean):**
$$
F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}
$$
- $\beta > 1$: Favor recall (e.g., cancer detection)
- $\beta < 1$: Favor precision (e.g., spam detection)

### 1.4 Multi-Class Averaging

For $C$ classes:

**Macro Average:**
$$
\text{Macro-F1} = \frac{1}{C} \sum_{i=1}^C F1_i
$$
*Treats all classes equally (good for imbalanced datasets)*

**Micro Average:**
$$
\text{Micro-F1} = \frac{2 \sum_{i=1}^C TP_i}{2 \sum_{i=1}^C TP_i + \sum_{i=1}^C FP_i + \sum_{i=1}^C FN_i}
$$
*Treats all samples equally (dominated by frequent classes)*

**Weighted Average:**
$$
\text{Weighted-F1} = \sum_{i=1}^C \frac{n_i}{N} F1_i
$$
where $n_i$ is the number of samples in class $i$.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report,
    roc_curve, auc, roc_auc_score,
    precision_recall_curve, average_precision_score,
    cohen_kappa_score, matthews_corrcoef,
    balanced_accuracy_score
)
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.calibration import calibration_curve
from scipy import stats
from itertools import cycle

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
from torch.utils.data import DataLoader

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Part 2: Load Models and Data

We'll use CIFAR-10 and evaluate multiple models from Day 6.

In [None]:
# CIFAR-10 classes
classes = ['airplane', 'automobile', 'bird', 'cat', 'deer', 
           'dog', 'frog', 'horse', 'ship', 'truck']

# Transforms
transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Load dataset
test_dataset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform
)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=2)

print(f"Test samples: {len(test_dataset)}")
print(f"Number of classes: {len(classes)}")

### Train Multiple Models for Comparison

We'll train 3 different models to compare their evaluation metrics.

In [None]:
def train_model(model_name, num_epochs=5):
    """Quick training function for demonstration."""
    print(f"\nTraining {model_name}...")
    
    # Load train data
    train_dataset = torchvision.datasets.CIFAR10(
        root='./data', train=True, download=True, transform=transform
    )
    train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2)
    
    # Create model
    if model_name == 'ResNet18':
        model = models.resnet18(pretrained=True)
        model.fc = nn.Linear(model.fc.in_features, 10)
    elif model_name == 'ResNet34':
        model = models.resnet34(pretrained=True)
        model.fc = nn.Linear(model.fc.in_features, 10)
    elif model_name == 'MobileNetV2':
        model = models.mobilenet_v2(pretrained=True)
        model.classifier[1] = nn.Linear(model.classifier[1].in_features, 10)
    
    model = model.to(device)
    
    # Freeze backbone (feature extraction for speed)
    for param in model.parameters():
        param.requires_grad = False
    if model_name in ['ResNet18', 'ResNet34']:
        for param in model.fc.parameters():
            param.requires_grad = True
    else:
        for param in model.classifier.parameters():
            param.requires_grad = True
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=0.001)
    
    # Training loop
    model.train()
    for epoch in range(num_epochs):
        running_loss = 0.0
        correct = 0
        total = 0
        
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
            
            if i % 200 == 199:
                print(f'Epoch {epoch+1}, Batch {i+1}: Loss={running_loss/200:.3f}, Acc={100.*correct/total:.2f}%')
                running_loss = 0.0
    
    return model

# Train models
model1 = train_model('ResNet18', num_epochs=5)
model2 = train_model('ResNet34', num_epochs=5)
model3 = train_model('MobileNetV2', num_epochs=5)

models_dict = {
    'ResNet18': model1,
    'ResNet34': model2,
    'MobileNetV2': model3
}

### Get Predictions and Probabilities

In [None]:
def get_predictions_and_probs(model, data_loader):
    """Get predictions and class probabilities."""
    model.eval()
    all_preds = []
    all_labels = []
    all_probs = []
    
    with torch.no_grad():
        for inputs, labels in data_loader:
            inputs = inputs.to(device)
            outputs = model(inputs)
            probs = torch.softmax(outputs, dim=1)
            _, predicted = outputs.max(1)
            
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.numpy())
            all_probs.extend(probs.cpu().numpy())
    
    return np.array(all_preds), np.array(all_labels), np.array(all_probs)

# Get predictions for all models
predictions_dict = {}
probabilities_dict = {}

for name, model in models_dict.items():
    print(f"Getting predictions for {name}...")
    preds, labels, probs = get_predictions_and_probs(model, test_loader)
    predictions_dict[name] = preds
    probabilities_dict[name] = probs

# Store ground truth labels (same for all models)
true_labels = labels

## Part 3: Comprehensive Metrics Analysis

### 3.1 Basic Metrics for All Models

In [None]:
def calculate_all_metrics(y_true, y_pred):
    """Calculate comprehensive classification metrics."""
    metrics = {
        'Accuracy': accuracy_score(y_true, y_pred),
        'Balanced Accuracy': balanced_accuracy_score(y_true, y_pred),
        'Macro Precision': precision_score(y_true, y_pred, average='macro'),
        'Macro Recall': recall_score(y_true, y_pred, average='macro'),
        'Macro F1': f1_score(y_true, y_pred, average='macro'),
        'Weighted Precision': precision_score(y_true, y_pred, average='weighted'),
        'Weighted Recall': recall_score(y_true, y_pred, average='weighted'),
        'Weighted F1': f1_score(y_true, y_pred, average='weighted'),
        'Cohen Kappa': cohen_kappa_score(y_true, y_pred),
        'Matthews Corr Coef': matthews_corrcoef(y_true, y_pred)
    }
    return metrics

# Calculate metrics for all models
all_metrics = {}
for name, preds in predictions_dict.items():
    all_metrics[name] = calculate_all_metrics(true_labels, preds)

# Display as DataFrame
df_metrics = pd.DataFrame(all_metrics).T
print("\n" + "="*80)
print("COMPREHENSIVE METRICS COMPARISON")
print("="*80)
print(df_metrics.to_string())
print("="*80)

### 3.2 Understanding Different Metrics

**Balanced Accuracy:**
$$
\text{Balanced Acc} = \frac{1}{C} \sum_{i=1}^C \frac{TP_i}{TP_i + FN_i}
$$
Average of per-class recall. Better than accuracy for imbalanced data.

**Cohen's Kappa:**
$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$
where $p_o$ is observed agreement and $p_e$ is expected agreement by chance.
- $\kappa = 1$: Perfect agreement
- $\kappa = 0$: Random agreement
- $\kappa < 0$: Worse than random

**Matthews Correlation Coefficient (MCC):**
$$
MCC = \frac{TP \cdot TN - FP \cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}
$$
Balanced metric even for imbalanced classes. Range: [-1, 1]

In [None]:
# Visualize metrics comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Select key metrics to visualize
key_metrics = ['Accuracy', 'Macro F1', 'Weighted F1', 'Balanced Accuracy']

for idx, metric in enumerate(key_metrics):
    ax = axes[idx // 2, idx % 2]
    values = [all_metrics[name][metric] for name in models_dict.keys()]
    bars = ax.bar(models_dict.keys(), values, alpha=0.7, edgecolor='black')
    ax.set_ylabel('Score', fontsize=12)
    ax.set_title(metric, fontsize=14, fontweight='bold')
    ax.set_ylim([0, 1])
    ax.grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for bar, val in zip(bars, values):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{val:.4f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.suptitle('Model Comparison: Key Metrics', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

## Part 4: Per-Class Analysis

### 4.1 Detailed Classification Report

In [None]:
# Classification report for best model
best_model_name = max(all_metrics.keys(), key=lambda k: all_metrics[k]['Accuracy'])
best_preds = predictions_dict[best_model_name]

print(f"\nDetailed Classification Report for {best_model_name}:")
print("="*80)
print(classification_report(true_labels, best_preds, target_names=classes, digits=4))
print("="*80)

### 4.2 Per-Class Metrics Visualization

In [None]:
# Calculate per-class metrics
precision_per_class = precision_score(true_labels, best_preds, average=None)
recall_per_class = recall_score(true_labels, best_preds, average=None)
f1_per_class = f1_score(true_labels, best_preds, average=None)

# Create DataFrame
df_per_class = pd.DataFrame({
    'Class': classes,
    'Precision': precision_per_class,
    'Recall': recall_per_class,
    'F1-Score': f1_per_class
})

print("\nPer-Class Metrics:")
print(df_per_class.to_string(index=False))

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

# Grouped bar chart
x = np.arange(len(classes))
width = 0.25

ax1.bar(x - width, precision_per_class, width, label='Precision', alpha=0.8)
ax1.bar(x, recall_per_class, width, label='Recall', alpha=0.8)
ax1.bar(x + width, f1_per_class, width, label='F1-Score', alpha=0.8)

ax1.set_xlabel('Class', fontsize=12)
ax1.set_ylabel('Score', fontsize=12)
ax1.set_title(f'Per-Class Metrics - {best_model_name}', fontsize=14, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(classes, rotation=45, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')
ax1.set_ylim([0, 1])

# Heatmap
metrics_array = np.array([precision_per_class, recall_per_class, f1_per_class])
sns.heatmap(metrics_array, annot=True, fmt='.3f', cmap='YlGnBu',
            xticklabels=classes, yticklabels=['Precision', 'Recall', 'F1'],
            ax=ax2, cbar_kws={'label': 'Score'})
ax2.set_title(f'Metrics Heatmap - {best_model_name}', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Identify best and worst performing classes
best_class_idx = f1_per_class.argmax()
worst_class_idx = f1_per_class.argmin()

print(f"\nBest performing class: {classes[best_class_idx]} (F1={f1_per_class[best_class_idx]:.4f})")
print(f"Worst performing class: {classes[worst_class_idx]} (F1={f1_per_class[worst_class_idx]:.4f})")

## Part 5: Confusion Matrix Analysis

### 5.1 Confusion Matrices for All Models

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for idx, (name, preds) in enumerate(predictions_dict.items()):
    cm = confusion_matrix(true_labels, preds)
    
    # Normalize by row (true labels)
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues',
                xticklabels=classes, yticklabels=classes,
                ax=axes[idx], cbar_kws={'label': 'Proportion'})
    axes[idx].set_xlabel('Predicted Label', fontsize=11)
    axes[idx].set_ylabel('True Label', fontsize=11)
    axes[idx].set_title(f'{name}\nAccuracy: {all_metrics[name]["Accuracy"]:.4f}', 
                        fontsize=12, fontweight='bold')

plt.suptitle('Normalized Confusion Matrices (Row-Normalized)', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

### 5.2 Common Misclassification Patterns

In [None]:
# Analyze confusion patterns for best model
cm = confusion_matrix(true_labels, best_preds)
cm_no_diag = cm.copy()
np.fill_diagonal(cm_no_diag, 0)

# Find top 10 confusion pairs
print(f"\nTop 10 Most Common Misclassifications ({best_model_name}):\n")
print(f"{'Rank':<5} {'True Class':<15} {'Predicted As':<15} {'Count':<8} {'% of True Class'}")
print("="*70)

flat_indices = np.argsort(cm_no_diag.ravel())[::-1]
for i, flat_idx in enumerate(flat_indices[:10]):
    true_idx, pred_idx = np.unravel_index(flat_idx, cm.shape)
    count = cm_no_diag[true_idx, pred_idx]
    total_in_class = cm[true_idx, :].sum()
    percentage = 100 * count / total_in_class
    print(f"{i+1:<5} {classes[true_idx]:<15} {classes[pred_idx]:<15} {count:<8} {percentage:.2f}%")

## Part 6: ROC Curves and AUC

### 6.1 Theory: ROC and AUC

**ROC (Receiver Operating Characteristic) Curve:**
- Plot of True Positive Rate vs False Positive Rate at various thresholds
- TPR = Recall = $\frac{TP}{TP+FN}$
- FPR = $\frac{FP}{FP+TN}$

**AUC (Area Under Curve):**
- AUC = 1.0: Perfect classifier
- AUC = 0.5: Random classifier
- AUC < 0.5: Worse than random

**Multi-class ROC:**
- One-vs-Rest (OvR): Each class vs all others
- Macro-average: Average of per-class AUC
- Micro-average: Aggregate all classes

### 6.2 ROC Curves for Multi-Class

In [None]:
def plot_multiclass_roc(y_true, y_probs, class_names, model_name):
    """Plot ROC curves for multi-class classification."""
    n_classes = len(class_names)
    
    # Binarize labels for OvR
    from sklearn.preprocessing import label_binarize
    y_true_bin = label_binarize(y_true, classes=range(n_classes))
    
    # Compute ROC curve and AUC for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_true_bin[:, i], y_probs[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])
    
    # Compute micro-average ROC curve and AUC
    fpr["micro"], tpr["micro"], _ = roc_curve(y_true_bin.ravel(), y_probs.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    # Compute macro-average ROC curve and AUC
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += np.interp(all_fpr, fpr[i], tpr[i])
    mean_tpr /= n_classes
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
    
    # Plot 1: Per-class ROC curves
    colors = cycle(plt.cm.tab10.colors)
    for i, color in zip(range(n_classes), colors):
        ax1.plot(fpr[i], tpr[i], color=color, lw=2,
                label=f'{class_names[i]} (AUC={roc_auc[i]:.3f})')
    
    ax1.plot([0, 1], [0, 1], 'k--', lw=2, label='Random (AUC=0.5)')
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('False Positive Rate', fontsize=12)
    ax1.set_ylabel('True Positive Rate', fontsize=12)
    ax1.set_title(f'Per-Class ROC Curves - {model_name}', fontsize=14, fontweight='bold')
    ax1.legend(loc='lower right', fontsize=9)
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Micro and Macro average
    ax2.plot(fpr["micro"], tpr["micro"],
            label=f'Micro-average (AUC={roc_auc["micro"]:.3f})',
            color='deeppink', linestyle=':', linewidth=4)
    ax2.plot(fpr["macro"], tpr["macro"],
            label=f'Macro-average (AUC={roc_auc["macro"]:.3f})',
            color='navy', linestyle=':', linewidth=4)
    ax2.plot([0, 1], [0, 1], 'k--', lw=2, label='Random (AUC=0.5)')
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_xlabel('False Positive Rate', fontsize=12)
    ax2.set_ylabel('True Positive Rate', fontsize=12)
    ax2.set_title(f'Average ROC Curves - {model_name}', fontsize=14, fontweight='bold')
    ax2.legend(loc='lower right', fontsize=11)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return roc_auc

# Plot ROC for best model
best_probs = probabilities_dict[best_model_name]
roc_auc_scores = plot_multiclass_roc(true_labels, best_probs, classes, best_model_name)

print(f"\nMacro-average AUC: {roc_auc_scores['macro']:.4f}")
print(f"Micro-average AUC: {roc_auc_scores['micro']:.4f}")

## Part 7: Precision-Recall Curves

### 7.1 Theory: Precision-Recall vs ROC

**When to use PR curves instead of ROC:**
- Imbalanced datasets (few positive samples)
- Care more about positive class
- ROC can be overly optimistic for imbalanced data

**Average Precision (AP):**
$$
AP = \sum_n (R_n - R_{n-1}) P_n
$$
Approximates area under PR curve.

In [None]:
def plot_precision_recall_curves(y_true, y_probs, class_names, model_name):
    """Plot Precision-Recall curves for multi-class."""
    from sklearn.preprocessing import label_binarize
    n_classes = len(class_names)
    y_true_bin = label_binarize(y_true, classes=range(n_classes))
    
    # Compute PR curve for each class
    precision = dict()
    recall = dict()
    avg_precision = dict()
    
    for i in range(n_classes):
        precision[i], recall[i], _ = precision_recall_curve(y_true_bin[:, i], y_probs[:, i])
        avg_precision[i] = average_precision_score(y_true_bin[:, i], y_probs[:, i])
    
    # Micro-average
    precision["micro"], recall["micro"], _ = precision_recall_curve(
        y_true_bin.ravel(), y_probs.ravel()
    )
    avg_precision["micro"] = average_precision_score(y_true_bin, y_probs, average="micro")
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
    
    # Per-class curves
    colors = cycle(plt.cm.tab10.colors)
    for i, color in zip(range(n_classes), colors):
        ax1.plot(recall[i], precision[i], color=color, lw=2,
                label=f'{class_names[i]} (AP={avg_precision[i]:.3f})')
    
    ax1.set_xlim([0.0, 1.0])
    ax1.set_ylim([0.0, 1.05])
    ax1.set_xlabel('Recall', fontsize=12)
    ax1.set_ylabel('Precision', fontsize=12)
    ax1.set_title(f'Per-Class Precision-Recall Curves - {model_name}', 
                 fontsize=14, fontweight='bold')
    ax1.legend(loc='lower left', fontsize=9)
    ax1.grid(True, alpha=0.3)
    
    # Micro-average
    ax2.plot(recall["micro"], precision["micro"],
            label=f'Micro-average (AP={avg_precision["micro"]:.3f})',
            color='deeppink', linestyle=':', linewidth=4)
    ax2.set_xlim([0.0, 1.0])
    ax2.set_ylim([0.0, 1.05])
    ax2.set_xlabel('Recall', fontsize=12)
    ax2.set_ylabel('Precision', fontsize=12)
    ax2.set_title(f'Micro-Average Precision-Recall - {model_name}', 
                 fontsize=14, fontweight='bold')
    ax2.legend(loc='lower left', fontsize=11)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return avg_precision

# Plot PR curves
ap_scores = plot_precision_recall_curves(true_labels, best_probs, classes, best_model_name)

print(f"\nMean Average Precision (mAP): {np.mean([ap_scores[i] for i in range(10)]):.4f}")
print(f"Micro-average AP: {ap_scores['micro']:.4f}")

## Part 8: Calibration Analysis

### 8.1 Theory: Probability Calibration

A well-calibrated model:
- If it predicts 70% probability, the event should occur 70% of the time

**Calibration Plot:**
- X-axis: Predicted probability
- Y-axis: Observed frequency
- Perfect calibration: y = x line

**Expected Calibration Error (ECE):**
$$
ECE = \sum_{m=1}^M \frac{|B_m|}{N} |\text{acc}(B_m) - \text{conf}(B_m)|
$$
where $B_m$ are bins of predictions.

In [None]:
def plot_calibration_curve(y_true, y_probs, model_name, n_bins=10):
    """Plot calibration curve for multi-class model."""
    # Get predicted probabilities for predicted class
    y_pred = np.argmax(y_probs, axis=1)
    y_conf = np.max(y_probs, axis=1)
    
    # Check if prediction is correct
    y_correct = (y_pred == y_true).astype(int)
    
    # Compute calibration curve
    prob_true, prob_pred = calibration_curve(y_correct, y_conf, n_bins=n_bins, strategy='uniform')
    
    # Compute ECE
    ece = np.mean(np.abs(prob_true - prob_pred))
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Calibration curve
    ax1.plot([0, 1], [0, 1], 'k--', lw=2, label='Perfect Calibration')
    ax1.plot(prob_pred, prob_true, 's-', lw=2, markersize=10, 
            label=f'{model_name}\nECE={ece:.4f}')
    ax1.set_xlabel('Mean Predicted Probability', fontsize=12)
    ax1.set_ylabel('Fraction of Positives', fontsize=12)
    ax1.set_title('Calibration Curve', fontsize=14, fontweight='bold')
    ax1.legend(loc='lower right', fontsize=11)
    ax1.grid(True, alpha=0.3)
    ax1.set_xlim([0, 1])
    ax1.set_ylim([0, 1])
    
    # Confidence histogram
    ax2.hist(y_conf, bins=n_bins, edgecolor='black', alpha=0.7)
    ax2.set_xlabel('Predicted Confidence', fontsize=12)
    ax2.set_ylabel('Count', fontsize=12)
    ax2.set_title('Distribution of Predicted Confidences', fontsize=14, fontweight='bold')
    ax2.grid(True, alpha=0.3, axis='y')
    
    plt.suptitle(f'Model Calibration Analysis - {model_name}', 
                fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()
    
    return ece

# Plot calibration for all models
ece_scores = {}
for name, probs in probabilities_dict.items():
    print(f"\nCalibration for {name}:")
    ece = plot_calibration_curve(true_labels, probs, name)
    ece_scores[name] = ece
    print(f"Expected Calibration Error: {ece:.4f}")

## Part 9: Statistical Model Comparison

### 9.1 McNemar's Test

Tests if two models have significantly different error rates.

**Contingency Table:**
```
              Model 2 Correct  Model 2 Wrong
Model 1 Correct      a              b
Model 1 Wrong        c              d
```

**Test Statistic:**
$$
\chi^2 = \frac{(b - c)^2}{b + c}
$$

**Null Hypothesis:** Models have same error rate

**Decision:** If p < 0.05, models are significantly different

In [None]:
from statsmodels.stats.contingency_tables import mcnemar

def mcnemar_test(y_true, y_pred1, y_pred2, model1_name, model2_name):
    """Perform McNemar's test between two models."""
    # Create contingency table
    correct1 = (y_pred1 == y_true)
    correct2 = (y_pred2 == y_true)
    
    a = np.sum(correct1 & correct2)   # Both correct
    b = np.sum(correct1 & ~correct2)  # Model 1 correct, Model 2 wrong
    c = np.sum(~correct1 & correct2)  # Model 1 wrong, Model 2 correct
    d = np.sum(~correct1 & ~correct2) # Both wrong
    
    contingency_table = np.array([[a, b], [c, d]])
    
    # McNemar's test
    result = mcnemar(contingency_table, exact=False, correction=True)
    
    return {
        'contingency_table': contingency_table,
        'statistic': result.statistic,
        'p_value': result.pvalue
    }

# Compare all pairs of models
model_names = list(predictions_dict.keys())
print("\n" + "="*80)
print("McNemar's Test for Pairwise Model Comparison")
print("="*80)

for i in range(len(model_names)):
    for j in range(i+1, len(model_names)):
        name1, name2 = model_names[i], model_names[j]
        result = mcnemar_test(true_labels, predictions_dict[name1], 
                             predictions_dict[name2], name1, name2)
        
        print(f"\n{name1} vs {name2}:")
        print(f"Contingency Table:")
        print(f"                {name2} Correct  {name2} Wrong")
        print(f"{name1} Correct  {result['contingency_table'][0,0]:>12}  {result['contingency_table'][0,1]:>12}")
        print(f"{name1} Wrong    {result['contingency_table'][1,0]:>12}  {result['contingency_table'][1,1]:>12}")
        print(f"\nChi-square statistic: {result['statistic']:.4f}")
        print(f"p-value: {result['p_value']:.4f}")
        
        if result['p_value'] < 0.05:
            print("✓ Significant difference (p < 0.05)")
        else:
            print("✗ No significant difference (p >= 0.05)")

print("\n" + "="*80)

### 9.2 Confidence Intervals for Accuracy

Using Wilson score interval (better than normal approximation for extreme probabilities):

$$
CI = \frac{\hat{p} + \frac{z^2}{2n} \pm z\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}}{1 + \frac{z^2}{n}}
$$

where $\hat{p}$ is observed accuracy, $n$ is sample size, $z$ is z-score (1.96 for 95% CI).

In [None]:
from scipy.stats import norm

def wilson_confidence_interval(n_correct, n_total, confidence=0.95):
    """Compute Wilson score confidence interval."""
    z = norm.ppf(1 - (1 - confidence) / 2)
    p_hat = n_correct / n_total
    
    denominator = 1 + z**2 / n_total
    center = p_hat + z**2 / (2 * n_total)
    spread = z * np.sqrt(p_hat * (1 - p_hat) / n_total + z**2 / (4 * n_total**2))
    
    lower = (center - spread) / denominator
    upper = (center + spread) / denominator
    
    return lower, upper

# Compute confidence intervals for all models
n_total = len(true_labels)
ci_data = []

for name, preds in predictions_dict.items():
    n_correct = np.sum(preds == true_labels)
    accuracy = n_correct / n_total
    lower, upper = wilson_confidence_interval(n_correct, n_total)
    
    ci_data.append({
        'Model': name,
        'Accuracy': accuracy,
        'CI_Lower': lower,
        'CI_Upper': upper,
        'CI_Width': upper - lower
    })

df_ci = pd.DataFrame(ci_data)

print("\n95% Confidence Intervals for Accuracy (Wilson Score):")
print("="*70)
for _, row in df_ci.iterrows():
    print(f"{row['Model']:<15}: {row['Accuracy']:.4f} [{row['CI_Lower']:.4f}, {row['CI_Upper']:.4f}] (width={row['CI_Width']:.4f})")
print("="*70)

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))

y_pos = np.arange(len(df_ci))
errors = np.array([[df_ci['Accuracy'].values - df_ci['CI_Lower'].values],
                   [df_ci['CI_Upper'].values - df_ci['Accuracy'].values]])

ax.errorbar(df_ci['Accuracy'], y_pos, xerr=errors, fmt='o', markersize=10,
           capsize=5, capthick=2, elinewidth=2)
ax.set_yticks(y_pos)
ax.set_yticklabels(df_ci['Model'])
ax.set_xlabel('Accuracy', fontsize=12)
ax.set_title('Model Accuracy with 95% Confidence Intervals (Wilson Score)', 
            fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')
ax.set_xlim([df_ci['CI_Lower'].min() - 0.01, df_ci['CI_Upper'].max() + 0.01])

plt.tight_layout()
plt.show()

## Part 10: Comprehensive Evaluation Dashboard

Create a final summary dashboard with all key metrics.

In [None]:
# Create comprehensive summary
summary_data = []

for name in models_dict.keys():
    preds = predictions_dict[name]
    probs = probabilities_dict[name]
    
    # Basic metrics
    acc = accuracy_score(true_labels, preds)
    
    # Multi-class metrics
    macro_f1 = f1_score(true_labels, preds, average='macro')
    weighted_f1 = f1_score(true_labels, preds, average='weighted')
    
    # AUC (one-vs-rest)
    try:
        macro_auc = roc_auc_score(true_labels, probs, average='macro', multi_class='ovr')
    except:
        macro_auc = 0
    
    # Cohen's Kappa
    kappa = cohen_kappa_score(true_labels, preds)
    
    # ECE
    ece = ece_scores[name]
    
    summary_data.append({
        'Model': name,
        'Accuracy': acc,
        'Macro-F1': macro_f1,
        'Weighted-F1': weighted_f1,
        'Macro-AUC': macro_auc,
        'Cohen-Kappa': kappa,
        'ECE': ece
    })

df_summary = pd.DataFrame(summary_data)

print("\n" + "="*100)
print("COMPREHENSIVE EVALUATION DASHBOARD")
print("="*100)
print(df_summary.to_string(index=False))
print("="*100)

# Radar chart for comparison
from math import pi

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

# Metrics to plot (normalize ECE by inverting: 1-ECE)
metrics = ['Accuracy', 'Macro-F1', 'Weighted-F1', 'Macro-AUC', 'Cohen-Kappa']
num_vars = len(metrics)

# Compute angle for each axis
angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
angles += angles[:1]

# Plot for each model
for idx, row in df_summary.iterrows():
    values = [row[m] for m in metrics]
    values += values[:1]
    
    ax.plot(angles, values, 'o-', linewidth=2, label=row['Model'])
    ax.fill(angles, values, alpha=0.15)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(metrics, size=11)
ax.set_ylim(0, 1)
ax.set_title('Model Comparison Radar Chart', size=14, fontweight='bold', pad=20)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
ax.grid(True)

plt.tight_layout()
plt.show()

## Part 11: Key Takeaways and Best Practices

### 11.1 Metric Selection Guidelines

**Use Case → Metric Mapping:**

| Use Case | Primary Metric | Why |
|----------|----------------|-----|
| Balanced classes | Accuracy | Simple, interpretable |
| Imbalanced classes | Macro-F1, Balanced Accuracy | Treats all classes equally |
| Cost-sensitive (FP costly) | Precision | Minimize false positives |
| Cost-sensitive (FN costly) | Recall | Minimize false negatives |
| Ranking quality | AUC-ROC | Threshold-independent |
| Imbalanced ranking | AUC-PR | Focus on positive class |
| Probability calibration | ECE, Brier Score | Need well-calibrated probabilities |
| Multi-label | Hamming Loss, Subset Accuracy | Handle label correlations |

### 11.2 Common Pitfalls

1. **Using only accuracy on imbalanced data**
   - Solution: Use F1, balanced accuracy, or class weights

2. **Ignoring confidence calibration**
   - Solution: Plot calibration curves, apply calibration methods

3. **Not using confidence intervals**
   - Solution: Always report CIs, especially for small datasets

4. **Comparing models without statistical tests**
   - Solution: Use McNemar's test, paired t-tests, etc.

5. **Forgetting about computational cost**
   - Solution: Include training time, inference time in evaluation

### 11.3 Mathematical Insights

**Precision-Recall Trade-off:**
$$
\text{Precision} \uparrow \implies \text{Recall} \downarrow
$$
Controlled by classification threshold. F1 balances this trade-off.

**Why Harmonic Mean for F1?**
- Arithmetic mean: $(P + R)/2$ - too lenient
- Geometric mean: $\sqrt{PR}$ - still too lenient
- Harmonic mean: $\frac{2PR}{P+R}$ - penalizes imbalance

**Example:** P=0.9, R=0.1
- Arithmetic: 0.5 (misleading!)
- Harmonic: 0.18 (more realistic)

### 11.4 Reporting Checklist

When reporting model performance, always include:

✅ **Basic Metrics:**
- Accuracy (with confidence interval)
- Macro and weighted F1
- Per-class precision/recall

✅ **Visual Analysis:**
- Confusion matrix (normalized)
- ROC curves (if applicable)
- Calibration plot

✅ **Statistical Tests:**
- Confidence intervals
- Comparison with baseline (McNemar's test)

✅ **Error Analysis:**
- Most common misclassifications
- Failure modes
- Examples of errors

✅ **Computational Cost:**
- Training time
- Inference time
- Model size

---

## Summary

Congratulations! You've built a comprehensive model evaluation framework. You now understand:

✅ Why accuracy is insufficient (accuracy paradox)  
✅ Comprehensive classification metrics (precision, recall, F1, etc.)  
✅ Multi-class averaging strategies (macro, micro, weighted)  
✅ ROC curves and AUC interpretation  
✅ Precision-Recall curves for imbalanced data  
✅ Probability calibration analysis  
✅ Statistical model comparison (McNemar's test)  
✅ Confidence intervals for metrics  
✅ Systematic error analysis  
✅ Best practices for metric selection and reporting  

**Key Insight:** No single metric tells the whole story. Always use multiple complementary metrics and statistical tests.

**Time spent:** ~3 hours

**Next:** Day 8 - Hyperparameter Tuning and Experimentation