# Chapter 4: Evaluation Metrics (The Unit Tests of ML)

Accuracy can lie. This notebook shows you why, and introduces the metrics that tell the truth.

We'll cover:
1. **Confusion Matrix** — The Code Coverage Report
2. **Precision & Recall** — Strict Typing vs Test Coverage
3. **F1-Score** — The Balanced SLA
4. **ROC Curve** — The Load Test
5. **Cross-Validation** — The CI Pipeline

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 11

## Setup: Train Our Zoo of Models (from Chapters 2 & 3)

Let's start by training the same three models from Chapter 3 on the digits dataset.

In [None]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the digits dataset (same as Chapter 2)
digits = load_digits()
X, y = digits.data, digits.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train our three models
models = {
    'KNN': KNeighborsClassifier(n_neighbors=3),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

predictions = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions[name] = model.predict(X_test)
    acc = accuracy_score(y_test, predictions[name])
    print(f"{name}: {acc:.2%} accuracy")

print("\nAll three look great! But accuracy only tells part of the story...")

---
## 1. The Confusion Matrix (The Code Coverage Report)

A confusion matrix shows you exactly WHERE the model is wrong. It's like a code coverage report — not just "80% pass rate" but "these specific lines are uncovered."

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Use the best model (Random Forest)
cm = confusion_matrix(y_test, predictions['Random Forest'])

fig, ax = plt.subplots(figsize=(10, 8))
disp = ConfusionMatrixDisplay(cm, display_labels=digits.target_names)
disp.plot(cmap='Blues', ax=ax, colorbar=True)
ax.set_title('Confusion Matrix: Random Forest on Digits', fontsize=14, fontweight='bold')
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_ylabel('True Label', fontsize=12)
plt.tight_layout()
plt.show()

print("Diagonal = correct predictions. Off-diagonal = mistakes.")
print("Look for the largest off-diagonal numbers — those are the digits the model confuses.")

### Comparing Confusion Matrices Across Models
Let's see how each model's errors differ.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, (name, preds) in zip(axes, predictions.items()):
    cm = confusion_matrix(y_test, preds)
    disp = ConfusionMatrixDisplay(cm, display_labels=digits.target_names)
    disp.plot(cmap='Blues', ax=ax, colorbar=False)
    acc = accuracy_score(y_test, preds)
    ax.set_title(f'{name}\nAccuracy: {acc:.1%}', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

print("Notice: Decision Tree has more off-diagonal noise. Random Forest is cleaner.")

---
## 2. Precision & Recall (Strict Typing vs Test Coverage)

- **Precision**: "When I predict class X, how often am I right?" (Strict type checking)
- **Recall**: "Of all actual class X samples, how many did I find?" (Test coverage)

The trade-off is like **latency vs throughput** — you can't maximize both.

In [None]:
from sklearn.metrics import classification_report, precision_score, recall_score

# Detailed report for Random Forest
print("Random Forest — Classification Report:")
print("=" * 55)
print(classification_report(y_test, predictions['Random Forest'], target_names=[str(d) for d in digits.target_names]))

In [None]:
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, support = precision_recall_fscore_support(
    y_test, predictions['Random Forest'], average=None
)

x = np.arange(10)
width = 0.35

fig, ax = plt.subplots(figsize=(12, 6))
bars1 = ax.bar(x - width/2, precision, width, label='Precision', color='#4ECDC4', edgecolor='white')
bars2 = ax.bar(x + width/2, recall, width, label='Recall', color='#FF6B6B', edgecolor='white')

ax.set_xlabel('Digit Class', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Precision vs Recall Per Digit (Random Forest)', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([str(d) for d in digits.target_names])
ax.set_ylim(0.8, 1.05)
ax.legend(fontsize=11)
ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.show()

worst_precision = digits.target_names[np.argmin(precision)]
worst_recall = digits.target_names[np.argmin(recall)]
print(f"Lowest Precision: digit {worst_precision} — model is least confident here")
print(f"Lowest Recall: digit {worst_recall} — model misses the most of these")

---
## 3. F1-Score (The Balanced SLA)

F1 is the harmonic mean of Precision and Recall. Like a balanced SLA — you need BOTH uptime AND response time.

### When Accuracy Lies
Let's prove it. We'll create an imbalanced dataset and show how accuracy is misleading.

In [None]:
from sklearn.metrics import f1_score
from sklearn.datasets import make_classification
from sklearn.dummy import DummyClassifier

# Create a VERY imbalanced dataset: 95% class 0, 5% class 1
X_imb, y_imb = make_classification(
    n_samples=2000, n_features=20, n_classes=2,
    weights=[0.95, 0.05], random_state=42, flip_y=0
)

X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.2, random_state=42
)

print(f"Class distribution in test set:")
print(f"  Class 0 (majority): {np.sum(y_test_imb == 0)} samples ({np.mean(y_test_imb == 0):.1%})")
print(f"  Class 1 (minority): {np.sum(y_test_imb == 1)} samples ({np.mean(y_test_imb == 1):.1%})")

In [None]:
# Train models on imbalanced data
imb_models = {
    'Always Guess\nMajority Class': DummyClassifier(strategy='most_frequent'),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

imb_results = {}
for name, model in imb_models.items():
    model.fit(X_train_imb, y_train_imb)
    preds = model.predict(X_test_imb)
    acc = accuracy_score(y_test_imb, preds)
    f1 = f1_score(y_test_imb, preds, average='weighted')
    imb_results[name] = {'accuracy': acc, 'f1': f1}
    print(f"{name}: Accuracy={acc:.2%}, F1={f1:.3f}")

print("\nNotice: The 'dumb' model has high accuracy but F1 exposes it!")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

names = list(imb_results.keys())
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

# Accuracy
acc_vals = [imb_results[n]['accuracy'] for n in names]
bars = axes[0].bar(range(len(names)), acc_vals, color=colors, edgecolor='white')
axes[0].set_xticks(range(len(names)))
axes[0].set_xticklabels(names, fontsize=9)
axes[0].set_ylim(0.7, 1.05)
axes[0].set_title('Accuracy (Misleading!)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Score')
for bar, val in zip(bars, acc_vals):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.01, f'{val:.1%}',
                ha='center', fontsize=11, fontweight='bold')

# F1
f1_vals = [imb_results[n]['f1'] for n in names]
bars = axes[1].bar(range(len(names)), f1_vals, color=colors, edgecolor='white')
axes[1].set_xticks(range(len(names)))
axes[1].set_xticklabels(names, fontsize=9)
axes[1].set_ylim(0.7, 1.05)
axes[1].set_title('F1-Score (The Truth)', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Score')
for bar, val in zip(bars, f1_vals):
    axes[1].text(bar.get_x() + bar.get_width()/2, val + 0.01, f'{val:.3f}',
                ha='center', fontsize=11, fontweight='bold')

plt.suptitle('Imbalanced Data: Accuracy Lies, F1 Tells the Truth',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---
## 4. The ROC Curve (The Load Test)

The ROC curve shows performance at EVERY possible decision threshold—like a load test that measures your API at 10, 100, 1000, and 10000 requests/sec.

**AUC** (Area Under the Curve): 1.0 = perfect, 0.5 = random coin flip.

In [None]:
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

# For ROC we need binary classification — use the imbalanced dataset
roc_models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'KNN': KNeighborsClassifier(n_neighbors=3)
}

fig, ax = plt.subplots(figsize=(8, 8))
colors = ['#4ECDC4', '#45B7D1', '#FF6B6B']

for (name, model), color in zip(roc_models.items(), colors):
    model.fit(X_train_imb, y_train_imb)
    
    if hasattr(model, 'predict_proba'):
        y_scores = model.predict_proba(X_test_imb)[:, 1]
    else:
        y_scores = model.predict(X_test_imb)
    
    fpr, tpr, _ = roc_curve(y_test_imb, y_scores)
    roc_auc = auc(fpr, tpr)
    
    ax.plot(fpr, tpr, color=color, linewidth=2.5,
            label=f'{name} (AUC = {roc_auc:.3f})')

# Random baseline
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random (AUC = 0.500)')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves: Performance Across All Thresholds', fontsize=14, fontweight='bold')
ax.legend(fontsize=11, loc='lower right')
ax.set_xlim([-0.02, 1.02])
ax.set_ylim([-0.02, 1.02])

plt.tight_layout()
plt.show()

print("Higher curve = better model. The dashed line = random guessing.")
print("AUC closer to 1.0 means the model separates classes well at ANY threshold.")

---
## 5. Cross-Validation (The CI Pipeline)

A single train/test split = testing on your laptop. Cross-validation = running CI on 5 different machines.

**K-Fold Cross-Validation**: Split data into K parts. Train on K-1, test on the remaining 1. Repeat K times.

In [None]:
from sklearn.model_selection import cross_val_score

# Run 5-fold cross-validation on the digits dataset
cv_models = {
    'KNN': KNeighborsClassifier(n_neighbors=3),
    'Decision Tree': DecisionTreeClassifier(max_depth=10, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

cv_results = {}
for name, model in cv_models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    cv_results[name] = scores
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}  (scores: {[f'{s:.3f}' for s in scores]})")

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

bp = ax.boxplot(
    [cv_results[name] for name in cv_results],
    labels=list(cv_results.keys()),
    patch_artist=True,
    widths=0.5
)

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

# Overlay individual fold scores
for i, (name, scores) in enumerate(cv_results.items()):
    x = np.random.normal(i + 1, 0.04, size=len(scores))
    ax.scatter(x, scores, alpha=0.8, color='black', s=40, zorder=3)

ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('5-Fold Cross-Validation: Consistency Across Data Splits', fontsize=14, fontweight='bold')
ax.set_ylim(0.85, 1.0)

plt.tight_layout()
plt.show()

print("Tighter boxplot = more consistent model.")
print("Cross-validation gives a more reliable estimate than a single train/test split.")

---
## Summary: Your Metric Cheat Sheet

| Metric | SWE Analogy | When to Use |
|--------|-------------|-------------|
| **Accuracy** | Pass/Fail rate | Balanced classes, quick sanity check |
| **Precision** | Strict typing | Cost of false positives is high (spam filter) |
| **Recall** | Test coverage | Cost of false negatives is high (disease detection) |
| **F1-Score** | Balanced SLA | Imbalanced classes, need both P and R |
| **ROC-AUC** | Load test | Comparing models, tuning thresholds |
| **Cross-Validation** | CI pipeline | Reliable performance estimate |

---
## Challenge

1. Look at the confusion matrix for the **Decision Tree**. Which pair of digits does it confuse the most?
2. Change the imbalanced dataset from 95/5 to 99/1 split. How does this affect the "dumb" model's accuracy vs F1?
3. Try `cross_val_score` with `scoring='f1_weighted'` instead of `'accuracy'`. Do the rankings change?

**Next Chapter**: We've been using raw data and default settings. Next, we'll explore **Data Preprocessing & Feature Engineering** — the data pipeline of ML.