# Classification Metrics: Confusion Matrix, ROC, and PR Curves

## Learning Objectives

By the end of this notebook, you will be able to:

1. Construct and interpret a confusion matrix (TP, TN, FP, FN)
2. Compute accuracy, precision, recall, specificity, and F1 score
3. Plot and interpret ROC curves and compute AUC
4. Plot and interpret Precision-Recall curves and PR-AUC
5. Explain when accuracy is misleading and which metric to use instead

## Prerequisites

- Logistic regression basics (Notebook 01)
- Binary classification concepts
- Python, NumPy, Matplotlib fundamentals

## Table of Contents

1. [Confusion Matrix](#1)
2. [Core Metrics](#2)
3. [Computing Metrics with sklearn](#3)
4. [ROC Curve and AUC](#4)
5. [Precision-Recall Curve and PR-AUC](#5)
6. [When Accuracy is Misleading](#6)
7. [Common Mistakes](#7)
8. [Exercise](#8)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score, recall_score,
    f1_score, roc_curve, roc_auc_score, precision_recall_curve,
    average_precision_score, classification_report
)

np.random.seed(42)
sns.set_style("whitegrid")
%matplotlib inline

### Prepare Data: Breast Cancer Dataset

In [None]:
# Load and prepare breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Train logistic regression
model = LogisticRegression(max_iter=500, random_state=42)
model.fit(X_train_s, y_train)

y_pred = model.predict(X_test_s)
y_proba = model.predict_proba(X_test_s)[:, 1]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {data.target_names} (0=malignant, 1=benign)")
print(f"Class distribution (test): {np.bincount(y_test)}")

<a id='1'></a>
## 1. Confusion Matrix

A confusion matrix summarizes prediction results for a binary classifier:

```
                    Predicted
                 Negative  Positive
Actual Negative [   TN   |   FP   ]
Actual Positive [   FN   |   TP   ]
```

- **TP (True Positive)**: Correctly predicted positive
- **TN (True Negative)**: Correctly predicted negative
- **FP (False Positive)**: Incorrectly predicted positive (Type I error)
- **FN (False Negative)**: Incorrectly predicted negative (Type II error)

In [None]:
# Compute and display confusion matrix
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"TN={tn}  FP={fp}")
print(f"FN={fn}  TP={tp}")

# Plot confusion matrix as a heatmap
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Pred Malignant", "Pred Benign"],
            yticklabels=["Actual Malignant", "Actual Benign"])
plt.ylabel("Actual", fontsize=12)
plt.xlabel("Predicted", fontsize=12)
plt.title("Confusion Matrix", fontsize=14)
plt.show()

<a id='2'></a>
## 2. Core Metrics

### Accuracy

$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Fraction of all predictions that are correct. Misleading for imbalanced data.

### Precision

$$\text{Precision} = \frac{TP}{TP + FP}$$

Of all predicted positives, how many are actually positive? High precision = few false alarms.

### Recall (Sensitivity / True Positive Rate)

$$\text{Recall} = \frac{TP}{TP + FN}$$

Of all actual positives, how many did we catch? High recall = few missed positives.

### Specificity (True Negative Rate)

$$\text{Specificity} = \frac{TN}{TN + FP}$$

Of all actual negatives, how many did we correctly identify?

### F1 Score

$$F_1 = 2 \cdot \frac{P \cdot R}{P + R}$$

Harmonic mean of precision and recall. Useful when you need a single metric that balances both.

In [None]:
# Compute all metrics manually from confusion matrix values
accuracy_manual = (tp + tn) / (tp + tn + fp + fn)
precision_manual = tp / (tp + fp)
recall_manual = tp / (tp + fn)
specificity_manual = tn / (tn + fp)
f1_manual = 2 * precision_manual * recall_manual / (precision_manual + recall_manual)

print("Manual calculations from confusion matrix:")
print(f"  Accuracy:    {accuracy_manual:.4f}")
print(f"  Precision:   {precision_manual:.4f}")
print(f"  Recall:      {recall_manual:.4f}")
print(f"  Specificity: {specificity_manual:.4f}")
print(f"  F1 Score:    {f1_manual:.4f}")

<a id='3'></a>
## 3. Computing Metrics with sklearn

In [None]:
# Verify with sklearn
print("sklearn metrics (should match manual):")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")

print("\nFull Classification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

<a id='4'></a>
## 4. ROC Curve and AUC

The **Receiver Operating Characteristic (ROC)** curve plots:
- **x-axis**: False Positive Rate (FPR) $= \frac{FP}{FP + TN} = 1 - \text{Specificity}$
- **y-axis**: True Positive Rate (TPR) $= \text{Recall} = \frac{TP}{TP + FN}$

The curve is generated by varying the classification threshold from 0 to 1.

**AUC (Area Under the ROC Curve)**:
- AUC = 1.0: perfect classifier
- AUC = 0.5: random guessing (diagonal line)
- AUC < 0.5: worse than random

In [None]:
# Compute ROC curve
fpr, tpr, roc_thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b-", linewidth=2, label=f"Logistic Regression (AUC = {auc_score:.4f})")
plt.plot([0, 1], [0, 1], "k--", linewidth=1, label="Random (AUC = 0.5)")
plt.fill_between(fpr, tpr, alpha=0.1, color="blue")
plt.xlabel("False Positive Rate (FPR)", fontsize=12)
plt.ylabel("True Positive Rate (TPR / Recall)", fontsize=12)
plt.title("ROC Curve", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

print(f"ROC AUC = {auc_score:.4f}")
print("Interpretation: the probability that a randomly chosen positive")
print("instance is ranked higher than a randomly chosen negative instance.")

<a id='5'></a>
## 5. Precision-Recall Curve and PR-AUC

The **Precision-Recall (PR)** curve plots:
- **x-axis**: Recall
- **y-axis**: Precision

**Why PR curves?** For imbalanced datasets, the ROC curve can be overly optimistic because TN (which are abundant) inflate the FPR denominator. The PR curve focuses only on the positive class and is more informative when positives are rare.

**PR-AUC (Average Precision)**: area under the PR curve. Higher is better.

In [None]:
# Compute PR curve
precision_vals, recall_vals, pr_thresholds = precision_recall_curve(y_test, y_proba)
pr_auc = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall_vals, precision_vals, "r-", linewidth=2,
         label=f"Logistic Regression (PR-AUC = {pr_auc:.4f})")
# Baseline: fraction of positives
baseline = y_test.mean()
plt.axhline(y=baseline, color="k", linestyle="--", linewidth=1,
            label=f"Baseline (prevalence = {baseline:.3f})")
plt.fill_between(recall_vals, precision_vals, alpha=0.1, color="red")
plt.xlabel("Recall", fontsize=12)
plt.ylabel("Precision", fontsize=12)
plt.title("Precision-Recall Curve", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim([0, 1])
plt.ylim([0, 1.05])
plt.show()

print(f"Average Precision (PR-AUC) = {pr_auc:.4f}")

In [None]:
# Side-by-side comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# ROC
axes[0].plot(fpr, tpr, "b-", linewidth=2, label=f"AUC = {auc_score:.4f}")
axes[0].plot([0, 1], [0, 1], "k--", linewidth=1)
axes[0].set_xlabel("FPR", fontsize=12)
axes[0].set_ylabel("TPR", fontsize=12)
axes[0].set_title("ROC Curve", fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# PR
axes[1].plot(recall_vals, precision_vals, "r-", linewidth=2,
             label=f"PR-AUC = {pr_auc:.4f}")
axes[1].axhline(y=baseline, color="k", linestyle="--", linewidth=1)
axes[1].set_xlabel("Recall", fontsize=12)
axes[1].set_ylabel("Precision", fontsize=12)
axes[1].set_title("Precision-Recall Curve", fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

<a id='6'></a>
## 6. When Accuracy is Misleading

Consider a fraud detection dataset with 99% legitimate transactions and 1% fraud. A model that always predicts "legitimate" achieves **99% accuracy** but catches **zero fraud** (recall = 0).

In [None]:
# Demonstrate the accuracy trap with imbalanced data
np.random.seed(42)
n_samples = 1000
y_imb = np.array([0] * 990 + [1] * 10)  # 99% class 0, 1% class 1

# "Model" that always predicts majority class
y_pred_naive = np.zeros(n_samples, dtype=int)

print("Naive model (always predicts class 0):")
print(f"  Accuracy:  {accuracy_score(y_imb, y_pred_naive):.4f}  (looks great!)")
print(f"  Precision: {precision_score(y_imb, y_pred_naive, zero_division=0):.4f}  (undefined/0)")
print(f"  Recall:    {recall_score(y_imb, y_pred_naive):.4f}  (catches nothing!)")
print(f"  F1 Score:  {f1_score(y_imb, y_pred_naive):.4f}  (reveals the problem)")
print()
print("Takeaway: for imbalanced data, ALWAYS look beyond accuracy.")
print("Use F1, PR-AUC, or the metric most relevant to your business problem.")

### Which Metric to Use?

| Scenario | Recommended Metric | Reason |
|----------|-------------------|--------|
| Balanced classes | Accuracy or F1 | Both are reliable |
| Imbalanced classes | F1, PR-AUC | Accuracy is misleading |
| Cost of FP is high (spam filter) | Precision | Minimize false alarms |
| Cost of FN is high (cancer detection) | Recall | Minimize missed cases |
| Ranking / probability quality | ROC-AUC | Threshold-independent |

<a id='7'></a>
## 7. Common Mistakes

1. **Relying solely on accuracy**: Always compute precision, recall, and F1 -- especially for imbalanced data.

2. **Confusing precision and recall**: Precision answers "of my positive predictions, how many are right?" Recall answers "of all actual positives, how many did I find?"

3. **Using ROC-AUC for highly imbalanced data**: ROC can look good even when the model is poor. Use PR-AUC instead.

4. **Ignoring the threshold**: All metrics except AUC depend on the chosen classification threshold. The default 0.5 is not always optimal.

5. **Comparing AUC across different datasets**: AUC values are only comparable on the same test set.

<a id='8'></a>
## 8. Exercise: Compute Metrics on a Different Dataset

**Task**: Use `sklearn.datasets.make_classification` to generate a binary dataset and compute all metrics.

1. Generate 500 samples, 10 features, `weights=[0.7, 0.3]`, `random_state=42`.
2. Split 70/30 with stratification.
3. Scale with `StandardScaler`, fit `LogisticRegression`.
4. Print the classification report.
5. Plot both the ROC curve and PR curve side by side.

In [None]:
# Your solution here
# ------------------
from sklearn.datasets import make_classification

# Step 1: Generate data
X_ex, y_ex = make_classification(
    n_samples=500, n_features=10, n_informative=5, n_redundant=2,
    weights=[0.7, 0.3], random_state=42
)
print(f"Class distribution: {np.bincount(y_ex)}")

# Step 2: Split
X_tr, X_te, y_tr, y_te = train_test_split(
    X_ex, y_ex, test_size=0.3, random_state=42, stratify=y_ex
)

# Step 3: Scale and fit
sc = StandardScaler()
X_tr_s = sc.fit_transform(X_tr)
X_te_s = sc.transform(X_te)

clf = LogisticRegression(max_iter=300, random_state=42)
clf.fit(X_tr_s, y_tr)

y_p = clf.predict(X_te_s)
y_prob = clf.predict_proba(X_te_s)[:, 1]

# Step 4: Classification report
print("\nClassification Report:")
print(classification_report(y_te, y_p))

# Step 5: Plot ROC and PR curves
fpr_ex, tpr_ex, _ = roc_curve(y_te, y_prob)
auc_ex = roc_auc_score(y_te, y_prob)
prec_ex, rec_ex, _ = precision_recall_curve(y_te, y_prob)
pr_auc_ex = average_precision_score(y_te, y_prob)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(fpr_ex, tpr_ex, "b-", linewidth=2, label=f"AUC = {auc_ex:.4f}")
axes[0].plot([0, 1], [0, 1], "k--")
axes[0].set_xlabel("FPR")
axes[0].set_ylabel("TPR")
axes[0].set_title("ROC Curve")
axes[0].legend()

axes[1].plot(rec_ex, prec_ex, "r-", linewidth=2, label=f"PR-AUC = {pr_auc_ex:.4f}")
axes[1].set_xlabel("Recall")
axes[1].set_ylabel("Precision")
axes[1].set_title("Precision-Recall Curve")
axes[1].legend()

plt.tight_layout()
plt.show()