# OPTIONAL: Probability Calibration and Reliability

**Module**: ML700 Advanced Topics (Optional)  
**Notebook**: 01 - Probability Calibration and Reliability  
**Status**: OPTIONAL - This notebook covers advanced material beyond the core curriculum.

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain what calibrated probabilities are and why they matter for decision-making
2. Read and interpret reliability diagrams (calibration curves)
3. Compute the Brier score to measure calibration quality
4. Apply Platt scaling and isotonic regression to calibrate classifier outputs
5. Use `CalibratedClassifierCV` from scikit-learn to calibrate models

## Prerequisites

- Understanding of classification (Logistic Regression, Random Forests)
- Familiarity with `predict_proba` in scikit-learn
- Basic knowledge of train/test splitting and cross-validation
- Modules ML300 (Logistic Regression) and ML500 (Trees/Ensembles)

## Table of Contents

1. [What Are Calibrated Probabilities?](#1.-What-Are-Calibrated-Probabilities?)
2. [Why Calibration Matters](#2.-Why-Calibration-Matters)
3. [Reliability Diagrams (Calibration Curves)](#3.-Reliability-Diagrams)
4. [Brier Score](#4.-Brier-Score)
5. [Calibration Methods](#5.-Calibration-Methods)
6. [Hands-On: Comparing Calibration on Breast Cancer Data](#6.-Hands-On)
7. [When Calibration Matters Most](#7.-When-Calibration-Matters-Most)
8. [Common Mistakes](#8.-Common-Mistakes)
9. [Summary](#9.-Summary)

---

## 1. What Are Calibrated Probabilities?

A classifier is **well-calibrated** if, among all samples it assigns a predicted probability of $p$,
the true fraction of positives is approximately $p$.

For example, if a model says "there is a 70% chance of cancer" for 100 patients,
roughly 70 of those patients should actually have cancer.

**Key insight**: Many classifiers output scores that look like probabilities (values between 0 and 1)
but are NOT actually calibrated probabilities. Random Forests, SVMs, and Naive Bayes are common offenders.

## 2. Why Calibration Matters

Calibration matters when you use predicted probabilities for **decision-making**, not just ranking:

- **Medical diagnosis**: "80% chance of malignancy" must actually mean 80%
- **Risk scoring**: Insurance pricing, credit scoring
- **Threshold selection**: Choosing a cutoff requires trustworthy probabilities
- **Combining models**: Ensembling probabilities from different models only works if they are calibrated

If you only care about **ranking** (which sample is more likely positive?), calibration matters less.
But if you care about the **actual probability values**, calibration is essential.

## 3. Reliability Diagrams

A **reliability diagram** (calibration curve) plots:
- **X-axis**: Mean predicted probability in each bin
- **Y-axis**: Fraction of positives (true frequency) in each bin

A **perfectly calibrated** classifier lies on the **diagonal line** (y = x).

- Points **above** the diagonal: model is **under-confident** (actual rate > predicted)
- Points **below** the diagonal: model is **over-confident** (actual rate < predicted)

## 4. Brier Score

The **Brier score** measures the mean squared error between predicted probabilities and actual outcomes:

$$BS = \frac{1}{n}\sum_{i=1}^{n}(\hat{p}_i - y_i)^2$$

Where:
- $\hat{p}_i$ is the predicted probability of the positive class for sample $i$
- $y_i \in \{0, 1\}$ is the true label

**Interpretation**:
- Brier score ranges from 0 (perfect) to 1 (worst)
- Lower is better
- A baseline of predicting the class prevalence always gives a reference score

## 5. Calibration Methods

### Platt Scaling (Sigmoid)
Fits a logistic regression on the classifier's output scores:
$$P(y=1|f) = \frac{1}{1 + \exp(Af + B)}$$
- Works well when the distortion is sigmoid-shaped
- Needs fewer samples (only 2 parameters)
- Common for SVMs and boosting methods

### Isotonic Regression
Fits a non-parametric, non-decreasing function to map scores to probabilities.
- More flexible than Platt scaling
- Needs more data (can overfit with small datasets)
- Works when the distortion is not sigmoid-shaped

## 6. Hands-On: Comparing Calibration on Breast Cancer Data

Let us train Logistic Regression and Random Forest, compare their calibration curves,
and then apply calibration methods.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import calibration_curve, CalibratedClassifierCV
from sklearn.metrics import brier_score_loss

# Load data
X, y = load_breast_cancer(return_X_y=True)

# Split: train (60%), calibration (20%), test (20%)
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_cal, y_train, y_cal = train_test_split(
    X_train_full, y_train_full, test_size=0.25, random_state=42, stratify=y_train_full
)

print(f"Train: {X_train.shape[0]}, Calibration: {X_cal.shape[0]}, Test: {X_test.shape[0]}")

In [None]:
# Train both models
lr = LogisticRegression(max_iter=5000, random_state=42)
lr.fit(X_train, y_train)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Get predicted probabilities on test set
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]

print(f"Logistic Regression Brier Score: {brier_score_loss(y_test, lr_probs):.4f}")
print(f"Random Forest Brier Score:       {brier_score_loss(y_test, rf_probs):.4f}")

In [None]:
# Plot calibration curves BEFORE calibration
fig, ax = plt.subplots(1, 1, figsize=(7, 6))

# Perfect calibration reference
ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')

for name, probs in [("Logistic Regression", lr_probs), ("Random Forest", rf_probs)]:
    fraction_pos, mean_predicted = calibration_curve(y_test, probs, n_bins=8)
    ax.plot(mean_predicted, fraction_pos, 's-', label=name)

ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration Curves (Before Calibration)')
ax.legend(loc='lower right')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.show()

print("Notice: Logistic Regression is typically closer to the diagonal.")
print("Random Forests tend to push probabilities away from 0 and 1 (S-shaped distortion).")

### Applying Calibration with CalibratedClassifierCV

We will calibrate the Random Forest using both Platt scaling (sigmoid) and isotonic regression.

In [None]:
# Calibrate Random Forest using Platt scaling (sigmoid)
rf_sigmoid = CalibratedClassifierCV(rf, method='sigmoid', cv='prefit')
rf_sigmoid.fit(X_cal, y_cal)
rf_sigmoid_probs = rf_sigmoid.predict_proba(X_test)[:, 1]

# Calibrate Random Forest using Isotonic regression
rf_isotonic = CalibratedClassifierCV(rf, method='isotonic', cv='prefit')
rf_isotonic.fit(X_cal, y_cal)
rf_isotonic_probs = rf_isotonic.predict_proba(X_test)[:, 1]

# Brier scores
print("Brier Scores (lower is better):")
print(f"  RF (uncalibrated):         {brier_score_loss(y_test, rf_probs):.4f}")
print(f"  RF + Platt (sigmoid):      {brier_score_loss(y_test, rf_sigmoid_probs):.4f}")
print(f"  RF + Isotonic:             {brier_score_loss(y_test, rf_isotonic_probs):.4f}")
print(f"  Logistic Regression:       {brier_score_loss(y_test, lr_probs):.4f}")

In [None]:
# Plot calibration curves AFTER calibration
fig, ax = plt.subplots(1, 1, figsize=(7, 6))

ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')

models_and_probs = [
    ("Logistic Regression", lr_probs),
    ("RF (uncalibrated)", rf_probs),
    ("RF + Platt (sigmoid)", rf_sigmoid_probs),
    ("RF + Isotonic", rf_isotonic_probs),
]

for name, probs in models_and_probs:
    fraction_pos, mean_predicted = calibration_curve(y_test, probs, n_bins=8)
    ax.plot(mean_predicted, fraction_pos, 's-', label=name)

ax.set_xlabel('Mean Predicted Probability')
ax.set_ylabel('Fraction of Positives')
ax.set_title('Calibration Curves (After Calibration)')
ax.legend(loc='lower right')
ax.set_xlim([0, 1])
ax.set_ylim([0, 1])
plt.tight_layout()
plt.show()

## 7. When Calibration Matters Most

| Use Case | Calibration Needed? | Why |
|----------|-------------------|-----|
| Medical diagnosis | **Yes** | Decisions based on probability thresholds |
| Risk scoring (insurance, credit) | **Yes** | Probabilities directly used in pricing |
| Ranking items (search, recommendation) | Less critical | Only relative ordering matters |
| Binary classification with fixed threshold | Less critical | Only care about 0.5 boundary |
| Combining multiple models | **Yes** | Probabilities from different models must be comparable |

## 8. Common Mistakes

1. **Ignoring calibration entirely**: Treating `predict_proba` output as true probabilities without checking
2. **Using probabilities for ranking without checking calibration**: If you need actual probability values (not just ranking), you must check calibration
3. **Calibrating on the training set**: Always calibrate on a held-out set or use cross-validation
4. **Using isotonic regression with small datasets**: Isotonic regression can overfit; prefer Platt scaling with limited data
5. **Forgetting to re-calibrate after retraining**: If you retrain the base model, the calibration must be redone

## 9. Summary

- **Calibrated probabilities** mean predicted confidence matches actual frequency of outcomes
- **Reliability diagrams** visualize calibration; perfect calibration = diagonal line
- **Brier score** ($BS = \frac{1}{n}\sum(\hat{p}_i - y_i)^2$) measures calibration quality (lower = better)
- **Platt scaling** (sigmoid) fits a logistic curve to calibrate scores (good for small data)
- **Isotonic regression** is more flexible but needs more data
- Logistic Regression is naturally better calibrated than Random Forests
- Use `CalibratedClassifierCV` from scikit-learn for easy calibration
- Calibration is essential when probability values drive decisions (medicine, finance)