# Threshold Tuning and Probability Calibration

## Learning Objectives

By the end of this notebook, you will be able to:

1. Explain why the default threshold of 0.5 is not always optimal
2. Frame threshold selection as a business cost problem
3. Plot precision, recall, and F1 as functions of the threshold
4. Find the optimal threshold for different objectives (max F1, target recall)
5. Understand what calibrated probabilities mean
6. Create and interpret reliability diagrams (calibration curves)
7. Apply Platt scaling and Isotonic calibration with `CalibratedClassifierCV`

## Prerequisites

- Logistic regression (Notebook 01)
- Classification metrics: precision, recall, F1, ROC, PR (Notebook 02)
- Python, NumPy, Matplotlib fundamentals

## Table of Contents

1. [Why Tune the Threshold?](#1)
2. [Business Cost Framing](#2)
3. [Metrics vs Threshold Plots](#3)
4. [Finding the Optimal Threshold](#4)
5. [Probability Calibration](#5)
6. [Reliability Diagrams](#6)
7. [Platt Scaling and Isotonic Calibration](#7)
8. [Common Mistakes](#8)
9. [Exercise](#9)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    precision_recall_curve, roc_curve, classification_report
)
from sklearn.calibration import calibration_curve, CalibratedClassifierCV

np.random.seed(42)
sns.set_style("whitegrid")
%matplotlib inline

In [None]:
# Prepare data
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model = LogisticRegression(max_iter=500, random_state=42)
model.fit(X_train_s, y_train)

y_proba = model.predict_proba(X_test_s)[:, 1]
print(f"Samples: train={len(y_train)}, test={len(y_test)}")
print(f"Test class distribution: {np.bincount(y_test)}")

<a id='1'></a>
## 1. Why Tune the Threshold?

By default, classifiers use a threshold of 0.5:
- If $P(y=1|x) \geq 0.5$, predict class 1
- If $P(y=1|x) < 0.5$, predict class 0

But 0.5 is arbitrary. The optimal threshold depends on:
- **Class imbalance**: if positives are rare, a lower threshold catches more
- **Business costs**: is a false positive or false negative more expensive?
- **Target metric**: do you want to maximize F1, precision, or recall?

In [None]:
# Show how different thresholds produce different predictions
thresholds_demo = [0.3, 0.5, 0.7, 0.9]

print(f"{'Threshold':<12} {'Precision':<12} {'Recall':<12} {'F1':<12} {'Predicted+':>12}")
print("-" * 60)
for t in thresholds_demo:
    y_pred_t = (y_proba >= t).astype(int)
    p = precision_score(y_test, y_pred_t, zero_division=0)
    r = recall_score(y_test, y_pred_t)
    f = f1_score(y_test, y_pred_t)
    n_pos = y_pred_t.sum()
    print(f"{t:<12.1f} {p:<12.4f} {r:<12.4f} {f:<12.4f} {n_pos:>12}")

print("\nLower threshold => more positives predicted => higher recall, lower precision")
print("Higher threshold => fewer positives predicted => lower recall, higher precision")

<a id='2'></a>
## 2. Business Cost Framing

Choosing a threshold is fundamentally a **business decision**:

| Scenario | Cost of FP | Cost of FN | Strategy |
|----------|-----------|-----------|----------|
| **Spam filter** | High (lose important email) | Low (see a spam) | Raise threshold (favor precision) |
| **Fraud detection** | Low (flag legitimate transaction) | Very high (miss fraud) | Lower threshold (favor recall) |
| **Cancer screening** | Moderate (unnecessary biopsy) | Very high (miss cancer) | Lower threshold (favor recall) |
| **Product recommendation** | Very low (irrelevant suggestion) | Low (missed sale) | Use default or optimize for engagement |

The key question: **What costs more -- a false positive or a false negative?**

<a id='3'></a>
## 3. Metrics vs Threshold Plots

In [None]:
# Plot precision, recall, and F1 as functions of threshold
thresholds = np.arange(0.0, 1.01, 0.01)
precisions = []
recalls = []
f1_scores = []

for t in thresholds:
    y_pred_t = (y_proba >= t).astype(int)
    precisions.append(precision_score(y_test, y_pred_t, zero_division=0))
    recalls.append(recall_score(y_test, y_pred_t, zero_division=0))
    f1_scores.append(f1_score(y_test, y_pred_t, zero_division=0))

plt.figure(figsize=(10, 6))
plt.plot(thresholds, precisions, "b-", linewidth=2, label="Precision")
plt.plot(thresholds, recalls, "r-", linewidth=2, label="Recall")
plt.plot(thresholds, f1_scores, "g-", linewidth=2, label="F1 Score")
plt.axvline(x=0.5, color="gray", linestyle="--", alpha=0.7, label="Default threshold (0.5)")
plt.xlabel("Threshold", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.title("Precision, Recall, and F1 vs Classification Threshold", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim([-0.05, 1.05])
plt.show()

print("As threshold increases: precision goes up, recall goes down.")
print("F1 peaks at the point that best balances precision and recall.")

<a id='4'></a>
## 4. Finding the Optimal Threshold

In [None]:
# Objective 1: Maximize F1 score
best_f1_idx = np.argmax(f1_scores)
best_f1_threshold = thresholds[best_f1_idx]
best_f1 = f1_scores[best_f1_idx]

print("=== Objective: Maximize F1 ===")
print(f"  Optimal threshold: {best_f1_threshold:.2f}")
print(f"  F1 at optimal:     {best_f1:.4f}")
print(f"  Precision:         {precisions[best_f1_idx]:.4f}")
print(f"  Recall:            {recalls[best_f1_idx]:.4f}")

# Objective 2: Target recall >= 0.95 with best precision
target_recall = 0.95
valid_indices = [i for i, r in enumerate(recalls) if r >= target_recall]
if valid_indices:
    # Among thresholds with recall >= target, pick highest threshold (best precision)
    best_idx = max(valid_indices, key=lambda i: thresholds[i])
    print(f"\n=== Objective: Recall >= {target_recall} ===")
    print(f"  Optimal threshold: {thresholds[best_idx]:.2f}")
    print(f"  Precision:         {precisions[best_idx]:.4f}")
    print(f"  Recall:            {recalls[best_idx]:.4f}")
    print(f"  F1:                {f1_scores[best_idx]:.4f}")

In [None]:
# Visualize threshold selection
plt.figure(figsize=(10, 6))
plt.plot(thresholds, f1_scores, "g-", linewidth=2, label="F1 Score")
plt.axvline(x=best_f1_threshold, color="green", linestyle="--", alpha=0.7,
            label=f"Best F1 threshold = {best_f1_threshold:.2f}")
plt.axvline(x=0.5, color="gray", linestyle=":", alpha=0.7, label="Default (0.5)")
plt.scatter([best_f1_threshold], [best_f1], color="green", s=100, zorder=5)
plt.xlabel("Threshold", fontsize=12)
plt.ylabel("F1 Score", fontsize=12)
plt.title("Finding Optimal Threshold for Maximum F1", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

<a id='5'></a>
## 5. Probability Calibration

A classifier is **well-calibrated** if its predicted probabilities match observed frequencies:
- Among all samples where the model predicts $P = 0.8$, approximately 80% should actually be positive.

**Why does calibration matter?**
- If you use probabilities for decision-making (e.g., "treat if risk > 30%"), they must be accurate.
- If you only care about ranking (AUC), calibration is less critical.

Logistic regression is generally well-calibrated by default (because it directly models probabilities). Other models (e.g., Random Forest, SVM) often need post-hoc calibration.

<a id='6'></a>
## 6. Reliability Diagrams (Calibration Curves)

In [None]:
# Create a reliability diagram
prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10)

plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, "bo-", linewidth=2, markersize=8,
         label="Logistic Regression")
plt.plot([0, 1], [0, 1], "k--", linewidth=1, label="Perfectly calibrated")
plt.xlabel("Mean Predicted Probability", fontsize=12)
plt.ylabel("Fraction of Positives (Observed)", fontsize=12)
plt.title("Reliability Diagram (Calibration Curve)", fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.show()

print("If the curve closely follows the diagonal, the model is well-calibrated.")
print("Points above the diagonal: model is underconfident (actual prob > predicted).")
print("Points below the diagonal: model is overconfident (actual prob < predicted).")

In [None]:
# Also show the distribution of predicted probabilities
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Calibration curve
axes[0].plot(prob_pred, prob_true, "bo-", linewidth=2, markersize=8)
axes[0].plot([0, 1], [0, 1], "k--")
axes[0].set_xlabel("Mean Predicted Probability")
axes[0].set_ylabel("Fraction of Positives")
axes[0].set_title("Calibration Curve")
axes[0].grid(True, alpha=0.3)

# Histogram of predicted probabilities
axes[1].hist(y_proba[y_test == 0], bins=20, alpha=0.6, label="Class 0", color="blue")
axes[1].hist(y_proba[y_test == 1], bins=20, alpha=0.6, label="Class 1", color="red")
axes[1].set_xlabel("Predicted Probability")
axes[1].set_ylabel("Count")
axes[1].set_title("Distribution of Predicted Probabilities")
axes[1].legend()

plt.tight_layout()
plt.show()

<a id='7'></a>
## 7. Platt Scaling and Isotonic Calibration

Two common post-hoc calibration methods:

### Platt Scaling (Sigmoid)
- Fits a sigmoid function to map raw scores to calibrated probabilities.
- Works well with small datasets. Assumes the calibration curve is S-shaped.
- Use: `CalibratedClassifierCV(method='sigmoid')`

### Isotonic Regression
- Fits a non-parametric, monotone increasing function.
- More flexible but needs more data to avoid overfitting.
- Use: `CalibratedClassifierCV(method='isotonic')`

Both use cross-validation internally to avoid overfitting to the calibration data.

In [None]:
# Compare uncalibrated vs calibrated models
# Use a model that benefits more from calibration for demonstration
from sklearn.ensemble import GradientBoostingClassifier

# Train a GBM (often less calibrated than logistic regression)
gbm = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
gbm.fit(X_train_s, y_train)
y_proba_gbm = gbm.predict_proba(X_test_s)[:, 1]

# Calibrate with Platt scaling (sigmoid)
gbm_sigmoid = CalibratedClassifierCV(gbm, method="sigmoid", cv=5)
gbm_sigmoid.fit(X_train_s, y_train)
y_proba_sigmoid = gbm_sigmoid.predict_proba(X_test_s)[:, 1]

# Calibrate with Isotonic regression
gbm_isotonic = CalibratedClassifierCV(gbm, method="isotonic", cv=5)
gbm_isotonic.fit(X_train_s, y_train)
y_proba_isotonic = gbm_isotonic.predict_proba(X_test_s)[:, 1]

# Plot all three calibration curves
fig, ax = plt.subplots(figsize=(8, 6))

for name, proba, color in [
    ("GBM (uncalibrated)", y_proba_gbm, "blue"),
    ("GBM + Platt (sigmoid)", y_proba_sigmoid, "green"),
    ("GBM + Isotonic", y_proba_isotonic, "orange"),
    ("Logistic Regression", y_proba, "purple"),
]:
    prob_true_i, prob_pred_i = calibration_curve(y_test, proba, n_bins=10)
    ax.plot(prob_pred_i, prob_true_i, "o-", label=name, color=color, linewidth=2)

ax.plot([0, 1], [0, 1], "k--", label="Perfect calibration")
ax.set_xlabel("Mean Predicted Probability", fontsize=12)
ax.set_ylabel("Fraction of Positives", fontsize=12)
ax.set_title("Calibration Comparison", fontsize=14)
ax.legend(fontsize=10, loc="lower right")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Logistic regression is typically well-calibrated out of the box.")
print("Tree-based models (GBM, RF) often benefit from post-hoc calibration.")

<a id='8'></a>
## 8. Common Mistakes

1. **Using threshold 0.5 blindly**: The default threshold is rarely optimal. Always analyze your precision-recall tradeoff and business constraints.

2. **Trusting uncalibrated probabilities**: Many models (SVMs, Random Forests, boosted trees) produce scores that are not true probabilities. Always check the calibration curve before using predicted probabilities in downstream decisions.

3. **Tuning threshold on the test set**: The threshold should be selected on a validation set, not the test set. Otherwise you leak information and get optimistic estimates.

4. **Ignoring class imbalance when setting threshold**: With 99% negatives, threshold 0.5 may predict almost everything as negative.

5. **Calibrating with too little data**: Isotonic calibration in particular needs enough samples. With small datasets, prefer Platt scaling.

<a id='9'></a>
## 9. Exercise

**Task**: Work with a synthetic imbalanced dataset to practice threshold tuning.

1. Generate data: `make_classification(n_samples=1000, weights=[0.9, 0.1], random_state=42)`.
2. Split, scale, and fit a logistic regression model.
3. Plot precision, recall, and F1 vs threshold.
4. Find the threshold that maximizes F1.
5. Compare the classification report at threshold=0.5 vs your optimal threshold.

In [None]:
# Your solution here
# ------------------

# Step 1: Generate imbalanced data
X_ex, y_ex = make_classification(
    n_samples=1000, n_features=10, n_informative=5, n_redundant=2,
    weights=[0.9, 0.1], random_state=42
)
print(f"Class distribution: {np.bincount(y_ex)}")

# Step 2: Split, scale, fit
X_tr, X_te, y_tr, y_te = train_test_split(X_ex, y_ex, test_size=0.3,
                                           random_state=42, stratify=y_ex)
sc = StandardScaler()
X_tr_s = sc.fit_transform(X_tr)
X_te_s = sc.transform(X_te)

clf = LogisticRegression(max_iter=300, random_state=42)
clf.fit(X_tr_s, y_tr)
y_prob_ex = clf.predict_proba(X_te_s)[:, 1]

# Step 3: Plot metrics vs threshold
ts = np.arange(0.0, 1.01, 0.01)
p_list, r_list, f_list = [], [], []
for t in ts:
    yp = (y_prob_ex >= t).astype(int)
    p_list.append(precision_score(y_te, yp, zero_division=0))
    r_list.append(recall_score(y_te, yp, zero_division=0))
    f_list.append(f1_score(y_te, yp, zero_division=0))

plt.figure(figsize=(10, 6))
plt.plot(ts, p_list, "b-", linewidth=2, label="Precision")
plt.plot(ts, r_list, "r-", linewidth=2, label="Recall")
plt.plot(ts, f_list, "g-", linewidth=2, label="F1")
plt.axvline(x=0.5, color="gray", linestyle="--", label="Default (0.5)")
plt.xlabel("Threshold")
plt.ylabel("Score")
plt.title("Metrics vs Threshold (Imbalanced Data)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Step 4: Find optimal F1 threshold
best_idx = np.argmax(f_list)
opt_t = ts[best_idx]
print(f"\nOptimal F1 threshold: {opt_t:.2f} (F1={f_list[best_idx]:.4f})")

# Step 5: Compare reports
print("\n=== Classification Report at threshold=0.5 ===")
print(classification_report(y_te, (y_prob_ex >= 0.5).astype(int)))

print(f"=== Classification Report at threshold={opt_t:.2f} ===")
print(classification_report(y_te, (y_prob_ex >= opt_t).astype(int)))