# ROC Curve (Receiver Operating Characteristic)

The ROC curve visualizes the tradeoff between:

- **True Positive Rate (TPR / recall / sensitivity)**
- **False Positive Rate (FPR = 1 - specificity)**

as we sweep a decision threshold over a model's **scores** (probabilities, logits, or any ranking score).

---

## Learning goals

By the end you should be able to:

- define TPR/FPR from the confusion matrix
- compute ROC points by threshold-sweeping
- implement `roc_curve` and AUC from scratch (NumPy)
- pick an operating threshold with ROC constraints (e.g. "FPR ≤ 5%")
- use AUC-ROC to pick a hyperparameter for logistic regression

## Quick import (scikit-learn)

```python
from sklearn.metrics import roc_curve, roc_auc_score
```


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

# Reproducibility: record versions (nice to have in a knowledge base)
import sys
import plotly

print("python:", sys.version.split()[0])
print("numpy :", np.__version__)
print("plotly:", plotly.__version__)

try:
    import sklearn

    print("sklearn:", sklearn.__version__)
except Exception as e:
    print("sklearn: not available ->", repr(e))


## 1) From scores to decisions

Let $y_i \in \{0,1\}$ be the true label and $s_i \in \mathbb{R}$ be a score where **larger means more likely positive**.

A threshold $t$ turns scores into hard predictions:

$$
\hat{y}_i(t) = \mathbb{1}[s_i \ge t]
$$

This creates the confusion-matrix counts:

$$
\begin{aligned}
\mathrm{TP}(t) &= \sum_i \mathbb{1}[y_i=1 \land s_i \ge t] \\
\mathrm{FP}(t) &= \sum_i \mathbb{1}[y_i=0 \land s_i \ge t] \\
\mathrm{TN}(t) &= \sum_i \mathbb{1}[y_i=0 \land s_i < t] \\
\mathrm{FN}(t) &= \sum_i \mathbb{1}[y_i=1 \land s_i < t]
\end{aligned}
$$

Two key rates (both in $[0,1]$):

$$
\mathrm{TPR}(t) = \frac{\mathrm{TP}(t)}{\mathrm{TP}(t)+\mathrm{FN}(t)}
\qquad
\mathrm{FPR}(t) = \frac{\mathrm{FP}(t)}{\mathrm{FP}(t)+\mathrm{TN}(t)}
$$

- $\mathrm{TPR}(t)$ is **sensitivity / recall**: $P(\hat{y}=1\mid y=1)$
- $\mathrm{FPR}(t)$ is $1-\text{specificity}$: $P(\hat{y}=1\mid y=0)$

**ROC curve**: the set of points $(\mathrm{FPR}(t),\mathrm{TPR}(t))$ as we sweep $t$ from $+\infty$ down to $-\infty$.

- At $t=+\infty$: predict everything negative → $(0,0)$
- At $t=-\infty$: predict everything positive → $(1,1)$


In [None]:
# Toy scores: positives tend to have higher scores, but overlap with negatives.
n_pos, n_neg = 250, 350
scores_pos = rng.normal(loc=1.2, scale=1.0, size=n_pos)
scores_neg = rng.normal(loc=0.0, scale=1.0, size=n_neg)

y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[scores_pos, scores_neg]

threshold_example = 0.5

fig = go.Figure()
fig.add_histogram(
    x=scores_neg,
    name="y=0 (negative)",
    nbinsx=50,
    opacity=0.6,
    histnorm="probability density",
)
fig.add_histogram(
    x=scores_pos,
    name="y=1 (positive)",
    nbinsx=50,
    opacity=0.6,
    histnorm="probability density",
)
fig.add_vline(x=threshold_example, line_width=2, line_dash="dash", line_color="black")
fig.update_layout(
    barmode="overlay",
    title="Toy scores: two overlapping distributions (a threshold splits predictions)",
    xaxis_title="score s (higher ⇒ more positive)",
    yaxis_title="density",
)
fig.show()


In [None]:
def confusion_counts(y_true, y_score, threshold):
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score)
    y_pred = (y_score >= threshold).astype(int)

    tp = np.sum((y_true == 1) & (y_pred == 1))
    fp = np.sum((y_true == 0) & (y_pred == 1))
    tn = np.sum((y_true == 0) & (y_pred == 0))
    fn = np.sum((y_true == 1) & (y_pred == 0))
    return int(tp), int(fp), int(tn), int(fn)


tp, fp, tn, fn = confusion_counts(y_true, y_score, threshold_example)
tpr_example = tp / (tp + fn)
fpr_example = fp / (fp + tn)

print(f"threshold t = {threshold_example:.2f}")
print(f"TP={tp}, FP={fp}, TN={tn}, FN={fn}")
print(f"TPR={tpr_example:.3f}, FPR={fpr_example:.3f}")


## 2) Computing the ROC curve (NumPy)

A naive ROC implementation tries many thresholds and recomputes the confusion matrix each time.
That works, but it can be slow.

A standard efficient approach:

1. Sort samples by score $s$ (descending).
2. Sweep the threshold from high to low.
3. Each time the threshold crosses a score value, that sample flips from predicted negative → positive.
4. Cumulative sums give $\mathrm{TP}(t)$ and $\mathrm{FP}(t)$ at each **unique** score.

This is $O(n\log n)$ for the sort, then $O(n)$ for the sweep.


In [None]:
def roc_curve_numpy(y_true, y_score):
    """Compute ROC curve points (FPR, TPR) and thresholds.

    Parameters
    ----------
    y_true : array-like, shape (n_samples,)
        Binary labels {0,1}.
    y_score : array-like, shape (n_samples,)
        Scores where higher means more likely positive.

    Returns
    -------
    fpr : ndarray
        False positive rates (non-decreasing).
    tpr : ndarray
        True positive rates (non-decreasing).
    thresholds : ndarray
        Thresholds in descending score order, starting with +inf.
    """
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)

    if y_true.shape != y_score.shape:
        raise ValueError("y_true and y_score must have the same shape.")

    y_true = y_true.astype(int)
    unique = np.unique(y_true)
    if unique.size != 2 or not np.array_equal(unique, [0, 1]):
        raise ValueError(f"y_true must contain both 0 and 1 labels; got {unique}.")

    # Sort by decreasing score (stable sort => deterministic tie handling)
    order = np.argsort(-y_score, kind="mergesort")
    y_true_sorted = y_true[order]
    y_score_sorted = y_score[order]

    pos = y_true_sorted == 1
    neg = ~pos

    n_pos = pos.sum()
    n_neg = neg.sum()

    tps = np.cumsum(pos)
    fps = np.cumsum(neg)

    # ROC points only change when the threshold passes a distinct score value.
    distinct_score_indices = np.where(np.diff(y_score_sorted) != 0)[0]
    threshold_indices = np.r_[distinct_score_indices, y_true_sorted.size - 1]

    thresholds = y_score_sorted[threshold_indices]
    tpr = tps[threshold_indices] / n_pos
    fpr = fps[threshold_indices] / n_neg

    # Add the (0,0) start point at threshold = +inf
    thresholds = np.r_[np.inf, thresholds]
    tpr = np.r_[0.0, tpr]
    fpr = np.r_[0.0, fpr]

    return fpr, tpr, thresholds


def auc_trapz(x, y):
    """Area under a curve via the trapezoidal rule.

    Assumes x is sorted in non-decreasing order.
    """
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if x.ndim != 1 or y.ndim != 1 or x.shape != y.shape:
        raise ValueError("x and y must be 1D arrays of the same length.")
    if np.any(np.diff(x) < 0):
        raise ValueError("x must be sorted in non-decreasing order.")
    return float(np.trapz(y, x))


In [None]:
fpr, tpr, thresholds = roc_curve_numpy(y_true, y_score)
auc_roc = auc_trapz(fpr, tpr)

print("AUC (NumPy trapezoid):", auc_roc)

try:
    from sklearn.metrics import roc_auc_score as sk_roc_auc_score, roc_curve as sk_roc_curve

    fpr_sk, tpr_sk, thr_sk = sk_roc_curve(y_true, y_score)
    auc_sk = sk_roc_auc_score(y_true, y_score)
    auc_sk_trapz = auc_trapz(fpr_sk, tpr_sk)

    print("sklearn roc_curve points:", fpr_sk.size)
    print("AUC (sklearn roc_auc_score):", auc_sk)
    print("AUC (sklearn roc_curve + trapz):", auc_sk_trapz)
    print("AUC abs diff (ours vs sklearn):", abs(auc_roc - auc_sk))
except Exception as e:
    print("sklearn check skipped ->", repr(e))


In [None]:
hover_text = [
    "t=+inf" if not np.isfinite(t) else f"t={t:.3f}" for t in thresholds
]

fig = go.Figure()
fig.add_scatter(
    x=fpr,
    y=tpr,
    mode="lines+markers",
    name=f"ROC (AUC={auc_roc:.3f})",
    text=hover_text,
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>%{text}<extra></extra>",
)
fig.add_scatter(
    x=[0, 1],
    y=[0, 1],
    mode="lines",
    name="random (AUC=0.5)",
    line=dict(dash="dash", color="gray"),
)
fig.add_scatter(
    x=[fpr_example],
    y=[tpr_example],
    mode="markers",
    name=f"example t={threshold_example:.2f}",
    marker=dict(size=11, symbol="star", color="black"),
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>example threshold<extra></extra>",
)

fig.update_layout(
    title="ROC curve: sweep threshold to trade FPR vs TPR",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()


In [None]:
# How TPR/FPR evolve as the threshold moves (note: thresholds are in descending score order).
fig = go.Figure()
fig.add_scatter(x=thresholds[1:], y=tpr[1:], mode="lines+markers", name="TPR")
fig.add_scatter(x=thresholds[1:], y=fpr[1:], mode="lines+markers", name="FPR")
fig.add_vline(x=threshold_example, line_width=2, line_dash="dash", line_color="black")
fig.update_layout(
    title="TPR and FPR as functions of the decision threshold",
    xaxis_title="threshold t (higher ⇒ stricter)",
    yaxis_title="rate",
)
fig.update_xaxes(autorange="reversed")
fig.show()


## 3) AUC: a single-number summary (and what it means)

The ROC curve is a whole *family* of operating points.
A common summary is the **Area Under the ROC Curve (AUC-ROC)**.

A helpful interpretation (ranking view):

$$
\mathrm{AUC} = P(s^+ > s^-) + \frac{1}{2} P(s^+ = s^-)
$$

So AUC is about how well the model **ranks** positives above negatives (not about calibration).

AUC is convenient, but it can hide important details (for example, you might only care about very low FPR).


## 4) Using ROC/AUC to tune a classifier (logistic regression from scratch)

ROC curves need **scores**.
Logistic regression produces a probability score $\hat{p}(y=1\mid x)$, so it's a natural match.

Important nuance:

- We usually **fit** logistic regression by minimizing **log loss** (it is smooth and differentiable).
- We often **select** hyperparameters (e.g., regularization strength) by maximizing a metric like **AUC-ROC** on a validation set.
- We then choose an **operating threshold** based on business constraints, often using the ROC curve.


In [None]:
# Synthetic 2D binary classification dataset (logistic generative story)
n = 1200
X = rng.normal(size=(n, 2))

w_true = np.array([1.5, -2.0])
b_true = 0.2

logits = X @ w_true + b_true + rng.normal(0, 0.8, size=n)
p = 1.0 / (1.0 + np.exp(-logits))
y = rng.binomial(1, p).astype(int)

# Train/validation split
perm = rng.permutation(n)
n_train = int(0.7 * n)
train_idx = perm[:n_train]
val_idx = perm[n_train:]

X_train, y_train = X[train_idx], y[train_idx]
X_val, y_val = X[val_idx], y[val_idx]

# Standardize using training statistics
mu = X_train.mean(axis=0)
sigma = X_train.std(axis=0) + 1e-12

X_train_s = (X_train - mu) / sigma
X_val_s = (X_val - mu) / sigma


def add_intercept(X):
    return np.c_[np.ones(X.shape[0]), X]


Xb_train = add_intercept(X_train_s)
Xb_val = add_intercept(X_val_s)

fig = px.scatter(
    x=X_val_s[:, 0],
    y=X_val_s[:, 1],
    color=y_val.astype(str),
    title="Validation split (standardized features)",
    labels={"x": "x1 (standardized)", "y": "x2 (standardized)", "color": "y"},
    opacity=0.7,
)
fig.show()


In [None]:
def sigmoid(z):
    """Numerically stable sigmoid."""
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def fit_logreg_gd(X, y, l2=0.0, lr=0.2, n_iter=2500):
    """Logistic regression via batch gradient descent (L2 on weights, not intercept)."""
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    w = np.zeros(X.shape[1], dtype=float)
    history = np.empty(n_iter, dtype=float)
    eps = 1e-12

    for i in range(n_iter):
        z = X @ w
        p = sigmoid(z)

        # Regularized log loss (average)
        data_loss = -np.mean(y * np.log(p + eps) + (1 - y) * np.log(1 - p + eps))
        reg_loss = 0.5 * l2 * np.sum(w[1:] ** 2)
        loss = data_loss + reg_loss

        grad = (X.T @ (p - y)) / X.shape[0]
        grad[1:] += l2 * w[1:]

        w -= lr * grad
        history[i] = loss

    return w, history


In [None]:
# Hyperparameter tuning: pick L2 strength that maximizes validation AUC-ROC
l2_grid = np.logspace(-4, 1, 10)

weights_by_l2 = {}
auc_by_l2 = []

for l2 in l2_grid:
    w, _ = fit_logreg_gd(Xb_train, y_train, l2=l2, lr=0.2, n_iter=2500)
    weights_by_l2[float(l2)] = w

    scores_val = sigmoid(Xb_val @ w)
    fpr_v, tpr_v, _ = roc_curve_numpy(y_val, scores_val)
    auc_v = auc_trapz(fpr_v, tpr_v)
    auc_by_l2.append(float(auc_v))

auc_by_l2 = np.array(auc_by_l2)
best_i = int(np.argmax(auc_by_l2))
best_l2 = float(l2_grid[best_i])
best_auc = float(auc_by_l2[best_i])

print(f"best l2: {best_l2:g}")
print(f"best validation AUC: {best_auc:.4f}")

fig = go.Figure()
fig.add_scatter(
    x=l2_grid,
    y=auc_by_l2,
    mode="lines+markers",
    name="validation AUC",
)
fig.add_scatter(
    x=[best_l2],
    y=[best_auc],
    mode="markers",
    marker=dict(size=12, symbol="star", color="black"),
    name="best",
)
fig.update_xaxes(type="log", title="L2 regularization strength (λ)")
fig.update_yaxes(title="Validation AUC-ROC", range=[0, 1])
fig.update_layout(title="Model selection: choose λ that maximizes AUC-ROC")
fig.show()


In [None]:
# Refit the best model to visualize training loss
best_w, loss_hist = fit_logreg_gd(Xb_train, y_train, l2=best_l2, lr=0.2, n_iter=2500)

fig = px.line(
    y=loss_hist,
    title=f"Training loss (regularized log loss), best λ={best_l2:g}",
    labels={"x": "iteration", "y": "loss"},
)
fig.show()


In [None]:
# Compare ROC curves for a few regularization settings
candidates = [float(l2_grid[0]), best_l2, float(l2_grid[-1])]

fig = go.Figure()
for l2 in candidates:
    w = best_w if l2 == best_l2 else weights_by_l2[l2]
    scores_val = sigmoid(Xb_val @ w)
    fpr_v, tpr_v, _ = roc_curve_numpy(y_val, scores_val)
    auc_v = auc_trapz(fpr_v, tpr_v)
    fig.add_scatter(
        x=fpr_v,
        y=tpr_v,
        mode="lines",
        name=f"λ={l2:g} (AUC={auc_v:.3f})",
    )

fig.add_scatter(
    x=[0, 1],
    y=[0, 1],
    mode="lines",
    name="random (AUC=0.5)",
    line=dict(dash="dash", color="gray"),
)
fig.update_layout(
    title="ROC curves on validation set (different regularization strengths)",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()


In [None]:
# Choosing an operating threshold from the ROC curve
# Example constraint: keep FPR <= 5% while maximizing TPR.
scores_val = sigmoid(Xb_val @ best_w)
fpr_v, tpr_v, thr_v = roc_curve_numpy(y_val, scores_val)

target_fpr = 0.05
feasible = np.where(fpr_v <= target_fpr)[0]

chosen_i = int(feasible[np.argmax(tpr_v[feasible])]) if feasible.size else 0
chosen_thr = float(thr_v[chosen_i])

tp, fp, tn, fn = confusion_counts(y_val, scores_val, chosen_thr)
precision = tp / (tp + fp) if (tp + fp) else 0.0
recall = tp / (tp + fn) if (tp + fn) else 0.0

print(f"target FPR <= {target_fpr:.2f}")
print(f"chosen threshold: {chosen_thr:.4f}")
print(f"FPR={fpr_v[chosen_i]:.4f}, TPR={tpr_v[chosen_i]:.4f}")
print(f"TP={tp}, FP={fp}, TN={tn}, FN={fn}")
print(f"precision={precision:.4f}, recall={recall:.4f}")

default_thr = 0.5
tp2, fp2, tn2, fn2 = confusion_counts(y_val, scores_val, default_thr)
tpr2 = tp2 / (tp2 + fn2)
fpr2 = fp2 / (fp2 + tn2)

fig = go.Figure()
fig.add_scatter(
    x=fpr_v,
    y=tpr_v,
    mode="lines",
    name="ROC (best model)",
)
fig.add_scatter(
    x=[fpr_v[chosen_i]],
    y=[tpr_v[chosen_i]],
    mode="markers",
    marker=dict(size=12, symbol="star", color="black"),
    name=f"chosen (FPR≤{target_fpr:.2f})",
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>chosen threshold<extra></extra>",
)
fig.add_scatter(
    x=[fpr2],
    y=[tpr2],
    mode="markers",
    marker=dict(size=10, symbol="circle", color="gray"),
    name="default t=0.5",
    hovertemplate="FPR=%{x:.3f}<br>TPR=%{y:.3f}<br>t=0.5<extra></extra>",
)
fig.add_vline(x=target_fpr, line_width=1, line_dash="dash", line_color="gray")
fig.update_layout(
    title="Picking a threshold from ROC constraints",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    xaxis=dict(range=[0, 1], constrain="domain"),
    yaxis=dict(range=[0, 1], scaleanchor="x", scaleratio=1),
)
fig.show()


## Pros, cons, and when ROC is a good choice

### Pros

- **Threshold-free view**: shows the entire tradeoff curve instead of committing to one $t$.
- **Ranking-focused**: works naturally with scores (and AUC has a clean ranking interpretation).
- **Model comparison**: curves make it easy to compare classifiers across operating points.
- **Less sensitive to class imbalance than accuracy** (it uses rates, not raw counts).

### Cons / pitfalls

- **Can look overly optimistic on very imbalanced problems**: a small FPR can still mean many false positives in absolute count.
- **AUC can hide the region you actually care about** (e.g., only FPR < 1%). Consider partial AUC or zooming.
- **Not about calibration**: a perfectly calibrated model and a poorly calibrated model can have the same ROC/AUC.
- **Needs scores**: if you pass hard labels, you'll get only a couple of ROC points.

### Good use cases

- You care about **ranking** (retrieval, screening, triage) and want a threshold later.
- You have a constraint like **"FPR must be below X"** and want the best achievable TPR.
- Comparing multiple models when operating conditions or cost ratios are not yet fixed.


## Exercises

1. Implement a **sample-weighted** ROC curve (each point has weight $w_i$).
2. Show a case where two models have similar AUC but very different performance for **FPR < 1%**.
3. Create an extremely imbalanced dataset and compare ROC vs **precision-recall** curves (why might PR be more informative?).
4. Derive the ranking interpretation of AUC from scratch.


## References

- scikit-learn: `roc_curve`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html
- scikit-learn: `roc_auc_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
- Tom Fawcett (2006), *An introduction to ROC analysis*: https://doi.org/10.1016/j.patrec.2005.10.010
- Wikipedia: Receiver operating characteristic: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
