# roc_auc_score (ROC AUC)

`roc_auc_score` computes the **area under the ROC curve**. It evaluates how well a model **ranks** positive examples above negative examples, using *scores* (probabilities or decision values), not hard class labels.

**You will learn**
- How thresholds produce points on the ROC curve (TPR vs FPR)
- Two equivalent AUC formulas: trapezoid area and Mann–Whitney (rank) view
- A NumPy implementation of `roc_curve` + `roc_auc_score` (tie-safe)
- How to optimize for AUC with a differentiable pairwise surrogate (NumPy)

## Quick import

```python
from sklearn.metrics import roc_auc_score
```

## Prerequisites

- Binary classification labels (0/1)
- Confusion matrix terms: TP / FP / TN / FN
- Basic probability and calculus


In [None]:
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_auc_score as skl_roc_auc_score
from sklearn.metrics import roc_curve as skl_roc_curve

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)


## 1) From scores to TPR/FPR (one threshold)

Assume:
- true labels: $y_i \in \{0,1\}$
- model scores (higher = more positive): $s_i \in \mathbb{R}$
- threshold: $\tau$

We predict positive if:

$$
\hat{y}_i(\tau)=\mathbb{1}[s_i \ge \tau]
$$

From the confusion matrix at threshold $\tau$:

$$
\mathrm{TPR}(\tau)=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, \qquad
\mathrm{FPR}(\tau)=\frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}
$$

- TPR = **recall / sensitivity**
- FPR = **1 - specificity**


In [None]:
def confusion_at_threshold(y_true, y_score, threshold, pos_label=1):
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    pred_pos = y_score >= threshold

    tp = np.sum(pos & pred_pos)
    fp = np.sum(~pos & pred_pos)
    fn = np.sum(pos & ~pred_pos)
    tn = np.sum(~pos & ~pred_pos)
    return tp, fp, tn, fn


def tpr_fpr_from_confusion(tp, fp, tn, fn):
    tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
    fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan
    return tpr, fpr


In [None]:
y_true_small = np.array([1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0])
y_score_small = np.array([0.95, 0.90, 0.80, 0.75, 0.60, 0.55, 0.52, 0.50, 0.40, 0.35, 0.30, 0.10])

threshold = 0.50

tp, fp, tn, fn = confusion_at_threshold(y_true_small, y_score_small, threshold=threshold)
tpr, fpr = tpr_fpr_from_confusion(tp, fp, tn, fn)

df_small = pd.DataFrame({"y_true": y_true_small, "score": y_score_small})
df_small = df_small.sort_values("score", ascending=False).reset_index(drop=True)
df_small[f"y_pred(score \u2265 {threshold:.2f})"] = (df_small["score"] >= threshold).astype(int)

df_small, {"TP": tp, "FP": fp, "TN": tn, "FN": fn, "TPR": tpr, "FPR": fpr}


## 2) ROC curve: sweep the threshold

The ROC curve plots $(\mathrm{FPR}(\tau), \mathrm{TPR}(\tau))$ as we move the threshold $\tau$ from **very strict** to **very lenient**:

- $\tau = +\infty$ ⇒ predict nothing positive ⇒ (FPR,TPR) = (0,0)
- $\tau$ decreases ⇒ more predicted positives ⇒ move up/right
- $\tau = -\infty$ ⇒ predict everything positive ⇒ (1,1)

A random ranking gives the diagonal line $\mathrm{TPR} = \mathrm{FPR}$.


In [None]:
def roc_curve_bruteforce(y_true, y_score, pos_label=1):
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)

    thresholds = np.r_[np.inf, np.sort(np.unique(y_score))[::-1]]
    fpr = []
    tpr = []

    for thr in thresholds:
        tp, fp, tn, fn = confusion_at_threshold(y_true, y_score, thr, pos_label=pos_label)
        tpr_i, fpr_i = tpr_fpr_from_confusion(tp, fp, tn, fn)
        fpr.append(fpr_i)
        tpr.append(tpr_i)

    return np.asarray(fpr), np.asarray(tpr), thresholds


fpr_b, tpr_b, thr_b = roc_curve_bruteforce(y_true_small, y_score_small)
auc_b = np.trapz(tpr_b, fpr_b)

df_roc_small = pd.DataFrame({"threshold": thr_b, "fpr": fpr_b, "tpr": tpr_b})
df_roc_small


In [None]:
point_labels = ["inf" if np.isinf(t) else f"{t:.2f}" for t in thr_b]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("ROC curve (toy example)", "TPR/FPR vs threshold"),
)

fig.add_trace(
    go.Scatter(
        x=fpr_b,
        y=tpr_b,
        mode="lines+markers",
        name=f"ROC (AUC={auc_b:.3f})",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        name="random",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(
        x=fpr_b,
        y=tpr_b,
        mode="markers+text",
        text=point_labels,
        textposition="top center",
        marker=dict(size=8),
        name="thresholds",
    ),
    row=1,
    col=1,
)

mask = np.isfinite(thr_b)
fig.add_trace(
    go.Scatter(
        x=thr_b[mask],
        y=tpr_b[mask],
        mode="lines+markers",
        name="TPR",
    ),
    row=1,
    col=2,
)
fig.add_trace(
    go.Scatter(
        x=thr_b[mask],
        y=fpr_b[mask],
        mode="lines+markers",
        name="FPR",
    ),
    row=1,
    col=2,
)

fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=1)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="threshold τ", autorange="reversed", row=1, col=2)
fig.update_yaxes(title_text="rate", range=[0, 1], row=1, col=2)

fig.update_layout(width=950, height=430)
fig.show()


## 3) AUC: "area" and "probability of correct ranking"

The **ROC AUC** is the area under the ROC curve:

$$
\mathrm{AUC} = \int_0^1 \mathrm{TPR}(u)\,du
$$

where we integrate TPR as a function of FPR.

A powerful equivalent view (binary case) is:

$$
\mathrm{AUC} = \mathbb{P}(s^+ > s^-) + \frac{1}{2}\mathbb{P}(s^+ = s^-)
$$

where $s^+$ is the score of a random positive example and $s^-$ is the score of a random negative example.

So AUC is a **ranking metric**:
- any strictly monotonic transform of the score (e.g. logits → probabilities) leaves AUC unchanged
- AUC = 0.5 means random ranking, AUC = 1.0 means perfect ranking


In [None]:
pos_scores = y_score_small[y_true_small == 1]
neg_scores = y_score_small[y_true_small == 0]

auc_pairwise = (pos_scores[:, None] > neg_scores[None, :]).mean() + 0.5 * (
    pos_scores[:, None] == neg_scores[None, :]
).mean()

auc_pairwise, auc_b


In [None]:
n_pairs = 500
pos_s = rng.choice(pos_scores, size=n_pairs, replace=True)
neg_s = rng.choice(neg_scores, size=n_pairs, replace=True)

df_pairs = pd.DataFrame({"neg_score": neg_s, "pos_score": pos_s})

min_s = float(min(df_pairs["neg_score"].min(), df_pairs["pos_score"].min()))
max_s = float(max(df_pairs["neg_score"].max(), df_pairs["pos_score"].max()))

fig = px.scatter(
    df_pairs,
    x="neg_score",
    y="pos_score",
    opacity=0.55,
    title=(
        "Random positive/negative score pairs (above diagonal = correct ranking)" f"<br>AUC ≈ {auc_pairwise:.3f}"
    ),
)
fig.add_shape(
    type="line",
    x0=min_s,
    y0=min_s,
    x1=max_s,
    y1=max_s,
    line=dict(color="black", dash="dash"),
)
fig.update_xaxes(title="negative score s⁻")
fig.update_yaxes(title="positive score s⁺")
fig.update_layout(width=650, height=520)
fig.show()


## 4) NumPy implementation (ROC curve + ROC AUC)

A direct implementation by scanning all thresholds can be $O(n^2)$.

A faster approach:

1. Sort examples by score (descending)
2. Sweep the threshold from high to low
3. Track cumulative TP and FP counts
4. Record a ROC point only when the score changes (tie handling)

This is $O(n \log n)$ due to sorting.


In [None]:
def roc_curve_np(y_true, y_score, pos_label=1):
    """Compute ROC curve points (FPR, TPR) for binary classification.

    Parameters
    ----------
    y_true : array-like of shape (n_samples,)
        Binary labels. Anything equal to `pos_label` is treated as positive.
    y_score : array-like of shape (n_samples,)
        Scores where larger means "more positive".
    pos_label : label (default=1)
        Which label is considered positive.
    """
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    n_pos = int(pos.sum())
    n_neg = int((~pos).sum())
    if n_pos == 0 or n_neg == 0:
        raise ValueError("roc_curve is undefined with only one class present in y_true.")

    order = np.argsort(-y_score, kind="mergesort")
    y_score_sorted = y_score[order]
    y_pos_sorted = pos[order].astype(int)

    distinct_value_indices = np.where(np.diff(y_score_sorted))[0]
    threshold_idxs = np.r_[distinct_value_indices, y_pos_sorted.size - 1]

    tps = np.cumsum(y_pos_sorted)[threshold_idxs]
    fps = 1 + threshold_idxs - tps

    # Prepend the point at threshold +inf: (FPR,TPR) = (0,0)
    tps = np.r_[0, tps]
    fps = np.r_[0, fps]
    thresholds = np.r_[np.inf, y_score_sorted[threshold_idxs]]

    fpr = fps / n_neg
    tpr = tps / n_pos
    return fpr, tpr, thresholds


def roc_auc_score_np(y_true, y_score, pos_label=1):
    fpr, tpr, _ = roc_curve_np(y_true, y_score, pos_label=pos_label)
    return float(np.trapz(tpr, fpr))


def rankdata_average_ties(x):
    """Ranks starting at 1, using average ranks for ties (NumPy-only)."""
    x = np.asarray(x)
    order = np.argsort(x, kind="mergesort")
    x_sorted = x[order]

    ranks_sorted = np.empty_like(x_sorted, dtype=float)

    n = len(x_sorted)
    i = 0
    rank = 1
    while i < n:
        j = i + 1
        while j < n and x_sorted[j] == x_sorted[i]:
            j += 1

        # ranks for i..j-1 are rank..rank+(j-i)-1
        avg_rank = 0.5 * (rank + (rank + (j - i) - 1))
        ranks_sorted[i:j] = avg_rank

        rank += j - i
        i = j

    ranks = np.empty_like(ranks_sorted)
    ranks[order] = ranks_sorted
    return ranks


def roc_auc_score_mann_whitney_np(y_true, y_score, pos_label=1):
    """AUC via Mann–Whitney U / Wilcoxon rank-sum (tie-safe)."""
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape[0] != y_score.shape[0]:
        raise ValueError("y_true and y_score must have the same length.")

    pos = y_true == pos_label
    n_pos = int(pos.sum())
    n_neg = int((~pos).sum())
    if n_pos == 0 or n_neg == 0:
        raise ValueError("roc_auc_score is undefined with only one class present in y_true.")

    ranks = rankdata_average_ties(y_score)
    sum_ranks_pos = ranks[pos].sum()
    u = sum_ranks_pos - n_pos * (n_pos + 1) / 2
    return float(u / (n_pos * n_neg))


In [None]:
y_true = rng.integers(0, 2, size=300)
y_score = rng.normal(size=300)

auc_np = roc_auc_score_np(y_true, y_score)
auc_mw = roc_auc_score_mann_whitney_np(y_true, y_score)
auc_skl = skl_roc_auc_score(y_true, y_score)

auc_np, auc_mw, auc_skl


In [None]:
# Our curve matches sklearn when drop_intermediate=False (sklearn defaults to drop_intermediate=True)
fpr_np, tpr_np, thr_np = roc_curve_np(y_true, y_score)
fpr_skl, tpr_skl, thr_skl = skl_roc_curve(y_true, y_score, drop_intermediate=False)

(
    np.allclose(fpr_np, fpr_skl),
    np.allclose(tpr_np, tpr_skl),
    np.allclose(thr_np, thr_skl),
    len(fpr_np),
    len(skl_roc_curve(y_true, y_score)[0]),
)


In [None]:
# AUC is invariant to strictly monotonic transforms of the scores
auc_logits = roc_auc_score_np(y_true, y_score)
auc_prob = roc_auc_score_np(y_true, 1 / (1 + np.exp(-y_score)))

auc_logits, auc_prob


## 5) Visual intuition: distributions → thresholds → ROC points

Below we draw score distributions for each class and place a few thresholds. Each threshold maps to a point on the ROC curve.


In [None]:
n_pos, n_neg = 250, 750
scores_pos = rng.normal(loc=1.2, scale=1.0, size=n_pos)
scores_neg = rng.normal(loc=0.0, scale=1.0, size=n_neg)

y_true_big = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score_big = np.r_[scores_pos, scores_neg]

perm = rng.permutation(len(y_true_big))
y_true_big = y_true_big[perm]
y_score_big = y_score_big[perm]

fpr, tpr, thresholds = roc_curve_np(y_true_big, y_score_big)
auc_val = roc_auc_score_np(y_true_big, y_score_big)

thresholds_demo = np.quantile(y_score_big, [0.9, 0.5, 0.1])
colors = ["#1f77b4", "#ff7f0e", "#2ca02c"]

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Score distributions", f"ROC curve (AUC={auc_val:.3f})"),
)

fig.add_trace(
    go.Histogram(
        x=y_score_big[y_true_big == 0],
        name="negative",
        opacity=0.6,
        nbinsx=40,
        marker_color="gray",
    ),
    row=1,
    col=1,
)
fig.add_trace(
    go.Histogram(
        x=y_score_big[y_true_big == 1],
        name="positive",
        opacity=0.6,
        nbinsx=40,
        marker_color="crimson",
    ),
    row=1,
    col=1,
)

for thr, c in zip(thresholds_demo, colors):
    fig.add_vline(x=float(thr), line_dash="dash", line_color=c, row=1, col=1)

fig.add_trace(go.Scatter(x=fpr, y=tpr, mode="lines", name="ROC"), row=1, col=2)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        name="random",
    ),
    row=1,
    col=2,
)

for thr, c in zip(thresholds_demo, colors):
    tp, fp, tn, fn = confusion_at_threshold(y_true_big, y_score_big, threshold=float(thr))
    tpr_thr, fpr_thr = tpr_fpr_from_confusion(tp, fp, tn, fn)
    fig.add_trace(
        go.Scatter(
            x=[fpr_thr],
            y=[tpr_thr],
            mode="markers",
            marker=dict(size=10, color=c),
            name=f"τ={thr:.2f}",
        ),
        row=1,
        col=2,
    )

fig.update_layout(barmode="overlay", width=950, height=430)
fig.update_xaxes(title_text="score", row=1, col=1)
fig.update_yaxes(title_text="count", row=1, col=1)
fig.update_xaxes(title_text="FPR", range=[0, 1], row=1, col=2)
fig.update_yaxes(title_text="TPR", range=[0, 1], row=1, col=2)

fig.show()


## 6) Class imbalance: ROC AUC is prevalence-invariant (PR AUC is not)

ROC uses rates (TPR/FPR), so duplicating every negative example (same scores) leaves the curve and AUC unchanged.

Precision–recall metrics do change with prevalence, so PR AUC is often preferred for extreme imbalance.


In [None]:
# Duplicate negatives 10x (same scores) to change prevalence
y_true_imbal = np.r_[y_true_big[y_true_big == 1], np.repeat(y_true_big[y_true_big == 0], 10)]
y_score_imbal = np.r_[y_score_big[y_true_big == 1], np.repeat(y_score_big[y_true_big == 0], 10)]

auc_orig = roc_auc_score_np(y_true_big, y_score_big)
auc_imbal = roc_auc_score_np(y_true_imbal, y_score_imbal)

ap_orig = average_precision_score(y_true_big, y_score_big)
ap_imbal = average_precision_score(y_true_imbal, y_score_imbal)

auc_orig, auc_imbal, ap_orig, ap_imbal


In [None]:
fpr_o, tpr_o, _ = roc_curve_np(y_true_big, y_score_big)
fpr_i, tpr_i, _ = roc_curve_np(y_true_imbal, y_score_imbal)

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_o, y=tpr_o, mode="lines", name=f"original (AUC={auc_orig:.3f})"))
fig.add_trace(go.Scatter(x=fpr_i, y=tpr_i, mode="lines", name=f"negatives ×10 (AUC={auc_imbal:.3f})"))
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        showlegend=False,
    )
)
fig.update_layout(
    title="ROC curves overlap under prevalence shift",
    xaxis_title="FPR",
    yaxis_title="TPR",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=720,
    height=450,
)
fig.show()


## 7) Practical usage (scikit-learn)

Key points:
- Pass **scores**, not hard labels.
  - `predict_proba(X)[:, 1]` (probabilities)
  - `decision_function(X)` (raw scores / logits)
- Any monotonic transform of scores gives the same AUC.
- For multiclass you must choose `multi_class="ovr"` or `"ovo"` and an averaging strategy.

Docs: `sklearn.metrics.roc_auc_score`.


In [None]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    weights=[0.85, 0.15],
    class_sep=1.0,
    random_state=SEED,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=SEED
)

clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)

score_logit = clf.decision_function(X_test)
score_proba = clf.predict_proba(X_test)[:, 1]
score_label = clf.predict(X_test)

# (logits and probabilities have identical ranking → identical AUC)
skl_roc_auc_score(y_test, score_logit), skl_roc_auc_score(y_test, score_proba), skl_roc_auc_score(y_test, score_label)


## 8) Optimizing for ROC AUC (NumPy)

For binary labels, AUC can be written as an average over all positive–negative pairs:

$$
\mathrm{AUC}(s) = \frac{1}{|P||N|} \sum_{i\in P}\sum_{j\in N} \Big(\mathbb{1}[s_i > s_j] + \tfrac{1}{2}\mathbb{1}[s_i = s_j]\Big)
$$

This depends on **pairwise orderings** (rankings), which makes it:
- non-decomposable over single examples
- non-differentiable because of the indicator

A common workaround is to optimize a **smooth pairwise surrogate**. For a linear scoring model $s(x)=w^\top x$ one choice is the pairwise logistic loss:

$$
L(w)=\frac{1}{|P||N|}\sum_{i\in P}\sum_{j\in N} \log\big(1+\exp\big(-(s_i - s_j)\big)\big)
$$

Minimizing $L$ encourages $s_i > s_j$ for positive $i$ and negative $j$, i.e. better AUC.

In practice we sample pairs (SGD) instead of enumerating all $|P||N|$ pairs.


In [None]:
def sigmoid(z):
    z = np.asarray(z)
    z = np.clip(z, -40, 40)
    return 1 / (1 + np.exp(-z))


def add_bias(X):
    X = np.asarray(X)
    return np.c_[np.ones(X.shape[0]), X]


def make_gaussian_binary(n_pos=250, n_neg=1250, seed=0):
    rng_local = np.random.default_rng(seed)
    mean_pos = np.array([1.5, 1.5])
    mean_neg = np.array([0.0, 0.0])
    cov = np.array([[1.0, 0.3], [0.3, 1.0]])

    X_pos = rng_local.multivariate_normal(mean_pos, cov, size=n_pos)
    X_neg = rng_local.multivariate_normal(mean_neg, cov, size=n_neg)

    X = np.vstack([X_pos, X_neg])
    y = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]

    perm = rng_local.permutation(len(y))
    return X[perm], y[perm]


def train_logistic_logloss_gd(X, y, lr=0.2, steps=2000, l2=1e-3, log_every=50):
    Xb = add_bias(X)
    y = y.astype(float)

    w = np.zeros(Xb.shape[1])
    hist = []

    for step in range(steps + 1):
        scores = Xb @ w
        p = sigmoid(scores)

        grad = (Xb.T @ (p - y)) / len(y)
        reg_grad = l2 * np.r_[0.0, w[1:]]  # don't regularize bias
        w -= lr * (grad + reg_grad)

        if step % log_every == 0:
            logloss = -(y * np.log(p + 1e-12) + (1 - y) * np.log(1 - p + 1e-12)).mean()
            auc = roc_auc_score_np(y.astype(int), scores)
            hist.append({"step": step, "logloss": logloss, "train_auc": auc})

    return w, pd.DataFrame(hist)


def train_auc_pairwise_sgd(
    X, y, lr=0.2, steps=4000, batch_pairs=512, l2=1e-3, log_every=50, seed=0
):
    rng_local = np.random.default_rng(seed)
    Xb = add_bias(X)
    y = y.astype(int)

    pos_idx = np.flatnonzero(y == 1)
    neg_idx = np.flatnonzero(y == 0)
    if len(pos_idx) == 0 or len(neg_idx) == 0:
        raise ValueError("Need both classes to optimize AUC.")

    w = np.zeros(Xb.shape[1])
    hist = []

    for step in range(steps + 1):
        i = rng_local.choice(pos_idx, size=batch_pairs, replace=True)
        j = rng_local.choice(neg_idx, size=batch_pairs, replace=True)

        delta = Xb[i] - Xb[j]  # x_i - x_j
        d = delta @ w  # (w^T x_i) - (w^T x_j)

        # loss = log(1 + exp(-d))
        # dloss/dd = -sigmoid(-d)
        grad = -(sigmoid(-d)[:, None] * delta).mean(axis=0)

        reg_grad = l2 * np.r_[0.0, w[1:]]
        w -= lr * (grad + reg_grad)

        if step % log_every == 0:
            scores = Xb @ w
            auc = roc_auc_score_np(y, scores)
            pair_loss = np.log1p(np.exp(-d)).mean()
            hist.append({"step": step, "pair_loss": pair_loss, "train_auc": auc})

    return w, pd.DataFrame(hist)


In [None]:
X, y = make_gaussian_binary(seed=SEED)

# manual split (stratified-ish via shuffling; dataset is large enough here)
idx = rng.permutation(len(y))
n_train = int(0.7 * len(y))
train_idx, test_idx = idx[:n_train], idx[n_train:]

X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]

w_ce, hist_ce = train_logistic_logloss_gd(X_train, y_train, lr=0.3, steps=2000, log_every=50)
w_auc, hist_auc = train_auc_pairwise_sgd(
    X_train, y_train, lr=0.3, steps=3000, batch_pairs=1024, log_every=50, seed=SEED
)

scores_ce_test = add_bias(X_test) @ w_ce
scores_auc_test = add_bias(X_test) @ w_auc

auc_ce_test = roc_auc_score_np(y_test, scores_ce_test)
auc_auc_test = roc_auc_score_np(y_test, scores_auc_test)

auc_ce_test, auc_auc_test


In [None]:
fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=hist_ce["step"],
        y=hist_ce["train_auc"],
        mode="lines",
        name="log-loss GD (train AUC)",
    )
)
fig.add_trace(
    go.Scatter(
        x=hist_auc["step"],
        y=hist_auc["train_auc"],
        mode="lines",
        name="pairwise AUC surrogate (train AUC)",
    )
)
fig.update_layout(
    title="Training AUC over iterations",
    xaxis_title="step",
    yaxis_title="ROC AUC",
    yaxis=dict(range=[0, 1]),
    width=760,
    height=430,
)
fig.show()


In [None]:
fpr_ce, tpr_ce, _ = roc_curve_np(y_test, scores_ce_test)
fpr_auc, tpr_auc, _ = roc_curve_np(y_test, scores_auc_test)

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=fpr_ce,
        y=tpr_ce,
        mode="lines",
        name=f"log-loss GD (test AUC={auc_ce_test:.3f})",
    )
)
fig.add_trace(
    go.Scatter(
        x=fpr_auc,
        y=tpr_auc,
        mode="lines",
        name=f"AUC surrogate (test AUC={auc_auc_test:.3f})",
    )
)
fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        line=dict(dash="dash", color="black"),
        showlegend=False,
    )
)
fig.update_layout(
    title="Test ROC curves",
    xaxis_title="FPR",
    yaxis_title="TPR",
    xaxis=dict(range=[0, 1]),
    yaxis=dict(range=[0, 1]),
    width=760,
    height=450,
)
fig.show()


## Pros / cons / when to use

**Pros**
- Threshold-free: summarizes performance across all thresholds
- Ranking-focused: $\mathbb{P}(s^+ > s^-)$ interpretation is often intuitive
- Invariant to monotonic score transforms (logits vs probabilities)
- Less sensitive to class imbalance than accuracy (uses normalized rates)

**Cons / pitfalls**
- Not about calibration: probabilities can be badly calibrated and still yield high AUC
- Weights all FPR regions equally; if you care about tiny FPR, consider **partial AUC**
- For extreme imbalance, PR AUC can be more informative than ROC AUC
- Undefined if `y_true` contains only one class; multiclass requires design choices (`ovr`/`ovo`, averaging)

**Good for**
- Model comparison when you care about ranking / screening
- Imbalanced classification when you want a threshold-independent ranking metric

**Less good for**
- Picking a single operating threshold under asymmetric costs
- Measuring probability quality (use log-loss, Brier score, calibration curves)


## Exercises

1) Implement **partial AUC** for a max FPR (e.g. integrate only over $\mathrm{FPR}\in[0,0.1]$).

2) Extend `roc_curve_np` to support **sample weights**.

3) Show numerically that AUC is unchanged by any strictly monotonic transform (try `np.tanh`, `np.exp`, `sigmoid`).

4) Multiclass: compute one-vs-rest AUC for each class and compare macro vs weighted averages.


## References

- scikit-learn `roc_auc_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
- scikit-learn ROC user guide: https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc
- T. Fawcett (2006), *An introduction to ROC analysis*
- Hanley & McNeil (1982), *The meaning and use of the area under a ROC curve*
