# zero_one_loss (classification error / 1 - accuracy)

`zero_one_loss` is the simplest classification loss: it counts how often the predicted label differs from the true label.

It is a great **evaluation** metric for "did we get the label right?", but a poor **training** objective for gradient-based optimization because it is discontinuous / non-differentiable.

## Learning goals
- write the binary and multiclass definitions in clean notation
- understand the link to accuracy and (for binary) the confusion matrix
- implement `zero_one_loss` in NumPy (with optional `sample_weight`)
- build intuition via threshold and parameter-surface plots (Plotly)
- see how 0-1 loss is used for selection/optimization in practice (threshold tuning)

## Quick import

```python
from sklearn.metrics import zero_one_loss
```

## Table of contents
1. Definition and notation
2. Intuition: thresholds and decision rules (plots)
3. NumPy implementation + sanity checks
4. Using 0-1 loss for selection/optimization
5. Pros, cons, pitfalls

## References (quick)
- scikit-learn docs: https://scikit-learn.org/stable/api/sklearn.metrics.html
- ESL (Hastie, Tibshirani, Friedman): "The Elements of Statistical Learning" (classification + empirical risk)


In [None]:
import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score, zero_one_loss as sk_zero_one_loss
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)


## 1) Definition and notation

Assume we have $n$ examples.

- True label: $y_i$
- Predicted label: $\hat{y}_i$

### Per-example 0-1 loss

$$
\ell_i = \mathbb{1}[\hat{y}_i \ne y_i]
$$

### Aggregate (count vs mean)

Unnormalized (count of mistakes):

$$
L_{\text{count}} = \sum_{i=1}^n \mathbb{1}[\hat{y}_i \ne y_i]
$$

Normalized (fraction of mistakes):

$$
L = \frac{1}{n} \sum_{i=1}^n \mathbb{1}[\hat{y}_i \ne y_i]
$$

This is exactly:

$$
L = 1 - \text{accuracy}.
$$

### Sample-weighted version

Given weights $w_i \ge 0$ (e.g. importance weights, class weights), the normalized weighted 0-1 loss is:

$$
L_w = \frac{\sum_{i=1}^n w_i\,\mathbb{1}[\hat{y}_i \ne y_i]}{\sum_{i=1}^n w_i}.
$$

### Multiclass and multilabel

- Multiclass ($K$ classes): $y_i \in \{0,\dots,K-1\}$ and the same formula applies.
- Multilabel / multioutput: $y_i$ is a vector. scikit-learn's `zero_one_loss` uses **subset 0-1 loss**:

$$
\ell_i = \mathbb{1}[\hat{\mathbf{y}}_i \ne \mathbf{y}_i]
$$

i.e. the whole label vector must match exactly. (This is often stricter than what you want; see pitfalls.)

### Bayes optimal decision rule (why argmax probability is optimal)

Let the model output class probabilities $p_k(x) = P(Y=k\mid X=x)$. The classifier that minimizes the **expected** 0-1 loss is:

$$
\hat{y}(x) = \arg\max_k\ p_k(x).
$$

Binary case with $\eta(x)=P(Y=1\mid X=x)$ and equal misclassification costs:

$$
\hat{y}(x)=\mathbb{1}[\eta(x)\ge 1/2].
$$

With costs $c_{01}$ (false positive) and $c_{10}$ (false negative), the optimal threshold becomes:

$$
\hat{y}(x)=\mathbb{1}\Big[\eta(x)\ge \frac{c_{01}}{c_{01}+c_{10}}\Big].
$$


In [None]:
def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))


def zero_one_loss_np(y_true, y_pred, *, normalize=True, sample_weight=None):
    """NumPy implementation of scikit-learn's zero_one_loss.

    - If y is 1D: counts elementwise mismatches.
    - If y is 2D (multilabel / multioutput): uses subset 0-1 loss (row must match exactly).
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}")
    if y_true.ndim == 0:
        raise ValueError("y_true must be 1D or 2D")

    if y_true.ndim == 1:
        incorrect = (y_true != y_pred)
    else:
        incorrect = np.any(y_true != y_pred, axis=1)

    incorrect = incorrect.astype(float)
    n = incorrect.shape[0]

    if sample_weight is None:
        total = float(incorrect.sum())
        return total / n if normalize else total

    w = np.asarray(sample_weight, dtype=float)
    if w.ndim != 1 or w.shape[0] != n:
        raise ValueError(f"sample_weight must be shape (n,), got {w.shape}")

    total = float(np.sum(w * incorrect))
    if not normalize:
        return total

    w_sum = float(w.sum())
    if w_sum == 0:
        return 0.0
    return total / w_sum


def predict_labels_from_proba(p, *, threshold=0.5):
    """Convert probabilities to hard labels.

    - Binary: p is (n,) or (n,2) (assumes column 1 is P(y=1)).
    - Multiclass: p is (n,K) -> argmax.
    """
    p = np.asarray(p, dtype=float)
    if p.ndim == 1:
        return (p >= threshold).astype(int)
    if p.ndim == 2 and p.shape[1] == 2:
        return (p[:, 1] >= threshold).astype(int)
    if p.ndim == 2:
        return np.argmax(p, axis=1)
    raise ValueError(f"p must be 1D or 2D, got shape {p.shape}")


def zero_one_loss_from_proba(
    y_true,
    p,
    *,
    threshold=0.5,
    normalize=True,
    sample_weight=None,
):
    y_pred = predict_labels_from_proba(p, threshold=threshold)
    return zero_one_loss_np(y_true, y_pred, normalize=normalize, sample_weight=sample_weight)


def log_loss_binary(y_true, p, *, sample_weight=None, eps=1e-15):
    y_true = np.asarray(y_true, dtype=float)
    p = np.asarray(p, dtype=float)
    p = np.clip(p, eps, 1 - eps)
    per_sample = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
    if sample_weight is None:
        return float(per_sample.mean())
    w = np.asarray(sample_weight, dtype=float)
    w_sum = float(w.sum())
    if w_sum == 0:
        return 0.0
    return float(np.sum(w * per_sample) / w_sum)


def best_threshold_zero_one(y_true, p, *, sample_weight=None, normalize=True):
    """Find an exact minimizer over thresholds t in [0, 1] (binary, rule: p>=t -> 1).

    The predictions only change when t crosses a value in p, so evaluating t over unique p values
    (plus the endpoints 0 and 1) is enough to find the exact optimum.
    """
    y_true = np.asarray(y_true)
    p = np.asarray(p, dtype=float)
    if y_true.shape != p.shape or p.ndim != 1:
        raise ValueError("y_true and p must be 1D arrays of the same shape")

    if sample_weight is None:
        w = np.ones_like(p, dtype=float)
    else:
        w = np.asarray(sample_weight, dtype=float)
        if w.shape != p.shape:
            raise ValueError("sample_weight must have the same shape as p")

    order = np.argsort(p)
    p_s = p[order]
    y_s = y_true[order]
    w_s = w[order]

    w_pos = w_s * (y_s == 1)
    w_neg = w_s * (y_s == 0)
    cum_pos = np.cumsum(w_pos)
    cum_neg = np.cumsum(w_neg)
    total_neg = float(cum_neg[-1])

    uniq = np.unique(p_s)
    thresholds = np.unique(np.concatenate(([0.0], uniq, [1.0])))
    start = np.searchsorted(p_s, thresholds, side="left")
    before = start - 1
    pos_below = np.where(before >= 0, cum_pos[before], 0.0)
    neg_below = np.where(before >= 0, cum_neg[before], 0.0)

    misclassified = pos_below + (total_neg - neg_below)
    if normalize:
        denom = float(w_s.sum())
        losses = misclassified / denom if denom > 0 else np.zeros_like(misclassified)
    else:
        losses = misclassified

    best_j = int(np.argmin(losses))
    return float(thresholds[best_j]), float(losses[best_j])


def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_logistic_regression_gd(
    X_train,
    y_train,
    X_val=None,
    y_val=None,
    *,
    lr=0.2,
    n_steps=300,
    l2=0.0,
    threshold=0.5,
):
    X_train = np.asarray(X_train, dtype=float)
    y_train = np.asarray(y_train, dtype=int)

    n, d = X_train.shape
    w = np.zeros(d, dtype=float)
    b = 0.0

    hist = {
        "step": [],
        "train_log_loss": [],
        "train_zero_one": [],
        "val_log_loss": [],
        "val_zero_one": [],
    }

    for step in range(n_steps + 1):
        z_train = X_train @ w + b
        p_train = sigmoid(z_train)

        hist["step"].append(step)
        hist["train_log_loss"].append(log_loss_binary(y_train, p_train))
        hist["train_zero_one"].append(zero_one_loss_from_proba(y_train, p_train, threshold=threshold))

        if X_val is not None and y_val is not None:
            z_val = np.asarray(X_val, dtype=float) @ w + b
            p_val = sigmoid(z_val)
            hist["val_log_loss"].append(log_loss_binary(y_val, p_val))
            hist["val_zero_one"].append(zero_one_loss_from_proba(y_val, p_val, threshold=threshold))
        else:
            hist["val_log_loss"].append(np.nan)
            hist["val_zero_one"].append(np.nan)

        if step == n_steps:
            break

        # gradient of mean log loss (plus optional L2 penalty)
        grad = p_train - y_train
        grad_w = (X_train.T @ grad) / n + l2 * w
        grad_b = float(grad.mean())

        w -= lr * grad_w
        b -= lr * grad_b

    return w, b, hist


## 2) Intuition: thresholds and decision rules (plots)

0-1 loss depends only on **hard labels**.

In binary classification, many models output a score or probability $\hat{p}(y=1\mid x)$.
To turn that into a label we pick a threshold $t$:

$$
\hat{y}(t) = \mathbb{1}[\hat{p} \ge t].
$$

As you vary $t$, the predictions only change when $t$ crosses one of the predicted probabilities.
So the empirical 0-1 loss as a function of $t$ is a **step function** (flat most of the time, then jumps).

This is a key reason 0-1 loss is not used as a smooth training objective: small parameter changes often produce *no* change in 0-1 loss until a point flips sides.


In [None]:
n = 250
x = rng.normal(size=n)
p_true = sigmoid(1.5 * x - 0.3)
y = rng.binomial(1, p_true)

# Pretend these are predicted probabilities from an imperfect model
p_hat = np.clip(p_true + 0.15 * rng.normal(size=n), 1e-3, 1 - 1e-3)

thresholds = np.linspace(0.0, 1.0, 601)
losses = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
acc = 1.0 - losses

t_best, _ = best_threshold_zero_one(y, p_hat)

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(
    go.Scatter(
        x=thresholds,
        y=losses,
        name="zero-one loss",
        mode="lines",
        line_shape="hv",
    ),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(
        x=thresholds,
        y=acc,
        name="accuracy (1 - loss)",
        mode="lines",
        line_shape="hv",
    ),
    secondary_y=True,
)

fig.add_vline(x=0.5, line_dash="dash", line_color="gray", opacity=0.7)
fig.add_vline(x=t_best, line_dash="dot", line_color="crimson")

fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", secondary_y=False, range=[0, 1])
fig.update_yaxes(title_text="accuracy", secondary_y=True, range=[0, 1])

fig.update_layout(
    title=f"0-1 loss is a step function of the threshold (one optimal t ≈ {t_best:.3f})",
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="left", x=0),
)
fig

## 3) NumPy implementation: sanity checks

A key property: 0-1 loss is **insensitive to confidence**.

- predicting 0.51 vs 0.99 for the positive class gives the same 0-1 outcome (as long as the thresholded label is the same)
- but a probabilistic loss like log loss will strongly prefer 0.99 over 0.51 when the true label is 1

Let's verify that our NumPy version matches scikit-learn and highlight the "confidence blindness".


In [None]:
y_true = np.array([1, 0, 1, 1, 0, 0])
y_pred = np.array([1, 0, 0, 1, 0, 1])

print("numpy (mean):", zero_one_loss_np(y_true, y_pred))
print("sklearn (mean):", sk_zero_one_loss(y_true, y_pred))
print("1 - accuracy_score:", 1 - accuracy_score(y_true, y_pred))
print("numpy (count):", zero_one_loss_np(y_true, y_pred, normalize=False))
print("sklearn (count):", sk_zero_one_loss(y_true, y_pred, normalize=False))

w = np.array([1, 1, 5, 1, 1, 1], dtype=float)
print("\nweighted numpy (mean):", zero_one_loss_np(y_true, y_pred, sample_weight=w))
print("weighted sklearn (mean):", sk_zero_one_loss(y_true, y_pred, sample_weight=w))

# multilabel / multioutput: subset 0-1 loss (row must match exactly)
y_true_ml = np.array([[1, 0, 1], [1, 1, 0], [0, 0, 1]])
y_pred_ml = np.array([[1, 0, 1], [1, 0, 0], [0, 1, 1]])
print("\nmultilabel numpy:", zero_one_loss_np(y_true_ml, y_pred_ml))
print("multilabel sklearn:", sk_zero_one_loss(y_true_ml, y_pred_ml))

# confidence blindness: same hard predictions, different probabilities
y_true = np.array([1, 1, 1, 0, 0])
p_soft = np.array([0.51, 0.55, 0.52, 0.49, 0.45])
p_confident = np.array([0.99, 0.90, 0.80, 0.20, 0.01])

print("\n0-1 loss (soft):", zero_one_loss_from_proba(y_true, p_soft))
print("0-1 loss (confident):", zero_one_loss_from_proba(y_true, p_confident))
print("log loss (soft):", log_loss_binary(y_true, p_soft))
print("log loss (confident):", log_loss_binary(y_true, p_confident))


## 4) Using 0-1 loss for selection/optimization

Because 0-1 loss is a **step function** in the threshold (and in the model parameters), it is typically used as a **selection criterion** rather than a differentiable training objective.

A very common and practical "optimization" task is threshold tuning:

$$
t^* \in \arg\min_{t\in[0,1]}\ L\bigl(y,\ \mathbb{1}[\hat{p}\ge t]\bigr).
$$

This works well because it is a **1D search** (grid search or exact search over unique probabilities).

If you care more about one class (asymmetric costs), you can encode that with `sample_weight` (or with an explicit cost-sensitive threshold rule).


In [None]:
# Grid search threshold (approximate)
thresholds = np.linspace(0.0, 1.0, 2001)
losses_grid = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
min_loss_grid = float(losses_grid.min())
min_idx = np.where(losses_grid == min_loss_grid)[0]
t_grid = float(thresholds[int(min_idx[0])])
t_grid_low = float(thresholds[int(min_idx[0])])
t_grid_high = float(thresholds[int(min_idx[-1])])

# Exact threshold search (evaluate unique p_hat values)
t_exact, loss_exact = best_threshold_zero_one(y, p_hat)

print(f"grid-search min loss: {min_loss_grid:.4f} (t in [{t_grid_low:.4f}, {t_grid_high:.4f}])")
print(f"exact-search min loss: {loss_exact:.4f} (one optimal t={t_exact:.4f})")

# Weighted: make mistakes on positives 3x more costly
w_pos = np.where(y == 1, 3.0, 1.0)
t_w, loss_w = best_threshold_zero_one(y, p_hat, sample_weight=w_pos)
print(f"weighted best t: {t_w:.4f} (loss={loss_w:.4f})")

losses_unweighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t) for t in thresholds])
losses_weighted = np.array([zero_one_loss_from_proba(y, p_hat, threshold=t, sample_weight=w_pos) for t in thresholds])

fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=losses_unweighted, mode="lines", line_shape="hv", name="unweighted"))
fig.add_trace(go.Scatter(x=thresholds, y=losses_weighted, mode="lines", line_shape="hv", name="weighted (pos×3)"))
fig.add_vline(x=t_exact, line_dash="dot", line_color="black")
fig.add_vline(x=t_w, line_dash="dot", line_color="crimson")
fig.update_layout(title="Threshold tuning for 0-1 loss (unweighted vs weighted)")
fig.update_xaxes(title_text="threshold t")
fig.update_yaxes(title_text="0-1 loss", range=[0, 1])
fig

### 4.1 Why 0-1 loss is hard to optimize directly (and what we do instead)

If a classifier depends on parameters $\theta$ (e.g. linear model weights), the empirical 0-1 loss is:

$$
L(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[\hat{y}(x_i;\theta) \ne y_i].
$$

This function is:
- **discontinuous / non-differentiable** (jumps when a point flips sides)
- typically **non-convex** and full of plateaus
- hard to minimize exactly for most hypothesis classes

So in practice we train with a **surrogate loss** that is smooth and easier to optimize (e.g. log loss / cross-entropy for logistic regression), and then evaluate with 0-1 loss.

The plots below compare the loss landscapes for a simple 1D logistic model.


In [None]:
n = 120
x = rng.normal(size=n)
x = (x - x.mean()) / x.std()

p_true = sigmoid(2.0 * x - 0.4)
y = rng.binomial(1, p_true)

w_grid = np.linspace(-6, 6, 151)
b_grid = np.linspace(-6, 6, 151)

Z = x[:, None, None] * w_grid[None, None, :] + b_grid[None, :, None]
P = sigmoid(Z)

y_pred = (P >= 0.5).astype(int)
loss01 = (y[:, None, None] != y_pred).mean(axis=0)

eps = 1e-12
P_clip = np.clip(P, eps, 1 - eps)
losslog = -(y[:, None, None] * np.log(P_clip) + (1 - y[:, None, None]) * np.log(1 - P_clip)).mean(axis=0)

# Gradient descent on log loss (same simple 1D model)
w = 0.0
b = 0.0
lr = 0.8
w_path = [w]
b_path = [b]
for _ in range(40):
    z = w * x + b
    p = sigmoid(z)
    grad = p - y
    grad_w = float(np.mean(grad * x))
    grad_b = float(np.mean(grad))
    w -= lr * grad_w
    b -= lr * grad_b
    w_path.append(w)
    b_path.append(b)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("0-1 loss (threshold=0.5)", "log loss (smooth surrogate)"),
    horizontal_spacing=0.12,
)

fig.add_trace(
    go.Heatmap(x=w_grid, y=b_grid, z=loss01, zmin=0, zmax=1, colorbar=dict(title="0-1")),
    row=1,
    col=1,
)
fig.add_trace(
    go.Heatmap(x=w_grid, y=b_grid, z=losslog, colorbar=dict(title="log")),
    row=1,
    col=2,
)

fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", name="GD path"), row=1, col=1)
fig.add_trace(go.Scatter(x=w_path, y=b_path, mode="lines+markers", showlegend=False), row=1, col=2)

fig.update_xaxes(title_text="w", row=1, col=1)
fig.update_xaxes(title_text="w", row=1, col=2)
fig.update_yaxes(title_text="b", row=1, col=1)
fig.update_yaxes(title_text="b", row=1, col=2)
fig.update_layout(title="0-1 loss is piecewise-constant; log loss provides a smooth optimization landscape")
fig

### 4.2 Example: train logistic regression (from scratch), evaluate 0-1 loss

We'll fit a simple logistic regression model by minimizing **log loss** with gradient descent, while tracking **0-1 loss** on train/validation.

Model:

$$
\hat{p}_i = \sigma(x_i^\top w + b),\qquad \sigma(z)=\frac{1}{1+e^{-z}}.
$$

Training objective (mean log loss):

$$
J(w,b) = -\frac{1}{n}\sum_{i=1}^n \Big(y_i\log\hat{p}_i + (1-y_i)\log(1-\hat{p}_i)\Big).
$$

Then we compute 0-1 loss by thresholding $\hat{p}$ at $t=0.5$ (and optionally tuning $t$ on validation).


In [None]:
X, y = make_blobs(
    n_samples=900,
    centers=2,
    n_features=2,
    cluster_std=2.2,
    random_state=0,
)

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0,
    stratify=y,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

w, b, hist = fit_logistic_regression_gd(
    X_train_s,
    y_train,
    X_val=X_val_s,
    y_val=y_val,
    lr=0.2,
    n_steps=250,
    l2=0.01,
    threshold=0.5,
)

p_val = sigmoid(X_val_s @ w + b)
val_loss_05 = zero_one_loss_from_proba(y_val, p_val, threshold=0.5)
t_best, val_loss_best = best_threshold_zero_one(y_val, p_val)

print(f"val 0-1 loss @ t=0.5: {val_loss_05:.4f}")
print(f"best val threshold: {t_best:.4f} (val 0-1 loss={val_loss_best:.4f})")

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scatter(x=hist["step"], y=hist["train_log_loss"], name="train log loss"), secondary_y=False)
fig.add_trace(go.Scatter(x=hist["step"], y=hist["val_log_loss"], name="val log loss"), secondary_y=False)

fig.add_trace(
    go.Scatter(x=hist["step"], y=hist["train_zero_one"], name="train 0-1 loss", line_shape="hv"),
    secondary_y=True,
)
fig.add_trace(
    go.Scatter(x=hist["step"], y=hist["val_zero_one"], name="val 0-1 loss", line_shape="hv"),
    secondary_y=True,
)

fig.update_xaxes(title_text="gradient descent step")
fig.update_yaxes(title_text="log loss", secondary_y=False)
fig.update_yaxes(title_text="0-1 loss", secondary_y=True, range=[0, 1])
fig.update_layout(title="Training with log loss; tracking 0-1 loss (step-like)")
fig.show()

# Decision boundary visualization
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8

x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]

prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.0, end=1.0, size=0.1),
        colorscale="RdBu",
        opacity=0.85,
        colorbar=dict(title="P(y=1)"),
        showscale=True,
    )
)

fig.add_trace(
    go.Scatter(
        x=X_train_s[:, 0],
        y=X_train_s[:, 1],
        mode="markers",
        marker=dict(color=y_train, colorscale="Viridis", opacity=0.9, line=dict(width=0.2, color="black")),
        name="train points",
    )
)

fig.update_layout(title="Logistic regression probabilities (0-1 loss comes from thresholding)")
fig.update_xaxes(title_text="x0 (standardized)")
fig.update_yaxes(title_text="x1 (standardized)")
fig

## Pros / cons and when to use 0-1 loss

### Pros
- **Highly interpretable**: "error rate" (or # mistakes)
- **Threshold/decision-rule focused**: directly measures what many applications care about (correct label)
- **Works for multiclass** with no extra machinery
- **Aligns with the Bayes classifier** under equal misclassification costs (argmax posterior)

### Cons
- **Non-differentiable / discontinuous** → not suitable as a gradient-based training loss
- **Ignores confidence and calibration**: 0.51 and 0.99 are treated the same after thresholding
- **Can be misleading under class imbalance** (a majority-class classifier can look good)
- **Depends on the decision rule** (threshold choice, argmax ties, cost-sensitive adjustments)
- **Multilabel subset 0-1 is very strict** (one wrong label makes the whole example wrong)

### When it's a good choice
- Reporting final performance when **all errors are equally costly**
- Comparing classifiers after you have a clear, fixed **threshold / decision policy**
- Hyperparameter selection when you truly care about **accuracy/error rate** (using a validation set)


## Common pitfalls + diagnostics

- **Class imbalance**: 0-1 loss/accuracy may hide poor minority performance. Also inspect the confusion matrix; consider balanced accuracy, F1, PR AUC.
- **Wrong threshold**: if your positive class is rare or costs are asymmetric, $t=0.5$ is often not optimal; tune $t$ or use cost-sensitive decision rules.
- **Multilabel strictness**: subset 0-1 can be too harsh; consider Hamming loss, Jaccard score, or per-label F1.
- **Probability quality not measured**: two models can have the same 0-1 loss but very different calibration; also report log loss / Brier score if probabilities matter.
- **Test-set threshold tuning**: choose thresholds/hyperparameters on validation (or via cross-validation), not on the test set.


## Exercises

1) Prove that normalized 0-1 loss is exactly $1-\text{accuracy}$.
2) Derive the cost-sensitive threshold $\eta(x)\ge \frac{c_{01}}{c_{01}+c_{10}}$ from expected cost minimization.
3) Construct two classifiers with the same 0-1 loss but very different log loss. When would you prefer each?
4) Extend `best_threshold_zero_one` to return *all* thresholds achieving the minimum.
5) For multilabel data, compare subset 0-1 loss vs Hamming loss on a synthetic example and interpret the difference.


## References

- scikit-learn `zero_one_loss`: https://scikit-learn.org/stable/api/generated/sklearn.metrics.zero_one_loss.html
- scikit-learn `accuracy_score`: https://scikit-learn.org/stable/api/generated/sklearn.metrics.accuracy_score.html
- Hastie, Tibshirani, Friedman: *The Elements of Statistical Learning*, Ch. 2 (classification), Ch. 4 (linear methods for classification)
