# hamming_loss (bitwise error rate for multilabel classification)

`hamming_loss` measures the fraction of labels that are **wrong**.

- For standard (single-label) classification it reduces to the **misclassification rate**.
- For multilabel classification it averages mistakes across the `(sample, label)` grid — *how many bits did we flip?*

## Learning goals
- write the multiclass and multilabel formulas (with clear notation)
- build intuition with plots (what counts as an error)
- implement Hamming loss from scratch in NumPy (including `sample_weight`)
- see how Hamming loss interacts with probability thresholds in multilabel logistic regression
- know pros/cons and when to prefer other metrics

## Quick import

```python
from sklearn.metrics import hamming_loss
```

## Table of contents
1. Definitions and notation
2. Intuition (plots)
3. NumPy implementation + sanity checks
4. Using Hamming loss for threshold tuning (multilabel logistic regression)
5. Pros, cons, pitfalls


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import hamming_loss as sk_hamming_loss
from sklearn.model_selection import train_test_split

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.random.seed(0)
np.set_printoptions(precision=3, suppress=True)


## 1) Definitions and notation

Assume we have $n$ samples.

### Single-label classification (binary or multiclass)

- True label: $y_i \in \{0,1,\dots,K-1\}$
- Predicted label: $\hat{y}_i \in \{0,1,\dots,K-1\}$

The Hamming loss is the average number of **wrong labels** per sample:

$$
\operatorname{HL}(y,\hat{y}) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[y_i \neq \hat{y}_i]
$$

For single-label classification this is exactly the **misclassification rate** (a.k.a. `zero_one_loss`).

### Multilabel classification (label indicator matrix)

- True labels: $Y \in \{0,1\}^{n\times L}$ (each row can have multiple 1s)
- Predictions: $\hat{Y} \in \{0,1\}^{n\times L}$

Hamming loss counts mismatches over all `(sample, label)` decisions:

$$
\operatorname{HL}(Y,\hat{Y})
= \frac{1}{nL}\sum_{i=1}^n\sum_{\ell=1}^L \mathbf{1}[Y_{i\ell} \neq \hat{Y}_{i\ell}]
$$

Equivalently, it is the average **Hamming distance per sample**, normalized by $L$.

### Relationship to micro-accuracy

If you treat each `(sample, label)` as a binary decision, then:

$$
\text{micro-accuracy} = \frac{TP + TN}{nL}
\quad\Rightarrow\quad
\operatorname{HL} = 1 - \text{micro-accuracy}
$$

### Contrast with subset accuracy (exact match)

Subset accuracy (a.k.a. *exact match ratio*) for multilabel requires getting **all labels** correct for a sample:

$$
\text{subset-accuracy}(Y,\hat{Y}) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[Y_{i,:} = \hat{Y}_{i,:}]
$$

Hamming loss is more forgiving: getting 1 label wrong out of 20 is a small penalty, while subset accuracy would count the whole sample as wrong.


## 2) Intuition (plots)

Think of each label decision as a **bit**.

- `0` means perfect predictions.
- `0.25` means 25% of all bits are wrong.

Below we visualize `Y_true`, `Y_pred`, and the mismatch matrix `(Y_true != Y_pred)`.


In [None]:
Y_true = np.array(
    [
        [1, 0, 0, 1, 0, 1],
        [0, 1, 0, 0, 0, 0],
        [1, 1, 0, 0, 1, 0],
        [0, 0, 0, 1, 0, 0],
        [1, 0, 1, 1, 0, 0],
        [0, 1, 0, 0, 1, 1],
        [0, 0, 0, 0, 0, 0],
        [1, 1, 1, 0, 0, 0],
    ],
    dtype=int,
)

Y_pred = np.array(
    [
        [1, 0, 1, 1, 0, 0],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0],
        [0, 0, 0, 1, 1, 0],
        [1, 0, 0, 1, 0, 0],
        [0, 1, 0, 1, 0, 1],
        [0, 0, 0, 0, 0, 0],
        [1, 0, 1, 0, 0, 0],
    ],
    dtype=int,
)

mismatch = (Y_true != Y_pred).astype(int)

hl_manual = float(mismatch.mean())
hl_sklearn = float(sk_hamming_loss(Y_true, Y_pred))

subset_acc = float(np.mean(np.all(Y_true == Y_pred, axis=1)))

print(f'Hamming loss (manual) : {hl_manual:.3f}')
print(f'Hamming loss (sklearn): {hl_sklearn:.3f}')
print(f'Subset accuracy       : {subset_acc:.3f}')


In [None]:
n_samples, n_labels = Y_true.shape
x_labels = [f'label_{j}' for j in range(n_labels)]
y_labels = [f'sample_{i}' for i in range(n_samples)]

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=['Y_true', 'Y_pred', 'Mismatch (1 = wrong)'],
)

fig.add_trace(
    go.Heatmap(
        z=Y_true,
        x=x_labels,
        y=y_labels,
        colorscale='Blues',
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=1,
)

fig.add_trace(
    go.Heatmap(
        z=Y_pred,
        x=x_labels,
        y=y_labels,
        colorscale='Greens',
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=2,
)

fig.add_trace(
    go.Heatmap(
        z=mismatch,
        x=x_labels,
        y=y_labels,
        colorscale=[[0, '#ffffff'], [1, '#d62728']],
        zmin=0,
        zmax=1,
        showscale=False,
    ),
    row=1,
    col=3,
)

fig.update_layout(
    title=f'Hamming loss = {hl_manual:.3f} (fraction of wrong bits)',
    height=420,
)
fig

In [None]:
per_sample = mismatch.mean(axis=1)
per_label = mismatch.mean(axis=0)

fig1 = px.bar(
    x=[f'sample_{i}' for i in range(n_samples)],
    y=per_sample,
    title='Per-sample contribution: fraction of wrong labels',
    labels={'x': 'sample', 'y': 'wrong-label fraction'},
)
fig1.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig1.update_yaxes(range=[0, 1])
fig1.show()

fig2 = px.bar(
    x=x_labels,
    y=per_label,
    title='Per-label error rate',
    labels={'x': 'label', 'y': 'error rate'},
)
fig2.add_hline(y=hl_manual, line_dash='dash', annotation_text='global HL')
fig2.update_yaxes(range=[0, 1])
fig2.show()


### A common pitfall: multiclass as one-hot vs integer labels

For multiclass problems you often have **one true class** per sample.

- If you pass integer labels (`shape = (n,)`), Hamming loss is the misclassification rate.
- If you convert to one-hot (`shape = (n, K)`), a single wrong prediction creates **two bit errors** (one FN + one FP), so the value changes.

Below we compare the two representations.


In [None]:
y_true_mc = np.array([0, 1, 2, 2, 1, 0])
y_pred_mc = np.array([0, 2, 2, 1, 1, 0])

hl_int = float(sk_hamming_loss(y_true_mc, y_pred_mc))

K = 3
Y_true_oh = np.eye(K, dtype=int)[y_true_mc]
Y_pred_oh = np.eye(K, dtype=int)[y_pred_mc]

hl_onehot = float(sk_hamming_loss(Y_true_oh, Y_pred_oh))

print(f'Hamming loss with integer labels: {hl_int:.3f}')
print(f'Hamming loss with one-hot labels: {hl_onehot:.3f}  (note the scaling)')


## 3) NumPy implementation + sanity checks

A from-scratch implementation is straightforward once you remember the definition: **count mismatches and average**.

`sample_weight` in `sklearn.metrics.hamming_loss` applies at the **sample** level:

- compute per-sample Hamming loss (mean mismatches across labels)
- take a weighted average across samples


In [None]:
def hamming_loss_np(y_true, y_pred, *, sample_weight=None) -> float:
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f'shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}')

    if y_true.ndim == 1:
        mismatches = (y_true != y_pred).astype(float)
        if sample_weight is None:
            return float(mismatches.mean())

        w = np.asarray(sample_weight, dtype=float)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
        return float(np.average(mismatches, weights=w))

    if y_true.ndim == 2:
        mismatches = (y_true != y_pred).astype(float)
        per_sample = mismatches.mean(axis=1)

        if sample_weight is None:
            return float(per_sample.mean())

        w = np.asarray(sample_weight, dtype=float)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f'sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}')
        return float(np.average(per_sample, weights=w))

    raise ValueError('y_true and y_pred must be 1D (single-label) or 2D (multilabel)')


In [None]:
# Sanity checks vs scikit-learn
rng = np.random.default_rng(0)

# 1) single-label (multiclass)
y_true_1d = rng.integers(0, 4, size=200)
y_pred_1d = rng.integers(0, 4, size=200)

print(
    '1D close?',
    np.allclose(
        hamming_loss_np(y_true_1d, y_pred_1d),
        sk_hamming_loss(y_true_1d, y_pred_1d),
    ),
)

# 2) multilabel indicator
Y_true_2d = rng.integers(0, 2, size=(120, 7))
Y_pred_2d = rng.integers(0, 2, size=(120, 7))

print(
    '2D close?',
    np.allclose(
        hamming_loss_np(Y_true_2d, Y_pred_2d),
        sk_hamming_loss(Y_true_2d, Y_pred_2d),
    ),
)

# 3) sample weights
w = rng.random(size=Y_true_2d.shape[0])

hl_np_w = hamming_loss_np(Y_true_2d, Y_pred_2d, sample_weight=w)
hl_sk_w = float(sk_hamming_loss(Y_true_2d, Y_pred_2d, sample_weight=w))

print('weighted close?', np.allclose(hl_np_w, hl_sk_w))
print('weighted value:', hl_np_w)


## 4) Using Hamming loss for threshold tuning (multilabel logistic regression)

Hamming loss is defined on **hard predictions** (`0/1`), so it is **not differentiable**.

A common pattern is:

1. Train a probabilistic model (e.g. multilabel logistic regression) by minimizing a differentiable surrogate (binary cross-entropy / `log_loss`).
2. Convert probabilities to hard labels with a threshold $t$.
3. Choose $t$ (or per-label thresholds) to minimize Hamming loss on a validation set.

### Model

For $L$ labels, we use independent sigmoids:

$$
Z = XW + b,\quad P = \sigma(Z)
$$

Prediction with a threshold $t$:

$$
\hat{Y}_{i\ell} = \mathbf{1}[P_{i\ell} \ge t]
$$

We will train with average binary cross-entropy (from logits):

$$
J(W,b) = \frac{1}{nL}\sum_{i=1}^n\sum_{\ell=1}^L \Big(\operatorname{softplus}(Z_{i\ell}) - Y_{i\ell} Z_{i\ell}\Big)
$$

Then we will *tune* $t$ to minimize Hamming loss.


In [None]:
def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return 1.0 / (1.0 + np.exp(-z))


def softplus(z):
    # Stable softplus: log(1 + exp(z))
    z = np.asarray(z, dtype=float)
    return np.log1p(np.exp(-np.abs(z))) + np.maximum(z, 0.0)


def bce_from_logits(Y, Z) -> float:
    Y = np.asarray(Y, dtype=float)
    Z = np.asarray(Z, dtype=float)
    return float(np.mean(softplus(Z) - Y * Z))


def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_multilabel_logreg_gd(
    X_train,
    Y_train,
    X_val=None,
    Y_val=None,
    *,
    lr=0.8,
    n_steps=400,
    l2=0.0,
    threshold=0.5,
):
    X_train = np.asarray(X_train, dtype=float)
    Y_train = np.asarray(Y_train, dtype=float)
    n_samples, n_features = X_train.shape
    n_labels = Y_train.shape[1]

    W = np.zeros((n_features, n_labels))
    b = np.zeros(n_labels)

    history = {
        'step': [],
        'train_bce': [],
        'train_hl': [],
        'val_bce': [],
        'val_hl': [],
    }

    for step in range(n_steps):
        Z = X_train @ W + b
        P = sigmoid(Z)

        train_bce = bce_from_logits(Y_train, Z)

        # dJ/dZ = (P - Y) / (n_samples * n_labels) when J is the mean over all entries
        G = (P - Y_train) / (n_samples * n_labels)
        grad_W = X_train.T @ G + l2 * W
        grad_b = G.sum(axis=0)

        W -= lr * grad_W
        b -= lr * grad_b

        Y_hat = (P >= threshold).astype(int)
        train_hl = hamming_loss_np(Y_train.astype(int), Y_hat)

        history['step'].append(step)
        history['train_bce'].append(train_bce)
        history['train_hl'].append(train_hl)

        if X_val is not None and Y_val is not None:
            Z_val = X_val @ W + b
            P_val = sigmoid(Z_val)
            val_bce = bce_from_logits(Y_val, Z_val)
            val_hl = hamming_loss_np(Y_val.astype(int), (P_val >= threshold).astype(int))

            history['val_bce'].append(val_bce)
            history['val_hl'].append(val_hl)
        else:
            history['val_bce'].append(None)
            history['val_hl'].append(None)

    return W, b, history


In [None]:
# Synthetic multilabel dataset
rng = np.random.default_rng(1)

n_samples = 1600
n_features = 8
n_labels = 6

X = rng.normal(size=(n_samples, n_features))

W_true = rng.normal(scale=1.2, size=(n_features, n_labels))
# Make some labels rarer than others by shifting biases
b_true = np.linspace(-2.0, 0.5, n_labels)

Z_true = X @ W_true + b_true
P_true = sigmoid(Z_true)
Y = (rng.random(size=P_true.shape) < P_true).astype(int)

X_train, X_val, Y_train, Y_val = train_test_split(
    X,
    Y,
    test_size=0.3,
    random_state=0,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

W, b, hist = fit_multilabel_logreg_gd(
    X_train_s,
    Y_train,
    X_val=X_val_s,
    Y_val=Y_val,
    lr=0.9,
    n_steps=300,
    l2=0.0,
    threshold=0.5,
)

fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_bce'], name='train BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_bce'], name='val BCE'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_hl'], name='train Hamming loss'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_hl'], name='val Hamming loss'), secondary_y=True)

fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='binary cross-entropy (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='Hamming loss (lower is better)', secondary_y=True, range=[0, 1])
fig.update_layout(title='Train with BCE, monitor Hamming loss at threshold=0.5', height=480)
fig

In [None]:
# Tune the probability threshold to minimize validation Hamming loss
Z_val = X_val_s @ W + b
P_val = sigmoid(Z_val)

thresholds = np.linspace(0.05, 0.95, 91)
hl_vals = []
for t in thresholds:
    Y_hat_val = (P_val >= t).astype(int)
    hl_vals.append(hamming_loss_np(Y_val, Y_hat_val))

hl_vals = np.array(hl_vals)
best_idx = int(np.argmin(hl_vals))
best_t = float(thresholds[best_idx])

t05_idx = int(np.where(np.isclose(thresholds, 0.5))[0][0])
hl_at_05 = float(hl_vals[t05_idx])
hl_best = float(hl_vals[best_idx])

print(f'Validation HL at t=0.50: {hl_at_05:.4f}')
print(f'Best threshold t*:      {best_t:.2f}')
print(f'Validation HL at t*:    {hl_best:.4f}')


In [None]:
fig = px.line(
    x=thresholds,
    y=hl_vals,
    title='Validation Hamming loss vs threshold',
    labels={'x': 'threshold t', 'y': 'Hamming loss'},
)
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='t=0.5')
fig.add_vline(x=best_t, line_dash='dash', line_color='green', annotation_text='best t*')
fig.update_yaxes(range=[0, 1])
fig

In [None]:
# Optional: per-label threshold tuning (can reduce HL when base rates differ)
per_label_thresholds = np.zeros(n_labels)

for j in range(n_labels):
    errs = []
    for t in thresholds:
        pred_j = (P_val[:, j] >= t).astype(int)
        errs.append(float(np.mean(pred_j != Y_val[:, j])))
    per_label_thresholds[j] = thresholds[int(np.argmin(errs))]

Y_hat_per_label = (P_val >= per_label_thresholds).astype(int)
hl_per_label = hamming_loss_np(Y_val, Y_hat_per_label)

print('Per-label thresholds:', np.round(per_label_thresholds, 2))
print('Validation HL (single t*) :', hl_best)
print('Validation HL (per-label) :', hl_per_label)

fig = px.bar(
    x=[f'label_{j}' for j in range(n_labels)],
    y=per_label_thresholds,
    title='Per-label thresholds that minimize per-label error',
    labels={'x': 'label', 'y': 'best threshold'},
)
fig.update_yaxes(range=[0, 1])
fig

## 5) Pros, cons, pitfalls

### Pros
- **Simple and interpretable**: “fraction of wrong labels.”
- **Works naturally for multilabel**: does not require perfect set matches.
- **Label-wise averaging**: each label decision contributes equally (micro view over all bits).
- **Comparable across models** when the label space is fixed (same $L$).

### Cons / caveats
- **Can look deceptively good on sparse multilabel problems**: if most labels are 0, predicting all zeros yields many true negatives and a low Hamming loss.
- **Does not capture set quality**: predicting a wrong combination can still have a small Hamming loss if only a few bits differ.
- **Not differentiable**: not suitable as a direct gradient-based training objective; use a surrogate loss and treat Hamming loss as an evaluation metric.
- **Representation matters for multiclass**: integer labels vs one-hot produce different scales.

### Common pitfalls
- Passing **probabilities** instead of hard labels (threshold them first).
- Using one-hot for multiclass and interpreting the value as misclassification rate.
- Relying on Hamming loss alone with heavy class imbalance; complement with per-label precision/recall/F1, Jaccard score, or subset accuracy.

### Where it’s a good fit
- Multilabel tagging where each label decision matters roughly equally (e.g. topic tags, attribute prediction).
- Problems where you want a single number that reflects “average per-label error rate,” not strict exact matches.


## Exercises
- Show algebraically that for multilabel indicators, `Hamming loss = 1 - micro-accuracy`.
- Construct a sparse multilabel dataset where predicting all zeros achieves a low Hamming loss but terrible recall.
- Implement a per-label F1 score and compare its behavior to Hamming loss under imbalance.

## References
- scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
- Hamming distance (background): https://en.wikipedia.org/wiki/Hamming_distance
