# `accuracy_score` (classification accuracy)

Accuracy is the simplest classification metric: it’s the **fraction of samples you got exactly right**.

**You will learn:**
- the math definition (binary, multiclass, multilabel)
- a from-scratch NumPy implementation
- how **decision thresholds** change accuracy
- how to use accuracy when training a simple **logistic regression** model


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score as sk_accuracy_score
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)


## Prerequisites

- You know what **labels** $y \in \{0,1\}$ (binary) or $y \in \{1,\dots,K\}$ (multiclass) are.
- You’re comfortable with the idea that a classifier may output either:
  - **hard predictions** $\hat{y}$ (a class label), or
  - **scores/probabilities** (then you still need a rule to turn them into $\hat{y}$).

### Notation

- True labels: $y_1,\dots,y_n$
- Predicted labels: $\hat{y}_1,\dots,\hat{y}_n$
- Indicator function: $\mathbf{1}[\text{statement}]$ equals 1 if the statement is true, else 0.


## Definition

### Generic (binary or multiclass)

Accuracy is the **average of “correct?” indicators**:

$$
\operatorname{Acc}(y,\hat{y}) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[y_i = \hat{y}_i]
$$

With per-sample weights $w_i \ge 0$:

$$
\operatorname{Acc}_w(y,\hat{y}) = \frac{\sum_{i=1}^n w_i\,\mathbf{1}[y_i = \hat{y}_i]}{\sum_{i=1}^n w_i}
$$

### Binary (via confusion matrix)

If you define true positives/negatives and false positives/negatives:

$$
\operatorname{Acc} = \frac{\mathrm{TP} + \mathrm{TN}}{\mathrm{TP} + \mathrm{FP} + \mathrm{TN} + \mathrm{FN}}
$$

### Relation to 0–1 loss

Define the 0–1 loss per sample:

$$
\ell_{0/1}(y_i,\hat{y}_i) = \mathbf{1}[y_i \ne \hat{y}_i]
$$

Then:

$$
\operatorname{Acc} = 1 - \frac{1}{n} \sum_{i=1}^n \ell_{0/1}(y_i,\hat{y}_i)
$$

### Multilabel (subset accuracy)

If each sample has a **vector** of labels $\mathbf{y}_i \in \{0,1\}^L$, scikit-learn’s `accuracy_score` uses **subset accuracy**:

$$
\operatorname{Acc}_{\text{subset}} = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[\mathbf{y}_i = \hat{\mathbf{y}}_i]
$$

A sample counts as correct only if **all labels match exactly**.


## Intuition: accuracy is an average of 0/1 per-sample outcomes

Each sample contributes either:
- 1 (correct prediction)
- 0 (incorrect prediction)

Accuracy is just the mean of that vector.


In [None]:
y_true = np.array([0, 1, 1, 0, 1, 0, 0, 1])
y_pred = np.array([0, 1, 0, 0, 1, 1, 0, 1])

correct = (y_true == y_pred).astype(int)
acc = correct.mean()

print('per-sample correct:', correct)
print('accuracy:', acc)
print('sklearn accuracy:', sk_accuracy_score(y_true, y_pred))

fig = go.Figure()
fig.add_trace(
    go.Bar(
        x=np.arange(len(correct)),
        y=correct,
        marker_color=["#2ca02c" if c == 1 else "#d62728" for c in correct],
        name="correct (1) / wrong (0)",
    )
)
fig.add_hline(
    y=acc,
    line_dash="dash",
    line_color="black",
    annotation_text=f"accuracy = {acc:.2f}",
    annotation_position="top left",
)
fig.update_layout(
    title="Accuracy = mean of per-sample correctness",
    xaxis_title="sample index",
    yaxis_title="correct?",
    yaxis=dict(tickmode="array", tickvals=[0, 1]),
)
fig.show()


## From-scratch NumPy implementation


In [None]:
def accuracy_score_np(y_true, y_pred, *, sample_weight=None, normalize=True):
    '''Compute accuracy (and multilabel subset accuracy) using NumPy.

    Parameters
    ----------
    y_true, y_pred:
        1D arrays (n_samples,) for standard classification, or
        2D arrays (n_samples, n_labels) for multilabel subset accuracy.

    sample_weight:
        Optional array (n_samples,) of non-negative weights.

    normalize:
        If True, return a fraction in [0, 1]. If False, return the (weighted) count.
    '''

    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true {y_true.shape} vs y_pred {y_pred.shape}")

    if y_true.ndim == 1:
        correct = (y_true == y_pred)
    elif y_true.ndim == 2:
        # multilabel subset accuracy: all labels must match per sample
        correct = np.all(y_true == y_pred, axis=1)
    else:
        raise ValueError(f"expected 1D or 2D arrays, got ndim={y_true.ndim}")

    if sample_weight is None:
        if normalize:
            return float(np.mean(correct))
        return float(np.sum(correct))

    w = np.asarray(sample_weight)
    if w.ndim != 1 or w.shape[0] != correct.shape[0]:
        raise ValueError(f"sample_weight must be shape (n_samples,), got {w.shape}")

    correct_f = correct.astype(float)
    if normalize:
        return float(np.average(correct_f, weights=w))
    return float(np.sum(w * correct_f))


def predict_labels_from_proba(p, threshold=0.5):
    '''Turn probabilities into hard labels using a threshold.'''

    p = np.asarray(p)
    return (p >= threshold).astype(int)


In [None]:
# Sanity checks vs scikit-learn

# 1) Binary / multiclass (1D)
y_true = rng.integers(0, 3, size=200)
y_pred = rng.integers(0, 3, size=200)
w = rng.uniform(0.1, 2.0, size=200)

for normalize in [True, False]:
    ours = accuracy_score_np(y_true, y_pred, sample_weight=w, normalize=normalize)
    theirs = sk_accuracy_score(y_true, y_pred, sample_weight=w, normalize=normalize)
    print(normalize, ours, theirs, 'diff', abs(ours - theirs))

# 2) Multilabel subset accuracy (2D)
y_true_ml = rng.integers(0, 2, size=(50, 4))
y_pred_ml = rng.integers(0, 2, size=(50, 4))
print('multilabel:', accuracy_score_np(y_true_ml, y_pred_ml), sk_accuracy_score(y_true_ml, y_pred_ml))


## Accuracy depends on the decision threshold

Many binary classifiers output a **probability** $p_i = P(y_i=1\mid x_i)$.
To turn that into a predicted label, you choose a threshold $t$:

$$
\hat{y}_i(t) = \mathbf{1}[p_i \ge t]
$$

So accuracy is really a function of $t$:

$$
\operatorname{Acc}(t) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}[y_i = \hat{y}_i(t)]
$$

**Key property:** $\operatorname{Acc}(t)$ is a **step function** of $t$ (it only changes when $t$ crosses one of the predicted probabilities).


In [None]:
# Synthetic probabilities where threshold choice matters
n = 200

y_true = rng.integers(0, 2, size=n)

# Create probabilities correlated with y_true, but noisy
logit = (y_true * 2 - 1) * 1.2 + rng.normal(0, 1.0, size=n)
p = 1 / (1 + np.exp(-logit))

thresholds = np.linspace(0, 1, 401)
accs = np.array([accuracy_score_np(y_true, predict_labels_from_proba(p, t)) for t in thresholds])

best_idx = int(np.argmax(accs))
best_t = float(thresholds[best_idx])

print('accuracy @ t=0.50:', accuracy_score_np(y_true, predict_labels_from_proba(p, 0.5)))
print('best threshold:', best_t)
print('best accuracy:', float(accs[best_idx]))

fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=accs, mode='lines', name='accuracy(t)'))
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='0.5', annotation_position='top')
fig.add_vline(x=best_t, line_dash='dash', line_color='black', annotation_text=f'best={best_t:.2f}', annotation_position='top')
fig.update_layout(
    title='Accuracy as a function of the decision threshold',
    xaxis_title='threshold t',
    yaxis_title='accuracy',
    yaxis=dict(range=[0, 1]),
)
fig.show()

fig = px.histogram(
    x=p,
    color=y_true.astype(str),
    nbins=30,
    barmode='overlay',
    opacity=0.6,
    title='Predicted probability distribution by true class',
    labels={'x': 'predicted probability p', 'color': 'true label'},
)
fig.add_vline(x=best_t, line_dash='dash', line_color='black')
fig.add_vline(x=0.5, line_dash='dash', line_color='gray')
fig.show()


## A classic pitfall: accuracy on imbalanced data (“accuracy paradox”)

If one class dominates, a model can achieve high accuracy by **always predicting the majority class**.

Example: 95% negatives, 5% positives.
- Predicting “negative” for everyone gives **95% accuracy**.
- But it completely fails to detect the positives.

This is why it’s good practice to always look at the **confusion matrix** (and consider metrics like recall/precision/F1 or balanced accuracy).


In [None]:
# Imbalanced example: 95% of class 0
n = 200
n_pos = int(0.05 * n)

y_true = np.array([1] * n_pos + [0] * (n - n_pos))
rng.shuffle(y_true)

y_pred_all0 = np.zeros_like(y_true)

acc = accuracy_score_np(y_true, y_pred_all0)
print('majority-class baseline accuracy:', acc)

# Confusion matrix counts (binary)
TN = int(np.sum((y_true == 0) & (y_pred_all0 == 0)))
FP = int(np.sum((y_true == 0) & (y_pred_all0 == 1)))
FN = int(np.sum((y_true == 1) & (y_pred_all0 == 0)))
TP = int(np.sum((y_true == 1) & (y_pred_all0 == 1)))

cm = np.array([[TN, FP], [FN, TP]])

fig = go.Figure(
    data=go.Heatmap(
        z=cm,
        x=['pred 0', 'pred 1'],
        y=['true 0', 'true 1'],
        text=cm,
        texttemplate='%{text}',
        colorscale='Blues',
        showscale=False,
    )
)
fig.update_layout(title='Confusion matrix for the majority-class baseline')
fig.show()

fig = px.bar(
    x=['class 0', 'class 1'],
    y=[int(np.sum(y_true == 0)), int(np.sum(y_true == 1))],
    title='Class imbalance in the data',
    labels={'x': 'class', 'y': 'count'},
)
fig.show()


## Using accuracy during optimization: logistic regression (NumPy)

### Why we usually *don’t* optimize accuracy directly

Accuracy corresponds to the **0–1 loss**, which is **non-differentiable** with respect to model parameters.
Gradient-based methods (like gradient descent) need smooth objectives, so we typically optimize a surrogate such as **log loss**.

Even if you train with log loss, accuracy is still useful for:
- monitoring training (does performance improve?)
- comparing models
- choosing a **decision threshold** on a validation set


In [None]:
# Synthetic 2D dataset (two overlapping Gaussians)

n0, n1 = 260, 140
X0 = rng.normal(loc=(-1.0, -1.0), scale=(1.1, 1.1), size=(n0, 2))
X1 = rng.normal(loc=(1.2, 1.0), scale=(1.3, 1.0), size=(n1, 2))

X = np.vstack([X0, X1])
y = np.array([0] * n0 + [1] * n1)

# Shuffle and split
idx = rng.permutation(len(y))
X, y = X[idx], y[idx]

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.35, random_state=42, stratify=y
)

fig = px.scatter(
    x=X_train[:, 0],
    y=X_train[:, 1],
    color=y_train.astype(str),
    title='Training data (2D) — overlap makes errors unavoidable',
    labels={'x': 'x1', 'y': 'x2', 'color': 'class'},
)
fig.show()


In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))


def add_intercept(X):
    return np.c_[np.ones((X.shape[0], 1)), X]


def log_loss_np(y_true, p, eps=1e-15):
    p = np.clip(p, eps, 1 - eps)
    y_true = y_true.astype(float)
    return float(-np.mean(y_true * np.log(p) + (1 - y_true) * np.log(1 - p)))


def fit_logreg_gd(X, y, *, lr=0.2, n_iters=400, l2=0.0, verbose=False):
    '''Logistic regression with batch gradient descent (binary).'''

    Xb = add_intercept(X)
    y = y.astype(float)

    w = np.zeros(Xb.shape[1])

    history = {
        'iter': [],
        'train_loss': [],
        'train_acc@0.5': [],
    }

    for it in range(1, n_iters + 1):
        z = Xb @ w
        p = sigmoid(z)

        # Gradient of average log loss + L2 regularization (excluding intercept)
        grad = (Xb.T @ (p - y)) / Xb.shape[0]
        grad[1:] += l2 * w[1:]

        w -= lr * grad

        if it % 5 == 0 or it == 1:
            y_pred = (p >= 0.5).astype(int)
            history['iter'].append(it)
            history['train_loss'].append(log_loss_np(y, p))
            history['train_acc@0.5'].append(accuracy_score_np(y.astype(int), y_pred))

            if verbose:
                print(it, history['train_loss'][-1], history['train_acc@0.5'][-1])

    return w, history


w, hist = fit_logreg_gd(X_train, y_train, lr=0.15, n_iters=500, l2=0.01)

# Evaluate on validation
p_val = sigmoid(add_intercept(X_val) @ w)
acc_val_05 = accuracy_score_np(y_val, predict_labels_from_proba(p_val, 0.5))
print('validation accuracy @ 0.5:', acc_val_05)

fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['iter'], y=hist['train_loss'], mode='lines', name='train log loss'))
fig.update_layout(title='Training objective (log loss) decreases smoothly', xaxis_title='iteration', yaxis_title='log loss')
fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=hist['iter'], y=hist['train_acc@0.5'], mode='lines', name='train accuracy @ 0.5'))
fig.update_layout(title='Accuracy during training (often changes in jumps)', xaxis_title='iteration', yaxis_title='accuracy', yaxis=dict(range=[0, 1]))
fig.show()


In [None]:
# Pick a decision threshold that maximizes validation accuracy

thresholds = np.linspace(0, 1, 401)
accs_val = np.array([accuracy_score_np(y_val, predict_labels_from_proba(p_val, t)) for t in thresholds])

best_idx = int(np.argmax(accs_val))
best_t = float(thresholds[best_idx])

print('best threshold:', best_t)
print('validation accuracy @ best_t:', float(accs_val[best_idx]))

fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=accs_val, mode='lines', name='val accuracy(t)'))
fig.add_vline(x=0.5, line_dash='dash', line_color='gray', annotation_text='0.5', annotation_position='top')
fig.add_vline(x=best_t, line_dash='dash', line_color='black', annotation_text=f'best={best_t:.2f}', annotation_position='top')
fig.update_layout(title='Validation accuracy vs threshold', xaxis_title='threshold', yaxis_title='accuracy', yaxis=dict(range=[0, 1]))
fig.show()


In [None]:
# Decision boundary visualization (threshold = best_t)

x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1

x1_grid = np.linspace(x1_min, x1_max, 200)
x2_grid = np.linspace(x2_min, x2_max, 200)
xx, yy = np.meshgrid(x1_grid, x2_grid)
grid = np.c_[xx.ravel(), yy.ravel()]

p_grid = sigmoid(add_intercept(grid) @ w).reshape(xx.shape)

fig = go.Figure()

# Background probability field
fig.add_trace(
    go.Contour(
        x=x1_grid,
        y=x2_grid,
        z=p_grid,
        colorscale='RdBu',
        reversescale=True,
        opacity=0.6,
        contours=dict(showlines=False),
        colorbar=dict(title='P(y=1)'),
        name='P(y=1)',
    )
)

# Decision boundary: p = best_t
fig.add_trace(
    go.Contour(
        x=x1_grid,
        y=x2_grid,
        z=p_grid,
        showscale=False,
        contours=dict(start=best_t, end=best_t, size=1, coloring='lines'),
        line=dict(color='black', width=3),
        name='boundary',
    )
)

fig.add_trace(
    go.Scatter(
        x=X_val[:, 0],
        y=X_val[:, 1],
        mode='markers',
        marker=dict(size=7, color=y_val, colorscale='Viridis', line=dict(width=0.5, color='white')),
        name='validation points',
    )
)

fig.update_layout(
    title=f'Decision boundary for threshold t={best_t:.2f}',
    xaxis_title='x1',
    yaxis_title='x2',
)
fig.show()


## Practical usage (scikit-learn)

For most workflows you’ll use scikit-learn:

```python
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
```

Notes:
- For multiclass, pass integer class labels.
- For multilabel, `accuracy_score` computes **subset accuracy** (all labels must match).
- Use `sample_weight=` if some samples should count more than others.


In [None]:
# Compare with scikit-learn's LogisticRegression

clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)

y_pred_val = clf.predict(X_val)
print('sklearn LogisticRegression val accuracy:', sk_accuracy_score(y_val, y_pred_val))


## Pros / Cons / When to use

### Pros
- **Simple and interpretable**: “percent correct”.
- Works well when **classes are balanced** and **error costs are similar**.
- Useful as a quick baseline and sanity check.

### Cons
- **Misleading on imbalanced datasets** (majority-class baseline can look “great”).
- Hides *which* mistakes you make (use a confusion matrix / per-class metrics).
- For probabilistic models it is **threshold-dependent**.
- Hard to optimize directly with gradient methods (0–1 loss is non-smooth).
- For multilabel, subset accuracy can be **too strict**.

### Good fits
- Balanced multiclass problems (top-1 correctness matters).
- Settings where false positives and false negatives have comparable cost.

### Consider alternatives when
- Classes are imbalanced: `balanced_accuracy_score`, precision/recall/F1, PR-AUC.
- You care about ranking/threshold tradeoffs: ROC curves, PR curves.
- You need probability quality: log loss, Brier score.


## Exercises

1. Build a classifier that outputs probabilities and show how the **best threshold** shifts when class imbalance increases.
2. Construct two models with identical accuracy but very different confusion matrices. Which one would you deploy for a medical screening task?
3. For multilabel data, compare subset accuracy with per-label accuracy (Hamming accuracy).


## References

- scikit-learn `accuracy_score`: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
- scikit-learn classification metrics user guide: https://scikit-learn.org/stable/modules/model_evaluation.html
