# log_loss (cross-entropy / negative log-likelihood)

`log_loss` measures how well **predicted probabilities** match the true labels.
It is the standard objective for logistic regression / softmax classifiers, and a common evaluation metric for probabilistic models.

## Learning goals
- understand the binary and multiclass formulas (with notation)
- build intuition for why confident mistakes are punished heavily
- implement numerically-stable log loss in NumPy (from probabilities and from logits)
- see how minimizing log loss trains logistic regression via gradient descent
- know when log loss is the right metric (and when it is not)

## Quick import

```python
from sklearn.metrics import log_loss
```

## Table of contents
1. Definitions and notation
2. Intuition (plots)
3. NumPy implementation (binary + multiclass)
4. Using log loss to optimize logistic regression
5. Pros, cons, pitfalls


In [None]:
import numpy as np

import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_blobs
from sklearn.metrics import log_loss as sk_log_loss
from sklearn.model_selection import train_test_split

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(0)


## 1) Definitions and notation

Assume we have $n$ examples.

### Binary classification

- True label: $y_i \in \{0,1\}$
- Predicted probability of the positive class: $p_i = P(y_i=1 \mid x_i)$

Per-example log loss (Bernoulli negative log-likelihood) is:

$$
\ell_i = -\Big(y_i \log(p_i) + (1-y_i)\log(1-p_i)\Big)
$$

Average (optionally weighted) log loss:

$$
L = \frac{1}{n}\sum_{i=1}^n \ell_i
$$

### Multiclass classification ($K$ classes)

- True label: $y_i \in \{0,1,\dots,K-1\}$
- Predicted probabilities: $p_{ik} = P(y_i=k \mid x_i)$ with $\sum_{k=0}^{K-1} p_{ik}=1$

If we write one-hot targets $y_{ik} \in \{0,1\}$, then:

$$
L = -\frac{1}{n}\sum_{i=1}^n \sum_{k=0}^{K-1} y_{ik}\log(p_{ik})
$$

Equivalently (using integer labels):

$$
L = -\frac{1}{n}\sum_{i=1}^n \log\big(p_{i, y_i}\big)
$$

### Why the log?

If a model assigns probability $p_{i,y_i}$ to the true class, the likelihood of the dataset is:

$$
\prod_{i=1}^n p_{i,y_i}
$$

Taking `-log` turns a product into a sum:

$$
-\log\Big(\prod_{i=1}^n p_{i,y_i}\Big) = -\sum_{i=1}^n \log(p_{i,y_i})
$$

So minimizing log loss is the same as **maximizing likelihood**.

### From logits (numerical stability)

Sometimes models output **logits** (real-valued scores) instead of probabilities.

Binary: logit $z_i = w^\top x_i + b$, $p_i = \sigma(z_i)$ where $\sigma$ is the sigmoid.

A stable per-sample loss is:

$$
\ell_i = \operatorname{softplus}(z_i) - y_i z_i,\quad \operatorname{softplus}(z)=\log(1+e^z)
$$

Multiclass: logits $z_{ik}$, softmax probabilities $p_{ik} = \frac{e^{z_{ik}}}{\sum_j e^{z_{ij}}}$.

Stable loss:

$$
\ell_i = \log\Big(\sum_{k=0}^{K-1} e^{z_{ik}}\Big) - z_{i,y_i}
$$

In practice we also **clip probabilities** with a small $\varepsilon$ to avoid $\log(0)$ (which would be $+\infty$).


In [None]:
def sigmoid(z):
    z = np.asarray(z, dtype=float)
    return np.where(z >= 0, 1.0 / (1.0 + np.exp(-z)), np.exp(z) / (1.0 + np.exp(z)))


def softplus(z):
    z = np.asarray(z, dtype=float)
    return np.logaddexp(0.0, z)


def logsumexp(a, axis=None, keepdims=False):
    a = np.asarray(a, dtype=float)
    a_max = np.max(a, axis=axis, keepdims=True)
    out = np.log(np.sum(np.exp(a - a_max), axis=axis, keepdims=True)) + a_max
    if keepdims:
        return out
    if axis is None:
        return out.squeeze()
    return np.squeeze(out, axis=axis)


def log_softmax(z, axis=1):
    z = np.asarray(z, dtype=float)
    return z - logsumexp(z, axis=axis, keepdims=True)


def _weighted_mean(values, sample_weight=None):
    values = np.asarray(values, dtype=float)
    if sample_weight is None:
        return float(np.mean(values))

    w = np.asarray(sample_weight, dtype=float)
    if w.shape != values.shape:
        raise ValueError('sample_weight must have the same shape as values')
    w_sum = np.sum(w)
    if w_sum <= 0:
        raise ValueError('sample_weight must sum to a positive number')
    return float(np.sum(w * values) / w_sum)


def log_loss_binary(y_true, y_prob, *, eps=1e-15, sample_weight=None):
    """Binary log loss from probabilities.

    Parameters
    - y_true: shape (n,), values in {0,1}
    - y_prob: shape (n,), predicted P(y=1|x)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob, dtype=float)

    if y_true.shape != y_prob.shape:
        raise ValueError('y_true and y_prob must have the same shape')

    if np.any((y_true != 0) & (y_true != 1)):
        raise ValueError('y_true must contain only 0/1 labels')

    p = np.clip(y_prob, eps, 1.0 - eps)
    losses = -(y_true * np.log(p) + (1 - y_true) * np.log(1 - p))
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_multiclass(y_true, y_prob, *, eps=1e-15, sample_weight=None):
    """Multiclass log loss from probabilities.

    Parameters
    - y_true: shape (n,), integer labels in {0,1,...,K-1}
    - y_prob: shape (n,K), predicted class probabilities (rows should sum to 1)
    """
    y_true = np.asarray(y_true)
    y_prob = np.asarray(y_prob, dtype=float)

    if y_prob.ndim != 2:
        raise ValueError('y_prob must be a 2D array of shape (n_samples, n_classes)')

    n_samples, n_classes = y_prob.shape
    if y_true.shape != (n_samples,):
        raise ValueError('y_true must have shape (n_samples,)')

    if np.any((y_true < 0) | (y_true >= n_classes)):
        raise ValueError('y_true contains labels outside [0, n_classes)')

    p = np.clip(y_prob, eps, 1.0 - eps)
    p = p / p.sum(axis=1, keepdims=True)
    losses = -np.log(p[np.arange(n_samples), y_true])
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_binary_from_logits(y_true, logits, *, sample_weight=None):
    """Binary log loss from logits: softplus(z) - y*z."""
    y_true = np.asarray(y_true)
    logits = np.asarray(logits, dtype=float)

    if y_true.shape != logits.shape:
        raise ValueError('y_true and logits must have the same shape')
    if np.any((y_true != 0) & (y_true != 1)):
        raise ValueError('y_true must contain only 0/1 labels')

    losses = softplus(logits) - y_true * logits
    return _weighted_mean(losses, sample_weight=sample_weight)


def log_loss_multiclass_from_logits(y_true, logits, *, sample_weight=None):
    """Multiclass log loss from logits: -log softmax(true_class)."""
    y_true = np.asarray(y_true)
    logits = np.asarray(logits, dtype=float)

    if logits.ndim != 2:
        raise ValueError('logits must be a 2D array of shape (n_samples, n_classes)')

    n_samples, n_classes = logits.shape
    if y_true.shape != (n_samples,):
        raise ValueError('y_true must have shape (n_samples,)')
    if np.any((y_true < 0) | (y_true >= n_classes)):
        raise ValueError('y_true contains labels outside [0, n_classes)')

    log_probs = log_softmax(logits, axis=1)
    losses = -log_probs[np.arange(n_samples), y_true]
    return _weighted_mean(losses, sample_weight=sample_weight)


## 2) Intuition (plots)

For binary classification:

- if the true label is 1, the loss is $-\log(p)$
- if the true label is 0, the loss is $-\log(1-p)$

So **being confidently wrong** is punished heavily (the loss goes to $+\infty$ as the predicted probability goes to 0 for the true class).

A key property: log loss is a **strictly proper scoring rule**.
If the true label is Bernoulli with positive rate $q$, then the *expected* loss of predicting $p$ is:

$$
\mathbb{E}[\ell(p)] = -q\log(p) - (1-q)\log(1-p)
$$

This is the **cross-entropy** $H(q,p)$ and it is minimized at $p=q$.


In [None]:
eps = 1e-6
p = np.linspace(eps, 1 - eps, 800)

loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)

q = 0.7
expected_loss = -(q * np.log(p) + (1 - q) * np.log(1 - p))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        'Per-sample log loss as a function of predicted probability',
        'Expected log loss when the true positive rate is q',
    ),
)

fig.add_trace(go.Scatter(x=p, y=loss_y1, name='y=1: -log(p)'), row=1, col=1)
fig.add_trace(go.Scatter(x=p, y=loss_y0, name='y=0: -log(1-p)'), row=1, col=1)
fig.update_xaxes(title_text='predicted probability p', row=1, col=1)
fig.update_yaxes(title_text='loss', row=1, col=1)

fig.add_trace(go.Scatter(x=p, y=expected_loss, name='E[loss]'), row=1, col=2)
fig.add_vline(x=q, line_width=2, line_dash='dash', line_color='black', row=1, col=2)
fig.add_annotation(
    x=q,
    y=float(expected_loss[np.argmin(np.abs(p - q))]),
    text='minimum at p=q',
    showarrow=True,
    arrowhead=2,
    ax=40,
    ay=-30,
    row=1,
    col=2,
)
fig.update_xaxes(title_text='predicted probability p', row=1, col=2)
fig.update_yaxes(title_text='expected loss', row=1, col=2)

fig.update_layout(height=420, legend=dict(orientation='h', yanchor='bottom', y=1.02))
fig.show()


## 3) NumPy implementation: quick sanity checks

A small example showing why log loss is sensitive to a single confident mistake.


In [None]:
y_true = np.array([1, 1, 1, 0, 0, 0])

p_good = np.array([0.9, 0.8, 0.7, 0.3, 0.2, 0.1])
p_one_confident_mistake = np.array([0.9, 0.8, 0.01, 0.3, 0.2, 0.99])

print('mean log loss (good):', log_loss_binary(y_true, p_good))
print('mean log loss (one confident mistake):', log_loss_binary(y_true, p_one_confident_mistake))
print('sklearn check:', sk_log_loss(y_true, p_good))

eps = 1e-15
losses_good = -(y_true * np.log(np.clip(p_good, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_good, eps, 1 - eps)))
losses_bad = -(y_true * np.log(np.clip(p_one_confident_mistake, eps, 1 - eps)) + (1 - y_true) * np.log(1 - np.clip(p_one_confident_mistake, eps, 1 - eps)))

fig = go.Figure()
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_good, name='per-sample loss (good)'))
fig.add_trace(go.Bar(x=np.arange(len(y_true)), y=losses_bad, name='per-sample loss (one confident mistake)'))
fig.update_layout(
    barmode='group',
    title='A single confident mistake can dominate mean log loss',
    xaxis_title='sample index',
    yaxis_title='per-sample loss',
)
fig.show()

p_baseline = np.full_like(y_true, y_true.mean(), dtype=float)
print('baseline (predict base rate p=mean(y)):', log_loss_binary(y_true, p_baseline))


### Multiclass example

For multiclass problems you pass a probability matrix of shape `(n_samples, n_classes)`.
The loss for each sample is simply `-log(probability_assigned_to_the_true_class)`.


In [None]:
y_true_mc = np.array([0, 2, 1, 2])
P = np.array(
    [
        [0.7, 0.2, 0.1],
        [0.1, 0.2, 0.7],
        [0.2, 0.6, 0.2],
        [0.05, 0.05, 0.9],
    ]
)

print('multiclass log loss (numpy):', log_loss_multiclass(y_true_mc, P))
print('multiclass log loss (sklearn):', sk_log_loss(y_true_mc, P, labels=[0, 1, 2]))

# log-loss from logits should match log-loss from probabilities
Z = np.log(P)
print('multiclass log loss from logits (numpy):', log_loss_multiclass_from_logits(y_true_mc, Z))


## 4) Using log loss to optimize logistic regression (NumPy)

Binary logistic regression models:

$$
z_i = w^\top x_i + b,\quad p_i = \sigma(z_i)
$$

and minimizes the average log loss:

$$
J(w,b) = \frac{1}{n}\sum_{i=1}^n \Big(-y_i\log(p_i) - (1-y_i)\log(1-p_i)\Big)
$$

A very useful fact for optimization is that the derivative w.r.t. the logit is:

$$
\frac{\partial \ell_i}{\partial z_i} = p_i - y_i
$$

So the gradients are:

$$
\nabla_w J = \frac{1}{n} X^\top (p - y),\quad \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^n (p_i - y_i)
$$

Below is a simple gradient descent optimizer that learns `w` and `b` by directly minimizing log loss.


In [None]:
def standardize_fit_transform(X):
    X = np.asarray(X, dtype=float)
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X, mean, std):
    X = np.asarray(X, dtype=float)
    std = np.where(std == 0, 1.0, std)
    return (X - mean) / std


def fit_logistic_regression_gd(
    X_train,
    y_train,
    X_val=None,
    y_val=None,
    *,
    lr=0.2,
    n_steps=300,
    l2=0.0,
):
    X_train = np.asarray(X_train, dtype=float)
    y_train = np.asarray(y_train)
    n_samples, n_features = X_train.shape

    w = np.zeros(n_features)
    b = 0.0

    history = {
        'step': [],
        'train_loss': [],
        'train_acc': [],
        'val_loss': [],
        'val_acc': [],
    }

    for step in range(n_steps):
        logits = X_train @ w + b
        p = sigmoid(logits)

        loss = log_loss_binary(y_train, p)
        grad_w = (X_train.T @ (p - y_train)) / n_samples + l2 * w
        grad_b = float(np.mean(p - y_train))

        w -= lr * grad_w
        b -= lr * grad_b

        pred = (p >= 0.5).astype(int)
        acc = float(np.mean(pred == y_train))

        history['step'].append(step)
        history['train_loss'].append(loss)
        history['train_acc'].append(acc)

        if X_val is not None and y_val is not None:
            logits_val = X_val @ w + b
            p_val = sigmoid(logits_val)
            history['val_loss'].append(log_loss_binary(y_val, p_val))
            history['val_acc'].append(float(np.mean((p_val >= 0.5).astype(int) == y_val)))

    return w, b, history


In [None]:
X, y = make_blobs(
    n_samples=800,
    centers=2,
    n_features=2,
    cluster_std=2.2,
    random_state=0,
)

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.3,
    random_state=0,
    stratify=y,
)

X_train_s, mean, std = standardize_fit_transform(X_train)
X_val_s = standardize_transform(X_val, mean, std)

w, b, hist = fit_logistic_regression_gd(
    X_train_s,
    y_train,
    X_val=X_val_s,
    y_val=y_val,
    lr=0.2,
    n_steps=250,
)

fig = make_subplots(specs=[[{'secondary_y': True}]])
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_loss'], name='train log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_loss'], name='val log loss'), secondary_y=False)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['train_acc'], name='train accuracy'), secondary_y=True)
fig.add_trace(go.Scatter(x=hist['step'], y=hist['val_acc'], name='val accuracy'), secondary_y=True)

fig.update_xaxes(title_text='gradient descent step')
fig.update_yaxes(title_text='log loss (lower is better)', secondary_y=False)
fig.update_yaxes(title_text='accuracy', range=[0, 1], secondary_y=True)
fig.update_layout(title='Minimizing log loss trains logistic regression', height=420)
fig.show()

print('final train loss:', hist['train_loss'][-1])
print('final val loss:', hist['val_loss'][-1])


In [None]:
x0_min, x0_max = X_train_s[:, 0].min() - 0.8, X_train_s[:, 0].max() + 0.8
x1_min, x1_max = X_train_s[:, 1].min() - 0.8, X_train_s[:, 1].max() + 0.8

x0 = np.linspace(x0_min, x0_max, 220)
x1 = np.linspace(x1_min, x1_max, 220)
xx0, xx1 = np.meshgrid(x0, x1)
grid = np.c_[xx0.ravel(), xx1.ravel()]

prob_grid = sigmoid(grid @ w + b).reshape(xx0.shape)

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.0, end=1.0, size=0.1),
        colorscale='RdBu',
        opacity=0.85,
        colorbar=dict(title='P(y=1)'),
        name='P(y=1)',
    )
)
fig.add_trace(
    go.Contour(
        x=x0,
        y=x1,
        z=prob_grid,
        contours=dict(start=0.5, end=0.5, size=0.5),
        showscale=False,
        line=dict(color='black', width=3),
        hoverinfo='skip',
        name='decision boundary (p=0.5)',
    )
)

fig.add_trace(
    go.Scatter(
        x=X_train_s[:, 0],
        y=X_train_s[:, 1],
        mode='markers',
        name='train',
        marker=dict(
            size=6,
            color=y_train,
            cmin=0,
            cmax=1,
            colorscale=[[0, '#1f77b4'], [1, '#d62728']],
            line=dict(width=0.5, color='black'),
        ),
    )
)

fig.add_trace(
    go.Scatter(
        x=X_val_s[:, 0],
        y=X_val_s[:, 1],
        mode='markers',
        name='val',
        marker=dict(
            size=8,
            symbol='x',
            color=y_val,
            cmin=0,
            cmax=1,
            colorscale=[[0, '#1f77b4'], [1, '#d62728']],
            line=dict(width=1.0, color='black'),
        ),
    )
)

fig.update_layout(
    title='Decision boundary after minimizing log loss (standardized feature space)',
    xaxis_title='feature 1 (standardized)',
    yaxis_title='feature 2 (standardized)',
    height=520,
)
fig.show()


## 5) Pros, cons, pitfalls

### Pros
- **Uses probabilities**: rewards calibrated predictions, not just correct hard labels.
- **Strictly proper scoring rule**: in expectation, you minimize it by predicting the true conditional probabilities.
- **Differentiable**: works naturally as a training objective (logistic regression, neural nets, softmax models).
- **Works for multiclass**: via categorical cross-entropy.

### Cons / caveats
- **Harder to interpret** than accuracy (units are nats if using natural logs).
- **Unbounded above**: a few confidently wrong predictions can dominate the mean.
- **Sensitive to label noise**: mislabeled points can produce very large losses if the model is confident.
- Requires good probability estimates; models that only rank well (AUC) can still have poor log loss.

### Common pitfalls
- Passing **hard class labels** instead of probabilities (log loss expects probabilities).
- For multiclass, probability rows must align with label order and sum to 1.
- With scikit-learn, if `y_true` contains only one class, pass `labels=[...]` to define the full label set.
- Not clipping probabilities leads to `log(0)` and infinite loss; use a small $\varepsilon$.

### Where it is a good fit
- When you care about **probability quality**: risk estimation, triage systems, cost-sensitive decisions.
- When you want an evaluation metric that matches the training objective for probabilistic classifiers.
- When comparing calibrated models (often alongside calibration curves / Brier score).


## Exercises
- Derive $\partial \ell / \partial z = p - y$ for the binary case.
- Implement multiclass gradient descent for softmax regression using `log_loss_multiclass_from_logits`.
- Compare log loss and accuracy on an imbalanced dataset; notice that accuracy can look good even with poor probabilities.


## References
- scikit-learn: `sklearn.metrics.log_loss` https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
- Cross-entropy and negative log-likelihood (NLL): https://en.wikipedia.org/wiki/Cross_entropy
- Proper scoring rules: https://en.wikipedia.org/wiki/Scoring_rule
