# hinge_loss

Hinge loss is a **margin-based** loss for classification. It‚Äôs the standard convex surrogate behind the (soft-margin) **Support Vector Machine (SVM)**.

This notebook:
- defines **binary** and **multiclass** hinge loss with consistent notation
- builds intuition with Plotly plots
- implements the loss (and a useful subgradient) from scratch in NumPy
- uses hinge loss to optimize a simple **linear classifier** (primal SVM-style)

## Quick import

```python
from sklearn.metrics import hinge_loss
```

> Important: `hinge_loss` expects **decision scores** (real-valued margins), not probabilities.


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass

from sklearn.datasets import make_blobs
from sklearn.metrics import hinge_loss as skl_hinge_loss
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)


\
    ## 1) Binary hinge loss (definition)

    Binary classification with a **real-valued score**:

    - label: $y \in \{-1, +1\}$
    - model score: $s = f(x) \in \mathbb{R}$
    - prediction: $\hat{y} = \mathrm{sign}(s)$

    The key quantity is the **(signed) margin**:

    $$
     m = y\,s.
    $$

    - If $m > 0$, the example is classified correctly.
    - Larger $m$ means ‚Äúmore confident‚Äù (further from the decision boundary).

    The **hinge loss** is:

    $$
    \ell(y, s) = \max(0, 1 - y s) = \max(0, 1 - m).
    $$

    Average hinge loss over a dataset:

    $$
    L = \frac{1}{n}\sum_{i=1}^n \max(0, 1 - y_i s_i).
    $$

    ### Relationship to 0‚Äì1 loss

    The 0‚Äì1 loss is $\mathbb{1}[m \le 0]$ (wrong sign).

    Hinge loss is a **convex upper bound**:

    $$
    \mathbb{1}[m \le 0] \;\le\; \max(0, 1 - m).
    $$

    So minimizing hinge loss tends to reduce classification errors while also encouraging a **margin** ($m \ge 1$ gives zero loss).


In [None]:
m = np.linspace(-3, 3, 600)

loss_01 = (m <= 0).astype(float)
loss_hinge = np.maximum(0.0, 1.0 - m)
loss_sq_hinge = np.maximum(0.0, 1.0 - m) ** 2

fig = go.Figure()
fig.add_trace(go.Scatter(x=m, y=loss_01, name="0-1 loss  ùüô[m‚â§0]", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=m, y=loss_hinge, name="hinge  max(0, 1-m)", line=dict(width=3)))
fig.add_trace(go.Scatter(x=m, y=loss_sq_hinge, name="squared hinge (variant)", line=dict(dash="dot")))

fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")

fig.update_layout(
    title="Loss as a function of margin  m = y¬∑score",
    xaxis_title="margin m",
    yaxis_title="loss",
    legend_title="",
)
fig.show()


\
    ## 2) Intuition: which points are penalized?

    Because $\ell(m)=\max(0, 1-m)$:

    - **Misclassified** points ($m \le 0$) get loss $\ge 1$.
    - **Correct but too close** to the boundary ($0 < m < 1$) still get *some* loss.
    - **Confident** points ($m \ge 1$) get **zero** loss.

    This is why hinge-based models often end up depending heavily on a subset of points (those with $m \le 1$), commonly called **support vectors** in the SVM context.


In [None]:
m_samples = np.linspace(-2.5, 2.5, 60)
loss_samples = np.maximum(0.0, 1.0 - m_samples)

category = np.where(
    m_samples <= 0,
    "misclassified (m ‚â§ 0)",
    np.where(m_samples < 1, "correct but within margin (0 < m < 1)", "confident (m ‚â• 1)"),
)

fig = px.scatter(
    x=m_samples,
    y=loss_samples,
    color=category,
    title="Only points with margin m < 1 contribute to hinge loss",
)
fig.add_vline(x=0, line_dash="dot", line_color="gray")
fig.add_vline(x=1, line_dash="dot", line_color="gray")
fig.update_layout(xaxis_title="margin m", yaxis_title="hinge loss")
fig.show()


\
    ## 3) Multiclass hinge loss (Crammer‚ÄìSinger)

    For $K$ classes, assume a score vector:

    $$
     s(x) \in \mathbb{R}^K, \quad s_k(x) = \text{score for class } k.
    $$

    If the true class is $y \in \{0,\dots,K-1\}$, the multiclass hinge loss is:

    $$
    \ell(y, s) = \max\big(0, 1 + \max_{j \ne y} s_j - s_y\big).
    $$

    It enforces a **margin** between the true class score and the best competing score:

    $$
     s_y \ge \max_{j \ne y} s_j + 1 \quad \Rightarrow \quad \ell = 0.
    $$

    This is the formulation used by `sklearn.metrics.hinge_loss` when `pred_decision` is shaped `(n_samples, n_classes)`.


In [None]:
def _as_1d_float(x: np.ndarray) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    if x.ndim != 1:
        raise ValueError(f"Expected a 1D array, got shape={x.shape}")
    return x


def binary_hinge_loss(
    y_true: np.ndarray,
    scores: np.ndarray,
    *,
    margin: float = 1.0,
    sample_weight: np.ndarray | None = None,
    reduction: str = "mean",
) -> float:
    """Binary hinge loss: mean_i max(0, margin - y_i * score_i).

    Accepts labels in {0,1} or {-1,+1}. `scores` are raw decision scores.
    """

    y = _as_1d_float(y_true)
    s = _as_1d_float(scores)
    if y.shape[0] != s.shape[0]:
        raise ValueError(
            f"y_true and scores must match in length, got {y.shape[0]} vs {s.shape[0]}"
        )

    uniques = set(np.unique(y).tolist())
    if uniques.issubset({0.0, 1.0}):
        y = np.where(y == 0.0, -1.0, 1.0)
    elif not uniques.issubset({-1.0, 1.0}):
        raise ValueError(
            f"For binary hinge loss, y_true must be in {{0,1}} or {{-1,1}}, got {sorted(uniques)}"
        )

    loss = np.maximum(0.0, margin - y * s)

    if sample_weight is not None:
        w = _as_1d_float(sample_weight)
        if w.shape[0] != loss.shape[0]:
            raise ValueError("sample_weight must have the same length as y_true")
        if reduction == "mean":
            return float(np.sum(w * loss) / np.sum(w))
        if reduction == "sum":
            return float(np.sum(w * loss))
        raise ValueError("reduction must be 'mean' or 'sum'")

    if reduction == "mean":
        return float(np.mean(loss))
    if reduction == "sum":
        return float(np.sum(loss))
    raise ValueError("reduction must be 'mean' or 'sum'")


def multiclass_hinge_loss(
    y_true: np.ndarray,
    scores: np.ndarray,
    *,
    margin: float = 1.0,
    sample_weight: np.ndarray | None = None,
    reduction: str = "mean",
) -> float:
    """Multiclass hinge loss (Crammer‚ÄìSinger): mean_i max(0, margin + max_{j!=y} s_ij - s_i,y).

    `y_true` are integer class labels in [0, K-1]. `scores` has shape (n, K).
    """

    y = np.asarray(y_true)
    s = np.asarray(scores, dtype=float)
    if y.ndim != 1:
        raise ValueError(f"y_true must be 1D, got shape={y.shape}")
    if s.ndim != 2:
        raise ValueError(f"scores must be 2D, got shape={s.shape}")
    n, k = s.shape
    if y.shape[0] != n:
        raise ValueError("y_true and scores must match in n_samples")

    y = y.astype(int)
    if y.min() < 0 or y.max() >= k:
        raise ValueError(f"y_true values must be in [0, {k-1}]")

    true_scores = s[np.arange(n), y]

    s_other = s.copy()
    s_other[np.arange(n), y] = -np.inf
    max_other = np.max(s_other, axis=1)

    loss = np.maximum(0.0, margin + max_other - true_scores)

    if sample_weight is not None:
        w = _as_1d_float(sample_weight)
        if w.shape[0] != loss.shape[0]:
            raise ValueError("sample_weight must have the same length as y_true")
        if reduction == "mean":
            return float(np.sum(w * loss) / np.sum(w))
        if reduction == "sum":
            return float(np.sum(w * loss))
        raise ValueError("reduction must be 'mean' or 'sum'")

    if reduction == "mean":
        return float(np.mean(loss))
    if reduction == "sum":
        return float(np.sum(loss))
    raise ValueError("reduction must be 'mean' or 'sum'")


In [None]:
# --- Binary: compare against sklearn.metrics.hinge_loss ---

y_true_01 = np.array([0, 1, 0, 1])
score = np.array([-0.2, 0.5, 0.3, 1.2])

skl = skl_hinge_loss(y_true_01, score)
ours = binary_hinge_loss(y_true_01, score)
print("binary | sklearn:", skl)
print("binary | numpy :", ours)

# --- Multiclass: compare against sklearn.metrics.hinge_loss ---

y_true_mc = np.array([0, 1, 2])
scores_mc = np.array(
    [
        [2.0, 0.0, -1.0],
        [0.1, 0.2, 0.0],
        [-1.0, 0.0, 3.0],
    ]
)

skl_mc = skl_hinge_loss(y_true_mc, scores_mc)
ours_mc = multiclass_hinge_loss(y_true_mc, scores_mc)
print("multiclass | sklearn:", skl_mc)
print("multiclass | numpy :", ours_mc)


\
    ## 4) Using hinge loss to optimize a linear classifier (soft-margin SVM style)

    A common choice is a linear score function:

    $$
     s_i = f(x_i) = w^T x_i + b.
    $$

    A soft-margin (primal) SVM objective is:

    $$
    J(w,b) = \frac{1}{2}\lVert w \rVert^2 + C\,\frac{1}{n}\sum_{i=1}^n \max\big(0, 1 - y_i(w^T x_i + b)\big).
    $$

    - The $\tfrac12\lVert w \rVert^2$ term is **L2 regularization** (prefers a wider margin).
    - $C>0$ trades off margin size vs hinge penalties.

    ### Subgradient (what we need for gradient descent)

    The hinge part is **not differentiable** at $m_i = 1$.
    But it‚Äôs convex, so we can use a **subgradient**.

    Let $m_i = y_i(w^T x_i + b)$ and define the ‚Äúviolators‚Äù:

    $$
    \mathcal{V} = \{i : m_i < 1\}.
    $$

    A convenient subgradient is:

    $$
    \nabla_w J = w - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i x_i,
    \qquad
    \nabla_b J = - \frac{C}{n}\sum_{i\in\mathcal{V}} y_i.
    $$

    We‚Äôll implement full-batch subgradient descent below.


In [None]:
@dataclass
class LinearSVMHistory:
    objective: list[float]
    mean_hinge: list[float]
    accuracy: list[float]


def linear_svm_objective(
    w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[float, float]:
    scores = X @ w + b
    hinge = np.maximum(0.0, 1.0 - y * scores)
    obj = 0.5 * float(w @ w) + C * float(np.mean(hinge))
    return obj, float(np.mean(hinge))


def linear_svm_subgrad(
    w: np.ndarray, b: float, X: np.ndarray, y: np.ndarray, *, C: float = 1.0
) -> tuple[np.ndarray, float]:
    n = X.shape[0]
    scores = X @ w + b
    margins = y * scores
    viol = margins < 1.0

    grad_w = w.copy()
    grad_b = 0.0

    if np.any(viol):
        grad_w -= (C / n) * (X[viol].T @ y[viol])
        grad_b = -(C / n) * float(np.sum(y[viol]))

    return grad_w, grad_b


def train_linear_svm_subgradient_descent(
    X: np.ndarray,
    y: np.ndarray,
    *,
    C: float = 1.0,
    lr: float = 0.2,
    n_epochs: int = 200,
    seed: int = 42,
) -> tuple[np.ndarray, float, LinearSVMHistory]:
    """Train a linear classifier with L2 + hinge using full-batch subgradient descent."""

    rng_local = np.random.default_rng(seed)
    w = rng_local.normal(scale=0.01, size=X.shape[1])
    b = 0.0

    hist = LinearSVMHistory(objective=[], mean_hinge=[], accuracy=[])

    for _ in range(n_epochs):
        obj, mean_hinge = linear_svm_objective(w, b, X, y, C=C)
        scores = X @ w + b
        y_pred = np.where(scores >= 0.0, 1.0, -1.0)
        acc = float(np.mean(y_pred == y))

        hist.objective.append(obj)
        hist.mean_hinge.append(mean_hinge)
        hist.accuracy.append(acc)

        grad_w, grad_b = linear_svm_subgrad(w, b, X, y, C=C)
        w = w - lr * grad_w
        b = b - lr * grad_b

    return w, b, hist


# --- Make a simple dataset ---
X_raw, y01 = make_blobs(n_samples=250, centers=2, cluster_std=1.8, random_state=42)
y_pm1 = np.where(y01 == 0, -1.0, 1.0)

scaler = StandardScaler()
X = scaler.fit_transform(X_raw)

w, b, hist = train_linear_svm_subgradient_descent(X, y_pm1, C=2.0, lr=0.15, n_epochs=220)

print("final objective:", hist.objective[-1])
print("final mean hinge:", hist.mean_hinge[-1])
print("final accuracy :", hist.accuracy[-1])


In [None]:
from plotly.subplots import make_subplots

epochs = np.arange(len(hist.objective))

fig = make_subplots(
    rows=3,
    cols=1,
    shared_xaxes=True,
    vertical_spacing=0.06,
    subplot_titles=(
        "Objective  (0.5||w||^2 + C¬∑mean_hinge)",
        "Mean hinge loss",
        "Accuracy",
    ),
)

fig.add_trace(go.Scatter(x=epochs, y=hist.objective, name="objective"), row=1, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.mean_hinge, name="mean hinge"), row=2, col=1)
fig.add_trace(go.Scatter(x=epochs, y=hist.accuracy, name="accuracy"), row=3, col=1)

fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text="value", row=2, col=1)
fig.update_yaxes(title_text="", row=3, col=1, range=[0, 1.02])
fig.update_xaxes(title_text="epoch", row=3, col=1)

fig.update_layout(height=700, title="Training curves (full-batch subgradient descent)")
fig.show()


In [None]:
# Visualize decision boundary + margin band in 2D

x1_min, x1_max = float(X[:, 0].min() - 1.0), float(X[:, 0].max() + 1.0)
x2_min, x2_max = float(X[:, 1].min() - 1.0), float(X[:, 1].max() + 1.0)

xs = np.linspace(x1_min, x1_max, 200)

w0, w1 = float(w[0]), float(w[1])


def boundary_line(level: float) -> tuple[np.ndarray, np.ndarray]:
    """Return points (x1, x2) satisfying w0*x1 + w1*x2 + b = level."""
    if abs(w1) > 1e-10:
        x1 = xs
        x2 = (level - b - w0 * x1) / w1
        return x1, x2

    # Vertical line fallback
    x1 = np.full_like(xs, (level - b) / w0)
    x2 = np.linspace(x2_min, x2_max, xs.shape[0])
    return x1, x2


margins = y_pm1 * (X @ w + b)
support = margins <= 1.0 + 1e-12

fig = go.Figure()

# points by class
for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
    mask = y_pm1 == cls
    fig.add_trace(
        go.Scatter(
            x=X[mask, 0],
            y=X[mask, 1],
            mode="markers",
            name=f"y={int(cls)}",
            marker=dict(size=8, color=color, line=dict(width=0)),
        )
    )

# highlight support vectors
fig.add_trace(
    go.Scatter(
        x=X[support, 0],
        y=X[support, 1],
        mode="markers",
        name="support (m ‚â§ 1)",
        marker=dict(size=14, color="rgba(0,0,0,0)", line=dict(width=2, color="black")),
    )
)

# decision boundary and margins
for level, name, dash, width, color in [
    (0.0, "decision f(x)=0", "solid", 3, "black"),
    (1.0, "+margin f(x)=+1", "dash", 2, "gray"),
    (-1.0, "-margin f(x)=-1", "dash", 2, "gray"),
]:
    x1, x2 = boundary_line(level)
    fig.add_trace(
        go.Scatter(
            x=x1,
            y=x2,
            mode="lines",
            name=name,
            line=dict(dash=dash, width=width, color=color),
        )
    )

fig.update_layout(
    title="Learned linear classifier with hinge loss (support vectors highlighted)",
    xaxis_title="x1 (scaled)",
    yaxis_title="x2 (scaled)",
)
fig.update_xaxes(range=[x1_min, x1_max])
fig.update_yaxes(range=[x2_min, x2_max])
fig.show()


\
    ## 5) The role of `C` (regularization trade-off)

    In the objective

    $$
    \tfrac12\lVert w\rVert^2 + C\,\text{mean hinge},
    $$

    - **small `C`**: regularization dominates ‚Üí wider margin, more tolerance for violations
    - **large `C`**: hinge penalties dominate ‚Üí tries harder to fit training points (narrower margin)

    Below we train three models with different `C` values and compare the resulting decision boundaries.


In [None]:
from plotly.subplots import make_subplots

Cs = [0.2, 2.0, 20.0]
models: list[tuple[float, np.ndarray, float]] = []

for C in Cs:
    w_c, b_c, _ = train_linear_svm_subgradient_descent(
        X, y_pm1, C=C, lr=0.15, n_epochs=220, seed=42
    )
    models.append((C, w_c, b_c))

fig = make_subplots(rows=1, cols=len(Cs), subplot_titles=[f"C={C}" for C in Cs])

for col, (C, w_c, b_c) in enumerate(models, start=1):
    # data
    for cls, color in [(-1.0, "#1f77b4"), (1.0, "#d62728")]:
        mask = y_pm1 == cls
        fig.add_trace(
            go.Scatter(
                x=X[mask, 0],
                y=X[mask, 1],
                mode="markers",
                marker=dict(size=6, color=color),
                showlegend=(col == 1),
                name=f"y={int(cls)}",
            ),
            row=1,
            col=col,
        )

    # boundary (only f(x)=0 to keep it readable)
    w0, w1 = float(w_c[0]), float(w_c[1])
    if abs(w1) > 1e-10:
        x1 = xs
        x2 = (0.0 - b_c - w0 * x1) / w1
    else:
        x1 = np.full_like(xs, (0.0 - b_c) / w0)
        x2 = np.linspace(x2_min, x2_max, xs.shape[0])

    fig.add_trace(
        go.Scatter(
            x=x1,
            y=x2,
            mode="lines",
            line=dict(width=3, color="black"),
            showlegend=False,
            name="boundary",
        ),
        row=1,
        col=col,
    )

fig.update_layout(
    height=420,
    title="Effect of C on the learned decision boundary",
)
for col in range(1, len(Cs) + 1):
    fig.update_xaxes(title_text="x1", range=[x1_min, x1_max], row=1, col=col)
    fig.update_yaxes(title_text="x2", range=[x2_min, x2_max], row=1, col=col)

fig.show()


## 6) Practical usage: `sklearn.metrics.hinge_loss`

`sklearn.metrics.hinge_loss(y_true, pred_decision, ...)` expects:

- **binary**: `pred_decision.shape == (n_samples,)` (a real-valued decision score)
- **multiclass**: `pred_decision.shape == (n_samples, n_classes)` (one score per class)

A common workflow:

1) train a classifier that exposes `decision_function`
2) compute `pred_decision = model.decision_function(X)`
3) evaluate with `hinge_loss(y_true, pred_decision)`

Below we fit `LinearSVC` and compare `sklearn`‚Äôs hinge loss to our NumPy implementation.


In [None]:
clf = LinearSVC(C=2.0, dual=True, random_state=42)
clf.fit(X, y01)

dec = clf.decision_function(X)

skl = skl_hinge_loss(y01, dec)
ours = binary_hinge_loss(np.where(y01 == 0, -1.0, 1.0), dec)

print("sklearn hinge_loss:", skl)
print("numpy  hinge_loss:", ours)


\
    ## 7) Pros, cons, and when to use hinge loss

    ### Pros

    - **Convex** (for linear models): optimization is well-behaved (no local minima).
    - **Margin-aware**: doesn‚Äôt just separate classes; encourages a safety buffer.
    - **Sparse dependence on data** (SVM view): only points with $m \le 1$ influence the solution.
    - Often strong performance for **high-dimensional** classification (e.g., text with bag-of-words / TF-IDF).

    ### Cons

    - **Non-smooth** at $m=1$ (requires subgradients or a smoothed variant).
    - Produces **uncalibrated scores** (unlike logistic loss, it‚Äôs not a log-likelihood).
    - Not ideal when you need **probabilities** or well-calibrated uncertainty.
    - Can be sensitive to **label noise** near the boundary (like most margin-based methods).

    ### Good use cases

    - Binary or multiclass classification when you care about **large margins**.
    - Linear classification on large, sparse feature spaces (classic SVM territory).
    - As a surrogate for the 0‚Äì1 loss when you need a convex objective.


## 8) Common pitfalls and diagnostics

- **Use decision scores**: hinge loss needs raw scores (e.g., `decision_function`), not probabilities.
- **Label encoding**: math is cleanest with $y\in\{-1,+1\}$; many libraries accept `{0,1}` but be explicit.
- **Feature scaling**: for linear models with L2 regularization, scaling can strongly affect the margin and the effective regularization.
- **Class imbalance**: hinge loss itself doesn‚Äôt fix imbalance; consider class weights or re-sampling.
- **Interpretation**: a lower hinge loss generally means larger margins, but it‚Äôs not a calibrated probability of correctness.


## Exercises

1) Implement **squared hinge loss** and compare optimization behavior (smoother gradients).
2) Add **L1 regularization** and see how it changes sparsity in `w`.
3) Compare hinge vs logistic loss on the same dataset: decision boundary, calibration, and outliers.
4) Implement **SGD** (mini-batches) for the hinge objective and compare convergence.

## References

- scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hinge_loss.html
- Vapnik, *The Nature of Statistical Learning Theory*
- Cortes & Vapnik (1995), *Support-vector networks*
