# Cohen's Kappa (`cohen_kappa_score`)

Cohen’s kappa ($\kappa$) measures **agreement beyond chance** between two labelers.
In ML, it’s commonly used to evaluate a classifier against ground truth, especially under **class imbalance**.

$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$

- $p_o$ = observed agreement (diagonal of the confusion matrix; for classifier-vs-truth it equals **accuracy**)
- $p_e$ = expected agreement under independent marginals (chance agreement given the label prevalences)

Range: $\kappa\in[-1, 1]$ (1 = perfect agreement, 0 = chance-level, <0 = worse than chance).


## Goals

- Build intuition for why $\kappa$ differs from accuracy
- Derive $\kappa$ from the confusion matrix (binary + multiclass)
- Implement $\kappa$ from scratch in NumPy (including **weighted $\kappa$** for ordinal labels)
- Use $\kappa$ in a simple optimization loop (threshold selection for a logistic regression model)

## Prerequisites

- Confusion matrix basics
- Basic probability (marginals, independence)


In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from plotly.subplots import make_subplots

from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)


## 1) Why $\kappa$ (and not just accuracy)?

Accuracy counts raw agreement.
If the data are imbalanced, a trivial classifier can look “good” by mostly predicting the majority class.

$\kappa$ adjusts for the agreement you would expect **by chance**, given the label frequencies.


In [None]:
# A trivial “majority class” classifier
y_true = np.r_[np.zeros(950, dtype=int), np.ones(50, dtype=int)]
y_pred = np.zeros_like(y_true)

accuracy = (y_true == y_pred).mean()
kappa = cohen_kappa_score(y_true, y_pred)

print(f"accuracy = {accuracy:.3f}")
print(f"kappa    = {kappa:.3f}")


## 2) The math (confusion matrix → $\kappa$)

Let there be $K$ classes and a confusion matrix $C\in\mathbb{N}^{K\times K}$ where
$C_{ij}$ counts samples with **true** class $i$ predicted as class $j$.
Let $N = \sum_{i,j} C_{ij}$.

**Observed agreement** (diagonal mass):
$$
p_o = \frac{1}{N}\sum_{i=1}^K C_{ii}
$$

Define the marginal class frequencies (as proportions):
$$
r_i = \frac{1}{N}\sum_{j=1}^K C_{ij} \quad (\text{true prevalence}), \qquad
c_i = \frac{1}{N}\sum_{j=1}^K C_{ji} \quad (\text{predicted prevalence}).
$$

If true and predicted labels were **independent** but kept the same marginals, the expected agreement is:
$$
p_e = \sum_{i=1}^K r_i\,c_i
$$

Finally:
$$
\kappa = \frac{p_o - p_e}{1 - p_e}
$$

Notes:
- $\kappa=1$ when $p_o=1$.
- $\kappa=0$ when $p_o=p_e$ (no better than chance under the marginals).
- $\kappa<0$ when agreement is worse than chance.


In [None]:
def confusion_matrix_numpy(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    labels: np.ndarray | None = None,
) -> tuple[np.ndarray, np.ndarray]:
    """Confusion matrix C where C[i, j] counts true=labels[i], pred=labels[j]."""
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape.")

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))
    labels = np.asarray(labels)

    label_to_index = {label: i for i, label in enumerate(labels.tolist())}
    true_idx = np.array([label_to_index[label] for label in y_true], dtype=int)
    pred_idx = np.array([label_to_index[label] for label in y_pred], dtype=int)

    k = labels.size
    cm = np.zeros((k, k), dtype=int)
    np.add.at(cm, (true_idx, pred_idx), 1)
    return cm, labels


def kappa_components_from_cm(cm: np.ndarray) -> tuple[float, float]:
    """Return (p_o, p_e) computed from a confusion matrix."""
    cm = np.asarray(cm)
    n = cm.sum()
    if n == 0:
        raise ValueError("Confusion matrix has zero total count.")

    p_o = float(np.trace(cm) / n)
    row = cm.sum(axis=1) / n
    col = cm.sum(axis=0) / n
    p_e = float(row @ col)
    return p_o, p_e


def kappa_weight_matrix(n_classes: int, weights: str | np.ndarray | None) -> np.ndarray:
    """Disagreement weights W (0 on diagonal)."""
    if n_classes < 1:
        raise ValueError("n_classes must be >= 1")

    if weights is None:
        w = np.ones((n_classes, n_classes), dtype=float)
        np.fill_diagonal(w, 0.0)
        return w

    if isinstance(weights, str):
        if weights not in {"linear", "quadratic"}:
            raise ValueError("weights must be None, 'linear', 'quadratic', or a (K,K) array.")

        if n_classes == 1:
            return np.zeros((1, 1), dtype=float)

        idx = np.arange(n_classes)
        i, j = np.meshgrid(idx, idx, indexing="ij")
        if weights == "linear":
            return (np.abs(i - j) / (n_classes - 1)).astype(float)
        return (((i - j) ** 2) / ((n_classes - 1) ** 2)).astype(float)

    w = np.asarray(weights, dtype=float)
    if w.shape != (n_classes, n_classes):
        raise ValueError(f"custom weights must have shape ({n_classes},{n_classes})")
    return w


def cohen_kappa_score_numpy(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    labels: np.ndarray | None = None,
    weights: str | np.ndarray | None = None,
) -> float:
    """Cohen's kappa from scratch (unweighted or weighted)."""
    cm, labels = confusion_matrix_numpy(y_true, y_pred, labels=labels)
    n = cm.sum()
    if n == 0:
        raise ValueError("No samples.")

    o = cm / n
    row = o.sum(axis=1)
    col = o.sum(axis=0)
    e = np.outer(row, col)

    w = kappa_weight_matrix(labels.size, weights)
    denom = float((w * e).sum())
    num = float((w * o).sum())

    # Degenerate case: expected disagreement is zero (perfectly concentrated marginals)
    if np.isclose(denom, 0.0):
        return 1.0

    return 1.0 - (num / denom)


In [None]:
# Sanity check: match scikit-learn
y_true = rng.integers(0, 4, size=1_000)
y_pred = rng.integers(0, 4, size=1_000)

k_np = cohen_kappa_score_numpy(y_true, y_pred)
k_sk = cohen_kappa_score(y_true, y_pred)
print("unweighted", k_np, k_sk)
assert np.isclose(k_np, k_sk)

k_np_lin = cohen_kappa_score_numpy(y_true, y_pred, weights="linear")
k_sk_lin = cohen_kappa_score(y_true, y_pred, weights="linear")
print("linear   ", k_np_lin, k_sk_lin)
assert np.isclose(k_np_lin, k_sk_lin)

k_np_quad = cohen_kappa_score_numpy(y_true, y_pred, weights="quadratic")
k_sk_quad = cohen_kappa_score(y_true, y_pred, weights="quadratic")
print("quadratic", k_np_quad, k_sk_quad)
assert np.isclose(k_np_quad, k_sk_quad)


## 3) Visual intuition: observed vs expected agreement

$\kappa$ depends on both:
- how much mass is on the diagonal (observed agreement)
- how concentrated the **marginals** are (chance agreement)

Let’s look at a small multiclass example.


In [None]:
def plot_confusion_matrix(cm: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
    fig = px.imshow(
        cm,
        x=[str(l) for l in labels],
        y=[str(l) for l in labels],
        text_auto=True,
        color_continuous_scale="Blues",
        title=title,
        labels={"x": "pred", "y": "true", "color": "count"},
    )
    fig.update_yaxes(autorange="reversed")
    fig.update_layout(coloraxis_showscale=False)
    return fig


def plot_true_vs_pred_marginals(cm: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
    n = cm.sum()
    true_marg = cm.sum(axis=1) / n
    pred_marg = cm.sum(axis=0) / n

    fig = go.Figure()
    fig.add_trace(go.Bar(x=[str(l) for l in labels], y=true_marg, name="true", opacity=0.8))
    fig.add_trace(go.Bar(x=[str(l) for l in labels], y=pred_marg, name="pred", opacity=0.8))
    fig.update_layout(
        barmode="group",
        title=title,
        xaxis_title="class",
        yaxis_title="proportion",
    )
    fig.update_yaxes(range=[0, 1])
    return fig


In [None]:
labels = np.array([0, 1, 2])
y_true = np.r_[np.zeros(200, dtype=int), np.ones(150, dtype=int), np.full(50, 2, dtype=int)]

y_pred = y_true.copy()
flip_idx = rng.choice(y_true.size, size=80, replace=False)
y_pred[flip_idx] = (y_true[flip_idx] + rng.integers(1, 3, size=flip_idx.size)) % 3

cm, labels_used = confusion_matrix_numpy(y_true, y_pred, labels=labels)
p_o, p_e = kappa_components_from_cm(cm)
kappa = cohen_kappa_score_numpy(y_true, y_pred, labels=labels)

print(f"p_o (observed) = {p_o:.3f}")
print(f"p_e (expected) = {p_e:.3f}")
print(f"kappa          = {kappa:.3f}")

plot_confusion_matrix(cm, labels_used, title="Confusion matrix (example)")


In [None]:
plot_true_vs_pred_marginals(cm, labels_used, title="True vs predicted marginals")


## 4) How prevalence changes $\kappa$

Even if two classifiers have the same accuracy, $\kappa$ can differ depending on class prevalence because $p_e$ changes with the marginals.

Below we simulate a binary classifier with a fixed error rate: we flip the true label with probability $q$.
Accuracy stays at $1-q$, but $\kappa$ changes as the positive class prevalence moves toward 0 or 1.


In [None]:
def simulate_symmetric_noise(prevalences: np.ndarray, q: float, n: int = 50_000) -> tuple[np.ndarray, np.ndarray]:
    accs: list[float] = []
    kappas: list[float] = []
    for pi in prevalences:
        y = (rng.random(n) < pi).astype(int)
        flip = rng.random(n) < q
        y_hat = y.copy()
        y_hat[flip] = 1 - y_hat[flip]
        accs.append(float((y == y_hat).mean()))
        kappas.append(cohen_kappa_score_numpy(y, y_hat))
    return np.array(accs), np.array(kappas)


prevalences = np.linspace(0.02, 0.98, 60)
q = 0.10
accs, kappas = simulate_symmetric_noise(prevalences, q=q)

fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalences, y=accs, mode="lines", name="accuracy"))
fig.add_trace(go.Scatter(x=prevalences, y=kappas, mode="lines", name="kappa"))
fig.update_layout(
    title=f"Fixed error rate q={q:.2f}: accuracy vs kappa across prevalence",
    xaxis_title="positive class prevalence π = P(y=1)",
    yaxis_title="metric value",
)
fig.update_yaxes(range=[-0.1, 1.0])
fig


In [None]:
# Majority-class predictor: accuracy can be high, but kappa stays ~0
prevalences = np.linspace(0.02, 0.98, 60)
n = 50_000

accs: list[float] = []
kappas: list[float] = []
for pi in prevalences:
    y = (rng.random(n) < pi).astype(int)
    y_hat = np.zeros_like(y)
    accs.append(float((y == y_hat).mean()))
    kappas.append(cohen_kappa_score_numpy(y, y_hat))

fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalences, y=accs, mode="lines", name="accuracy"))
fig.add_trace(go.Scatter(x=prevalences, y=kappas, mode="lines", name="kappa"))
fig.update_layout(
    title="Always-predict-0 baseline across prevalence",
    xaxis_title="positive class prevalence π = P(y=1)",
    yaxis_title="metric value",
)
fig.update_yaxes(range=[-0.1, 1.0])
fig


## 5) Weighted $\kappa$ (ordinal labels)

If classes are **ordered** (e.g., 1–5 star ratings), “off by 1” is usually less severe than “off by 4”.
Weighted kappa replaces exact agreement with a **weighted disagreement**.

Let $O_{ij}=C_{ij}/N$ be the observed joint distribution and $E_{ij}=r_i c_j$ the expected joint distribution.
Choose disagreement weights $w_{ij}$ with $w_{ii}=0$.

$$
\kappa_w = 1 - \frac{\sum_{i,j} w_{ij} O_{ij}}{\sum_{i,j} w_{ij} E_{ij}}
$$

Common choices:
- linear: $w_{ij}=\frac{|i-j|}{K-1}$
- quadratic: $w_{ij}=\frac{(i-j)^2}{(K-1)^2}$

These only make sense when the class order is meaningful.


In [None]:
def plot_weight_matrix(w: np.ndarray, labels: np.ndarray, title: str) -> go.Figure:
    fig = px.imshow(
        w,
        x=[str(l) for l in labels],
        y=[str(l) for l in labels],
        text_auto=".2f",
        color_continuous_scale="Reds",
        title=title,
        labels={"x": "pred", "y": "true", "color": "disagreement"},
    )
    fig.update_yaxes(autorange="reversed")
    return fig


k = 5
labels = np.arange(k)
y_true = rng.integers(0, k, size=2_000)
noise = rng.choice([-2, -1, 0, 1, 2], size=y_true.size, p=[0.05, 0.15, 0.60, 0.15, 0.05])
y_pred = np.clip(y_true + noise, 0, k - 1)

k_unw = cohen_kappa_score_numpy(y_true, y_pred)
k_lin = cohen_kappa_score_numpy(y_true, y_pred, weights="linear")
k_quad = cohen_kappa_score_numpy(y_true, y_pred, weights="quadratic")

print(f"unweighted kappa = {k_unw:.3f}")
print(f"linear kappa     = {k_lin:.3f}")
print(f"quadratic kappa  = {k_quad:.3f}")

cm, labels_used = confusion_matrix_numpy(y_true, y_pred, labels=labels)
plot_confusion_matrix(cm, labels_used, title="Ordinal example: confusion matrix")


In [None]:
w_lin = kappa_weight_matrix(k, "linear")
w_quad = kappa_weight_matrix(k, "quadratic")

fig = make_subplots(rows=1, cols=2, subplot_titles=["Linear weights", "Quadratic weights"])

fig.add_trace(
    go.Heatmap(z=w_lin, x=labels.astype(str), y=labels.astype(str), colorscale="Reds", showscale=False),
    row=1,
    col=1,
)
fig.add_trace(
    go.Heatmap(z=w_quad, x=labels.astype(str), y=labels.astype(str), colorscale="Reds", showscale=False),
    row=1,
    col=2,
)

fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Disagreement weights (larger = more severe)")
fig


## 6) Using $\kappa$ for optimization (threshold tuning)

$\kappa$ is computed from **discrete** predictions, so it is not differentiable with respect to model parameters.
That means you usually *don’t* train a model by directly gradient-descenting on $\kappa$.

A common pattern is:
1. Train a probabilistic model with a differentiable objective (e.g., log loss)
2. Pick a decision threshold on a validation set to maximize $\kappa$

Below: logistic regression trained from scratch with gradient descent, then a threshold sweep to maximize $\kappa$.


In [None]:
def sigmoid(z: np.ndarray) -> np.ndarray:
    z = np.asarray(z)
    out = np.empty_like(z, dtype=float)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def standardize_fit(X: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0.0, 1.0, std)
    return (X - mean) / std, mean, std


def standardize_transform(X: np.ndarray, mean: np.ndarray, std: np.ndarray) -> np.ndarray:
    return (X - mean) / std


def add_intercept(X: np.ndarray) -> np.ndarray:
    return np.c_[np.ones((X.shape[0], 1)), X]


def log_loss_from_logits(y: np.ndarray, logits: np.ndarray) -> float:
    # Stable binary cross-entropy: mean(log(1+exp(z)) - y*z)
    return float(np.mean(np.logaddexp(0.0, logits) - y * logits))


def fit_logistic_regression_gd(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.2,
    n_iters: int = 2_000,
    l2: float = 0.0,
    record_every: int = 20,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Return (weights, steps, losses). L2 excludes the intercept term."""
    X_i = add_intercept(X)
    y = y.astype(float)
    w = np.zeros(X_i.shape[1], dtype=float)

    steps: list[int] = []
    losses: list[float] = []

    for step in range(1, n_iters + 1):
        logits = X_i @ w
        p = sigmoid(logits)

        grad = (X_i.T @ (p - y)) / y.size
        grad[1:] += l2 * w[1:]
        w -= lr * grad

        if step % record_every == 0 or step == 1:
            steps.append(step)
            losses.append(log_loss_from_logits(y, logits) + 0.5 * l2 * float(w[1:] @ w[1:]))

    return w, np.array(steps), np.array(losses)


def predict_proba_logreg(X: np.ndarray, w: np.ndarray) -> np.ndarray:
    X_i = add_intercept(X)
    return sigmoid(X_i @ w)


In [None]:
# Synthetic, slightly imbalanced binary classification problem
n0, n1 = 2_000, 500
mean0 = np.array([-1.2, -0.2])
mean1 = np.array([1.0, 0.9])
cov = np.array([[1.0, 0.45], [0.45, 1.0]])

X0 = rng.multivariate_normal(mean0, cov, size=n0)
X1 = rng.multivariate_normal(mean1, cov, size=n1)
X = np.vstack([X0, X1])
y = np.r_[np.zeros(n0, dtype=int), np.ones(n1, dtype=int)]

X_train, X_tmp, y_train, y_tmp = train_test_split(
    X, y, test_size=0.4, stratify=y, random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=42
)

# Standardize based on train only
X_train_s, mean, std = standardize_fit(X_train)
X_val_s = standardize_transform(X_val, mean, std)
X_test_s = standardize_transform(X_test, mean, std)

w, steps, losses = fit_logistic_regression_gd(X_train_s, y_train, lr=0.2, n_iters=2_000, l2=0.1)

fig = px.line(x=steps, y=losses, title="Logistic regression (GD): training objective")
fig.update_layout(xaxis_title="iteration", yaxis_title="log loss + L2")
fig


In [None]:
def sweep_thresholds(y_true: np.ndarray, proba: np.ndarray, thresholds: np.ndarray) -> tuple[np.ndarray, np.ndarray]:
    kappas = np.empty_like(thresholds, dtype=float)
    accs = np.empty_like(thresholds, dtype=float)
    for i, t in enumerate(thresholds):
        y_hat = (proba >= t).astype(int)
        kappas[i] = cohen_kappa_score_numpy(y_true, y_hat)
        accs[i] = float((y_true == y_hat).mean())
    return kappas, accs


proba_val = predict_proba_logreg(X_val_s, w)
thresholds = np.linspace(0.01, 0.99, 99)
kappas, accs = sweep_thresholds(y_val, proba_val, thresholds)

best_idx = int(np.argmax(kappas))
best_t = float(thresholds[best_idx])

print(f"best threshold (val) = {best_t:.2f}")
print(f"best kappa (val)     = {kappas[best_idx]:.3f}")
print(f"accuracy @ best_t    = {accs[best_idx]:.3f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=kappas, mode="lines", name="kappa"))
fig.add_trace(go.Scatter(x=thresholds, y=accs, mode="lines", name="accuracy", opacity=0.7))
fig.add_vline(x=best_t, line_dash="dash", line_color="black")
fig.update_layout(
    title="Validation sweep: kappa and accuracy vs threshold",
    xaxis_title="threshold",
    yaxis_title="metric value",
)
fig


In [None]:
def eval_at_threshold(y_true: np.ndarray, proba: np.ndarray, t: float) -> dict[str, float]:
    y_hat = (proba >= t).astype(int)
    cm, _ = confusion_matrix_numpy(y_true, y_hat, labels=np.array([0, 1]))
    return {
        "threshold": float(t),
        "accuracy": float((y_true == y_hat).mean()),
        "kappa": cohen_kappa_score_numpy(y_true, y_hat),
        "tn": float(cm[0, 0]),
        "fp": float(cm[0, 1]),
        "fn": float(cm[1, 0]),
        "tp": float(cm[1, 1]),
    }


proba_test = predict_proba_logreg(X_test_s, w)

m_default = eval_at_threshold(y_test, proba_test, t=0.5)
m_best = eval_at_threshold(y_test, proba_test, t=best_t)

print("test @ t=0.50:", m_default)
print("test @ best_t:", m_best)

cm_default, _ = confusion_matrix_numpy(y_test, (proba_test >= 0.5).astype(int), labels=np.array([0, 1]))
cm_best, _ = confusion_matrix_numpy(y_test, (proba_test >= best_t).astype(int), labels=np.array([0, 1]))

fig = make_subplots(rows=1, cols=2, subplot_titles=["t = 0.50", f"t = {best_t:.2f}"])
fig.add_trace(go.Heatmap(z=cm_default, x=["0", "1"], y=["0", "1"], colorscale="Blues", showscale=False), row=1, col=1)
fig.add_trace(go.Heatmap(z=cm_best, x=["0", "1"], y=["0", "1"], colorscale="Blues", showscale=False), row=1, col=2)
fig.update_yaxes(autorange="reversed")
fig.update_layout(title="Test confusion matrices: default vs kappa-optimized threshold", xaxis_title="pred", yaxis_title="true")
fig


## 7) Pros, cons, and good use cases

**Pros**
- Chance-corrected agreement: unlike accuracy, it explicitly subtracts expected agreement from the marginals
- Works for binary and multiclass classification
- Weighted variants handle **ordinal** labels naturally

**Cons / caveats**
- Sensitive to class prevalence and prediction bias (the “prevalence paradox”): the same accuracy can yield different $\kappa$
- Not defined for probabilistic outputs; you must pick a threshold (or argmax) first
- Not differentiable → usually not a direct training loss (use it for model selection / threshold tuning)
- Can be unstable on small datasets (marginals estimated with high variance)

**Good use cases**
- Inter-annotator reliability (two humans labeling the same items)
- Evaluating classifiers when you care about going beyond “matching prevalence” baselines
- Ordinal classification (use weighted $\kappa$)


## Exercises

1. Construct a dataset where accuracy increases but $\kappa$ decreases (hint: shift the marginals).
2. For the threshold-sweep demo, optimize for accuracy instead of $\kappa$. How do the chosen thresholds differ?
3. Create your own custom weight matrix for ordinal labels (e.g., asymmetric penalties) and compute $\kappa_w$.


## References

- scikit-learn API: `sklearn.metrics.cohen_kappa_score`
- Cohen, J. (1960). *A coefficient of agreement for nominal scales*.
- Weighted kappa: commonly used for ordinal agreement (e.g., clinical ratings).
