# matthews_corrcoef (Matthews Correlation Coefficient, MCC)

The **Matthews correlation coefficient (MCC)** is a single-number summary of a classifier’s **confusion matrix**.
It can be interpreted as the **Pearson correlation** between true and predicted labels, so it naturally lives in $[-1, 1]$:

- $+1$: perfect predictions
- $0$: no better than random (no correlation)
- $-1$: perfectly wrong (systematic inversion)

MCC is especially useful when classes are **imbalanced**, because it uses **all four** confusion-matrix entries (TP, TN, FP, FN).

---

## Learning goals
- derive MCC from the confusion matrix and from Pearson correlation
- implement MCC from scratch in NumPy (binary + multiclass)
- build intuition with Plotly visuals (imbalance + thresholding)
- use MCC to **select a decision threshold** / tune a simple model

## Table of contents
1. Confusion matrix recap
2. Binary MCC: definition + correlation view
3. Multiclass MCC
4. NumPy implementation (from scratch)
5. Intuition plots (TPR/TNR surface + imbalance trap)
6. Using MCC for optimization: threshold tuning for logistic regression
7. Pros, cons, and when to use MCC
8. Exercises + references


In [None]:
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.datasets import make_classification
from sklearn.metrics import matthews_corrcoef as sk_matthews_corrcoef
from sklearn.model_selection import train_test_split

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)


## 1) Confusion matrix recap

For **binary classification**, assume the positive class is labeled $1$ and the negative class is labeled $0$.

A confusion matrix counts outcomes:

|                | predicted $1$ | predicted $0$ |
|---             |---:           |---:           |
| **true $1$**   | TP            | FN            |
| **true $0$**   | FP            | TN            |

With total sample size:

$$
N = \text{TP} + \text{TN} + \text{FP} + \text{FN}.
$$

Useful rates:

- **TPR / recall / sensitivity**: $\text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}}$
- **TNR / specificity**: $\text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}}$

MCC “wants” both TPR and TNR to be high.


## 2) Binary MCC: definition + correlation view

### 2.1 Definition (confusion-matrix form)

The (binary) Matthews correlation coefficient is

$$
\mathrm{MCC} =
\frac{\text{TP}\,\text{TN} - \text{FP}\,\text{FN}}
{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}.
$$

- The **numerator** rewards agreement (TP·TN) and penalizes disagreement (FP·FN).
- The **denominator** normalizes to keep the score in $[-1, 1]$.

If the denominator is $0$ (e.g. constant predictions, or all labels are the same), MCC is mathematically undefined.
In practice (and in scikit-learn), it is returned as **0.0**.

### 2.2 MCC = Pearson correlation for 0/1 labels

Let $Y, \hat Y \in \{0,1\}$ be the true and predicted labels. The Pearson correlation is

$$
\rho(Y, \hat Y) = \frac{\mathrm{Cov}(Y, \hat Y)}{\sqrt{\mathrm{Var}(Y)\,\mathrm{Var}(\hat Y)}}.
$$

Using the contingency table above:

- $\mathbb{E}[Y] = \frac{\text{TP}+\text{FN}}{N}$
- $\mathbb{E}[\hat Y] = \frac{\text{TP}+\text{FP}}{N}$
- $\mathbb{E}[Y\hat Y] = \frac{\text{TP}}{N}$

So

$$
\mathrm{Cov}(Y,\hat Y)
= \mathbb{E}[Y\hat Y] - \mathbb{E}[Y]\,\mathbb{E}[\hat Y]
= \frac{\text{TP}\,\text{TN} - \text{FP}\,\text{FN}}{N^2}.
$$

And

$$
\mathrm{Var}(Y) = \frac{(\text{TP}+\text{FN})(\text{TN}+\text{FP})}{N^2},
\quad
\mathrm{Var}(\hat Y) = \frac{(\text{TP}+\text{FP})(\text{TN}+\text{FN})}{N^2}.
$$

Plugging these into $\rho$ yields the MCC formula.
This is why MCC is also known as the **phi coefficient** (correlation for two binary variables).


## 3) Multiclass MCC

MCC has a natural multiclass extension based on the full $K\times K$ confusion matrix.

Let $C\in\mathbb{N}^{K\times K}$ with entries:

$$
C_{ij} = \#\{n : y^{(n)}=i, \; \hat y^{(n)}=j\}.
$$

Define:

- $s = \sum_{i,j} C_{ij}$ (total)
- $c = \sum_k C_{kk}$ (correct / trace)
- $t_k = \sum_j C_{k j}$ (true count per class; row sums)
- $p_k = \sum_i C_{i k}$ (predicted count per class; column sums)

Then the multiclass MCC is:

$$
\mathrm{MCC} =
\frac{c\,s - \sum_k t_k p_k}
{\sqrt{\left(s^2 - \sum_k p_k^2\right)\left(s^2 - \sum_k t_k^2\right)}}.
$$

It reduces to the binary formula when $K=2$, and can be viewed as a correlation between one-hot encodings of $y$ and $\hat y$.


## 4) NumPy implementation (from scratch)

We’ll implement:

- a simple confusion matrix builder
- MCC for binary and multiclass using the $K\times K$ formula
- (optionally) the binary closed-form as a sanity check


In [None]:
def confusion_matrix_np(y_true, y_pred, labels=None):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")

    if labels is None:
        labels = np.unique(np.concatenate([y_true, y_pred]))
    else:
        labels = np.asarray(labels)

    label_to_index = {label: i for i, label in enumerate(labels.tolist())}

    true_idx = np.fromiter((label_to_index.get(v, -1) for v in y_true), dtype=int, count=y_true.size)
    pred_idx = np.fromiter((label_to_index.get(v, -1) for v in y_pred), dtype=int, count=y_pred.size)

    if (true_idx < 0).any() or (pred_idx < 0).any():
        raise ValueError("labels must contain all values appearing in y_true and y_pred")

    k = labels.size
    cm = np.zeros((k, k), dtype=int)
    np.add.at(cm, (true_idx, pred_idx), 1)
    return cm, labels


def mcc_from_counts(tp, tn, fp, fn):
    tp = np.asarray(tp, dtype=float)
    tn = np.asarray(tn, dtype=float)
    fp = np.asarray(fp, dtype=float)
    fn = np.asarray(fn, dtype=float)

    num = tp * tn - fp * fn
    denom = np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn))
    return np.where(denom == 0, 0.0, num / denom)


def matthews_corrcoef_np(y_true, y_pred, labels=None) -> float:
    cm, _ = confusion_matrix_np(y_true, y_pred, labels=labels)

    t_sum = cm.sum(axis=1, dtype=float)  # true per class
    p_sum = cm.sum(axis=0, dtype=float)  # predicted per class

    s = float(cm.sum())
    c = float(np.trace(cm))

    num = c * s - float(np.dot(t_sum, p_sum))
    denom = np.sqrt((s**2 - float(np.dot(p_sum, p_sum))) * (s**2 - float(np.dot(t_sum, t_sum))))

    return 0.0 if denom == 0.0 else num / denom


def confusion_counts_binary(y_true, y_pred, positive_label=1):
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    true_pos = y_true == positive_label
    pred_pos = y_pred == positive_label

    tp = int(np.sum(true_pos & pred_pos))
    tn = int(np.sum(~true_pos & ~pred_pos))
    fp = int(np.sum(~true_pos & pred_pos))
    fn = int(np.sum(true_pos & ~pred_pos))
    return tp, tn, fp, fn


In [None]:
# Quick sanity check vs scikit-learn

y_true = np.array([1, 1, 1, 0, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 0, 0, 1, 0, 0])

cm, labels = confusion_matrix_np(y_true, y_pred)
tp, tn, fp, fn = confusion_counts_binary(y_true, y_pred, positive_label=1)

print("labels:", labels)
print("confusion matrix (rows=true, cols=pred):\n", cm)
print("TP,TN,FP,FN:", tp, tn, fp, fn)

print("MCC (scratch, KxK):", matthews_corrcoef_np(y_true, y_pred))
print("MCC (scratch, binary counts):", float(mcc_from_counts(tp, tn, fp, fn)))
print("MCC (sklearn):", sk_matthews_corrcoef(y_true, y_pred))


### 4.1 Multiclass sanity check

MCC supports multiclass via the confusion-matrix generalization. We’ll verify our NumPy implementation against scikit-learn on a simple 3-class example.


In [None]:
# Multiclass sanity check (K=3)

y_true_mc = rng.integers(0, 3, size=500)

y_pred_mc = y_true_mc.copy()
noise = rng.random(size=y_true_mc.size) < 0.25
# if noisy: replace with a random label in {0,1,2}
y_pred_mc[noise] = rng.integers(0, 3, size=int(noise.sum()))

mcc_mc = matthews_corrcoef_np(y_true_mc, y_pred_mc)

cm_mc, labels_mc = confusion_matrix_np(y_true_mc, y_pred_mc)

fig = px.imshow(
    cm_mc,
    x=[f"pred {l}" for l in labels_mc],
    y=[f"true {l}" for l in labels_mc],
    text_auto=True,
    color_continuous_scale="Blues",
)
fig.update_layout(title=f"Multiclass confusion matrix (MCC={mcc_mc:.3f})")
fig.show()

print("MCC multiclass (scratch):", mcc_mc)
print("MCC multiclass (sklearn):", sk_matthews_corrcoef(y_true_mc, y_pred_mc))


## 5) Intuition plots

### 5.1 MCC as a function of TPR and TNR

If we fix the **class prevalence** $\pi = P(Y=1)$ and imagine a classifier with some $(\text{TPR},\text{TNR})$, the expected confusion counts (for large $N$) are:

$$
\text{TP} = N\,\pi\,\text{TPR},\quad
\text{FN} = N\,\pi\,(1-\text{TPR}),\quad
\text{TN} = N\,(1-\pi)\,\text{TNR},\quad
\text{FP} = N\,(1-\pi)\,(1-\text{TNR}).
$$

Plotting MCC over $(\text{TPR},\text{TNR})$ shows how **both** kinds of mistakes affect the score.


In [None]:
def plot_mcc_surface(pi: float, grid_steps: int = 101, title: str | None = None):
    t = np.linspace(0.0, 1.0, grid_steps)
    tpr, tnr = np.meshgrid(t, t, indexing="xy")

    n = 1.0  # scale cancels out in MCC
    tp = n * pi * tpr
    fn = n * pi * (1 - tpr)
    tn = n * (1 - pi) * tnr
    fp = n * (1 - pi) * (1 - tnr)

    z = mcc_from_counts(tp, tn, fp, fn)

    fig = px.imshow(
        z,
        x=t,
        y=t,
        origin="lower",
        aspect="auto",
        zmin=-1,
        zmax=1,
        color_continuous_scale="RdBu",
        labels={"x": "TPR (recall)", "y": "TNR (specificity)", "color": "MCC"},
    )

    fig.add_trace(
        go.Scatter(x=t, y=t, mode="lines", name="TPR = TNR", line=dict(color="black", dash="dash"))
    )

    fig.update_layout(
        title=title or f"MCC surface over (TPR, TNR) with prevalence π={pi:.2f}",
        coloraxis_colorbar=dict(title="MCC"),
    )

    return fig


fig = plot_mcc_surface(pi=0.10)
fig.show()


### 5.2 The “accuracy trap” under imbalance

Consider a dataset where the positive class is rare. A trivial classifier that predicts **always negative** can achieve very high **accuracy**, even though it’s useless.

MCC exposes this: constant predictions lead to an MCC of **0**.


In [None]:
prevalence = np.linspace(0.001, 0.999, 300)

acc_always_negative = 1.0 - prevalence
mcc_always_negative = np.zeros_like(prevalence)
balanced_acc_always_negative = np.full_like(prevalence, 0.5)

fig = go.Figure()
fig.add_trace(go.Scatter(x=prevalence, y=acc_always_negative, name="accuracy (always predict 0)"))
fig.add_trace(go.Scatter(x=prevalence, y=balanced_acc_always_negative, name="balanced accuracy (always 0)", line=dict(dash="dash")))
fig.add_trace(go.Scatter(x=prevalence, y=mcc_always_negative, name="MCC (always 0)", line=dict(dash="dot")))

fig.update_layout(
    title="Imbalance demo: accuracy can look great while MCC stays 0",
    xaxis_title="Positive prevalence π = P(Y=1)",
    yaxis_title="Metric value",
    yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()


## 6) Using MCC for optimization: threshold tuning for logistic regression

MCC is **not differentiable** with respect to model parameters (it depends on discrete labels), so we typically:

1) train a probabilistic model with a smooth loss (e.g. **log-loss**)
2) choose a **decision threshold** (or hyperparameters) that maximizes MCC on a validation set

Below is a minimal **from-scratch** logistic regression and an MCC-based threshold selection.


In [None]:
def add_intercept(X: np.ndarray) -> np.ndarray:
    X = np.asarray(X, dtype=float)
    return np.c_[np.ones((X.shape[0], 1)), X]


def sigmoid(z):
    z = np.asarray(z, dtype=float)

    out = np.empty_like(z, dtype=float)
    pos = z >= 0

    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    ez = np.exp(z[~pos])
    out[~pos] = ez / (1.0 + ez)

    return out


def binary_log_loss(y_true, p, eps: float = 1e-15) -> float:
    y_true = np.asarray(y_true, dtype=float)
    p = np.asarray(p, dtype=float)

    p = np.clip(p, eps, 1.0 - eps)
    return float(-np.mean(y_true * np.log(p) + (1.0 - y_true) * np.log(1.0 - p)))


def standardize_fit(X_train: np.ndarray):
    mu = X_train.mean(axis=0)
    sigma = X_train.std(axis=0)
    sigma = np.where(sigma == 0, 1.0, sigma)
    return mu, sigma


def standardize_apply(X: np.ndarray, mu: np.ndarray, sigma: np.ndarray) -> np.ndarray:
    return (X - mu) / sigma


def fit_logistic_regression_gd(
    X: np.ndarray,
    y: np.ndarray,
    lr: float = 0.1,
    n_iter: int = 2000,
    l2: float = 0.0,
):
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)

    n, d = X.shape
    w = np.zeros(d)

    losses = np.empty(n_iter)

    for i in range(n_iter):
        p = sigmoid(X @ w)

        # average log-loss + L2 (skip intercept)
        losses[i] = binary_log_loss(y, p) + 0.5 * l2 * float(np.dot(w[1:], w[1:]))

        grad = (X.T @ (p - y)) / n
        grad[1:] += l2 * w[1:]

        w -= lr * grad

    return w, losses


def safe_div(num, denom):
    num = np.asarray(num, dtype=float)
    denom = np.asarray(denom, dtype=float)
    return np.where(denom == 0, 0.0, num / denom)


def binary_metrics_from_counts(tp, tn, fp, fn):
    tp = np.asarray(tp, dtype=float)
    tn = np.asarray(tn, dtype=float)
    fp = np.asarray(fp, dtype=float)
    fn = np.asarray(fn, dtype=float)

    acc = safe_div(tp + tn, tp + tn + fp + fn)

    precision = safe_div(tp, tp + fp)
    recall = safe_div(tp, tp + fn)
    f1 = safe_div(2 * precision * recall, precision + recall)

    tpr = recall
    tnr = safe_div(tn, tn + fp)
    bal_acc = 0.5 * (tpr + tnr)

    mcc = mcc_from_counts(tp, tn, fp, fn)

    return {
        "mcc": mcc,
        "accuracy": acc,
        "balanced_accuracy": bal_acc,
        "precision": precision,
        "recall": recall,
        "f1": f1,
    }


In [None]:
# Synthetic, imbalanced dataset
X, y = make_classification(
    n_samples=4000,
    n_features=6,
    n_informative=4,
    n_redundant=0,
    n_clusters_per_class=2,
    weights=[0.90, 0.10],
    class_sep=1.2,
    flip_y=0.02,
    random_state=7,
)

X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=7)
X_val, X_test, y_val, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, stratify=y_tmp, random_state=7)

mu, sigma = standardize_fit(X_train)
X_train_s = standardize_apply(X_train, mu, sigma)
X_val_s = standardize_apply(X_val, mu, sigma)
X_test_s = standardize_apply(X_test, mu, sigma)

X_train_i = add_intercept(X_train_s)
X_val_i = add_intercept(X_val_s)
X_test_i = add_intercept(X_test_s)

w, losses = fit_logistic_regression_gd(X_train_i, y_train, lr=0.15, n_iter=2500, l2=1e-2)

fig = go.Figure()
fig.add_trace(go.Scatter(x=np.arange(losses.size), y=losses, name="train loss"))
fig.update_layout(title="From-scratch logistic regression (GD): training loss", xaxis_title="iteration", yaxis_title="loss")
fig.show()


In [None]:
# Threshold sweep on validation set
p_val = sigmoid(X_val_i @ w)

thresholds = np.linspace(0.0, 1.0, 401)

y_val_bool = y_val.astype(bool)
pred_pos = p_val[:, None] >= thresholds[None, :]

tp = np.sum(pred_pos & y_val_bool[:, None], axis=0)
fp = np.sum(pred_pos & ~y_val_bool[:, None], axis=0)
fn = np.sum(~pred_pos & y_val_bool[:, None], axis=0)
tn = np.sum(~pred_pos & ~y_val_bool[:, None], axis=0)

metrics = binary_metrics_from_counts(tp, tn, fp, fn)

best_idx = int(np.argmax(metrics["mcc"]))
best_t = float(thresholds[best_idx])

best_acc_idx = int(np.argmax(metrics["accuracy"]))
best_acc_t = float(thresholds[best_acc_idx])

best_t, best_acc_t


In [None]:
# Plot how metrics change with the decision threshold

fig = go.Figure()

for name, values in metrics.items():
    fig.add_trace(go.Scatter(x=thresholds, y=values, name=name))

fig.add_vline(x=0.5, line_dash="dot", line_color="gray", annotation_text="t=0.5")
fig.add_vline(
    x=best_acc_t,
    line_dash="dash",
    line_color="gray",
    annotation_text=f"best accuracy t={best_acc_t:.3f}",
)
fig.add_vline(
    x=best_t,
    line_dash="dash",
    line_color="black",
    annotation_text=f"best MCC t={best_t:.3f}",
)

fig.update_layout(
    title="Validation curves vs threshold",
    xaxis_title="threshold",
    yaxis_title="metric value",
    yaxis=dict(range=[-0.05, 1.05]),
)
fig.show()


In [None]:
# Evaluate on the test set and compare thresholds
p_test = sigmoid(X_test_i @ w)


def metrics_at_threshold(t: float):
    y_pred = (p_test >= t).astype(int)
    tp, tn, fp, fn = confusion_counts_binary(y_test, y_pred, positive_label=1)
    m = binary_metrics_from_counts(tp, tn, fp, fn)
    return {k: float(v) for k, v in m.items()}, y_pred


for t in [0.5, best_acc_t, best_t]:
    m, _ = metrics_at_threshold(t)
    print(
        f"t={t:.3f} | MCC={m['mcc']:.3f} | acc={m['accuracy']:.3f} | bal_acc={m['balanced_accuracy']:.3f} | F1={m['f1']:.3f}"
    )

# Confusion matrix for the MCC-optimal threshold
m_best, y_test_pred = metrics_at_threshold(best_t)

cm_test, _ = confusion_matrix_np(y_test, y_test_pred, labels=np.array([0, 1]))

fig = px.imshow(
    cm_test,
    x=["pred 0", "pred 1"],
    y=["true 0", "true 1"],
    text_auto=True,
    color_continuous_scale="Blues",
)
fig.update_layout(title=f"Test confusion matrix (threshold={best_t:.3f}, MCC={m_best['mcc']:.3f})")
fig.show()

print("Test MCC (scratch):", m_best["mcc"])
print("Test MCC (sklearn):", sk_matthews_corrcoef(y_test, y_test_pred))


## 7) Pros, cons, and when to use MCC

### Pros
- **Uses all of TP/TN/FP/FN** (unlike precision/recall which ignore TN/TP)
- **Robust under class imbalance** (unlike accuracy)
- **Symmetric**: swapping positive/negative labels does not change the value
- **Single interpretable scale** ($[-1,1]$) with a correlation meaning
- **Works for multiclass** via the confusion-matrix generalization

### Cons / caveats
- Can be **undefined** when predictions (or labels) are constant; commonly returned as **0** by convention
- **Non-differentiable** w.r.t. model parameters → not a direct gradient-descent loss
- Threshold-dependent for probabilistic models; you often need **threshold tuning**
- Can be **noisy/unstable** with very small sample sizes or extremely rare classes

### When MCC shines
- **Imbalanced binary classification** where both error types matter (FP and FN)
- Model selection and threshold tuning when you want a single score that “respects” the full confusion matrix
- Domains with strong imbalance and asymmetric costs where accuracy is misleading (bioinformatics, fraud, anomaly-ish settings)


## 8) Exercises + references

### Exercises
1) Compute MCC by hand for a few confusion matrices and interpret the sign.
2) Implement a multiclass demo: generate $K=3$ labels, perturb predictions, and verify your MCC matches scikit-learn.
3) On the logistic regression demo above:
   - compare the threshold that maximizes **accuracy** vs **MCC**
   - try a more imbalanced dataset (e.g. 99/1) and re-run the threshold sweep
4) Implement cross-validated model selection where the chosen hyperparameter maximizes validation MCC.

### References
- Matthews, B. W. (1975). *Comparison of the predicted and observed secondary structure of T4 phage lysozyme.*
- scikit-learn docs: `sklearn.metrics.matthews_corrcoef`
- The phi coefficient (binary correlation) and its relationship to MCC
