# Homogeneity Score (`homogeneity_score`)

Homogeneity is an **external clustering metric**: it scores how *pure* each predicted cluster is with respect to **ground-truth** class labels.

> Intuition: *If I open a cluster, do I mostly see one class?*

- Perfectly pure clusters → score = **1.0**
- Completely mixed clusters (clusters don’t help predict the class) → score ≈ **0.0**

---

## Learning goals

By the end you should be able to:

- explain homogeneity in terms of **entropy**
- compute it from a **contingency matrix** (class × cluster counts)
- implement `homogeneity_score` from scratch in NumPy
- visualize what increases / decreases the score
- use it to **tune** a simple clustering algorithm (with caveats)

---

## Quick import

```python
from sklearn.metrics import homogeneity_score
```

---

## Table of contents

1. Intuition: purity vs completeness
2. The math: entropy & conditional entropy
3. NumPy implementation (from scratch)
4. Worked toy example + plots
5. How mixing affects homogeneity
6. Pitfall: over-segmentation
7. Using homogeneity to tune k-means (grid search)
8. Pros/cons + when to use


In [None]:
import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import (
    completeness_score as sk_completeness_score,
    homogeneity_score as sk_homogeneity_score,
    v_measure_score as sk_v_measure_score,
)

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(7)


## 1) Intuition: purity vs completeness

Homogeneity cares about **purity inside each predicted cluster**.

- If a cluster contains multiple ground-truth classes, it’s *impure* → homogeneity goes down.
- If a ground-truth class gets split across many clusters, homogeneity **does not** complain.

That second point is why homogeneity is often paired with **completeness**:

- **Homogeneity**: *each cluster contains only members of a single class*.
- **Completeness**: *all members of a given class are assigned to the same cluster*.

Both together are summarized by the **V-measure** (harmonic mean).

A key property: homogeneity is **label-permutation invariant**. If you relabel clusters (e.g., swap cluster `0` and `1`), the score doesn’t change.


## 2) The math: entropy & conditional entropy

We have:

- ground-truth class labels: $c \in \{1,\dots,C\}$ (random variable $C$)
- predicted cluster labels: $k \in \{1,\dots,K\}$ (random variable $K$)

### 2.1 Contingency matrix

Let the contingency matrix $N \in \mathbb{N}^{C\times K}$ count co-occurrences:

$$
N_{c,k} = \#\{i: y_i = c, \; \hat y_i = k\}.
$$

Define totals:

- $n = \sum_{c,k} N_{c,k}$
- class counts: $n_c = \sum_k N_{c,k}$
- cluster counts: $n_k = \sum_c N_{c,k}$

### 2.2 Entropy

The entropy of the class variable is

$$
H(C) = -\sum_{c=1}^C p(c)\,\log p(c),
\qquad p(c)=\frac{n_c}{n}.
$$

### 2.3 Conditional entropy

The conditional entropy of classes *given clusters* is

$$
H(C\mid K)
= \sum_{k=1}^K p(k)\,H(C\mid K=k)
= -\sum_{k=1}^K\sum_{c=1}^C p(c,k)\,\log p(c\mid k),
$$

where

$$
p(k)=\frac{n_k}{n},\quad
p(c,k)=\frac{N_{c,k}}{n},\quad
p(c\mid k)=\frac{N_{c,k}}{n_k}.
$$

### 2.4 Homogeneity score

Homogeneity is defined as

$$
h = 1 - \frac{H(C\mid K)}{H(C)}.
$$

Edge case: if $H(C)=0$ (all points belong to one class), homogeneity is defined as **1.0**.

Interpretation:

- $H(C\mid K)=0$ ⇒ each cluster determines the class perfectly ⇒ **$h=1$**
- $H(C\mid K)=H(C)$ ⇒ clusters tell you nothing about the class ⇒ **$h=0$**

Note: the log base cancels in the ratio, so you can use natural log.

A nice identity (using mutual information $I(C;K)$):

$$
h = \frac{I(C;K)}{H(C)}.
$$

So homogeneity is the **fraction of class entropy explained by the clustering**.


## 3) NumPy implementation (from scratch)

We’ll implement:

- a contingency matrix builder (any label types)
- entropy + conditional entropy from counts
- `homogeneity_score` using the definition above


In [None]:
def encode_labels(y):
    '''Map arbitrary labels to integer ids 0..(m-1).'''
    y = np.asarray(y)
    classes, y_idx = np.unique(y, return_inverse=True)
    return classes, y_idx


def contingency_matrix_np(y_true, y_pred):
    '''Contingency matrix N with N[c,k] = count(true=c, pred=k).'''
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError("y_true and y_pred must have the same shape")

    true_labels, true_idx = encode_labels(y_true)
    pred_labels, pred_idx = encode_labels(y_pred)

    n_classes = true_labels.size
    n_clusters = pred_labels.size

    N = np.zeros((n_classes, n_clusters), dtype=int)
    np.add.at(N, (true_idx, pred_idx), 1)

    return N, true_labels, pred_labels


def entropy_from_counts(counts: np.ndarray) -> float:
    '''Shannon entropy of a discrete distribution given counts.'''
    counts = np.asarray(counts, dtype=float)
    total = counts.sum()
    if total <= 0:
        return 0.0

    p = counts[counts > 0] / total
    return float(-(p * np.log(p)).sum())


def conditional_entropy_C_given_K_from_contingency(N: np.ndarray) -> float:
    '''Compute H(C|K) from contingency matrix N (classes x clusters).'''
    N = np.asarray(N, dtype=float)
    n = N.sum()
    if n <= 0:
        return 0.0

    n_k = N.sum(axis=0, keepdims=True)  # (1, K)

    # H(C|K) = - sum_{c,k} p(c,k) log p(c|k)
    with np.errstate(divide="ignore", invalid="ignore"):
        p_ck = N / n
        p_c_given_k = np.divide(N, n_k, where=n_k > 0)
        terms = np.where(N > 0, p_ck * np.log(p_c_given_k), 0.0)

    return float(-terms.sum())


def homogeneity_score_np(y_true, y_pred) -> float:
    '''Homogeneity score in [0,1]. Matches sklearn's definition.'''
    N, _, _ = contingency_matrix_np(y_true, y_pred)

    H_C = entropy_from_counts(N.sum(axis=1))
    if H_C == 0.0:
        return 1.0

    H_C_given_K = conditional_entropy_C_given_K_from_contingency(N)
    h = 1.0 - H_C_given_K / H_C

    # Numerical safety
    return float(np.clip(h, 0.0, 1.0))


In [None]:
# Quick sanity check vs scikit-learn

y_true = rng.integers(0, 4, size=500)
y_pred = rng.integers(0, 7, size=500)

h_np = homogeneity_score_np(y_true, y_pred)
h_sk = sk_homogeneity_score(y_true, y_pred)

print("homogeneity (numpy): ", h_np)
print("homogeneity (sklearn):", h_sk)
print("abs diff:", abs(h_np - h_sk))

# Edge case: one true class -> defined as 1.0
print("one-class edge case:", homogeneity_score_np(np.zeros(20), rng.integers(0, 3, size=20)))


## 4) Worked toy example + plots

Let’s build a small example and look at:

- the contingency matrix
- per-cluster class proportions
- per-cluster class entropy (how “mixed” each cluster is)


In [None]:
y_true_toy = np.array([
    "A", "A", "A", "A", "A",
    "B", "B", "B", "B",
    "C", "C", "C", "C",
])

# Clusters are somewhat mixed:
# - cluster 0: mostly A
# - cluster 1: mix of A and B
# - cluster 2: pure C
# - cluster 3: mix of B and C
y_pred_toy = np.array([
    0, 0, 0, 1, 1,
    1, 1, 3, 3,
    2, 2, 2, 3,
])

N_toy, classes_toy, clusters_toy = contingency_matrix_np(y_true_toy, y_pred_toy)

h_toy = homogeneity_score_np(y_true_toy, y_pred_toy)

print("classes:", classes_toy)
print("clusters:", clusters_toy)
print("contingency N (rows=class, cols=cluster):")
print(N_toy)
print("homogeneity:", h_toy)

fig = px.imshow(
    N_toy,
    x=[f"cluster {k}" for k in clusters_toy],
    y=[f"class {c}" for c in classes_toy],
    text_auto=True,
    color_continuous_scale="Blues",
    title=f"Toy contingency matrix (homogeneity={h_toy:.3f})",
    labels={"x": "predicted cluster", "y": "true class", "color": "count"},
)
fig.update_layout(coloraxis_showscale=False)
fig.show()


In [None]:
# Per-cluster class proportions and per-cluster entropy

cluster_sizes = N_toy.sum(axis=0)
proportions = np.divide(N_toy, cluster_sizes, where=cluster_sizes > 0)

cluster_entropies = np.array([entropy_from_counts(N_toy[:, k]) for k in range(N_toy.shape[1])])

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Class proportions within each cluster", "Entropy within each cluster"),
)

# stacked bars (proportions)
for i, c in enumerate(classes_toy):
    fig.add_trace(
        go.Bar(
            x=[f"cluster {k}" for k in clusters_toy],
            y=proportions[i],
            name=f"class {c}",
        ),
        row=1,
        col=1,
    )

fig.update_yaxes(title_text="proportion", range=[0, 1], row=1, col=1)
fig.update_xaxes(title_text="cluster", row=1, col=1)

# entropies
fig.add_trace(
    go.Bar(
        x=[f"cluster {k}" for k in clusters_toy],
        y=cluster_entropies,
        name="entropy",
        marker_color="gray",
    ),
    row=1,
    col=2,
)

fig.update_yaxes(title_text="H(C | K=k)", row=1, col=2)
fig.update_xaxes(title_text="cluster", row=1, col=2)

fig.update_layout(barmode="stack", title_text="What makes homogeneity go up/down")
fig.show()


## 5) How mixing affects homogeneity

Consider a **binary** problem with two equally common classes.

We’ll create cluster labels by copying the true labels and then **flipping** a fraction $\varepsilon$ of them.

- $\varepsilon = 0$ ⇒ perfectly pure clusters ⇒ homogeneity = 1
- larger $\varepsilon$ ⇒ more mixing inside clusters ⇒ homogeneity drops


In [None]:
def flip_fraction(y, eps: float, rng: np.random.Generator) -> np.ndarray:
    y = np.asarray(y, dtype=int)
    if not (0.0 <= eps <= 1.0):
        raise ValueError("eps must be in [0,1]")

    y_pred = y.copy()
    flip = rng.random(size=y.size) < eps
    y_pred[flip] = 1 - y_pred[flip]
    return y_pred


n = 2000
# perfectly balanced classes
true_bin = np.r_[np.zeros(n // 2, dtype=int), np.ones(n // 2, dtype=int)]
rng.shuffle(true_bin)

eps_grid = np.linspace(0.0, 0.5, 51)
h_values = []

for eps in eps_grid:
    pred_bin = flip_fraction(true_bin, eps=float(eps), rng=rng)
    h_values.append(homogeneity_score_np(true_bin, pred_bin))

fig = go.Figure()
fig.add_trace(go.Scatter(x=eps_grid, y=h_values, mode="lines+markers", name="homogeneity"))
fig.update_layout(
    title="Homogeneity vs label mixing (binary flip noise)",
    xaxis_title="flip fraction ε",
    yaxis_title="homogeneity",
    yaxis_range=[0, 1.02],
)
fig.show()


## 6) Pitfall: over-segmentation can reach 1.0

Homogeneity ignores whether a class is split across many clusters.

If each class is divided into multiple *sub-clusters* (all pure), homogeneity stays **1.0**, even though the clustering is often less useful.

We’ll demonstrate this by taking $C=3$ classes and splitting each class into $m$ pure clusters.

We’ll also show **completeness** and **V-measure** for contrast.


In [None]:
C = 3
n_per_class = 400

y_true = np.repeat(np.arange(C), n_per_class)
rng.shuffle(y_true)


def split_each_class_into_m_clusters(y_true, m: int, rng: np.random.Generator) -> np.ndarray:
    y_true = np.asarray(y_true, dtype=int)
    y_pred = np.empty_like(y_true)

    for c in range(np.max(y_true) + 1):
        idx = np.where(y_true == c)[0]
        sub = rng.integers(0, m, size=idx.size)
        y_pred[idx] = c * m + sub

    return y_pred


m_grid = np.arange(1, 21)

h_list = []
comp_list = []
v_list = []

for m in m_grid:
    y_pred = split_each_class_into_m_clusters(y_true, m=int(m), rng=rng)
    h_list.append(homogeneity_score_np(y_true, y_pred))
    comp_list.append(sk_completeness_score(y_true, y_pred))
    v_list.append(sk_v_measure_score(y_true, y_pred))

fig = go.Figure()
fig.add_trace(go.Scatter(x=m_grid, y=h_list, mode="lines+markers", name="homogeneity"))
fig.add_trace(go.Scatter(x=m_grid, y=comp_list, mode="lines+markers", name="completeness"))
fig.add_trace(go.Scatter(x=m_grid, y=v_list, mode="lines+markers", name="v-measure"))

fig.update_layout(
    title="Over-segmentation: splitting each class into m pure clusters",
    xaxis_title="m (clusters per true class)",
    yaxis_title="score",
    yaxis_range=[0, 1.02],
)
fig.show()


## 7) Using homogeneity to tune k-means (grid search)

Homogeneity is **not differentiable** w.r.t. model parameters (it depends on discrete assignments), so you normally use it for:

- comparing clustering algorithms
- selecting hyperparameters (like number of clusters $k$)

Below is a tiny **NumPy k-means** implementation and a grid search over $k$.

We’ll see an important behavior:

- as $k$ increases, homogeneity often increases (sometimes monotonically)

So *optimizing for homogeneity alone* tends to push toward larger $k$ unless you constrain $k$ or pair it with completeness / V-measure.


In [None]:
def kmeans_fit_predict_np(X: np.ndarray, k: int, n_iters: int = 50, seed: int = 0):
    '''Simple k-means (Lloyd) implementation. Returns labels and centroids.'''
    X = np.asarray(X, dtype=float)
    n, d = X.shape

    if not (1 <= k <= n):
        raise ValueError("k must be in [1, n]")

    rng_local = np.random.default_rng(seed)

    # init: choose k random points as centroids
    centroids = X[rng_local.choice(n, size=k, replace=False)].copy()

    labels = np.full(n, -1, dtype=int)

    for _ in range(n_iters):
        # squared distances to each centroid (n, k)
        d2 = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
        new_labels = np.argmin(d2, axis=1)

        if np.array_equal(new_labels, labels):
            break

        labels = new_labels

        # update step
        for j in range(k):
            mask = labels == j
            if np.any(mask):
                centroids[j] = X[mask].mean(axis=0)
            else:
                # empty cluster: re-seed to a random point
                centroids[j] = X[rng_local.integers(0, n)]

    return labels, centroids


In [None]:
# Dataset with known classes (so we can compute external metrics)

X, y_true = make_blobs(
    n_samples=1500,
    centers=3,
    n_features=2,
    cluster_std=[1.0, 1.2, 0.9],
    random_state=3,
)

fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y_true.astype(str),
    title="Ground-truth classes (for evaluation)",
    labels={"x": "x1", "y": "x2", "color": "true class"},
)
fig.show()


In [None]:
# Compare different k-means clusterings visually

def plot_clustering(X, labels, title: str):
    fig = px.scatter(
        x=X[:, 0],
        y=X[:, 1],
        color=labels.astype(str),
        title=title,
        labels={"x": "x1", "y": "x2", "color": "cluster"},
    )
    fig.show()


for k in [2, 3, 6]:
    km = KMeans(n_clusters=k, n_init=10, random_state=0)
    y_pred = km.fit_predict(X)

    h = homogeneity_score_np(y_true, y_pred)
    plot_clustering(X, y_pred, title=f"KMeans k={k} (homogeneity={h:.3f})")


In [None]:
# Grid search over k and random seeds (using the NumPy k-means above)

k_values = np.arange(2, 13)
seeds = np.arange(0, 15)

rows = []

for k in k_values:
    for seed in seeds:
        labels, _ = kmeans_fit_predict_np(X, k=int(k), n_iters=80, seed=int(seed))
        rows.append(
            {
                "k": int(k),
                "seed": int(seed),
                "homogeneity": homogeneity_score_np(y_true, labels),
                "completeness": sk_completeness_score(y_true, labels),
                "v_measure": sk_v_measure_score(y_true, labels),
            }
        )

# best seed per k (by homogeneity)
best_by_k = {}
for r in rows:
    k = r["k"]
    if (k not in best_by_k) or (r["homogeneity"] > best_by_k[k]["homogeneity"]):
        best_by_k[k] = r

best_rows = [best_by_k[k] for k in k_values]

best_k_by_h = max(best_rows, key=lambda r: r["homogeneity"])["k"]
print("best k by homogeneity:", best_k_by_h)

fig = go.Figure()

# scatter all runs
fig.add_trace(
    go.Scatter(
        x=[r["k"] for r in rows],
        y=[r["homogeneity"] for r in rows],
        mode="markers",
        name="homogeneity (all seeds)",
        marker=dict(size=6, opacity=0.35),
    )
)

# line: best homogeneity per k
fig.add_trace(
    go.Scatter(
        x=[r["k"] for r in best_rows],
        y=[r["homogeneity"] for r in best_rows],
        mode="lines+markers",
        name="best homogeneity per k",
    )
)

# lines: completeness and v-measure for the same best runs
fig.add_trace(
    go.Scatter(
        x=[r["k"] for r in best_rows],
        y=[r["completeness"] for r in best_rows],
        mode="lines+markers",
        name="completeness (same best-by-h runs)",
    )
)

fig.add_trace(
    go.Scatter(
        x=[r["k"] for r in best_rows],
        y=[r["v_measure"] for r in best_rows],
        mode="lines+markers",
        name="v-measure (same best-by-h runs)",
    )
)

fig.add_vline(
    x=best_k_by_h,
    line_dash="dash",
    line_color="gray",
    annotation_text=f"best k by homogeneity: {best_k_by_h}",
)

fig.update_layout(
    title="Selecting k by homogeneity (watch the over-segmentation bias)",
    xaxis_title="k",
    yaxis_title="score",
    yaxis_range=[0, 1.02],
)

fig.show()


## 8) Pros/cons + when to use

### Pros

- **Interpretable**: “cluster purity” aligned with many real use cases
- **Scale [0, 1]** and **label-permutation invariant**
- Works for **multiclass** and **imbalanced** class distributions
- Information-theoretic: connects to **entropy** and **mutual information**

### Cons / pitfalls

- Requires **ground-truth labels** (so it’s not usable for truly unsupervised evaluation)
- **Ignores completeness** → can be **artificially high** with many clusters (over-segmentation)
- Not a smooth/differentiable objective (used for evaluation / selection, not gradient training)
- Can hide issues if small impure clusters exist but are tiny (weighted by cluster size)

### Good use cases

- Benchmarking clustering when you have a gold standard (topics, categories, known segments)
- Situations where mixing classes inside a cluster is especially harmful (you need “clean buckets”)
- As part of **V-measure** (homogeneity + completeness) or alongside other external metrics (ARI, AMI)


## References

- Rosenberg, A., & Hirschberg, J. (2007). *V-measure: A conditional entropy-based external cluster evaluation measure.*
- scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html
- Related metrics: `completeness_score`, `v_measure_score`, `adjusted_rand_score`, `adjusted_mutual_info_score`

## Exercises

1. Create a clustering with **high homogeneity but low completeness**. Verify with plots.
2. Modify the toy example so one small cluster is very impure. How much does homogeneity change?
3. Implement `completeness_score` from scratch and reproduce V-measure.
