# Seminar Notebook 2.2: Clustering

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan HÃ¼bert**

This notebook covers the vector space approach and $k$-means clustering.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [None]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

### Loading the DFM

We need to load the DFM we created in the last notebook. We start by reading the sparse array object we saved as an `.npz` file:

In [None]:
from scipy import sparse

sparse_dfm_file = os.path.join(sdir, 'guardian-dfm.npz')
if os.path.exists(sparse_dfm_file):
    dfm = sparse.load_npz(sparse_dfm_file)
else:
    raise ValueError("You must create the DFM using the previous notebook before proceeding!")

dfm.shape

Next, let's load the list of features (the vocabulary), which remember is not included with the sparse array data:

In [None]:
features_file = os.path.join(sdir, 'guardian-dfm-features.txt')
vocabulary = open(features_file, mode = "r").read().split("\n")

## Calculating distance and similarity

Before we look at $k$-means clustering, let's examine how to calculate distance and similarity between documents. First, we can calculate the Euclidean and Manhattan distances between documents using the formula from lecture. Let's calculate these two distance metrics between document 0 and document 1.

In [None]:
import numpy as np

ed = np.sqrt(((dfm[0] - dfm[1]).power(2)).sum())
print(ed)

md = np.abs(dfm[0] - dfm[1]).sum()
print(md)

There is a convenient function available in `sklearn` for calculating distance. This function allows you to choose which metric you want to use, and it allows you to calculate distance between all documents (returning a matrix of pairwise distances). Let's calculate Euclidean and Manhattan distance between the first five documents. Note that Manhattan distance is called `cityblock` in `sklearn`.

In [None]:
from sklearn.metrics import pairwise_distances

edist = pairwise_distances(dfm[0:5], metric="euclidean")
print(edist)

mdist = pairwise_distances(dfm[0:5], metric="cityblock")
print(mdist)

We can also calculate the cosine similarity between two documents. For example, let's look at document 0 and 1. As we can see, they are not very similar:

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

cs = cosine_similarity(dfm[0], dfm[1])  # cosine similarity
print(cs)
np.arccos(cs)                           # radians between documents
np.degrees(np.arccos(cs))               # degrees between documents

## k-means Clustering

First, we will weight the DFM using TF-IDF weighting. Note that, by default, `TfidfTransformer` applies a normalisation to ensure that all of the vectors in the DFM have the same magnitude. The default is to apply the L2 norm, which is another way of saying the vector for each row is normalised by its vector magnitude. This is exactly what we did when we computed cosine similarity in week 3 lecture. This normalisation removes differences due purely to document length, allowing clustering to focus on differences in word composition rather than scale.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
dfm_tfidf = transformer.fit_transform(dfm)

Next, we will "set up" our $k$-means clustering exercise. For now, let's try to find 30 clusters and see what we get.

In [None]:
from sklearn.cluster import KMeans

K = 10
kmeans = KMeans(n_clusters=K, random_state=42) 
labels = kmeans.fit_predict(dfm_tfidf)

What objects can we extract from this? We are interested in each document $i$'s cluster assignment $\widehat{\boldsymbol{\pi}}_i$, as well as each cluster $k$'s "word usage" as represented by the centroid $\widehat{\boldsymbol{\mu}}_k$. Where can we extract those quantities from the `kmeans` and/or `labels` objects? 

### Cluster assignment 

This gives you the cluster assignments for each document:

In [None]:
cluster_assignments = labels
print(cluster_assignments)

For example, we see that document 0 is in cluster 3. This means: $\widehat{\boldsymbol{\pi}}_0 = (0,0,1,0,0,0,0,0,0,0)$. (Remember: Python uses zero-indexing.) Let's now look at the distribution of documents across all clusters.

In [None]:
import numpy as np
import pandas as pd

cf = pd.Series(labels).value_counts()
cf = pd.concat([cf, cf / cf.sum()], axis = 1, keys=["doc_count", "doc_prop"])
cf = cf.sort_index()
cf

### Cluster centroid feature use

The following gives you a $K \times J$ matrix (in our case $10 \times 6236$) of cluster centroids, $\widehat{\boldsymbol{\mu}}$. Each row is a specific cluster $k$'s "average document", which we can interpret as representing the cluster's prototypical word usage.

In [None]:
mu = kmeans.cluster_centers_
print(mu)
print(mu.shape)

We can look at a specific cluster's centroid by extracting a row of this matrix, such as cluster 0 (the "first" cluster):

In [None]:
mu[0]

For each cluster, we can use the cluster's row in `mu` to find the top words of that cluster. More specifically, the words used the most in the cluster's centroid. Consider cluster 0. First, let's figure out which of the elements of $\boldsymbol{\mu}_0$ represent the 6 most used words in this cluster's centroid.

In [None]:
# How many "top words" do we want?
num_top_feats = 6

# Convert a row of mu to a Series object 
tf = pd.Series(mu[0]) 
# Get the top features (along with indexes)
tf = tf.nlargest(num_top_feats)
print(tf)

Now let's get the _tokens_ that correspond to these `mu[0]` values, and then bind it as a column to `tf`.

In [None]:
top_words = pd.Series([vocabulary[x] for x in tf.index], index=tf.index)
tf = pd.concat([tf, top_words], axis=1, keys=["mu0_j", "j"])
tf

Of course, we can do this for each of the clusters to get a general sense for what they are about:

In [None]:
tf = pd.DataFrame(mu) 
tf = tf.apply(pd.Series.nlargest, n=num_top_feats, axis=1)
tf = tf.reset_index().melt(id_vars="index", var_name="j", value_name="mu_kj").rename(columns={"index": "cluster"})
tf = tf.dropna(subset=["mu_kj"])
tf = tf.sort_values(["cluster", "mu_kj"], ascending=[True, False])
tf = tf.reset_index(drop=True)
tf["feature"] = [vocabulary[x] for x in tf["j"]]

top_words = tf.groupby("cluster")["feature"].apply(lambda s: ", ".join(s.astype(str)))

for i,r in top_words.items():
    print(f"Cluster {i} top words: {r}")

### Calculating clusters discriminating words

We can also calculate the discriminating words of each cluster using `sklearn`'s function `chi2`, which calculate the Pearson's chi2 statistic from lecture 3. Let's start by doing it for one cluster to see the basic process.

In [None]:
from sklearn.feature_selection import chi2

target = cluster_assignments == 0  # cluster 0 versus all other clusters
scores, pvals = chi2(dfm, target)  # chi2 against null hypothesis (for all features at once)

# Now let's format nicely
disc_words = pd.DataFrame({"cluster": 0, "feature": vocabulary, "chi2" : scores, "pval" : pvals})
disc_words = disc_words.sort_values("chi2", ascending=False)
print(disc_words)

Now, let's plot a bar chart depicting the top 10 most discriminating words for cluster 0.

In [None]:
import matplotlib.pyplot as plt

top = disc_words.nlargest(10, "chi2").sort_values("chi2")

plt.figure(figsize=(5, 3))
plt.barh(top["feature"], top["chi2"])
plt.xlabel("Chi-square statistic")
plt.ylabel("")
plt.title("Most indicative features of cluster 0")
plt.tight_layout()
plt.show()

Now, we do it for all clusters.

In [None]:
dwf = pd.DataFrame()

for cluster in range(K):
    target = cluster_assignments == cluster
    scores, pvals = chi2(dfm, target)
    disc_words = pd.DataFrame({"cluster": cluster, "feature": vocabulary, "chi2" : scores, "pval" : pvals})
    disc_words = disc_words.sort_values("chi2", ascending=False).iloc[0:num_top_feats,:]
    disc_words = disc_words.loc[disc_words["pval"] < 0.05,:]
    dwf = pd.concat([dwf, disc_words], axis = 0)

disc_words = dwf.groupby("cluster")["feature"].apply(lambda s: ", ".join(s.astype(str)))

for i,r in disc_words.items():
    print(f"Cluster {i} most discriminating words: {r}")