# Seminar Notebook 2.2: Clustering

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan Hübert**

This notebook covers the vector space approach and $k$-means clustering.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [1]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

### Loading the DFM

We need to load the DFM we created in the last notebook. We start by reading the sparse array object we saved as an `.npz` file:

In [2]:
from scipy import sparse

sparse_dfm_file = os.path.join(sdir, 'guardian-dfm.npz')
if os.path.exists(sparse_dfm_file):
    dfm = sparse.load_npz(sparse_dfm_file)
else:
    raise ValueError("You must create the DFM using the previous notebook before proceeding!")

dfm.shape

(1959, 6236)

Next, let's load the list of features (the vocabulary), which remember is not included with the sparse array data:

In [3]:
features_file = os.path.join(sdir, 'guardian-dfm-features.txt')
vocabulary = open(features_file, mode = "r").read().split("\n")

## Calculating distance and similarity

Before we look at $k$-means clustering, let's examine how to calculate distance and similarity between documents. First, we can calculate the Euclidean and Manhattan distances between documents using the formula from lecture. Let's calculate these two distance metrics between document 0 and document 1.

In [5]:
import numpy as np 
ed = np.sqrt((dfm[0] - dfm[1]).power(2)).sum()

md = np.abs(dfm[0] - dfm[1]).sum()

print(ed)

print(md)

255.0
255.0


There is a convenient function available in `sklearn` for calculating distance. This function allows you to choose which metric you want to use, and it allows you to calculate distance between all documents (returning a matrix of pairwise distances). Let's calculate Euclidean and Manhattan distance between the first five documents. Note that Manhattan distance is called `cityblock` in `sklearn`.

In [None]:
from sklearn.metrics import pairwise_distances

pairwise_distances(dfm[0:2], metric="euclidean")
pairwise_distances(dfm[0:2], metric="cityblock") #这个就是曼哈顿距离

array([[  0., 255.],
       [255.,   0.]])

We can also calculate the cosine similarity between two documents. For example, let's look at document 0 and 1. As we can see, they are not very similar:

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(dfm[0], dfm[1])

array([[0.01239644]])

## k-means Clustering

First, we will weight the DFM using TF-IDF weighting. Note that, by default, `TfidfTransformer` applies a normalisation to ensure that all of the vectors in the DFM have the same magnitude. The default is to apply the L2 norm, which is another way of saying the vector for each row is normalised by its vector magnitude. This is exactly what we did when we computed cosine similarity in week 3 lecture. This normalisation removes differences due purely to document length, allowing clustering to focus on differences in word composition rather than scale.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
dfm_tfidf = transformer.fit_transform(dfm)
print(dfm_tfidf)

#标准化？

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 231987 stored elements and shape (1959, 6236)>
  Coords	Values
  (0, 39)	0.05256121464142173
  (0, 100)	0.16583917900223952
  (0, 119)	0.1392357025124698
  (0, 144)	0.15547711052686594
  (0, 165)	0.04251758740448491
  (0, 176)	0.05420018460240871
  (0, 346)	0.04634991464746871
  (0, 427)	0.2088535537687047
  (0, 446)	0.059869919567261336
  (0, 613)	0.05334994004923864
  (0, 648)	0.04987690007060491
  (0, 677)	0.051825703508955316
  (0, 709)	0.06742934699399131
  (0, 768)	0.11609607265961656
  (0, 891)	0.036356895150812286
  (0, 926)	0.03960974894494444
  (0, 1048)	0.06560746375653825
  (0, 1054)	0.05334994004923864
  (0, 1239)	0.06479924603171734
  (0, 1367)	0.036530468286103866
  (0, 1406)	0.06742934699399131
  (0, 1418)	0.07235867345844732
  (0, 1443)	0.04075021455486167
  (0, 1534)	0.04565503325965483
  (0, 1567)	0.060377264460157294
  :	:
  (1958, 4894)	0.03557551795133242
  (1958, 4954)	0.07518884265745372
  (1958, 5019

Next, we will "set up" our $k$-means clustering exercise. For now, let's try to find 30 clusters and see what we get.

In [None]:
from sklearn.cluster import KMeans

k = 10 #需要仔细考虑，cohenrente cluster
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(dfm_tfidf)
labels

array([3, 5, 9, ..., 2, 5, 9], shape=(1959,), dtype=int32)

What objects can we extract from this? We are interested in each document $i$'s cluster assignment $\widehat{\boldsymbol{\pi}}_i$, as well as each cluster $k$'s "word usage" as represented by the centroid $\widehat{\boldsymbol{\mu}}_k$. Where can we extract those quantities from the `kmeans` and/or `labels` objects? 

### Cluster assignment 

This gives you the cluster assignments for each document:

In [16]:
cluster_assignments = labels
print(cluster_assignments)

[3 5 9 ... 2 5 9]


For example, we see that document 0 is in cluster 3. This means: $\widehat{\boldsymbol{\pi}}_0 = (0,0,1,0,0,0,0,0,0,0)$. (Remember: Python uses zero-indexing.) Let's now look at the distribution of documents across all clusters.

In [17]:
from collections import Counter
Counter(labels)

Counter({np.int32(5): 701,
         np.int32(9): 342,
         np.int32(4): 221,
         np.int32(3): 161,
         np.int32(6): 111,
         np.int32(2): 109,
         np.int32(8): 95,
         np.int32(1): 94,
         np.int32(7): 74,
         np.int32(0): 51})

### Cluster centroid feature use

The following gives you a $K \times J$ matrix (in our case $10 \times 6236$) of cluster centroids, $\widehat{\boldsymbol{\mu}}$. Each row is a specific cluster $k$'s "average document", which we can interpret as representing the cluster's prototypical word usage.

In [19]:
mu = kmeans.cluster_centers_

print(mu)
print(mu.shape)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00068003 ... 0.         0.         0.        ]
 ...
 [0.         0.00039359 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.00213422 0.         ... 0.00296132 0.         0.        ]]
(10, 6236)


We can look at a specific cluster's centroid by extracting a row of this matrix, such as cluster 0 (the "first" cluster):

In [20]:
mu[0]

array([0., 0., 0., ..., 0., 0., 0.], shape=(6236,))

For each cluster, we can use the cluster's row in `mu` to find the top words of that cluster. More specifically, the words used the most in the cluster's centroid. Consider cluster 0. First, let's figure out which of the elements of $\boldsymbol{\mu}_0$ represent the 6 most used words in this cluster's centroid.

In [35]:
num_top = 10
cluster = 5

import pandas as pd
tf = pd.Series(mu[cluster])
tf = tf.nlargest(num_top)
tf


1704    0.017439
5977    0.014408
4922    0.014037
2316    0.013395
3826    0.013312
4301    0.012764
32      0.012309
359     0.011724
4354    0.011707
297     0.011084
dtype: float64

Now let's get the _tokens_ that correspond to these `mu[0]` values, and then bind it as a column to `tf`.

In [36]:
[vocabulary[x] for x in tf.index]

['drug',
 'violenc',
 'sexual',
 'girl',
 'obama',
 'prison',
 'abus',
 'australian',
 'protest',
 'arrest']

Of course, we can do this for each of the clusters to get a general sense for what they are about:

### Calculating clusters discriminating words

We can also calculate the discriminating words of each cluster using `sklearn`'s function `chi2`, which calculate the Pearson's chi2 statistic from lecture 3. Let's start by doing it for one cluster to see the basic process.

Now, let's plot a bar chart depicting the top 10 most discriminating words for cluster 0.

Now, we do it for all clusters.