# Seminar Notebook 2.4: Latent Dirichlet Allocation (LDA)

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan HÃ¼bert**

This notebook covers Latent Dirichlet Allocation.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [1]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

### Loading the DFM

We need to load the DFM we created in the last notebook. We start by reading the sparse array object we saved as an `.npz` file:

In [2]:
from scipy import sparse
import pandas as pd

sparse_dfm_file = os.path.join(sdir, 'guardian-dfm.npz')
if os.path.exists(sparse_dfm_file):
    dfm = sparse.load_npz(sparse_dfm_file)
else:
    raise ValueError("You must create the DFM using the previous notebook before proceeding!")

dfm.shape

(1959, 6236)

Next, let's load the list of features (the vocabulary), which remember is not included with the sparse array data:

In [3]:
features_file = os.path.join(sdir, 'guardian-dfm-features.txt')
vocabulary = open(features_file, mode = "r").read().split("\n")

## Latent Dirichlet Allocation (LDA)

We will run LDA on our corpus of news articles. We'll estimate 10 topics.

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

K = 10
lda = LatentDirichletAllocation(n_components=K, random_state=6541)
lda = lda.fit(dfm)

As was the case for $k$-means clustering, we are interested in each document $i$'s $\widehat{\boldsymbol{\pi}}_i$, as well as each cluster $k$'s $\widehat{\boldsymbol{\mu}}_k$. However in the context of topic modelling, they have different interpretations:

- $\widehat{\boldsymbol{\pi}}_i$ gives the proportion of document $i$ that corresponds to each topic
- $\widehat{\boldsymbol{\mu}}_k$ gives the word use for a topic $k$

Where can we extract these important items from the `lda` object?

### Topic assignment proportions 

We can extract each document's topic proportions as follows.

In [5]:
pi = lda.transform(dfm)
pi

array([[4.09087177e-01, 6.45250311e-04, 6.45298651e-04, ...,
        3.45914315e-01, 6.45304265e-04, 6.45290131e-04],
       [9.61627750e-04, 9.69995121e-01, 9.61787655e-04, ...,
        9.61712693e-04, 2.23112091e-02, 9.61727355e-04],
       [5.80502988e-01, 5.29241921e-04, 5.29182746e-04, ...,
        3.27037817e-01, 5.29211071e-04, 5.29226236e-04],
       ...,
       [4.16779291e-04, 4.16730470e-04, 4.16777639e-04, ...,
        4.16710758e-04, 4.16792021e-04, 4.16730949e-04],
       [7.75462038e-02, 4.15030516e-04, 4.15106098e-04, ...,
        3.05189661e-01, 4.15089298e-04, 4.15082475e-04],
       [7.41013871e-04, 7.40904474e-04, 2.44583503e-02, ...,
        5.34036348e-01, 7.40923013e-04, 7.40898063e-04]], shape=(1959, 10))

For example, we see that document 0 has the following proportions:

In [6]:
pi[0]

array([0.40908718, 0.00064525, 0.0006453 , 0.24048126, 0.0006454 ,
       0.00064535, 0.00064536, 0.34591431, 0.0006453 , 0.00064529])

The document is 40% about topic 0, 24% about topic 3, and so on.

### Topic feature probabilities (word use)

Next, we need to examine what these topics are actually about. We begin by extracting a $K \times J$ matrix (in our case $10 \times 6236$), where each row gives a topic's $\widehat{\boldsymbol{\mu}}_k$. In the DGP for LDA, this parameter controls the probabilities that each token in the vocabulary will be chosen when a token is assigned to that topic. The following code extracts this $\widehat{\boldsymbol{\mu}}$ matrix:

In [7]:
mu = lda.components_ / lda.components_.sum(axis=1, keepdims=True)
mu

array([[2.84452025e-06, 6.49441088e-06, 2.84452025e-06, ...,
        2.84478316e-06, 2.84452025e-06, 2.84452025e-06],
       [3.74060841e-06, 4.41948993e-05, 3.77800874e-04, ...,
        3.74138134e-06, 1.90964808e-04, 3.74132293e-06],
       [3.32516647e-06, 1.52858306e-04, 3.32538213e-06, ...,
        3.32528809e-06, 1.66295166e-04, 3.32516647e-06],
       ...,
       [2.11580965e-06, 7.44144449e-05, 2.11587431e-06, ...,
        3.47730874e-06, 2.11580965e-06, 2.11580965e-06],
       [1.93600198e-06, 2.59673184e-04, 1.93600198e-06, ...,
        1.93625416e-06, 1.13141448e-05, 1.93600198e-06],
       [6.13496955e-04, 2.18338150e-06, 2.18326321e-06, ...,
        2.18327132e-06, 2.18556113e-06, 2.42341799e-04]], shape=(10, 6236))

Let's look at a specific topic's word usage by extracting a row of this matrix, such as topic 0 (the "first" topic):

In [8]:
mu[0]

array([2.84452025e-06, 6.49441088e-06, 2.84452025e-06, ...,
       2.84478316e-06, 2.84452025e-06, 2.84452025e-06], shape=(6236,))

For each topic, we can use the topic's row in `mu` to find the top words of that cluster. More specifically, the words used the most in the cluster's centroid. Consider cluster 0. First, let's figure out which of the elements of $\boldsymbol{\mu}_0$ represent the 6 most used words in this cluster's centroid.

In [9]:
# How many "top words" do we want?
num_top_feats = 6

# Convert a row of mu to a Series object 
tf = pd.Series(mu[0]) 
# Get the top features (along with indexes)
tf = tf.nlargest(num_top_feats)
print(tf)

1013    0.012126
6047    0.010213
1836    0.009952
2154    0.009389
2277    0.005840
4625    0.005070
dtype: float64


We want to know what each of the 10 topics are roughly about. So we can look at the top features for each topic $k$, as represented by the feature with the highest probability in $\boldsymbol{\mu}_k$. This is identical to what we did for $k$-means clustering.

In [10]:
tf = pd.DataFrame(mu) 
tf = tf.apply(pd.Series.nlargest, n=num_top_feats, axis=1)
tf = tf.reset_index().melt(id_vars="index", var_name="j", value_name="mu_kj").rename(columns={"index": "topic"})
tf = tf.dropna(subset=["mu_kj"])
tf = tf.sort_values(["topic", "mu_kj"], ascending=[True, False])
tf = tf.reset_index(drop=True)
tf["feature"] = [vocabulary[x] for x in tf["j"]]

top_words = tf.groupby("topic")["feature"].apply(lambda s: ", ".join(s.astype(str)))

for i,r in top_words.items():
    print(f"Topic {i} top words: {r}")

Topic 0 top words: climat, water, energi, food, gas, retail
Topic 1 top words: violenc, arrest, assault, shoot, sentenc, murder
Topic 2 top words: clinton, sander, gun, hillari, berni, deleg
Topic 3 top words: australian, labor, turnbul, coalit, malcolm, shorten
Topic 4 top words: oil, investor, quarter, analyst, stock, china
Topic 5 top words: corbyn, jeremi, shadow, leadership, resign, de
Topic 6 top words: johnson, game, appl, osborn, googl, tori
Topic 7 top words: doctor, drug, hospit, medic, prison, patient
Topic 8 top words: cruz, obama, rubio, clinton, sander, ted
Topic 9 top words: refuge, syria, syrian, brussel, isi, turkey


## Reading documents

Of course, if you want to really understand these topics, you will need to read a selection of documents corresponding to each one of the topics. Let's look at the five documents that have the highest proportion of tokens assigned to a topic. First, let's identify the top five documents for each topic. 

In [11]:
num_top_docs = 5
zf = pd.DataFrame(pi) 
zf = zf.apply(pd.Series.nlargest, n=num_top_docs, axis=0)
zf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
9,,,,,,,,,,0.998056
34,,0.995477,,,,,,,,
128,,,,,,,,0.997221,,
248,0.997196,,,,,,,,,
268,,,,,0.997105,,,,,
309,,,,,,,,0.995521,,
350,,,,,,,,,0.99768,
362,,,,,,,,,0.99925,
379,,,,,,,,,,0.999398
401,,,,0.99647,,,,,,


Next, we will load the corpus documents and look at the top five documents in topic 0.

In [12]:
from pprint import pprint

topic = 0
topic_idx = zf.iloc[:,topic].dropna().index

corpus_file = os.path.join(sdir, 'guardian-corpus.csv')
corpus = pd.read_csv(corpus_file)

for i,r in corpus.iloc[topic_idx,:].iterrows():
    print(r["datetime"])
    pprint(r["texts"][0:210], width=80)
    print("\n")

2016-02-03 11:02:00
('Human emissions of heat-trapping gases like carbon dioxide are causing the '
 'Earth to warm. We know this, and we have known this would happen for over '
 '100 years. But scientists want to know how fast the Earth is ')


2016-02-24 21:15:00
('Right now, roughly a kilometre below the surface of an ocean near you, a '
 'yellow cylinder about the size of a golf bag is taking measurements of the '
 'temperature and saltiness of the water. | Every couple of days')


2016-05-23 22:10:00
("Sir Philip Green's retail business was warned before the sale of BHS that "
 'Dominic Chappell, the man who led the buyout, had been declared bankrupt and '
 'lacked experience in the retail industry. | Paul Budge, the')


2016-05-23 22:30:00
("Sir Philip Green's retail business was warned before the sale of BHS that "
 'Dominic Chappell, the man who led the buyout, had been declared bankrupt and '
 'lacked experience in the retail industry. | Paul Budge, the')


2016-05-24 17:20:00


You could, of course, do the same thing to review clusters as well!