# Seminar Notebook 2.4: Latent Dirichlet Allocation (LDA)

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan Hübert**

This notebook covers Latent Dirichlet Allocation.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [1]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

### Loading the DFM

We need to load the DFM we created in the last notebook. We start by reading the sparse array object we saved as an `.npz` file:

In [2]:
from scipy import sparse
import pandas as pd

sparse_dfm_file = os.path.join(sdir, 'guardian-dfm.npz')
if os.path.exists(sparse_dfm_file):
    dfm = sparse.load_npz(sparse_dfm_file)
else:
    raise ValueError("You must create the DFM using the previous notebook before proceeding!")

dfm.shape

(1959, 6236)

Next, let's load the list of features (the vocabulary), which remember is not included with the sparse array data:

In [3]:
features_file = os.path.join(sdir, 'guardian-dfm-features.txt')
vocabulary = open(features_file, mode = "r").read().split("\n")

## Latent Dirichlet Allocation (LDA)

We will run LDA on our corpus of news articles. We'll estimate 10 topics.

In [4]:
K = 10

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=K, random_state=6541)
lda = lda.fit(dfm)

As was the case for $k$-means clustering, we are interested in each document $i$'s $\widehat{\boldsymbol{\pi}}_i$, as well as each cluster $k$'s $\widehat{\boldsymbol{\mu}}_k$. However in the context of topic modelling, they have different interpretations:

- $\widehat{\boldsymbol{\pi}}_i$ gives the proportion of document $i$ that corresponds to each topic
- $\widehat{\boldsymbol{\mu}}_k$ gives the word use for a topic $k$

Where can we extract these important items from the `lda` object?

### Topic assignment proportions 

We can extract each document's topic proportions as follows.

In [5]:
pi = lda.transform(dfm)
pi

array([[4.09087177e-01, 6.45250311e-04, 6.45298651e-04, ...,
        3.45914315e-01, 6.45304265e-04, 6.45290131e-04],
       [9.61627750e-04, 9.69995121e-01, 9.61787655e-04, ...,
        9.61712693e-04, 2.23112091e-02, 9.61727355e-04],
       [5.80502988e-01, 5.29241921e-04, 5.29182746e-04, ...,
        3.27037817e-01, 5.29211071e-04, 5.29226236e-04],
       ...,
       [4.16779291e-04, 4.16730470e-04, 4.16777639e-04, ...,
        4.16710758e-04, 4.16792021e-04, 4.16730949e-04],
       [7.75462038e-02, 4.15030516e-04, 4.15106098e-04, ...,
        3.05189661e-01, 4.15089298e-04, 4.15082475e-04],
       [7.41013871e-04, 7.40904474e-04, 2.44583503e-02, ...,
        5.34036348e-01, 7.40923013e-04, 7.40898063e-04]], shape=(1959, 10))

For example, we see that document 0 has the following proportions:

In [None]:
pi[0]

array([0.40908718, 0.00064525, 0.0006453 , 0.24048126, 0.0006454 ,
       0.00064535, 0.00064536, 0.34591431, 0.0006453 , 0.00064529])

The document is 40% about topic 0, 24% about topic 3, and so on.

### Topic feature probabilities (word use)

Next, we need to examine what these topics are actually about. We begin by extracting a $K \times J$ matrix (in our case $10 \times 6236$), where each row gives a topic's $\widehat{\boldsymbol{\mu}}_k$. In the DGP for LDA, this parameter controls the probabilities that each token in the vocabulary will be chosen when a token is assigned to that topic. The following code extracts this $\widehat{\boldsymbol{\mu}}$ matrix:

In [None]:
mu = lda.components_ / lda.components_.sum(axis=1, keepdims=True)
mu

array([[2.84452025e-06, 6.49441088e-06, 2.84452025e-06, ...,
        2.84478316e-06, 2.84452025e-06, 2.84452025e-06],
       [3.74060841e-06, 4.41948993e-05, 3.77800874e-04, ...,
        3.74138134e-06, 1.90964808e-04, 3.74132293e-06],
       [3.32516647e-06, 1.52858306e-04, 3.32538213e-06, ...,
        3.32528809e-06, 1.66295166e-04, 3.32516647e-06],
       ...,
       [2.11580965e-06, 7.44144449e-05, 2.11587431e-06, ...,
        3.47730874e-06, 2.11580965e-06, 2.11580965e-06],
       [1.93600198e-06, 2.59673184e-04, 1.93600198e-06, ...,
        1.93625416e-06, 1.13141448e-05, 1.93600198e-06],
       [6.13496955e-04, 2.18338150e-06, 2.18326321e-06, ...,
        2.18327132e-06, 2.18556113e-06, 2.42341799e-04]], shape=(10, 6236))

Let's look at a specific topic's word usage by extracting a row of this matrix, such as topic 0 (the "first" topic):

In [8]:
mu[0]

array([2.84452025e-06, 6.49441088e-06, 2.84452025e-06, ...,
       2.84478316e-06, 2.84452025e-06, 2.84452025e-06], shape=(6236,))

For each topic, we can use the topic's row in `mu` to find the top words of that cluster. More specifically, the words used the most in the cluster's centroid. Consider cluster 0. First, let's figure out which of the elements of $\boldsymbol{\mu}_0$ represent the 6 most used words in this cluster's centroid.

In [15]:
num_top_feats = 6
topic = 2

tf = pd.Series(mu[topic])
tf = tf.nlargest(num_top_feats)
tf

1018    0.047418
4796    0.025840
2424    0.017290
2569    0.012245
509     0.007605
1444    0.006627
dtype: float64

We want to know what each of the 10 topics are roughly about. So we can look at the top features for each topic $k$, as represented by the feature with the highest probability in $\boldsymbol{\mu}_k$. This is identical to what we did for $k$-means clustering.

In [16]:
[vocabulary[x] for x in tf.index]

['clinton', 'sander', 'gun', 'hillari', 'berni', 'deleg']

## Reading documents

Of course, if you want to really understand these topics, you will need to read a selection of documents corresponding to each one of the topics. Let's look at the five documents that have the highest proportion of tokens assigned to a topic. First, let's identify the top five documents for each topic. 

In [None]:
num_top_docs = 5

zf = pd.DataFrame(pi) #是pi不是mu。还是有点不懂这两个代表什么
zf = zf.apply(pd.Series.nlargest,n=num_top_docs, axis=0)
zf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
9,,,,,,,,,,0.998056
34,,0.995477,,,,,,,,
128,,,,,,,,0.997221,,
248,0.997196,,,,,,,,,
268,,,,,0.997105,,,,,
309,,,,,,,,0.995521,,
350,,,,,,,,,0.99768,
362,,,,,,,,,0.99925,
379,,,,,,,,,,0.999398
401,,,,0.99647,,,,,,


Next, we will load the corpus documents and look at the top five documents in topic 0.

In [None]:
topic = 3
topic_ids = zf.iloc[:,topic].dropna().index

topic_ids

corpus_file = os.path.join(sdir, 'guardian-corpus.csv')
corpus = pd.read_csv(corpus_file)

from pprint import pprint
for i,r in corpus.iloc[topic_ids,:].iterrows():
    print(r["datetime"])
    pprint(r["texts"][0:200])
    
    
#可以发现patterns，但是人工需要去总结topic

2016-02-22 09:16:00
("Labor will oppose the Coalition's proposed changes to the Senate voting "
 'system as the government moves to rush them through with the support of the '
 'Greens and Senator Nick Xenophon, before an election')
2016-04-19 00:51:00
('block-time published-time 12.51am BST | As well as rejecting the ABCC, the '
 'senate last night passed legislation abolishing the truckies tribunal. The '
 'passage of the repeal bill follows a campaign in r')
2016-04-19 05:10:00
("More of course, but that's the chunky bits. | block-time updated-timeUpdated "
 'at 4.12am BST | block-time published-time 3.45am BST | I will stand still '
 "shortly and summarise all this, don't fret. | But")
2016-05-18 05:12:00
('Senate voting reform is not entirely a "he said, she said" story, despite '
 'being so often presented that way. | The changes to Senate voting about to '
 'be introduced, and almost certainly passed, will ch')
2016-05-18 05:13:00
("As this election year gets under way, we'll go

You could, of course, do the same thing to review clusters as well!