# Seminar Notebook 2.4: Latent Dirichlet Allocation (LDA)

**LSE MY459: Computational Text Analysis and Large Language Models** (WT 2026)

**Ryan HÃ¼bert**

This notebook covers Latent Dirichlet Allocation.

## Directory management

We begin with some directory management to specify the file path to the folder on your computer where you wish to store data for this notebook.

In [None]:
import os
sdir = os.path.join(os.path.expanduser("~"), "LSE-MY459-WT26", "SeminarWeek04") # or whatever path you want
if not os.path.exists(sdir):
    os.mkdir(sdir)

### Loading the DFM

We need to load the DFM we created in the last notebook. We start by reading the sparse array object we saved as an `.npz` file:

In [None]:
from scipy import sparse
import pandas as pd

sparse_dfm_file = os.path.join(sdir, 'guardian-dfm.npz')
if os.path.exists(sparse_dfm_file):
    dfm = sparse.load_npz(sparse_dfm_file)
else:
    raise ValueError("You must create the DFM using the previous notebook before proceeding!")

dfm.shape

Next, let's load the list of features (the vocabulary), which remember is not included with the sparse array data:

In [None]:
features_file = os.path.join(sdir, 'guardian-dfm-features.txt')
vocabulary = open(features_file, mode = "r").read().split("\n")

## Latent Dirichlet Allocation (LDA)

We will run LDA on our corpus of news articles. We'll estimate 10 topics.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

K = 10
lda = LatentDirichletAllocation(n_components=K, random_state=6541)
lda = lda.fit(dfm)

As was the case for $k$-means clustering, we are interested in each document $i$'s $\widehat{\boldsymbol{\pi}}_i$, as well as each cluster $k$'s $\widehat{\boldsymbol{\mu}}_k$. However in the context of topic modelling, they have different interpretations:

- $\widehat{\boldsymbol{\pi}}_i$ gives the proportion of document $i$ that corresponds to each topic
- $\widehat{\boldsymbol{\mu}}_k$ gives the word use for a topic $k$

Where can we extract these important items from the `lda` object?

### Topic assignment proportions 

We can extract each document's topic proportions as follows.

In [None]:
pi = lda.transform(dfm)
pi

For example, we see that document 0 has the following proportions:

In [None]:
pi[0]

The document is 40% about topic 0, 24% about topic 3, and so on.

### Topic feature probabilities (word use)

Next, we need to examine what these topics are actually about. We begin by extracting a $K \times J$ matrix (in our case $10 \times 6236$), where each row gives a topic's $\widehat{\boldsymbol{\mu}}_k$. In the DGP for LDA, this parameter controls the probabilities that each token in the vocabulary will be chosen when a token is assigned to that topic. The following code extracts this $\widehat{\boldsymbol{\mu}}$ matrix:

In [None]:
mu = lda.components_ / lda.components_.sum(axis=1, keepdims=True)
mu

Let's look at a specific topic's word usage by extracting a row of this matrix, such as topic 0 (the "first" topic):

In [None]:
mu[0]

For each topic, we can use the topic's row in `mu` to find the top words of that cluster. More specifically, the words used the most in the cluster's centroid. Consider cluster 0. First, let's figure out which of the elements of $\boldsymbol{\mu}_0$ represent the 6 most used words in this cluster's centroid.

In [None]:
# How many "top words" do we want?
num_top_feats = 6

# Convert a row of mu to a Series object 
tf = pd.Series(mu[0]) 
# Get the top features (along with indexes)
tf = tf.nlargest(num_top_feats)
print(tf)

We want to know what each of the 10 topics are roughly about. So we can look at the top features for each topic $k$, as represented by the feature with the highest probability in $\boldsymbol{\mu}_k$. This is identical to what we did for $k$-means clustering.

In [None]:
tf = pd.DataFrame(mu) 
tf = tf.apply(pd.Series.nlargest, n=num_top_feats, axis=1)
tf = tf.reset_index().melt(id_vars="index", var_name="j", value_name="mu_kj").rename(columns={"index": "topic"})
tf = tf.dropna(subset=["mu_kj"])
tf = tf.sort_values(["topic", "mu_kj"], ascending=[True, False])
tf = tf.reset_index(drop=True)
tf["feature"] = [vocabulary[x] for x in tf["j"]]

top_words = tf.groupby("topic")["feature"].apply(lambda s: ", ".join(s.astype(str)))

for i,r in top_words.items():
    print(f"Topic {i} top words: {r}")

## Reading documents

Of course, if you want to really understand these topics, you will need to read a selection of documents corresponding to each one of the topics. Let's look at the five documents that have the highest proportion of tokens assigned to a topic. First, let's identify the top five documents for each topic. 

In [None]:
num_top_docs = 5
zf = pd.DataFrame(pi) 
zf = zf.apply(pd.Series.nlargest, n=num_top_docs, axis=0)
zf

Next, we will load the corpus documents and look at the top five documents in topic 0.

In [None]:
from pprint import pprint

topic = 0
topic_idx = zf.iloc[:,topic].dropna().index

corpus_file = os.path.join(sdir, 'guardian-corpus.csv')
corpus = pd.read_csv(corpus_file)

for i,r in corpus.iloc[topic_idx,:].iterrows():
    print(r["datetime"])
    pprint(r["texts"][0:210], width=80)
    print("\n")

You could, of course, do the same thing to review clusters as well!