Given a bag-of-words matrix as input, LDA decomposes it into two new matrices:
• A document-to-topic matrix
• A word-to-topic matrix
LDA decomposes the bag-of-words matrix in such a way that if we multiply those two matrices to￾gether, we will be able to reproduce the input, the bag-of-words matrix, with the lowest possible error. 
In practice, we are interested in those topics that LDA found in the bag-of-words matrix. The only 
downside may be that we must define the number of topics beforehand—the number of topics is a 
hyperparameter of LDA that has to be specified manually.

- Sebastian Raschka

# Statistical Foundation of Latent Dirichlet Allocation

## Model Setup

**Variables:**
- $w_{d,n}$ = n-th word in document d
- $z_{d,n}$ = topic assignment for word $w_{d,n}$
- $\boldsymbol{\theta}_d$ = topic distribution for document d (K-dimensional)
- $\boldsymbol{\phi}_k$ = word distribution for topic k (V-dimensional)
- $\boldsymbol{\alpha}, \boldsymbol{\beta}$ = Dirichlet hyperparameters

## Generative Process

For each topic $k$:
$$\boldsymbol{\phi}_k \sim \text{Dirichlet}(\boldsymbol{\beta})$$

For each document $d$:
$$\boldsymbol{\theta}_d \sim \text{Dirichlet}(\boldsymbol{\alpha})$$

For each word position $n$ in document $d$:
$$z_{d,n} \sim \text{Multinomial}(\boldsymbol{\theta}_d)$$
$$w_{d,n} \sim \text{Multinomial}(\boldsymbol{\phi}_{z_{d,n}})$$

## Joint Distribution

$$P(\mathbf{w},\mathbf{z},\boldsymbol{\Theta},\boldsymbol{\Phi}|\boldsymbol{\alpha},\boldsymbol{\beta}) = \prod_{k=1}^K \text{Dir}(\boldsymbol{\phi}_k|\boldsymbol{\beta}) \prod_{d=1}^D \text{Dir}(\boldsymbol{\theta}_d|\boldsymbol{\alpha}) \prod_{d=1}^D \prod_{n=1}^{N_d} \theta_{d,z_{d,n}} \phi_{z_{d,n},w_{d,n}}$$

## Inference Problem

We want the posterior:
$$P(\mathbf{z},\boldsymbol{\Theta},\boldsymbol{\Phi}|\mathbf{w},\boldsymbol{\alpha},\boldsymbol{\beta}) = \frac{P(\mathbf{w},\mathbf{z},\boldsymbol{\Theta},\boldsymbol{\Phi}|\boldsymbol{\alpha},\boldsymbol{\beta})}{P(\mathbf{w}|\boldsymbol{\alpha},\boldsymbol{\beta})}$$

The denominator is intractable, requiring approximation methods.

## Collapsed Gibbs Sampling

By integrating out $\boldsymbol{\Theta}$ and $\boldsymbol{\Phi}$, we sample topic assignments directly. The conditional probability for assigning topic $k$ to word $w_{d,n}$ is:

$$P(z_{d,n}=k | \mathbf{z}_{-d,n}, \mathbf{w}, \boldsymbol{\alpha}, \boldsymbol{\beta}) \propto \frac{n_{d,k}^{-d,n} + \alpha_k}{\sum_{k'=1}^K (n_{d,k'}^{-d,n} + \alpha_{k'})} \cdot \frac{n_{k,w_{d,n}}^{-d,n} + \beta_{w_{d,n}}}{\sum_{w'=1}^V (n_{k,w'}^{-d,n} + \beta_{w'})}$$

Where:
- $n_{d,k}^{-d,n}$ = count of topic $k$ in document $d$ (excluding current word)
- $n_{k,w}^{-d,n}$ = count of word $w$ in topic $k$ (excluding current word)

## Parameter Estimation

After sampling, estimate parameters:

**Document-topic distributions:**
$$\hat{\theta}_{d,k} = \frac{n_{d,k} + \alpha_k}{\sum_{k'=1}^K (n_{d,k'} + \alpha_{k'})}$$

**Topic-word distributions:**
$$\hat{\phi}_{k,w} = \frac{n_{k,w} + \beta_w}{\sum_{w'=1}^V (n_{k,w'} + \beta_{w'})}$$

## Algorithm

1. **Initialize:** Randomly assign topics to all words
2. **For each iteration:**
   - For each word $w_{d,n}$:
     - Remove current topic assignment
     - Sample new topic using conditional probability above
     - Update count matrices
3. **Estimate:** Compute $\hat{\boldsymbol{\theta}}$ and $\hat{\boldsymbol{\phi}}$ from final counts

## Model Evaluation

**Perplexity:**
$$\text{Perplexity} = \exp\left(-\frac{\sum_{d} \sum_{n} \log P(w_{d,n}|\mathbf{w}_{train})}{\sum_{d} N_d}\right)$$

**Topic Coherence:** Measures semantic similarity of top words within topics:
$$C(k) = \sum_{i=1}^{M-1} \sum_{j=i+1}^{M} \log \frac{P(w_i^{(k)}, w_j^{(k)}) + \epsilon}{P(w_i^{(k)}) P(w_j^{(k)})}$$

## Hyperparameters

- **$\alpha_k < 1$:** Sparse document-topic distributions (documents focus on fewer topics)
- **$\beta_w < 1$:** Sparse topic-word distributions (topics use fewer distinctive words)

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer


df = pd.read_csv('movie_data.csv', encoding='utf-8')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 781.4+ KB
None


In [3]:
count = CountVectorizer(stop_words='english',
                        max_df=0.1, # maximum document frequency - arbitrary choice
                        max_features=5000) # maximum number of words to consider (taken to be the most frequent occuring ones) - arbitrary

X = count.fit_transform(df['review'].values)
y = df['sentiment'].values

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, # number of topics learned. It is costly to learn topics
                                random_state=1,
                                learning_method='batch') # Train on all avaialble data. 'online' would be equivalent to mini-batch training

X_topics = lda.fit_transform(X)

In [9]:
print(lda.components_.shape)
print(X_topics.shape)

feature_names = count.get_feature_names_out()

n_top_words = 5
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}: {', '.join(feature_names[i] for i in topic.argsort()[:-n_top_words - 1: -1])}")

(10, 5000)
(50000, 10)
Topic 0: script, worst, poor, minutes, production
Topic 1: original, series, episode, worst, stupid
Topic 2: family, book, kids, children, school
Topic 3: horror, ending, scary, original, suspense
Topic 4: music, beautiful, performance, excellent, wonderful
Topic 5: woman, wife, father, mother, husband
Topic 6: war, series, american, documentary, game
Topic 7: john, role, played, plays, michael
Topic 8: guy, action, effects, dead, looks
Topic 9: comedy, laugh, humor, jokes, fun


Looks like the topics are:
0) Bad movies
1) Sequels, series, shows
2) Kids movies
3) Horror movies
4) Art movies
5) Family movies
6) War movies
7) Unclear - possibly actors?
8) Action movies
9) Comedy movies

Let's test out the comedy movies topics to see what falls out!

In [12]:
war = X_topics[:, 6].argsort()[::-1]
for idx, movie_idx in enumerate(war[:5]):
    print(f"War movie #{idx}:")
    print(df['review'][movie_idx][:300], '...')

War movie #0:
The first 2 parts seek to reduce to absurdity the rise of wasteful wars and rule by nationalist barbarians. The 3rd part speculates that progress and exploration toward the moon and beyond is the key to ensuring a meaningful use of human talents and resources. It has speeches that some viewers dismi ...
War movie #1:
There is an episode of The Simpsons which has a joke news report referring to an army training base as a "Killbot Factory". Here the comment is simply part of a throwaway joke, but what Patricia Foulkrod's documentary does is show us, scarily, that it is not that far from the truth. After World War  ...
War movie #2:
One of the best documentaries released in recent years. Some points...<br /><br />1. Hugo Chavez was elected Venezuela's president in 1998, his support largely coming from the poorer regions of Venezuela.<br /><br />2. In 2002, a coup briefly deposed Chavez. At the time, Irish filmmakers Kim Bartley ...
War movie #3:
There are no reasons of takin

Sounds like we got war based movies or something in the apocalypse category, which can bleed into it quite easily!