# Topic Modeling - Latent Dirichlet Allocation


In this notebook we introduce Latent Dirichlet Allocation (LDA) as a technique for performing Topic Modeling on textual data.



## Motivation


Suppose we want to **discover topics** underlying the following set of sentences:

1. Feynman teaches Physics.
2. Physics is cool!
3. Fellini made great movies.
4. Ross theatre hosts independent movies.
5. The movie infinity is about Physics and Feynman.

By inspection we could see that the first two sentences might belog to the topic "Physics", third and fourth sentence could be about the topic "Movie", and the last sentence could be a mixture of both Physics and Movie.

    How to we doscover these topics automatically?


## Topic Modeling

We use an unsupervised machine learning technique called **Topic modeling** that is capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents. A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents.

Topic models are a great way to automatically explore and structure a large set of documents. They group or cluster documents based on the words that occur in them. As documents on similar topics tend to use a similar sub-vocabulary, the resulting clusters of documents can be interpreted as discussing different "topics".


## Latent Dirichlet Allocation (LDA) 

LDA is an example of a probabilistic topic model which is used in Natural Language Processing (NLP). It is a **generative** statistical model that allows **sets of observations to be explained by unobserved groups that explain why some parts of the data are similar**. 

For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics.

LDA is based on two hypotheses: 

- Distributional hypothesis: Similar topics make use of similar words.
- Statistical mixture hypothesis: Documents talk about several topics. 

The purpose of LDA is to map each document in our corpus to a set of topics which covers a good deal of the words in the document.


As an example let's see what LDA might produce on the corpus of 5 sentences (given above). 


|            | Topic A | Topic B |       
| :---:|:---:|:---:|
| Sentence 1 |   100%  |         |
| Sentence 2 |   100%  |         |
| Sentence 3 |         |  100%   |
| Sentence 4 |         |  100%   |
| Sentence 5 |    60%  |  40%    |



|         |  movie | Ross theatre | independent | hosts | Fellini | Physics | Feynman | teaches | 
    |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Topic A |   15%  |       0%     |      0%     |   0%  |    0%   |    42%  |    28%  |    15%  |
| Topic B |   28%  |      15%     |     15%     |  15%  |   15%   |    0%   |    0%   |    0%   |



Based on the above observation we might interpret Topic A as "Physics" and Topic B as "Movie".

Now let's see how LDA makes this discovery, i.e., to produce interpretable document representations which can be used to discover the topics or structure in a collection of unlabeled documents.


At first, we **vectorize the text** using a Bag-of-Words model to represent the different documents. Then, LDA use these representations to find the structure in the document collection.


## Bag-of-Words

Traditionally, text documents are represented in NLP as a Bag-of-Words. This means that each document is represented as a **fixed-length vector** with length equal to the vocabulary size. Each dimension of this vector corresponds to the count or occurrence of a word in a document. 




## How Does LDA Work?

        Main idea: Each document can be described by a distribution of topics and each topic can be described by a distribution of words.

LDA is a **generative process** that assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

Let's see how the generative process works. The goal is to create the documents by sampling words.

Assume that we have $N$ documents, $M$ number of distinct total words in those documents, and the number of topics we want to discover is $K$.

For generating these documents we use two dice. Using these two dice we select topics and words, respectively.
- K-sided die: with probability $\Theta_{td}$ select topic t in document d
- M-sided die: with probability $\Phi_{wt}$ select word w in topic t 

First, select a document $d$ for generating words for it. 
Then, roll the K-sided die to select a topic $t$ from a document $d$.
Finally, roll the M-sided die to select a word $w$ from the topic $t$.
By iterating the above two steps we generate the words for a document $d$. We repeat this process for other documents.



<img src="https://cse.unl.edu/~hasan/Pics/LDA-GenerativeProcess.png" width=600, height=200>


As an illustration, consider a document that contains the sentence "The movie infinity is about Physics and Feynman". Let's say that it's a mixture of two topics: Physics (40%) and Movie (60%).

If we roll a 2-sided die (K = 2 for two topics) for topic selection many times, 60% of the time we will generate the topic Movie and 40% of the time we will generate the topic Physics.

For these two topics let's say that we have the following distribution of words.

|         |  movie | Ross theatre | independent | hosts | Fellini | Physics | Feynman | teaches | 
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Physics |   15%  |       0%     |      0%     |   0%  |    0%   |    42%  |    28%  |    15%  |
|  Movie  |   28%  |      15%     |     15%     |  15%  |   15%   |    0%   |    0%   |    0%   |

After we select a topic Movie, we roll a 8-sided die (M = 8 for eight words) for selecting a word. The probability of generating the word "movie" is 28%, the probability of generating the word "Fellini" is 15%, etc.

Thus, if we repeatedly roll the two dice for each document, we will be able to generate word distribution for the documents that will be very close to the actual distribution of words in those documents. However, the order of the generated words would be different from the actual distribution.



## Why is the Knowledge of LDA as a Generative Process useful?

Knowledge about the generative model of LDA is useful because given a dataset of documents, LDA **backtracks** this generative process and tries to figure out **what topics would create those documents in the first place**. LDA finds two distributions: topics in the documents, and words in the topics.

LDA is a matrix factorization technique (explained later). In vector space, any corpus (collection of documents) can be represented as a document-term matrix. The matrix in Table 1 shows a corpus of N documents $D_1, D_2, D_3, ..., D_N$ and vocabulary size of M words $W_1, W_2, W_3, ..., W_M$. The value of $i,j$ cell gives the frequency count of word $W_j$ in Document $D_i$.

<img src="https://cse.unl.edu/~hasan/Pics/Tables-LDA.png" width=300, height=100>

LDA converts this Document-Term Matrix (Table 1) into two lower dimensional matrices: 

- Document-Topics matrix (Table 2): ($N x K$) Provides the count of times each topic is assigned to each document
- Topic-Terms matrix (Table 3): ($K x M$) Provides the count of times each word is assigned to each topic

Here, $N$ is the number of documents, $K$ is the number of topics and $M$ is the vocabulary size.

LDA starts by assigning every term/word to a random topic.

Notice that these two matrices (Table 2 and 3) already provides topic-term and document-topic distributions. 

However, these distributions need to be improved, which is the main aim of LDA. LDA makes use of **sampling techniques** in order to update these matrices.



## Update Topic-Word Assignment:


LDA assumes that every word in each document comes from a topic and the topic is selected from a per-document distribution over topics. 

Thus, we have two matrices to store the probabilities for topic-selection and word-selection.

- $\Theta_{td} = P(t | d)$: probability distribution of topic t in document d (plus some smoothing)

- $\Phi_{wt} = P(w | t)$: probability distribution of word w in topic t (plus some smoothing)



### Smoothing Term: Dirichlet Distribution

Note that a smoothing term ensures every topic has a nonzero chance of being chosen in any document and that every word has a nonzero chance of being chosen for any topic.

On a technical note, these two probabilities are defined using a multinomial distribution called the **Dirichlet Distribution**. The Dirichlet distribution is controlled by a parameter $\alpha$ that determines the probability of a random variable (topic or word). A simplex is used to represent the plot of these probabilities. For three random variables (e.g., 3 topics, K = 3), the simplex is a triangular 2D plane.



To better understand the Dirichlet distribution in the context of LDA, let's consider the calculation of $\theta_{td}$. Assume that there are 3 topics that we want to discover. Thus, there would be 3 parameters for the Dirichlet distribution: $\alpha_1$, $\alpha_2$ and $\alpha_3$. 


Each topic should be given an $\alpha$ value (their sum should be 1). 

- Low $\alpha$ values ($< 1$): Most of the topic distribution samples are in the corners (near the topics). 

- Very low $\alpha$ values: We might end up sampling (1.0, 0.0, 0.0), (0.0, 1.0, 0.0), or (0.0, 0.0, 1.0). This would mean that a document would only ever have one topic.

- $\alpha = 1$: The probabilities for the topics are uniformly distributed. We could equally likely end up with a sample favoring only one topic, a sample that gives an even mixture of all the topics, or something in between.

- $\alpha > 1$: The samples start to congregate in the center of the triangle. This means that as $\alpha$ gets bigger, samples will more likely be uniform, i.e., represent an even mixture of all the topics.




##### Let's get back to the topic-word assignmet problem. 

The current topic-word assignment (Matrix in Table 3) is updated with a new topic with the probability (or topic weight) given by the product of two probabilities (two matrices): $\Phi_{wt}$ and $\Theta_{td}$

Note that for each word, we will get a vector of probabilities that will explain how likely this word belongs to each of the topics. 


The intuitive justification for this product is that given a word and its document, the likelihood of any topic choice depends on both:
- How likely that topic is for the document ($\Theta_{td}$)
- How likely that word is for the topic ($\Phi_{wt}$)



Thus, the probability of a word given document, i.e., $p(w|d)$:


$p(w | d) = \sum_{t \in K}p(w | t, d) p(t | d)$

Assume $K$ is the total number of topics.


$p(w | d) = \sum_{t \in K}p(w | t) p(t | d)$ [Using conditional independence: $p(w | t, d) = p(w | t)$ ]

$p(w | d) = \sum_{t \in K}\Phi_{wt} \Theta_{td}$

This probability explains **why a word $w$ is present in a document $d$**. This is because the documnt $d$ is a mixture of topics and that a topic is a mixture of words. 

- The word $w$ is present in the document $d$ because there is a likelihood that the word $w$ belongs to the topic $t$ and that the document $d$ has a likelihood to contain the topic $t$. 


The $p(w | d)$ is represented as a dot product between the matrices $\Phi_{wt}$ and  $\Theta_{td}$ as follows.



<img src="https://cse.unl.edu/~hasan/Pics/LDA-Matrix1.png" width=600, height=300>


<img src="https://cse.unl.edu/~hasan/Pics/LDA-Matrix2.png" width=600, height=300>


Observe that we decompose the probability distribution matrix of word in documents ($p(w | d)$) into two matrices consisting of distribution of topic in a document ($\Theta_{td}$) and distribution of words in a topic ($\Phi_{wt}$). Thus, we can think of LDA similar to that of **matrix factorization or SVD (Singular Value Decomposition)**.


### How do we learn the weights of these two matrices?


To start with, we randomly assign weights to both the matrices and assume that our data is generated as per the following steps:

1. Randomly choose a topic from the distribution of topics in a document based on their assigned weights. 

2. Next, based on the distribution of words for the chosen topic, select a word at random and put it in the document.

3. Repeat this step for the entire document.



We try to **maximize the likelihood** of our data given these two matrices. In this process, if our guess of the weights is wrong, then the actual data that we observe will be very unlikely under our assumed weights and data generating process. 


To identify the correct weights, we use an algorithm called **Gibbs sampling**. 

Let’s discuss what Gibbs sampling is and how it works in LDA. 

        You may want to look at the notebook on Gibbs sampling for further understanding.


## Gibbs Sampling

Gibbs sampling is an algorithm for **successively sampling conditional distributions of variables**, whose distribution over states converges to the true distribution in the long run. 

In LDA we apply Gibbs sampling as follows.

First, we assume that we know $\Theta_{td}$ and $\Phi_{wt}$ matrices. 

Then, we slowly change these matrices and get to an answer that maximizes the likelihood of the data that we have. 

We do this on word by word basis by changing the topic assignment of one word. 

    We assume that we don’t know the topic assignment of the given word. But we know the assignment of all other words in the text. Using this information, we try to infer what topic will be assigned to this word.


Mathematically we try to find conditional probability distribution of a single word’s topic assignment conditioned on the rest of the topic assignments. 

The conditional probability for a single word w in document d that belongs to topic k:


$p(z_{d,n} = k | \vec{z}_{-d, n}, \vec{w}, \alpha, \lambda) = \frac{n_{d, k} + \alpha_k}{\sum_{i}^{K}(n_{d, i} + \alpha_i)}\frac{v_{k, w_{d,n}} + \lambda_{w_{d,n}}}{\sum_{i}(v_{k, i} + \lambda_i)}$



Here:
- $n_{d, k}$: Number of times document d uses topic k

- $v_{k, w}$: Number of times topic k uses the given word

- $\alpha_k$: Dirichlet parameter for document to topic distribution

- $\lambda_w$: Dirichlet parameter for topic to word distribution


There are two parts of this equation. 

First part tells us how much each topic is present in a document and the second part tells how much each topic likes a word. 

Note that for each word, we will get a vector of probabilities that will explain how likely this word belongs to each of the topics. We sample a value from this probability distribution to update the topic of a word. 

In the above equation, it can be seen that the Dirichlet parameters also acts as smoothing parameters when $n_{d, k}$ or $v_{k, w}$ is zero. It means that there will still be some chance that the word will choose a topic going forward.


In short, the model assumes that all the existing topic-word assignments except the current word are correct. We iterate this process and after a number of iterations, a steady state is achieved where the document topic and topic term distributions are fairly good. This is the convergence point of LDA.

The LDA algorithm is implemented by a **variant** of Gibbs sampling known as **Collapsed Gibbs sampling**.



### Collapsed Gibbs Sampler

A collapsed Gibbs sampler integrates out (marginalizes over) one or more variables when sampling for some other variable. 

For example, imagine that a model consists of three variables A, B, and C. A simple Gibbs sampler would sample from p(A | B,C), then p(B | A,C), then p(C | A,B). 

A collapsed Gibbs sampler might replace the sampling step for A with a sample taken from the marginal distribution p(A | C), with variable B integrated out in this case. Alternatively, variable B could be collapsed out entirely, alternately sampling from p(A | C) and p(C | A) and not sampling over B at all. 

The distribution over a variable A that arises when collapsing a parent variable B is called a compound distribution; sampling from this distribution is generally tractable when B is the conjugate prior for A, particularly when A and B are members of the exponential family. 


In LDA, it is quite common to collapse out the Dirichlet distributions that are typically used as prior distributions over the categorical variables. The result of this collapsing introduces dependencies among all the categorical variables dependent on a given Dirichlet prior, and the joint distribution of these variables after collapsing is a Dirichlet-multinomial distribution. The conditional distribution of a given categorical variable in this distribution, conditioned on the others, assumes an extremely simple form that makes Gibbs sampling even easier than if the collapsing had not been done. 



## Description of the LDA Algorithm


Choose some fixed number of K topics to discover. 

Go through each document, and randomly assign each word in the document to one of the K topics.

This random assignment gives both topic representations of all the documents and word distributions of all the topics (Table 2 and 3,  $\Theta_{td}$ and $\Phi_{wt}$ matrices).

Then, LDA updates these matrices as follows.

For each document d:
- Go through each word w in d:
    - For each topic t, compute two probabilities: $\theta_{td}$ and $\phi_{wt}$ 

    - Reassign w a new topic, where we choose topic t based on the topic weight computed by the probability $p(w | d) =  \phi_{wt} * \theta_{td}$ (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). 
    - In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.


After repeating the previous step a large number of times, we will eventually reach a roughly steady state where our assignments are pretty good. 

We use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

## Why do we use LDA?

If we view the number of topics as a number of clusters and the probabilities as the proportion of cluster membership, then using LDA is a way of **soft-clustering** our composites and parts.

Thus, unlike K-Means that only does hard-clustering, LDA allows for "fuzzy" memberships. This provides a more nuanced way of recommending similar items, finding duplicates, or discovering user profiles/personas.


## Applications of LDA

- To shrink a large corpus of text data to some keywords (or sequence of keywords using N-gram).

- To reduce the task of clustering or searching a huge number of documents (may be huge in size too) to clustering or searching the keywords (topics). This helps in reducing the number of resources required for searching and retrieving information.

- As an initial step for summarization of a large collection of text data.

- To automatically tag new incoming text data by using the topics learned.