Latent Dirichlet Allocation (LDA) is a statistical model used for topic modeling, where each document is assumed to be a mixture of topics and each word is attributed to one of these topics. In LDA, Gibbs Sampling is a technique used for probabilistic inference to estimate the hidden structure of topics within the documents.
Gibbs Sampling in LDA works by iteratively updating the assignment of words to topics. At each iteration, a word's topic assignment is re-sampled based on the current state of the model. This process is repeated many times until convergence.
Here's how Gibbs Sampling works in the context of LDA:
-
Initialization: Start by randomly assigning each word in each document to one of the topics.
-
Iterative Process:
- For each word in each document:
- Calculate the probability of the word belonging to each topic, considering the current state of the model.
- Sample a new topic assignment for the word based on these probabilities.
- For each word in each document:
-
Repeat: Repeat the above step for many iterations until the model converges to a stable state.
-
Estimation: After convergence, the distribution of topics in each document and the distribution of words in each topic are estimated based on the sampled assignments.
Gibbs Sampling allows LDA to iteratively refine its estimates of topics and their distributions in the documents. It works well even when the exact mathematical solution is complex or intractable.
In summary, Gibbs Sampling in LDA is a powerful technique for estimating the latent structure of topics within documents by iteratively updating the assignments of words to topics based on the observed data.
In this implementation of LDA leveraging Gibbs Sampling, we'll be sampling words of 50 Wikipedia pages under the Category-'History', containing 14351 unique words and distributing the words into 10 topics. We'll be running 5 and 100 iterations of the sampling to distribute the words into topics.
To run the code:
- Clone the repository
git clone https://github.com/sima-sta14/lda.git
- Install the dependencies
pip install numpy matplotlib seaborn wikipediaapi nltk
- Run the script
python3 gibbs.py