# Latent Dirichlet Allocation (LDA)

#### Author information
- **Name:** Jaeseong Choe

- **email address:** 21900759@handong.ac.kr

- **GitHub:** https://github.com/sorrychoe

- **Linkedin:** https://www.linkedin.com/in/jaeseong-choe-048639250/

- **Personal Webpage:** https://jaeseongchoe.vercel.app/

## Part 1. Brief background of methodology

### Overview

- **Latent Dirichlet Allocation(LDA)** is a generative probabilistic model for collections of discrete data such as text corpora. 

- It assumes that each document is generated from a mixture of topics, where each topic is characterized by a distribution over words.

### Situation Before LDA

- Before LDA, text mining techniques mainly relied on frequency-based methods such as TF-IDF or clustering analysis without capturing latent semantics in documents.

### Why LDA Was Introduced

- LDA was introduced from the paper *"Latent Dirichlet Allocation"* of Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003).

- LDA was developed for capture the hidden thematic structure within large corpora of documents.

- It offers a generative probabilistic approach that assumes documents are a mixture of topics, with each topic being a distribution over words.

### Use Cases

- LDA is widely used in document classification, information retrieval, and topic discovery in large text datasets.

## Part 2. Key concept of methodology

### Key Concept
- LDA models each document as a mixture of topics, and each topic as a mixture of words. 

- The generative process involves sampling a topic distribution for each document and word distributions for each topic.

### Generative Process

The LDA generative process can be described as follows:

1. **Document Length**:
   - For each document $d$, choose the number of words $N \sim \text{Poisson}(\xi)$.

2. **Topic Distribution**:
   - Draw a topic distribution $\theta_d$ from a Dirichlet prior:
   $$
   \theta_d \sim \text{Dir}(\alpha)
   $$
   where $\alpha$ is the hyperparameter for the Dirichlet distribution.

3. **Word Generation**:
   - For each word $w_{dn}$ in document $d$:
     - Choose a topic $z_{dn}$ from $\theta_d$:
     $$
     z_{dn} \sim \text{Multinomial}(\theta_d)
     $$
     - Choose a word $w_{dn}$ from the word distribution $\beta_{z_{dn}}$ conditioned on the topic:
     $$
     w_{dn} \sim \text{Multinomial}(\beta_{z_{dn}})
     $$
     where $\beta_{z_{dn}}$ is the topic-specific distribution over the vocabulary.

![LDA_Graphic](./img/LDA_Graphic.png)

### Mathematical Representation

- **Dirichlet Distribution**: 
  $\theta$, the distribution of topics for a document, is drawn from a Dirichlet distribution $\text{Dir}(\alpha)$. The Dirichlet distribution has the following probability density function($pdf$):
  
  $$
  p(\theta | \alpha) = \frac{\Gamma(\sum_{i=1}^{K} \alpha_i)}{\prod_{i=1}^{K} \Gamma(\alpha_i)} \prod_{i=1}^{K} \theta_i^{\alpha_i - 1}
  $$

  where $\theta$ is a $K$-dimensional vector, each $\theta_i \geq 0$, and $\sum_{i=1}^{K} \theta_i = 1$. $\alpha$ is a $K$-dimensional vector with $\alpha_i > 0$, and $\Gamma(\cdot)$ is the Gamma function.

- **Joint Distribution**:
  The joint distribution of the topic mixture $\theta$, the topics $z$, and the words $w$ is given by:

  $$
  p(\theta, z, w | \alpha, \beta) = p(\theta | \alpha) \prod_{n=1}^{N} p(z_n | \theta) p(w_n | z_n, \beta)
  $$

  where $\theta$ is the topic distribution for the document, $z_n$ is the topic for the $n$-th word, and $w_n$ is the word itself.

- **Marginal Likelihood**:
  By integrating out the topic distribution $\theta$ and summing over the topic assignments $z$, the marginal probability of a document is:

  $$
  p(w | \alpha, \beta) = \int p(\theta | \alpha) \prod_{n=1}^{N} \sum_{z_n} p(z_n | \theta) p(w_n | z_n, \beta) d\theta
  $$

  This expression involves an intractable integral, which requires approximation methods like variational inference or Gibbs sampling for practical computation.

### Inference

Inference in LDA is approximated using methods like **variational inference** and **Gibbs sampling**.

- **Variational Inference:** Variational methods introduce a factorized distribution $q(\theta, z)$ to approximate the posterior distribution $p(\theta, z | w)$:

    $$
    q(\theta, z | \gamma, \phi) = q(\theta | \gamma) \prod_{n=1}^{N} q(z_n | \phi_n)
    $$

    Where $\gamma$ and $\phi$ are variational parameters. The update equations are:

    $$
    \phi_{ni} \propto \beta_{i w_n} \exp(\mathbb{E}[\log \theta_i])
    $$

    $$
    \gamma_i = \alpha_i + \sum_{n=1}^{N} \phi_{ni}
    $$

- **Gibbs Sampling:** The conditional distribution for the topic assignment $z_{dn}$ in Gibbs sampling is:

    $$
    P(z_{dn} = k | z_{-dn}, w, \alpha, \beta) \propto (\alpha_k + N_{dk}^{-dn}) \cdot \frac{\beta_{kw_n} + N_{kw_n}^{-dn}}{\sum_{v} (\beta_{kv} + N_{kv}^{-dn})}
    $$

    Where $N_{dk}^{-dn}$ is the count of words in document $d$ assigned to topic $k$, and $N_{kw_n}^{-dn}$ is the count of word $w_n$ in topic $k$ excluding $w_{dn}$.


### Key Assumptions

- **Bag-of-Words**: The order of words in a document is ignored, meaning that only the frequency of words matters, not their position.

- **Exchangeability**: Words within a document and documents within a corpus are assumed to be exchangeable, meaning the probability of a sequence of words (or documents) remains unchanged under permutation.

### Strength

- LDA captures hidden topics in a large corpus and provides a probabilistic framework for topic discovery.

## Part 3. Example

### Before Sample code
- Generally, the gensim library is used for topic modeling, but tomotopy was used in this practice.
- The reason is that the library is excellent in terms of speed and is characterized by best reproduction of the mathematical formulas mentioned in the paper.
- Although it is not a library with many users yet, it is a library that is emerging as an alternative to gensim, so topic modeling was attempted through this library.

### Precautions

- If you re-execute the code, there may be a slight difference in the result.

- Of course, the difference in the number or content of the topic will not be significant due to the seed number and the learning rate, but the number of the topic changes.

In [1]:
# import librarys
import numpy as np # for data preprocessing
import pandas as pd # for load excel data
import pyBigKinds as pbk # for preprocessing news text data
import tomotopy as tp # for topic modeling
import pyLDAvis # for visualize the LDA result

# for ignore the warning message
import warnings
warnings.filterwarnings("ignore")

In [2]:
def ldamodel(df:pd.DataFrame, k:int):
    """Define the LDA model """
    
    words = pbk.keyword_parser(pbk.keyword_list(df))
    model=tp.LDAModel(min_cf=5, rm_top=10, k=k, seed=42)

    for k in range(len(words)):
        model.add_doc(words=words[k])
    
    model.train(0)

    # print docs, vocabs and words
    print('Num docs:{}, Num Vocabs:{}, Total Words:{}'.format(
        len(model.docs), len(model.used_vocabs), model.num_words
    ))

    # train the model
    model.train(2000, show_progress=True)
    return model

In [3]:
def find_proper_k(df:pd.DataFrame, start:int, end:int):
    """find proper k value for hyperparameter tunning"""

    words = pbk.keyword_parser(pbk.keyword_list(df))

    for i in range(start,end+1):        
        # model setting
        mdl=tp.LDAModel(min_cf=5, rm_top=10, k=i, seed=42)
        
        for k in range(len(words)):
            mdl.add_doc(words=words[k])
            
        # pre-train the model for check the coherence score
        mdl.train(100)
        
        # get the coherence score
        coh = tp.coherence.Coherence(mdl, coherence='c_v')
        
        # coherence average
        average_coherence = coh.get_score()
        # initial value setup
        if i == start:
            proper_k = start
            tmp = average_coherence
        
        # get coherence per topic
        coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
        
        # print it out
        print('==== Coherence : k = {} ===='.format(i))
        print("\n")
        print('Average: {}'.format(average_coherence))
        print("\n")
        print('Per Topic:{}'.format(coherence_per_topic))
        print("\n")
        print("\n")
        
        # update k
        if tmp < average_coherence:
            proper_k = i
            tmp = average_coherence
    return proper_k

In [4]:
def get_ldavis(mdl):
    """preprocessing for LDA Topic data visualization"""
    
    # get the values
    topic_term_dists = np.stack([mdl.get_topic_word_dist(k) for k in range(mdl.k)])
    doc_topic_dists = np.stack([doc.get_topic_dist() for doc in mdl.docs])
    doc_lengths = np.array([len(doc.words) for doc in mdl.docs])
    vocab = list(mdl.used_vocabs)
    term_frequency = mdl.used_vocab_freq

    # prepara dataset
    prepared_data = pyLDAvis.prepare(
        topic_term_dists, 
        doc_topic_dists, 
        doc_lengths, 
        vocab, 
        term_frequency,
        start_index=0, 
        sort_topics=False 
    )

    # save result
    pyLDAvis.save_html(prepared_data, 'view/ldavis.html')

In [5]:
# data load
# The data is related to Handong University, 
# which was reported in major Korean daily newspapers from January 1995 to September 2024.
df = pd.read_excel("data/NewsResult_19950101-20240930.xlsx", engine="openpyxl")

#find proper k value
proper_k = find_proper_k(df, 3,10)

==== Coherence : k = 3 ====


Average: 0.741083973646164


Per Topic:[0.5999453455209732, 0.8651391208171845, 0.7581674546003342]




==== Coherence : k = 4 ====


Average: 0.7011181615293025


Per Topic:[0.6224485903978347, 0.766037181019783, 0.8679994404315948, 0.5479874342679978]




==== Coherence : k = 5 ====


Average: 0.7732469771802425


Per Topic:[0.7342068195343018, 0.8780243456363678, 0.7895594149827957, 0.8406326591968536, 0.6238116465508938]




==== Coherence : k = 6 ====


Average: 0.825429646174113


Per Topic:[0.7725077867507935, 0.7968347370624542, 0.761194908618927, 0.925168490409851, 0.8171466290950775, 0.8797253251075745]




==== Coherence : k = 7 ====


Average: 0.8127426283700127


Per Topic:[0.7746312499046326, 0.776884400844574, 0.9046697795391083, 0.7563041687011719, 0.8720169305801392, 0.8372242093086243, 0.7674676597118377]




==== Coherence : k = 8 ====


Average: 0.7869673725217581


Per Topic:[0.7796162962913513, 0.7850739479064941, 0.7353005141019822, 

In [6]:
# Model setting with K
mdl = ldamodel(df, proper_k)

Num docs:8051, Num Vocabs:27099, Total Words:1657565


Iteration: 100%|███████████| 2000/2000 [00:24<00:00, 81.70it/s, LLPW: -8.391166]


In [7]:
# get summary
mdl.summary()

# save result
get_ldavis(mdl)

<Basic Info>
| LDAModel (current version: 0.12.7)
| 8051 docs, 1657565 words
| Total Vocabs: 128176, Used Vocabs: 27099
| Entropy of words: 8.65140
| Entropy of term-weighted words: 8.65140
| Removed Vocabs: 미국 대학 북한 교수 한동대 한국 대통령 정부 교육 중국
|
<Training Info>
| Iterations: 2000, Burn-in steps: 0
| Optimization Interval: 10
| Log-likelihood per word: -8.39061
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 5 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 10 (the number of top words to be removed)
| k: 6 (the number of topics between 1 ~ 32767)
| alpha: [0.1] (hyperparameter of Dirichlet distribution for document-topic, given as a single `float` in case of symmetric prior and as a list with length `k` of `float` in case of asymmetric prior.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 42 (random seed)
| trained in version 0.12.7
|
<Parameters>
| alpha (Dirichlet prior on the per-document topic dist

In [8]:
# show result
from IPython.display import HTML

HTML(filename="view/ldavis.html")

### The result of Topic Modeling 

- Topic #0: Topic related to the university entrance examination system.

- Topic #1: Topic related to Handong University's business collaborating with the local community.

- Topic #2: Topic related to the Professors in Politics and Government Affairs

- Topic #3: Topic related to North Korea and Inter-Korean Relations

- Topic #4: Topic related to Religious Activities and Christian Identity of Handong University

- Topic #5: Topic related to International Relations and Global Affairs