# Correlated Topic Model (CTM)

#### Author information

- **Name:** Jaeseong Choe

- **email address:** 21900759@handong.ac.kr

- **GitHub:** https://github.com/sorrychoe

- **Linkedin:** https://www.linkedin.com/in/jaeseong-choe-048639250/

- **Personal Webpage:** https://jaeseongchoe.vercel.app/

## Part 1. Brief background of methodology

### Overview
- Correlated Topic Model (CTM) extends LDA to allow for correlations between topics. 

- It models the correlations among topics by using a logistic normal distribution for the topic proportions.

### Situation Before CTM

- LDA assumes that topics are independent, which may not be realistic in many cases where topics are correlated.

### Why CTM Was Introduced

- CTM was introduced from the paper *"A Correlated Topic Model of Science."* of Blei, D. M., & Lafferty, J. D. (2007).

- CTM extends LDA by modeling correlations among topics using a logistic normal distribution.

### Use Cases

- CTM is beneficial in applications where topics are expected to co-occur or have inherent relationships.

## Part 2. Key concept of methodology

### Key Concept

- CTM uses a logistic normal distribution to model correlations between topics.

### Generative Process
CTM uses a logistic normal distribution to capture topic correlations, improving upon LDA's independence assumption. The generative process is as follows:

1. **Topic Proportions**:
   - Draw topic proportions for each document from a logistic normal distribution:
   
   $$
   \eta \sim \mathcal{N}(\mu, \Sigma)
   $$
   
   - Transform the normal variables $\eta$ into the topic proportions using the following transformation:
   
   $$
   \theta_i = \frac{\exp(\eta_i)}{\sum_{j=1}^K \exp(\eta_j)}
   $$

2. **Document Generation**:
   - For each word in a document:
     - Draw a topic assignment $z_n$ from the multinomial distribution over topics:
     
     $$
     z_n \sim \text{Multinomial}(\theta)
     $$
     
     - Draw a word $w_n$ from the topic-specific word distribution:
     
     $$
     w_n \sim \text{Multinomial}(\beta_{z_n})
     $$

This model allows for the correlation between topics, unlike LDA, where topic proportions are drawn independently.

![CTM_Graphic](./img/CTM_Graphic.png)

### Mathematical Representation
- **Logistic Normal Distribution**: Topic proportions are drawn from a logistic normal distribution, which is a transformation of a multivariate normal:
  
  $$
  \eta \sim \mathcal{N}(\mu, \Sigma)
  $$
  
  The logistic transformation maps the natural parameters $\eta$ to the simplex, producing the topic proportions $\theta$.

- **Natural Parameterization**: The natural parameterization of a K-dimensional multinomial distribution is:
  
  $$
  p(z | \eta) = \exp(\eta^\top z - a(\eta))
  $$
  
  where $a(\eta) = \log \left( \sum_{i=1}^K \exp(\eta_i) \right)$ is the log normalizer.

### Inference
Posterior inference in CTM is challenging due to the non-conjugacy of the logistic normal and multinomial distributions. CTM uses **mean-field variational inference** to approximate the posterior.

- **Variational Inference**: The variational distribution is factorized for the latent variables $\eta$ and $z$:
  
  $$
  q(\eta, z) = \prod_{i=1}^K q(\eta_i | \lambda_i, \nu_i^2) \prod_{n=1}^N q(z_n | \phi_n)
  $$

  Here, $\lambda_i$ and $\nu_i^2$ are the variational parameters for the Gaussian distribution of $\eta$, and $\phi_n$ is the variational parameter for the multinomial distribution of $z_n$.

- **Optimization**: The variational parameters are optimized using **coordinate ascent**:
  1. Update $\phi_n$, the variational parameter for $z_n$, as:
  
     $$
     \phi_n \propto \exp(\lambda + \log(\beta_{z_n}))
     $$
     
  2. Update $\lambda$ and $\nu^2$, the variational parameters for $\eta$, using gradient ascent.

### Strength

- CTM captures dependencies between topics, which LDA does not.

## Part 3. Example

### Before Sample code
- Generally, the gensim library is used for topic modeling, but tomotopy was used in this practice.
- The reason is that the library is excellent in terms of speed and is characterized by best reproduction of the mathematical formulas mentioned in the paper.
- Although it is not a library with many users yet, it is a library that is emerging as an alternative to gensim, so topic modeling was attempted through this library.

### Precautions

- If you re-execute the code, there may be a slight difference in the result.

- Of course, the difference in the number or content of the topic will not be significant due to the seed number and the learning rate, but the number of the topic changes.

In [1]:
# import librarys
import numpy as np # for data preprocessing
import pandas as pd # for load excel data
import pyBigKinds as pbk # for preprocessing bigkinds text data
import tomotopy as tp # for topic modeling
from pyvis.network import Network # for visualize the CTM result

# for ignore the warning message
import warnings
warnings.filterwarnings("ignore")

In [2]:
def ctmmodel(df:pd.DataFrame, k:int):
    """Define CTM model """
    
    words = pbk.keyword_parser(pbk.keyword_list(df))
    model = tp.CTModel(min_cf=5, rm_top=10, k=k, seed=42)

    for k in range(len(words)):
        model.add_doc(words=words[k])

    model.train(0)

    # Since we have more than ten thousand of documents, 
    # setting the `num_beta_sample` smaller value will not cause an inaccurate result.
    model.num_beta_sample = 5
    print('Num docs:{}, Num Vocabs:{}, Total Words:{}'.format(
        len(model.docs), len(model.used_vocabs), model.num_words
    ))
    print('Removed Top words: ', *model.removed_top_words)
    
    # train the model
    model.train(2000, show_progress=True)

    return model

In [3]:
def find_proper_k(df:pd.DataFrame, start:int, end:int):
    """find proper k value for hyperparameter tunning"""

    words = pbk.keyword_parser(pbk.keyword_list(df))

    for i in range(start,end+1):        
        # model setting
        mdl=tp.CTModel(tw=tp.TermWeight.IDF, min_cf=5, rm_top=10, k=i, seed=42)
        
        for k in range(len(words)):
            mdl.add_doc(words=words[k])
            
        # pre-train the model for check the coherence score
        mdl.train(100)
        
        # get the coherence score
        coh = tp.coherence.Coherence(mdl, coherence='c_v')
        
        # coherence average
        average_coherence = coh.get_score()
        # initial value setup
        if i == start:
            proper_k = start
            tmp = average_coherence
        
        # get coherence per topic
        coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
        
        # print it out
        print('==== Coherence : k = {} ===='.format(i))
        print("\n")
        print('Average: {}'.format(average_coherence))
        print("\n")
        print('Per Topic:{}'.format(coherence_per_topic))
        print("\n")
        print("\n")
        
        # update k
        if tmp < average_coherence:
            proper_k = i
            tmp = average_coherence
    return proper_k

In [4]:
def get_ctm_network(mdl):
    """ctm result visualization through Network"""

    g = Network(width=800, height=800, font_color="#333")
    correl = mdl.get_correlations().reshape([-1])
    correl.sort()
    top_tenth = mdl.k * (mdl.k - 1) // 10
    top_tenth = correl[-mdl.k - top_tenth]
    
    for k in range(mdl.k):
        label = "#{}".format(k)
        title= ' '.join(word for word, _ in mdl.get_topic_words(k, top_n=6))
        print('Topic', label, title)
        g.add_node(k, label=label, title=title, shape='ellipse')
        for l, correlation in zip(range(k - 1), mdl.get_correlations(k)):
            if correlation < top_tenth: continue
            g.add_edge(k, l, value=float(correlation), title='{:.02}'.format(correlation))
    
    g.barnes_hut(gravity=-1000, spring_length=20)
    g.show_buttons()
    g.show("view/topic_network.html", notebook=False)

In [5]:
# data load
# The data is related to Handong University, 
# which was reported in major Korean daily newspapers from January 1995 to September 2024.
df = pd.read_excel("data/NewsResult_19950101-20240930.xlsx", engine="openpyxl")

# proper k value
proper_k = find_proper_k(df, 3, 10)

==== Coherence : k = 3 ====


Average: 0.5469152451958507


Per Topic:[0.5734525308012962, 0.5608294982928783, 0.5064637064933777]




==== Coherence : k = 4 ====


Average: 0.443054432189092


Per Topic:[0.514208522439003, 0.47966777831315993, 0.4169865742325783, 0.36135485377162696]




==== Coherence : k = 5 ====


Average: 0.5091650255769491


Per Topic:[0.4947906039655209, 0.5389226507395506, 0.5601890429854393, 0.4538996122777462, 0.4980232179164886]




==== Coherence : k = 6 ====


Average: 0.5469936619823178


Per Topic:[0.6473274737596512, 0.4905784908682108, 0.6215087890625, 0.6285855807363987, 0.4551302820444107, 0.4388313554227352]




==== Coherence : k = 7 ====


Average: 0.5035159835991051


Per Topic:[0.4621864646673203, 0.5850352458655834, 0.49095904380083083, 0.397606080584228, 0.5844570457935333, 0.50725132599473, 0.4971166784875095]




==== Coherence : k = 8 ====


Average: 0.6506026376504451


Per Topic:[0.9076479494571685, 0.7992440696805716, 0.6307563237845898,

In [6]:
# Model setting with K
mdl = ctmmodel(df, proper_k)

Num docs:8051, Num Vocabs:27099, Total Words:1657565
Removed Top words:  미국 대학 북한 교수 한동대 한국 대통령 정부 교육 중국


Iteration: 100%|███████████| 2000/2000 [01:55<00:00, 17.34it/s, LLPW: -6.978222]


In [7]:
# get summary
mdl.summary()

# save model
get_ctm_network(mdl)

<Basic Info>
| CTModel (current version: 0.12.7)
| 8051 docs, 1657565 words
| Total Vocabs: 128176, Used Vocabs: 27099
| Entropy of words: 8.65140
| Entropy of term-weighted words: 8.65140
| Removed Vocabs: 미국 대학 북한 교수 한동대 한국 대통령 정부 교육 중국
|
<Training Info>
| Iterations: 2000, Burn-in steps: 0
| Optimization Interval: 2
| Log-likelihood per word: -6.98059
|
<Initial Parameters>
| tw: TermWeight.ONE
| min_cf: 5 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 10 (the number of top words to be removed)
| k: 10 (the number of topics between 1 ~ 32767)
| smoothing_alpha: [0.1] (small smoothing value for preventing topic counts to be zero, given as a single `float` in case of symmetric and as a list with length `k` of `float` in case of asymmetric.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 42 (random seed)
| trained in version 0.12.7
|
<Parameters>
| prior_mean (Prior mean of Logit-normal for the per-docu

### The result of Topic Modeling 

- Topic #0: Topic related to North Korea and Inter-Korean Relations

- Topic #1: Topic related to International Relations and Global Affairs

- Topic #2: Topic related to the university entrance examination system

- Topic #3: Topic related to the Professors in Politics and Government Affairs

- Topic #4: Topic related to the Promotion of Handong University

- Topic #5: Topic related to Religious Activities and Christian Identity of Handong University

- Topic #6: Topic related to the desire of members of Handong University for unification

- Topic #7: Topic related to the social activities of students and professors of Handong University

- Topic #8: Topic related to Handong University's business collaborating with the local community

- Topic #9: Topic related to Glocal University 30 

### Why are the results of LDA and CTM different?

- LDA use the Dirichlet Distributions for sampling The document-topic and topic-word distributions.

- CTM use the Multivariate Normal Distribution for sampling the document-topic distribution for capturing topic correlations.

- Unlike LDA, CTM allows for the modeling of dependencies between topics, meaning that the appearance of one topic might increase the likelihood of other correlated topics appearing within the same document.

In [8]:
from IPython.display import HTML

HTML(filename="view/topic_network.html")

# Result interpretation

- Unlike traditional models, CTM is capable of topic modeling considering correlation coefficients between topics.

- In this result, Topic 1 and 3 are showing a strong positive correlation. This is because it is a topic that contains the  international relations content.

- Futhermore, Topic 2 and 4 are also showing a strong positive correlation. This is because it includes words related to the university entrance.

- As you can see, Topic 2 and 9, Topic 1 and 6 has a positive correlation. This is because some words In their topic tend to be used together within a specific topic.