# Correlated Topic Model (CTM)

#### Author information

- **Name:** Jaeseong Choe

- **email address:** 21900759@handong.ac.kr

- **GitHub:** https://github.com/sorrychoe

- **Linkedin:** https://www.linkedin.com/in/jaeseong-choe-048639250/

- **Personal Webpage:** https://jaeseongchoe.vercel.app/

## Part 1. Brief background of methodology

### Overview
- Correlated Topic Model (CTM) extends LDA to allow for correlations between topics. 

- It models the correlations among topics by using a logistic normal distribution for the topic proportions.

### Situation Before CTM

- LDA assumes that topics are independent, which may not be realistic in many cases where topics are correlated.

### Why CTM Was Introduced

- CTM was introduced from the paper *"A Correlated Topic Model of Science."* of Blei, D. M., & Lafferty, J. D. (2007).

- CTM extends LDA by modeling correlations among topics using a logistic normal distribution.

### Use Cases

- CTM is beneficial in applications where topics are expected to co-occur or have inherent relationships.

## Part 2. Key concept of methodology

### Key Concept

- CTM uses a logistic normal distribution to model correlations between topics.

### Mathematical Formulation

In CTM, the topic proportions for document $d$, denoted as $\theta_d$, are generated using a logistic normal distribution:

$$ \eta_d \sim \mathcal{N}(\mu, \Sigma) $$
$$ \theta_d = \text{softmax}(\eta_d) $$

Here:
- $\eta_d$ is a latent variable drawn from a multivariate normal distribution with mean vector $\mu$ and covariance matrix $\Sigma$.
- $\Sigma$ captures the correlations between topics.

### Generative Process

- For each document $d$:
   - Draw the latent variable $\eta_d \sim \mathcal{N}(\mu, \Sigma)$.
   - Transform $\eta_d$ to $\theta_d = \text{softmax}(\eta_d)$.
   - For each word $w_{dn}$:
     - Draw a topic $z_{dn} \sim \text{Multinomial}(\theta_d)$.
     - Draw a word $w_{dn} \sim \text{Multinomial}(\phi_{z_{dn}})$.

### Key Assumptions

- The topics are correlated, which is captured by the covariance matrix $\Sigma$.

- The logistic normal distribution allows for modeling dependencies between topics.

### Strength

- CTM captures dependencies between topics, which LDA does not.

## Part 3. Example

### Before Sample code
- Generally, the gensim library is used for topic modeling, but tomotopy was used in this practice.
- The reason is that the library is excellent in terms of speed and is characterized by best reproduction of the mathematical formulas mentioned in the paper.
- Although it is not a library with many users yet, it is a library that is emerging as an alternative to gensim, so topic modeling was attempted through this library.

In [1]:
# import librarys
import numpy as np # for data preprocessing
import pandas as pd # for load excel data
import pyBigKinds as pbk # for preprocessing bigkinds text data
import tomotopy as tp # for topic modeling
from pyvis.network import Network # for visualize the CTM result

# for ignore the warning message
import warnings
warnings.filterwarnings("ignore")

In [2]:
def ctmmodel(df:pd.DataFrame, k:int):
    """Define CTM model """
    
    words = pbk.keyword_parser(pbk.keyword_list(df))
    model = tp.CTModel(tw=tp.TermWeight.IDF, min_cf=5, rm_top=10, k=k, seed=42)

    for k in range(len(words)):
        model.add_doc(words=words[k])

    model.train(0)

    # Since we have more than ten thousand of documents, 
    # setting the `num_beta_sample` smaller value will not cause an inaccurate result.
    model.num_beta_sample = 5
    print('Num docs:{}, Num Vocabs:{}, Total Words:{}'.format(
        len(model.docs), len(model.used_vocabs), model.num_words
    ))
    print('Removed Top words: ', *model.removed_top_words)
    
    # train the model
    model.train(2000, show_progress=True)

    return model

In [3]:
def find_proper_k(df:pd.DataFrame, start:int, end:int):
    """find proper k value for hyperparameter tunning"""

    words = pbk.keyword_parser(pbk.keyword_list(df))

    for i in range(start,end+1):        
        # model setting
        mdl=tp.CTModel(tw=tp.TermWeight.IDF, min_cf=5, rm_top=10, k=i, seed=42)
        
        for k in range(len(words)):
            mdl.add_doc(words=words[k])
            
        # pre-train the model for check the coherence score
        mdl.train(100)
        
        # get the coherence score
        coh = tp.coherence.Coherence(mdl, coherence='c_v')
        
        # coherence average
        average_coherence = coh.get_score()
        # initial value setup
        if i == start:
            proper_k = start
            tmp = average_coherence
        
        # get coherence per topic
        coherence_per_topic = [coh.get_score(topic_id=k) for k in range(mdl.k)]
        
        # print it out
        print('==== Coherence : k = {} ===='.format(i))
        print("\n")
        print('Average: {}'.format(average_coherence))
        print("\n")
        print('Per Topic:{}'.format(coherence_per_topic))
        print("\n")
        print("\n")
        
        # update k
        if tmp < average_coherence:
            proper_k = i
            tmp = average_coherence
    return proper_k

In [4]:
def get_ctm_network(mdl):
    """ctm result visualization through Network"""

    g = Network(width=800, height=800, font_color="#333")
    correl = mdl.get_correlations().reshape([-1])
    correl.sort()
    top_tenth = mdl.k * (mdl.k - 1) // 10
    top_tenth = correl[-mdl.k - top_tenth]
    
    for k in range(mdl.k):
        label = "#{}".format(k)
        title= ' '.join(word for word, _ in mdl.get_topic_words(k, top_n=6))
        print('Topic', label, title)
        g.add_node(k, label=label, title=title, shape='ellipse')
        for l, correlation in zip(range(k - 1), mdl.get_correlations(k)):
            if correlation < top_tenth: continue
            g.add_edge(k, l, value=float(correlation), title='{:.02}'.format(correlation))
    
    g.barnes_hut(gravity=-1000, spring_length=20)
    g.show_buttons()
    g.show("view/topic_network.html", notebook=False)

In [5]:
# data load
# The data is related to Handong University, 
# which was reported in major Korean daily newspapers from January 1995 to September 2024.
df = pd.read_excel("data/NewsResult_19950101-20240930.xlsx", engine="openpyxl")

# proper k value
proper_k = find_proper_k(df, 3, 10)

==== Coherence : k = 3 ====


Average: 0.4946146271501979


Per Topic:[0.40830569695681335, 0.6744196325540542, 0.4011185519397259]




==== Coherence : k = 4 ====


Average: 0.48340339050628245


Per Topic:[0.4383124604821205, 0.3884956747293472, 0.6228600367903709, 0.4839453900232911]




==== Coherence : k = 5 ====


Average: 0.515775553882122


Per Topic:[0.484657421708107, 0.5774677455425262, 0.5402069807052612, 0.44625533521175387, 0.5302902862429619]




==== Coherence : k = 6 ====


Average: 0.5084231146300833


Per Topic:[0.562285591289401, 0.6126032531261444, 0.4598937191069126, 0.5458455741405487, 0.4348564878106117, 0.43505406230688093]




==== Coherence : k = 7 ====


Average: 0.595595714636147


Per Topic:[0.8640081822872162, 0.6303434371948242, 0.6054688286036253, 0.6235938429832458, 0.5479496046900749, 0.4956549558788538, 0.4021511508151889]




==== Coherence : k = 8 ====


Average: 0.611864210292697


Per Topic:[0.7569834172725678, 0.6845397315919399, 0.5804375298321

In [6]:
# Model setting with K
mdl = ctmmodel(df, proper_k)

Num docs:8051, Num Vocabs:27099, Total Words:1657565
Removed Top words:  미국 대학 북한 교수 한동대 한국 대통령 정부 교육 중국


Iteration: 100%|███████████| 2000/2000 [02:03<00:00, 16.19it/s, LLPW: -7.269598]


In [7]:
# get summary
mdl.summary()

# save model
get_ctm_network(mdl)

<Basic Info>
| CTModel (current version: 0.12.7)
| 8051 docs, 1657565 words
| Total Vocabs: 128176, Used Vocabs: 27099
| Entropy of words: 8.65140
| Entropy of term-weighted words: 9.28121
| Removed Vocabs: 미국 대학 북한 교수 한동대 한국 대통령 정부 교육 중국
|
<Training Info>
| Iterations: 2000, Burn-in steps: 0
| Optimization Interval: 2
| Log-likelihood per word: -7.27183
|
<Initial Parameters>
| tw: TermWeight.IDF
| min_cf: 5 (minimum collection frequency of words)
| min_df: 0 (minimum document frequency of words)
| rm_top: 10 (the number of top words to be removed)
| k: 10 (the number of topics between 1 ~ 32767)
| smoothing_alpha: [0.1] (small smoothing value for preventing topic counts to be zero, given as a single `float` in case of symmetric and as a list with length `k` of `float` in case of asymmetric.)
| eta: 0.01 (hyperparameter of Dirichlet distribution for topic-word)
| seed: 42 (random seed)
| trained in version 0.12.7
|
<Parameters>
| prior_mean (Prior mean of Logit-normal for the per-docu

### Topic Interpretations
#### Topic #0 (163,308 tokens)
##### Theme: Constitutional Discussions and International Agreements
- Keywords such as "헌법", "논의", "지소미아", and "중요" is about constitutional debates and the significance of international agreements like GSOMIA (General Security of Military Information Agreement), focusing on the legal and diplomatic discussions around these issues.

#### Topic #1 (157,991 tokens)
##### Theme: Missile Strategy and US-Korea Relations
- Keywords like "미사일", "전략", "한미", and "행정부" point toward discussions about missile defense strategy, and the evaluation of military strategies in the context of the US-Korea alliance, with a focus on government and administrative assessments.

#### Topic #2 (145,356 tokens)
##### Theme: College Admissions and Sanctions
- Keywords "수시2", "제재", "인권", "미군", and "정상회담" indicate a mix of topics, including college admissions policies, international sanctions, human rights issues, and possibly military involvement in global summits.

#### Topic #3 (170,480 tokens)
##### Theme: Diplomatic Negotiations with North Korea
- Keywords "트럼프", "협상", "회담", "외교", and "남북" is about diplomacy and negotiations between North Korea, South Korea, and the US, particularly during the Trump administration.

#### Topic #4 (151,028 tokens)
##### Theme: College Admissions and Natural Disasters
- Keywords "전형", "지진", "모집", "서울대", and "선발" related to university admissions processes, perhaps in the context of disruptions caused by natural disasters like earthquakes.

#### Topic #5 (166,345 tokens)
##### Theme: Handong Spirit who moved to the world
- Keywords such as "생각", "기도", "변화", "해결", and "의미" refer to the psychiatric content that Handong University students bring to the world.

#### Topic #6 (164,771 tokens)
##### Theme: Student Support and Business Ventures
- Keywords "지원", "학생", "서울", "사업", and "산업" indicate discussions on student support systems, possibly in connection with business ventures and industrial development, especially in metropolitan areas like Seoul.

#### Topic #7 (180,101 tokens)
##### Theme: Handong University as a Christian university
- Keywords "교회", "목사", "하나님", "총장", and "대표" refer to topics that collect articles indicating Handong University's Christian identity.

#### Topic #8 (173,648 tokens)
##### Theme: Promoting Handong University
- Keywords such as "학교", "포항", "기독교", "학생들", and "기업" are a set of words that are often used to inform Handong University to the outside world, and it can be seen that it is a topic containing articles related to Handong University promotion.

#### Topic #9 (184,537 tokens)
##### Theme: Globalization of Handong University
- Keywords like "지역", "세계", "참여", "진행", and "국제" imply that it is a topic mainly containing articles on globalization, a direction pursued by Handong University.

In [8]:
from IPython.display import HTML

HTML(filename="view/topic_network.html")

# Result interpretation

- Unlike traditional models, CTM is capable of topic modeling considering correlation coefficients between topics.

- In this result, topic 0 and 3 are showing a Positive correlation. This is because it is a topic that contains political content.

- Futhermore, Topic 0 and 2 are also showing a correlation. This is because it includes words related to the summit and the U.S. military.

- There was no noticeable correlation between other topics.