# Topics over Time (ToT)

#### Author information

- **Name:** Jaeseong Choe

- **email address:** 21900759@handong.ac.kr

- **GitHub:** https://github.com/sorrychoe

- **Linkedin:** https://www.linkedin.com/in/jaeseong-choe-048639250/

- **Personal Webpage:** https://jaeseongchoe.vercel.app/

## Part 1. Brief background of methodology

### Overview

- **BERTopic is a topic modeling technique that leverages transformer-based embeddings and a class-based TF-IDF (c-TF-IDF) approach** to generate dense clusters of documents, resulting in interpretable topics with meaningful descriptions.

### Situation Before ToT

- Traditional topic modeling methods like Latent Dirichlet Allocation (LDA) often struggled with capturing the semantic nuances of documents, especially when dealing with short texts or documents with overlapping topics.

### Why ToT Was Introduced

- BERTopic was introduced from the paper "TBERTopic: Neural topic modeling with a class-based TF-IDF procedure" of Maarten Grootendorst. (2022).

- BERTopic was introduced to address the limitations of traditional models by incorporating transformer-based embeddings, which capture contextual information, and by utilizing c-TF-IDF for more coherent topic representations

### Use Cases

- BERTopic is applicable in various domains, including analyzing customer reviews, extracting themes from research articles, monitoring social media trends, and exploring historical document collections.

## Part 2. Key concept of methodology

### Key Concept

- BERTopic combines transformer-based embeddings with clustering algorithms and a class-based TF-IDF approach to identify and represent topics within a corpus. 
  

### Generative Process

The methodology involves four main steps:

**1. Document Embedding**

- Each document is converted into a vector representation using a pre-trained transformer model, such as BERT. 

**2. Dimensionality Reduction**

- The high-dimensional embeddings are reduced to a lower-dimensional space using techniques like UMAP to preserve semantic relationships. 

**3. Clustering**

- The reduced embeddings are clustered using algorithms like HDBSCAN to group semantically similar documents. 

**4. Topic Representation**

- For each cluster, a c-TF-IDF approach is applied to extract representative words, forming the topic description.

![ToT_Graphic](./img/ToT_Graphic.png)

### Mathematical Representation

While BERTopic does not rely on a generative probabilistic model like LDA, its core components can be mathematically described as follows:

1. **Document Embedding**

- Each document $d$ is transformed into an embedding vector $e_d$ using a transformer model:

$$ e_d = TransformarModel(d) $$


2. **Dimensionality Reduction**

- The embedding vector $e_d$ is projected into a lower-dimensional space $u_d$ using UMAP:

$$u_d = UMAP(e_d)$$

3. Clustering

- The reduced embedding vector $u_d$ are clustered into $K$ clusters using HDBSCAN:

$$ Cluster = HDBSCAN({u_d})$$

4. Class-based TF-IDF (c-TF-IDF)

- For each cluster $k$ concatenate all documents to form a single class document $D_k$. Calculate the term frequency $TF(t, D_k)$ and document frequency $DF(t)$ across all clusters. The c-TF-IDF score for term $t$ in cluster $k$ is:

$$c-TF-IDF(t,k) = TF(t, D_k) \times log(\frac{N}{DF(t)})$$

### Strength

- BERTopic effectively captures semantic relationships in documents through transformer embeddings and provides coherent topic representations using c-TF-IDF. Its modular design allows for flexibility in choosing embedding models, dimensionality reduction techniques, and clustering algorithms, making it adaptable to various datasets and applications.

## Part 3. Example

### Precautions

- If you re-execute the code, there may be a slight difference in the result.

- Of course, the difference in the number or content of the topic will not be significant due to the learning rate, but the number of the topic changes.

In [1]:
# import librarys
import pandas as pd # for load excel data
import pyBigKinds as pbk # for preprocessing news data
from sklearn.feature_extraction.text import CountVectorizer # for vectorize text data 
from konlpy.tag import Mecab # for tokenize the Korean Words
from bertopic.vectorizers import ClassTfidfTransformer # for get c-TF-IDF value
from bertopic import BERTopic # for load the BERTopic model
from hdbscan import HDBSCAN # for tunning the BERTopic model
from umap import UMAP # for tunning the BERTopic model

# for ignore the warning message
import warnings
warnings.filterwarnings("ignore")

In [2]:
def list_to_str(words: list):
    """Function that list data change to string for text preprocessing"""
    for i in range(len(words)):
        text = ""
        for word in words[i]:
            if text == "":
                text = word
            else: 
                text = text + " " + word
        words[i] = text
    return words

In [3]:
class CustomTokenizer:
    """ Define the Korean Tokenizer"""
    def __init__(self, tagger):
        self.tagger = tagger
    def __call__(self, sent):
        sent = sent[:1000000]
        word_tokens = self.tagger.morphs(sent)
        result = [word for word in word_tokens if len(word) > 1]
        return result

In [11]:
# data load
# The data is related to Handong University, 
# which was reported in major Korean daily newspapers from January 1995 to September 2024.
df = pd.read_excel("data/NewsResult_19950101-20240930.xlsx", engine="openpyxl")

# add the time stamp for ToT
df["시점"] = (round(df["일자"]/10000,0)).astype(int)
df = df.sort_values("시점")
df.reset_index(drop=True, inplace=True)

# text Preprocessing 
words = pbk.keyword_parser(pbk.keyword_list(df))
words = list_to_str(words)
timestamp = df["시점"].tolist()

# Define tokenizer & Vectorizer
custom_tokenizer = CustomTokenizer(Mecab())
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, max_features=3000)

In [13]:
# set the HDBSCAN Model for tunning BERTopic model
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=42)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)

# Define Topic Model for ToT
topic_model = BERTopic(
    embedding_model="sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens", 
    vectorizer_model=vectorizer,
    hdbscan_model=hdbscan_model,
    ctfidf_model=ctfidf_model,
    top_n_words=20,
    min_topic_size=20,
    verbose=True
)

In [14]:
# fit the data
topics, probs = topic_model.fit_transform(words)

2024-11-19 16:37:55,665 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/252 [00:00<?, ?it/s]

2024-11-19 16:40:45,066 - BERTopic - Embedding - Completed ✓
2024-11-19 16:40:45,069 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-19 16:40:51,280 - BERTopic - Dimensionality - Completed ✓
2024-11-19 16:40:51,284 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-19 16:40:51,435 - BERTopic - Cluster - Completed ✓
2024-11-19 16:40:51,446 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-19 16:40:55,584 - BERTopic - Representation - Completed ✓


In [15]:
# show topic dataframe
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,2878,-1_수시_ac_www_정시,"[수시, ac, www, 정시, kr, 모집, 전형, 수능, 총장, 선발, 논술, ...",[학년도 입정 특징 분할모집 112개 한국대학교육협의회 25일 정시모집 학년도 대입...
1,0,624,0_바이오_ai_하나님_창조,"[바이오, ai, 하나님, 창조, 성경, 인공지능, 예수, 인간, 세계, 진화, 과...",[세계 AI 가짜 주간경향 딥페이크 Deep Fake 생각 정도 미국 대통령 안심 ...
2,1,391,1_지정_선정_통합_10,"[지정, 선정, 통합, 10, 건대, 혁신, 100, 계획, 도립, 1000, 예비...",[자격증 봉사 활동 주부 전형 대학 사건 연평도 도발 입학 서해 대입 특례 추진 서...
3,2,328,2_20_예비_지정_외국어,"[20, 예비, 지정, 외국어, 탐구, 수리, 선교사, 대학생, 번역, 신청, 세대...",[G2시대 동북아질서 재편 천안함 연평도 정면충돌 남북 한반도 평화 안정 건설적 대...
4,3,249,3_트럼프_회담_정상_북미,"[트럼프, 회담, 정상, 북미, 대통령, 협상, 국무, 도널드, 행정부, 바이든, ...",[입장 대미 수세 승부수 시진핑 방북 향배 주목 북한 국빈 방문 시진핑 중국 국가 ...
...,...,...,...,...,...
56,55,24,55_sma_무급_분담금_휴직,"[sma, 무급, 분담금, 휴직, 방위비, 타결, 근로자, 미군, 주한, 분담, 협...",[주한 미군 한국인 근로자 초유 무급 휴직 현실 조율 방위비 분담금 협상 막바지 대...
57,56,22,56_동문_정당_파병_투표,"[동문, 정당, 파병, 투표, 통과, 법안, 보험, 미디어, 선거, 승리, 영광, ...",[총선 총선 자문 위원 제언 투표용지 탄환 후보 13 경향신문 총선 자문 위원 선거...
58,57,21,57_cbmc_ccm_채플_콘서트,"[cbmc, ccm, 채플, 콘서트, 찬양, 음악, 사역, 가수, 장로, 회원, 실...",[찬양 사역자 8월 15일 대전 컨퍼런스 CCM 크리스천 대중음악 찬양 대중속 강좌...
59,58,21,58_김영길_공로_선출_몽골,"[김영길, 공로, 선출, 몽골, 회장, 총회, 법학, 입법, 이사, 정기, 석방, ...",[이성환 교수 입법 학회 회장 사단 법인 한국 입법 학회 회장 이성환 국민대 법학과...


In [16]:
# show 2d visualized topic like LDAvis
topic_model.visualize_topics()

In [17]:
# To visualize the hierarchy of topics
topic_model.visualize_hierarchy()

In [18]:
# To visualize the correlation of topics
topic_model.visualize_heatmap()

In [19]:
# To visualize the ranking of terms per topic
topic_model.visualize_barchart(top_n_topics=20)

In [20]:
# To visualize the Term score per topic 
topic_model.visualize_term_rank()

In [23]:
# get the ToT result
topics_over_time = topic_model.topics_over_time(words, timestamp)

30it [00:04,  6.03it/s]


In [24]:
# show dataframe of ToT result
topics_over_time.head(20)

Unnamed: 0,Topic,Words,Frequency,Timestamp
0,-1,"경쟁, 본고사, 학과, 입시, 미달",45,1995
1,0,"창조, 목회, 자립, 과학, 학습",11,1995
2,1,"순결, 대학가, 학부, 학과, 개방",2,1995
3,2,"본고사, 특차, 미달, 정원, 과목",6,1995
4,4,"선린, 실력, 과학자, 설립, 하나님",4,1995
5,6,"학점, 장벽, 은행, 교육, 개혁",2,1995
6,7,"내신, 배정, 응시, 거주, 고사",1,1995
7,8,"동성애자, 동성애, 본고사, 순결, 특차",4,1995
8,10,"모금, 사랑, 재활, 남편, 요즘",1,1995
9,12,"합격자, 등록, 예비, 엔진, 합격",3,1995


In [25]:
#show line plot of ToT result
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=30)

# Result interpretation

- In BERTopic model, the optimal number of topics is found as a result of HDBSCAN. Therefore, the methodology (Perplexity, Coherence) used in the existing topic modeling is not used separately.

- Unlike the DTM model, BERTopic includes time variables in the topic. This makes it possible to analyze continuous time fluctuations. for this reason, BERTopic is characterized by relatively accurate capture of topics even in rapidly changing data (ex, news, newspaper editorials, SNS, etc.).