# Topics over Time (ToT)

#### Author information

- **Name:** Jaeseong Choe

- **email address:** 21900759@handong.ac.kr

- **GitHub:** https://github.com/sorrychoe

- **Linkedin:** https://www.linkedin.com/in/jaeseong-choe-048639250/

- **Personal Webpage:** https://jaeseongchoe.vercel.app/

## Part 1. Brief background of methodology

### Overview

- **BERTopic is a topic modeling technique that leverages transformer-based embeddings and a class-based TF-IDF (c-TF-IDF) approach** to generate dense clusters of documents, resulting in interpretable topics with meaningful descriptions.

### Situation Before ToT

- Traditional topic modeling methods like Latent Dirichlet Allocation (LDA) often struggled with capturing the semantic nuances of documents, especially when dealing with short texts or documents with overlapping topics.

### Why ToT Was Introduced

- BERTopic was introduced from the paper "TBERTopic: Neural topic modeling with a class-based TF-IDF procedure" of Maarten Grootendorst. (2022).

- BERTopic was introduced to address the limitations of traditional models by incorporating transformer-based embeddings, which capture contextual information, and by utilizing c-TF-IDF for more coherent topic representations

### Use Cases

- BERTopic is applicable in various domains, including analyzing customer reviews, extracting themes from research articles, monitoring social media trends, and exploring historical document collections.

## Part 2. Key concept of methodology

### Key Concept

- BERTopic combines transformer-based embeddings with clustering algorithms and a class-based TF-IDF approach to identify and represent topics within a corpus. 
  

### Generative Process

The methodology involves four main steps:

**1. Document Embedding**

- Each document is converted into a vector representation using a pre-trained transformer model, such as BERT. 

**2. Dimensionality Reduction**

- The high-dimensional embeddings are reduced to a lower-dimensional space using techniques like UMAP to preserve semantic relationships. 

**3. Clustering**

- The reduced embeddings are clustered using algorithms like HDBSCAN to group semantically similar documents. 

**4. Topic Representation**

- For each cluster, a c-TF-IDF approach is applied to extract representative words, forming the topic description.

![ToT_Graphic](./img/bertopic_Graphic.png)

### Strength

- BERTopic effectively captures semantic relationships in documents through transformer embeddings and provides coherent topic representations using c-TF-IDF. Its modular design allows for flexibility in choosing embedding models, dimensionality reduction techniques, and clustering algorithms, making it adaptable to various datasets and applications.

## Part 3. Example

### Precautions

- If you re-execute the code, there may be a slight difference in the result.

- Of course, the difference in the number or content of the topic will not be significant due to the learning rate, but the number of the topic changes.

In [1]:
# import librarys
import pandas as pd # for load excel data
import pyBigKinds as pbk # for preprocessing news data
from sklearn.feature_extraction.text import CountVectorizer # for vectorize text data 
from konlpy.tag import Mecab # for tokenize the Korean Words
from bertopic.vectorizers import ClassTfidfTransformer # for get c-TF-IDF value
from bertopic import BERTopic # for load the BERTopic model
from hdbscan import HDBSCAN # for tunning the BERTopic model
from umap import UMAP # for tunning the BERTopic model

# for ignore the warning message
import warnings
warnings.filterwarnings("ignore")

In [2]:
def list_to_str(words: list):
    """Function that list data change to string for text preprocessing"""
    for i in range(len(words)):
        text = ""
        for word in words[i]:
            if text == "":
                text = word
            else: 
                text = text + " " + word
        words[i] = text
    return words

In [3]:
class CustomTokenizer:
    """ Define the Korean Tokenizer"""
    def __init__(self, tagger):
        self.tagger = tagger
    def __call__(self, sent):
        sent = sent[:1000000]
        word_tokens = self.tagger.morphs(sent)
        result = [word for word in word_tokens if len(word) > 1]
        return result

In [4]:
# data load
# The data is related to Handong University, 
# which was reported in major Korean daily newspapers from January 1995 to September 2024.
df = pd.read_excel("data/NewsResult_19950101-20240930.xlsx", engine="openpyxl")

# add the time stamp for ToT
df["시점"] = (round(df["일자"]/10000,0)).astype(int)
df = df.sort_values("시점")
df.reset_index(drop=True, inplace=True)

# text Preprocessing 
words = pbk.keyword_parser(pbk.keyword_list(df))
words = list_to_str(words)
timestamp = df["시점"].tolist()

# Define tokenizer & Vectorizer
custom_tokenizer = CustomTokenizer(Mecab())
vectorizer = CountVectorizer(tokenizer=custom_tokenizer, max_features=3000)

In [5]:
# set the HDBSCAN Model for tunning BERTopic model
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=42)
hdbscan_model = HDBSCAN(min_samples=10, gen_min_span_tree=True, prediction_data=False, min_cluster_size=20)

# Define Topic Model for ToT
topic_model = BERTopic(
    embedding_model="sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens", 
    vectorizer_model=vectorizer,
    hdbscan_model=hdbscan_model,
    ctfidf_model=ctfidf_model,
    top_n_words=20,
    min_topic_size=20,
    verbose=True
)

In [6]:
# fit the data
topics, probs = topic_model.fit_transform(words)

2024-11-19 16:50:27,376 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/252 [00:00<?, ?it/s]

2024-11-19 16:53:11,223 - BERTopic - Embedding - Completed ✓
2024-11-19 16:53:11,225 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
2024-11-19 16:53:23,754 - BERTopic - Dimensionality - Completed ✓
2024-11-19 16:53:23,755 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-19 16:53:23,889 - BERTopic - Cluster - Completed ✓
2024-11-19 16:53:23,894 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-19 16:53:27,424 - BERTopic - Representation - Completed ✓


In [7]:
# show topic dataframe
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,3414,-1_수시_ac_www_정시,"[수시, ac, www, 정시, kr, 히딩크, 모집, 전형, 수능, 총장, 경쟁,...",[수능 학생부 농어촌 실업계 특기자 대상 선발 대학수학능력시험 학교생활 기록부 성적...
1,0,336,0_20_예비_지정_외국어,"[20, 예비, 지정, 외국어, 탐구, 수리, 대학생, 선교사, 과목, 신청, 번역...",[G2시대 동북아질서 재편 천안함 연평도 정면충돌 남북 한반도 평화 안정 건설적 대...
2,1,225,1_ai_로봇_백신_인공지능,"[ai, 로봇, 백신, 인공지능, 진화, 유전자, 유네스코, 인간, 창조, 창업, ...",[네트워크화 시대 효율적 진보 어른들 텔레비전 낭비 과거 어른 어른 세대 휴대전화 ...
3,2,190,2_30_피지_학과_인원,"[30, 피지, 학과, 인원, 공학, 제외, 학부, 기증, 학년도, 특수, 마을, ...",[한동대학교 신앙 학문 조화 지구촌 인재 양성 요람 변화 갱신 요구 한국교회 기독교...
4,3,154,3_퀴어_누리_동성애_동성애자,"[퀴어, 누리, 동성애, 동성애자, 행사, 본부, 고등학교, 시민, 회장, 연합회,...",[성문화 캠퍼스 질서 억압 허구 페미니즘 동성애 대학 화두 금기 터널 대학가 화두 ...
...,...,...,...,...,...
64,63,22,63_지선_전신_사고_모교,"[지선, 전신, 사고, 모교, 화상, 이지선, 55, 치료, 운전자, 수술, 재활,...",[지선 사랑해 이지선 이대 교수 컴백 사고 23년 모교로 전신 화상 아픔 희망 전도...
65,64,21,64_사학_석사_학사_31,"[사학, 석사, 학사, 31, 역임, 도박, 친서, 박사, 역사, 학회, 포스코, ...",[포항시 협력 글로컬 대학 육성 거버넌스 공식 출범 경북 포항시 글로컬대학 지정 지...
66,65,21,65_통치_중립_주립_주임,"[통치, 중립, 주립, 주임, 일자리, 편입, 글로벌, 가스, 도청, 도시, 동성,...",[미국대학 진학 가능 편입과정 대세 텍사스주립대 글로벌 프론티어 기숙 프로그램 미국...
67,66,21,66_대학교_내진_합격자_서명,"[대학교, 내진, 합격자, 서명, 75, 시국, 등록금, 반값, 지진, 등록, 사이...",[전문가들 정면돌파 장기전 대비 정면 돌파 이례적 나흘간 마라톤 전원 회의 집권 육...


In [8]:
# show 2d visualized topic like LDAvis
BERTopic_vis = topic_model.visualize_topics()
BERTopic_vis

In [9]:
# To visualize the hierarchy of topics
hierachy_vis = topic_model.visualize_hierarchy()
hierachy_vis

In [10]:
# To visualize the correlation of topics
heatmap_vis = topic_model.visualize_heatmap()
heatmap_vis

In [11]:
# To visualize the ranking of terms per topic
barchart_vis = topic_model.visualize_barchart(top_n_topics=20)
barchart_vis

In [12]:
# To visualize the Term score per topic 
rank_vis = topic_model.visualize_term_rank()
rank_vis

In [13]:
# get the ToT result
topics_over_time = topic_model.topics_over_time(words, timestamp)

30it [00:05,  5.66it/s]


In [14]:
# show dataframe of ToT result
topics_over_time.head(20)

Unnamed: 0,Topic,Words,Frequency,Timestamp
0,-1,"경쟁, 본고사, 학과, 입시, 마감",53,1995
1,0,"본고사, 특차, 미달, 과목, 정원",6,1995
2,1,"창조, 윤리, 박사, 수준, 지식",3,1995
3,2,"선린, 실력, 설립, 과학자, 대학가",4,1995
4,3,"동성애자, 동성애, 본고사, 순결, 특차",4,1995
5,4,"학점, 장벽, 은행, 교육, 기관",2,1995
6,5,"내신, 배정, 응시, 거주, 전문대",1,1995
7,6,"순결, 대학가, 개방, 가치관, 공론",1,1995
8,8,"대사관, 환영, 참석자, 한국과학기술원, 주최",1,1995
9,10,"제한, 25, 명문, 표방, 상위",6,1995


In [15]:
#show line plot of ToT result
tot_vis = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=30)
tot_vis

In [16]:
# save the result
BERTopic_vis.write_html("view/bertopic_vis.html")
hierachy_vis.write_html("view/hierachy_vis.html")
heatmap_vis.write_html("view/heatmap_vis.html")
barchart_vis.write_html("view/barchart_vis.html")
rank_vis.write_html("view/rank_vis.html")
tot_vis.write_html("view/tot_vis.html")

# Result interpretation

- In BERTopic model, the optimal number of topics is found as a result of HDBSCAN. Therefore, the methodology (Perplexity, Coherence) used in the existing topic modeling is not used separately.

- Unlike the DTM model, BERTopic includes time variables in the topic. This makes it possible to analyze continuous time fluctuations. for this reason, BERTopic is characterized by relatively accurate capture of topics even in rapidly changing data (ex, news, newspaper editorials, SNS, etc.).