## BERTpopic

- 留言：需要先過濾表情符號、網頁原始碼、ckip斷詞
- 逐字稿：清理時間代碼、ckip斷詞

In [None]:
# pip install pandas
# !pip3.9 install bertopic
# !pip3.9 install hdbscan

In [15]:
import pandas as pd
from bertopic import BERTopic
from transformers.pipelines import pipeline

from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer

  from .autonotebook import tqdm as notebook_tqdm


### 跑留言

In [5]:
# load ckip data

comments_df = pd.read_csv('comments/ckip_comments.csv', encoding='utf-8')
data = comments_df[['video_title', 'cleaned_text', 'ws', 'published_at', 'author_name', 'like_count', 'comment_type']]
data.head(3)

  comments_df = pd.read_csv('comments/ckip_comments.csv', encoding='utf-8')


Unnamed: 0,video_title,cleaned_text,ws,published_at,author_name,like_count,comment_type
0,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,我入鏡了,"['我', '入鏡', '了']",2023-10-30T15:40:22Z,@rayduenglish,1142,top_comment
1,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,Ya 果然來留言了,"['Ya ', '果然', '來', '留言', '了', ' ']",2023-10-30T15:42:12Z,@TheLian8,15,reply
2,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,請大笑,"['請', '大笑']",2023-10-31T07:05:46Z,@teresayeh3049,18,reply


In [None]:
data['ws_clean'] = data["ws"].apply(
    lambda x: " ".join(str(x).replace("[", "").replace("]", "").replace("'", "").split(", ")) if pd.notnull(x) else ""
)
# print(data[:3])

                     video_title cleaned_text  \
0  【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅         我入鏡了   
1  【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅   Ya 果然來留言了    
2  【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅          請大笑   

                                   ws          published_at     author_name  \
0                    ['我', '入鏡', '了']  2023-10-30T15:40:22Z   @rayduenglish   
1  ['Ya ', '果然', '來', '留言', '了', ' ']  2023-10-30T15:42:12Z       @TheLian8   
2                         ['請', '大笑']  2023-10-31T07:05:46Z  @teresayeh3049   

  like_count comment_type         ws_clean  
0       1142  top_comment           我 入鏡 了  
1         15        reply  Ya  果然 來 留言 了    
2         18        reply             請 大笑  


In [None]:
from hdbscan import HDBSCAN

vectorizer_model = CountVectorizer(
    tokenizer=lambda x: x.split(" "),  # 拆空格就好
)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = KeyBERTInspired()
hdbscan_model = HDBSCAN(min_cluster_size = 20, metric='euclidean',
                        cluster_selection_method='eom', prediction_data=True, min_samples=10)


docs = data["ws_clean"].tolist()
print(len(docs))

153460


In [None]:
topic_model = BERTopic(
    language="chinese (traditional)",  # 指定語言為繁體中文
    embedding_model="distiluse-base-multilingual-cased-v1",  # 指定用來將文本轉成向量的模型
    vectorizer_model=vectorizer_model,  # 指定向量化方法（這裡用你前面自訂的 CountVectorizer）
    calculate_probabilities=True,       # 計算每個主題的機率
    verbose=True                        # 顯示詳細執行過程
)
topics, probs = topic_model.fit_transform(docs)

2025-05-12 20:59:48,152 - BERTopic - Embedding - Transforming documents to embeddings.
Batches: 100%|██████████| 4796/4796 [07:35<00:00, 10.54it/s]

2025-05-12 21:08:34,860 - BERTopic - Embedding - Completed ✓
2025-05-12 21:08:34,860 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 21:08:34,860 - BERTopic - Embedding - Completed ✓
2025-05-12 21:08:34,860 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 21:09:56,167 - BERTopic - Dimensionality - Completed ✓
2025-05-12 21:09:56,169 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 21:09:56,167 - BERTopic - Dimensionality - Completed ✓
2025-05-12 21:09:56,169 - BERTopic - Cluster - Start clustering the reduced embeddings
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set t

### View result

In [None]:
topic_model.get_topic(0)

In [None]:
freq = topic_model.get_topic_info();
freq.head(16)

In [None]:
doc_info = topic_model.get_document_info(docs)
doc_info.query("Topic==1")

In [None]:
all_topics = topic_model.get_topics()
df_all_topics = pd.DataFrame(all_topics)
df_all_topics

In [None]:
## visualize topics

topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=10, n_words = 10, topics = range(10))