## BERTpopic

- 留言：需要先過濾表情符號、網頁原始碼、ckip斷詞
- 逐字稿：清理時間代碼、ckip斷詞

In [None]:
# pip install pandas
# !pip3.9 install bertopic
# !pip3.9 install hdbscan

In [1]:
import pandas as pd
from bertopic import BERTopic
from transformers.pipelines import pipeline

from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer

  from .autonotebook import tqdm as notebook_tqdm


### 跑留言

In [2]:
# load ckip data

comments_df = pd.read_csv('comments/ckip_comments.csv', encoding='utf-8')
data = comments_df[['video_title', 'cleaned_text', 'ws', 'published_at', 'author_name', 'like_count', 'comment_type']]
data.head(3)

  comments_df = pd.read_csv('comments/ckip_comments.csv', encoding='utf-8')


Unnamed: 0,video_title,cleaned_text,ws,published_at,author_name,like_count,comment_type
0,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,我入鏡了,"['我', '入鏡', '了']",2023-10-30T15:40:22Z,@rayduenglish,1142,top_comment
1,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,Ya 果然來留言了,"['Ya ', '果然', '來', '留言', '了', ' ']",2023-10-30T15:42:12Z,@TheLian8,15,reply
2,【#賀瓏夜夜秀】10/28 新聞亂報 EP1｜藍白兩情相悅,請大笑,"['請', '大笑']",2023-10-31T07:05:46Z,@teresayeh3049,18,reply


In [4]:
data['ws_clean'] = data["ws"].apply(
    lambda x: " ".join(str(x).replace("[", "").replace("]", "").replace("'", "").split(", ")) if pd.notnull(x) else ""
)
# print(data[:3])

In [5]:
from hdbscan import HDBSCAN

vectorizer_model = CountVectorizer(
    tokenizer=lambda x: x.split(" "),  # 拆空格就好
)
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
representation_model = KeyBERTInspired()
hdbscan_model = HDBSCAN(min_cluster_size = 20, metric='euclidean',
                        cluster_selection_method='eom', prediction_data=True, min_samples=10)

In [6]:
# 用影片標題分組
# video 31 skipped

for i, (title, group) in enumerate(data.groupby('video_title'), 1):
    # start from video 30
    if i < 32:
        # print(f"Skipping video_{i} ({title})")
        continue

    docs = group["ws_clean"].tolist()

    if len(docs) > 20000:
        print(f"Skipping video_{i} ({title}) with {len(docs)} comments")
        continue

    print(f"Processing video_{i} ({title}) with {len(docs)} comments")
    topic_model = BERTopic(
        language="chinese (traditional)",
        embedding_model="distiluse-base-multilingual-cased-v1",
        vectorizer_model=vectorizer_model,
        calculate_probabilities=True,
        verbose=True
    )
    topics, probs = topic_model.fit_transform(docs)

    # 只存主題分配和主題關鍵字
    doc_info = topic_model.get_document_info(docs)
    doc_info.to_csv(f"video_{i}_topic_assignments.csv", index=False, encoding="utf-8")
    topic_info = topic_model.get_topic_info()
    topic_info.to_csv(f"video_{i}_topic_keywords.csv", index=False, encoding="utf-8")
    print(f"Saved results for video_{i}")

2025-05-12 23:09:25,277 - BERTopic - Embedding - Transforming documents to embeddings.


Processing video_32 (【#賀瓏夜夜秀】趙少康 戰鬥藍的老大另有其人) with 9441 comments


Batches: 100%|██████████| 296/296 [00:31<00:00,  9.26it/s]
2025-05-12 23:10:01,588 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:01,589 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm

2025-05-12 23:10:01,588 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:01,589 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:10:14,229 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:14,230 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:14,229 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:14,230 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:21,892 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:21,896 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 23:10:21,892 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:21,896 - BERTopic - Representation - Fine-tuning topics using representat

Saved results for video_32
Processing video_33 (【#賀瓏夜夜秀】鄭運鵬 兒子的爸爸從小志向是脫口秀主持人) with 3885 comments


Batches: 100%|██████████| 122/122 [00:10<00:00, 11.40it/s]
2025-05-12 23:10:35,854 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:35,855 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm

2025-05-12 23:10:35,854 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:35,855 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:10:46,378 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:46,378 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:46,378 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:46,378 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:47,007 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:47,007 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:47,009 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 23:10:47,009 - BERTopic - Representation - Fine-tuning topics using representat

Saved results for video_33
Processing video_34 (【#賀瓏夜夜秀】高嘉瑜 唱歌是為了認知作戰) with 1689 comments


Batches: 100%|██████████| 53/53 [00:04<00:00, 11.96it/s]
2025-05-12 23:10:54,385 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:54,385 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
Batches: 100%|██████████| 53/53 [00:04<00:00, 11.96it/s]
2025-05-12 23:10:54,385 - BERTopic - Embedding - Completed ✓
2025-05-12 23:10:54,385 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:10:56,667 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:56,667 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:56,667 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:10:56,667 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:10:56,743 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:56,745 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 23:10:56,743 - BERTopic - Cluster - Completed ✓
2025-05-12 23:10:56,745 - BERTopic -

Saved results for video_34
Processing video_35 (【#賀瓏夜夜秀】黃國昌 對柯醫師愛的咆哮｜@KC-Huang) with 4621 comments


Batches: 100%|██████████| 145/145 [00:13<00:00, 10.63it/s]
Batches: 100%|██████████| 145/145 [00:13<00:00, 10.63it/s]
2025-05-12 23:11:13,294 - BERTopic - Embedding - Completed ✓
2025-05-12 23:11:13,294 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:11:13,294 - BERTopic - Embedding - Completed ✓
2025-05-12 23:11:13,294 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:11:15,542 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:11:15,543 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:11:15,542 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:11:15,543 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:11:16,363 - BERTopic - Cluster - Completed ✓
2025-05-12 23:11:16,365 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 23:11:16,363 - BERTopic - Cluster - Completed ✓
2025-05-12 23:11:16,365 - BERTop

Saved results for video_35
Processing video_36 (【#賀瓏夜夜秀】黃瀞瑩 偶爾會覺得失言其實也蠻到位) with 1521 comments


Batches: 100%|██████████| 48/48 [00:04<00:00, 10.86it/s]
2025-05-12 23:11:24,140 - BERTopic - Embedding - Completed ✓
2025-05-12 23:11:24,140 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
Batches: 100%|██████████| 48/48 [00:04<00:00, 10.86it/s]
2025-05-12 23:11:24,140 - BERTopic - Embedding - Completed ✓
2025-05-12 23:11:24,140 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-12 23:11:26,110 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:11:26,111 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:11:26,110 - BERTopic - Dimensionality - Completed ✓
2025-05-12 23:11:26,111 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-12 23:11:26,178 - BERTopic - Cluster - Completed ✓
2025-05-12 23:11:26,179 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-12 23:11:26,178 - BERTopic - Cluster - Completed ✓
2025-05-12 23:11:26,179 - BERTopic -

Saved results for video_36


### View result

In [None]:
topic_model.get_topic(0)

In [None]:
freq = topic_model.get_topic_info();
freq.head(16)

In [None]:
doc_info = topic_model.get_document_info(docs)
doc_info.query("Topic==1")

In [None]:
all_topics = topic_model.get_topics()
df_all_topics = pd.DataFrame(all_topics)
df_all_topics

In [None]:
## visualize topics

topic_model.visualize_topics()

In [None]:
topic_model.visualize_barchart(top_n_topics=10, n_words = 10, topics = range(10))