# 📚 BERTopic Labeling Comparison Framework

This notebook compares different methods for generating topic labels after clustering YouTube comments using BERTopic.

We will compare:
- BERTopic built-in labeling (TF-IDF based)
- KeyBERT keyword extraction
- Gemini (LLM) based labeling
- Compute pairwise Jaccard similarity
- Visualize and interpret results

In [None]:

%pip uninstall -y keras keras-nightly keras-preprocessing tf-keras tensorflow
%pip install keras==2.11.0 tf-keras transformers==4.36.2 tokenizers==0.13.3 sentence-transformers==2.2.2 bertopic==0.15.0 keybert==0.7.0 scikit-learn pandas matplotlib
%pip install google-generativeai seaborn itertools numpy


Note: you may need to restart the kernel to use updated packages.




Collecting keras==2.11.0
Note: you may need to restart the kernel to use updated packages.  Using cached keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
Collecting tf-keras
  Using cached tf_keras-2.19.0-py3-none-any.whl (1.7 MB)
Collecting transformers==4.36.2
  Using cached transformers-4.36.2-py3-none-any.whl (8.2 MB)
Collecting tokenizers==0.13.3
  Using cached tokenizers-0.13.3-cp39-cp39-win_amd64.whl (3.5 MB)
Collecting sentence-transformers==2.2.2
  Using cached sentence_transformers-2.2.2-py3-none-any.whl


ERROR: Cannot install tokenizers==0.13.3 and transformers==4.36.2 because these package versions have conflicting dependencies.
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies



Collecting bertopic==0.15.0
  Using cached bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
Collecting keybert==0.7.0
  Using cached keybert-0.7.0.tar.gz (21 kB)
INFO: pip is looking at multiple versions of <Python from Requires-Python> to determine which version is compatible with other requirements. This could take a while.
INFO: pip is looking at multiple versions of keras to determine which version is compatible with other requirements. This could take a while.

The conflict is caused by:
    The user requested tokenizers==0.13.3
    transformers 4.36.2 depends on tokenizers<0.19 and >=0.14

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict





Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement itertools (from versions: none)
ERROR: No matching distribution found for itertools


In [133]:
import pandas as pd
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from keybert import KeyBERT
import matplotlib.pyplot as plt
import google.generativeai as genai
import seaborn as sns
import numpy as np

In [134]:
# Step 1: Load CSV
df = pd.read_csv("..\data\youtube_comments\jack_vs_calley_1000.csv") 
texts = df["text"].dropna().astype(str).tolist() 



In [135]:
# --- Step 2: BERTopic Clustering ---
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(texts)

In [136]:
genai.configure(api_key="AIzaSyDFm56mSyyYDUAL8yeWlYJ3Rf9z_fNFU9A")

gemini_model = genai.GenerativeModel("gemini-2.0-flash-lite")

In [137]:
# --- Step 3: Custom Labelers ---
class TopicLabeler:
    def __init__(self, texts, topics):
        self.texts = texts
        self.topics = topics

    def label_with_bertopic(self, topic_model):
        return {topic: [word for word, _ in topic_model.get_topic(topic)] for topic in set(self.topics) if topic != -1}

    def label_with_keybert(self, embedding_model, top_n=5):
        kw_model = KeyBERT(model=embedding_model)
        labels = {}
        for topic in set(self.topics):
            if topic == -1:
                continue
            docs_in_topic = [text for text, t in zip(self.texts, self.topics) if t == topic]
            keywords = kw_model.extract_keywords(" ".join(docs_in_topic), top_n=top_n, stop_words='english')
            labels[topic] = [kw[0] for kw in keywords]
        return labels
    
    def label_with_gemini(self, model, max_words=5):
        labels = {}
        for topic in set(self.topics):
            if topic == -1:
                continue

            # Collect topic texts
            docs_in_topic = [text for text, t in zip(self.texts, self.topics) if t == topic]

            # Skip small topics
            if len(docs_in_topic) < 3:
                continue

            # Limit number of comments
            docs_in_topic = docs_in_topic[:5]

            # Limit each comment length (max 300 characters per comment)
            docs_in_topic = [text[:300] for text in docs_in_topic]

            # Prepare the prompt text
            docs_text = "\n".join(docs_in_topic)

            prompt = f"""
            You are given a group of YouTube comments that share a common topic.
            Provide up to {max_words} keywords or short phrases that best summarize the main topic of these comments.
            Comments:
            {docs_text}
            Return the keywords separated by commas only.
            """

            chat = model.start_chat()
            response = chat.send_message(prompt)
            keywords = response.text.strip().split(',')

            labels[topic] = [kw.strip() for kw in keywords]

        return labels

## Apply Labelers and Generate Topic Labels

In this step, we will generate the actual **labels** (keywords) for each topic identified by BERTopic.

Each labeler takes the set of documents (comments) in each topic and outputs a list of keywords that are supposed to best describe the topic.

### The Labelers:

1. **BERTopic Built-in Labeler**  
   - This is the default labeler provided by BERTopic.
   - It uses TF-IDF to extract the most informative words within each topic cluster.
   - Advantage: Simple, fast, and based on word frequency.
   - Limitation: Might prioritize frequent but less meaningful words.

2. **KeyBERT-based Labeler**  
   - Uses the `KeyBERT` library to extract keywords by calculating semantic similarity between words and the overall topic embedding.
   - Advantage: Leverages semantic information, can extract less frequent but semantically important keywords.
   - Limitation: May produce more specific labels that are harder to generalize.

3. **Gemini-based Labeler (LLM)**  
   - Uses Google's Gemini (Generative AI) to generate keywords by prompting an LLM directly with the topic's documents.
   - Advantage: Capable of generating human-like and context-aware labels that may capture abstract concepts.
   - Limitation: Computationally expensive and may be influenced by prompt design.

---

### What Are We Comparing?

We aim to compare:
- How similar the labels produced by each model are.
- Whether the models consistently agree on the most important words for a topic.
- Which method generates more meaningful, diverse, and human-friendly topic descriptions.

We will do this by:
1. Generating labels with all three models.
2. Computing **Jaccard Similarity** between every pair of models.
3. Visualizing and analyzing the similarity results.
4. Manually exploring topics and their generated labels.

This step is crucial to understand which labeling method is most suitable for our task.

In [140]:
# --- Step 4: Labeling ---
labeler = TopicLabeler(texts, topics)
bertopic_labels = labeler.label_with_bertopic(topic_model)
keybert_labels = labeler.label_with_keybert(embedding_model)
gemini_labels = labeler.label_with_gemini(gemini_model)


## Reflection on Similarity Analysis

While we initially focused on comparing the labelers using **Jaccard Similarity** between keyword sets, this approach alone did not provide sufficient insights into the actual performance or quality of the labelers.

### Why?

- Jaccard Similarity only measures **overlap** between the generated keywords, but it does not tell us:
    - How many topics were successfully labeled.
    - How diverse or informative the labels were.
    - Whether the models tend to generate long or short labels.
    - The uniqueness and variability across topics.

- In our case, we observed:
    - Very low Jaccard scores across most model pairs.
    - Inconsistent patterns that did not lead to clear conclusions.
    - For example, Gemini consistently showed low overlap, but this did not necessarily mean it produced poor labels.

---

## The Motivation for Model Metrics

To overcome the limitations of relying on Jaccard Similarity alone, we introduced additional metrics via the `model_metrics()` function:

### Added Metrics:
- **Effective Coverage** — how many topics received a sufficient number of keywords.
- **Average Label Length** — are the generated labels short and clear or long and verbose?
- **Unique Keywords** — how many distinct keywords does the model generate across all topics?

These metrics allow us to:
1. Better characterize each labeler.
2. Identify trends beyond simple keyword overlap.
3. Make a more informed decision when selecting a labeler for our task.

> In real-world applications, a single similarity score is rarely enough.  
> Multiple complementary metrics are needed to properly evaluate and select models.


In [142]:
labels_dict = {"BERTopic": bertopic_labels, "KeyBERT": keybert_labels, "Gemini": gemini_labels}

## Limitation: The Challenge of Quantifying Labeling Quality

While we computed several numerical metrics such as Jaccard Similarity, Coverage, Average Label Length, and Keyword Diversity, it is important to acknowledge that **measuring the quality of topic labels is inherently challenging**.

Why?
- These metrics capture certain aspects like overlap, variety, and quantity, but they do not fully capture:
    - The relevance of the labels.
    - The interpretability and usefulness of the labels for humans.
    - The semantic adequacy of the labels.

In practice, selecting the most suitable labeling approach often requires **human judgment**, as numerical metrics alone may not reflect how well the labels truly describe the topics.

Therefore, this analysis should be seen as a **preliminary quantitative evaluation**, which ideally should be complemented with a **qualitative (manual) inspection** of selected topics.


In [None]:
import random

def show_topic_full(topic_id, labels_dict):
    print(f"=== Topic {topic_id} ===\n")
    

    for model_name, model_labels in labels_dict.items():
        labels = model_labels.get(topic_id, [])
        print(f"--- {model_name} Labels ---")
        print(", ".join(labels) if labels else "No labels")
        print()
    

    print(f"--- All Texts in Topic {topic_id} ---")
    texts_in_topic = [text for text, t in zip(texts, topics) if t == topic_id]
    
    if not texts_in_topic:
        print("No texts found for this topic.")
    else:
        for i, text in enumerate(texts_in_topic, 1):
            print(f"{i}. {text}")


random_topic = random.choice(list(set(topics) - {-1}))
show_topic_full(random_topic, labels_dict)


=== Topic 8 ===

--- BERTopic Labels ---
the, and, to, of, dr, in, is, he, this, who

--- KeyBERT Labels ---
calley, calleys, calleyi, healthcare, medicine

--- Gemini Labels ---
Calley, Big Pharma, Ozempic, Dr. Jack, podcast

--- All Texts in Topic 8 ---
1. Calley how many more babies and parents have to die while youre busy with niceties? Thats the issue
2. I do credit Calle and his sister for opening the dark can of worms about the BigPharma and "Health Care" agenda over Ozempic, for one thing.
3. This podcast has to be one of the most spectacular events that ever happened in the world. All three persons and the host reached an epic monumental milestone moment at [158:22] minutes into the show. The fact that Calley was able to gather the grit to stay on with the challenging confrontation before him and have Jack and Mary segue into posturing the Jack Medical Bukele act into our governmental framework leveraging the use of state law, and avoiding the clash of Federal Law, is the best

In [151]:
def compute_cluster_purity(labels_dict):
    purities = []

    for model_name, model_labels in labels_dict.items():
        model_purities = []

        for topic_id in model_labels.keys():
            keywords = model_labels[topic_id]
            if len(keywords) == 0:
                purity = 0
            else:
                keyword_counts = pd.Series(keywords).value_counts()
                purity = keyword_counts.max() / len(keywords)
            model_purities.append(purity)
        
        avg_purity = np.mean(model_purities)
        purities.append({
            "Model": model_name,
            "Average Purity": round(avg_purity, 3)
        })
    
    return pd.DataFrame(purities)

purity_df = compute_cluster_purity(labels_dict)
display(purity_df)

Unnamed: 0,Model,Average Purity
0,BERTopic,0.1
1,KeyBERT,0.2
2,Gemini,0.2


In [152]:
def compute_label_stability(labels_dict):
    rows = []

    models = list(labels_dict.keys())

    for i in range(len(models)):
        for j in range(i+1, len(models)):
            model1 = models[i]
            model2 = models[j]

            common_topics = set(labels_dict[model1].keys()) & set(labels_dict[model2].keys())
            if not common_topics:
                continue

            jaccard_scores = []

            for topic in common_topics:
                l1 = set(labels_dict[model1][topic])
                l2 = set(labels_dict[model2][topic])
                score = len(l1 & l2) / len(l1 | l2) if l1 | l2 else 0
                jaccard_scores.append(score)

            avg_stability = np.mean(jaccard_scores)
            rows.append({
                "Model 1": model1,
                "Model 2": model2,
                "Stability (Avg Jaccard)": round(avg_stability, 3)
            })

    return pd.DataFrame(rows)

# --- הרצה ---
stability_df = compute_label_stability(labels_dict)
print("=== Label Stability ===")
display(stability_df)


=== Label Stability ===


Unnamed: 0,Model 1,Model 2,Stability (Avg Jaccard)
0,BERTopic,KeyBERT,0.092
1,BERTopic,Gemini,0.009
2,KeyBERT,Gemini,0.045


In [158]:
def final_model_ranking(purity_df, stability_df):
    # חישוב ממוצע Stability לכל מודל
    stabilities = []
    for model in purity_df["Model"]:
        model_stabilities = []
        for _, row in stability_df.iterrows():
            if row["Model 1"] == model or row["Model 2"] == model:
                model_stabilities.append(row["Stability (Avg Jaccard)"])
        avg_stability = np.mean(model_stabilities) if model_stabilities else 0
        stabilities.append(round(avg_stability, 3))

    # הוספת העמודה לטבלה
    purity_df["Stability"] = stabilities

    # --- משקולות (תוכל לשנות בהתאם לצורך) ---
    w1 = 0.5  # Purity
    w2 = 0.5  # Stability

    # חישוב הציון הסופי
    purity_df["Final Score"] = (
        w1 * purity_df["Average Purity"] +
        w2 * purity_df["Stability"]
    )

    return purity_df.sort_values("Final Score", ascending=False)

# --- הרצה ---
ranking_df = final_model_ranking(purity_df, stability_df)
print("=== Final Model Ranking ===")
display(ranking_df)


=== Final Model Ranking ===


Unnamed: 0,Model,Average Purity,Stability,Final Score
1,KeyBERT,0.2,0.068,0.134
2,Gemini,0.2,0.027,0.1135
0,BERTopic,0.1,0.05,0.075
