# Topic Modelling using BERTopic
In this notebook, BERTopic was used to perform topic modeling on a subset of the Reddit Climate Change dataset, sourced from: \url{https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset}. To select an appropriate embedding model, reference was made to the MTEB leaderboard (\url{https://huggingface.co/spaces/mteb/leaderboard}), prioritizing models that balance performance and efficiency. After constructing and fitting a custom BERTopic pipeline, the Groq API was integrated to generate more descriptive and human-readable topic labels, enhancing the interpretability of the resulting topics.

**GCP Cluster specifications used**:
gcloud dataproc clusters create st446-cluster-gp2 \
  --enable-component-gateway \
  --public-ip-address \
  --region europe-west1 \
  --master-machine-type n2-standard-16 \
  --master-boot-disk-size 100 \
  --num-workers 2 \
  --worker-machine-type n2-standard-2 \
  --worker-boot-disk-size 200 \
  --image-version 2.2-debian12 \
  --optional-components JUPYTER \
  --metadata 'PIP_PACKAGES=sklearn nltk pandas numpy' \
  --project st446wt2025

## Notebook configurations and data loading

In [1]:
# Install required packages
!pip install bertopic sentence-transformers transformers umap-learn hdbscan
#!pip install --upgrade bertopic
!pip install gensim
!pip install bertopic[spacy]
!python -m spacy download en_core_web_sm
!pip install groq
!pip install 'huggingface_hub[hf_xet]'

Collecting bertopic
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers
  Downloading transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting umap-learn
  Downloading umap_learn-0.5.7-py3-none-any.whl.metadata (21 kB)
Collecting hdbscan
  Downloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting plotly>=4.7.0 (from bertopic)
  Downloading plotly-6.0.1-py3-none-any.whl.metadata (6.7 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.7.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (29 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014

Downloading bertopic-0.17.0-py3-none-any.whl (150 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.6/150.6 kB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading sentence_transformers-4.1.0-py3-none-any.whl (345 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.51.3-py3-none-any.whl (10.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.4/10.4 MB[0m [31m195.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading umap_learn-0.5.7-py3-none-any.whl (88 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading hdbscan-0.8.40-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m00:01[0m:00:01[0m
[?25hDownloading hugg

Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.1.0-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m0:00:01[0m0:01[0m
[?25hDownloading smart_open-7.1.0-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.7/61.7 kB[0m [31m71.0 kB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hDownloading wrapt-1.17.2-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.2/83.2 kB[0m [31m85.9 kB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstall

Collecting blis<1.4.0,>=1.3.0 (from thinc<8.4.0,>=8.3.4->spacy>=3.0.1->bertopic[spacy])
  Downloading blis-1.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting confection<1.0.0,>=0.0.1 (from thinc<8.4.0,>=8.3.4->spacy>=3.0.1->bertopic[spacy])
  Downloading confection-0.1.5-py3-none-any.whl.metadata (19 kB)
INFO: pip is looking at multiple versions of thinc to determine which version is compatible with other requirements. This could take a while.
Collecting thinc<8.4.0,>=8.3.4 (from spacy>=3.0.1->bertopic[spacy])
  Downloading thinc-8.3.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting blis<1.3.0,>=1.2.0 (from thinc<8.4.0,>=8.3.4->spacy>=3.0.1->bertopic[spacy])
  Downloading blis-1.2.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting shellingham>=1.3.0 (from typer<1.0.0,>=0.3.0->spacy>=3.0.1->bertopic[spacy])
  Downloading shellingham-1.5.4-py2.py3-none-any.whl.metadata

Downloading spacy-3.8.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading catalogue-2.0.10-py3-none-any.whl (17 kB)
Downloading cymem-2.0.11-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (218 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m218.9/218.9 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langcodes-3.5.0-py3-none-any.whl (182 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m183.0/183.0 kB[0m [31m28.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading murmurhash-1.0.12-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (134 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading preshed-3.0.9-cp311-cp311-manyli

  Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.1.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 MB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: hf-xet
Successfully installed hf-xet-1.1.0
[0m

In [7]:
# Download the Kaggle dataset zip
!curl -L -o climate.zip \
    "https://www.kaggle.com/api/v1/datasets/download/pavellexyr/the-reddit-climate-change-dataset"

# Unzip it
!unzip -o climate.zip

# Remove any old copy in HDFS and put the comments file there
!hadoop fs -rm -f /the-reddit-climate-change-dataset-comments.csv
!hadoop fs -put the-reddit-climate-change-dataset-comments.csv /

# Verify upload
!hadoop fs -ls /

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 1536M  100 1536M    0     0  42.4M      0  0:00:36  0:00:36 --:--:-- 42.8M
Archive:  climate.zip
  inflating: the-reddit-climate-change-dataset-comments.csv  
  inflating: the-reddit-climate-change-dataset-posts.csv  
Deleted /the-reddit-climate-change-dataset-comments.csv
Found 4 items
-rw-r--r--   2 root hadoop 4111000325 2025-05-03 11:31 /the-reddit-climate-change-dataset-comments.csv
drwxrwxrwt   - hdfs hadoop          0 2025-05-03 11:17 /tmp
drwxrwxrwt   - hdfs hadoop          0 2025-05-03 11:18 /user
drwxrwxrwt   - hdfs hadoop          0 2025-05-03 11:17 /var


In [2]:
# Import libraries used in this notebook
import zipfile
import sys
import os
import re
import hashlib
from datetime import datetime
import time
import numpy as np
import pandas as pd
import string
import spacy
import groq
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
from sentence_transformers import SentenceTransformer, models
from bertopic import BERTopic
from bertopic.representation import MaximalMarginalRelevance
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import CoherenceModel
from gensim.corpora import Dictionary
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, PartOfSpeech, OpenAI
from huggingface_hub import HfFileSystem
from collections import Counter
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [3]:
# Set HDFS path where file is saved
comments_path = "hdfs://st446-cluster-gp2-m:8020/the-reddit-climate-change-dataset-comments.csv"

In [4]:
# Define the schema to read the comments file

schema = StructType([
    StructField("type",           StringType(), True),
    StructField("id",             StringType(), True),
    StructField("subreddit.id",   StringType(), True),
    StructField("subreddit.name", StringType(), True),
    StructField("subreddit.nsfw", StringType(), True),
    StructField("created_utc",    StringType(), True),
    StructField("permalink",      StringType(), True),
    StructField("body",           StringType(), True),
    StructField("sentiment",      DoubleType(), True),
    StructField("score",          IntegerType(),True)
])

df = spark.read \
    .option("header", "true") \
    .option("multiLine", "true") \
    .option("escape", "\"") \
    .schema(schema) \
    .csv(comments_path)

df.printSchema()
df.show(5)

root
 |-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- subreddit.id: string (nullable = true)
 |-- subreddit.name: string (nullable = true)
 |-- subreddit.nsfw: string (nullable = true)
 |-- created_utc: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- body: string (nullable = true)
 |-- sentiment: double (nullable = true)
 |-- score: integer (nullable = true)



                                                                                

+-------+-------+------------+--------------+--------------+-----------+--------------------+--------------------+---------+-----+
|   type|     id|subreddit.id|subreddit.name|subreddit.nsfw|created_utc|           permalink|                body|sentiment|score|
+-------+-------+------------+--------------+--------------+-----------+--------------------+--------------------+---------+-----+
|comment|imlddn9|       2qh3l|          news|         false| 1661990368|https://old.reddi...|Yeah but what the...|   0.5719|    2|
|comment|imldbeh|       2qn7b|          ohio|         false| 1661990340|https://old.reddi...|Any comparison of...|  -0.9877|    2|
|comment|imldado|       2qhma|    newzealand|         false| 1661990327|https://old.reddi...|I'm honestly wait...|  -0.1143|    1|
|comment|imld6cb|       2qi09|    sacramento|         false| 1661990278|https://old.reddi...|Not just Sacramen...|      0.0|    4|
|comment|imld0kj|       2qh1i|     askreddit|         false| 1661990206|https://old

## BERTopic pipeline

BERT works with different submodels that can be changed and tuned:
1. **Embedding models**: Taking into account the restrictions in computational power and the MTEB Leaderboard (https://huggingface.co/spaces/mteb/leaderboard), we chose the following embedding models 'all-MiniLM-L6-v2'. Bigger models such as e.g. 'BAAI/bge-base-en-v1.5' might yield better results, but we were unable to run them on a reasonable subset of comments given our cluster only has a master-machine-type n2-standard-16 and we don't have access to a GPU because we're using a free-tier GC account.

For the additional submodels in BERT, we followed the best practices of the official website: https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html#additional-representations

2. **Representation Models**: For representation models, we used an ensemble model including: Keybert Model and Part of Speech (POS). Using them together in the pipeline, BERTopic internally combined the outputs of the two models, fusioning the best keywords based on the quality and diversity of the words returned by all models.
3. **Vectorizer Models**: As BERT does not perform any preprocessing of the documents (e.g. tokenization, stopword removal, lemmatization), CountVectorizer is applied to remove stopwords, ignore infrequent words and increase the n-gram range after documents are assigned to topics.
4. **Dimensionality Reduction Models**: UMAP is used by default in BERT to reduce the dimensionality of the embeddings. To be able to recreate the exact same results, we will specify the model and set a random state to deal with its stochastic behaviour.
5. **Cluster Model**: The cluster model is by default HDBSCAN. HDBSCAN has a parameter (min_cluster_size) that indirectly controls the number of topics that will be created. We will set that parameter to 150 to avoid the creation of too many small clusters. 

The following metrics were used to evaluate model performance: 
**(I) Topic Modeling Evaluation Metrics**, including C\_V, U\_Mass and C\_NPMI coherence, Topic imbalance, Topic Diversity and **(II) Metrics for Distributed Computing}**, including Runtime and Datasetsize. 


We tried different hyperparameter configurations for UMAP and HDBSCAN on a subset of 10,000 comments and then chose the best performing combination to based on the evaluation metrics to be run on a subset of 100,000 comments.

In [30]:
# Select a subset of comments from the dataframe
documents = df.select("body").limit(10000).rdd.flatMap(lambda x: x).collect()

# Keep only posts with at least 5 meaningful words
documents_clean = [doc for doc in documents if len(doc.split()) >= 5]

In [31]:
# Pre-calculate embeddings to avoid recalculating each time
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
MiniLM_embeddings = embedding_model.encode(documents_clean, show_progress_bar=True)

Batches:   0%|          | 0/307 [00:00<?, ?it/s]

In [5]:
# Define funtion to compute coherence score
def compute_coherence_score(documents, topic_model, top_n_words=10):
    tokenized_documents = [doc.split() for doc in documents]
    dictionary = Dictionary(tokenized_documents)
    corpus = [dictionary.doc2bow(text) for text in tokenized_documents]

    topic_words = []
    for topic_id, topic in topic_model.get_topics().items():
        if topic_id == -1:
            continue
        words = [word for word, _ in topic[:top_n_words]]
        topic_words.append(words)

    coherence_model = CoherenceModel(
        topics=topic_words,
        texts=tokenized_documents,
        dictionary=dictionary,
        coherence='c_v'
    )
    coherence_score = coherence_model.get_coherence()
    print(f"Topic Coherence (C_V): {coherence_score:.4f}")
    return coherence_score

# Other Coherence metrics
def calculate_umass_npmi(documents, topic_model):
    # Step 1: Tokenize your documents
    tokenized_documents = [doc.split() for doc in documents]

    # Step 2: Create a dictionary and corpus
    dictionary = Dictionary(tokenized_documents)
    corpus = [dictionary.doc2bow(text) for text in tokenized_documents]

    # Step 3: Extract topics from BERTopic
    topic_words = []
    for topic in topic_model.get_topics().values():
        topic_words.append([word for word, _ in topic])
        
    # Calculate UMASS coherence
    BERTopic_umass = CoherenceModel(
        topics=topic_words,
        texts=tokenized_documents,
        dictionary=dictionary,
        coherence="u_mass"
    ).get_coherence()

    print(f"BERTopic U_Mass coherence  = {BERTopic_umass:.4f}")
    
    # Calculate C_NPMI coherence
    BERTopic_npmi = CoherenceModel(
        topics=topic_words,
        texts=tokenized_documents,
        dictionary=dictionary,
        coherence="c_npmi"
    ).get_coherence()

    print(f"BERTopic C_NPMI coherence  = {BERTopic_npmi:.4f}")
    return BERTopic_umass, BERTopic_npmi

# Define function to calculate Topic Diversity
def calculate_topic_diversity(topic_model, top_n_words=10):
    # 1. Pull top words per topic
    topics = topic_model.get_topics()
    
    top_words_per_topic = []
    for topic_id, words_scores in topics.items():
        # Skip outlier topic (-1)
        if topic_id == -1:
            continue
        words = [word for word, _ in words_scores[:top_n_words]]
        top_words_per_topic.append(words)
    
    # 2. Flatten and count uniques
    all_top_words = [word for topic in top_words_per_topic for word in topic]
    unique_words = set(all_top_words)

    # 3. Compute diversity
    diversity = len(unique_words) / len(all_top_words)

    print(f"BERTopic diversity = {diversity:.4f}  "
          f"({len(unique_words)} unique of {len(all_top_words)} total words)")
    
    return diversity, len(unique_words), len(all_top_words)

# Define function to calculate topic size imbalance
def calculate_topic_size_imbalance(topics):
    # Remove noise topic (-1)
    filtered_topics = [topic for topic in topics if topic != -1]

    # Count documents per topic
    topic_counts = Counter(filtered_topics)

    if len(topic_counts) <= 1:
        print("Not enough topics to compute imbalance.")
        return None

    max_size = max(topic_counts.values())
    min_size = min(topic_counts.values())
    imbalance = max_size / min_size

    print(f"BERTopic Topic size imbalance (max/min): {imbalance:.2f}")
    return imbalance

# Define pipeline functions to get topic info and calculate coherence score for different embeddings

def pipeline(embeddings):
    # Set seed to avoid randomness in UMAP dimensionality reduction
    umap_model = UMAP(n_neighbors=15, n_components=10, min_dist=0.0, metric='cosine', random_state=42)
    
    # Use HDBSCAN model to control the number of topics
    hdbscan_model = HDBSCAN(min_cluster_size=50, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
    
    # Preprocess the topic representations after documents are assigned to topics to not influence the clustering process
    vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))
    
    # Add representations
    # KeyBERT
    keybert_model = KeyBERTInspired()

    # Part-of-Speech
    nlp = spacy.load("en_core_web_sm")
    pos_model = PartOfSpeech(nlp)


    # Ensemble representation model
    representation_model = {
        "KeyBERT": keybert_model,
        "POS": pos_model
    }
    
    topic_model = BERTopic(

      # Pipeline models
      embedding_model=embedding_model,
      umap_model=umap_model,
      hdbscan_model=hdbscan_model,
      vectorizer_model=vectorizer_model,
      representation_model=representation_model,

      # Hyperparameters
      top_n_words=10,
      verbose=True
    )

    # Train model
    topics, probs = topic_model.fit_transform(documents_clean, embeddings)
    
    # Compute metrics
    cv_final = compute_coherence_score(documents=documents_clean, topic_model=topic_model)
    umass_final, npmi_final = calculate_umass_npmi(documents=documents_clean, topic_model=topic_model)
    topic_diversity_final = calculate_topic_diversity(topic_model=topic_model, top_n_words=10)
    topic_size_final = calculate_topic_size_imbalance(topics)
    
    return topic_model

In [36]:
# Apply pipeline to 'all-MiniLM-L6-v2' embedding
topic_model = pipeline(MiniLM_embeddings)

# Print number of topics that BERT created
num_topics = len([t for t in topic_model.get_topics().keys() if t != -1])
print(f"Number of topics (excluding noise): {num_topics}")

# Get topic info DataFrame
topic_info = topic_model.get_topic_info()

# Exclude the noise topic (-1) and get top 5 by count
top_topics = topic_info[topic_info.Topic != -1].head(5)

# Loop through top 5 topics
for _, row in top_topics.iterrows():
    topic_id = row["Topic"]
    topic_name = row["Name"]
    keywords = topic_model.get_topic(topic_id)
    
    print(f"{topic_name}:")
    print(", ".join([word for word, _ in keywords]))
    print()

2025-05-03 12:33:24,615 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-03 12:33:43,719 - BERTopic - Dimensionality - Completed ✓
2025-05-03 12:33:43,720 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-03 12:33:44,419 - BERTopic - Cluster - Completed ✓
2025-05-03 12:33:44,423 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-03 12:34:04,883 - BERTopic - Representation - Completed ✓


Topic Coherence (C_V): 0.5134
BERTopic U_Mass coherence  = -3.7624
BERTopic C_NPMI coherence  = -0.0140
BERTopic diversity = 0.6571  (184 unique of 280 total words)
BERTopic Topic size imbalance (max/min): 8.22
Number of topics (excluding noise): 28
0_party_republicans_democrats_biden:
party, republicans, democrats, biden, people, republican, change, like, rights, right

1_nuclear_power_energy_nuclear power:
nuclear, power, energy, nuclear power, climate, germany, change, climate change, plants, gas

2_florida_insurance_fraud_climate change:
florida, insurance, fraud, climate change, insurance companies, climate, change, hurricane, state, hurricanes

3_cars_car_evs_ev:
cars, car, evs, ev, electric, vehicles, climate, change, energy, solar

4_pakistan_https_die_floods:
pakistan, https, die, floods, india, com, climate, www, https www, country



As the results show, the model identified 28 topics from the Reddit climate dataset (excluding noise). The topic coherence score (C_V = 0.5134) suggests that the topics are moderately interpretable, while the UMass and C_NPMI scores (–3.76 and –0.0140) indicate weaker word co-occurrence, which might be due to the informal and varied language used on Reddit. The topic diversity score of 0.6571 (184 unique out of 280 words) shows a fair amount of variation in topic descriptors, though some overlap between topics likely remains. The topic size imbalance of 8.22 means some topics are much larger than others, but the distribution is still manageable. Looking at the top topics, the model seems to have picked up on relevant themes such as political discussions, nuclear energy, insurance issues in Florida, electric vehicles, and climate-related events in Pakistan, showing that it can capture a broad range of climate-related conversations.

## Run BERTopic with MiniLM embeddings on 100,000 Comments

In [6]:
# Select a subset of comments from the dataframe
documents = df.select("body").limit(100000).rdd.flatMap(lambda x: x).collect()

# Keep only posts with at least 5 meaningful words
documents_clean = [doc for doc in documents if len(doc.split()) >= 5]

# Pre-calculate embeddings to avoid recalculating each time
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
MiniLM_embeddings = embedding_model.encode(documents_clean, show_progress_bar=True)

                                                                                

Batches:   0%|          | 0/3077 [00:00<?, ?it/s]

In [7]:
# Calculate size of data used
total_size_bytes = sum(sys.getsizeof(doc) for doc in documents_clean)
size_gb = total_size_bytes / (1024 ** 3)
print(f"Approx size: {size_gb:.2f} GB")

Approx size: 0.09 GB


In [8]:
%%time
# Apply pipeline to 'all-MiniLM-L6-v2' embedding 
topic_model = pipeline(MiniLM_embeddings)

2025-05-03 14:29:22,472 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-05-03 14:31:44,419 - BERTopic - Dimensionality - Completed ✓
2025-05-03 14:31:44,423 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-05-03 14:31:55,151 - BERTopic - Cluster - Completed ✓
2025-05-03 14:31:55,167 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-05-03 14:33:34,827 - BERTopic - Representation - Completed ✓


Topic Coherence (C_V): 0.5463
BERTopic U_Mass coherence  = -6.1638
BERTopic C_NPMI coherence  = 0.0223
BERTopic diversity = 0.7381  (1299 unique of 1760 total words)
BERTopic Topic size imbalance (max/min): 50.30
CPU times: user 7min 40s, sys: 1min 21s, total: 9min 1s
Wall time: 5min 32s


In [9]:
# Print number of topics that BERT created
num_topics = len([t for t in topic_model.get_topics().keys() if t != -1])
print(f"Number of topics (excluding noise): {num_topics}")

# Get topic info DataFrame
topic_info = topic_model.get_topic_info()

# Exclude the noise topic (-1) and get top 5 by count
top_topics = topic_info[topic_info.Topic != -1].head(5)

# Loop through top 5 topics
for _, row in top_topics.iterrows():
    topic_id = row["Topic"]
    topic_name = row["Name"]
    keywords = topic_model.get_topic(topic_id)
    
    print(f"{topic_name}:")
    print(", ".join([word for word, _ in keywords]))
    print()

Number of topics (excluding noise): 176
0_climate change_change_climate_change climate:
climate change, change, climate, change climate, global warming, warming, global, warming climate, change global, solve climate

1_peterson_climate change_climate_change:
peterson, climate change, climate, change, jordan, guy, just, like, jordan peterson, doesn

2_meat_animal_vegan_animals:
meat, animal, vegan, animals, eat, eating, agriculture, veganism, food, diet

3_kids_children_child_kid:
kids, children, child, kid, having, having kids, want, life, world, don

4_labor_greens_australia_party:
labor, greens, australia, party, government, labour, uk, election, brexit, gt



When scaling the BERTopic model to 100,000 Reddit comments, the number of identified topics increased substantially from 28 to 176, reflecting the broader thematic range present in the larger dataset. The topic coherence score improved slightly to 0.5463, indicating greater semantic clarity, while the topic diversity rose to 0.7381 (1299 unique of 1760 words), suggesting that the model captured a wide variety of distinct concepts. However, the topic size imbalance increased drastically to 50.30, implying that a few dominant topics absorbed a disproportionate number of documents—likely due to clustering challenges at this scale. UMass (–6.16) and C_NPMI (0.0223) coherence remained relatively low, consistent with prior findings and possibly reflecting the noisy, informal nature of Reddit language. Nevertheless, the top-ranked topics continue to be interpretable and thematically focused, covering themes such as climate change discourse, political figures like Jordan Peterson, diet and agriculture, family planning, and Australian political parties, indicating that the model remains effective at surfacing diverse climate-related narratives even at higher volumes.

# BERTopic with 'all-MiniLM-L6-v2' embeddings and LLM (Groq llama3-70b-8192) representation model
As can be seen from the code above, the topic labels assigned by the applied BERTopic pipeline are simply a combination of the topics keywords, including reptitions of the same terms. To improve human interpretability and create more sound and clean topic labels, we prompt Groq llama3-70b-8192 via an API to provide us with topic labels.

In [29]:
# Set Groq API Key
GROQ_API_KEY = "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"
client = groq.Groq(api_key=GROQ_API_KEY)

# Define prompt template
prompt_template = """
I have a topic that contains the following documents:
{documents}

The topic is described by the following keywords: {keywords}

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""

# Label cleaning function
def get_groq_label(documents, keywords):
    prompt = prompt_template.format(
        documents=documents,
        keywords=", ".join(keywords)
    )
    response = client.chat.completions.create(
        model="llama3-70b-8192",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    raw_text = response.choices[0].message.content.strip()
    match = re.search(r"topic:\s*(.+)", raw_text, re.IGNORECASE)
    return match.group(1).strip() if match else raw_text

# Get top 10 topics (excluding noise)
topic_info = topic_model.get_topic_info()
top_10 = topic_info[topic_info.Topic != -1].head(10)

# Generate and print labels
for _, row in top_10.iterrows():
    topic_id = row["Topic"]
    keywords = [word for word, _ in topic_model.get_topic(topic_id)[:10]]
    label = get_groq_label([" ".join(keywords)], keywords)

    print(f"Topic {topic_id}: {label}")
    print("Keywords:", ", ".join(keywords))
    print()


Topic 0: Climate Change Global Impact
Keywords: climate change, change, climate, change climate, global warming, warming, global, warming climate, change global, solve climate

Topic 1: Jordan Peterson on Climate
Keywords: peterson, climate change, climate, change, jordan, guy, just, like, jordan peterson, doesn

Topic 2: Veganism and Animal Agriculture
Keywords: meat, animal, vegan, animals, eat, eating, agriculture, veganism, food, diet

Topic 3: Having Children in Life
Keywords: kids, children, child, kid, having, having kids, want, life, world, don

Topic 4: Australian Labour Politics Brexit
Keywords: labor, greens, australia, party, government, labour, uk, election, brexit, gt

Topic 5: COVID-19 Pandemic and Vaccines
Keywords: covid, vaccine, vaccines, diseases, virus, pandemic, deaths, people, disease, mask

Topic 6: Christianity and Religious Beliefs
Keywords: god, religion, bible, religious, christian, christians, church, jesus, nbsp, amp nbsp

Topic 7: Urban Transportation Mod

As the output shows, Groq generates highly interpretable topic labels that align well with the underlying keywords. This makes it a valuable extension to the BERTopic pipeline, enhancing the overall clarity and usability of the resulting topic model.