<img src="../../img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>

# Topic Modeling With BERTopic

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Use BERTopic to group r/AmItheAsshole submissions by the issues people write about.
* Interpret and visualize the topics found by the model.
* Try out ways to improve or simplify the model when topics overlap too much or don’t make sense.
* Practice naming and describing topics, and using them to organize or classify new text.
</div>

### Icons Used in This Notebook
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>
💭 **Reflection**: Reflecting on ethical implications, biases, and social impact in data science.<br>

### Sections
1. [Topic Modeling with BERTopic](#topic)
2. [Explore Selected Topics](#explore)
3. [Reducing Overlap](#reduce)
4. [Finding Representative Posts](#repr)


<a id='topic'></a>

# Topic Modeling with BERTopic

In this optional notebook, we explore **BERTopic**, a topic modeling tool that leverages BERT embeddings and [c-TF-IDF](https://maartengr.github.io/BERTopic/api/ctfidf.html), to extract topics. It is a more "modern" topic modeling approach than the one we explore in our other notebook, and it can be interesting to compare topic model outputs between the approaches!

## About BERT Embeddings
BERT embeddings are vector representations of text created by the BERT model, which is a large transformer-based neural network trained to understand language context. Unlike traditional word embeddings (like word2vec), BERT creates contextual embeddings. That means the same word will have different vector representations depending on its surrounding words. BERT processes entire sentences at once, using its attention mechanism to capture meaning and relationships between words.

BERTopic works as follows:

- **Understands text using BERT:**  
Instead of just counting words, BERTopic uses a language model (BERT or similar) to turn each document into a “vector”—a list of numbers that captures its meaning and context. This makes it better at grouping together texts that talk about similar things, even if they use different words.

- **Clusters similar documents:**  
BERTopic then looks for clusters (groups) of documents that are close to each other in this vector space. It uses an algorithm called **HDBSCAN** that decides, based on the data, how many clusters (topics) make sense. This means you don’t have to guess the number of topics in advance.

- **Finds key words for each topic:**  
For every cluster it finds, BERTopic looks for the most important words that make this group unique compared to the rest of your data. These words help you quickly understand what each topic is about.

- **Visualizes topics and documents:**  
BERTopic comes with interactive tools to show you how your topics relate to each other, how common each topic is, and where your documents fit in.

### Note on package installation
This cell makes sure all the Python packages needed for this lesson are installed.
If you’re working on Datahub, you might not have everything yet.
Running this cell will check for each package and install it if it’s missing, so your notebook runs smoothly.

- If you are running this notebook on **DataHub**, you may need to **uncomment and run** the `%pip install ...` line below if you get an error about a missing package. Restart your kernel after running this cell!
- If you are working **locally** (on your own computer), you should already have all required packages installed via your Conda environment (see the ***"Local Python and Jupyter Setup"*** page on bCourses). Only use the `pip install` line if you see an ImportError and know what you’re doing.

In [5]:
# restart kernel after running
#%pip install bertopic 

Collecting bertopic
  Using cached bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Using cached hdbscan-0.8.40-cp310-cp310-macosx_11_0_arm64.whl
Using cached bertopic-0.17.0-py3-none-any.whl (150 kB)
Installing collected packages: hdbscan, bertopic
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [bertopic]
[1A[2KSuccessfully installed bertopic-0.17.0 hdbscan-0.8.40
Note: you may need to restart the kernel to use updated packages.


**Note: this notebook might not work on your local machine depending on your system architecture. It does work on DataHub, however.**

## Loading the Data

In [36]:
import gdown

gdown.download("https://drive.google.com/uc?id=1Glac4spXraWRcC_loxu1Cu4Bw-szS2ou", "../../data/aita_pp.csv", quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1Glac4spXraWRcC_loxu1Cu4Bw-szS2ou
To: /Users/tomvannuenen/Library/CloudStorage/Dropbox/GitHub/DEV/DIGHUM160/data/aita_pp.csv
100%|███████████████████████████████████████| 56.6M/56.6M [00:00<00:00, 115MB/s]


'../../data/aita_pp.csv'

In [62]:
import pandas as pd

# Load your preprocessed AITA CSV (change the filename if needed)
df = pd.read_csv('../../data/aita_pp.csv')

In [63]:
# quick preprocessing
docs = df['pp_text'].tolist()

## Build and Fit BERTopic Model
We'll use default settings first. This may take a few minutes.

In [64]:
from bertopic import BERTopic

# Use only a subset for demo to avoid memory errors
docs_sample = docs[:1000]

topic_model = BERTopic(verbose=True)
topics, probabilities = topic_model.fit_transform(docs_sample)

2025-06-03 17:19:56,659 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/32 [00:00<?, ?it/s]

2025-06-03 17:20:10,733 - BERTopic - Embedding - Completed ✓
2025-06-03 17:20:10,736 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-06-03 17:20:13,581 - BERTopic - Dimensionality - Completed ✓
2025-06-03 17:20:13,585 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-06-03 17:20:13,642 - BERTopic - Cluster - Completed ✓
2025-06-03 17:20:13,653 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-03 17:20:13,977 - BERTopic - Representation - Completed ✓


<a id='explore'></a>

# Explore Extracted Topics
View topic frequencies and the top words per topic.

In [65]:
topic_info = topic_model.get_topic_info()
topic_info.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,457,-1_like_said_told_time,"[like, said, told, time, know, got, want, thin...",[hi throw_away account obvious_reasons typing ...
1,0,143,0_said_told_like_friends,"[said, told, like, friends, night, sex, asked,...",[ex girlfriend anna dating months relationship...
2,1,59,1_wife_daughter_kids_son,"[wife, daughter, kids, son, mother, told, life...",[brother family close got married years_ago ye...
3,2,50,2_family_brother_sister_mother,"[family, brother, sister, mother, mom, told, d...",[like said original_post](https://www.reddit.c...
4,3,43,3_pay_money_kids_wife,"[pay, money, kids, wife, work, home, school, s...",[suspicion current karmic disposition affair l...
5,4,39,4_pregnant_baby_pregnancy_abortion,"[pregnant, baby, pregnancy, abortion, want, ch...",[start know sounds bad try explain 21(m girlfr...
6,5,33,5_card_tip_pay_money,"[card, tip, pay, money, manager, said, left, b...",[dining upscale restaurant bill came $ 49.82 p...
7,6,32,6_seat_seats_people_like,"[seat, seats, people, like, flight, said, wait...",[earlier week went library finish papers worki...
8,7,28,7_teacher_class_students_school,"[teacher, class, students, school, told, kids,...",[teach private school affiliated religion allo...
9,8,20,8_food_eating_eat_plate,"[food, eating, eat, plate, like, eats, boyfrie...",[broken ex boyfriend(22 pickiest eater mean ab...


💡 **Tip**: Topic -1 in BERTopic is a “catch-all” for documents that don’t fit into any meaningful cluster.

This works as follows:
- HDBSCAN (the clustering algorithm) automatically labels “noise” or “outlier” documents with -1.
- These are typically posts that are too unique, too generic, or just don’t belong to any clear topic group.
- Including topic -1 in your list of topics will show a “topic” that’s not really coherent, and the top words for -1 are usually either very generic or meaningless.
- Most users ignore topic -1 when reviewing topics and top words, focusing only on the numbered topics (0, 1, 2, …).

In [66]:
# How many topics do we have?
topic_info.shape

(16, 5)

In [67]:
# Show top words for topic 0
topic_model.get_topic(0)

[('said', 0.02659525132995789),
 ('told', 0.025904404696707732),
 ('like', 0.0253026324200215),
 ('friends', 0.02283089603724534),
 ('night', 0.019084676288042533),
 ('sex', 0.01885220048284822),
 ('asked', 0.018232663735160484),
 ('relationship', 0.017590021208154085),
 ('guy', 0.01746002423606322),
 ('know', 0.017203406830453263)]

## Intertopic Distance Map
The `visualize_topics` function visualizes topics and their similarity in an interactive plot. We're also saving it to disk so it can be embedded on a website.

In [68]:
fig = topic_model.visualize_topics()
fig.write_html("outputs_lesson/bertopic_topics.html") 
fig.show()

<a id='reduce'></a>

# Reducing Overlap

If your intertopic distance map shows lots of overlapping bubbles, your model may have produced **too many fine-grained topics**. This is common with large or complex datasets! Similar documents can get split into clusters that aren't really distinct.

To make your topics broader and reduce overlap, you can **merge similar topics** using the `.reduce_topics()` method in BERTopic.

### How to Reduce the Number of Topics

Use the following code to merge topics until only your desired number remain (e.g., 15):

In [69]:
# Reduce the number of topics (set to the number you want)
target_num_topics = 10  # change this as needed
topic_model.reduce_topics(docs_sample, nr_topics=target_num_topics)
topic_model.get_document_info(df['selftext'][:1000])  # instead of df['processed']

# Re-visualize the intertopic distance map
fig = topic_model.visualize_topics()

fig.write_html(f"outputs_lesson/bertopic_topics_{target_num_topics}.html")
fig.show()

2025-06-03 17:20:15,330 - BERTopic - Topic reduction - Reducing number of topics
2025-06-03 17:20:15,339 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-06-03 17:20:15,459 - BERTopic - Representation - Completed ✓
2025-06-03 17:20:15,462 - BERTopic - Topic reduction - Reduced number of topics from 16 to 10


As you can tell, we no longer have so many overlapping topics. That's good!

<a id='repr'></a>

# Finding Representative Posts

We can use BERTopic to find representative posts for a topic. So if you're interested in exploring posts from a particular topic, this is how you do that.

Let's first look at the top words for each topic:

In [47]:
for topic_num in topic_model.get_topic_info().Topic:
    if topic_num == -1:
        continue  # skip the outlier topic if you want
    top_words = [word for word, _ in topic_model.get_topic(topic_num)]
    print(f"Topic {topic_num}: {', '.join(top_words)}")

Topic 0: said, told, like, friends, know, got, feel, asked, want, going
Topic 1: family, kids, wife, told, sister, daughter, like, parents, want, brother
Topic 2: baby, pregnant, want, pregnancy, child, said, abortion, wife, told, think
Topic 3: like, food, people, eat, said, seat, eating, asked, guy, waitress
Topic 4: teacher, kids, class, students, school, told, said, daughter, got, yard
Topic 5: tip, pay, card, money, manager, said, bill, left, shop, creditcard
Topic 6: vegan, meat, food, vegetarian, eat, daughter, said, eatmeat, dishes, diet
Topic 7: gay, friends, like, straight, people, group, claire, word, bisexual, said
Topic 8: dog, dogs, park, said, walking, like, told, people, swimming, leave


Let's say I'm interested in topic 2.

**⚠️ Warning:** If you run this code again your topics might look different due to the probabilistic nature of UMAP. 

In [54]:
topic_num = 2  # or whatever topic you're interested in

# Get indices of documents in that topic
indices = [i for i, t in enumerate(topics) if t == topic_num]

# Show original texts from df
for i in indices[:3]:  # show up to 3 examples
    print(f"Example {i+1} for topic {topic_num}:\n", df['selftext'].iloc[i], "\n")

Example 39 for topic 2:
 Me and my girlfriend have been in a sexually active relationship for the past year and half, about 2 months ago she promised me that if I stopped using a condom, she would go on the pill, and if she does conceive by accident, she would 100% abort as soon as possible, however this morning she revealed to me that she's a bit over a month pregnant, and long story short she doesn't to abort.

&#x200B;

Obviously I was extremeeeely pissed off, as I trusted her to abort, that's the only reason I stopped using a condom, and I really do not want a child right now, it'll really fuck me over, so I gave her an ultimatum, either have an abortion this week or I'll leave and neither she, her family or the kid (if she keeps it) will ever see me again. Am I the asshole or is she? 

Example 40 for topic 2:
 My long time girlfriend told me recently that she is interested in being the surrogate for a child for a friend of hers who is gay and wants a child of his own.

I respect h

## Grab all posts from a certain topic (for further processing)

Once you’ve trained a BERTopic model, each document is assigned a dominant topic. You can use this assignment to extract all posts that belong to a specific topic for closer analysis, visualization, or downstream tasks.

For instance, let's create a new dataframe with all posts that have topic 2 as the dominant topic.

In [58]:
# Get indices of posts with topic 2
topic_num = 2
topic_2_indices = [i for i, t in enumerate(topics) if t == topic_num]

df_topic_2 = df.iloc[topic_2_indices].copy()
df_topic_2['dominant_topic'] = topic_num

In [60]:
df_topic_2[:5]

Unnamed: 0,idint,idstr,created,self,nsfw,author,title,url,selftext,score,subreddit,distinguish,textlen,num_comments,flair_text,flair_css_class,augmented_at,augmented_count,pp_text,dominant_topic
38,595325362,t3_9ufvzm,1541441116,1.0,0.0,Bumbarass2001,AITA For telling my girlfriend to get an abort...,,Me and my girlfriend have been in a sexually a...,671.0,AmItheAsshole,,774.0,705.0,Everyone Sucks,ass,,,girlfriend sexually_active relationship past y...,2
39,595912032,t3_9usgo0,1541538722,1.0,0.0,MrTaylor11,AITA for telling my girlfriend that if she wan...,,My long time girlfriend told me recently that ...,17247.0,AmItheAsshole,,423.0,2355.0,Not the A-hole,not,,,long time girlfriend told recently interested ...,2
53,601928856,t3_9ydfa0,1542602086,1.0,0.0,Thinkcali,AITA for asking my faithful girlfriend for a p...,,Why do we take people at their word for the la...,201.0,AmItheAsshole,,697.0,386.0,,,,,people word largest decision life raise child ...,2
65,606566007,t3_a14tbr,1543398570,1.0,0.0,Outlashed,AITA for telling my GF to get an abort?,,"Reference: I'm 22, she's 23. - Europe.\n\nWe s...",6961.0,AmItheAsshole,,2888.0,2491.0,Not the A-hole,not,,,reference europe started dating mid august tol...,2
66,607003408,t3_a1e6ts,1543466883,1.0,0.0,waterglass1986,AITA for abandoning my pregnant girlfriend,,My girlfriend of two years got pregnant a few ...,11760.0,AmItheAsshole,,794.0,2203.0,Not the A-hole,not,,,girlfriend years got pregnant months_ago condo...,2


## 💭 Reflection: Why Extracting Topic-Specific Posts Matters

You can now output this DF as a CSV file and do any of the analyses we are considering on this dataframe, to get a better insight into this specific topic. Some things to consider:

- **Do close reading at scale:** Instead of reading all posts blindly, you can target a subset organized around a common discourse or theme like family conflict, workplace dynamics, or gendered expectations.
- **Investigate interpretive boundaries:** Some posts may straddle multiple themes or fall into unexpected clusters—these are good entry points for discussing ambiguity, interpretive friction, and the limits of topic modeling.
