# Topic Modeling on Bluesky Data Using BERTopic

**Yilin Xu**  
M.A. in Computational Social Science | University of Chicago  
Email: yilinxu1@uchicago.edu | [LinkedIn](https://www.linkedin.com/in/yilin-xu-367826202/) | [Github](https://github.com/yilinx-10)

**Motivation:**  
We want to assign our Reddit post data into different topic categories to explore whether there is any relationship between topic category and shared active users. To do so, we must first get an understanding about what are the common topics discussed under the broad theme "LA wildfire" which we are interested in. We use BERTopic model as it leverages embedding techniques, dimension reduction, hierarchical clustering, and class-based tf-idf techniques to extract coherent topics from a large set of documents.

**Preparation:**
We import the packages:

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import json
import pprint

### Load Bluesky Data

In [25]:
data = pd.read_json("bluesky_posts_0131.json",lines=True,orient='columns')

In [26]:
data[:10]

Unnamed: 0,author,text,timestamp,likes,reposts,comment
0,worldbytenews.bsky.social,Renowned Dallas burger spot hosting fundraiser...,2025-01-31T12:31:07Z,0,0,[]
1,worldbytenews.bsky.social,Renowned Dallas burger spot hosting fundraiser...,2025-01-31T12:20:41Z,0,0,[]
2,smcclain2110.bsky.social,I feel like when we were a more cohesive socie...,2025-01-31T12:14:58.402Z,1,0,[]
3,rufatto.bsky.social,Ontem rolou o Fire Aid em beneficio das vitima...,2025-01-31T11:47:13.570Z,2,0,[fui direto ver o Dawes com Stephen Stills e G...
4,thecapitalist.bsky.social,"California has been devastated by fires, trans...",2025-01-31T11:34:54.638Z,0,0,[]
5,greendayitaly.bsky.social,Billie Joe Armstrong of Green Day poses with D...,2025-01-31T11:14:30.018Z,3,0,[]
6,elpaischile.bsky.social,Durante 2025 se firmará un acuerdo de colabora...,2025-01-31T10:53:57.509Z,2,0,[]
7,greendayitaly.bsky.social,Green Day performing with Billie Eilish during...,2025-01-31T09:44:07.267Z,16,6,[]
8,emoryro.bsky.social,"No notes. Okay, one note.",2025-01-31T09:33:36.536Z,0,0,[]
9,aptronym.bsky.social,Chubb Ltd. is estimating that it will need to ...,2025-01-31T09:22:33.129Z,0,0,[]


Since Bluskey resembles Twitter in terms of length restrictions, we use strings longer than 30 characters as our input documents for BERTopic. 

In [27]:
docs = data[data['text'].str.len() > 30].text.unique()

### Load Packages for BERTopic

In [28]:
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

stop_words = stopwords.words('english')


We extend the stopwords list with the following words. They are excluded because they indicate places or refer to the whole event(LA wildfire) we are interested in. Hence, they have limited meaning in detecting key themes and topics. 

In [29]:
stop_words.extend(['california wildfires', 'california', 'los angeles', 'california fire', 'california fires', 'california wildfire',
                   'fire', 'wildfire', 'fires', 'wildfires', 'ca', 'los', 'angeles', 'la', 'san', 'francisco', 'sf', 'state',
                   'states', 'losangeles', 'santa', 'monterey', 'altadena', 'pasedena', 'eaton', 'palisades', 'cal', 'cali'])

### Implement Basic BERTopic Model

We define the models used for each step of BERTopic. 

In [30]:
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
umap_model = UMAP(n_neighbors=20, n_components=3, min_dist=0, random_state = 88888888)
hdbscan_model = HDBSCAN(min_cluster_size=40, min_samples=20,
                        gen_min_span_tree=True,
                        prediction_data=True)
vectorizer_model = CountVectorizer(ngram_range=(1, 1), stop_words=stop_words)
representation_model = KeyBERTInspired()

In [31]:
from bertopic import BERTopic

model = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    representation_model=representation_model,
    top_n_words=5,
    language='english',
    calculate_probabilities=True,
    verbose=True
)
topics, probs = model.fit_transform(docs)

2025-03-10 14:07:33,415 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/136 [00:00<?, ?it/s]

2025-03-10 14:08:16,364 - BERTopic - Embedding - Completed ✓
2025-03-10 14:08:16,365 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-03-10 14:08:25,960 - BERTopic - Dimensionality - Completed ✓
2025-03-10 14:08:25,962 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-03-10 14:08:26,244 - BERTopic - Cluster - Completed ✓
2025-03-10 14:08:26,248 - BERTopic - Representation - Extracting topics from clusters using representation models.
2025-03-10 14:08:30,871 - BERTopic - Representation - Completed ✓


In [32]:
model.visualize_topics()

The visualization suggests that we have around 5 large groups of topics in which we observe some overlaps between topics. We can get more detailed information about each topic:  

In [23]:
freq = model.get_topic_info()
freq

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1464,-1_firefighters_federal_burned_disaster,"[firefighters, federal, burned, disaster, vict...",[\n🌐 California prepares for possible new fire...
1,0,1283,0_evacuation_evacuations_evacuate_firefighters,"[evacuation, evacuations, evacuate, firefighte...",[World’s largest battery plant on fire in Cent...
2,1,742,1_policy_donald_water_trump,"[policy, donald, water, trump, fema, disaster,...",[Trump targets California water policy as he p...
3,2,273,2_law_homeowners_insurance_flammable,"[law, homeowners, insurance, flammable, resist...",[California is years behind in implementing a ...
4,3,254,3_climate_florida_southern_snow,"[climate, florida, southern, snow, snowing, so...","[; last week, California was on fire. this wee..."
5,4,160,4_vegetationfire_sbcfire_vegetation_firewx,"[vegetationfire, sbcfire, vegetation, firewx, ...",[🔥New Wildfire: Vegetation Fire / 2000 Block o...
6,5,149,5_fireaidla_fireaid_fundraising_fundraiser,"[fireaidla, fireaid, fundraising, fundraiser, ...",[Each vote will help us raise funds for the Ca...


We can also visualize similarity matrix between all topics using heatmap. Lighter color indicates lower similarity. 

In [14]:
model.visualize_heatmap()

BERTopic also allows us to plot the hierarchy of topics. We can see there are around 6 basic clusters of topics marked by color. 

In [15]:
model.visualize_hierarchy()

We can print our the top 10 representative words for all 23 topics the model generate. 

In [16]:
for i in range(23):
    print(freq['Representation'][i])

['firefighters', 'federal', 'disaster', 'burned', 'news', 'victims', 'area', 'winds', 'climate', 'homes']
['fema', 'trump', 'disaster', 'trumps', 'donald', 'federal', 'aid', 'relief', 'hurricane', 'politics']
['mosslandingfire', 'battery', 'batteries', 'burning', 'flames', 'renewable', 'largest', 'tesla', 'blaze', 'lithium']
['winds', 'rain', 'rains', 'weather', 'wind', 'storm', 'southern', 'runoff', 'forecast', 'coast']
['hughesfire', 'firefighters', 'burning', 'lake', 'burned', 'burns', 'ventura', 'blaze', 'castaic', 'evacuation']
['fireproof', 'law', 'homeowners', 'resistant', 'regulations', 'legislation', 'homes', 'policies', 'property', 'houses']
['donald', 'trump', 'policy', 'policies', 'water', 'drought', 'pacific', 'reservoirs', 'politics', 'targets']
['burning', 'southern', 'gulf', 'smoke', 'lives', 'losing', 'us', 'lost', 'news', 'lose']
['firefighters', 'evacuations', 'arson', 'evacuation', 'erupt', 'evacuate', 'diego', 'erupts', 'burning', 'brush']
['vegetationfire', 'lilac

### Merge Topics

We do not want to use 23 topics to label our reddit post data(160~). Hence, we refer to the hierarchical clustering of topics to merge them into a total of 6 topics. 

In [17]:
topics_to_merge = [[13, 4], 
                   [11, 16],
                   [8, 19],
                   [20, 6, 17],
                   [5, 22, 18, 15, 12, 0],
                   [2, 21, 3, 9, 10, 7, 1, 14]]
model.merge_topics(docs, topics_to_merge)

In [18]:
freq = model.get_topic_info()
freq.head(10)

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,1464,-1_firefighters_federal_burned_disaster,"[firefighters, federal, burned, disaster, vict...",[\n🌐 California prepares for possible new fire...
1,0,1283,0_evacuation_evacuations_evacuate_firefighters,"[evacuation, evacuations, evacuate, firefighte...",[World’s largest battery plant on fire in Cent...
2,1,742,1_policy_donald_water_trump,"[policy, donald, water, trump, fema, disaster,...",[Trump targets California water policy as he p...
3,2,273,2_law_homeowners_insurance_flammable,"[law, homeowners, insurance, flammable, resist...",[California is years behind in implementing a ...
4,3,254,3_climate_florida_southern_snow,"[climate, florida, southern, snow, snowing, so...","[; last week, California was on fire. this wee..."
5,4,160,4_vegetationfire_sbcfire_vegetation_firewx,"[vegetationfire, sbcfire, vegetation, firewx, ...",[🔥New Wildfire: Vegetation Fire / 2000 Block o...
6,5,149,5_fireaidla_fireaid_fundraising_fundraiser,"[fireaidla, fireaid, fundraising, fundraiser, ...",[Each vote will help us raise funds for the Ca...


In [19]:
for i in range(6):
    print(freq['Representation'][i])

['firefighters', 'federal', 'burned', 'disaster', 'victims', 'org', 'news', 'governor', 'politics', 'aid']
['evacuation', 'evacuations', 'evacuate', 'firefighters', 'battery', 'burning', 'batteries', 'hughesfire', 'plants', 'lithium']
['policy', 'donald', 'water', 'trump', 'fema', 'disaster', 'politics', 'governor', 'targets', 'hurricane']
['law', 'homeowners', 'insurance', 'flammable', 'resistant', 'coverage', 'burning', 'homes', 'burns', 'property']
['climate', 'florida', 'southern', 'snow', 'snowing', 'south', 'texas', 'gulf', 'louisiana', 'freezing']
['vegetationfire', 'sbcfire', 'vegetation', 'firewx', 'rancho', 'canyon', 'bernardino', 'forestry', 'valley', 'area']


In [20]:
model.visualize_barchart()

It is pretty clear that we have distinct topic themes:  water/electricity, politics, private, climate change, fire update(info), celebrity fundraising

In [21]:
model.visualize_heatmap()

In [22]:
model.visualize_hierarchy()

### Conclusion & Insights

Based on out BERTopic results, we decided to assign our reddit post data into 7 topics: **private, politics, water/electricity policy, fire update(info), celebrity, fundraising, climate change**

### Reflection & Reference

No AI support is used for the codes in this document.

BERTopic Documentation: https://maartengr.github.io/BERTopic/algorithm/algorithm.html