# Finding key topics in product reviews using BERTopic

My goal is to use BERTopic to discover and label common topics discussed in reviews for an Otterbox Phone Case. I want to integrate this into my application that takes a csv of reviews and returns product aspects and customer sentiments towards them. Using an unsupervised model is necessary for this as I need my app to automatically discover topics.

### Importing and pre-processing

Below are the steps to get 100 Otterbox phone case reviews and process them into a list of sentences to be fed into the BERTopic model.

In [1]:
import json 
import numpy as np
import pandas as pd

In [2]:
with open('Cell_Phones_and_Accessories_5.json', 'r') as file:
    data = [json.loads(line) for line in file]

norm_data = pd.json_normalize(data)
df = pd.DataFrame(norm_data)

In [3]:
df_filtered = df[df["asin"] == "B005SUHPO6"].reset_index()
df_filtered_reviews = df_filtered["reviewText"]

In [4]:
#import nltk
#nltk.download()

In [5]:
from nltk.tokenize import sent_tokenize
sentences = [sent_tokenize(review) for review in df_filtered_reviews]
sentences = [sentence for doc in sentences for sentence in doc]

### Precalculating embeddings

First I run my reviews through an embedding model to save having to calculate embeddings each each time I run BERTopic later.

In [6]:
from sentence_transformers import SentenceTransformer

# Pre-calculate embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(sentences, show_progress_bar=True)

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 114/114 [00:15<00:00,  7.44it/s]


### Reducing dimensionality

To combat the curse of dimensionality, I want to run an algorithm to reduce the dimensions of the vector embeddings. UMAP is the recommended model for use in BERTopic so let's use that. Note I've set the random seed for reproducibility.

In [7]:
from umap import UMAP

umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=23)

### Choosing the clustering model and parameters

I want to BERTopic to use HDBSCAN as its clustering model, this is a density based clustering model that can account for outliers. I will try to control the number of topics using *min_cluster_size* .

In [65]:
from hdbscan import HDBSCAN

hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

### Topic representations - count vectorisation

Below I have the vectoriser model, which improves the default BERTopic representations by preventing filler words and infrequent words from appearing in topic representations.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", min_df=2, ngram_range=(1, 2))

### Topic representations - alternative representations

Normal representations will look like "1_adversarial_attacks_attack_robustness" but I want the topics to be more informative/ readable for the user. One way to do this is by using openai's topic representations powered with ChatGPT.

In [15]:
import openai
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech

# KeyBERT
keybert_model = KeyBERTInspired()

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# GPT-3.5
prompt = """
I have a topic that contains the following documents:
[DOCUMENTS]
The topic is described by the following keywords: [KEYWORDS]

Based on the information above, extract a short but highly descriptive topic label of at most 5 words. Make sure it is in the following format:
topic: <topic label>
"""
client = openai.OpenAI(api_key="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")
openai_model = OpenAI(client, model="gpt-4o-mini", exponential_backoff=True, chat=True, prompt=prompt)

# All representation models
representation_model = {
    "KeyBERT": keybert_model,
    #"OpenAI": openai_model,  # Uncomment if you will use OpenAI
    "MMR": mmr_model,
    "POS": pos_model
}

### Training

Now it's time to bring everthing together in the final model, making sure to use the pre calculated embeddings from earlier. Note *top_n_words* controls the number of words per topic to extract, and *verbose* makes the model tell you what stage of training it is.

In [66]:
from bertopic import BERTopic

topic_model = BERTopic(

  # Pipeline models
  embedding_model=embedding_model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=10,
  verbose=True
)

topics, probs = topic_model.fit_transform(sentences, embeddings)

2025-08-26 01:27:11,646 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2025-08-26 01:27:26,603 - BERTopic - Dimensionality - Completed ✓
2025-08-26 01:27:26,617 - BERTopic - Cluster - Start clustering the reduced embeddings
2025-08-26 01:27:26,719 - BERTopic - Cluster - Completed ✓
2025-08-26 01:27:26,730 - BERTopic - Representation - Fine-tuning topics using representation models.
2025-08-26 01:27:31,052 - BERTopic - Representation - Completed ✓


In [67]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,824,-1_case_phone_iphone_screen,"[case, phone, iphone, screen, like, protection...","[case iphone, screen protector, phone case, ca...","[iphone, screen, plastic, cover, bulky, pocket...","[case, phone, iphone, screen, protection, plas...",[The rubber outline is amazing and makes the p...
1,0,657,0_product_price_great_recommend,"[product, price, great, recommend, buy, amazon...","[great product, product great, price great, gr...","[price, purchase, worth, quality, great produc...","[product, price, great, amazon, purchase, good...","[I buy almost everything on Amazon, but this t..."
2,1,481,1_otterbox_defender_otter_otterbox defender,"[otterbox, defender, otter, otterbox defender,...","[otterbox defender, phone otterbox, iphone ott...","[otterbox, otterbox defender, otter box, defen...","[otterbox, defender, otter, box, case, series,...","[This is my fourth Otterbox Defender., Otterbo..."
3,2,265,2_dropped_phone_times_dropped phone,"[dropped, phone, times, dropped phone, drop, d...","[phone dropped, dropped phone, drops phone, dr...","[dropped phone, drops, damage, drop phone, ve ...","[phone, times, drop, drops, case, iphone, dama...","[My son is 15, and has dropped his phone a few..."
4,3,189,3_color_colors_pink_grey,"[color, colors, pink, grey, love, black, love ...","[love color, love colors, color great, color l...","[color, colors, pink, love color, blue, camo, ...","[color, colors, pink, grey, black, blue, diffe...","[I love the color., Love the color., I love th..."
5,4,180,4_case_love case_case great_love,"[case, love case, case great, love, case case,...","[great case, loves case, love case, case great...","[love case, case great, cases, case love, reco...","[case, love, great, good, cases, good case, gr...",[You will not be disapointed I love this case....
6,5,130,5_flap_charging_port_button,"[flap, charging, port, button, buttons, open, ...","[charging port, charging, charger, rubber flap...","[buttons, home button, charging port, headphon...","[flap, port, button, buttons, open, home, jack...",[The rubber flap cover for the charging port i...
7,6,126,6_rubber_silicone_plastic_stretched,"[rubber, silicone, plastic, stretched, shell, ...","[case rubber, outer rubber, rubber outer, rubb...","[silicone, stretched, hard plastic, plastic ca...","[rubber, silicone, plastic, shell, outer, outs...",[The outer rubber dose get stretched out a lot...
8,7,117,7_phone_protects_protection_protect,"[phone, protects, protection, protect, protect...","[protects phone, phone protects, phone protect...","[protects, protects phone, protection phone, p...","[phone, protection, great, iphone, great prote...","[It is great protection for the phone., Protec..."
9,8,108,8_screen_screen protector_protector_built,"[screen, screen protector, protector, built, t...","[screen protector, protector screen, screen pr...","[screen protector, touch screen, protector scr...","[screen, protector, touch, clear, air, plastic...",[The phone reacted as though there was no scre...


In [73]:
for i in range(17):
    print(topic_model.get_topic(i)[0][0])

product
otterbox
dropped
color
case
flap
rubber
phone
screen
holster
case
protection
bulky
yes
4s
phone case
bulky


### Result

BERTopic has successfully modelled topics discussed within the reviews, but I don't feel convinced that this is good for aspect discovery in my app. Of the above keywords from different topics only a few of them (namely "color", "flap", "rubber", "holster", "bulky", "protection") would be useful as aspects. I don't know how I could automate the listing of aspects from here. The only route I see from here would be to get the LLM topic representations working, but I would need to setup my own local LLM for that first (of course I could pay for OpenAI credits but I don't want to).