<a href="https://colab.research.google.com/github/torkelfaa/streamlit-exercise-torkelfaa/blob/main/Copy_of_(In_Class)_Topic_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling: An Application to Customer Reviews for Guell Appliances


**Topic modeling** helps us automatically discover abstract 'topics' that occur in a collection of customer reviews. By identifying these topics, we can quickly understand the main themes and concerns expressed by customers, even across a very large dataset, without having to read every single review. This provides valuable insights into product strengths, weaknesses, and areas for improvement.



To begin with, we first load the dataset that accompanies that case *Guell Appliances: A Refrigerator's World We're Just Living In*.

Note that we have an [Excel file](https://www.dropbox.com/scl/fi/dtvaatpn9pn0d655w117f/W33839-XLS-ENG.xlsx?rlkey=6b9q39xtqa2qsk0kr59drjv0a&st=o4pi7xvr&dl=0). Let's upload it before we load the data.

In [None]:

import pandas as pd

df_review = pd.read_excel("W33839-XLS-ENG.xlsx")

df_review.head()


Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"08 22, 2013",A34A1UP40713F8,B00009W3I4,{'Style:': ' Dryer Vent'},James. Backus,I like this as a vent as well as something tha...,Great product,1377129600,,
1,5,True,"02 8, 2016",A1AHW6I678O6F2,B00009W3PA,{'Size:': ' 6-Foot'},kevin.,good item,Five Stars,1454889600,,
2,5,True,"08 5, 2015",A8R48NKTGCJDQ,B00009W3PA,{'Size:': ' 6-Foot'},CDBrannom,Fit my new LG dryer perfectly.,Five Stars,1438732800,,
3,5,True,"04 24, 2015",AR3OHHHW01A8E,B00009W3PA,{'Size:': ' 6-Foot'},Calvin E Reames,Good value for electric dryers,Perfect size,1429833600,,
4,5,True,"03 21, 2015",A2CIEGHZ7L1WWR,B00009W3PA,{'Size:': ' 6-Foot'},albert j. kong,Price and delivery was excellent.,Five Stars,1426896000,,


## Preprocessing the data

We preprocess the data as follows.

* Remove all rows with missing values

* Combine *summary* and *reviewText* into a single text field, for each review.

* Drop rows with empty review text.

* Keep only two columns: *overall* and *text*

In [None]:
# Combine summary + full review into a single text field for embedding
df_review["summary"] = df_review["summary"].fillna("")
df_review["reviewText"] = df_review["reviewText"].fillna("")

df_review["text"] = (df_review["summary"].str.strip() + ". " + df_review["reviewText"].str.strip()).str.strip()

# Drop rows with completely empty text (if any)
df_review = df_review[df_review["text"].str.len() > 0].reset_index(drop=True)



df = df_review[["overall", "text"]]

df.head()

Unnamed: 0,overall,text
0,5,Great product. I like this as a vent as well a...
1,5,Five Stars. good item
2,5,Five Stars. Fit my new LG dryer perfectly.
3,5,Perfect size. Good value for electric dryers
4,5,Five Stars. Price and delivery was excellent.


Next, we will use the `sentence_transformers` library to convert the review text to **embeddings** - numerical representations (vectors) of objects, such as words, sentences, or even entire documents.

`sentence_transformers` is a Python library that provides pre-trained models, predominantly based on neural network transformer architectures (like BERT, RoBERTa, etc.) to create embeddings.

* The embeddings are designed to capture the semantic meaning of the input, allowing for tasks such as semantic search, clustering, and topic modeling where the underlying meaning of text is crucial for comparison and grouping.

In [None]:
import numpy as np

from sentence_transformers import SentenceTransformer

# Choose a BERT-based sentence embedding model.
# 'all-MiniLM-L6-v2' is fast and strong for semantic similarity.

# If you want a more "classic" BERT variant, you can use:
# model_name = "bert-base-nli-mean-tokens"

model_name = "all-MiniLM-L6-v2"

# construct the model
model = SentenceTransformer(model_name, device = "cuda")


We have chosen a pretrained **Bert** model (all-MiniLM-L6-v2) to encode the reviews.

Next, we use the model to perform the encoding. Note that the argument `batch_size=64` means that the model will work on 64 reviews at a time.

In [None]:
# convert the column df["text"] to a list of string values
texts = df["text"].tolist()

# Convert all review texts into embeddings

embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

# Convert the embeddings into a numpy array



Batches:   0%|          | 0/36 [00:00<?, ?it/s]

(2277, 384)

We can verify that the resulting `embeddings` is a 2277x384 array (matrix), which corresponding to 2277 reviews in total.

Each embedding (corresponding to one review) has the length 384.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,374,375,376,377,378,379,380,381,382,383
0,-0.021984,0.142611,0.021897,-0.054377,0.083711,0.014592,0.031684,-0.084621,0.030868,-0.035361,...,0.000533,-0.008881,-0.050851,-0.004747,0.108595,0.029495,-0.009316,0.095347,-0.092788,0.069899
1,-0.079158,0.058241,0.025512,0.043002,-0.017920,0.003665,0.101567,-0.105377,-0.044793,0.026178,...,-0.020654,-0.022265,0.015125,-0.036644,0.028416,0.004939,0.037460,-0.018146,-0.095302,0.111979
2,-0.088759,0.050775,0.108125,0.029299,0.094130,-0.029431,0.016128,0.035411,0.013270,-0.044423,...,0.016416,-0.007110,-0.092946,0.075405,-0.013038,-0.020567,0.041939,0.018115,-0.024790,0.054995
3,-0.021935,0.106255,0.002337,0.005074,0.018063,-0.048186,0.024361,0.113092,-0.018913,-0.015027,...,0.001111,-0.025832,-0.135232,0.049809,-0.091676,-0.032544,-0.034676,0.018297,-0.137226,0.017384
4,-0.099258,0.026704,0.013967,0.051910,-0.049580,-0.008456,0.023832,-0.050291,-0.045295,0.019630,...,-0.002183,-0.074513,-0.033940,-0.063914,0.062077,0.091529,0.016560,-0.088765,-0.058670,0.058713
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2272,-0.005556,0.011517,0.098065,-0.044247,0.107281,-0.032996,-0.060089,-0.018822,0.007528,0.029243,...,0.028568,0.025036,-0.010467,-0.008214,0.047035,0.026774,-0.043685,-0.000342,-0.044330,0.005289
2273,0.006274,-0.031963,0.096308,-0.034146,0.119715,-0.006158,-0.047833,-0.034590,0.033860,0.033100,...,-0.001305,-0.012792,-0.037750,-0.027048,0.041138,0.046097,-0.066223,-0.040907,-0.024678,0.066062
2274,-0.085265,-0.010924,-0.004756,0.066206,-0.022475,0.014536,0.095250,-0.105810,-0.033118,0.002457,...,0.033207,-0.034409,-0.004848,-0.040822,-0.011260,0.004579,0.050128,-0.012685,-0.001150,0.099158
2275,-0.110751,0.019467,0.098165,0.047912,0.063065,-0.030968,0.007862,-0.070621,-0.078685,0.002615,...,-0.011513,-0.003598,-0.052874,-0.020434,-0.004446,0.087290,0.078147,-0.078152,-0.005804,0.011091


In [None]:
# better view the embeddings

pd.DataFrame(embeddings)

## Clustering the embeddings

We now use **k-means** clustering to cluster all the embeddings/reviews into $k$ clusters.

K-means clustering is a  unsupervised machine learning algorithm  to partition a dataset into a given number (k) of clusters.
* Each cluster contains data records similar to each other

* Iteratively assigning each data point to the cluster whose centroid it is closest to.

In the following example, we divide all embeddings/reviews into 5 clusters.

In [None]:
from sklearn.cluster import KMeans

# Create a 5-means model
# n_init=10 means that the algorithm will run 10 times with different initial centroids,
## and then choose the best result of the 10 runs

five_means_model = KMeans(n_clusters=5, random_state=42, n_init=10)

# The fit_predict() function will first fit the model to data (the embeddings),
## and then "predict" the cluster label for each embedding;
## for example, the first embedding belongs to cluster 0, the second one belongs to cluster 3, ...

cluster_labels = five_means_model.fit_predict(embeddings)

# After fitting the model, model.inertia_ provides a score that measures
## how close are the points in the same cluster to each other;
## A smaller inertia indicates more compact clusters

five_means_model.inertia_


Let's try k=6,7,8, ..., 15, to find the proper number of clusters to use.

In [None]:
possible_ks = range(6, 16)  # you can adjust this range of candidate k values
inertias = [] # The list of inertias corresponding to the different k values

for k in possible_ks:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(embeddings)

    inertias.append(kmeans.inertia_)

We will plot the inertia values vs. k values using `plotly.express`.

In [None]:
import plotly.express as px

fig = px.line(x=list(possible_ks), y=inertias, title='Elbow Method for Optimal k', labels={'x':'Number of clusters (k)', 'y':'Inertia'}, markers=True)
fig.show()

The **elbow method** is a heuristic used to determine the optimal number of clusters (k) for k-means clustering.

* Look for a point in the plot where the rate of decrease in inertia sharply changes, forming an 'elbow' shape.

  * Before the elbow, the inertia decreases significantly with each increase in 'k', meaning adding more clusters greatly improves the fit.

  * After the elbow, the decrease becomes much smaller, indicating diminishing returns for adding more clusters.
  
*The 'k' value at this elbow point is often considered the optimal number of clusters

So we go back to set the number of clusters to k=12 and re-perform the clustering.





In [None]:
# After inspecting plots, set your chosen k here:
k_chosen = 12

# Create a 12-means model
# n_init=10 means that the algorithm will run 10 times with different initial centroids,
## and then choose the best result of the 10 runs

k_means_model = KMeans(n_clusters=k_chosen, random_state=42, n_init=10)

# The fit_predict() function will first fit the model to data (the embeddings),
## and then "predict" the cluster label for each embedding;
## for example, the first embedding belongs to cluster 0, the second one belongs to cluster 3, ...

cluster_labels = k_means_model.fit_predict(embeddings)



After the clustering, we add a new column "cluster" in the DataFrame `df` of the review texts.

In [None]:
df["cluster"] = cluster_labels

df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,overall,text,cluster
0,5,Great product. I like this as a vent as well a...,11
1,5,Five Stars. good item,9
2,5,Five Stars. Fit my new LG dryer perfectly.,9
3,5,Perfect size. Good value for electric dryers,11
4,5,Five Stars. Price and delivery was excellent.,9


## Examining reviews in a specific cluster

As an example, we will examine cluster 5.

Let's extract the rows of reviews corresponding to cluster 5.

In [None]:
cluster_id=5

df_c5 = df[df["cluster"]==cluster_id]

df_c5.head()



Unnamed: 0,overall,text,cluster
42,3,Does what it needs to. At first this contrapti...,5
47,4,Works great but be careful!. I used this today...,5
81,3,Does what it needs to. At first this contrapti...,5
86,4,Works great but be careful!. I used this today...,5
120,3,Does what it needs to. At first this contrapti...,5


## Using Generative AI identify the topic(s) of a cluster.

Since reviews in the same cluster are similar to each other, there should be a common theme in those reviews.

In what follows, we use Gemini to identify the theme.

We first import the Gen. AI. tool for colab. Note that although one typically needs an API key to use Gen. AI in coding languages, Google allows light use of Gemini API in Colab without API key.




Let's combine all reviews in Cluster 5 into one long string, in the format of "Review text 1. | Review text 2. | ... | Review text m".

To do that, we join the strings in `df_c5["text"]` using the seperation symbol "|".

In [None]:
# Combine all reviews as a single string, use " | " to seperate the reviews

all_reviews_c5 = " | ".join(df_c5['text'].values)

In [None]:
from google.colab import ai

# See what models you have access to
ai.list_models()

['google/gemini-2.5-flash', 'google/gemini-2.5-flash-lite']

Now, import the Gen. AI tool.

In [None]:
prompt = f"""
You are analyzing customer reviews for a product.\n

Here are {len(df_c5)} example reviews from this cluster:\n
{all_reviews_c5}, seperated by the symbol "|" for different customers.\n

Tasks:
1. Give this cluster a concise topic name (3–6 words).
2. Write a 1–2 sentence description of what customers are talking about.
3. Group the keywords into 3–5 sub-themes (for example: "shipping issues", "product quality", etc.), and list the keywords under each theme.
4. If you notice redundant or similar keywords, merge them or mention that they refer to the same idea.

Return your answer in a clear, readable Markdown format.
"""

prompt



We now apply the tool to get the result.

In [None]:
response = ai.generate_text(prompt)

In [None]:
response



Here's an analysis of the provided customer reviews:

### 1. Concise Topic Name
Dryer Vent Cleaning Kit Insights

### 2. Description
Customers generally find this dryer vent cleaning kit effective for removing lint and obstructions, but they emphasize the importance of following instructions carefully due to operational risks such as rods breaking or detaching. They also note that while effective, the setup can be inconvenient for regular use, and express some reservations about the product's overall build quality.

### 3. Sub-themes and Keywords

**1. Effectiveness & Results**
*   Works great
*   Cleans out lint (removed significant amounts)
*   Removes obstructions (e.g., birds nest)
*   Achieved desired outcome / Project turned out well
*   Useful for lint catcher cleaning

**2. Usage & Experience**
*   Initial confusion / Confusing contraption
*   Directions / Instructions (read multiple times)
*   Lack of initial confidence / Felt scared
*   "Fun" to use (once understood)
*   Helpful DVD
*   Inconvenient setup (assembling/taping rods each time)
*   Benefit of keeping assembled (for convenience)

**3. Operational Risks & Warnings**
*   Requirement for careful operation / "Be careful"
*   Correct drill direction (crucial to avoid issues)
*   Risk of losing rods in vent
*   Brush head detaching/unscrewing (if drill is in reverse)
*   Rod breakage (due to excessive drill torque)
*   Importance of drill clutch setting (to prevent damage)

**4. Product Quality & Durability**
*   "Decent" build quality
*   Questioned toughness / Lack of confidence in long-term durability
*   Desire for "very well built" instead of just "decent"

### 4. Redundant/Similar Keywords
*   "Works great," "Does what it needs to," and "Project turned out really well" all convey the **Effectiveness** of the product.
*   "Confusing," "read directions 3 times," "not fully confident," and "scared me a little" relate to the **Initial Learning Curve** or **Ease of Understanding**.
*   "Warnings," "be careful," "drill into reverse," "brush head will detach/unscrew," "rods may break," and "lose rods" are all examples of **Operational Risks or Required Cautions**.
*   "Seems decent" and "would be more confident if it seemed very well built" both refer to the user's perception of the **Product's Construction Quality**.

Paste the content of `response` to a text cell better view the result.

You may apply the same analysis to the remaining clusters.