### Amazon Reviews Analysis with Unsupervised Learning

In this notebook, we explore how unsupervised learning methods can be applied to real-world text data,
specifically Amazon product reviews. Companies rely heavily on customer feedback to improve products,
understand user experience, and identify trends. However, reviews are unstructured free-text data, 
which makes it challenging to analyze at scale without appropriate methods.

We will implement multiple approaches for text representation, clustering, and topic modeling, and 
evaluate their effectiveness. Our goals are to:

- Transform reviews into numerical representations (Bag-of-Words, TF-IDF, Word2Vec)
- Group reviews into clusters based on similarity
- Extract latent themes (topics) from reviews using topic modeling
- Reflect on potential biases and propose improvements

These techniques are widely used in industry, including for recommender systems, 
customer support analysis, and product quality monitoring.


### Step 1: Load Amazon Reviews Data

In [1]:
import pandas as pd

# Load dataset
# Ensure the dataset has at least one column with review text.

amazon = pd.read_csv(r'C:\Users\sarah\Desktop\BC\Fall 2025\MESA8414 Applied AI and Machine Learning\Assignment\Week 6 assignment\amazon-alexa.csv')   ###FILL IN BLANK: dataset path
amazon.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,2018-07-31 00:00:00,Charcoal Fabric,Love my Echo!,1
1,5,2018-07-31 00:00:00,Charcoal Fabric,Loved it!,1
2,4,2018-07-31 00:00:00,Walnut Finish,"""Sometimes while playing a game, you can answe...",1
3,5,2018-07-31 00:00:00,Charcoal Fabric,"""I have had a lot of fun with this thing. My 4...",1
4,5,2018-07-31 00:00:00,Charcoal Fabric,Music,1


### Step 2: Text Representation with Bag-of-Words and TF-IDF

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

text_col = "verified_reviews"

def normalize_text(x):
    if isinstance(x, list):
        return " ".join(map(str, x))
    if pd.isna(x):
        return ""
    return str(x)

amazon[text_col] = amazon[text_col].apply(normalize_text)
amazon_clean = amazon[amazon[text_col].str.strip().ne("")].copy()

# Bag-of-Words
vectorizer = CountVectorizer(max_features=5000)   ###FILL IN BLANK
X_bow = vectorizer.fit_transform(amazon['verified_reviews'])   ###FILL IN BLANK

# TF-IDF
tfidf = TfidfVectorizer(max_features=5000)   ###FILL IN BLANK
X_tfidf = tfidf.fit_transform(amazon['verified_reviews'])   ###FILL IN BLANK

print('BoW shape:', X_bow.shape)
print('TF-IDF shape:', X_tfidf.shape)

BoW shape: (3150, 4044)
TF-IDF shape: (3150, 4044)


Using the above analysis, elaborate on:
- What kind of information BoW captures and what it misses: Bow shape of 3150, 4044 means that 4044 unique words are taken from 3150 documents. However, because BoW only counts the frequency of words present, it misses the semantic relationships between words. It also does not take into account the ordering of words or compound words like New York.
  
- Why TF-IDF can sometimes be more effective: TF-IDF can be more effective when distinguishing documents is an important task. It gives low weights to high-frequency words like "the" "a" or "this" and gives more weight to low-frequency words that may carry actual information and meaning.
  
- How dimensionality (features) impacts representation: Dimensionality can provide nuanced and detailed analysis, but high dimensionality can be computationally very demanding. 

### Step 3: Word Embeddings with Word2Vec

In [3]:
from gensim.models import Word2Vec

# Tokenization
amazon['tokens'] = amazon['verified_reviews'].astype(str).apply(lambda x: x.split())


# Train Word2Vec
model = Word2Vec(sentences=amazon['tokens'], vector_size=100, window=5, min_count=2, workers=4)

# Similar words
model.wv.most_similar('good')

[('speaker', 0.9995638132095337),
 ('The', 0.9995536804199219),
 ('great', 0.9994460940361023),
 ('is', 0.9994190335273743),
 ('This', 0.9993958473205566),
 ('a', 0.9993539452552795),
 ('great.', 0.9993239045143127),
 ('sound', 0.9993112683296204),
 ('sounds', 0.9992988109588623),
 ('well', 0.9992895126342773)]

Using the above analysis, elaborate on:
- Do the similar words make intuitive sense? Some of the similar words make intuitive sense, but not others. For example, words such as "speaker", "sounds", and "device" could be said to carry semantic similarity. However, "The", "is", "This", and "a" do not carry meaning information in and of themselves. Although these words may have appeared frequently together with words semantically similar to speaker, they don't provide useful insights.
  
- What advantages embeddings provide compared to BoW/TF-IDF: Embeddings can capture contextual similarity. This is a huge advantage over Bow/TF-IDF that only analyzed frequencies.

  
- Situations where embeddings might fail: There may be a situation where embeddings fail. Words that were not seen during training get no embedding. Therefore, if the training dataset is small or if the trained algorithm is applied to documents that introduce new vocabulary in another area, embeddings may fail. 

### Step 4: Clustering Reviews

In [4]:
from sklearn.cluster import KMeans, AgglomerativeClustering
import numpy as np

# Choose representation
X = X_tfidf.toarray()   ###FILL IN BLANK if using another representation

# KMeans clustering
kmeans = KMeans(n_clusters=10, random_state=42)   ###FILL IN BLANK
kmeans_labels = kmeans.fit_predict(X)

print("=== KMeans Results ===")
print("Cluster labels:", kmeans_labels)
print("Cluster counts:", np.bincount(kmeans_labels))
print()

# Hierarchical clustering
agg = AgglomerativeClustering(n_clusters=10)   ###FILL IN BLANK
agg_labels = agg.fit_predict(X)

print("=== Agglomerative Results ===")
print("Cluster labels:", agg_labels)
print("Cluster counts:", np.bincount(agg_labels))

=== KMeans Results ===
Cluster labels: [3 1 9 ... 9 9 7]
Cluster counts: [ 33 433 126 260 510 164  94 720 113 697]

=== Agglomerative Results ===
Cluster labels: [0 0 5 ... 5 5 2]
Cluster counts: [1643  219  244   54   28  768   35   38   24   97]


In [5]:
import numpy as np
import pandas as pd

# 1) Top terms per KMeans cluster 
feature_names = tfidf.get_feature_names_out()
def top_terms_for_centroid(centroid, n=10):
    idx = np.argsort(centroid)[::-1][:n]
    return list(zip(feature_names[idx], centroid[idx]))

print("=== KMeans: Top Terms per Cluster ===")
for c in range(kmeans.n_clusters):
    print(f"\nCluster {c} (n={np.sum(kmeans_labels==c)}):")
    print([t for t,_ in top_terms_for_centroid(kmeans.cluster_centers_[c], n=10)])

# 2) Top terms per Agglomerative cluster 
X_dense = X
print("\n=== Agglomerative: Top Terms per Cluster ===")
for c in range(agg.n_clusters):
    mask = (agg_labels == c)
    centroid = X_dense[mask].mean(axis=0)
    centroid = np.asarray(centroid).ravel()
    print(f"\nCluster {c} (n={mask.sum()}):")
    print([t for t,_ in top_terms_for_centroid(centroid, n=10)])

=== KMeans: Top Terms per Cluster ===

Cluster 0 (n=33):
['expected', 'as', 'more', 'does', 'and', 'than', 'what', 'works', 'arrived', 'everything']

Cluster 1 (n=433):
['it', 'was', 'my', 'to', 'and', 'for', 'the', 'but', 'works', 'not']

Cluster 2 (n=126):
['like', 'it', 'new', 'just', 'really', 'works', 'other', 'charm', 'the', 'far']

Cluster 3 (n=260):
['love', 'it', 'my', 'echo', 'this', 'alexa', 'everything', 'so', 'the', 'absolutely']

Cluster 4 (n=510):
['the', 'and', 'we', 'echo', 'is', 'to', 'of', 'with', 'for', 'in']

Cluster 5 (n=164):
['easy', 'set', 'up', 'to', 'use', 'and', 'very', 'setup', 'it', 'was']

Cluster 6 (n=94):
['love', 'it', 'great', 'still', 'kids', 'amazing', 'said', 'enough', 'say', 'well']

Cluster 7 (n=720):
['good', 'works', 'great', 'very', 'sound', 'and', 'as', 'nice', 'product', 'awesome']

Cluster 8 (n=113):
['great', 'product', 'works', 'sound', 'it', 'price', 'use', 'and', 'device', 'buy']

Cluster 9 (n=697):
['to', 'the', 'it', 'and', 'my', 'is'

Using the above analysis, elaborate on:
- What kinds of reviews are grouped together? To see what kinds of reviews are grouped together, I had to look at the top terms that appear in each cluster. For some of the clusters, it looks like it did a fairly good job of putting semantically similar words into the same cluster. For example, K-means cluster # 7 contains words such as 'good', 'works', 'great', 'nice', and 'awesome.' However, other clusters like Cluster 4 in the K-means clustering contain mostly the words like 'the', 'and', 'we', and 'and' that do not carry meaning of their own. The same is true for agglomerative clustering. - 
  
- Differences between KMeans and Hierarchical clustering results; With K-means clustering, I see that there are several medium-to-large clusters. However, with hierarchical clustering, I see one very large cluster and several relatively smaller ones.
  
- How cluster interpretability depends on chosen representation: Cluster interpretability depends on the chosen representation of clusters. Since with Bow/TF-IDF, the words that most frequently appeared are the top terms, you can clearly see the topic labels for each cluster. With word embeddings, although the context is preserved, it may be harder to interpret clusters directly by just looking at the raw vectors. 

### Step 5: Topic Modeling (LDA and BERTopic)

In [6]:
from sklearn.decomposition import LatentDirichletAllocation

# LDA
lda = LatentDirichletAllocation(n_components=10, random_state=42)  
lda.fit(X_tfidf)

# Display top words per topic
for idx, topic in enumerate(lda.components_):
    print(f"Topic {idx}", [tfidf.get_feature_names_out()[i] for i in topic.argsort()[-10:]])

Topic 0 ['far', 'alexa', 'you', 'like', 'so', 'and', 'the', 'love', 'everything', 'it']
Topic 1 ['loves', 'really', 'this', 'spot', 'great', 'like', 'my', 'echo', 'it', 'love']
Topic 2 ['are', 'echo', 'enjoy', 'new', 'and', 'family', 'the', 'to', 'we', 'my']
Topic 3 ['volume', 'issue', 'champ', 'very', 'excited', 'clear', 'friendly', 'user', 'low', 'fantastic']
Topic 4 ['it', 'great', 'stick', 'works', 'excellent', 'very', 'sound', 'speaker', 'perfect', 'good']
Topic 5 ['have', 'echo', 'love', 'is', 'for', 'my', 'and', 'it', 'to', 'the']
Topic 6 ['second', 'purchased', 'was', 'third', 'problems', 'gift', 'working', 'fast', 'no', 'ease']
Topic 7 ['now', 'beyond', 'twice', 'listens', 'device', 'friend', 'rocks', 'alexa', 'phenomenal', 'awesome']
Topic 8 ['my', 'it', 'of', 'great', 'is', 'quality', 'fun', 'sound', 'amazing', 'the']
Topic 9 ['very', 'and', 'use', 'to', 'up', 'set', 'works', 'product', 'easy', 'great']


In [7]:
!pip install bertopic



In [8]:
# BERTopic (optional)
from bertopic import BERTopic
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(amazon['verified_reviews'])
topic_model.get_topic_info()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,879,-1_the_to_and_it,"[the, to, and, it, is, my, have, for, not, with]","[""Much easier to use than the dot. It picks up..."
1,0,124,0_stick_fire_tv_cable,"[stick, fire, tv, cable, firestick, watch, net...","[I love my fire stick, We all just love the TV..."
2,1,112,1_alexa_she_we_or,"[alexa, she, we, or, our, her, of, off, that, to]","[""I purchased this for my mother who is having..."
3,2,91,2_music_play_being_listen,"[music, play, being, listen, it, all, audioboo...",[I like it alot. I connected it so I can play ...
4,3,69,3_clock_alarm_night_at,"[clock, alarm, night, at, for, as, this, the, ...","[""This was given to my 7 year at the time as a..."
...,...,...,...,...,...
76,75,13,75_works_great_fine_awesome,"[works, great, fine, awesome, any, other, one,...","[Works great!, Works great!, Works great!]"
77,76,13,76_plus_pleasure_addition_echo,"[plus, pleasure, addition, echo, facts, leisur...","[Great addition to my Echo Plus!, Love my new ..."
78,77,13,77_tv_smart_manufacturerslg_grandsons,"[tv, smart, manufacturerslg, grandsons, linksp...","[""I’m having trouble connecting my tv to it, b..."
79,78,12,78_quality_thiswont_expert_sound,"[quality, thiswont, expert, sound, massive, ne...",[Love it. It works great. Alexa still has so...


Using the above analysis, elaborate on:
- Which method (LDA vs BERTopic) seems more coherent: the topics from LDA show a lot of stopwords like 'it', 'is', and 'the'. On the other hand, topics from BERTopic feature more words that are semantically coherent. For example, 'clock', 'alarm', night', etc.
  
- Practical insights companies can gain from extracted topics: I think companies can gain much practical insights from extracted topics, especially the BERTopic. I really liked the representative doc column so you can get more contextualized information for each cluster.

  
- Challenges of topic modeling with short reviews: I think that with short reviews, there are a small number of informative words and a lot of stopwords such as "I" and "The." Also, simple and short reviews like "Love it" can dominate the bag of words without providing much information and context. 

### Step 6: Reflection

- What biases may exist in Amazon reviews datasets? In Amazon reviews datasets, there can be selection bias. Those customers that have strongly motivated to write reviews are more likely to be represented in the dataset. Therefore, strong positive or negative reviews may dominate the dataset as opposed to more neutral reviews. Moreover, some newer products may not have had enough time to receive customer reviews and thus may be giving more weight to products that have been in the market for a long time. 
  
- How can text representation choices (BoW vs TF-IDF vs embeddings) impact downstream clustering? For BoW, stopwords or generic praises such as "Love it" may appear dominantly and therefore appear as cluster topics. For TF-IDF, it may favor longer reviews since they are likely to introduce more unique words that TF-IDF weighs more heavily on. With Embedding, there can be domain-mismatch between pretrained models and datasets from other domains. This can lead to irrelevant groupings. 


- Suggest high-level strategies to improve topic discovery (e.g., domain-specific embeddings, metadata, dimensionality reduction). Domain-specific embeddings that works specifically on certain domains can help improve the accuracy of embeddings. Also, including n-grams for TF-IDF can help to improve contextualization. MetadatA integration can be helpful if review ratings are used to preprocess the dataset into positive versus negative reviews. Also, using product categories to divide up the dataset can help topic discovery as well. Principal component analysis may be used for dimensionality reduction. This can help retain the dataset in a manageable size while keeping similar vectors together. 