# Amazon Review Topic Exploration

Goal:
- Discover emotional and product-attribute topics
- Unsupervised: Sentence embeddings ‚Üí HDBSCAN ‚Üí UMAP ‚Üí Cluster reading

Dataset:
- ~20k Amazon reviews
- We will start with a 3k subset for fast iteration

Workflow at a high level:
1. 20k reviews
2. Sentence embeddings (semantic meaning)
3. HDBSCAN (density-based clustering, auto #clusters)
4. UMAP (2D visualization)
5. Read + label clusters (human-in-the-loop)

Workflow at a detail level:
1. Data Loading & Sampling
2. Text Cleaning
3. Embeddings (SentenceTransformer)
4. Dimensionality Reduction (UMAP)
5. Clustering (HDBSCAN)
6. Cluster Inspection & Manual Labeling
7. Auto-Labeling Unknown Clusters   üëà YOU ARE HERE
8. Final Cluster Labels
9. Visualization
10. Export Results




In [4]:
# Cell 2 ‚Äî Imports

# Data handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# NLP & embeddings
from sentence_transformers import SentenceTransformer

# Clustering & dimensionality reduction
import umap
import hdbscan
from sklearn.feature_extraction.text import TfidfVectorizer



In [None]:
# show_cluster_examples

def show_cluster_examples(cluster_id, n=10):
    examples = (
        df_work[df_work["cluster"] == cluster_id]
        .sample(min(n, (df_work["cluster"] == cluster_id).sum()), random_state=42)
    )

    for i, row in examples.iterrows():
        print("-" * 80)

        rating = row["rating"] if "rating" in row else "NA"
        verified = row["verified_purchase"] if "verified_purchase" in row else "NA"

        print(f"Rating: {rating} | Verified: {verified}")
        print(row["doc"][:500])



In [6]:
# Label clusters efficiently


def top_terms(cluster_id, n=10):
    texts = df_work[df_work["cluster"] == cluster_id]["doc"]

    if len(texts) == 0:
        return []

    tfidf = TfidfVectorizer(
        stop_words="english",
        max_features=1000,
        ngram_range=(1, 2)
    )

    X = tfidf.fit_transform(texts)
    scores = X.mean(axis=0).A1
    terms = tfidf.get_feature_names_out()

    top = sorted(
        zip(terms, scores),
        key=lambda x: x[1],
        reverse=True
    )[:n]

    return top

In [None]:
# label cluster

def label_cluster(cluster_id, n_examples=10, n_terms=8):
    print(f"\n=== Cluster {cluster_id} ===\n")
    show_cluster_examples(cluster_id, n_examples)
    print("\nTop terms:")
    for term, score in top_terms(cluster_id, n_terms):
        print(term)


In [None]:
# Cell 3 ‚Äî Load full data
df = pd.read_csv("../data/reviews.csv")
print("Full dataset shape:", df.shape)
df.head()


In [9]:
# Cell 4 ‚Äî Create working subset (~3k) stratified by rating
df_work = (
    df.dropna(subset=["text"])  # remove rows with missing text
      .groupby("rating", group_keys=False)
      .apply(
          lambda x: x.sample(n=min(len(x), 600), random_state=42),
          include_groups=False  # avoids future warning
      )
)

print("Working set size:", len(df_work))



Working set size: 3000


In [None]:
# Cell 5 ‚Äî Minimal cleaning & create 'doc' column

# Ensure text is string
df_work["text"] = df_work["text"].astype(str)

# Combine title + text for better semantic embeddings
df_work["doc"] = df_work["title"].fillna("") + ". " + df_work["text"]

# Quick sanity check
df_work[["title", "text", "doc"]].head(3)
#display(df_work.head(3))


In [11]:
# Cell 6 ‚Äî Optional sanity check

# Make sure 'doc' column exists
assert "doc" in df_work.columns, "'doc' column is missing ‚Äî run cleaning cell first"

In [12]:
# Cell 7 ‚Äî Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

In [1]:
# Cell 8 ‚Äî Generate embeddings

embeddings = model.encode(
    df_work["doc"].tolist(),
    batch_size=64,
    show_progress_bar=True
)

print("Embeddings shape:", embeddings.shape)

# Convert to a DataFrame for easy saving
embeddings_df = pd.DataFrame(embeddings)
embeddings_df['user_id'] = df_work['user_id'].values  # keep an ID to match reviews

# Save Embeddings to NPZ (compressed, faster to load)
np.savez_compressed("amazon_embeddings_3k.npz", embeddings=embeddings, user_id=df_work['user_id'].values)


NameError: name 'model' is not defined

In [14]:
# Cell 9 ‚Äî UMAP (for clustering, 5D)
umap_reducer = umap.UMAP(
    n_neighbors=20,
    n_components=5,
    min_dist=0.0,
    metric="cosine",
    random_state=42
)

embeddings_umap = umap_reducer.fit_transform(embeddings)

print("UMAP-reduced shape:", embeddings_umap.shape)



  warn(


UMAP-reduced shape: (3000, 5)


Why this cell exists

HDBSCAN works much better in reduced space

5 dimensions preserves topic structure while removing noise

This is not for visualization yet

In [15]:
# Cell 10 ‚Äî HDBSCAN clustering

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    metric="euclidean",
    cluster_selection_method="eom"
)

clusters = clusterer.fit_predict(embeddings_umap)
df_work["cluster"] = clusters


Why these parameters

min_cluster_size=30: stable, interpretable topics

min_samples=10: moderate noise tolerance

-1 cluster = noise (expected)

In [16]:
# Cell 11 ‚Äî Cluster size overview

df_work["cluster"].value_counts().head(10)

cluster
-1     918
 24    215
 26    155
 2      86
 5      76
 23     75
 36     73
 7      68
 29     58
 45     56
Name: count, dtype: int64

What you‚Äôre looking for:

Several clusters with dozens to hundreds of reviews

Some -1 noise (10‚Äì30% is normal)

In [None]:
show_cluster_examples(5)


Final Model Configuration (Frozen)

Embedding: all-MiniLM-L6-v2
UMAP:
  n_neighbors=20
  n_components=5
  min_dist=0.0
  metric=cosine

HDBSCAN:
  min_cluster_size=15
  min_samples=5
  metric=euclidean


In [None]:
label_cluster(46)



In [24]:
# Summary saved to amazon_review_cluster_summary.csv
# # summary_utils.py (can go in src/)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

def get_top_terms(texts, n_terms=10):
    """Return top TF-IDF terms from a list of texts."""
    if len(texts) == 0:
        return []
    
    tfidf = TfidfVectorizer(stop_words="english", max_features=1000, ngram_range=(1,2))
    X = tfidf.fit_transform(texts)
    scores = X.mean(axis=0).A1
    terms = tfidf.get_feature_names_out()
    top = sorted(zip(terms, scores), key=lambda x: x[1], reverse=True)[:n_terms]
    return [term for term, score in top]

def generate_cluster_summary(df, cluster_labels, n_terms=10, n_examples=3):
    """
    Generate a summary table for clusters.
    
    df: pandas DataFrame with at least ['cluster', 'doc', 'verified_purchase']
    cluster_labels: dict {cluster_id: human_label}
    n_terms: number of top TF-IDF terms to extract
    n_examples: number of review examples per cluster
    """
    summary_data = []

    for cluster_id in sorted(df["cluster"].unique()):
        cluster_df = df[df["cluster"] == cluster_id]
        if cluster_df.empty:
            continue

        n_reviews = len(cluster_df)
        verified_ratio = cluster_df["verified_purchase"].mean() if "verified_purchase" in cluster_df.columns else None
        label = cluster_labels.get(cluster_id, "Unknown")
        terms = get_top_terms(cluster_df["doc"], n_terms)
        examples = cluster_df["doc"].head(n_examples).tolist()

        summary_data.append({
            "Cluster ID": cluster_id,
            "Label": label,
            "Top Terms": ", ".join(terms),
            "Num Reviews": n_reviews,
            "Verified Ratio": round(verified_ratio, 2) if verified_ratio is not None else "NA",
            "Examples": "\n---\n".join(examples)
        })

    summary_df = pd.DataFrame(summary_data)
    return summary_df

def save_summary(df_summary, filepath="cluster_summary.csv"):
    """Save the cluster summary DataFrame to CSV."""
    df_summary.to_csv(filepath, index=False)
    print(f"Summary saved to {filepath}")



# Define your cluster labels
cluster_labels = {
    2: "Fit / Authenticity / Defects",
    5: "Socks / Fit & Quality",
    7: "Clothing / Fit & Quality",
    23: "Shoes / Durability & Quality Issues",
    24: "Positive / Satisfaction / Quality",
    26: "Work / Durability / Fit",
    29: "Shoes / Comfort & Satisfaction",
    36: "Shoes / Fit & Width Issues",
    45: "Socks / Fit & Comfort",
}

# Generate the summary
summary_df = generate_cluster_summary(df_work, cluster_labels)

# Display in notebook
summary_df

# Save to CSV
save_summary(summary_df, "amazon_review_cluster_summary.csv")


Summary saved to amazon_review_cluster_summary.csv


In [None]:
# This code does NOT redo clustering.
# üëâ It only turns clusters into a 2D picture so humans can understand them.

#‚ÄúUMAP maps high-dimensional embeddings into a new 2-D coordinate system.
#The two dimensions are learned by the algorithm and only represent relative similarity, not interpretable features.‚Äù

# -----------------------------
# 1Ô∏è‚É£ Imports
# -----------------------------
import pandas as pd
import umap
import matplotlib.pyplot as plt
import seaborn as sns
from sentence_transformers import SentenceTransformer

# -----------------------------
# 2Ô∏è‚É£ Sanity checks (optional but recommended)
# -----------------------------
print ("\nLoading df_work -----------------------")
display(df_work.head(3))

required_cols = {"text", "cluster"}
missing = required_cols - set(df_work.columns)
if missing:
    raise ValueError(f"df_work is missing columns: {missing}")

# -----------------------------
# 3Ô∏è‚É£ Compute embeddings from review text
# -----------------------------
model = SentenceTransformer("all-MiniLM-L6-v2")

embeddings = model.encode(
    df_work["text"].tolist(),
    batch_size=64,
    show_progress_bar=True
)

# -----------------------------
# 4Ô∏è‚É£ Run UMAP
# -----------------------------
reducer = umap.UMAP(
    n_components=2,
    random_state=42,
    n_neighbors=15,
    min_dist=0.1
)

umap_embeddings = reducer.fit_transform(embeddings)

df_work["umap_x"] = umap_embeddings[:, 0]
df_work["umap_y"] = umap_embeddings[:, 1]

# -----------------------------
# 5Ô∏è‚É£ Map human-readable cluster labels
# -----------------------------
cluster_labels = {
    2: "Fit / Authenticity / Defects",
    5: "Socks / Fit & Quality",
    7: "Clothing / Fit & Quality",
    23: "Shoes / Durability & Quality",
    24: "Positive / Satisfaction / Quality",
    26: "Work / Durability / Fit",
    29: "Shoes / Comfort & Satisfaction",
    36: "Shoes / Fit & Width Issues",
    45: "Socks / Fit & Comfort"
}

df_work["label"] = (
    df_work["cluster"]
      .map(cluster_labels)
      .fillna("Unknown")
)

 
print ("\nAfter adding the label to df_work -----------------------")
display(df_work.head(3))

# -----------------------------
# 6Ô∏è‚É£ Plot clusters (human-readable)
# -----------------------------
plt.figure(figsize=(12, 9))
sns.scatterplot(
    data=df_work,
    x="umap_x",
    y="umap_y",
    hue="label",
    palette="tab20",
    s=50,
    alpha=0.8,
    legend="full"
)

plt.title("UMAP Visualization of Amazon Review Clusters", fontsize=16)
plt.xlabel("UMAP 1")
plt.ylabel("UMAP 2")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()


## 7Ô∏è‚É£ Auto-label Remaining Clusters


In [None]:
# what columns are currently in the df_work
df_work.columns

In [28]:
# STEP 1 ‚Äî Identify clusters that still need labels
unknown_clusters = (
    df_work[df_work["label"] == "Unknown"]["cluster"]
    .value_counts()
    .index
    .tolist()
)

len(unknown_clusters), unknown_clusters[:10]


(44, [-1, 9, 3, 37, 50, 51, 12, 4, 48, 33])

In [29]:
# STEP 2 ‚Äî Prepare TF-IDF keywords per cluster
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer(
    max_features=8000,
    stop_words="english",
    ngram_range=(1, 2)
)

tfidf = vectorizer.fit_transform(df_work["doc"])
terms = np.array(vectorizer.get_feature_names_out())


In [30]:
# STEP 3 ‚Äî Function: Get top keywords for a cluster

def top_terms_for_cluster(cluster_id, n=10):
    idx = df_work["cluster"] == cluster_id
    cluster_tfidf = tfidf[idx].mean(axis=0)
    scores = np.asarray(cluster_tfidf).flatten()
    top_idx = scores.argsort()[::-1][:n]
    return terms[top_idx]


In [31]:
# STEP 4 ‚Äî Function: Sample example reviews

def sample_reviews(cluster_id, n=3):
    return (
        df_work[df_work["cluster"] == cluster_id]["doc"]
        .sample(n=min(n, sum(df_work["cluster"] == cluster_id)), random_state=42)
        .tolist()
    )


In [32]:
# STEP 5 ‚Äî Auto-label generator (rule-based)

def suggest_label(keywords):
    keywords = " ".join(keywords)

    if any(k in keywords for k in ["size", "fit", "small", "large", "tight", "wide"]):
        return "Fit & Sizing Issues"

    if any(k in keywords for k in ["quality", "cheap", "broke", "durable", "poor"]):
        return "Quality & Durability"

    if any(k in keywords for k in ["comfortable", "comfort", "wear", "light"]):
        return "Comfort & Wearability"

    if any(k in keywords for k in ["fake", "authentic", "real", "counterfeit"]):
        return "Authenticity Issues"

    if any(k in keywords for k in ["price", "value", "worth"]):
        return "Price & Value"

    if any(k in keywords for k in ["love", "great", "perfect", "recommend"]):
        return "Positive Satisfaction"

    return "Miscellaneous / Other"


In [37]:
# STEP 6 ‚Äî Generate label suggestions for Unknown clusters

def top_terms_for_cluster(cluster_id, n=10):
    idx = (df_work["cluster"] == cluster_id).values  # üî• convert to NumPy
    cluster_tfidf = tfidf[idx].mean(axis=0)
    scores = np.asarray(cluster_tfidf).flatten()
    top_idx = scores.argsort()[::-1][:n]
    return terms[top_idx]

auto_labels = {}

for cid in unknown_clusters:
    keywords = top_terms_for_cluster(cid, n=12)
    label = suggest_label(keywords)
    auto_labels[cid] = label

list(auto_labels.items())[:5]


[(-1, 'Fit & Sizing Issues'),
 (9, 'Miscellaneous / Other'),
 (3, 'Quality & Durability'),
 (37, 'Fit & Sizing Issues'),
 (50, 'Quality & Durability')]

In [None]:
# STEP 7 ‚Äî Review suggestions (CRITICAL STEP)

for cid, label in list(auto_labels.items())[:10]:
    print(f"\n=== Cluster {cid} ‚Üí Suggested: {label} ===")
    print("Top terms:", top_terms_for_cluster(cid))
    for r in sample_reviews(cid):
        print("-", r[:200])


üß† This is where you judge correctness.

Keep the label

Edit it

Or mark cluster for manual review

In [47]:
refined_labels = {
    -1: "Other / Mixed / Noise",
    9: "Sweat Protection / Undershirt Performance",
    3: "Shoe Trees / Sizing & Quality",
    37: "Shoes / Run Small / Size Accuracy",
    50: "Socks / Positive Quality & Comfort",
    51: "Socks / Value, Warmth & Wool Quality",
    12: "Comfort & Cushioning (Positive)",
    4: "Generic Positive Reviews",
    48: "Warmth & Softness (Cold Weather Wear)",
    33: "Socks / Durability Failures (Holes)"
}

# 1Ô∏è‚É£ Apply manual refined labels first
df_work['label'] = df_work['cluster'].map(refined_labels).fillna(df_work['label'])


In [48]:
# STEP 8 ‚Äî Fill in the remaining Unknowns with auto_labels

for cid, label in auto_labels.items():
    df_work.loc[
        (df_work["cluster"] == cid) & (df_work["label"] == "Unknown"),
        "label"
    ] = label


In [49]:
# STEP 9 ‚Äî Check remaining Unknowns
# You should now have far fewer "Unknown"


df_work["label"].value_counts()

label
Other / Mixed / Noise                        918
Fit & Sizing Issues                          379
Positive / Satisfaction / Quality            215
Comfort & Wearability                        159
Work / Durability / Fit                      155
Quality & Durability                         103
Miscellaneous / Other                         95
Fit / Authenticity / Defects                  86
Socks / Fit & Quality                         76
Shoes / Durability & Quality                  75
Shoes / Fit & Width Issues                    73
Clothing / Fit & Quality                      68
Positive Satisfaction                         64
Shoes / Comfort & Satisfaction                58
Socks / Fit & Comfort                         56
Sweat Protection / Undershirt Performance     56
Shoe Trees / Sizing & Quality                 54
Shoes / Run Small / Size Accuracy             52
Comfort & Cushioning (Positive)               42
Socks / Value, Warmth & Wool Quality          42
Socks / Positi

In [50]:
# STEP 10 ‚Äî Save final labeled dataset

df_work.to_csv("amazon_reviews_with_final_labels.csv", index=False)
