<a href="https://colab.research.google.com/github/subha100x/Guess-the-Number/blob/main/clustering_model_subha.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, davies_bouldin_score
import pickle


In [None]:
# Load preprocessed reviews
data = pd.read_csv('amazon_reviews.csv')
print(data.head())

      Reviewer ID Product Purchased Customer Name  \
0  A3SBTW3WS4IQSN        B007WTAJTO           NaN   
1  A18K1ODH1I2MVB        B007WTAJTO          0mie   
2  A2FII3I2MBMUIA        B007WTAJTO           1K3   
3   A3H99DFEG68SR        B007WTAJTO           1m2   
4  A375ZM4U047O79        B007WTAJTO  2&amp;1/2Men   

                                         Review Text  Rating  
0                                         No issues.       4  
1  Purchased this for my device, it worked as adv...       5  
2  it works as expected. I should have sprung for...       4  
3  This think has worked out great.Had a diff. br...       5  
4  Bought it with Retail Packaging, arrived legit...       5  


In [None]:
data['Review Text'] = data['Review Text'].fillna('')
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')
X = vectorizer.fit_transform(data['Review Text'])

In [None]:
# Try different cluster numbers
scores = {}
for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42)
    labels = km.fit_predict(X)
    sil = silhouette_score(X, labels)
    db = davies_bouldin_score(X.toarray(), labels)
    scores[k] = (sil, db)
    print(f"K={k} → Silhouette={sil:.3f}, DB={db:.3f}")


K=2 → Silhouette=0.004, DB=11.867
K=3 → Silhouette=0.004, DB=11.011
K=4 → Silhouette=0.005, DB=10.248
K=5 → Silhouette=0.004, DB=9.796
K=6 → Silhouette=0.006, DB=9.748
K=7 → Silhouette=0.005, DB=9.395


In [None]:
best_k = max(scores, key=lambda k: scores[k][0])  # highest silhouette
print("Best K:", best_k)

best_kmeans = KMeans(n_clusters=best_k, random_state=42)
labels = best_kmeans.fit_predict(X)
data['cluster'] = labels


Best K: 6


KMeans performed best at K=6, giving Silhouette = 0.006 and DB Index = 9.748.

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X)
if len(set(labels_db)) > 1:
    print("DBSCAN Silhouette:", silhouette_score(X, labels_db))


DBSCAN failed to form clear clusters due to sparse TF-IDF vectors (most reviews marked as noise).

In [None]:
agg = AgglomerativeClustering(n_clusters=best_k)
labels_agg = agg.fit_predict(X.toarray())
print("Agglomerative Silhouette:", silhouette_score(X, labels_agg))


Agglomerative Silhouette: -0.001776528548745903


Agglomerative clustering also gave very low Silhouette, indicating poor separation.

In [None]:
with open("best_kmeans_model.pkl", "wb") as f:
    pickle.dump(best_kmeans, f)

with open("tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

data.to_csv("clustered_reviews.csv", index=False)


Overall, KMeans was the most effective method, though scores were weak, suggesting the dataset is difficult to cluster and could benefit from advanced embeddings