<a href="https://colab.research.google.com/github/shrimp0000/Data-Science-Project/blob/main/cell_phone_review_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Data

In [None]:
import os
import json
import gzip
import pandas as pd
from urllib.request import urlopen

In [None]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
file = drive.CreateFile({'id':'1RyEko0NG-dm7t3qQNCxWmmiJL6hHU8bR'})
file.GetContentFile('Cell_Phones_and_Accessories_5.json.gz')

In [None]:
### load the meta data

data = []
with gzip.open('Cell_Phones_and_Accessories_5.json.gz') as f:
    for l in f:
        data.append(json.loads(l.strip()))
    
# total length of list, this number equals total number of products
print(len(data))

# first row of the list
print(data[0])

1128437
{'overall': 5.0, 'verified': True, 'reviewTime': '08 4, 2014', 'reviewerID': 'A24E3SXTC62LJI', 'asin': '7508492919', 'style': {'Color:': ' Bling'}, 'reviewerName': 'Claudia Valdivia', 'reviewText': 'Looks even better in person. Be careful to not drop your phone so often because the rhinestones will fall off (duh). More of a decorative case than it is protective, but I will say that it fits perfectly and securely on my phone. Overall, very pleased with this purchase.', 'summary': "Can't stop won't stop looking at it", 'unixReviewTime': 1407110400}


In [None]:
# convert list into pandas dataframe

df = pd.DataFrame.from_dict(data)

print(len(df))

1128437


We use real Amazon review data. Given the metadata is too large for Google colab, we use a subset of the original megadata retrieved from https://nijianmo.github.io/amazon/index.html

In [None]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,True,"08 4, 2014",A24E3SXTC62LJI,7508492919,{'Color:': ' Bling'},Claudia Valdivia,Looks even better in person. Be careful to not...,Can't stop won't stop looking at it,1407110400,,
1,5.0,True,"02 12, 2014",A269FLZCB4GIPV,7508492919,,sarah ponce,When you don't want to spend a whole lot of ca...,1,1392163200,,
2,3.0,True,"02 8, 2014",AB6CHQWHZW4TV,7508492919,,Kai,"so the case came on time, i love the design. I...",Its okay,1391817600,,
3,2.0,True,"02 4, 2014",A1M117A53LEI8,7508492919,,Sharon Williams,DON'T CARE FOR IT. GAVE IT AS A GIFT AND THEY...,CASE,1391472000,,
4,4.0,True,"02 3, 2014",A272DUT8M88ZS8,7508492919,,Bella Rodriguez,"I liked it because it was cute, but the studs ...",Cute!,1391385600,,


In [None]:
df.dropna(subset=['reviewText'],inplace=True)

In [None]:
print("total number of reviews:")
len(df)

total number of reviews:


1127672

In [None]:
df.reset_index(inplace=True, drop=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1127672 entries, 0 to 1127671
Data columns (total 12 columns):
 #   Column          Non-Null Count    Dtype  
---  ------          --------------    -----  
 0   overall         1127672 non-null  float64
 1   verified        1127672 non-null  bool   
 2   reviewTime      1127672 non-null  object 
 3   reviewerID      1127672 non-null  object 
 4   asin            1127672 non-null  object 
 5   style           604807 non-null   object 
 6   reviewerName    1127538 non-null  object 
 7   reviewText      1127672 non-null  object 
 8   summary         1127206 non-null  object 
 9   unixReviewTime  1127672 non-null  int64  
 10  vote            92001 non-null    object 
 11  image           27013 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 95.7+ MB


Let's check the first 200 reviews first to get a general idea of the review texts:

In [None]:
data = df.loc[:200, 'reviewText'].tolist()

In [None]:
data

['Looks even better in person. Be careful to not drop your phone so often because the rhinestones will fall off (duh). More of a decorative case than it is protective, but I will say that it fits perfectly and securely on my phone. Overall, very pleased with this purchase.',
 "When you don't want to spend a whole lot of cash but want a great deal...this is the shop to buy from!",
 "so the case came on time, i love the design. I'm actually missing 2 studs but nothing too noticeable the studding is almost a bit sloppy around the bow, but once again not too noticeable. I haven't put in my phone yet so this is just what I've notice so far",
 "DON'T CARE FOR IT.  GAVE IT AS A GIFT AND THEY WERE OKAY WITH IT.  JUST NOT WHAT I EXPECTED.",
 'I liked it because it was cute, but the studs fall off easily and to protect a phone this would not be recommended. Buy if you just like it for looks.',
 "The product looked exactly like the picture and it was very nice. However only days later it fell apa

# Tokenization of review texts

In [None]:
import numpy as np
import pandas as pd
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
stopwords = nltk.corpus.stopwords.words('english')
stopwords.append("'s")
stopwords.append("'m")
stopwords.append("\n")
stopwords.append("phone")

We decide to add the word `phone` to our stopwords dictionary. This is a review dataset about cell phone and accessories, so we expect this word will appear a lot of times without providing us useful information - we know the users are talking about phones.

# Stemming

In [None]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")

def tokenization_and_stemming(text):
    tokens = []
    for word in nltk.word_tokenize(text):
        if word.lower() not in stopwords:
            tokens.append(word.lower())

    filtered_tokens = []
    
    for token in tokens:
        if token.isalpha():
            filtered_tokens.append(token)
            
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# TF-IDF matrix

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_model = TfidfVectorizer(max_df=0.99, max_features=1000,
                              min_df=0.01, stop_words='english',
                              use_idf=True, tokenizer=tokenization_and_stemming,
                              ngram_range=(1,1))

review_text = df['reviewText'].tolist()

tfidf_matrix = tfidf_model.fit_transform(review_text)

tfidf_matrix.shape



(1127672, 363)

In [None]:
tf_selected_words = tfidf_model.get_feature_names()
tf_selected_words



['abl',
 'absolut',
 'access',
 'actual',
 'ad',
 'add',
 'advertis',
 'air',
 'allow',
 'alreadi',
 'alway',
 'amaz',
 'amazon',
 'android',
 'anoth',
 'anyon',
 'anyth',
 'app',
 'appear',
 'appl',
 'appli',
 'area',
 'arriv',
 'attach',
 'away',
 'awesom',
 'bad',
 'band',
 'base',
 'batteri',
 'beauti',
 'belt',
 'best',
 'better',
 'big',
 'bit',
 'black',
 'blue',
 'bluetooth',
 'bought',
 'box',
 'brand',
 'break',
 'broke',
 'bubbl',
 'built',
 'bulk',
 'bulki',
 'button',
 'buy',
 'ca',
 'cabl',
 'came',
 'camera',
 'car',
 'card',
 'care',
 'carri',
 'case',
 'caus',
 'cell',
 'chang',
 'charg',
 'charger',
 'cheap',
 'clean',
 'clear',
 'clip',
 'close',
 'color',
 'come',
 'comfort',
 'compani',
 'compar',
 'complaint',
 'complet',
 'connect',
 'contact',
 'cool',
 'cord',
 'corner',
 'cost',
 'coupl',
 'cover',
 'crack',
 'custom',
 'cut',
 'cute',
 'damag',
 'daughter',
 'day',
 'deal',
 'decent',
 'decid',
 'definit',
 'describ',
 'design',
 'devic',
 'differ',
 'difficu

# Clustering

We will use kmeans and LDA to generate clusters.

**Kmeans**

In [None]:
from sklearn.cluster import KMeans

In [None]:
# k-means with 10 clusters

num_clusters = 10

km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

In [None]:
product = { 'review': df.reviewText, 'cluster': clusters}
frame = pd.DataFrame(product, columns = ['review', 'cluster'])

In [None]:
print ("Number of reviews in each cluster:")
frame['cluster'].value_counts().to_frame()

Number of reviews in each cluster:


Unnamed: 0,cluster
8,454536
2,200077
4,141419
9,82473
6,72183
7,52115
1,51366
0,38003
3,24371
5,11129


We can see that the number of reviews in each clusters is very imbalanced. While cluster 8 contains 454536 reviews, 7 out of 10 of the clusters only have less than 100000 reviews, and the smallest cluster only has 11129 reviews. This will raise some concerns about the effectiveness of the result. For example, it is likely that the big clusters can be further divided into several clusters, but we failed to generalize the potential different patterns from it.

In [None]:
km.cluster_centers_

array([[1.39694511e-03, 7.84006877e-04, 9.05308973e-04, ...,
        3.35899280e-03, 1.50780007e-03, 1.74531994e-03],
       [1.29651544e-03, 8.97719176e-03, 4.12937209e-04, ...,
        2.51809260e-03, 9.21475726e-04, 3.51688547e-03],
       [5.30651648e-03, 6.28347184e-03, 7.91618876e-03, ...,
        7.42280823e-03, 4.43396317e-03, 8.79781863e-03],
       ...,
       [1.59413628e-04, 4.32626011e-05, 2.15169879e-04, ...,
        9.39400224e-04, 2.86508784e-04, 1.21288123e-03],
       [5.61777381e-03, 3.49492837e-03, 2.60583040e-03, ...,
        8.92002571e-03, 4.69305991e-03, 8.69089805e-03],
       [6.34663086e-03, 5.00698707e-03, 1.76337049e-03, ...,
        7.32195124e-03, 3.63824858e-03, 7.80318699e-03]])

In [None]:
km.cluster_centers_.shape

(10, 363)

In [None]:
print ("<Document clustering result by K-means>")

#sort in decreasing-order and get the top 10 items.
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print ("Cluster " + str(i) + " words:", end='')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i, :11]:
        Cluster_keywords_summary[i].append(tf_selected_words[ind])
        print (tf_selected_words[ind] + ",", end='')
    print ()

<Document clustering result by K-means>
Cluster 0 words:nice,case,fit,look,product,protect,good,great,realli,price,love,
Cluster 1 words:love,case,great,color,daughter,perfect,fit,bought,wife,product,work,
Cluster 2 words:case,protect,like,fit,look,great,drop,love,good,use,iphon,
Cluster 3 words:excel,product,qualiti,recommend,thank,good,price,seller,case,great,fit,
Cluster 4 words:charg,work,charger,batteri,cabl,use,devic,usb,great,fast,time,
Cluster 5 words:ok,work,price,case,product,everyth,fit,good,qualiti,use,like,
Cluster 6 words:great,work,product,price,case,fit,look,protect,qualiti,item,buy,
Cluster 7 words:good,product,qualiti,price,work,case,fit,protect,look,far,item,
Cluster 8 words:use,fit,perfect,like,product,great,good,work,look,easi,time,
Cluster 9 words:screen,protector,glass,instal,protect,bubbl,easi,great,edg,use,case,


From the clusters, we can generalize certain information.

1. 8 of 10 clusters contain the word `case`. We checked back from the 200 sample reviews and found that it refers to phone case. All these clusters have words representing postive feedbacks such as nice, great, etc. This means that we can possibly produce more phone cases to earn profit.

2. Cluster 2 contains words `daughter`, `wife`, and `bought`, which shows this cluster of customers bought the case for their family members. We can possibly show more advertisement of items that are good as gifts in holidays and expect that this cluster of customers are more likely to purchase.

3. Cluster 4 contains most words related to `charger` and `battery`, and we can see the associated word that represents attitude is `great`. We can possibly conclude that accessories associated with charging are profitable products.

4. Interestingly, we saw `iphon` in one cluster but no other type of phones (such as galaxy, which we saw from the sample reviews) and it is associated with words with positive attitudes. It may be a hint of iphone is more preferable than other brands, but we need to do further experiments such as hypothesis testing to see if it really has an impact.

**LDA**

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
lda = LatentDirichletAllocation(n_components=10)

In [None]:
lda_output = lda.fit_transform(tfidf_matrix)

(1127672, 10)

In [None]:
topic_names = ["Topic" + str(i) for i in range(lda.n_components)]

doc_names = ["Doc" + str(i) for i in range(len(review_text))]

df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topic_names, index=doc_names)

# get dominant topic for each document
topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['topic'] = topic

df_document_topic.head(10)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7,Topic8,Topic9,topic
Doc0,0.02,0.02,0.02,0.02,0.02,0.02,0.81,0.02,0.02,0.02,6
Doc1,0.71,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0
Doc2,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.75,0.03,8
Doc3,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.7,9
Doc4,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.76,9
Doc5,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.78,9
Doc6,0.83,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0
Doc7,0.4,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.43,9
Doc8,0.77,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0
Doc9,0.79,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0.02,0


In [None]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
0,172075
6,160330
5,145874
3,132675
8,117852
4,111004
9,101116
7,65976
2,65361
1,55409


With LDA, we get a more balanced group of clusters.

In [None]:
def print_topic_words(tfidf_model, lda_model, n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []

    for topic_words_weights in lda_model.components_:
        top_words = topic_words_weights.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

In [None]:
# n = 10
topic_keywords = print_topic_words(tfidf_model=tfidf_model, lda_model=lda, n_words=10)        

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word '+str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic '+str(i) for i in range(df_topic_words.shape[0])]
df_topic_words



Unnamed: 0,Word 0,Word 1,Word 2,Word 3,Word 4,Word 5,Word 6,Word 7,Word 8,Word 9
Topic 0,use,case,month,work,broke,time,day,money,batteri,year
Topic 1,excel,ok,describ,fast,product,ship,recommend,exact,seller,great
Topic 2,perfect,fit,thank,awesom,great,item,work,job,case,galaxi
Topic 3,great,love,case,work,product,protect,color,look,bought,daughter
Topic 4,nice,case,mount,use,hold,review,product,car,work,magnet
Topic 5,charg,charger,work,cabl,batteri,use,devic,usb,cord,great
Topic 6,case,protect,fit,button,like,drop,cover,feel,look,hard
Topic 7,good,product,qualiti,expect,price,work,best,great,far,case
Topic 8,screen,protector,instal,easi,glass,bubbl,appli,edg,protect,scratch
Topic 9,like,case,look,cheap,fit,clip,advertis,watch,band,realli


We can see that `case` is still a word with high occurrence and associated with postive attitude. However, in cluster 0, it also has word `broke` and `batteri`, and it doesn't have any words showing attitude. This specific cluster can be explored further (such as using the clusters to predict ratings and see how this cluster is related to the prediction).

Also, cluster 3 shows similar traits as the cluster of people that bought gifts to their family we identified in kmeans. Cluster 5 is similar to the charging-related cluster in kmeans too. So we are more confident on these findings.

Finally, cluster 2 contains `galaxi` with good comments, but there is no clusters having `iphon`. This adds more doubts that our previous assumption of `iphon` having a better market. To investigate more about phone brands, we should conduct experiments.