# Amazon Review Analysis - Document Clustering and Topic Modelling



In this project, we will analyze the underlying structure of documents automatically and visualize the clustering result natural language methods/tools.

The dataset used in this project is from Amazon product reviews data.

Scope:

- Text Processing(NLTK)
    - Tokenizing: splitting a text into individual words or sequences of words based on delimeters
    - Stop words: Remove stop words from the result set since they convey significant meaning such as the, a
    - Stemming: Counting inflected forms of a word together. It can help us reduce the number of unique vocabulary items that we nned to track


- Feature Engineering
    - Term Frequency-Inverse Doucment Frequency(TF-IDF)
            TF: count of work A in doc B
            IDF: 1/ number of documents where word A appears
            TF-IDF = TF * IDF
    


- Model Training
    - K-means clustering
    - Latent Dirichlet Allocation(LDA)


# Part 1: Setup Google Drive Environment

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import nltk
import gensim
import re
import os

from sklearn import decomposition
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/wenxianfei/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/wenxianfei/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [7]:
# Load data into dataframe
df = pd.read_csv('watch_reviews.tsv',sep='\t',header = 0,error_bad_lines = False)

b'Skipping line 8704: expected 15 fields, saw 22\nSkipping line 16933: expected 15 fields, saw 22\nSkipping line 23726: expected 15 fields, saw 22\n'
b'Skipping line 85637: expected 15 fields, saw 22\n'
b'Skipping line 132136: expected 15 fields, saw 22\nSkipping line 158070: expected 15 fields, saw 22\nSkipping line 166007: expected 15 fields, saw 22\nSkipping line 171877: expected 15 fields, saw 22\nSkipping line 177756: expected 15 fields, saw 22\nSkipping line 181773: expected 15 fields, saw 22\nSkipping line 191085: expected 15 fields, saw 22\nSkipping line 196273: expected 15 fields, saw 22\nSkipping line 196331: expected 15 fields, saw 22\n'
b'Skipping line 197000: expected 15 fields, saw 22\nSkipping line 197011: expected 15 fields, saw 22\nSkipping line 197432: expected 15 fields, saw 22\nSkipping line 208016: expected 15 fields, saw 22\nSkipping line 214110: expected 15 fields, saw 22\nSkipping line 244328: expected 15 fields, saw 22\nSkipping line 248519: expected 15 fields,

In [9]:
df.head(3)

Unnamed: 0,marketplace,customer_id,review_id,product_id,product_parent,product_title,product_category,star_rating,helpful_votes,total_votes,vine,verified_purchase,review_headline,review_body,review_date
0,US,3653882,R3O9SGZBVQBV76,B00FALQ1ZC,937001370,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",Watches,5,0,0,N,Y,Five Stars,Absolutely love this watch! Get compliments al...,2015-08-31
1,US,14661224,RKH8BNC3L5DLF,B00D3RGO20,484010722,Kenneth Cole New York Women's KC4944 Automatic...,Watches,5,0,0,N,Y,I love thiswatch it keeps time wonderfully,I love this watch it keeps time wonderfully.,2015-08-31
2,US,27324930,R2HLE8WKZSU3NL,B00DKYC7TK,361166390,Ritche 22mm Black Stainless Steel Bracelet Wat...,Watches,2,1,1,N,Y,Two Stars,Scratches,2015-08-31


In [10]:
# Remove missing value
df.review_body.dropna(inplace = True)

In [11]:
data = df.loc[:1000,'review_body'].tolist()

# Part2: Tokenizing and Stemming

In [12]:
# Use NLTK's English stopword.
stopwords = nltk.corpus.stopwords.words('english')

print ("We use " + str(len(stopwords))+" stop-words from nltk liabrary")
print(stopwords[:10])

We use 179 stop-words from nltk liabrary
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [23]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

# tokenization and stemming
def tokenization_stemming(text):
#     double iteration in list comprehension
    tokens = [ word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)
             if word not in stopwords]
    filtered_tokens = []
    
#     filter out any tokens not containing letters (e.g. numeric tokens, raw puncutation)
    for token in tokens:
        if re.search('[a-zA-Z]',token):
            filtered_tokens.append(token)
#     stemming
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# tokenization without stemming
def tokenization(text):
    tokens = [ word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)
             if word not in stopwords]
    filtered_tokens = []
    
#     filter out any tokens not containing letters (e.g. numeric tokens, raw puncutation)
    for token in tokens:
        if re.search('[a-zA-Z]',token):
            filtered_tokens.append(token)
    return filtered_tokens

In [24]:
# tokenization and stemming
# lemmatization
tokenization_stemming(data[0])

['absolut',
 'love',
 'watch',
 'get',
 'compliment',
 'almost',
 'everi',
 'time',
 'i',
 'wear',
 'dainti']

In [25]:
# do tokenization and stemming for all the documents
# also just do tokenization for all the documents
# the goal is to create a mapping from stemmed words to orignial tokenized words for result interpretation.
docs_stemmed = []
docs_tokenized = []
for i in data:
    tokenized_and_stemmed_results = tokenization_stemming(i)
    docs_stemmed.extend(tokenized_and_stemmed_results)
    
    tokenized_results = tokenization(i)
    docs_tokenized.extend(tokenized_results)

In [28]:
# create a mapping from stemmed words to orignial words
vocab_frame_dict = {docs_stemmed[x]:docs_tokenized[x] for x in range(len(docs_stemmed))}

# Part3: TF-IDF

In [30]:
# Define vectorizer parameters
# TFIDFVectorizer will help us to create tf-idf matrix
# max_df: maximum document frequency for the given word
# min_df: minimum document frequency for the given word
# max_features: maximum number of words
# use_idf: if not true, we only calculate tf
# stop words: buile-in stop words
# tokenizer: how to tokenize the document
# ngram_range:(mi_value,max_value),eg.(1,3) means the result will include 1-gram,2-gram,3-gram
tfidf_model = TfidfVectorizer(max_df = 0.99,max_features = 1000,
                             min_df = 0.01, stop_words = 'english',
                             use_idf = True , tokenizer = tokenization_stemming,
                              ngram_range = (1,1))
tfidf_matrix = tfidf_model.fit_transform(data) # fit the vectorizer to synopses

print('In total, there are ' + str(tfidf_matrix.shape[0]) + \
      ' reviews and ' + str(tfidf_matrix.shape[1]) + ' terms.')

In total, there are 1000 reviews and 245 terms.


In [31]:
tfidf_model.get_params()

{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 0.99,
 'max_features': 1000,
 'min_df': 0.01,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': 'english',
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': <function __main__.tokenization_stemming(text)>,
 'use_idf': True,
 'vocabulary': None}

Save the terms identified by TF-IDF

In [32]:
# words
tf_selected_words = tfidf_model.get_feature_names()

In [33]:
tf_selected_words

["'m",
 "'s",
 'abl',
 'absolut',
 'accur',
 'actual',
 'adjust',
 'alarm',
 'alreadi',
 'alway',
 'amaz',
 'amazon',
 'anoth',
 'arm',
 'arriv',
 'automat',
 'awesom',
 'bad',
 'band',
 'batteri',
 'beauti',
 'best',
 'better',
 'big',
 'bit',
 'black',
 'blue',
 'bought',
 'box',
 'br',
 'bracelet',
 'brand',
 'break',
 'bright',
 'broke',
 'button',
 'buy',
 'ca',
 'came',
 'case',
 'casio',
 'chang',
 'cheap',
 'clasp',
 'classi',
 'clock',
 'color',
 'come',
 'comfort',
 'compliment',
 'cool',
 'cost',
 'crown',
 'crystal',
 'dark',
 'date',
 'daughter',
 'day',
 'deal',
 'definit',
 'deliveri',
 'design',
 'dial',
 'differ',
 'difficult',
 'disappoint',
 'display',
 'dress',
 'durabl',
 'easi',
 'easili',
 'end',
 'everi',
 'everyday',
 'everyth',
 'exact',
 'excel',
 'expect',
 'expens',
 'face',
 'fair',
 'far',
 'fast',
 'featur',
 'feel',
 'fell',
 'fine',
 'finish',
 'fit',
 'function',
 'gave',
 'gift',
 'gold',
 'good',
 'got',
 'great',
 'hand',
 'happi',
 'hard',
 'heavi

In [34]:
# Calculate the document Similarity
from sklearn.metrics.pairwise import cosine_similarity
cos_matrix = cosine_similarity(tfidf_matrix)
print(cos_matrix)

[[1.         0.42500259 0.         ... 0.         0.0400497  0.        ]
 [0.42500259 1.         0.         ... 0.         0.09423401 0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.0400497  0.09423401 0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


# Part 4: K-means clustering

In [36]:
# K-means clustering
from sklearn.cluster import KMeans

num_clusters = 5

# number of clusters
km = KMeans(n_clusters = num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist() 

# 4.1 Analyze K-means Result

In [38]:
# Create Dataframe films from all of the input files.
product = {'review' : df[:1000].product_title,'cluster':clusters}
frame = pd.DataFrame(product,columns = ['review','cluster'])

In [39]:
frame.head(10)

Unnamed: 0,review,cluster
0,"Invicta Women's 15150 ""Angel"" 18k Yellow Gold ...",0
1,Kenneth Cole New York Women's KC4944 Automatic...,0
2,Ritche 22mm Black Stainless Steel Bracelet Wat...,3
3,Citizen Men's BM8180-03E Eco-Drive Stainless S...,3
4,Orient ER27009B Men's Symphony Automatic Stain...,3
5,Casio Men's GW-9400BJ-1JF G-Shock Master of G ...,0
6,Fossil Women's ES3851 Urban Traveler Multifunc...,0
7,INFANTRY Mens Night Vision Analog Quartz Wrist...,3
8,G-Shock Men's Grey Sport Watch,3
9,Heiden Quad Watch Winder in Black Leather,3


In [41]:
print('Number of reviews included in each cluster:')
frame['cluster'].value_counts().to_frame

Number of reviews included in each cluster:


<bound method Series.to_frame of 3    710
0    112
1     74
4     68
2     36
Name: cluster, dtype: int64>

In [42]:
print('<Document clustering result by K-means>')

order_centroids = km.cluster_centers_.argsort()[:,::-1]

Cluster_keywords_summary = {}
for i in range(num_clusters):
    print('Clusters ' + str(i) + ' words:' , end = '')
    Cluster_keywords_summary[i] = []
    for ind in order_centroids[i,:6]:# replace 6 with n words per cluster
        Cluster_keywords_summary[i].append(vocab_frame_dict[tf_selected_words[ind]])
        print(vocab_frame_dict[tf_selected_words[ind]] + ',', end = '')
    print()
    
    cluster_reviews = frame[frame.cluster == i].review.tolist()
    print('Cluster ' + str(i) + ' reviews (' + str(len(cluster_reviews)) + ' reviews): ' )
    print(', '.join(cluster_reviews))
    print()


<Document clustering result by K-means>
Clusters 0 words:loved,watch,wife,looks,husband,beautiful,
Cluster 0 reviews (112 reviews): 
Invicta Women's 15150 "Angel" 18k Yellow Gold Ion-Plated Stainless Steel and Brown Leather Watch, Kenneth Cole New York Women's KC4944 Automatic Silver Automatic Mesh Bracelet Analog Watch, Casio Men's GW-9400BJ-1JF G-Shock Master of G Rangeman Digital Solar Black Carbon Fiber Insert Watch, Fossil Women's ES3851 Urban Traveler Multifunction Stainless Steel Watch - Rose, Domire Fashion Accessories Trial Order New Quartz Fashion Weave Wrap Around Leather Bracelet Lady Woman Butterfly Wrist Watch, Batman Kids' BAT4072 Black Rubber Batman Logo Strap Watch, Timex Easy Reader Day-Date Leather Strap Watch, Casio F108WH Water Resistant Digital Blue Resin Strap Watch, Stuhrling Original Women's 956.02 Symphony Gold-Tone Watch with Brown Genuine Leather Band, Seiko Men's SNKK27 Seiko 5 Stainless Steel Automatic Watch, Swiss Legend Women's 11044D-01 Neptune Black Di

# Part5: Topic Modelling - LDA

In [43]:
# Use LDA for clustering
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 5, learning_method = 'online')

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
# LDA requires integer value
tfidf_model_lda = CountVectorizer(max_df = 0.99,max_features = 500,
                                 min_df = 0.01, stop_words = 'english',
                                 tokenizer = tokenization_stemming,ngram_range = (1,1))

tfidf_matrix_lda = tfidf_model_lda.fit_transform(data) # fit the vectorizaer to synopses

print('In total, there are ' + str(tfidf_matrix_lda.shape[0]) + \
    ' reviews and '+str(tfidf_matrix_lda.shape[1]) + ' terms.')

In total, there are 1000 reviews and 245 terms.


In [53]:
# document topic matrix for tfidf_matrix_lda
lda_output = lda.fit_transform(tfidf_matrix_lda)
print(lda_output.shape)
print(lda_output)

(1000, 5)
[[0.18537023 0.02517623 0.02530421 0.02554169 0.73860764]
 [0.05121229 0.05062284 0.05138098 0.05223708 0.79454681]
 [0.2        0.2        0.2        0.2        0.2       ]
 ...
 [0.10004276 0.1001286  0.59970654 0.10011636 0.10000574]
 [0.79810382 0.05071981 0.05024557 0.05040557 0.05052523]
 [0.03449244 0.86386655 0.03362075 0.03392305 0.0340972 ]]


In [54]:
# topic and words matrix
topic_word = lda.components_
print(topic_word.shape)
print(lda_output)

(5, 245)
[[0.18537023 0.02517623 0.02530421 0.02554169 0.73860764]
 [0.05121229 0.05062284 0.05138098 0.05223708 0.79454681]
 [0.2        0.2        0.2        0.2        0.2       ]
 ...
 [0.10004276 0.1001286  0.59970654 0.10011636 0.10000574]
 [0.79810382 0.05071981 0.05024557 0.05040557 0.05052523]
 [0.03449244 0.86386655 0.03362075 0.03392305 0.0340972 ]]


In [56]:
# column names
topic_names = ['Topic ' + str(i) for i in range(lda.n_components)]

# index names
doc_names = ['Doc' + str(i) for i in range(len(data))]

df_document_topic = pd.DataFrame(np.round(lda_output,2),columns = topic_names,
                                index = doc_names)

# get dominant topic for each document

topic = np.argmax(df_document_topic.values,axis = 1)
df_document_topic['topic'] = topic

df_document_topic.head(10)


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,topic
Doc0,0.19,0.03,0.03,0.03,0.74,4
Doc1,0.05,0.05,0.05,0.05,0.79,4
Doc2,0.2,0.2,0.2,0.2,0.2,0
Doc3,0.03,0.03,0.88,0.03,0.03,2
Doc4,0.01,0.22,0.11,0.49,0.17,3
Doc5,0.43,0.04,0.04,0.04,0.44,4
Doc6,0.03,0.03,0.03,0.03,0.88,4
Doc7,0.03,0.03,0.03,0.88,0.03,3
Doc8,0.25,0.01,0.05,0.01,0.68,4
Doc9,0.02,0.02,0.19,0.64,0.13,3


In [58]:
df_document_topic['topic'].value_counts().to_frame()

Unnamed: 0,topic
4,256
0,202
2,191
1,186
3,165


In [63]:
# topic word matrix
print(lda.components_)
# topic_word matrix
df_topic_words = pd.DataFrame(lda.components_)

# column and index
df_topic_words.columns = tfidf_model_lda.get_feature_names()
df_topic_words.index = topic_names

df_topic_words.head()

[[ 0.21310499 21.02293093  6.19701564 ...  3.31185565  4.37536561
   0.21833922]
 [23.61492085 95.49601809  9.52932782 ...  1.65062046 30.75504234
   4.50509686]
 [ 0.20815625 17.68141069  0.20147555 ...  2.82244017  0.2332786
  13.31744609]
 [11.33189015 56.47838443  0.2410831  ...  5.40667556 20.00059801
  36.85629874]
 [12.78863453 41.8297772   0.21003646 ...  4.71294438 15.73807426
   0.63150832]]


Unnamed: 0,'m,'s,abl,absolut,accur,actual,adjust,alarm,alreadi,alway,...,weight,went,wife,wind,wish,work,worn,worth,wrist,year
Topic 0,0.213105,21.022931,6.197016,0.226711,0.211029,10.107001,0.212059,0.201363,5.767082,0.207454,...,1.267843,0.555455,0.205533,0.209271,9.820612,6.419062,0.20322,3.311856,4.375366,0.218339
Topic 1,23.614921,95.496018,9.529328,0.206814,0.207285,1.956452,22.501119,8.833674,0.224118,2.591758,...,13.269811,0.230301,4.768875,2.087903,0.212688,1.355261,2.598732,1.65062,30.755042,4.505097
Topic 2,0.208156,17.681411,0.201476,0.21855,0.270105,0.204248,0.21242,1.588216,4.228661,0.969309,...,0.203763,4.164661,0.203631,0.201602,0.203816,100.812547,0.2017,2.82244,0.233279,13.317446
Topic 3,11.33189,56.478384,0.241083,1.359566,12.148022,0.226488,1.379112,6.719053,3.325834,9.690513,...,0.211486,6.528039,0.206949,11.602365,1.668571,21.344628,0.22021,5.406676,20.000598,36.856299
Topic 4,12.788635,41.829777,0.210036,15.425642,0.211483,3.968851,0.216227,0.211477,0.236068,0.297145,...,0.207924,2.28014,15.202582,0.203563,0.208356,1.806369,10.164524,4.712944,15.738074,0.631508


In [65]:
def print_topic_words(tfidf_model,lda_model,n_words):
    words = np.array(tfidf_model.get_feature_names())
    topic_words = []
#      for each topic, we have words weight
    for topic_words_weight in lda_model.components_:
        top_words = topic_words_weight.argsort()[::-1][:n_words]
        topic_words.append(words.take(top_words))
    return topic_words

topic_keywords = print_topic_words(tfidf_model = tfidf_model_lda,lda_model = lda, n_words = 15)

df_topic_words = pd.DataFrame(topic_keywords)
df_topic_words.columns = ['Word' + str(i) for i in range(df_topic_words.shape[1])]
df_topic_words.index = ['Topic' + str(i) for i in range(df_topic_words.shape[0])]
df_topic_words


Unnamed: 0,Word0,Word1,Word2,Word3,Word4,Word5,Word6,Word7,Word8,Word9,Word10,Word11,Word12,Word13,Word14
Topic0,watch,beauti,excel,great,got,bought,price,lot,exact,look,qualiti,purchas,gift,awesom,love
Topic1,watch,like,'s,look,size,band,n't,big,cheap,feel,dial,use,wrist,make,small
Topic2,good,work,nice,watch,time,batteri,product,price,qualiti,week,broke,day,set,look,use
Topic3,watch,br,band,time,n't,'s,day,hand,face,look,second,year,read,need,water
Topic4,watch,love,great,look,color,n't,nice,perfect,'s,wear,strap,realli,light,blue,buy
