# Amazon Reviews Sentiment Analysis

______________________

## Part 1: EDA, Cleaning

### Riche Ngo

### Amazon Reviews for Electronics

Data from [link](https://nijianmo.github.io/amazon/index.html).

The total number of reviews is 233.1 million (142.8 million in 2014).  
Current data includes reviews in the range May 1996 - Oct 2018.  

We will only be looking at 2018 reviews.

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

from bs4 import BeautifulSoup   

## Reviews Data

### Import data

In this notebook, we are exploring "small" subsets of the original data for experimentation.

K-cores (i.e., dense subsets): These data have been reduced to extract the k-core, such that each of the remaining users and items have k reviews each.

In this notebook, K=5.

Contents of the data:  
* reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
* asin - ID of the product, e.g. 0000013714
* reviewerName - name of the reviewer
* vote - helpful votes of the review
* style - a disctionary of the product metadata, e.g., "Format" is "Hardcover"
* reviewText - text of the review
* overall - rating of the product
* summary - summary of the review
* unixReviewTime - time of the review (unix time)
* reviewTime - time of the review (raw)
* image - images that users post after they have received the product

In [2]:
# Read csv
df = pd.read_csv('../datasets/amazon_reviews_electronics/Electronics_5_2018.csv')

In [3]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5.0,True,"01 27, 2018",A1XSPKZ8HHSBX2,073530498X,{'Format:': ' Spiral-bound'},Problematic1963,I made a photo album for a senior friend who w...,great buy,1517011200,,
1,5.0,True,"04 1, 2018",A3G5NNV6T6JA8J,106171327X,,Tazman32,"Great addition to our new Galaxy S9's which, b...",Great addition to our new Galaxy S9's which,1522540800,,
2,5.0,True,"03 30, 2018",AFML7PYI3LERI,106171327X,,Brian D. Carrico,Perfect !,Five Stars,1522368000,,
3,4.0,True,"03 30, 2018",A1G0HYMR02WM2W,106171327X,,Cici Ciconia,As described.,Four Stars,1522368000,,
4,5.0,True,"03 27, 2018",A1T8B3I8KRS3W0,106171327X,,AJ,Great little card made my device better,Five Stars,1522108800,,


In [4]:
df.shape

(377430, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377430 entries, 0 to 377429
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         377430 non-null  float64
 1   verified        377430 non-null  bool   
 2   reviewTime      377430 non-null  object 
 3   reviewerID      377430 non-null  object 
 4   asin            377430 non-null  object 
 5   style           240764 non-null  object 
 6   reviewerName    377356 non-null  object 
 7   reviewText      377257 non-null  object 
 8   summary         377300 non-null  object 
 9   unixReviewTime  377430 non-null  int64  
 10  vote            12133 non-null   float64
 11  image           10509 non-null   object 
dtypes: bool(1), float64(2), int64(1), object(8)
memory usage: 32.0+ MB


In [6]:
# Checks for people talking about movies in reviews
# for i in range(5):
#     print(df[df['reviewText'].str.contains('movie') == True]['reviewText'].values[i])
#     print()

### Missing Values

In [13]:
df.isnull().sum()

overall                0
verified               0
reviewTime             0
reviewerID             0
asin                   0
style             136666
reviewerName          74
reviewText           173
summary              130
unixReviewTime         0
vote              365297
image             366921
dtype: int64

We will not be using the columns that have high numbers of null values anyway.

There are missing reviewers' names but we can remove that column and just keep the reviewer IDs.

We do not need 'verified', 'style', 'reviewerName', 'vote', 'image'.

In [14]:
# Extract the columns of concern
reviews = df[['overall', 'reviewTime', 'reviewerID', 'asin', 'reviewText', 'summary', 'unixReviewTime']].copy()

In [15]:
# Drop null values in row data
reviews.dropna(inplace=True)

In [16]:
reviews.shape

(377164, 7)

In [18]:
# Check null values
reviews.isnull().sum()

overall           0
reviewTime        0
reviewerID        0
asin              0
reviewText        0
summary           0
unixReviewTime    0
dtype: int64

In [19]:
# Save cleaned dataset as a pickle file
outfile = open('../datasets/amazon_reviews_electronics/Electronics_reviews_2018.pkl','wb')
pickle.dump(reviews, outfile)
outfile.close()

### Load `.pkl` file from here

In [None]:
# Use this to load previously saved .pkl file
# infile = open('../datasets/amazon_reviews_electronics/Electronics_reviews_2018.pkl', 'rb')
# reviews = pickle.load(infile)
# infile.close()

## Products Meta Data

Contents of meta data:

* asin - ID of the product, e.g. 0000031852
* title - name of the product
* feature - bullet-point format features of the product
* description - description of the product
* price - price in US dollars (at time of crawl)
* image - url of the product image
* related - related products (also bought, also viewed, bought together, buy after viewing)
* salesRank - sales rank information
* brand - brand name
* categories - list of categories the product belongs to
* tech1 - the first technical detail table of the product
* tech2 - the second technical detail table of the product
* similar - similar product table

### Import data

In [20]:
m_df = pd.read_csv('../datasets/amazon_reviews_electronics/meta_Electronics_2018.csv')

In [21]:
m_df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,details
0,"['Electronics', 'Accessories &amp; Supplies', ...",,['The CLIKR-5 UR5U-8780L remote control is des...,,CLIKR-5 Time Warner Cable Remote Control UR5U-...,"['B06VTZK822', 'B00OSI6O7S']",['https://images-na.ssl-images-amazon.com/imag...,,URC,['Instruction manual included'],"['>#4,971 in Electronics &gt; Accessories &amp...","['B06VTZK822', 'B00OSI6O7S', 'B00KUL8O0W', 'B0...",All Electronics,"class=""a-bordered a-horizontal-stripes a-spa...","January 31, 2013",,0511189877,
1,"['Electronics', 'GPS, Finders & Accessories', ...",,['**** Shipped by 2-3 DAY UNITED STATES PRIORI...,,Rand McNally 528881469 7-inch Intelliroute TND...,[],['https://images-na.ssl-images-amazon.com/imag...,,Rand McNally,"['Extra large 7-inch high-definition screen', ...",['>#60 in Electronics > GPS & Navigation > Veh...,"['B00RVGXZBM', 'B00N58RZ34', 'B07FKR7VZ4', 'B0...",All Electronics,,"April 15, 2010",,0528881469,
2,"['Electronics', 'Computers &amp; Accessories',...","class=""a-keyvalue prodDetTable"" role=""present...","['Nook HD protective stand cover slim, smart a...",,Nook Hd + 9-Inch Groovy Protective Stand Cover...,"['B00E9IKYKK', '1400699169', 'B00E9ISXPS', 'B0...",['https://images-na.ssl-images-amazon.com/imag...,"class=""a-keyvalue prodDetTable"" role=""present...",Nook,"['Custom designed for NOOK', 'Horizontal Viewi...","['>#24,362 in Computers &amp; Accessories &gt;...","['B010CHS6PG', 'B00B8EG632', 'B01FY2BAYS', 'B0...",Computers,"class=""a-bordered a-horizontal-stripes a-spa...","April 20, 2013",,0594450268,
3,"['Electronics', 'eBook Readers &amp; Accessori...","class=""a-keyvalue prodDetTable"" role=""present...",['Original Barnes &amp; Noble Nook Color or Ta...,,Barnes &amp; Noble Nook Color Tablet USB Cable...,[],['https://images-na.ssl-images-amazon.com/imag...,,Barnes &amp; Noble,['<span>\n BUY MORE AND SAVE! Purchase ...,['>#24 in Electronics &gt; eBook Readers &amp;...,"['B01C31MQA0', 'B00AZRHYKW', 'B01MG3HKUX', 'B0...",Portable Audio &amp; Accessories,,"August 1, 2014",$6.04,0594459451,
4,"['Electronics', 'Camera &amp; Photo', 'Lightin...",,"['', '']",,Vintage Camera Photo Album,"['1441310533', '1935414763', '1441317422', 'B0...",[],,Visit Amazon's Galison Page,[],"3,298 in Books (","['1441310533', 'B01JAVJZ46', 'B00LJWNO7Y', 'B0...",Books,,,$9.99,073530498X,


In [22]:
m_df.shape

(58305, 18)

In [23]:
m_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58305 entries, 0 to 58304
Data columns (total 18 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   category      58305 non-null  object
 1   tech1         15525 non-null  object
 2   description   58305 non-null  object
 3   fit           16 non-null     object
 4   title         58305 non-null  object
 5   also_buy      58305 non-null  object
 6   image         58305 non-null  object
 7   tech2         5171 non-null   object
 8   brand         58212 non-null  object
 9   feature       58305 non-null  object
 10  rank          58305 non-null  object
 11  also_view     58305 non-null  object
 12  main_cat      58180 non-null  object
 13  similar_item  42995 non-null  object
 14  date          49378 non-null  object
 15  price         42609 non-null  object
 16  asin          58305 non-null  object
 17  details       58283 non-null  object
dtypes: object(18)
memory usage: 8.0+ MB


### Missing Values

In [24]:
m_df.isnull().sum()

category            0
tech1           42780
description         0
fit             58289
title               0
also_buy            0
image               0
tech2           53134
brand              93
feature             0
rank                0
also_view           0
main_cat          125
similar_item    15310
date             8927
price           15696
asin                0
details            22
dtype: int64

We want to use the titles and main categories of the items.

### Main Category

In [25]:
m_df['main_cat'].value_counts()

Computers                                                                                                                                                                    18678
Camera & Photo                                                                                                                                                               10285
All Electronics                                                                                                                                                               8441
Home Audio & Theater                                                                                                                                                          8426
Cell Phones & Accessories                                                                                                                                                     4699
Car Electronics                                                                                          

We noticed that there are certain categories which contain the image urls and should remove them.

In [26]:
# Checking the number of entries with the image urls
m_df['main_cat'].str.startswith('<img').sum()

101

In [43]:
# Filtering out the rows with image urls
m_df = m_df.loc[m_df['main_cat'].str.startswith('<img')==False]

In [44]:
m_df['main_cat'].str.startswith('<img').sum()

0

In [62]:
# Extract the columns of concern
categories = m_df[['title', 'main_cat', 'asin']].copy()

In [63]:
categories.shape

(58079, 3)

In [64]:
# Save cleaned dataset as a pickle file
outfile = open('../datasets/amazon_reviews_electronics/Electronics_categories_2018.pkl','wb')
pickle.dump(categories, outfile)
outfile.close()

## Combine Meta Data with Reviews

In [65]:
# Merging dataframes
combined_df = reviews.merge(categories, how='left', on='asin')

In [66]:
combined_df.shape

(385515, 9)

In [67]:
combined_df.isnull().sum()

overall              0
reviewTime           0
reviewerID           0
asin                 0
reviewText           0
summary              0
unixReviewTime       0
title             1266
main_cat          1266
dtype: int64

In [69]:
combined_df.head(15)

Unnamed: 0,overall,reviewTime,reviewerID,asin,reviewText,summary,unixReviewTime,title,main_cat
0,5.0,"01 27, 2018",A1XSPKZ8HHSBX2,073530498X,I made a photo album for a senior friend who w...,great buy,1517011200,Vintage Camera Photo Album,Books
1,5.0,"04 1, 2018",A3G5NNV6T6JA8J,106171327X,"Great addition to our new Galaxy S9's which, b...",Great addition to our new Galaxy S9's which,1522540800,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
2,5.0,"03 30, 2018",AFML7PYI3LERI,106171327X,Perfect !,Five Stars,1522368000,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
3,4.0,"03 30, 2018",A1G0HYMR02WM2W,106171327X,As described.,Four Stars,1522368000,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
4,5.0,"03 27, 2018",A1T8B3I8KRS3W0,106171327X,Great little card made my device better,Five Stars,1522108800,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
5,5.0,"03 26, 2018",A3J18CDIQTMSY9,106171327X,Item as described. Fast shipping. A++,Five Stars,1522022400,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
6,5.0,"03 10, 2018",A2Q85T9VJ6S1DL,106171327X,Great little card. Great storage and fast ship...,Great little card. Great storage,1520640000,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
7,4.0,"03 9, 2018",A2L4NKP6PBIOFB,106171327X,Whatever i put on this memory it stayed unlike...,Four Stars,1520553600,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
8,5.0,"03 7, 2018",A3ASS2MFJZC7XJ,106171327X,Sandisk is my favorite because of the company ...,Five Stars,1520380800,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers
9,5.0,"03 5, 2018",A29QUPMRMIQHLT,106171327X,Great brand name. I have never had a problem w...,Five Stars,1520208000,Sandisk SDSDQUA-064G-A11 Professional Ultra 64...,Computers


In [71]:
# Save combined dataset as a pickle file
outfile = open('../datasets/amazon_reviews_electronics/Electronics_merged_2018.pkl','wb')
pickle.dump(combined_df, outfile)
outfile.close()