# Amazon Fine Food Reviews Analysis

Dataset source : https://www.kaggle.com/snap/amazon-fine-food-reviews

### Tasks : To determine whether a review is positive or negative and build a machine learning model around it .

Data includes:

Reviews from Oct 1999 - Oct 2012

- 568,454 reviews

- 256,059 users

- 74,258 products

- 260 users with > 50 reviews

Attribute Information :

- id

- product id

- userid

- prfinename

- helpfulness Numerator - number of users who found the review helpful (if 2500 ppl said Yes its 2500)

- helpfulness Denomunator - number of users who indicated weather they found the review helpful or not (if 100 ppl said No then its 2500 + 100)

- score

- time
 
- summary

- review text

### Objective : 

Given a review, determine the review is positive (Ratuing 4 or %) or negative (rating 1 or 2)

[Q] How to determine if a review is positive or negative?

[ANS] We could use the Score/Rating. A rating of 4 or % could be considerd a positive review. A review of 1 or 2 could be considered negative. A review of 3 is neutral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

# Import Libraries

In [16]:
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import string
import seaborn as sns

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer

import re

import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

# Load the dataset
 dataset available in two forms
 
 1. .csv file
 2. SQLite Database
 
using SQLITE dataset as it is eaiser to query the data and visualize the data efficciently.

In [1]:


# using the SQLite Table to read data.

con = sqlite3.connect('data/database.sqlite')

# filtering only positive and negative reviews i.e
# not taking into consideration those reviews with score = 3

filtered_data = pd.read_sql_query("""
SELECT * 
FROM Reviews
WHERE Score != 3
""", con)

#Give reviews with Socre > 3 a positive rating, and with a score < 3 a negative rating

def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

# Changing reviews with score less than 3 to be positive and vice-versa

actualScore = filtered_data['Score']
positivenegative = actualScore.map(partition)
filtered_data['Score'] = positivenegative

In [2]:
filtered_data.shape
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


# Exploratory Data Analysis

## 1. Data Cleaning : Deduplication
it is observed (as shown in the table below) that the reviews data had many duplicate entries. hence it is necessary  to remove duplicates in order to get unbiased results for the analysis of the data.

In [3]:
display = pd.read_sql_query(
"""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId='A395BORC6FGVXV'
ORDER BY ProductId
""", con)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,544173,B000U9WZ54,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
1,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
2,136304,B002Y7526Y,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...


After running the above query for many different values of userId we found that same user has multiple reviews with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text and on doing analysis it was found that priduct ID B002Y7526Y, B000UA0QIQ,B000U9WZ54 belonged to the sae product just having different flavours. to check that we used (www.amazon.com/dp/_your_product_id)

Deduplication ensures that there is only one reprentative for each product and deduplication without sorting wpould lead to possibility of dfferent represntatives still existing for the same data

In [4]:
# Sorting data according to ProductId in ascending order
sorted_data = filtered_data.sort_values('ProductId', axis = 0, ascending=True, inplace= False)

In [5]:
# Deduplication of entries
final = sorted_data.drop_duplicates(subset={"UserId","ProfileName", "Time", "Summary","Text"},keep = 'first', inplace = False)
final.shape

(365333, 10)

Here we said keep the first one, as we sorted w.r.t product id it will keep the first product and remove the duplicates

In [6]:
# cheking to see how much % of data still remains
(final['Id'].size * 1.0)/(filtered_data['Id'].size * 1.0 ) * 100

69.4795117665182

**Observation :** It was also seen that in tow rows given below the value of HelpfulnessNumerator is greater than value of HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calculations.

In [7]:
display = pd.read_sql_query(
"""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id= 544173 or Id = 64422
ORDER BY ProductId 
""", con)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,544173,B000U9WZ54,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...


In [8]:
final = final[final.HelpfulnessNumerator <= final.HelpfulnessDenominator]

In [9]:
print(final.shape)

# how many positive and negative review are present in our dataset?
final['Score'].value_counts()

(365331, 10)


positive    307967
negative     57364
Name: Score, dtype: int64

In [10]:
# checking for duplicates
final.duplicated().sum()

0

In [11]:
final.isna().sum()

Id                        0
ProductId                 0
UserId                    0
ProfileName               0
HelpfulnessNumerator      0
HelpfulnessDenominator    0
Score                     0
Time                      0
Summary                   0
Text                      0
dtype: int64

In [13]:
final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 365331 entries, 138706 to 302474
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      365331 non-null  int64 
 1   ProductId               365331 non-null  object
 2   UserId                  365331 non-null  object
 3   ProfileName             365331 non-null  object
 4   HelpfulnessNumerator    365331 non-null  int64 
 5   HelpfulnessDenominator  365331 non-null  int64 
 6   Score                   365331 non-null  object
 7   Time                    365331 non-null  int64 
 8   Summary                 365331 non-null  object
 9   Text                    365331 non-null  object
dtypes: int64(4), object(6)
memory usage: 30.7+ MB


In [17]:
final.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Time
count,365331.0,365331.0,365331.0,365331.0
mean,282791.663516,1.73922,2.188024,1296127000.0
std,164586.489158,6.721836,7.347362,48650880.0
min,1.0,0.0,0.0,939340800.0
25%,140723.5,0.0,0.0,1270858000.0
50%,278958.0,0.0,1.0,1311379000.0
75%,428530.5,2.0,2.0,1332893000.0
max,568454.0,866.0,878.0,1351210000.0


In [93]:
final.to_csv("reviews_clean_file")

In [94]:
final = pd.read_csv("reviews_clean_file")

In [95]:
final.shape

(365331, 11)

# 2. Text Preprocessing

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [96]:
# printing some random reviews

sent_0 = final['Text'].values[0]
print(sent_0)
print("__"*50)

sent_1000 = final['Text'].values[125]
print(sent_1000)
print("__"*50)

sent_1500 = final['Text'].values[14810] #row number
print(sent_1500)
print("__"*50)

sent_4900 = final['Text'].values[14900]
print(sent_4900)
print("__"*50)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
____________________________________________________________________________________________________
Though they might be a bit pricey for just the average dog owner like myself (as opposed to say someone in a profession working with dogs), they're great treats.<br />Perfect size for the quick little snack on the run.<br />And yeah... my dog loves these. At 13, she's getting pretty finicky, and these will gather her full attention.<br />Recommended.
____________________________________________________________________________________________________
Hey if my 18 year old cat will eat it, and I don't have to drive to P

In [97]:
# remove urls from text python : https://stackoverflow.com/a/40823105/4084039

sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [98]:
# https://stackoverflow.com/questions/16206380/python-beautifulsoup-how-to-remove-all-tags-from-an-element
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, "html.parser")
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, "html.parser")
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, "html.parser")
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, "html.parser")
text = soup.get_text()
print(text)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college
Though they might be a bit pricey for just the average dog owner like myself (as opposed to say someone in a profession working with dogs), they're great treats.Perfect size for the quick little snack on the run.And yeah... my dog loves these. At 13, she's getting pretty finicky, and these will gather her full attention.Recommended.
Hey if my 18 year old cat will eat it, and I don't have to drive to Petsmart, then I'm a fan!Note: if your cat is a nibbler not a wolfer, give smaller portions several times a day. At first I was dumping out half a can and she would leave it and it dries out quickly. Then she wouldn't tou

In [99]:
# https://stackoverflow.com/a/47091490/4084039
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [100]:
sent_4900 = decontracted(sent_4900)
print(sent_4900)
print("="*50)

I have been purchasing Cosmic catnip and scratching pads for as long as I have been a cat owner, and I have never been disappointed.  When I saw the cute, colorful packaging and the variety of flavors available for these treats, I could not resist.<br /><br />Sadly, my cats will not even acknowledge these as food.  They will not even smell them, let alone taste them.  Maybe I just got a bad batch, but my cats will usually eat anything that is offered to them at least once.  Even one of my dogs, who does anything to sneak a bite of the kitty food, was not too eager to eat one of these Philly Cheesesteak treats.<br /><br />The treats arrived somewhat hard, but it is obvious that they are the type that should be soft.  I do not know if this was Amazon is error for selling a stale product, or if the cold weather had something to do with it (it is getting down to the 20 is at night where I live).<br /><br />For nearly the same price as these treats, buy your kitties a tub of Cosmic catnip..

In [101]:
#remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

this witty little book makes my son laugh at loud. i recite it in the car as we're driving along and he always can sing the refrain. he's learned about whales, India, drooping roses:  i love all the new words this book  introduces and the silliness of it all.  this is a classic book i am  willing to bet my son will STILL be able to recite from memory when he is  in college


In [102]:
#remove spacial character: https://stackoverflow.com/a/5843547/4084039
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

Hey if my 18 year old cat will eat it and I don t have to drive to Petsmart then I m a fan br br Note if your cat is a nibbler not a wolfer give smaller portions several times a day At first I was dumping out half a can and she would leave it and it dries out quickly Then she wouldn t touch it A sprinkle of water and 10 seconds in the microwave did the trick but now I give her 1 4 can 4 times a day 


In [103]:
# https://gist.github.com/sebleier/554280
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [108]:
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'html.parser').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|█████████████████████████████████| 365331/365331 [00:52<00:00, 6993.71it/s]


In [109]:
# preprocessing for review summary
# Combining all the above stundents 
from tqdm import tqdm
preprocessed_summary = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'html.parser').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    # https://gist.github.com/sebleier/554280
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_summary.append(sentance.strip())

100%|█████████████████████████████████| 365331/365331 [00:52<00:00, 7017.54it/s]


# Featurization 

## 1. BAG OF WORDS

In [110]:
#BoW
count_vect = CountVectorizer() # in scikit-learn
count_vect.fit(preprocessed_reviews)
print("Some feature names", count_vect.get_feature_names()[:10])
print("="*50)

Some feature names ['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaaaaaa', 'aaaaaaaaaaaa', 'aaaaaaaaaaaaa', 'aaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaa']




In [111]:
final_counts = count_vect.transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_counts))
print("the shape of out text BOW vectorizer ",final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse.csr.csr_matrix'>
the shape of out text BOW vectorizer  (365331, 117474)
the number of unique words  117474
