# Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews <br>

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br>
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




In [None]:
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

from tqdm import tqdm
import os

In [8]:
from google.colab import files

uploaded = files.upload()


Saving database.sqlite to database.sqlite


In [10]:
# using the SQLite Table to read data.
con = sqlite3.connect('database.sqlite')
#filtering only positive and negative reviews i.e.
# not taking into consideration those reviews with Score=3
# SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000, will give top 500000 data points
# you can change the number to any other number based on your computing power

# filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 500000""", con)
# for tsne assignment you can take 5k data points

filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 LIMIT 5000""", con)

# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def partition(x):
    if x < 3:
        return 0
    return 1

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition)
filtered_data['Score'] = positiveNegative
print("Number of data points in our data", filtered_data.shape)
filtered_data.head(3)

Number of data points in our data (5000, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,1,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,0,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,1,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...


In [11]:
display = pd.read_sql_query("""
SELECT UserId, ProductId, ProfileName, Time, Score, Text, COUNT(*)
FROM Reviews
GROUP BY UserId
HAVING COUNT(*)>1
""", con)

In [12]:
print(display.shape)
display.head()

(80668, 7)


Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
0,#oc-R115TNMSPFT9I7,B005ZBZLT4,Breyton,1331510400,2,Overall its just OK when considering the price...,2
1,#oc-R11D9D7SHXIJB9,B005HG9ESG,"Louis E. Emory ""hoppy""",1342396800,5,"My wife has recurring extreme muscle spasms, u...",3
2,#oc-R11DNU2NBKQ23Z,B005ZBZLT4,Kim Cieszykowski,1348531200,1,This coffee is horrible and unfortunately not ...,2
3,#oc-R11O5J5ZVQE25C,B005HG9ESG,Penguin Chick,1346889600,5,This will be the bottle that you grab from the...,3
4,#oc-R12KPBODL2B5ZD,B007OSBEV0,Christopher P. Presta,1348617600,1,I didnt like this coffee. Instead of telling y...,2


In [13]:
display[display['UserId']=='AZY10LLTJ71NX']

Unnamed: 0,UserId,ProductId,ProfileName,Time,Score,Text,COUNT(*)
80638,AZY10LLTJ71NX,B001ATMQK2,"undertheshrine ""undertheshrine""",1296691200,5,I bought this 6 pack because for the price tha...,5


In [14]:
display['COUNT(*)'].sum()

393063

#  Exploratory Data Analysis

## [2] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.

In [15]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", con)
display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br>
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br>
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [16]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [17]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(4986, 10)

In [18]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

99.72

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [19]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", con)

display.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [20]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [21]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(4986, 10)


1    4178
0     808
Name: Score, dtype: int64

# [3].  Text Preprocessing.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>



In [22]:
# printing some random reviews
sent_0 = final['Text'].values[0]
print(sent_0)
print("="*50)

sent_1000 = final['Text'].values[1000]
print(sent_1000)
print("="*50)

sent_1500 = final['Text'].values[1500]
print(sent_1500)
print("="*50)

sent_4900 = final['Text'].values[4900]
print(sent_4900)
print("="*50)

Why is this $[...] when the same product is available for $[...] here?<br />http://www.amazon.com/VICTOR-FLY-MAGNET-BAIT-REFILL/dp/B00004RBDY<br /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the flavor is still very good.  I like them better than the yogurt and green onion flavor because they do not seem to be as salty, and the onion flavor is better.  If you haven't eaten Kettle chips before, I recommend that you try a bag before buying bulk.  They are thicker and crunchier than Lays but just as fresh out of the bag.
Wow.  So far, two two-star reviews.  One obviously had no 

In [23]:
# remove urls from text python: https://stackoverflow.com/a/40823105/4084039
sent_0 = re.sub(r"http\S+", "", sent_0)
sent_1000 = re.sub(r"http\S+", "", sent_1000)
sent_150 = re.sub(r"http\S+", "", sent_1500)
sent_4900 = re.sub(r"http\S+", "", sent_4900)

print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [24]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(sent_0, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1000, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_1500, 'lxml')
text = soup.get_text()
print(text)
print("="*50)

soup = BeautifulSoup(sent_4900, 'lxml')
text = soup.get_text()
print(text)

Why is this $[...] when the same product is available for $[...] here? />The Victor M380 and M502 traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.
I recently tried this flavor/brand and was surprised at how delicious these chips are.  The best thing was that there were a lot of "brown" chips in the bsg (my favorite), so I bought some more through amazon and shared with family and friends.  I am a little disappointed that there are not, so far, very many brown chips in these bags, but the flavor is still very good.  I like them better than the yogurt and green onion flavor because they do not seem to be as salty, and the onion flavor is better.  If you haven't eaten Kettle chips before, I recommend that you try a bag before buying bulk.  They are thicker and crunchier than Lays but just as fresh out of the bag.
Wow.  So far, two two-star reviews.  One obviously had no idea what they were ordering; the other wants crispy cookies.  Hey, I'm sorry; b

In [25]:
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [26]:
sent_1500 = decontracted(sent_1500)
print(sent_1500)
print("="*50)

Wow.  So far, two two-star reviews.  One obviously had no idea what they were ordering; the other wants crispy cookies.  Hey, I am sorry; but these reviews do nobody any good beyond reminding us to look  before ordering.<br /><br />These are chocolate-oatmeal cookies.  If you do not like that combination, do not order this type of cookie.  I find the combo quite nice, really.  The oatmeal sort of "calms" the rich chocolate flavor and gives the cookie sort of a coconut-type consistency.  Now let is also remember that tastes differ; so, I have given my opinion.<br /><br />Then, these are soft, chewy cookies -- as advertised.  They are not "crispy" cookies, or the blurb would say "crispy," rather than "chewy."  I happen to like raw cookie dough; however, I do not see where these taste like raw cookie dough.  Both are soft, however, so is this the confusion?  And, yes, they stick together.  Soft cookies tend to do that.  They are not individually wrapped, which would add to the cost.  Oh y

In [27]:
#remove words with numbers python
sent_0 = re.sub("\S*\d\S*", "", sent_0).strip()
print(sent_0)

Why is this $[...] when the same product is available for $[...] here?<br /> /><br />The Victor  and  traps are unreal, of course -- total fly genocide. Pretty stinky, but only right nearby.


In [28]:
#remove spacial character
sent_1500 = re.sub('[^A-Za-z0-9]+', ' ', sent_1500)
print(sent_1500)

Wow So far two two star reviews One obviously had no idea what they were ordering the other wants crispy cookies Hey I am sorry but these reviews do nobody any good beyond reminding us to look before ordering br br These are chocolate oatmeal cookies If you do not like that combination do not order this type of cookie I find the combo quite nice really The oatmeal sort of calms the rich chocolate flavor and gives the cookie sort of a coconut type consistency Now let is also remember that tastes differ so I have given my opinion br br Then these are soft chewy cookies as advertised They are not crispy cookies or the blurb would say crispy rather than chewy I happen to like raw cookie dough however I do not see where these taste like raw cookie dough Both are soft however so is this the confusion And yes they stick together Soft cookies tend to do that They are not individually wrapped which would add to the cost Oh yeah chocolate chip cookies tend to be somewhat sweet br br So if you wa

In [29]:
# we are removing the words from the stop words list: 'no', 'nor', 'not'
# <br /><br /> ==> after the above steps, we are getting "br br"
# we are including them into stop words list
# instead of <br /> if we have <br/> these tags would have revmoved in the 1st step

stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [30]:
# Combining all the above stundents
from tqdm import tqdm
preprocessed_reviews = []
# tqdm is for printing the status bar
for sentance in tqdm(final['Text'].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_reviews.append(sentance.strip())

100%|██████████| 4986/4986 [00:02<00:00, 1990.46it/s]


In [31]:
preprocessed_reviews[1500]

'wow far two two star reviews one obviously no idea ordering wants crispy cookies hey sorry reviews nobody good beyond reminding us look ordering chocolate oatmeal cookies not like combination not order type cookie find combo quite nice really oatmeal sort calms rich chocolate flavor gives cookie sort coconut type consistency let also remember tastes differ given opinion soft chewy cookies advertised not crispy cookies blurb would say crispy rather chewy happen like raw cookie dough however not see taste like raw cookie dough soft however confusion yes stick together soft cookies tend not individually wrapped would add cost oh yeah chocolate chip cookies tend somewhat sweet want something hard crisp suggest nabiso ginger snaps want cookie soft chewy tastes like combination chocolate oatmeal give try place second order'

# [4] Featurization

In [34]:
# Initialize CountVectorizer
count_vect = CountVectorizer()

# Fit the Vectorizer
count_vect.fit(preprocessed_reviews)

# Print Some Feature Names
print("some feature names ", count_vect.get_feature_names_out()[:10])
print('='*50)

# Transform Text to Bag-of-Words
final_counts = count_vect.transform(preprocessed_reviews)

# Print Information about the Resulting Sparse Matrix
print("the type of count vectorizer ", type(final_counts))
print("the shape of out text BOW vectorizer ", final_counts.get_shape())
print("the number of unique words ", final_counts.get_shape()[1])


some feature names  ['aa' 'aahhhs' 'aback' 'abandon' 'abates' 'abbott' 'abby' 'abdominal'
 'abiding' 'ability']
the type of count vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
the shape of out text BOW vectorizer  (4986, 12997)
the number of unique words  12997


## [4.2] Bi-Grams and n-Grams.

In [35]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
# count_vect = CountVectorizer(ngram_range=(1,2))
count_vect = CountVectorizer(ngram_range=(1,2), min_df=10, max_features=5000)
final_bigram_counts = count_vect.fit_transform(preprocessed_reviews)
print("the type of count vectorizer ",type(final_bigram_counts))
print("the shape of out text BOW vectorizer ",final_bigram_counts.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_bigram_counts.get_shape()[1])

the type of count vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
the shape of out text BOW vectorizer  (4986, 3144)
the number of unique words including both unigrams and bigrams  3144


## [4.3] TF-IDF

In [37]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2), min_df=10)
tf_idf_vect.fit(preprocessed_reviews)
print("some sample features(unique words in the corpus)", tf_idf_vect.get_feature_names_out()[0:10])
print('='*50)

final_tf_idf = tf_idf_vect.transform(preprocessed_reviews)
print("the type of TF-IDF vectorizer ", type(final_tf_idf))
print("the shape of the TF-IDF matrix ", final_tf_idf.get_shape())
print("the number of unique words including both unigrams and bigrams ", final_tf_idf.get_shape()[1])


some sample features(unique words in the corpus) ['ability' 'able' 'able find' 'able get' 'absolute' 'absolutely'
 'absolutely delicious' 'absolutely love' 'absolutely no' 'according']
the type of TF-IDF vectorizer  <class 'scipy.sparse._csr.csr_matrix'>
the shape of the TF-IDF matrix  (4986, 3144)
the number of unique words including both unigrams and bigrams  3144


## [4.4] Word2Vec

In [38]:
i=0
list_of_sentance=[]
for sentance in preprocessed_reviews:
    list_of_sentance.append(sentance.split())

In [42]:
# Import necessary libraries
from gensim.models import Word2Vec


if want_to_train_w2v:
    # min_count = 5 considers only words that occurred at least 5 times
    w2v_model = Word2Vec(sentences=list_of_sentance, min_count=5, vector_size=50, workers=4)
    print(w2v_model.wv.most_similar('great'))
    print('=' * 50)
    print(w2v_model.wv.most_similar('worst'))


[('excellent', 0.9805138111114502), ('think', 0.9794772267341614), ('overall', 0.9785645008087158), ('want', 0.9782611131668091), ('alternative', 0.9767622351646423), ('wanting', 0.9765724539756775), ('looking', 0.9762692451477051), ('snack', 0.9762449860572815), ('lunches', 0.9755992889404297), ('anything', 0.9755110740661621)]
[('remember', 0.9981744289398193), ('level', 0.997930109500885), ('body', 0.997905433177948), ('experience', 0.9978353977203369), ('perhaps', 0.99782794713974), ('turned', 0.9978261590003967), ('night', 0.997800350189209), ('normal', 0.9977465867996216), ('terrible', 0.9977110624313354), ('american', 0.9976968169212341)]


In [44]:
w2v_words = list(w2v_model.wv.key_to_index.keys())
print("number of words that occurred minimum 5 times ", len(w2v_words))
print("sample words ", w2v_words[:50])


number of words that occurred minimum 5 times  3817
sample words  ['not', 'like', 'good', 'great', 'taste', 'one', 'product', 'would', 'flavor', 'love', 'coffee', 'food', 'chips', 'tea', 'no', 'really', 'get', 'best', 'much', 'amazon', 'use', 'time', 'buy', 'also', 'tried', 'little', 'find', 'make', 'price', 'better', 'bag', 'try', 'even', 'mix', 'well', 'chocolate', 'hot', 'eat', 'free', 'water', 'dog', 'first', 'made', 'could', 'found', 'used', 'bought', 'box', 'sugar', 'cup']


## [4.4.1] Converting text into vectors using wAvg W2V, TFIDF-W2V

#### [4.4.1.1] Avg W2v


In [45]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in tqdm(list_of_sentance): # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length 50, you might need to change this to 300 if you use google's w2v
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

100%|██████████| 4986/4986 [00:04<00:00, 1066.51it/s]

4986
50





#### [4.4.1.2] TFIDF weighted W2v

In [47]:
# set of documents
S = ["abc def pqr", "def def def abc", "pqr pqr def"]

# Create a TfidfVectorizer model
model = TfidfVectorizer()

# Fit the model on the preprocessed reviews
model.fit(preprocessed_reviews)

# Create a dictionary with word as a key and IDF as a value
dictionary = dict(zip(model.get_feature_names_out(), list(model.idf_)))


In [49]:
# TF-IDF weighted Word2Vec
tfidf_feat = model.get_feature_names_out()  # tfidf words/col-names

# Rest of the code remains unchanged
tfidf_sent_vectors = []  # the tfidf-w2v for each sentence/review is stored in this list
row = 0
for sent in tqdm(list_of_sentance):  # for each review/sentence
    sent_vec = np.zeros(50)  # as word vectors are of zero length
    weight_sum = 0  # num of words with a valid vector in the sentence/review
    for word in sent:  # for each word in a review/sentence
        if word in w2v_words and word in tfidf_feat:
            vec = w2v_model.wv[word]
            tf_idf = dictionary[word] * (sent.count(word) / len(sent))
            sent_vec += (vec * tf_idf)
            weight_sum += tf_idf
    if weight_sum != 0:
        sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1


100%|██████████| 4986/4986 [01:13<00:00, 67.56it/s] 


#Machine Learning Models
##1.Logistic Regression


In [50]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Prepare features and labels
X = sent_vectors
y = final['Score']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Logistic Regression model
lr_model = LogisticRegression()

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print evaluation metrics
print(f"Accuracy: {accuracy}")
print("\nClassification Report:\n", classification_report_result)
print("\nConfusion Matrix:\n", conf_matrix)


Accuracy: 0.843687374749499

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       156
           1       0.84      1.00      0.92       842

    accuracy                           0.84       998
   macro avg       0.42      0.50      0.46       998
weighted avg       0.71      0.84      0.77       998


Confusion Matrix:
 [[  0 156]
 [  0 842]]


In [51]:
# Initialize Logistic Regression model with class_weight
lr_model = LogisticRegression(class_weight='balanced')

# Train the model
lr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print updated evaluation metrics
print(f"Updated Accuracy: {accuracy}")
print("\nUpdated Classification Report:\n", classification_report_result)
print("\nUpdated Confusion Matrix:\n", conf_matrix)


Updated Accuracy: 0.6533066132264529

Updated Classification Report:
               precision    recall  f1-score   support

           0       0.27      0.73      0.40       156
           1       0.93      0.64      0.76       842

    accuracy                           0.65       998
   macro avg       0.60      0.68      0.58       998
weighted avg       0.83      0.65      0.70       998


Updated Confusion Matrix:
 [[114  42]
 [304 538]]


In [52]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# Initialize Logistic Regression model
lr_model = LogisticRegression(class_weight='balanced')

# Perform Grid Search Cross-Validation
grid_search = GridSearchCV(lr_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_

# Train the model with the best hyperparameters
best_lr_model = LogisticRegression(class_weight='balanced', C=best_params['C'])
best_lr_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_lr_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_report_result = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print updated evaluation metrics
print(f"Updated Accuracy after Hyperparameter Tuning: {accuracy}")
print("\nUpdated Classification Report:\n", classification_report_result)
print("\nUpdated Confusion Matrix:\n", conf_matrix)
print("\nBest Hyperparameters:", best_params)


Updated Accuracy after Hyperparameter Tuning: 0.7294589178356713

Updated Classification Report:
               precision    recall  f1-score   support

           0       0.34      0.77      0.47       156
           1       0.94      0.72      0.82       842

    accuracy                           0.73       998
   macro avg       0.64      0.75      0.64       998
weighted avg       0.85      0.73      0.76       998


Updated Confusion Matrix:
 [[120  36]
 [234 608]]

Best Hyperparameters: {'C': 1000}


In [53]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest model
rf_model = RandomForestClassifier(class_weight='balanced', n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = rf_model.predict(X_test)

# Evaluate the Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
classification_report_rf = classification_report(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)

# Print evaluation metrics for Random Forest
print(f"Random Forest Accuracy: {accuracy_rf}")
print("\nRandom Forest Classification Report:\n", classification_report_rf)
print("\nRandom Forest Confusion Matrix:\n", conf_matrix_rf)


Random Forest Accuracy: 0.8486973947895792

Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.05      0.10       156
           1       0.85      1.00      0.92       842

    accuracy                           0.85       998
   macro avg       0.79      0.52      0.51       998
weighted avg       0.83      0.85      0.79       998


Random Forest Confusion Matrix:
 [[  8 148]
 [  3 839]]


#Hyperparametertuning

In [54]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20, 30]
}

# Initialize Random Forest model
rf_model_tuned = RandomForestClassifier(class_weight='balanced', random_state=42)

# Perform Grid Search Cross-Validation
grid_search_rf = GridSearchCV(rf_model_tuned, param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

# Get the best hyperparameters
best_params_rf = grid_search_rf.best_params_

# Train the tuned Random Forest model
best_rf_model = RandomForestClassifier(class_weight='balanced', random_state=42, **best_params_rf)
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf_tuned = best_rf_model.predict(X_test)

# Evaluate the tuned Random Forest model
accuracy_rf_tuned = accuracy_score(y_test, y_pred_rf_tuned)
classification_report_rf_tuned = classification_report(y_test, y_pred_rf_tuned)
conf_matrix_rf_tuned = confusion_matrix(y_test, y_pred_rf_tuned)

# Print evaluation metrics for tuned Random Forest
print(f"Tuned Random Forest Accuracy: {accuracy_rf_tuned}")
print("\nTuned Random Forest Classification Report:\n", classification_report_rf_tuned)
print("\nTuned Random Forest Confusion Matrix:\n", conf_matrix_rf_tuned)
print("\nBest Hyperparameters for Random Forest:", best_params_rf)


Tuned Random Forest Accuracy: 0.845691382765531

Tuned Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.55      0.08      0.13       156
           1       0.85      0.99      0.92       842

    accuracy                           0.85       998
   macro avg       0.70      0.53      0.53       998
weighted avg       0.80      0.85      0.79       998


Tuned Random Forest Confusion Matrix:
 [[ 12 144]
 [ 10 832]]

Best Hyperparameters for Random Forest: {'max_depth': 20, 'n_estimators': 100}


#3.Using SVM Model

In [55]:
from sklearn.svm import SVC

# Initialize Support Vector Machine model with a linear kernel
svm_model = SVC(kernel='linear', class_weight='balanced', random_state=42)

# Train the SVM model
svm_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_svm = svm_model.predict(X_test)

# Evaluate the SVM model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
classification_report_svm = classification_report(y_test, y_pred_svm)
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)

# Print evaluation metrics for SVM
print(f"SVM Accuracy: {accuracy_svm}")
print("\nSVM Classification Report:\n", classification_report_svm)
print("\nSVM Confusion Matrix:\n", conf_matrix_svm)


SVM Accuracy: 0.6012024048096193

SVM Classification Report:
               precision    recall  f1-score   support

           0       0.25      0.80      0.39       156
           1       0.94      0.56      0.70       842

    accuracy                           0.60       998
   macro avg       0.60      0.68      0.55       998
weighted avg       0.83      0.60      0.65       998


SVM Confusion Matrix:
 [[125  31]
 [367 475]]


#Hyperparametertuning

In [56]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for SVM with RBF kernel
param_grid_svm_rbf = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1]
}

# Initialize SVM model with RBF kernel
svm_model_rbf = SVC(kernel='rbf', class_weight='balanced', random_state=42)

# Perform Grid Search Cross-Validation
grid_search_svm_rbf = GridSearchCV(svm_model_rbf, param_grid_svm_rbf, cv=5, scoring='accuracy')
grid_search_svm_rbf.fit(X_train, y_train)

# Get the best hyperparameters
best_params_svm_rbf = grid_search_svm_rbf.best_params_

# Train the tuned SVM model with RBF kernel
best_svm_model_rbf = SVC(kernel='rbf', class_weight='balanced', random_state=42, **best_params_svm_rbf)
best_svm_model_rbf.fit(X_train, y_train)

# Make predictions on the test set
y_pred_svm_rbf = best_svm_model_rbf.predict(X_test)

# Evaluate the tuned SVM model with RBF kernel
accuracy_svm_rbf = accuracy_score(y_test, y_pred_svm_rbf)
classification_report_svm_rbf = classification_report(y_test, y_pred_svm_rbf)
conf_matrix_svm_rbf = confusion_matrix(y_test, y_pred_svm_rbf)

# Print evaluation metrics for tuned SVM with RBF kernel
print(f"Tuned SVM (RBF Kernel) Accuracy: {accuracy_svm_rbf}")
print("\nTuned SVM (RBF Kernel) Classification Report:\n", classification_report_svm_rbf)
print("\nTuned SVM (RBF Kernel) Confusion Matrix:\n", conf_matrix_svm_rbf)
print("\nBest Hyperparameters for SVM (RBF Kernel):", best_params_svm_rbf)



Tuned SVM (RBF Kernel) Accuracy: 0.7224448897795591

Tuned SVM (RBF Kernel) Classification Report:
               precision    recall  f1-score   support

           0       0.34      0.79      0.47       156
           1       0.95      0.71      0.81       842

    accuracy                           0.72       998
   macro avg       0.64      0.75      0.64       998
weighted avg       0.85      0.72      0.76       998


Tuned SVM (RBF Kernel) Confusion Matrix:
 [[123  33]
 [244 598]]

Best Hyperparameters for SVM (RBF Kernel): {'C': 10, 'gamma': 1}


# Random Forest classifier and ensemble methods:

In [57]:
from sklearn.ensemble import RandomForestClassifier

# Initialize Random Forest model
rf_model = RandomForestClassifier(random_state=42, class_weight='balanced')

# Define the hyperparameter grid for Random Forest
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Perform Grid Search Cross-Validation for Random Forest
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

# Get the best hyperparameters
best_params_rf = grid_search_rf.best_params_

# Train the tuned Random Forest model
best_rf_model = RandomForestClassifier(random_state=42, class_weight='balanced', **best_params_rf)
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_rf = best_rf_model.predict(X_test)

# Evaluate the tuned Random Forest model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
classification_report_rf = classification_report(y_test, y_pred_rf)
conf_matrix_rf = confusion_matrix(y_test, y_pred_rf)

# Print evaluation metrics for tuned Random Forest
print(f"Tuned Random Forest Accuracy: {accuracy_rf}")
print("\nTuned Random Forest Classification Report:\n", classification_report_rf)
print("\nTuned Random Forest Confusion Matrix:\n", conf_matrix_rf)
print("\nBest Hyperparameters for Random Forest:", best_params_rf)


Tuned Random Forest Accuracy: 0.8496993987975952

Tuned Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.69      0.07      0.13       156
           1       0.85      0.99      0.92       842

    accuracy                           0.85       998
   macro avg       0.77      0.53      0.52       998
weighted avg       0.83      0.85      0.79       998


Tuned Random Forest Confusion Matrix:
 [[ 11 145]
 [  5 837]]

Best Hyperparameters for Random Forest: {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 100}


#Gradient Boosting:


In [58]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Assume 'X' is your feature matrix and 'y' is the target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the Gradient Boosting model
gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)

# Make predictions
y_pred = gb_model.predict(X_test)

# Evaluate the model
print("Gradient Boosting Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Gradient Boosting Accuracy: 0.845691382765531
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.05      0.09       156
           1       0.85      0.99      0.92       842

    accuracy                           0.85       998
   macro avg       0.71      0.52      0.50       998
weighted avg       0.81      0.85      0.79       998

Confusion Matrix:
 [[  8 148]
 [  6 836]]


Here's a concise summary of model performance based on  experiments:

1. **Logistic Regression with Word2Vec:**
   - Initial Accuracy: 0.84
   - After Hyperparameter Tuning: 0.73
   - Further Tuning: 0.73

2. **Random Forest with Word2Vec:**
   - Initial Accuracy: 0.85
   - After Hyperparameter Tuning: 0.85

3. **SVM (RBF Kernel) with Word2Vec:**
   - Initial Accuracy: 0.60
   - After Hyperparameter Tuning: 0.72

4. **Gradient Boosting with Word2Vec:**
   - Initial Accuracy: 0.85

