# [7] Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




## [7.1] Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [8]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer



# using the SQLite Table to read data.
sql_file = sqlite3.connect('database.sqlite') 



#filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", sql_file) 




# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def polarity_r(x):
    if x < 3:
        return 'negative'
    return 'positive'

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(polarity_r) 
filtered_data['Score'] = positiveNegative

In [17]:
filtered_data.shape #looking at the number of attributes and size of the data
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#  Exploratory Data Analysis

## [7.1.2] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [2]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", sql_file)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [60]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('Time', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [61]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(364173, 10)

In [5]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [6]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", sql_file)
display


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [62]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
#Y = final['labels']

In [5]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

## 7.2.3  Text Preprocessing: Stemming, stop-word removal.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [63]:
# find sentences containing HTML tags
import re
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;    

        

8
What happens when you say his name three times? Michael Keaten stars in this comedy about two couples that live in an old two story house.  While coming back from a supply store, the couple suddenly get caught inside of a  &quot;broken-up&quot; bridge and then just before they start to tumble down  into the lake, a board catches them.  But just when they've got their hopes  up, and small dog steps on the board and the car starts to slide off the  bridge and into the lake waters.  A few minutes later...<p>They find  themselves back into their home, they find that somehow somehad light the  fireplace, as if done by magic.  From then on, they find a weird-looking  dead guy known as Bettlejuice.  The only way they can get him for help is  to call him by his name three times and he will appear at their survice.  But they soon wish that they have never called his name, because  Bettlejuice was once a troublemaker but he is the only one who can save  them, on the account that they said his 

In [9]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stop_word = set(stopwords.words('english')) #set of stopwords
sno_stem = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned
print(stop_word)
print('************************************')
print(sno_stem.stem('tasty'))

{"she's", 'a', 'against', 'what', 'such', 'but', 've', 'then', 'ma', 'd', 'shan', "should've", 'had', 'up', 'with', 'very', 'below', 'his', 'himself', 'hasn', "won't", 'most', 'did', 'don', 'during', 'out', 'in', 'off', 'he', 'too', "didn't", 'm', 'who', "shan't", 'there', 'for', 'was', 'myself', 'it', 'o', 'themselves', 'are', 'mustn', 'doing', 'were', 'or', 'been', 'under', 'some', 'isn', 'more', 'do', 'and', "aren't", 'this', 'yourself', 'our', 'over', "couldn't", "you'd", 'is', 'hadn', 'those', 'y', "it's", 'where', 'll', 'any', "haven't", 'won', 'you', 'being', 'she', 'couldn', "mustn't", 'them', "that'll", 'no', 'how', "mightn't", 'about', 'its', 'down', 'weren', 's', 'theirs', 'yourselves', "wouldn't", "weren't", 'hers', 'through', 'here', 'have', 'than', 'not', 'when', 'until', 'that', 'haven', 'needn', 'from', 't', "needn't", 'will', 'whom', 'now', 'shouldn', 'to', 'again', "you've", 'by', "you're", 'me', 'the', 'further', 'on', 'be', 'they', 'each', 'before', 'itself', 'after

In [65]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.
i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['Text'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop_word):
                    s=(sno_stem.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

In [66]:
final['CleanedText']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 

In [13]:
final.head(3) #below the processed review can be seen in the CleanedText Column 


# store final table into an SQlLite table for future.
sql_file2 = sqlite3.connect('final.sqlite')
c=sql_file2.cursor()
sql_file2.text_factory = str
final.to_sql('Reviews', sql_file2, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

## Separating data into Train ,Cross-validation,Test

In [10]:
sql_file3 = sqlite3.connect('final.sqlite')
final = pd.read_sql_query('SELECT * FROM Reviews',sql_file3)
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
def values(x):
    if x == 'positive':
        return 1
    return 0
X = final['Text'].values
label = final['Score']
Y = label.map(values)

## Test data split ##
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y, test_size=0.3, random_state=0)
 


# [7.2.2] Bag of Words (BoW)

In [14]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(X_train)
X_final = count_vect.transform(X_test)

### finding value of alpha using grid search

In [15]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
nbl1_values = [{'alpha': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(MultinomialNB(), l1_values, scoring = 'f1', cv=10)
model.fit(final_counts[0:20000], Y_train[0:20000])
print('best alpha: ',model.best_estimator_)

best alpha:  MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)


### Naive bayes on bow

In [23]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
NB = MultinomialNB(alpha=0.01)
NB.fit(final_counts[0:90000],Y_train[0:90000])
pred = NB.predict(X_final[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  89.37
f1_score 0.8940417239060247
precision 0.8943999540399429
recall 0.8937
confusion matrix [[1048  519]
 [ 544 7889]]


In [46]:
##################    Top features for positive class            ########################
feature_names = count_vect.get_feature_names()
feature_names = np.array(feature_names)
feature_positive = np.vstack((np.absolute(naive_bayes.feature_log_prob_[0]),feature_names))
feature_positive = pd.DataFrame(data=feature.T,columns=('coefficients','words'))
main = feature.sort_values('coefficients', axis=0,ascending=False ,inplace=False, kind='quicksort')
print('\t     top 20 features')
print(main[0:20])

	     top 20 features
            coefficients      words
46343  9.951696722458527     humans
57948  9.951696722458527     minute
39627  9.951696722458527   friendly
72226  9.951696722458527     recent
28198  9.951696722458527      dairy
49688  9.951696722458527   japanese
52461  9.951696722458527      latte
25758  9.951696722458527    consume
25768  9.951696722458527  consuming
87805  9.951696722458527      title
23286  9.951696722458527      cider
83230  9.951696722458527     stored
73598  9.951696722458527   replaced
26429  9.951696722458527  correctly
72653  9.951696722458527    reduced
17966  9.951696722458527    bottled
44943  9.951696722458527       herb
57929  9.951696722458527      mints
66108  9.951696722458527  perfectly
93064  9.951696722458527      vomit


In [50]:
############################   Top features for negative class   #####################
feature_negative = np.vstack((np.absolute(naive_bayes.feature_log_prob_[1]),feature_names))
feature_negative = pd.DataFrame(data=feature_negative.T,columns=('coefficients','words'))
main_negative = feature_negative.sort_values('coefficients', axis=0,ascending=False ,inplace=False, kind='quicksort')
print('\t     top 20 features')
print(main_negative[0:20])

	     top 20 features
            coefficients     words
5496   9.987005259212236     ahead
31925  9.987005259212236    dollar
77317  9.987005259212236   seafood
38952  9.987005259212236   forever
80319  9.987005259212236  smoothie
80064  9.987005259212236    slowly
52729  9.987005259212236   leaving
38966  9.987005259212236    forget
8280   9.987005259212236     aside
32563  9.987005259212236     drive
86608  9.987005259212236     terms
81222  9.987005259212236    sounds
40532  9.987005259212236    garden
89393  9.987005259212236       tsp
42659  9.987005259212236    greasy
85405  9.987005259212236   tabasco
56609  9.987005259212236     meats
43504  9.987005259212236       guy
16040  9.987005259212236     below
71647  9.969616540777753    rarely


## [7.2.4] Bi-Grams and n-Grams.

**Motivation**

Now that we have our list of words describing positive and negative reviews lets analyse them.<br>

We begin analysis by getting the frequency distribution of the words as shown below

In [None]:
freq_dist_positive=nltk.FreqDist(all_positive_words)
freq_dist_negative=nltk.FreqDist(all_negative_words)
print("Most Common Positive Words : ",freq_dist_positive.most_common(20))
print("Most Common Negative Words : ",freq_dist_negative.most_common(20))

<b>Observation:-</b> From the above it can be seen that the most common positive and the negative words overlap for eg. 'like' could be used as 'not like' etc. <br>
So, it is a good idea to consider pairs of consequent words (bi-grams) or q sequnce of n consecutive words (n-grams)

In [5]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
count_vect = CountVectorizer(ngram_range=(1,2) ) #in scikit-learn
final_bigram_counts = count_vect.fit_transform(X_train)
X_test_grams = count_vect.transform(X_test)

In [17]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
l1_values = [{'alpha': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(MultinomialNB(), l1_values, scoring = 'f1', cv=10)
model.fit(final_bigram_counts[0:20000], Y_train[0:20000])
print(model.best_estimator_)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
NB = MultinomialNB(alpha=0.01)
NB.fit(final_bigram_counts,Y_train)
pred = NB.predict(X_test_grams[0:5000])
f1score = f1_score(Y_test[0:5000],pred,average='weighted')
precision = precision_score(Y_test[0:5000],pred,average='weighted')
recall = recall_score(Y_test[0:5000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:5000],pred)
accuracy = accuracy_score(Y_test[0:5000],pred)
print('accuracy score',accuracy*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy score 92.54
f1_score 0.9224462947298251
precision 0.9219232493990281
recall 0.9254
confusion matrix [[ 516  253]
 [ 120 4111]]


In [28]:
feature_names = count_vect.get_feature_names()
feature_names = np.array(feature_names)
feature_positive = np.vstack((np.absolute(NB.feature_log_prob_[0]),feature_names))
feature_positive = pd.DataFrame(data=feature_positive.T,columns=('coefficients','words'))
main = feature_positive.sort_values('coefficients', axis=0,ascending=False ,inplace=False, kind='quicksort')
print('\t     top 20 features for positive class')
print(main[0:20])

	     top 20 features for positive class
              coefficients       words
295307   9.997864948192731   bottle of
1074194  9.997864948192731     it took
1533094  9.997864948192731   pieces of
2288476  9.997864948192731   with them
581063   9.997864948192731     despite
1734879  9.997864948192731  same thing
1070959  9.994469474488035     it like
935826   9.994469474488035    have any
933434   9.994469474488035     has the
1514623  9.994469474488035  people who
759256   9.994469474488035    find the
115904    9.99108549102122    and didn
949055    9.99108549102122   healthier
2312848   9.99108549102122   years ago
2050896   9.99108549102122   these for
2017408   9.99108549102122    the cups
1074604   9.99108549102122     it very
632083    9.99108549102122       drank
167239   9.987712920288859     as gift
1014275  9.987712920288859      in any


In [30]:
feature_names = count_vect.get_feature_names()
feature_names = np.array(feature_names)
feature_negative = np.vstack((np.absolute(NB.feature_log_prob_[1]),feature_names))
feature_negative = pd.DataFrame(data=feature_negative.T,columns=('coefficients','words'))
main = feature_negative.sort_values('coefficients', axis=0,ascending=False ,inplace=False, kind='quicksort')
print('\t     top 20 features for negative class')
print(main[0:20])

	     top 20 features for negative class
              coefficients            words
2020550  9.999090887001028       the future
1602370  9.997687387847826       product to
1575617  9.997687387847826          premium
877032    9.99698637625944          good to
1684951   9.99698637625944      replacement
2010577  9.995585825614189       that these
1561174  9.995585825614189         possible
1971706  9.995585825614189             tart
876100   9.994886285183705     good product
351542   9.994886285183705        buy these
82025     9.99418723376797           all my
1973966  9.992790595249314       taste good
1238933  9.992093006784252            me of
2053952  9.991395904609645       these were
1346094  9.990699288047978          nice to
898800   9.990003156423148             grey
1241675  9.989307509060467             mean
2238264  9.989307509060467             weak
2065451  9.989307509060467     this company
228013   9.988612345286652  be disappointed


# [7.2.5] TF-IDF

In [11]:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(X_train)
X_cv_tfidf = tf_idf_vect.transform(X_test)

### finding best value of alpha

In [36]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
l1_values = [{'alpha': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(MultinomialNB(), l1_values, scoring = 'f1', cv=10)
model.fit(final_tf_idf[0:20000], Y_train[0:20000])
print('best value of alpha using tf_idf',model.best_estimator_)

best value of alpha using tf_idf MultinomialNB(alpha=0.001, class_prior=None, fit_prior=True)


In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
NB = MultinomialNB(alpha=0.01)
NB.fit(final_tf_idf,Y_train)
pred = NB.predict(X_cv_tfidf[0:5000])
f1score = f1_score(Y_test[0:5000],pred,average='weighted')
precision = precision_score(Y_test[0:5000],pred,average='weighted')
recall = recall_score(Y_test[0:5000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:5000],pred)
accuracy = accuracy_score(Y_test[0:5000],pred)
print('accuracy score',accuracy*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy score 91.67999999999999
f1_score 0.9084386604280144
precision 0.9137799959304203
recall 0.9168
confusion matrix [[ 415  354]
 [  62 4169]]


In [18]:
features = tf_idf_vect.get_feature_names()
features = np.array(features)
feature_positive = np.vstack((np.absolute(NB.feature_log_prob_[0]),features))
features_positive = pd.DataFrame(data=feature_positive.T,columns=('coefficients','words'))
main = features_positive.sort_values('coefficients',axis=0,ascending=False,inplace=False,kind='quicksort')
print('\t   top 20 features')
print(main[0:20])

	   top 20 features
              coefficients         words
2016828   9.99922676388742  the contents
1409648  9.998969422957899          oily
1377298   9.99882893185357     nutrition
2016486  9.998538850389009     the color
2026132  9.998257539793439   the noodles
2055454  9.998223156495728     they just
128334   9.997851413424486     and tried
935799   9.996733301292956       have an
2040307  9.996555862177264       them at
1059709  9.995859539330256       is more
347024   9.995412883208783       but won
2013893  9.995284862655792     the beans
1060081   9.99527504860016        is one
1468206  9.994850390130802        out to
7808      9.99356558734386            14
2018969  9.993467759185092       the end
1539267  9.992626605433697        placed
1974133  9.992406823001057      taste in
609828   9.992032463939315  dissapointed
1104667  9.991708593707097          kept


In [14]:
features = tf_idf_vect.get_feature_names()
features = np.array(features)
feature_negative = np.vstack((np.absolute(NB.feature_log_prob_[1]),features))
features_negative = pd.DataFrame(data=feature_negative.T,columns=('coefficients','words'))
main = features_negative.sort_values('coefficients',axis=0,ascending=False,inplace=False,kind='quicksort')
print('\t   top 20 features')
print(main[0:20])

	   top 20 features
              coefficients                 words
523525    9.99962458156395               creamer
2191395  9.999032818178492               various
2304243  9.998936168580505            would like
796939   9.998787540371666             for great
1984071  9.998466384640704                tea in
1152827  9.998428193501052             light and
1545604  9.998375767244857  pleasantly surprised
1069406  9.997023658339375               it even
1030531  9.996761088331343          individually
1784061  9.996407951480236               shampoo
117539   9.996013893814409            and flavor
1073948  9.995571517510374              it taste
585546   9.995569031225696              diabetic
795015   9.995514490558492         for christmas
1116917  9.994788934879839             know that
511803   9.994395350124986           couldn find
1843300   9.99399832946013                so the
1317690  9.993026540175222             my mother
2285196  9.992737987369235               with me


In [None]:
 # source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],features,25)

## Conclusion and summary

Following step wise approach is used for each type of vectorization and embeddings
- Finding the right alpha using grid search
- Applying Naive bayes on the data using best alpha
- Finding best features for positive class 
- Finding best features for negative class
- Calculating required performance metric over test data

#### Performance table for training and test data

|      Algorithms                       | Bow       | Bigrams  | tf_idf  |
|---------------------------------------|-----------|----------|---------|
|        alpha                          |  0.01     |    0.01  |  0.001  |
|---------------------------------------|-----------|----------|---------|
|       precision                       |  0.894    |    0.921 |  0.913  |
|---------------------------------------|-----------|----------|---------|
|       recall                          |  0.893    |   0.925  |  0.911  |
|---------------------------------------|-----------|----------|---------|
|       f1_score                        |  0.894    |   0.922  |  0.908  |
|---------------------------------------|-----------|----------|---------|
|       accuracy                        |   89.77   |   92.54  |  91.61  |
|---------------------------------------|-----------|----------|---------|


### Confusion matrix

#### Bow

|1048|519 |
|----|----|
|544 |7889|

#### Bigrams

| 516 |253 |
|-----|----|
| 120 |4111|

#### tf_idf

| 415| 354|
|----|----|
| 62 |4169|