# Amazon Fine Food Reviews Naive Bayes


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).



 In Naive bayes we assume conditional independency as our key assumption.This is a 2 class classification problem 
 where class 1 resemble the review as positive and class 0 as the negative review

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer




con = sqlite3.connect('./amazon-fine-food-reviews/database.sqlite') 
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", con)

In [3]:
s1= filtered_data.loc[filtered_data["Score"]>=4].sample(n=20000,random_state=1)
print(s1.shape)

s2= filtered_data.loc[filtered_data["Score"]<=2].sample(n=20000,random_state=127)
print(s2.shape)

(20000, 10)
(20000, 10)


In [4]:
data=s1
data=data.append(s2)
data.shape

(40000, 10)

In [5]:
def partition(x):
    if x < 3:
        return 'negative'
    return 'positive'

actualScore = filtered_data['Score']
positiveNegative = actualScore.map(partition) 
data['Score'] = positiveNegative

In [6]:
sorted_data=data.sort_values('ProductId', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(36108, 10)

In [7]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
final=final.drop_duplicates(subset={"UserId","ProfileName","Time"},keep='first',inplace=False)
final.shape

(35554, 10)

In [8]:
final['Score'].value_counts()

positive    18914
negative    16640
Name: Score, dtype: int64

## DATA pre-processing 

In [9]:
import re
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;    


1
Summary:  A young boy describes the usefulness of chicken soup with rice for each month of the year.<br /><br />Evaluation:  With Sendak's creative repetitious and rhythmic words, children will enjoy and learn to read the story of a boy who loves chicken soup with rice!  Through Sendak's catchy story, children will also learn the months of the year, as well as what seasons go with what month! They learn to identify ice-skating and snowmen in the winter; strong wind in March; birds and flowers in the spring; swimming and hot temperatures in the summer; and finally different holidays throughout the year. Such as Halloween in October, and Christmas in December.<br /><br />Sendak's simple three colored crayon-like drawings are a perfect addition to his educational and entertaining story.<br /><br />A great activity that you can do with this book is to have children draw their own illustrations for each month of the year.  Afterwards you can bind the pages together so the children can cre

In [10]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english') 

def cleanhtml(sentence): 
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): 
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned
print(stop)
print(sno.stem('tasty'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\santosh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
{'this', 's', "couldn't", 'shan', 'hadn', 'that', 'whom', 'my', 'itself', 'mustn', 'does', 'yourselves', 'her', 'having', 'shouldn', 'these', 'mightn', 'such', 'no', 'of', 'you', 'which', "won't", 'it', "needn't", "doesn't", 'after', 'in', "isn't", 'was', 'between', 'why', "should've", 'had', 'their', 'some', 'being', "mightn't", 'him', 'few', 'are', 'o', 'not', 'but', 'herself', 'haven', 'just', 'for', 'am', 'most', "haven't", 'how', 'is', 'they', 'from', 'ours', 'on', 'when', 't', 'can', 're', 'both', 'didn', 'those', 'while', 'nor', 'then', 'has', "you're", 'wouldn', 'to', 'as', 'aren', 'and', 'so', 'hers', 'each', 'doesn', 'himself', 'with', 'yourself', "hasn't", "weren't", 'off', 'i', 'we', 'than', 'any', 'until', 'theirs', "you'd", 'should', 'where', 'needn', 'there', 'couldn', 'through', 'during', "it's", 'only', 'over'

In [11]:
i=0
str1=' '
final_string=[]
all_positive_words=[] 
all_negative_words=[] 
s=''
for sent in final['Text'].values:
    filtered_sentence=[]
    
    sent=cleanhtml(sent) 
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) 
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) 
                else:
                    continue
            else:
                continue 

    str1 = b" ".join(filtered_sentence)
    
    final_string.append(str1)
    i+=1

In [12]:
final['CleanedText']=final_string
final.head(3) 


conn = sqlite3.connect('final.sqlite')
c=conn.cursor()
conn.text_factory = str
final.to_sql('Reviews', conn, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None,
             dtype=None)
final.shape

(35554, 11)

## Time based sorting

In [13]:
final=final.sort_values('Time')

In [14]:
x= np.array(final.iloc[:, 0:10])

In [15]:
y= np.array(final['Score'])


## bag of words-Naive Bayes

In [16]:
count_vect = CountVectorizer()
final_bow = count_vect.fit_transform(x[:,9])


In [17]:
final_bow.get_shape()

(35554, 38046)

In [42]:
y[y=='positive']=1
y[y=="negative"]=0
y=y.astype('int')

In [18]:
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB



In [44]:
X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_bow, y, test_size=0.3)

X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)


In [84]:
from sklearn.naive_bayes import MultinomialNB
n=0
while(n<7):
    a=[0.001,0.01,0.1,0,1,10,100]
    gnb = MultinomialNB(alpha=a[n])
    y_pred = gnb.fit(X_tr,y_tr)
    pred = gnb.predict(X_cv)
    acc = accuracy_score(y_cv, pred, normalize=True) * float(100)
    print('\nCV accuracy for alpha=%f is %d%%' %(float(a[n]),acc))    
    n=n+1


CV accuracy for alpha=0.001000 is 81%

CV accuracy for alpha=0.010000 is 83%

CV accuracy for alpha=0.100000 is 84%

CV accuracy for alpha=0.000000 is 78%

CV accuracy for alpha=1.000000 is 85%

CV accuracy for alpha=10.000000 is 85%

CV accuracy for alpha=100.000000 is 81%


  'setting alpha = %.1e' % _ALPHA_MIN)


In [92]:
gnb = MultinomialNB(alpha=10)
gnb.fit(X_tr,y_tr)
pred = gnb.predict(X_test)
pre=precision_score(y_test,pred, average='macro')* float(100)
acc = accuracy_score(y_test, pred, normalize=True) * float(100)
re=recall_score(y_test, pred, average='macro')*float(100)
f1=f1_score(y_test, pred, average='micro')* float(100)
print("accuracy is",acc)
print("precision score is ",pre)
print("recall score is ",re)
print("f1 score is ",f1)

accuracy is 85.94731414643293
precision score is  86.27476642465228
recall score is  85.7305382168692
f1 score is  85.94731414643293


In [108]:
confusion_matrix(y_test,pred)

array([[4120,  979],
       [ 520, 5048]], dtype=int64)

## observation:
- the optimal alpha value is 1 and 10 which has 85% accuracy on CV.
- with aplha = 10 on test data the metrics are as follows
  - accuracy is 85.94%
  - precision score is 86.27%
  - recall score is 85.73%
  - f1 score is 85.94%
- confusion matrix values:
  - TN is 4120
  - FN is 979
  - FP is 520
  - TP is 5048

## feature importance:

In [128]:
acc=[]
for i in range(38046):
    X_1, X_test, y_1, y_test = cross_validation.train_test_split(final_bow[:,i], y, test_size=0.3)
    X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)
    gnb = MultinomialNB()
    y_pred = gnb.fit(X_tr,y_tr)
    pred = gnb.predict(X_cv)
    gnb = MultinomialNB(alpha=10)
    gnb.fit(X_tr,y_tr)
    pred = gnb.predict(X_test)
    acc = accuracy_score(y_test, pred, normalize=True) * float(100)
    

In [143]:
acc=np.array(acc)
print(acc.max())

53.12646479797506


## tfidf NaiveBayes 

In [109]:
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(x[:,9])
final_tf_idf.get_shape()

(35554, 660649)

In [110]:
X_2, X_test1, y_2, y_test1 = cross_validation.train_test_split(final_tf_idf, y, test_size=0.3)

X_tr1, X_cv1, y_tr1, y_cv1 = cross_validation.train_test_split(X_2, y_2, test_size=0.3)


In [113]:

n=0
while(n<9):
    a=[0.0001,0.001,0.01,0.1,0,1,10,100,1000]
    gnb = MultinomialNB(alpha=a[n])
    y_pred1 = gnb.fit(X_tr1,y_tr1)
    pred1 = gnb.predict(X_cv1)
    acc1 = accuracy_score(y_cv1, pred1, normalize=True) * float(100)
    print('\nCV accuracy for alpha=%f is %d%%' %(float(a[n]),acc1))    
    n=n+1


CV accuracy for alpha=0.000100 is 83%

CV accuracy for alpha=0.001000 is 85%

CV accuracy for alpha=0.010000 is 87%

CV accuracy for alpha=0.100000 is 89%


  'setting alpha = %.1e' % _ALPHA_MIN)



CV accuracy for alpha=0.000000 is 78%

CV accuracy for alpha=1.000000 is 85%

CV accuracy for alpha=10.000000 is 73%

CV accuracy for alpha=100.000000 is 53%

CV accuracy for alpha=1000.000000 is 52%


In [117]:
gnb = MultinomialNB(alpha=0.1)
gnb.fit(X_tr1,y_tr1)
pred1 = gnb.predict(X_test1)
pre2=precision_score(y_test1,pred1)* float(100)
acc2 = accuracy_score(y_test1, pred1, normalize=True) * float(100)
re2=recall_score(y_test1, pred1)*float(100)
f12=f1_score(y_test1, pred1)* float(100)
print("accuracy is",acc2)
print("precision score is ",pre2)
print("recall score is ",re2)
print("f1 score is ",f12)

accuracy is 88.95659510640293
precision score is  87.91485082132083
recall score is  91.98526832690284
f1 score is  89.90401097017484


In [118]:
confusion_matrix(y_test1,pred1)

array([[4244,  721],
       [ 457, 5245]], dtype=int64)

## Observation:
- the optimal alpha value is 0.1 on cv data with accuracy of 89%.
- the metrics for test data are as follows
  - accuracy is 88.9%
  - precison score is 87.91%
  - recall score is 91.98%
  - f1 score is 89.90%
- the confusion matrix values are :
  - TN: 4244
  - FN: 721
  - FP: 457
  - TP: 5245

# CONCLUSION: 

- Here we find that the feature with highest importance is giving us 53% accuracy
- we also find that in both the bag of words and tfidf naive bayes models the accuracy on test data is 85% and 88%
 respectively.
- as we know from the confusion matrix the true positive and true negative values must be high.
- the same high TN and TP we observe in the models outcomes.
- the values from the bag of words NB are:
  - accuracy is 85.94%
  - precision score is 86.27%
  - recall score is 85.73%
  - f1 score is 85.94%
- confusion matrix values:
  - TN is 4120
  - FN is 979
  - FP is 520
  - TP is 5048
- the values from tfidf are:
  - accuracy is 88.9%
  - precison score is 87.91%
  - recall score is 91.98%
  - f1 score is 89.90%
- the confusion matrix values are :
  - TN: 4244
  - FN: 721
  - FP: 457
  - TP: 5245
- these values are taken from the test data analysis.
- I took 40000 data points and I could clearly see that it works better than KNN algorithm.
- I can conclude that Naive Bayes is better than KNN in case text data classification.(As we use reviews here)