# Project 02: Sentiment Analysis on the Web
# 1. Experiment Objective

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.



Since this dataset is offered to become a public competition, the using of this dataset will not go against copyright rules. 



The **goal** of this analysis is to study food review and try to predict whether a review is positive or negative.

# 2. Data Collection

It is a dataset downloaded from https://www.kaggle.com/snap/amazon-fine-food-reviews?select=Reviews.csv

Contents

Reviews.csv: Pulled from the corresponding SQLite table named Reviews in database.sqlite  
database.sqlite: Contains the table 'Reviews'

Data includes:

Reviews from Oct 1999 - Oct 2012  
568,454 reviews  
256,059 users  
74,258 products  
260 users with > 50 reviews  

# 3. Data Preprocessing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df = pd.read_csv('./amazon_food_review/Reviews.csv')

print(df.shape)

(568454, 10)


In [3]:
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


id: Row ID  
ProductID: ID for the product  
UserID: ID for the user  
ProfileName: Profile name of the user  
HelpfulnessNumerator: Number of users who found the review helpful  
HelpfulnessDenominator: Number of users who indicated whether they found the review helpful  
Score: Rating between 1 and 5  
Time: Timestamp for review  
Summary: Brief summary of the review  
Text: Text of the review

In [9]:
df['helpful_percent'] = np.where(df['HelpfulnessDenominator'] > 0, df['HelpfulnessNumerator'] / df['HelpfulnessDenominator'], -1)
df['upvote_percent'] = pd.cut(df['helpful_percent'], bins = [-1, 0, 0.2, 0.4, 0.6, 0.8, 1.0], labels = ['empty', '0-20%', '20-40%', '40-60%', '60-80%', '80-100%'], include_lowest = True)
df.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,helpful_percent,upvote_percent
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...,1.0,80-100%
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,-1.0,empty
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...,1.0,80-100%
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...,1.0,80-100%
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...,-1.0,empty


In [4]:
# removie score 3 reviews which are neutral
# and separate the remaining reviews into binary class (1 = positive, 0 = negative)

df = df[df['Score'] != 3]
print(df.shape)
df.head()

(525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [5]:
X = df['Text']
y_map = {1:0, 2:0, 4:1, 5:1}
y = df['Score'].map(y_map)
print(len(X))
print(len(y))

525814
525814


In [6]:
reviews = np.array(X)
sentiment = np.array(y)
dataset = pd.DataFrame({'review': reviews, 'sentiment': list(sentiment)}, columns=['review', 'sentiment'])

In [7]:
dataset.to_csv('food_data.csv',index=False, encoding='utf-8')
dataset

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1
3,If you are looking for the secret ingredient i...,0
4,Great taffy at a great price. There was a wid...,1
...,...,...
525809,Great for sesame chicken..this is a good if no...,1
525810,I'm disappointed with the flavor. The chocolat...,0
525811,"These stars are small, so you can give 10-15 o...",1
525812,These are the BEST treats for training and rew...,1


In [8]:
df = pd.read_csv('food_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,I have bought several of the Vitality canned d...,1
1,Product arrived labeled as Jumbo Salted Peanut...,0
2,This is a confection that has been around a fe...,1


## Cleaning text data

In [9]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [10]:
df['review'] = df['review'].apply(preprocessor)

In [11]:
df

Unnamed: 0,review,sentiment
0,i have bought several of the vitality canned d...,1
1,product arrived labeled as jumbo salted peanut...,0
2,this is a confection that has been around a fe...,1
3,if you are looking for the secret ingredient i...,0
4,great taffy at a great price there was a wide ...,1
...,...,...
525809,great for sesame chicken this is a good if not...,1
525810,i m disappointed with the flavor the chocolate...,0
525811,these stars are small so you can give 10 15 of...,1
525812,these are the best treats for training and rew...,1


## Processing documents into tokens

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [13]:
from sklearn import svm
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from nltk import ngrams
from itertools import chain

In [14]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()


def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

# 4. Model Optimization and Serialization

In [16]:
df1 = df[:50000]
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

In [18]:
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

In [19]:
stop = stopwords.words('english')

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2']},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2']},
              ]

sgd_tfidf = Pipeline([('vect', tfidf),
                     ('clf', SGDClassifier(max_iter=1000, loss='log', random_state=1))])

gs_sgd_tfidf = GridSearchCV(sgd_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=2,
                           n_jobs=-1)

In [20]:
gs_sgd_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed: 11.1min
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed: 23.6min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [21]:
print('Best parameter set: %s ' % gs_sgd_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_sgd_tfidf.best_score_)

Best parameter set: {'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__norm': None, 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7fd0aee7cc20>, 'vect__use_idf': False} 
CV Accuracy: 0.898


In [22]:
clf = gs_sgd_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.914


According to the results above, 
we can see that the optimal value of penalty is l2. 
The test accuracy is 0.914 which is slightly better than the CV accuracy.

In [24]:
'''
def text_fit(X, y, model,clf_model,coef_show=1):
    
    X_c = model.fit_transform(X)
    print('features: {}'.format(X_c.shape[1]))
    X_train, X_test, y_train, y_test = train_test_split(X_c, y, random_state=0)
    print('train records: {}'.format(X_train.shape[0]))
    print('test records: {}'.format(X_test.shape[0]))
    clf = clf_model.fit(X_train, y_train)
    acc = clf.score(X_test, y_test)
    print ('Model Accuracy: {}'.format(acc))
    
    if coef_show == 1: 
        w = model.get_feature_names()
        coef = clf.coef_.tolist()[0]
        coeff_df = pd.DataFrame({'Word' : w, 'Coefficient' : coef})
        coeff_df = coeff_df.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
        print('')
        print('Top 20 positive')
        print(coeff_df.head(20).to_string(index=False))
        print('')
        print('Top 20 negative')        
        print(coeff_df.tail(20).to_string(index=False))

tfidf = TfidfVectorizer(stop_words = 'english')
text_fit(X_small, y_small, tfidf, LogisticRegression())

features: 18566
train records: 7500
test records: 2500
Model Accuracy: 0.8868

Top 20 positive
      Word  Coefficient
     great     5.787696
      best     4.163253
      love     3.990870
      good     3.405225
 delicious     3.304234
   perfect     3.141750
      nice     3.011186
 excellent     2.855048
  favorite     2.650813
     loves     2.555413
 wonderful     2.510425
    smooth     2.428402
     happy     2.285978
    highly     2.252826
   amazing     2.058148
     snack     1.968183
    stores     1.958974
     tasty     1.942438
       use     1.935599
   pleased     1.929478

Top 20 negative
          Word  Coefficient
         stale    -2.008231
    ingredient    -2.051222
      received    -2.072664
         taste    -2.114831
        return    -2.123710
           did    -2.155057
          didn    -2.167791
          away    -2.193213
       grounds    -2.329415
         awful    -2.330517
         money    -2.425792
      terrible    -2.448788
         worst    -2

In [None]:
X = df['review']
y = df['sentiment']

X = tfidf.fit_transform(X)

clf = SGDClassifier(max_iter=1000, loss='log', random_state=1)

clf = clf.fit(X, y)

In [35]:
import os

dest = os.path.join('foodreviewclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

In [36]:
import re
import pickle

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)   
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol=4)

 # 5. Website Creation

In [37]:
%%writefile foodreviewclassifier/vectorizer.py
from sklearn.feature_extraction.text import HashingVectorizer
import os
import re
import pickle
cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = re.sub('[\W]+', ' ', text.lower()) \
                   + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=16384,
                         preprocessor=None,
                         tokenizer=tokenizer)

Overwriting foodreviewclassifier/vectorizer.py


In [27]:
os.chdir('foodreviewclassifier')
os.getcwd()

'/home/jovyan/projects/project02/foodreviewclassifier'

In [32]:
os.chdir('/home/jovyan/projects/project02/')

In [44]:
import sqlite3

conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS review_db')
c.execute('CREATE TABLE review_db (review TEXT, sentiment INTEGER, date TEXT)')

conn.commit()
conn.close()

In [45]:
conn = sqlite3.connect('reviews.sqlite')
c = conn.cursor()

c.execute("SELECT * FROM review_db WHERE date BETWEEN '2019-01-01 10:10:10' AND DATETIME('now')")
results = c.fetchall()

conn.close()