# Sentiment-Analysis-Amazon-fine-foods

    Loading the data

In [1]:
import pandas as pd
df = pd.read_csv('data/Reviews.csv')
df.shape

(568454, 10)

In [2]:
df.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

## Dropping duplicates

In [3]:
imp_cols = set(df.columns)-{'Id','ProductId'}
df = df.drop_duplicates(subset=imp_cols)
print ('Dimension after eliminating duplicates',df.shape)

Dimension after eliminating duplicates (396309, 10)


***
    1. Neglecting 3 star reviews 
    2. Sorting by time-stamp
    3. Extracting Reviews and Summary and concatenating them
    4. Defining <3 score as negative and >3 as positive

In [4]:
df = df[df.Score != 3]
df = df.sort_values(by='Time')
temp1 = df.Text.tolist(); temp2 = df.Summary.tolist()
X = [str(temp1[i])+' '+str(temp2[i]) for i in range(len(temp1))]
Y = []
for i in df.Score.tolist():
    if(i>3):
        Y.append(1)
    else:
        Y.append(0)
del df, temp1, temp2
len(X), len(Y)

(366402, 366402)

## Cleaning the data
* Removing HTML tags
* Make in lower case
* Tokenizing and removing stopwords with punctuation marks
* Also removing non alpha numeric data

In [5]:
import re
X = [re.sub('<[^>]*>', '',i.lower()) for i in X] #Removes HTML tags
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize
stop_word = stopwords.words('english')+\
['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}']
'''including punctuation marks and other chatecters in the stopwords'''
for j in range(len(X)):
    X[j] = [i for i in wordpunct_tokenize(X[j]) if (i not in stop_word) and (i.isalnum())]
print (X[5])

['one', 'movie', 'movie', 'collection', 'filled', 'comedy', 'action', 'whatever', 'else', 'want', 'call', 'great']


* Stemming to normalize the words

In [6]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()
for j in range(len(X)):
    X[j] = [ps.stem(i) for i in X[j]]

***

    Storing the cleaned data to avoid running above operations in the future

In [7]:
import pickle
with open('clean-data-XY.pkl','wb') as fp:
    tupXY = (X,Y)
    pickle.dump(tupXY,fp)
fp.close()
del tupXY, X, Y

    Loading the data

In [2]:
import pickle
with open('clean-data-XY.pkl','rb') as fp:
    X,Y = pickle.load(fp)
fp.close()
len(X), len(Y)

(366402, 366402)

In [3]:
X = [' '.join(i) for i in X]
X[0]

'witti littl book make son laugh loud recit car drive along alway sing refrain learn whale india droop rose love new word book introduc silli classic book will bet son still abl recit memori colleg everi book educ'

# Baseline models:
* Throught the models, precision recall would be my metric
* Feature used: Bag of words

### Count based Bag-of-words

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
model = CountVectorizer(min_df=3,binary=False) #keeping min_df=3 will eliminate unnecessary strings
l = int(0.7*len(X))
model.fit(X[:l])
BOW_tr = model.transform(X[:l])
BOW_ts = model.transform(X[l:])
BOW_tr.shape, BOW_ts.shape

((256481, 27224), (109921, 27224))

### Occurence based Bag-of-words

In [5]:
model = CountVectorizer(min_df=3,binary=True)
model.fit(X[:l])
BBOW_tr = model.transform(X[:l])
BBOW_ts = model.transform(X[l:])
BBOW_tr.shape, BBOW_ts.shape

((256481, 27224), (109921, 27224))

## 1. Multinomial Naive Baye's
    Doing Grid-search on hyperparameter alpha

In [11]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(MultinomialNB(),{'alpha':[0.5,0.25,0.125,1,2,3,4]})
clf.fit(BOW_tr,Y[:l])
hX = clf.predict(BOW_ts)

In [14]:
from sklearn.metrics import precision_recall_fscore_support, roc_curve
precision_recall_fscore_support(Y[l:],hX)

(array([ 0.7340398 ,  0.94389242]),
 array([ 0.73643813,  0.94323636]),
 array([ 0.73523701,  0.94356427]),
 array([19282, 90639]))

We can see high precision and recall for positive class, but low for negative class

## 2. Bernoullis Naive Baye's
    Doing grid search on hyperparameter alpha

In [16]:
from sklearn.naive_bayes import BernoulliNB
clf = GridSearchCV(BernoulliNB(),{'alpha':[0.5,0.25,0.125,1,2,3,4]})
clf.fit(BBOW_tr,Y[:l])
hX = clf.predict(BBOW_ts)

In [17]:
precision_recall_fscore_support(Y[l:],hX)

(array([ 0.69271864,  0.9411627 ]),
 array([ 0.72627321,  0.93146438]),
 array([ 0.70909919,  0.93628843]),
 array([19282, 90639]))

The metrics have decreased than MultinomialNB which was expected

## 3. Logistic Regression

    L1 regularizer with gridsearch on hyperparameter c

In [19]:
from sklearn.linear_model import LogisticRegression
clf = GridSearchCV(LogisticRegression(n_jobs=-1,penalty='l1'),{'C':[0.25,0.5,0.75,1,2,3,4]})
clf.fit(BBOW_tr,Y[:l])
hX = clf.predict(BBOW_ts)

In [20]:
precision_recall_fscore_support(Y[l:],hX)

(array([ 0.85021199,  0.9438503 ]),
 array([ 0.72798465,  0.97271594]),
 array([ 0.78436522,  0.95806574]),
 array([19282, 90639]))

The metrics have drastically improved from Naive Baye's

    L2 regularizer with gridsearch on hyperparameter C

In [21]:
clf = GridSearchCV(LogisticRegression(n_jobs=-1,penalty='l2'),{'C':[0.25,0.5,0.75,1,2,3,4]})
clf.fit(BBOW_tr,Y[:l])
hX = clf.predict(BBOW_ts)

In [22]:
precision_recall_fscore_support(Y[l:],hX)

(array([ 0.85201603,  0.94381133]),
 array([ 0.72767348,  0.97311312]),
 array([ 0.78495105,  0.95823827]),
 array([19282, 90639]))

The metrics are almost equal to LR with l1 regularizer

# Feature engineering
* tf-idf
* W2V
* tf-idf based W2V

### Tf-idf:

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
model = TfidfVectorizer(min_df=3,strip_accents='unicode')
model.fit(X[:l])
tfidf_tr = model.transform(X[:l])
tfidf_ts = model.transform(X[l:])
tfidf_tr.shape, tfidf_ts.shape

((256481, 27224), (109921, 27224))

## 1. Multinomial Naive Baye's
    Doing Grid-search on hyperparameter alpha

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(MultinomialNB(),{'alpha':[0.5,0.25,0.125,1,2,3,4]})
clf.fit(tfidf_tr,Y[:l])
hX = clf.predict(tfidf_ts)

In [15]:
from sklearn.metrics import precision_recall_fscore_support, roc_curve
precision_recall_fscore_support(Y[l:],hX)

(array([ 0.91717347,  0.87430957]),
 array([ 0.32849289,  0.99368925]),
 array([ 0.48373301,  0.93018476]),
 array([19282, 90639]))

After using tf-idf the precision for negative class has increased and the positive class has decreased. The recall for negative class has drastically decreased but for positive class has increased very much.
        
    Checking for overfitting

In [16]:
hX = clf.predict(tfidf_tr)
precision_recall_fscore_support(Y[:l],hX)

(array([ 0.92300557,  0.89850822]),
 array([ 0.36248666,  0.994671  ]),
 array([ 0.52054334,  0.94414734]),
 array([ 38429, 218052]))