# SVM On Amazon Fine Food Reviews 

## [ CONTENTS ] 

1. About the dataset<br>
2. Objective<br>
3. Loading the data<br>
4. Data Preprocessing <br>
5. Function Definitions<br>
6. Bag of Words (BoW)<br>
    6.1 Bi-Grams & N-Grams<br>
7. TF-IDF<br>
8. Word2Vec<br>
9. Avg W2V & TFIDF-W2V<br>
    9.1 TF-IDF weighted W2V
10. Conclusion<br>

## 1. About the dataset
1. Title: Amazon Fine Food Reviews. Link:https://www.kaggle.com/snap/amazon-fine-food-reviews
2. Relevant Information: This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories.
3. Data includes:
    * Number of reviews: 568,454<br>
    * Number of users: 256,059<br>
    * Number of products: 74,258<br>
    * Timespan: Oct 1999 - Oct 2012<br>
    * Number of Attributes/Columns in data: 10 
4. Attribute Information: 
    * Id
    * ProductId - unique identifier for the product
    * UserId - unqiue identifier for the user
    * ProfileName
    * HelpfulnessNumerator - number of users who found the review helpful
    * HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
    * Score - rating between 1 and 5
    * Time - timestamp for the review
    * Summary - brief summary of the review
    * Text - text of the review

## 2. Objective:
For a given Amazon review, classify it as "Possitive"(Rating of 4 or 5) or "Negative"(Rating of 1 or 2).<br>
<br>
Here I'm using Support Vector Machine(SVM) algorithm to classify reviews as 'positive' or 'negative'. To convert a review text to numerical features I'm using bag of words, TF-IDF, avg Word2Vec, TF-IDF weighted Word2Vec. 

In [1]:
# loading required libraries 
import numpy as np
import pandas as pd 
import matplotlib 
import sqlite3
import string
import scipy 
import nltk
import time
import seaborn as sns 
from scipy import stats
from matplotlib import pyplot as plt 

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.grid_search import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, roc_auc_score, auc
from sklearn.metrics import accuracy_score

from sklearn.model_selection import KFold
from sklearn import cross_validation
from sklearn.cross_validation import train_test_split

import warnings 
warnings.filterwarnings('ignore')



In [22]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import TruncatedSVD

#Standardizing the data
def standardizer(data):
    stnd_scaler = StandardScaler(with_mean=False)
    stnd_matx = stnd_scaler.fit_transform(data)
    return stnd_matx

#Applying dimensionality reduction 
def truncated_svd(data):
    svd = TruncatedSVD(n_components = 1000, random_state = 0)
    svd_val = svd.fit_transform(data)
    print(np.sum(svd.explained_variance_ratio_))
    return svd_val

## 3. Loading the data

In [3]:
#Loading the data
connect = sqlite3.connect('final_data.sqlite')

#Ignoring the rows which have rating 3
data = pd.read_sql_query("""
SELECT *
FROM Reviews
""", connect)

Loading the pre-processed data using sqlite. This dataset has no entry with score 3 which is previously removed. And the scores which are greater than 3 are denoted as 'positive' and which are less than 3 are denoted as 'negative' scores.

In [4]:
print(data.shape)
data.head()

(364171, 12)


Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
0,138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,b'witti littl book make son laugh loud recit c...
1,138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",b'grew read sendak book watch realli rosi movi...
2,138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,b'fun way children learn month year learn poem...
3,138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,b'great littl book read nice rhythm well good ...
4,138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,b'book poetri month year goe month cute littl ...


Loaded data is imbalanced and logistic regression is very sensitive to imbalanced data as well as to mismatch between the class distribution of train-set and test-set. So, it is a good idea to upsample or downsample the data to balance the two classes. Here, I'm downsampling my data. 

In [5]:
data.Score.value_counts()

positive    307061
negative     57110
Name: Score, dtype: int64

In [6]:
from sklearn.utils import resample

df_majority = data[data.Score=='positive']
df_minority = data[data.Score=='negative']
 
# downsampling majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    
                                 n_samples=57110,  
                                 random_state=1)   
 
# combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

In [7]:
df_downsampled.shape

(114220, 12)

In [8]:
# sorting the data according to the time-stamp
sorted_data = df_downsampled.sort_values('Time', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
sorted_data.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
423,417838,451855,B00004CXX9,AJH6LUC1UT1ON,The Phantom of the Opera,0,0,positive,946857600,FANTASTIC!,Beetlejuice is an excellent and funny movie. K...,b'beetlejuic excel funni movi keaton hilari wa...
245,346116,374422,B00004CI84,A1048CYU0OV4O8,Judy L. Eans,2,2,positive,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...,b'one movi movi collect fill comedi action wha...
308,346041,374343,B00004CI84,A1B2IZU1JLZA6,Wes,19,23,negative,948240000,WARNING: CLAMSHELL EDITION IS EDITED TV VERSION,"I, myself always enjoyed this movie, it's very...",b'alway enjoy movi funni entertain didnt hesit...
241,1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,positive,961718400,Great Product,This was a really good idea and the final prod...,b'realli good idea final product outstand use ...
296,346102,374408,B00004CI84,A1GB1Q193DNFGR,Bruce Lee Pullen,5,5,positive,970531200,Fabulous Comedic Fanasy Directed by a Master,Beetlejuice is an awe-inspiring wonderfully am...,b'beetlejuic wonder amus comed romp explor inc...


In [9]:
def partition(x):
    if x == 'positive':
        return 1
    return 0

#Preparing the filtered data
actualScore = sorted_data['Score']
positiveNegative = actualScore.map(partition) 
sorted_data['Score'] = positiveNegative

In [10]:
sorted_data.head()

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,CleanedText
423,417838,451855,B00004CXX9,AJH6LUC1UT1ON,The Phantom of the Opera,0,0,1,946857600,FANTASTIC!,Beetlejuice is an excellent and funny movie. K...,b'beetlejuic excel funni movi keaton hilari wa...
245,346116,374422,B00004CI84,A1048CYU0OV4O8,Judy L. Eans,2,2,1,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...,b'one movi movi collect fill comedi action wha...
308,346041,374343,B00004CI84,A1B2IZU1JLZA6,Wes,19,23,0,948240000,WARNING: CLAMSHELL EDITION IS EDITED TV VERSION,"I, myself always enjoyed this movie, it's very...",b'alway enjoy movi funni entertain didnt hesit...
241,1146,1245,B00002Z754,A29Z5PI9BW2PU3,Robbie,7,7,1,961718400,Great Product,This was a really good idea and the final prod...,b'realli good idea final product outstand use ...
296,346102,374408,B00004CI84,A1GB1Q193DNFGR,Bruce Lee Pullen,5,5,1,970531200,Fabulous Comedic Fanasy Directed by a Master,Beetlejuice is an awe-inspiring wonderfully am...,b'beetlejuic wonder amus comed romp explor inc...


In [11]:
score = np.array(sorted_data.Score.reshape(114220,1))

## 5. Function Definitions

### [A.] Data Spliting 

In [12]:
# spliting the data
def data_split(data, score):
    # train data 70% and test data 30%
    train_x, test_x, train_y, test_y = cross_validation.train_test_split(data, score, test_size=0.3, random_state=0)    
    return train_x, test_x, train_y, test_y

### [B.] Support Vector Classifier 

In [13]:
# applying Support Vector Classifier 
def support_vect_classifier(data, score):
    train_x, test_x, train_y, test_y = data_split(data, score)
    cv_err = []
    train_err = []
    gamma = [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0]
    c_val = [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0]
    # applying 3-Fold cross validation
    Kfold = KFold(3, shuffle=False, random_state=36)
    for train_data, cv_data in Kfold.split(train_x):
        for g in gamma:
            for c in c_val:
                svc_model = SVC(kernel = 'rbf', degree = 3, C = c, gamma = g, random_state = 36)
                svc_model.fit(train_x[train_data], train_y[train_data])
                train_err.append(1 - (svc_model.score(train_x[train_data], train_y[train_data])))
                cv_err.append(1 - (svc_model.score(train_x[cv_data], train_y[cv_data])))
    return train_err, cv_err

### [C.] Error Curve 

In [14]:
# comparing error between cv and train data
def error_comparision(cv_err, train_err):
    sns.set()
    c_val = [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0]
    gamma = [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0]
    for i in range(3):
        plt.figure(1)
        plt.figure(figsize=(9,12))
        plt.subplot(3,1,i+1)
        plt.plot(c_val, cv_err[i,:],label = 'cv_error')
        plt.plot(c_val, train_err[i,:],label = 'train_error')
        plt.xscale('log')
        plt.xlabel('C-Values')
        plt.ylabel('Error Values')
        plt.legend()
        plt.title('CV & TRAIN-ERR for Fold '+str(i+1))
    for i in range(3):
        plt.figure(1)
        plt.figure(figsize=(9,12))
        plt.subplot(3,1,i+1)
        plt.plot(gamma, cv_err[i,:],label = 'cv_error')
        plt.plot(gamma, train_err[i,:],label = 'train_error')
        plt.xscale('log')
        plt.xlabel('C-Values')
        plt.ylabel('Error Values')
        plt.legend()
        plt.title('CV & TRAIN-ERR for Fold '+str(i+1))

### [D.] Accuracy Metrics 

In [15]:
# test accuracy and ROC plot
def final_test_acc(data,score,l,best_c,best_gamma,name):
    train_x, test_x, train_y, test_y = data_split(data, score)
    scv_model = SVC(kernel = 'rbf', degree = 3, C = best_c, gamma = best_gamma, random_state = 36)
    svc_model.fit(train_x,train_y)
    pred = svc_model.predict(test_x)
    acc = accuracy_score(test_y, pred, normalize=True) * float(100)
    print("\nTest accuracy for C = '{0}' is '{1}'".format(best_c, acc))
    
    y_pred_proba = lr_model.predict_proba(test_x)[::,1]
    fpr, tpr, thresholds = roc_curve(test_y, y_pred_proba)
    sns.set()
    plt.figure(figsize=(8,5))
    plt.plot([0,1],[0,1],'k--')
    plt.plot(fpr, tpr, label='Logistic Regression')
    plt.xlabel('False-Positive Rate')
    plt.ylabel('True-Positive Rate')
    plt.title('Logistic-Regression ROC curve for '+name)
    plt.show()
    
    print('Area under the ROC curve is ', roc_auc_score(test_y, y_pred_proba))
    conf_matx = confusion_matrix(test_y,pred)
    print('\nConfusion Matrix :\n', conf_matx)
    norm_conf_matx = conf_matx / conf_matx.astype(np.float).sum(axis=1).reshape(2,1)
    print('\nNormalized Confusion Matrix :\n', norm_conf_matx)
    
    plt.figure(figsize=(8,5))
    plot = sns.heatmap(norm_conf_matx, annot=True, xticklabels=['Negative Review', 'Positive Review'], yticklabels=['Negative Review','Positive Review'])
    plot.set_yticklabels(plot.get_yticklabels(), rotation = 0, fontsize = 10)
    plt.title('Confusion Matrix Heatmap', fontsize=18)

### [E.] Grid Search 

In [16]:
# applying grid search to find best c 
def grid_search_cv(data, score):
    
    train_x, test_x, train_y, test_y = data_split(data, score)
    parameter = [{'C': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0],\
                   'gamma': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1.0,5.0,10.0,50.0,100.0,500.0,1000.0,5000.0,10000.0]}]
            
    model = GridSearchCV(SVC(), parameters, scoring = 'f1', cv=3, n_jobs = 6)
    model.fit(train_x, train_y.reshape(train_x.shape[0],))

    print(model.best_estimator_)
    print(model.score(test_x, test_y))

### [F.] Random Search 

In [17]:
# applying random search to find best c 
def random_search_cv(data, score, l):
    
    train_x, test_x, train_y, test_y = data_split(data, score)
    parameters={'C': scipy.stats.norm(10), 'gamma': scipy.stats.norm(10)}
    model = RandomizedSearchCV(SVC(), parameters, scoring = 'f1', cv=3)
    model.fit(train_x,train_y)
                                                                                   
    print(model.best_estimator_)
    print(model.score(test_x, test_y))

## 6. Bag of Words (BoW)

In [23]:
#Applying Bag of Word to cleaned text 
#In sklearn BoW is known as CountVectorizer
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(sorted_data['CleanedText'].values)

In [24]:
final_counts.shape

(114220, 41258)

In [25]:
# column standardization 
final_counts = standardizer(final_counts)

In [26]:
# feature extraction
final_counts = truncated_svd(final_counts)

0.1666574334332846


### [A.] Support Vector Classifier

In [None]:
# applying svc and 3-fold cross validation
train_bow, cv_bow = support_vect_classifier(final_counts, score)