# [7] Amazon Fine Food Reviews Analysis


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




## [7.1] Loading the data

The dataset is available in two forms
1. .csv file
2. SQLite Database

In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently.
<br> 

Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative".

In [1]:
%matplotlib inline

import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer



# using the SQLite Table to read data.
sql_file = sqlite3.connect('database.sqlite') 



#filtering only positive and negative reviews i.e. 
# not taking into consideration those reviews with Score=3
filtered_data = pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3
""", sql_file) 




# Give reviews with Score>3 a positive rating, and reviews with a score<3 a negative rating.
def polarity_r(x):
    if x < 3:
        return 'negative'
    return 'positive'

#changing reviews with score less than 3 to be positive and vice-versa
actualScore = filtered_data['Score']
positiveNegative = actualScore.map(polarity_r) 
filtered_data['Score'] = positiveNegative



In [2]:
filtered_data.shape #looking at the number of attributes and size of the data
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


#  Exploratory Data Analysis

## [7.1.2] Data Cleaning: Deduplication

It is observed (as shown in the table below) that the reviews data had many duplicate entries. Hence it was necessary to remove duplicates in order to get unbiased results for the analysis of the data.  Following is an example:

In [2]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND UserId="AR5J8UI46CURR"
ORDER BY ProductID
""", sql_file)
display

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,78445,B000HDL1RQ,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
1,138317,B000HDOPYC,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
2,138277,B000HDOPYM,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
3,73791,B000HDOPZG,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...
4,155049,B000PAQ75C,AR5J8UI46CURR,Geetha Krishnan,2,2,5,1199577600,LOACKER QUADRATINI VANILLA WAFERS,DELICIOUS WAFERS. I FIND THAT EUROPEAN WAFERS ...


As can be seen above the same user has multiple reviews of the with the same values for HelpfulnessNumerator, HelpfulnessDenominator, Score, Time, Summary and Text  and on doing analysis it was found that <br>
<br> 
ProductId=B000HDOPZG was Loacker Quadratini Vanilla Wafer Cookies, 8.82-Ounce Packages (Pack of 8)<br>
<br> 
ProductId=B000HDL1RQ was Loacker Quadratini Lemon Wafer Cookies, 8.82-Ounce Packages (Pack of 8) and so on<br>

It was inferred after analysis that reviews with same parameters other than ProductId belonged to the same product just having different flavour or quantity. Hence in order to reduce redundancy it was decided to eliminate the rows having same parameters.<br>

The method used for the same was that we first sort the data according to ProductId and then just keep the first similar product review and delelte the others. for eg. in the above just the review for ProductId=B000HDL1RQ remains. This method ensures that there is only one representative for each product and deduplication without sorting would lead to possibility of different representatives still existing for the same product.

In [2]:
#Sorting data according to ProductId in ascending order
sorted_data=filtered_data.sort_values('Time', axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')

In [3]:
#Deduplication of entries
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"}, keep='first', inplace=False)
final.shape

(364173, 10)

In [5]:
#Checking to see how much % of data still remains
(final['Id'].size*1.0)/(filtered_data['Id'].size*1.0)*100

69.25890143662969

<b>Observation:-</b> It was also seen that in two rows given below the value of HelpfulnessNumerator is greater than HelpfulnessDenominator which is not practically possible hence these two rows too are removed from calcualtions

In [6]:
display= pd.read_sql_query("""
SELECT *
FROM Reviews
WHERE Score != 3 AND Id=44737 OR Id=64422
ORDER BY ProductID
""", sql_file)
display


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...
1,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...


In [13]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]
#Y = final['labels']

In [5]:
#Before starting the next phase of preprocessing lets see the number of entries left
print(final.shape)

#How many positive and negative reviews are present in our dataset?
final['Score'].value_counts()

(364171, 10)


positive    307061
negative     57110
Name: Score, dtype: int64

## 7.2.3  Text Preprocessing: Stemming, stop-word removal.

Now that we have finished deduplication our data requires some preprocessing before we go on further with analysis and making the prediction model.

Hence in the Preprocessing phase we do the following in the order below:-

1. Begin by removing the html tags
2. Remove any punctuations or limited set of special characters like , or . or # etc.
3. Check if the word is made up of english letters and is not alpha-numeric
4. Check to see if the length of the word is greater than 2 (as it was researched that there is no adjective in 2-letters)
5. Convert the word to lowercase
6. Remove Stopwords
7. Finally Snowball Stemming the word (it was obsereved to be better than Porter Stemming)<br>

After which we collect the words used to describe positive and negative reviews

In [5]:
# find sentences containing HTML tags
import re
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;    

        

8
What happens when you say his name three times? Michael Keaten stars in this comedy about two couples that live in an old two story house.  While coming back from a supply store, the couple suddenly get caught inside of a  &quot;broken-up&quot; bridge and then just before they start to tumble down  into the lake, a board catches them.  But just when they've got their hopes  up, and small dog steps on the board and the car starts to slide off the  bridge and into the lake waters.  A few minutes later...<p>They find  themselves back into their home, they find that somehow somehad light the  fireplace, as if done by magic.  From then on, they find a weird-looking  dead guy known as Bettlejuice.  The only way they can get him for help is  to call him by his name three times and he will appear at their survice.  But they soon wish that they have never called his name, because  Bettlejuice was once a troublemaker but he is the only one who can save  them, on the account that they said his 

In [2]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

stop_word = set(stopwords.words('english')) #set of stopwords
sno_stem = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned
print(stop_word)
print('************************************')
print(sno_stem.stem('tasty'))

{"isn't", 'its', 't', 'because', "you're", "don't", 'y', "you'll", 'mustn', 'herself', 'me', 'to', 'same', 'for', 'ourselves', 'more', 'how', 'my', 'now', 'under', 'it', 'before', 'and', 'am', 'not', 'whom', 'didn', 'in', "she's", "doesn't", "mightn't", 'do', 'during', 'myself', 'no', 'such', 'an', 'had', 'who', 'a', 'yours', 'when', 'why', 'by', 'are', 'doesn', 'what', 'again', 'further', 'below', 'ain', 're', 'ma', 'through', 'did', 'own', 'won', "should've", 'mightn', 'will', 'i', 'so', 'about', 'your', 'their', 'all', 'other', 'she', 'them', 'while', 'o', 'until', 've', 'hasn', "haven't", 'needn', 'been', 'some', 'we', 'with', 'll', "mustn't", "didn't", 'as', 'themselves', 'our', 'her', "won't", 'theirs', 'was', 'once', 'aren', 'isn', 's', 'into', 'they', 'yourself', 'can', 'those', 'ours', 'have', 'than', 'is', 'here', 'each', 'should', "shouldn't", "aren't", 'at', "weren't", 'be', 'haven', 'on', 'any', 'above', 'that', 'but', 'nor', 'few', 'from', 'wouldn', 'itself', 'then', 'wer

In [14]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.
i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['Text'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop_word):
                    s=(sno_stem.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    
    final_string.append(str1)
    i+=1

In [8]:
final['CleanedText']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 

In [13]:
# store final table into an SQlLite table for future.
sql_file2 = sqlite3.connect('final.sqlite')
c=sql_file2.cursor()
sql_file2.text_factory = str
final.to_sql('Reviews', sql_file2, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

## Separating data into Train ,Cross-validation,Test

In [3]:
con = sqlite3.connect('final.sqlite')
final = pd.read_sql_query('SELECT * FROM REVIEWS',con)
from sklearn.cross_validation import train_test_split
from sklearn import cross_validation
def values(x):
    if x == 'positive':
        return 1
    return 0
X = final['Text'].values
label = final['Score']
Y = label.map(values)

## Test data split ##
X_1, X_test, Y_1, Y_test = cross_validation.train_test_split(X, Y, test_size=0.3, random_state=0)
X_train, X_cv, Y_train, Y_cv = cross_validation.train_test_split(X_1, Y_1, test_size=0.3, random_state=0)


# [7.2.2] Bag of Words (BoW)

In [4]:
count_vect = CountVectorizer()
final_counts = count_vect.fit_transform(X_train)
X_final = count_vect.transform(X_test)

### Finding the best gamma and C using grid and random search

In [7]:
####################               Using gird search           ##############################
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
l1_values = [{'gamma': [10**-2,10**2]},
              {'C': [10**-2,10**2]}]
model = GridSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(final_counts[0:20000], Y_train[0:20000])
print('best alpha: ',model.best_estimator_)

best alpha:  SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [7]:
#####################               Randomized Search           #############################
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
l1_values = {'gamma': [10**-5,10**-4,10**-3,10**4],
              'C': [10**-5,10**-4,10**-3,10**4]}
model = RandomizedSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(final_counts[0:20000], Y_train[0:20000])
print('best alpha: ',model.best_estimator_)

best alpha:  SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


### SVM on Bow

In [None]:
###############################        SVC on test data ###########################################
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(X_final[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  91.82000000000001
f1_score 0.9162867260441633
precision 0.9152601700901014
recall 0.9182
confusion matrix [[1074  493]
 [ 325 8108]]


In [10]:
#################################        SGD on test data ######################################
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=1e-05,n_jobs=-1)
SVM.fit(final_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(X_final[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)



accuracy  90.79
f1_score 0.9048578580997312
precision 0.9034705394612436
recall 0.9079
confusion matrix [[ 991  576]
 [ 345 8088]]


In [6]:
##########################                SVC on training data   ##############################
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(final_counts[0:90000])
f1score = f1_score(Y_train[0:90000],pred,average='weighted')
precision = precision_score(Y_train[0:90000],pred,average='weighted')
recall = recall_score(Y_train[0:90000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:90000],pred)
acc = accuracy_score(Y_train[0:90000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  96.82555555555555
f1_score 0.9675228469720494
precision 0.9677987163657515
recall 0.9682555555555555
confusion matrix [[11738  2162]
 [  695 75405]]


In [9]:
###########################               SGD Classifier #####################################
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=1e-05,n_jobs=-1)
SVM.fit(final_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(final_counts[0:90000])
f1score = f1_score(Y_train[0:90000],pred,average='weighted')
precision = precision_score(Y_train[0:90000],pred,average='weighted')
recall = recall_score(Y_train[0:90000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:90000],pred)
acc = accuracy_score(Y_train[0:90000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)



accuracy  92.77
f1_score 0.9219134391851195
precision 0.925595836984165
recall 0.9277
confusion matrix [[ 8445  5455]
 [ 1052 75048]]


## [7.2.4] Bi-Grams and n-Grams.

**Motivation**

Now that we have our list of words describing positive and negative reviews lets analyse them.<br>

We begin analysis by getting the frequency distribution of the words as shown below

In [None]:
freq_dist_positive=nltk.FreqDist(all_positive_words)
freq_dist_negative=nltk.FreqDist(all_negative_words)
print("Most Common Positive Words : ",freq_dist_positive.most_common(20))
print("Most Common Negative Words : ",freq_dist_negative.most_common(20))

<b>Observation:-</b> From the above it can be seen that the most common positive and the negative words overlap for eg. 'like' could be used as 'not like' etc. <br>
So, it is a good idea to consider pairs of consequent words (bi-grams) or q sequnce of n consecutive words (n-grams)

In [11]:
#bi-gram, tri-gram and n-gram

#removing stop words like "not" should be avoided before building n-grams
count_vect_bigram = CountVectorizer(ngram_range=(1,2) ) #in scikit-learn
final_bigram_counts = count_vect_bigram.fit_transform(X_train)
X_test_grams = count_vect_bigram.transform(X_test)

### Finding best gamma and C Using grid and random

In [5]:
####################               Using gird search           ##############################
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
l1_values = [{'gamma': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]},
              {'C': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(final_bigrams_counts[0:20000], Y_train[0:20000])
print(model.best_estimator_)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [5]:
#####################               Randomized Search           #############################
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
l1_values = {'gamma': [10**-5,10**4],
              'C': [10**-5,10**4]}
model = RandomizedSearchCV(SVC(), l1_values, scoring = 'f1', cv=10,n_iter=4)
model.fit(final_bigram_counts[0:20000], Y_train[0:20000])
print(model.best_estimator_)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


### Svm on bigrams

In [9]:
################      SVC on test data   #########################
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_bigram_counts[0:40000],Y_train[0:40000])
pred = SVM.predict(X_test_grams[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  92.94
f1_score 0.9280810325575636
precision 0.9273595805953546
recall 0.9294
confusion matrix [[1146  421]
 [ 285 8148]]


In [13]:
###########################           SGD Classifier ########################333
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=1e-05,n_jobs=-1)
SVM.fit(final_bigram_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(X_test_grams[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('SGD accuracy ',acc*100)
print('SGD f1_score',f1score)
print('SGD precision',precision)
print('SGD recall',recall)
print('SGD confusion matrix',con_matrix)



SGD accuracy  92.63
SGD f1_score 0.92648049902905
SGD precision 0.9266717700577701
SGD recall 0.9263
SGD confusion matrix [[1208  359]
 [ 378 8055]]


In [None]:

from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_bigram_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(final_bigram_counts[0:10000])
f1score = f1_score(Y_train[0:10000],pred,average='weighted')
precision = precision_score(Y_train[0:10000],pred,average='weighted')
recall = recall_score(Y_train[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:10000],pred)
acc = accuracy_score(Y_train[0:10000],pred)
print('SGD accuracy ',acc*100)
print('SGD f1_score',f1score)
print('SGD precision',precision)
print('SGD recall',recall)
print('SGD confusion matrix',con_matrix)

SGD accuracy  93.33
SGD f1_score 0.9325002985493813
SGD precision 0.9319545093825452
SGD recall 0.9333
SGD confusion matrix [[1189  378]
 [ 289 8144]]


In [15]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_bigram_counts[0:90000],Y_train[0:90000])
pred = SVM.predict(final_bigram_counts[0:90000])
f1score = f1_score(Y_train[0:90000],pred,average='weighted')
precision = precision_score(Y_train[0:90000],pred,average='weighted')
recall = recall_score(Y_train[0:90000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:90000],pred)
acc = accuracy_score(Y_train[0:90000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)



accuracy  97.68333333333334
f1_score 0.9765231139844274
precision 0.9765686738866491
recall 0.9768333333333333
confusion matrix [[12420  1480]
 [  605 75495]]


# [7.2.5] TF-IDF

In [4]:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
final_tf_idf = tf_idf_vect.fit_transform(X_train)
X_test_tfidf = tf_idf_vect.transform(X_test)

In [21]:
X_cv_tfidf = tf_idf_vect.transform(X_cv)
#X_test_tfidf = tf_idf_vect.transform(X_test)

### Finding best gamma using grid and random search

In [5]:
####################               Using gird search           ##############################
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
l1_values = [{'gamma': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]},
              {'C': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(final_tf_idf[0:20000], Y_train[0:20000])
print(model.best_estimator_)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [5]:
#####################               Randomized Search           #############################
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
l1_values = {'gamma': [10**-5,10**4],
              'C': [10**-5,10**4]}
model = RandomizedSearchCV(SVC(), l1_values, scoring = 'f1', cv=10,n_iter=4)
model.fit(X-test_tfidf[0:20000], Y_train[0:20000])
print(model.best_estimator_)

SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


### svm on tf-idf

In [12]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_tf_idf[0:40000],Y_train[0:40000])
pred = SVM.predict(X_test_tfidf [0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  88.09
f1_score 0.8493329896702129
precision 0.8920350609012568
recall 0.8809
confusion matrix [[ 387 1180]
 [  11 8422]]


In [6]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=1e-05)
SVM.fit(final_tf_idf[0:40000],Y_train[0:40000])
pred = SVM.predict(X_test_tfidf [0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)



accuracy  93.26
f1_score 0.9301494056805685
precision 0.9298999215902599
recall 0.9326
confusion matrix [[1104  463]
 [ 211 8222]]


In [9]:
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=10000,kernel='rbf')
SVM.fit(final_tf_idf[0:40000],Y_train[0:40000])
pred = SVM.predict(final_tf_idf[0:10000])
f1score = f1_score(Y_train[0:10000],pred,average='weighted')
precision = precision_score(Y_train[0:10000],pred,average='weighted')
recall = recall_score(Y_train[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:10000],pred)
acc = accuracy_score(Y_train[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  89.77000000000001
f1_score 0.8759420770458186
precision 0.9081137072045787
recall 0.8977
confusion matrix [[ 543 1020]
 [   3 8434]]


In [10]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=1e-05)
SVM.fit(final_tf_idf[0:40000],Y_train[0:40000])
pred = SVM.predict(final_tf_idf [0:10000])
f1score = f1_score(Y_train[0:10000],pred,average='weighted')
precision = precision_score(Y_train[0:10000],pred,average='weighted')
recall = recall_score(Y_train[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:10000],pred)
acc = accuracy_score(Y_train[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)



accuracy  99.96000000000001
f1_score 0.9995998956613259
precision 0.9995999429754733
recall 0.9996
confusion matrix [[1560    3]
 [   1 8436]]


In [14]:
feature_names_tfidf = tf_idf_vect.get_feature_names()
feature_names_tfidf = np.array(feature_names_tfidf)
feature_tfidf = np.vstack((np.absolute(SVM.coef_),feature_names_tfidf))
feature_tfidf = pd.DataFrame(data=feature_tfidf.T,columns=('coefficients','words'))
main = feature_tfidf.sort_values('coefficients', axis=0, ascending=False, inplace=False, kind='quicksort')
print('\t     top 20 features')
print(main[0:20])

	     top 20 features
                   coefficients            words
894600    9.998718059239454e-05      great place
1059332   9.490954053588738e-05          is make
2217748    9.44062235782976e-06        want good
2301035    9.06061843651392e-05      world would
2030716   8.976364449882943e-05   the shortbread
2125323   8.864421555149713e-05        treat not
773219    8.587941012333129e-05      flavor made
145939    8.448238761167035e-05  appetizing this
1710075   8.250099246826404e-05         roast so
2245506   8.073773304429928e-05         well had
1143908   8.018239957906577e-05       lemon with
589024    7.893976273912822e-05          didn it
1951614   7.422312551686386e-06   susceptible to
1951611   7.368336134948334e-06      susceptible
892289       7.2328131004027885            great
421296    6.105003301536146e-05       chicken so
1588531  6.0534245404522055e-05   private seller
569427       5.6750416791155835        delicious
2176486   5.188357873445765e-05         use goo

In [None]:
# source: https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],features,25)

In [None]:
top_tfidf


# [7.2.6] Word2Vec

In [None]:
# Using Google News Word2Vectors
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

# in this project we are using a pretrained model by google
# its 3.3G file, once you load this into your memory 
# it occupies ~9Gb, so please do this step only if you have >12G of ram
# we will provide a pickle file wich contains a dict , 
# and it contains all our courpus words as keys and  model[word] as values
# To use this code-snippet, download "GoogleNews-vectors-negative300.bin" 
# from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
# it's 1.9GB in size.


model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)



In [15]:
import gensim
i=0
list_of_sent=[]
for sent in X_train:
    filtered_sentence=[]
    sent=cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):    
                filtered_sentence.append(cleaned_words.lower())
            else:
                continue 
    list_of_sent.append(filtered_sentence)



In [17]:
i =0
list_of_xtest =[]
for sent in X_test:
    filtered_Xtest = []
    sent = cleanhtml(sent)
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if(cleaned_words.isalpha()):
                filtered_Xtest.append(cleaned_words.lower())
            else:
                continue
    list_of_xtest.append(filtered_Xtest)

In [18]:
w2v_model=gensim.models.Word2Vec(list_of_sent,min_count=5,size=50, workers=4)    


# [7.2.7] Avg W2V, TFIDF-W2V

In [19]:

sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

  del sys.path[0]


254919
50


In [20]:
sent_vectors_xtest = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_xtest: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors_xtest.append(sent_vec)

### Finding best gamma and C using grid and random search

In [20]:
####################               Using gird search           ##############################
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
l1_values = [{'gamma': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]},
              {'C': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(final_counts[0:20000], Y_train[0:20000])
print(model.best_estimator_)

best alpha:  SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [20]:
#####################               Randomized Search           #############################
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
l1_values = {'gamma': [10**-5,10**-4,10**-3,10**4],
              'C': [10**-5,10**-4,10**-3,10**4]}
model = RandomizedSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(sent_vectors[0:20000], Y_train[0:20000])
print('best alpha: ',model.best_estimator_)

best alpha:  SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [27]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
l1_values = [{'alpha': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(SGDClassifier(), l1_values, scoring = 'f1', cv=10)
model.fit(sent_vectors[0:90000], Y_train[0:90000])
print(model.best_estimator_)

SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)


### Svm on avgword2vec

In [22]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=0.001,C=10000,kernel='rbf')
SVM.fit(sent_vectors[0:90000],Y_train[0:90000])
pred = SVM.predict(sent_vectors_xtest[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  89.03
f1_score 0.8803801272497364
precision 0.8805537398962955
recall 0.8903
confusion matrix [[ 744  823]
 [ 274 8159]]


In [28]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=0.001)
SVM.fit(sent_vectors[0:90000],Y_train[0:90000])
pred = SVM.predict(sent_vectors_xtest[0:10000])
f1score = f1_score(Y_test[0:10000],pred,average='weighted')
precision = precision_score(Y_test[0:10000],pred,average='weighted')
recall = recall_score(Y_test[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:10000],pred)
acc = accuracy_score(Y_test[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  87.83
f1_score 0.8702492286820532
precision 0.8675017916265151
recall 0.8783
confusion matrix [[ 746  821]
 [ 396 8037]]


In [30]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=0.001,C=10000,kernel='rbf')
SVM.fit(sent_vectors[0:90000],Y_train[0:90000])
pred = SVM.predict(sent_vectors[0:10000])
f1score = f1_score(Y_train[0:10000],pred,average='weighted')
precision = precision_score(Y_train[0:10000],pred,average='weighted')
recall = recall_score(Y_train[0:10000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:10000],pred)
acc = accuracy_score(Y_train[0:10000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  90.52
f1_score 0.8978010724759111
precision 0.8986812558200663
recall 0.9052
confusion matrix [[ 846  717]
 [ 231 8206]]


In [29]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SGDClassifier(alpha=0.001)
SVM.fit(sent_vectors[0:90000],Y_train[0:90000])
pred = SVM.predict(sent_vectors[0:90000])
f1score = f1_score(Y_train[0:90000],pred,average='weighted')
precision = precision_score(Y_train[0:90000],pred,average='weighted')
recall = recall_score(Y_train[0:90000],pred,average='weighted')
con_matrix = confusion_matrix(Y_train[0:90000],pred)
acc = accuracy_score(Y_train[0:90000],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  88.72777777777779
f1_score 0.879144584203561
precision 0.8773495422075803
recall 0.8872777777777778
confusion matrix [[ 6806  7094]
 [ 3051 73049]]


## KNN on Tf-IDF Word2vec

In [23]:
# TF-IDF weighted Word2Vec

tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_sent[0:2000]: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = final_tf_idf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tfidf)
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors.append(sent_vec)
    row += 1
    

    

In [24]:
tfidf_sent_vectors[0]

array([ 0.37549071,  0.89660279,  1.85424839,  0.29227973, -0.32008218,
       -0.50843662, -1.29595636,  1.00737415,  1.41863542,  0.03797864,
        0.94282051,  1.35295428, -1.01863877, -0.1711508 , -1.1014169 ,
        0.22838491,  1.1461651 ,  0.29160659,  1.51856809,  0.96807693,
        1.37684546, -0.01373881, -2.57144704, -0.90532853, -0.25323081,
       -0.23164053,  1.56751758, -0.27761396,  0.63327613, -2.12590607,
       -0.59075763,  0.43809102,  0.66852525, -1.34568059, -0.99967248,
       -0.01928877, -0.94167368, -0.45722977, -0.57009571,  0.36639322,
       -1.17853452,  0.29265655,  1.94985111, -0.19006218,  0.15773333,
       -1.55608967,  0.33606658, -0.41338954,  1.10937412, -0.13977818])

In [25]:
test_tfidf = tf_idf_vect.fit_transform(X_test)

In [26]:
tfidf_feat = tf_idf_vect.get_feature_names() # tfidf words/col-names
# final_tf_idf is the sparse matrix with row= sentence, col=word and cell_val = tfidf

tfidf_sent_vectors_xtest = []; # the tfidf-w2v for each sentence/review is stored in this list
row=0;
for sent in list_of_Xcv[0:100]: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    weight_sum =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            # obtain the tf_idfidf of a word in a sentence/review
            tfidf = test_tfidf[row, tfidf_feat.index(word)]
            sent_vec += (vec * tfidf)
            weight_sum += tfidf
        except:
            pass
    sent_vec /= weight_sum
    tfidf_sent_vectors_xtest.append(sent_vec)
    row += 1
    

### Finding best gamma and c using grid and random search

In [28]:
####################               Using gird search           ##############################
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
l1_values = [{'gamma': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]},
              {'C': [10**-5,10**-4,10**-3,10**-2, 10**0, 10**2,10**-3,10**4]}]
model = GridSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(tfidf_sent_vectors[0:2000], Y_train[0:2000])
print(model.best_estimator_)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [29]:
#####################               Randomized Search           #############################
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
l1_values = {'gamma': [10**-5,10**-4,10**-3,10**4],
              'C': [10**-5,10**-4,10**2,10**-3,10**4]}
model = RandomizedSearchCV(SVC(), l1_values, scoring = 'f1', cv=10)
model.fit(tfidf_sent_vectors[0:2000], Y_train[0:2000])
print('best alpha: ',model.best_estimator_)

best alpha:  SVC(C=10000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


### Svm on tf_idf word2vec

In [30]:
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
SVM = SVC(gamma=1e-05,C=1,kernel='rbf')
SVM.fit(tfidf_sent_vectors[0:2000],Y_train[0:2000])
pred = SVM.predict(tfidf_sent_vectors_xtest[0:100])
f1score = f1_score(Y_test[0:100],pred,average='weighted')
precision = precision_score(Y_test[0:100],pred,average='weighted')
recall = recall_score(Y_test[0:100],pred,average='weighted')
con_matrix = confusion_matrix(Y_test[0:100],pred)
acc = accuracy_score(Y_test[0:100],pred)
print('accuracy ',acc*100)
print('f1_score',f1score)
print('precision',precision)
print('recall',recall)
print('confusion matrix',con_matrix)

accuracy  85.0
f1_score 0.781081081081081
precision 0.7225
recall 0.85
confusion matrix [[ 0 15]
 [ 0 85]]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


## Performance table

|      Algorithms                       | Bow       | Bigrams  | tf_idf  | avgword2vec   | tfidf_word2vec  |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|        gamma                          |  1e-05    |  1e-05   |1e-05    |    0.001      |    1e-05        |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|           c                           |    10000  |  10000   |10000    |     1000      |     10000       |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|       precision                       |  0.915    |   0.927  |  0.89   |    0.881      |      0.722      |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|       recall                          |  0.918    |   0.924  |  0.886  |    0.890      |      0.85       |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|       f1_score                        |  0.0.916  |   0.928  |  0.849  |    0.884      |      0.781      |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|
|       accuracy                        |   91.8    |   92.94  |  88.09  |    89.03      |       85.0      |
|---------------------------------------|-----------|----------|---------|---------------|-----------------|


## Confusion matrix

### Bow

|107 |493 |
|----|----|
|325 |8108|

### Bi-grams

|1146|421 |
|----|----|
|285 |8148|

### tf-idf

|387 |1180|
|----|----|
|11  |8472|

### Avg word2vec

|744 |823 |
|----|----|
|274 |8159|

### Avg word2vec

|0   |15  |
|----|----|
|0   |85  |