# GBDT and RF  on Amazon dataset

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review

#### Objective :
Apply GridSearchCV to find no.of base learners,depth of each tree,learning rate for GBDT,RF .

In [3]:
import sqlite3
import pandas as pd
import numpy as np
import nltk
import string
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer
from sklearn import cross_validation
from sklearn.model_selection import train_test_split

# Loading the data


The dataset is available in two forms

1) .csv file
2)  SQLite Database




In order to load the data, We have used the SQLITE dataset as it easier to query the data and visualise the data efficiently. 
Here as we only want to get the global sentiment of the recommendations (positive or negative), we will purposefully ignore all Scores equal to 3. If the score id above 3, then the recommendation wil be set to "positive". Otherwise, it will be set to "negative"


Also we sort data by time-based slicing

## Loading Preprocessed Data

I have preprocessed the data separately for 250k points and stored in cleanedreviews.csv

In [4]:
df=pd.read_csv('cleanedreviews.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,0,150524,0006641040,ACITT7DI6IDDL,shari zychinski,0,0,1,939340800,EVERY book is educational,witty little book makes son laugh loud recite ...
1,1,150501,0006641040,AJ46FKXOVC7NR,Nicholas A Mesiano,2,2,1,940809600,This whole series is great way to spend time w...,remember seeing show aired television years ag...
2,2,451856,B00004CXX9,AIUWLEQ1ADEG5,Elizabeth Medina,0,0,1,944092800,Entertainingl Funny!,beetlejuice well written movie everything exce...
3,3,230285,B00004RYGX,A344SMIA5JECGM,Vincent P. Ross,1,2,1,944438400,A modern day fairy tale,twist rumplestiskin captured film starring mic...
4,4,374359,B00004CI84,A344SMIA5JECGM,Vincent P. Ross,1,2,1,944438400,A modern day fairy tale,twist rumplestiskin captured film starring mic...


## Splitting the data into train and test

In [6]:
#array that contains +ve=1 and -ve=-1 reviews
y=np.array(df['Score'])

In [8]:
# split the data set into train and test
x_train,x_test,y_train,y_test=cross_validation.train_test_split(df['Text'],y,test_size=0.3,random_state=0)

# Word2Vec

In [10]:
# Using Google News Word2Vectors
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
import pickle

In [12]:
# Train your own Word2Vec model using your own text corpus
import gensim

In [13]:
w2v_model=gensim.models.Word2Vec(x_train,min_count=5,size=50, workers=4)    
type(w2v_model)

gensim.models.word2vec.Word2Vec

## AVGW2V

In [14]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except:
            pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

7276
50


In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.cross_validation import cross_val_score

In [17]:
from sklearn.ensemble import RandomForestClassifier

In [18]:
#parameters for GridSearch
tuned_param=[{'n_estimators':[10,15,20,25,30,35,40]}]
              

## Random Forests

In [27]:
rf=RandomForestClassifier()
rf_model=GridSearchCV(rf,tuned_param,scoring='f1',cv=5)
rf_model.fit(x_train,y_train)
rf_model.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=35, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [28]:
# optimal no. of base learners
rf_model.best_params_

{'n_estimators': 35}

In [31]:
#accuracy
rf_model.best_score_*100

94.05761115965035

## Gradient Boosting Decision Trees

In [19]:
tuned_param2=[{'n_estimators':[10,15,20,25,30,35,40],'max_depth':[10**1,10**2,10**3],'learning_rate':[0.1,0.01,0.001]}]

In [20]:
from sklearn.ensemble import GradientBoostingClassifier

In [41]:
gbdt=GradientBoostingClassifier()
gbdt_model=GridSearchCV(gbdt,tuned_param2,scoring='f1',cv=5)
gbdt_model.fit(x_train,y_train)
gbdt_model.best_estimator_

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.01, loss='deviance', max_depth=10,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=10,
              presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [42]:
#optimal learing rate,depth,no.of learners
gbdt_model.best_params_

{'learning_rate': 0.01, 'max_depth': 10, 'n_estimators': 10}

In [43]:
#accuracy

gbdt_model.best_score_*100

94.00624049033199

# BagofWords

In [21]:
count_vect=CountVectorizer()
vect=count_vect.fit_transform(sub_data['Text'].values)

## Splitting the data into train and test

In [22]:
x2_train,x2_test,y2_train,y2_test=cross_validation.train_test_split(vect,y,test_size=0.3,random_state=0)

## Random Forests

In [23]:
rf2=RandomForestClassifier()
rf2_model=GridSearchCV(rf2,tuned_param,scoring='f1',cv=5)
rf2_model.fit(x2_train,y2_train)
rf2_model.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [24]:
#optimal no.of base learners
rf2_model.best_params_

{'n_estimators': 20}

In [26]:
#accuracy
rf2_model.best_score_*100

94.27150049086715

In [27]:
gbdt2=GradientBoostingClassifier()
gbdt2_model=GridSearchCV(gbdt2,tuned_param2,scoring='f1',cv=5)
gbdt2_model.fit(x2_train,y2_train)
gbdt2_model.best_estimator_

KeyboardInterrupt: 

# Conclusion


### AVG-W2V:

##### Random forests:
1. Optimal no.of base learners=35
2. Accuracy=94.05761115965035

##### GBDT:
1. Optimal no.of base learners=10
2. Optimal depth for each tree=10
3. optimal learning rate=0.01
4. Accuracy=94.00624049033199



### BagOfWords:
 
 
 
##### Random forests:
1. Optimal no.of base learners=35
2. Accuracy=94.29791035585351
