# Exercise 5 : Tree Based Methods (Decision Tree, Random Forest GBM & XG-Boost)

Use the following tree based methods to predict overall scores and labels (positive/nagative) of reviews of Patio, Lawn and Garden present in Amazon <br>
1) Decision Trees <br>
2) Random Forest <br>
3) GBM <br>
4) XG-Boost <br>

(Note: Reviews with overall score <=4 is to be treated as negative and those with overall score >4 is to be treated as positive)

The dataset can be obtained from http://jmcauley.ucsd.edu/data/amazon/ http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Patio_Lawn_and_Garden_5.json.gz

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering R. He, J. McAuley WWW, 2016

In [40]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

In [3]:
data_patio_lawn_garden = pd.read_json('data_ch3/reviews_Patio_Lawn_and_Garden_5.json', lines = True)
data_patio_lawn_garden.head()

Unnamed: 0,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,B00002N674,"[4, 4]",4,Good USA company that stands behind their prod...,"06 21, 2011",A1JZFGZEZVWQPY,"Carter H ""1amazonreviewer@gmail . com""",Great Hoses,1308614400
1,B00002N674,"[0, 0]",5,This is a high quality 8 ply hose. I have had ...,"06 9, 2014",A32JCI4AK2JTTG,"Darryl Bennett ""Fuzzy342""",Gilmour 10-58050 8-ply Flexogen Hose 5/8-Inch ...,1402272000
2,B00002N674,"[2, 3]",4,It's probably one of the best hoses I've ever ...,"05 5, 2012",A3N0P5AAMP6XD2,H B,Very satisfied!,1336176000
3,B00002N674,"[0, 0]",5,I probably should have bought something a bit ...,"07 15, 2013",A2QK7UNJ857YG,Jason,Very high quality,1373846400
4,B00002N674,"[1, 1]",5,I bought three of these 5/8-inch Flexogen hose...,"08 5, 2013",AS0CYBAN6EM06,jimmy,Good Hoses,1375660800


In [4]:
data_patio_lawn_garden['overall'].value_counts()

5    7037
4    3384
3    1659
2     673
1     519
Name: overall, dtype: int64

In [5]:
lemmatizer = WordNetLemmatizer()

In [6]:
data_patio_lawn_garden['cleaned_review_text'] = data_patio_lawn_garden['reviewText'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))]))

In [7]:
data_patio_lawn_garden[['cleaned_review_text', 'reviewText', 'overall']].head()

Unnamed: 0,cleaned_review_text,reviewText,overall
0,good usa company that stand behind their produ...,Good USA company that stands behind their prod...,4
1,this is a high quality 8 ply hose i have had g...,This is a high quality 8 ply hose. I have had ...,5
2,it s probably one of the best hose i ve ever h...,It's probably one of the best hoses I've ever ...,4
3,i probably should have bought something a bit ...,I probably should have bought something a bit ...,5
4,i bought three of these 5 8 inch flexogen hose...,I bought three of these 5/8-inch Flexogen hose...,5


In [8]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(data_patio_lawn_garden['cleaned_review_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

Unnamed: 0,10,20,34,8217,able,about,actually,add,after,again,...,work,worked,working,worth,would,yard,year,yet,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.120568,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.161561,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.116566,0.0,0.216988,0.0,0.049357
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.064347,0.0,0.0,0.070857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.083019,0.0,0.0,0.0,0.0


In [13]:
#Let's consider review with overall score <= 4 to be negative (encode it as 0) 
#and overall score > 4 to be positive (encode it as 1)

data_patio_lawn_garden['target'] = data_patio_lawn_garden['overall'].apply(lambda x : 0 if x<=4 else 1)
data_patio_lawn_garden['target'].value_counts()

1    7037
0    6235
Name: target, dtype: int64

## Decision Tree for Classification

In [15]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(tfidf_df, data_patio_lawn_garden['target'])
data_patio_lawn_garden['predicted_labels_dtc'] = dtc.predict(tfidf_df)

In [16]:
pd.crosstab(data_patio_lawn_garden['target'], data_patio_lawn_garden['predicted_labels_dtc'])

predicted_labels_dtc,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6227,8
1,1,7036


## Decision for Regression

In [18]:
from sklearn import tree
dtr = tree.DecisionTreeRegressor()
dtr = dtr.fit(tfidf_df, data_patio_lawn_garden['overall'])
data_patio_lawn_garden['predicted_values_dtr'] = dtr.predict(tfidf_df)
data_patio_lawn_garden[['predicted_values_dtr', 'overall']].head(10)

Unnamed: 0,predicted_values_dtr,overall
0,4.0,4
1,5.0,5
2,4.0,4
3,5.0,5
4,5.0,5
5,5.0,5
6,5.0,5
7,5.0,5
8,5.0,5
9,4.0,4


## Defining generic function for classifier models

In [27]:
def clf_model(model_type, X_train, y):
    model = model_type.fit(X_train,y)
    predicted_labels = model.predict(tfidf_df)
    return predicted_labels

## Random Forest Classifier

In [28]:
from sklearn.ensemble import RandomForestClassifier 
rfc = RandomForestClassifier(n_estimators=20,max_depth=4,max_features='sqrt',random_state=1)
data_patio_lawn_garden['predicted_labels_rfc'] = clf_model(rfc, tfidf_df, data_patio_lawn_garden['target'])
pd.crosstab(data_patio_lawn_garden['target'], data_patio_lawn_garden['predicted_labels_rfc'])

predicted_labels_rfc,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3302,2933
1,1557,5480


## Gradient Boosting Machines Classifier

In [29]:
from sklearn.ensemble import GradientBoostingClassifier 
gbc = GradientBoostingClassifier(n_estimators=2,max_depth=3,max_features='sqrt',random_state=1)
data_patio_lawn_garden['predicted_labels_gbc'] = clf_model(gbc, tfidf_df, data_patio_lawn_garden['target'])
pd.crosstab(data_patio_lawn_garden['target'], data_patio_lawn_garden['predicted_labels_gbc'])

predicted_labels_gbc,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,101,6134
1,26,7011


## XG-Boost Classifier

In [33]:
#!pip install xgboost

In [34]:
from xgboost import XGBClassifier
xgb_clf=XGBClassifier(n_estimators=20,learning_rate=0.03,max_depth=5,subsample=0.6,colsample_bytree= 0.6,reg_alpha= 10,seed=42)
data_patio_lawn_garden['predicted_labels_xgbc'] = clf_model(xgb_clf, tfidf_df, data_patio_lawn_garden['target'])
pd.crosstab(data_patio_lawn_garden['target'], data_patio_lawn_garden['predicted_labels_xgbc'])


predicted_labels_xgbc,0,1
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4222,2013
1,2088,4949


## Defining generic function for regression models

In [35]:
def reg_model(model_type, X_train, y):
    model = model_type.fit(X_train,y)
    predicted_values = model.predict(tfidf_df)
    return predicted_values

## Random Forest Regressor

In [36]:
from sklearn.ensemble import RandomForestRegressor 
rfg = RandomForestRegressor(n_estimators=20,max_depth=4,max_features='sqrt',random_state=1)
data_patio_lawn_garden['predicted_values_rfg'] = reg_model(rfg, tfidf_df, data_patio_lawn_garden['overall'])
data_patio_lawn_garden[['overall', 'predicted_values_rfg']].head(10)

Unnamed: 0,overall,predicted_values_rfg
0,4,4.236717
1,5,4.341767
2,4,4.219413
3,5,4.134852
4,5,4.147218
5,5,4.252751
6,5,4.190971
7,5,4.251688
8,5,4.25161
9,4,4.262498


## Gradient Boosting Machine Regressor

In [37]:
from sklearn.ensemble import GradientBoostingRegressor 
gbr = GradientBoostingRegressor(n_estimators=20,max_depth=4,max_features='sqrt',random_state=1)
data_patio_lawn_garden['predicted_values_gbr'] = reg_model(gbr, tfidf_df, data_patio_lawn_garden['overall'])
data_patio_lawn_garden[['overall', 'predicted_values_gbr']].head(10)

Unnamed: 0,overall,predicted_values_gbr
0,4,4.354611
1,5,4.441782
2,4,4.329691
3,5,4.060094
4,5,4.145767
5,5,4.162901
6,5,4.227398
7,5,4.146231
8,5,4.269629
9,4,4.13646


## XG-Boost Regressor

In [38]:
from xgboost import XGBRegressor 
xgbr = XGBRegressor(n_estimators=20,learning_rate=0.03,max_depth=5,subsample=0.6,colsample_bytree= 0.6,reg_alpha= 10,seed=42)
data_patio_lawn_garden['predicted_values_xgbr'] = reg_model(xgbr, tfidf_df, data_patio_lawn_garden['overall'])
data_patio_lawn_garden[['overall', 'predicted_values_xgbr']].head(10)

Unnamed: 0,overall,predicted_values_xgbr
0,4,2.22069
1,5,2.318871
2,4,2.235872
3,5,2.108017
4,5,2.108017
5,5,2.129134
6,5,2.193431
7,5,2.186514
8,5,2.178304
9,4,2.128875


Reference / Citation for the dataset: http://jmcauley.ucsd.edu/data/amazon/
http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz    
Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley WWW, 2016