### Prediction Task

The dataset for this section was obtained from the exploratory data analysis section.  The goal of the prediction task is to predict review sentiment, positive or negative, from just the text of the review.  In the previous section, feature engineering was done to create the Bayes average rating feature and the price range feature.  The Bayes average rating feature is a rating that takes into account the number of stars and the number of reviews for a restaurant.  However, these restaurant attributes and others will not be used in classification and will be described in the next section, Feature Selection.  The prediction will determine the probability of the review sentiment being positive as well as whether the prediction for the review is positive or negative.  

Below is the first five rows of the dataframe from the previous section.  More wrangling steps need to be done to the data such as conversion of the text reviews into vectors that represent each word in the vector as well as justification for dropping restaurants attributes from the dataset.  Word2Vec will be used to create vector representations for each of the classification methods.  In the deep neural network, the word2vec vector was used as an embedding while in the ensemble methods, the word vectors for each review were averaged by the number of words in a review.  The methods used in this report include a Random Forest model, Gradient Boosted Classifier, Voting Classifier, and a Deep Neural Network.  

In [135]:
import pandas as pd
lv_reviews_ml=pd.read_csv('lv_reviews_ml')  #read in dataframe with text review and restaurant attributes

lv_reviews_ml.head(5)

Unnamed: 0.1,Unnamed: 0,attributes,business_id,categories,city,is_open,latitude,longitude,name,postal_code,review_count,business_stars,cool,funny,review_stars,text,useful,bayes_avg,price_range,sentiment
0,0,"{'RestaurantsReservations': 'False', 'GoodForK...",3BCsAgo_1i4xMuTyLKMLRQ,"Food Delivery Services, Juice Bars & Smoothies...",Las Vegas,1.0,36.075876,-115.181726,SkinnyFATS,89118,2206.0,4.5,0,0,5.0,We stopped on a whim for a fast lunch before h...,0,4.40222,2,1
1,1,"{'BikeParking': 'True', 'RestaurantsReservatio...",XvOzizKafffkMuk-tIdNcQ,"Restaurants, American (Traditional), Breakfast...",Las Vegas,1.0,36.029381,-115.115198,Squeeze In,89123,272.0,4.0,0,0,5.0,"Amazing food, great portions and a very creati...",0,3.780229,2,1
2,2,"{'RestaurantsPriceRange2': '2', 'RestaurantsAt...",H_FkQxifSo13B2FWjhp36g,"Italian, Restaurants, Pizza",Las Vegas,0.0,36.122952,-115.168515,Otto Enoteca Pizzeria,89109,684.0,3.5,1,1,4.0,"Yup, same great quality as NYC. The salads ar...",1,3.511204,2,1
3,3,"{'BikeParking': 'True', 'RestaurantsTakeOut': ...",S599hCA4kJJO3_b6SRFKoA,"Seafood, Restaurants, Mexican, Breakfast & Brunch",Las Vegas,1.0,36.271357,-115.26685,Michoacan,89149,474.0,3.5,0,0,3.0,as a recent resident of Las Vegas it was imper...,0,3.51445,2,0
4,4,"{'Caters': 'True', 'WheelchairAccessible': 'Tr...",ygaOvp0PLBYaYeN9cZAlGg,"Soup, Vietnamese, Barbeque, Restaurants",Las Vegas,1.0,36.126567,-115.213231,Viet Noodle Bar,89146,984.0,4.0,1,0,5.0,Excellent rib eye pho. Nice interior. Good ser...,0,3.906963,2,1



### Feature Selection

Not all of the relevant features as well as the engineered features were used as inputs to the classifiers for several reasons.  For one, the goal of this project is to predict sentiment from just the text review so that sentiment prediction can be used in any review website, not just Yelp.  If features that describe restaurant attributes are used in this prediction, the models will not necessarily perform as well in another domain with different business attributes. Restaurant attributes other than reviews can change on a daily basis such as review count or number of people that rate the review as cool, funny, or useful.  It is not practical to keep updating the dataframe of reviews and then reclassifying each review along with restaurant attributes as the restaurant attributes keep changing. The following business attributes were dropped: is_open, review_count, business_stars, cool, funny, useful, bayes_avg, and price_range.  

A Word2Vec model was used on the text reviews to convert them into vectors.  A vector with a dimension size of 300 was used to represent each word in the Yelp reviews.  According to the shape of the Word2Vec model, not including stop words, 5715 words in the reviews had a corresponding vector representation.   Every word in a review that had a word in the Word2Vec model was converted into a vector and then used as weights in an embedding layer in the deep neural network classifier.  In other other models, the information contained in the sequential order of the words was not accounted because the average of the Wor2Vec vectors for each review was used.  

A deep neural network with the same architecture used on the text review was used with multiple features including restaurant attributes.  The metrics for this model such has accuracy, F1-score, and AUC were slightly lower when restaurant attributes were discarded.  If restaurant attributes want to be added to the model in the future, input and outputs need to be created for categorical and numerical variables.  Embedding would be used for categorical and a single dense layer would be used for numerical variables.  Aftwerwards, the overall architecture for all of the variables can be added similar to the one used later on just text reviews.  

In [4]:
from gensim.models import Word2Vec
model = Word2Vec.load("300features_40minwords_10context")  #load word2vec model

In [5]:
type(model)  #type of model

gensim.models.word2vec.Word2Vec

In [6]:
model.wv.syn0.shape #shape of model 

  """Entry point for launching an IPython kernel.


(5715, 300)

### Classification Metrics

The following metrics were used to compare the performance of the baseline classifier and other classifiers against one another.  

**-Classification Accuracy**: Fraction of applications correctly classified


$$accuracy = \frac{tp + tn}{tp+tn+fp+fn}$$


where 

tp is the number of true positives,<br>
tn is the number of true negatives,<br>
fp is the number of false positives,<br>
and fn is the number of false negatives.<br>

**-False Positive Rate**: Fraction of denied applications labeled as certified. 

$$FPR = \frac{fp}{fp+tn}$$

**-False Negative Rate**: Fraction of certified applications classified as denied.  

$$FNR = \frac{fn}{fn+tp}$$


**-Balanced Error Rate**: Average of the false negative rate and false positive rate. 

$$ BER = \frac{FPR+FNR}{2} $$


**-F1 Score**: Another measure of accuracy that uses precision and recall.


- Precision is the ratio of true positives to applications that were predicted to be positive.

- Recall is the ratio of true positives to applications that were actually positive.  

    $$ F{1}\: Score = \frac{2*precision*recall}{precision+recall} $$

**-AUC**: Area under the curve of the ROC-curve (True positive Rate vs. False Positive Rate).  AUC represents the probability of a randomly drawn positive review having a higher probability of being positive compared to a randomly drawn negative review.  

 


### Models

#### Benchmark Classifier

A benchmark classifier is used to compare all the other classification methods against.

-ZeroR (Baseline): The ZeroR classifier used as a baseline simply classifies all reviews as the majority class which in this case is positive.

In [137]:
from sklearn.dummy import DummyClassifier  #classifier that labels all samples as majority class


X_edit=lv_reviews_ml["text"]  #feature matrix
baseline=DummyClassifier(strategy="most_frequent")  #labels all as majority class
y_test=lv_reviews_ml["sentiment"] #predicted variable
baseline.fit(X_edit,y_test)
y_pred_baseline=baseline.predict(X_edit)  #predicted values
y_pred_prob_baseline=baseline.predict_proba(X_edit)[:,1]  #predicted probabilities

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

In [140]:
sum(y_pred_prob_baseline)

50000.0

In [138]:
y_pred = y_pred_baseline
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))  #Accuracy
print('F1-score: ' +  str(round(f1_score(y_pred, y_test), 5)))  #F1-Score 

TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

false_positive_rate = FP / float(TN + FP)
false_negative_rate = FN/ float(FN + TP)
balanced_error_rate = (false_negative_rate+false_positive_rate)/2

print('False Positive Rate: ' + str(round(false_positive_rate, 5)))  #FPR
print('False Negative Rate: ' + str(round(false_negative_rate, 5)))   #FNR
print('Balanced Error Rate: ' + str(round(balanced_error_rate, 5)) + '\n') #BER

print("AUC of ROC Curve: " + "{}".format(roc_auc_score(y_test, y_pred_prob_baseline))) #AUC


print('Confusion matrix:')
confusion_matrix(y_pred, y_test)  #Confusion Matrix


Accuracy: 0.6635
F1-score: 0.79772
False Positive Rate: 1.0
False Negative Rate: 0.0
Balanced Error Rate: 0.5

AUC of ROC Curve: 0.5
Confusion matrix:


array([[    0,     0],
       [16825, 33175]])

#### Deep Neural Network

A deep neural network has multiple layers and finds the correct mathematical operations, whether they be linear or nonlinear to transform the input into the true prediction while minimizing error in the prediciton.  A sequential model is advantageous for text reviews because it takes into account the order of words and positioning of the words when predicting sentiment.  The word vectors in a review do not need to be averaged or concatenated in order to feed into a neural network because a list of words can be interpreted.  Neural networks have many parameters that can be adjusted in order to improve prediction metrics such as the number of layers, number of outputs per layer, types of activation functions, and the number of epochs to run the model for.  

In [16]:
import string
import re
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

review_lines = []
lines=lv_reviews_ml['text'].values.tolist()  #create a list of arrays containing review text

for line in lines:
    line = re.sub("[^a-zA-Z]", " ", line)  #only keep alphabetical characters 
    words=line.lower().split()  #lowercase words and split them by space characters
    stops=set(stopwords.words("english"))  #define stop words and english stop words
    words=[w for w in words if not w in stops]  #filter only words that are not stop words
    review_lines.append(words)  #append list of words for each review to review_lines list
    

In [17]:
embeddings_index={}
f = open('yelp_embedding_word2vec.txt', encoding = "utf-8")
for line in f:
    values = line.split()
    word = values[0]  #words are in first column of each line
    coefs = np.asarray(values[1:])  #coefficients for the wordvec representation are the rest of columns
    embeddings_index[word] = coefs  #create mapping of all words to vectors
    
f.close()

In [18]:
import tensorflow
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences


tokenizer_obj = Tokenizer()  #tokenizer object to convert words into numbers
tokenizer_obj.fit_on_texts(review_lines)  #fit the tokenizer on all of the reviews
sequences = tokenizer_obj.texts_to_sequences(review_lines)  #convert text to numbers

max_length=max([len(s.split()) for s in lines]) #max_len of a review=1022

word_index=tokenizer_obj.word_index #all of the words that were tokenized
print('Found %s unique tokens.' % len(word_index))

review_pad = pad_sequences(sequences, maxlen=max_length)  #change length of every review to max_length
sentiment = lv_reviews_ml['sentiment'].values #convert column to array
print('Shape of review tensor: ', review_pad.shape) 
print('Shape of sentiment tensor: ', sentiment.shape)

  from ._conv import register_converters as _register_converters


Found 47518 unique tokens.
Shape of review tensor:  (50000, 1022)
Shape of sentiment tensor:  (50000,)


In [19]:
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, 300)) #EMBEDDING_DIM=300

for word, i in word_index.items():  #words that were tokenized
    if i > num_words:
        continue
    embedding_vector = embeddings_index.get(word)  #create embedding matrix from word2vec model
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

print(num_words)

47519


In [20]:
total=[]
for i in sequences:
    
    total.append(len(i))   #mean number of words per review
np.mean(total)

56.15596

In [24]:
from keras.models import Sequential
from keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, GRU, Flatten, Concatenate
from keras.layers.embeddings import Embedding
from keras.models import Model
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.initializers import Constant

model = Sequential()  #initialize sequential model
embedding_layer=Embedding(num_words, 300, 
                          embeddings_initializer=Constant(embedding_matrix),  #wor2vec embedding
                         input_length=1022)  #output_dim=300 and input_length=max_length=1022

model.add(embedding_layer)  #convert words to vectors

model.add(Bidirectional(LSTM(32, return_sequences=True)))  #look ahead and backwards as well as long and short term memory
model.add(GlobalMaxPool1D())  #find the maximum value for pool inputed
model.add(Dense(20, activation="relu"))  #another layer with relu activation function
model.add(Dropout(0.05)) #drop 5% of inputs
model.add(Dense(1, activation='sigmoid')) #output layer


batch_size=100
epochs=3

model.compile(loss='binary_crossentropy', optimizer="adam", metrics=['accuracy'])

VALIDATION_SPLIT=0.2  # split for validation count
indices=np.arange(review_pad.shape[0])
np.random.shuffle(indices)   #randomize indices
review_pad = review_pad[indices]
sentiment = sentiment[indices]

num_validation_samples = int(VALIDATION_SPLIT*review_pad.shape[0])

X_train_pad=review_pad[:-num_validation_samples-10000]   #TRAINING
y_train=sentiment[:-num_validation_samples-10000]

X_val_pad=review_pad[-num_validation_samples-10000:-10000] #VALIDATION
y_val=sentiment[-num_validation_samples-10000:-10000]

X_test_pad=review_pad[-num_validation_samples: ] #TEST
y_test=sentiment[-num_validation_samples:]

print('Shape of X_train_pad_tensor:', X_train_pad.shape)
print('Shape of y_train tensor: ', y_train.shape)

print('Shape of X_val_pad tensor: ', X_val_pad.shape)
print('Shape of y_val tensor: ', y_val.shape)

print('Shape of X_test_pad tensor: ', X_test_pad.shape)
print('Shape of y_test tensor: ', y_test.shape)

Shape of X_train_pad_tensor: (30000, 1022)
Shape of y_train tensor:  (30000,)
Shape of X_val_pad tensor:  (10000, 1022)
Shape of y_val tensor:  (10000,)
Shape of X_test_pad tensor:  (10000, 1022)
Shape of y_test tensor:  (10000,)


In [25]:
print('Train...')
model.fit(X_train_pad, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_val_pad, y_val), verbose=2)

Train...
Train on 30000 samples, validate on 10000 samples
Epoch 1/3
 - 1228s - loss: 0.3486 - acc: 0.8446 - val_loss: 0.2700 - val_acc: 0.8873
Epoch 2/3
 - 1217s - loss: 0.2062 - acc: 0.9199 - val_loss: 0.2730 - val_acc: 0.8887
Epoch 3/3
 - 1265s - loss: 0.1368 - acc: 0.9507 - val_loss: 0.3064 - val_acc: 0.8784


<keras.callbacks.History at 0x1a3e772dd8>

In [36]:
prediction = model.predict(X_test_pad)   #prediction on neural netowrk



In [37]:
y_pred = (prediction > 0.5)
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))
print('F1-score: ' +  str(round(f1_score(y_pred, y_test), 5)))

TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

false_positive_rate = FP / float(TN + FP)
false_negative_rate = FN/ float(FN + TP)
balanced_error_rate = (false_negative_rate+false_positive_rate)/2

print('False Positive Rate: ' + str(round(false_positive_rate, 5)))
print('False Negative Rate: ' + str(round(false_negative_rate, 5)))
print('Balanced Error Rate: ' + str(round(balanced_error_rate, 5)) + '\n')

print("AUC of ROC Curve: " + "{}".format(roc_auc_score(y_test, prediction)))


print('Confusion matrix:')
confusion_matrix(y_pred, y_test)



Accuracy: 0.8774
F1-score: 0.90775
False Positive Rate: 0.19778
False Negative Rate: 0.08356
Balanced Error Rate: 0.14067

AUC of ROC Curve: 0.9408858876959147
Confusion matrix:


array([[2742,  550],
       [ 676, 6032]])

#### Ensemble Methods Preparation

In [95]:
from sklearn.model_selection import train_test_split

model = Word2Vec.load("300features_40minwords_10context")  #load word2vec model
X=lv_reviews_ml["text"]
y=lv_reviews_ml["sentiment"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
#split predictors and predicted variabled with 0.8/0.2 split

def makeFeatureVec(words, model, num_features):
    #Function to average all of the word vectors in a given paragraph
    #Initalize a empty numpy array
    featureVec=np.zeros((num_features), dtype="float32")
    nwords=0
    #number of words per review
    index2word_set=set(model.wv.index2word)  #list of words in model's vocabulary
    for word in words:
        if word in index2word_set:
            nwords=nwords+1
            featureVec=np.add(featureVec, model[word])  #add vectors per review together
     
    if nwords != 0:
        featureVec=np.divide(featureVec,nwords)  #divded featureVec by number of words in review
        return featureVec
    else:
        return featureVec

def getAvgFeatureVecs(reviews, model, num_features):
    counter=0
    reviewFeatureVecs=np.zeros((len(reviews), num_features), dtype="float32")
    for review in reviews:
        if counter % 5000 == 0:
            print("Review %d of %d" % (counter, len(reviews)))
        reviewFeatureVecs[counter]=makeFeatureVec(review, model, num_features) #crate numpy array where each row is average vector for each review
        counter=counter+1
    return reviewFeatureVecs

print('Create average feature vecs for train reviews')


clean_train_reviews=[]
for line in X_train:
    line = re.sub("[^a-zA-Z]", " ", line) #keep alphabetical characters
    words=line.lower().split() #lowercase and split words
    stops=set(stopwords.words("english")) #english stop woreds
    words=[w for w in words if not w in stops] #keep non stop words
    clean_train_reviews.append(words)  #append all words in reviews of train set

trainDataVecs=getAvgFeatureVecs(clean_train_reviews, model, num_features=300)


print('Create average feature vecs for test reviews')
clean_test_reviews= []
for line in X_test:
    line = re.sub("[^a-zA-Z]", " ", line)
    words=line.lower().split()
    stops=set(stopwords.words("english"))
    words=[w for w in words if not w in stops]
    clean_test_reviews.append(words)
    

    
testDataVecs=getAvgFeatureVecs(clean_test_reviews, model, num_features=300)
    

Create average feature vecs for train reviews
Review 0 of 40000




Review 5000 of 40000
Review 10000 of 40000
Review 15000 of 40000
Review 20000 of 40000
Review 25000 of 40000
Review 30000 of 40000
Review 35000 of 40000
Create average feature vecs for test reviews
Review 0 of 10000
Review 5000 of 10000


#### Random Forest

The random forest classifier is an ensemble method that uses many decision trees that have had bootstrap aggregating or bagging done on the samples. Bagging means that samples are chosen at random with replacement to reduce variance in any single tree. Random forests in addition take a subset of the predictors when constructing a tree so that each tree is more different from one another. The combination of the trees made from bagged samples and subsetted predictors is the random forest result. Compared to a decision tree, a random forest improves accuracy by determining class from the majority of votes for each sample in the many trees.

In [None]:
#Fit a random forest model to the training data

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

steps=[('randoforest', RandomForestClassifier(verbose=2))]  #steps of pipeline
pipeline=Pipeline(steps) #creation of pipeline

parameters_rf = [{   #parameters
    'randoforest__n_estimators':      [250],
    'randoforest__criterion':         ['gini'],
    'randoforest__max_features':      ['auto'],
    'randoforest__class_weight':      ['balanced']
    #'randoforest__min_samples_leaf':  list(range(2, 8))
}]
print("Fitting a random forest to labeled training data...")





clf_rf = GridSearchCV(pipeline, parameters_rf, cv=5)  #GridSearch object

clf_rf.fit(trainDataVecs, y_train) #fit 

y_pred_rf=clf_rf.predict(testDataVecs) #predict
y_pred_prob_rf = clf_rf.predict_proba(testDataVecs)[:,1] # Compute predicted probabilities: y_pred_prob



In [104]:
y_pred = y_pred_rf
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))
print('F1-score: ' +  str(round(f1_score(y_pred, y_test), 5)))

TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

false_positive_rate = FP / float(TN + FP)
false_negative_rate = FN/ float(FN + TP)
balanced_error_rate = (false_negative_rate+false_positive_rate)/2

print('False Positive Rate: ' + str(round(false_positive_rate, 5)))
print('False Negative Rate: ' + str(round(false_negative_rate, 5)))
print('Balanced Error Rate: ' + str(round(balanced_error_rate, 5)) + '\n')

print("AUC of ROC Curve: " + "{}".format(roc_auc_score(y_test, y_pred_prob_rf)))


print('Confusion matrix:')
confusion_matrix(y_pred, y_test)



Accuracy: 0.8354
F1-score: 0.88356
False Positive Rate: 0.37325
False Negative Rate: 0.05878
Balanced Error Rate: 0.21602

AUC of ROC Curve: 0.9111844858919391
Confusion matrix:


array([[2109,  390],
       [1256, 6245]])

#### Gradient Boosted Classifier

A gradient boosted classifier involves combining a large number of trees.  Each subsequent tree that is constructed is done so based on the least pure splits created by the previous tree.  The trees are added back to the previous tree so that the tree can learn slowly and improve classification accuracy.  An advantage to gradient boosted trees is that compared to a decision tree that may overfit to one set of the data because gradient boosted trees learn by focusing on the worst split in a tree and improve over time.  The disadvantages of gradient boosted trees is that they may overfit compared to random forests.

In [None]:
#Fit a random forest model to the training data

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

steps=[('gradientboost', GradientBoostingClassifier(verbose=2))]  #steps of pipeline
pipeline=Pipeline(steps) 

parameters_grad = [{  #parameters
    'gradientboost__n_estimators':      [250],
    'gradientboost__max_features':      ['auto'],
    #'gradientboost__min_samples_leaf':  list(range(2, 7)),
    #'gradientboost__loss' :             ['deviance', 'exponential'],
    #'gradientboost__learning_rate':     [0.025, 0.05, 0.075, 0.1],
}]


print("Fitting a gradient boosted classifier to labeled training data...")


clf_gb = GridSearchCV(pipeline, parameters_grad, cv=5)  #Gridsearch object

clf_gb.fit(trainDataVecs, y_train)  #fit 

y_pred_gb=clf_gb.predict(testDataVecs)  #predict
y_pred_prob_gb = clf_gb.predict_proba(testDataVecs)[:,1] # Compute predicted probabilities: y_pred_prob



In [106]:
y_pred = y_pred_gb
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))
print('F1-score: ' +  str(round(f1_score(y_pred, y_test), 5)))

TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

false_positive_rate = FP / float(TN + FP)
false_negative_rate = FN/ float(FN + TP)
balanced_error_rate = (false_negative_rate+false_positive_rate)/2

print('False Positive Rate: ' + str(round(false_positive_rate, 5)))
print('False Negative Rate: ' + str(round(false_negative_rate, 5)))
print('Balanced Error Rate: ' + str(round(balanced_error_rate, 5)) + '\n')

print("AUC of ROC Curve: " + "{}".format(roc_auc_score(y_test, y_pred_prob_gb)))


print('Confusion matrix:')
confusion_matrix(y_pred, y_test)



Accuracy: 0.8602
F1-score: 0.8971
False Positive Rate: 0.25468
False Negative Rate: 0.08154
Balanced Error Rate: 0.16811

AUC of ROC Curve: 0.9260917441054518
Confusion matrix:


array([[2508,  541],
       [ 857, 6094]])

#### Voting Classifier

A voting classifier involves combining multiple classifiers, in this case a random forest and gradient boosting classifier.  A voting classifier using soft voting averages the probabilities for each sample being positive and assigns the prediction as positive if the average probability across all of the estimators is above 50% in the case of a binary class problem.  An advtange of a voting classifier is that an estimator can be used to mask the errors in prediction of another estimator if the estimators are using different methods of prediction.  

In [None]:
from sklearn.ensemble import (RandomForestClassifier, GradientBoostingClassifier, 
                              VotingClassifier)

clf1 = RandomForestClassifier(n_estimators=250, criterion='gini', max_features='auto', class_weight='balanced', verbose=2)  #estimator 1- rf
clf2 = GradientBoostingClassifier(n_estimators=250,  max_features='auto',  verbose=2)  #estimator 2- gb

steps = [('voting', VotingClassifier(estimators=[('randoforest', clf1), ('gradientboost', clf2)], voting='soft'))]
#steps for pipeline, used soft voting

pipeline=Pipeline(steps)
params={}

clf_vote = GridSearchCV(pipeline, param_grid=params, cv=5) #Gridsearch object

clf_vote.fit(trainDataVecs, y_train)  #fit

y_pred_vote=clf_vote.predict(testDataVecs) #predict
y_pred_prob_vote = clf_vote.predict_proba(testDataVecs)[:,1] # Compute predicted probabilities: y_pred_prob

In [129]:
y_pred = y_pred_vote
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score
from sklearn.metrics import roc_auc_score

print('Accuracy: ' + str(accuracy_score(y_test, y_pred)))
print('F1-score: ' +  str(round(f1_score(y_pred, y_test), 5)))

TN, FP, FN, TP = confusion_matrix(y_test, y_pred).ravel()

false_positive_rate = FP / float(TN + FP)
false_negative_rate = FN/ float(FN + TP)
balanced_error_rate = (false_negative_rate+false_positive_rate)/2

print('False Positive Rate: ' + str(round(false_positive_rate, 5)))
print('False Negative Rate: ' + str(round(false_negative_rate, 5)))
print('Balanced Error Rate: ' + str(round(balanced_error_rate, 5)) + '\n')

print("AUC of ROC Curve: " + "{}".format(roc_auc_score(y_test, y_pred_prob_vote)))


print('Confusion matrix:')
confusion_matrix(y_pred, y_test)



Accuracy: 0.8563
F1-score: 0.8956
False Positive Rate: 0.28707
False Negative Rate: 0.07099
Balanced Error Rate: 0.17903

AUC of ROC Curve: 0.9236227802716694
Confusion matrix:


array([[2399,  471],
       [ 966, 6164]])

### Results

In [139]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

print('\033[1m' + 'Classification Metrics for ML Models \n')

d1 = {'Model': ['ZeroR Baseline'], 'Accuracy': [0.6635], 'FPR': [1.0000], 'FNR': [0.0000], 'BER': [0.50000], 'F1 Score': [0.7977], 'AUC': [0.50000]}

d2 = {'Model': ['Random Forest', 'Gradient Boosted Tree', 'Voting Classifier', 'Deep Neural Network'], 
                 'Accuracy': [0.8354, 0.8602, 0.8563, 0.8774], 
                 'FPR': [0.3733, 0.2547, 0.2871, 0.1978], 
                 'FNR': [0.0588, 0.0815, 0.0710, 0.0836],
                 'BER': [0.2160, 0.1681, 0.1790, 0.1407],                         
                 'F1-Score': [0.8836, 0.8971, 0.8956, 0.9078],
                 'AUC': [0.9118, 0.9261, 0.9236, 0.9409]
    }
df1=pd.DataFrame(data=d1)
df2=pd.DataFrame(data=d2)


df1
df2



[1mClassification Metrics for ML Models 



Unnamed: 0,Model,Accuracy,FPR,FNR,BER,F1 Score,AUC
0,ZeroR Baseline,0.6635,1.0,0.0,0.5,0.7977,0.5


Unnamed: 0,Model,Accuracy,FPR,FNR,BER,F1-Score,AUC
0,Random Forest,0.8354,0.3733,0.0588,0.216,0.8836,0.9118
1,Gradient Boosted Tree,0.8602,0.2547,0.0815,0.1681,0.8971,0.9261
2,Voting Classifier,0.8563,0.2871,0.071,0.179,0.8956,0.9236
3,Deep Neural Network,0.8774,0.1978,0.0836,0.1407,0.9078,0.9409
