# Overview

In this notebook, I've explored some classifiers other than Logistic Regression to increase the performance of our predictions

In [1]:
import numpy as np
import pandas as pd

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Sklearn imports
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, f1_score, accuracy_score
from sklearn.utils import resample

In [4]:
import helperFunctions

In [5]:
# Setting the random state for later use
random_state = 565

## Load datasets

In [6]:
X_train, y_train = helperFunctions.load_clean_encode('training.csv', delimiter=';')

In [7]:
X_valid, y_valid = helperFunctions.load_clean_encode('validation.csv', delimiter=';')


Make sure that the train and validation sets have the same columns

In [8]:
X_train, X_valid = helperFunctions.equalizeColumns(X_train, X_valid)

## 0 - RandomForests

In [9]:
rfPipe0 = Pipeline(steps = [
    ('imputer', Imputer(strategy='mean', axis=0)),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=random_state)),
])

In [10]:
scores = cross_val_score(estimator=rfPipe0, X=X_train, y=y_train, n_jobs=-1, scoring='accuracy', verbose=10, 
                         cv=StratifiedKFold(n_splits=5,random_state=random_state, shuffle=False))
print('CV Accuracy scores: %s' % scores)
print('CV score mean: %.2f' % np.mean(scores))

[Parallel(n_jobs=-1)]: Done   2 out of   5 | elapsed:    2.2s remaining:    3.4s
[Parallel(n_jobs=-1)]: Done   3 out of   5 | elapsed:    3.0s remaining:    2.0s


CV Accuracy scores: [ 0.97638889  0.97916667  0.98194444  0.97635605  0.97771588]
CV score mean: 0.98


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.7s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    3.7s finished


In [11]:
rfPipe0 = rfPipe0.fit(X=X_train, y=y_train)
accuracy_score(y_pred=rfPipe0.predict(X=X_valid), y_true=y_valid)

0.80000000000000004

__Try to adjust the parameters to reduce overfitting and also account for unbalanced classes with weight adjustments__

In [12]:
rfPipe0.set_params(**{ 
               'clf__n_estimators': 100, 
               'clf__max_depth': None,
               'clf__min_samples_leaf': 20,
               'clf__class_weight': 'balanced'
              })
rfPipe0 = rfPipe0.fit(X=X_train, y=y_train)
accuracy_score(y_pred=rfPipe0.predict(X=X_valid), y_true=y_valid)

0.84102564102564104

## 1 - Random Forests - Oversampling Minority Class

In [13]:
# First resample the minority class to get the same number of samples as the majority class
X_upsample, y_upsample = resample(X_train[y_train == 1], y_train[y_train == 1], 
                                  replace=True, n_samples=X_train[y_train == 0].shape[0])

# Now concatenate the resampled majority set to the minority set
xBal_Ovr = pd.concat([X_train[y_train==0], X_upsample], axis=0)
yBal_Ovr = pd.concat([y_train[y_train==0], y_upsample], axis=0)

In [14]:
yBal_Ovr.value_counts()

1    3328
0    3328
Name: classLabel, dtype: int64

In [15]:
rfPipe2 = Pipeline(steps = [
    ('imputer', Imputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('clf', RandomForestClassifier(random_state=random_state)),
])

__Training CV__

In [16]:
scores = cross_val_score(estimator=rfPipe2, X=xBal_Ovr, y=yBal_Ovr, n_jobs=-1, scoring='accuracy', cv=10)
print('CV scores: %s' % scores)
print('CV score mean: %.2f' % np.mean(scores))

CV scores: [ 1.          0.9984985   0.9984985   1.          1.          1.          0.9984985
  1.          1.          0.99698795]
CV score mean: 1.00


__Validation Score__

In [17]:
rfPipe2 = rfPipe2.fit(X=xBal_Ovr, y=yBal_Ovr)
accuracy_score(y_pred=rfPipe2.predict(X=X_valid), y_true=y_valid)

0.76923076923076927

__GridSearch Hyperparameter Tuning__

In [18]:
param_grid = [{ 
               'clf__n_estimators': [1, 20, 50, 100, 150, 200], 
               'clf__max_depth': [None, 10, 50, 100],
               'clf__min_samples_leaf': [1, 10, 20, 50],
              }]

# Using a predefined function for gridSearch in helperFunctions
helperFunctions.gridSearch(rfPipe2, param_grid, xBal_Ovr, yBal_Ovr, scoring='neg_log_loss', cv=5)

Best score: -0.011
Best parameters set:
	clf__max_depth: None
	clf__min_samples_leaf: 1
	clf__n_estimators: 100


Grid scores:
-0.368 (+/-0.211) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 1}
-0.012 (+/-0.004) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 20}
-0.011 (+/-0.002) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 50}
-0.011 (+/-0.002) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 100}
-0.011 (+/-0.002) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 150}
-0.011 (+/-0.002) for {'clf__max_depth': None, 'clf__min_samples_leaf': 1, 'clf__n_estimators': 200}
-0.302 (+/-0.081) for {'clf__max_depth': None, 'clf__min_samples_leaf': 10, 'clf__n_estimators': 1}
-0.096 (+/-0.011) for {'clf__max_depth': None, 'clf__min_samples_leaf': 10, 'clf__n_estimators': 20}
-0.091 (+/-0.009) for {'clf__max_depth': None, 'clf__min_samples_leaf'

__Final Validation__

While the previous GridSearch gave the best parameter as:
```
	clf__max_depth: None
	clf__min_samples_leaf: 1
	clf__n_estimators: 100
```
The best performance was actually at the below parameters. I went through the GridSearch results and choose some parameters that would reduce overfitting and give a more conservative model.

In [19]:
rfPipe2.set_params(**{ 
               'clf__n_estimators': 200, 
               'clf__max_depth': 10,
               'clf__min_samples_leaf': 50,
              })
rfPipe2 = rfPipe2.fit(X=xBal_Ovr, y=yBal_Ovr)
accuracy_score(y_pred=rfPipe2.predict(X=X_valid), y_true=y_valid)

0.83589743589743593

# Summary

The best performance with the RandomForests classifier was about 84% accuracy on the validation dataset. This was achieved after adjusting the parameters to reduce overfitting.

RandomForests is actually not affected by class imbalanced too much and this is evident in the 2 rounds shown above.

