### Random Forest Exercise

------------------

In [1]:
# import pandas
import pandas as pd

In [2]:
# list for column headers
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# load data
df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", names=names)

Spend some time to explore the dataset.
- head
- shape

In [3]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
df.shape

(768, 9)

* create the X and y (the goal is to predict column **class** based on other variables)

In [7]:
X = df[names].drop('class', axis=1)
y = df['class']

* split data set into a train set and test set

In [10]:
from sklearn.model_selection import train_test_split

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

------------------------
#### Part 1: Setting up the Random Forest Classifier
* import RandomForestClassifier from sklearn. It is suggested to spend some time on the doccumentation of this classifier to get familiar with the available parameters.

In [12]:
from sklearn.ensemble import RandomForestClassifier

* create model

In [13]:
clf = RandomForestClassifier()

* fit training set with default parameters

In [14]:
clf.fit(X_train, y_train)

RandomForestClassifier()

* predict X_test

In [15]:
y_pred = clf.predict(X_test)

* import roc_auc_score and confusion_matrix from sklearn

In [16]:
from sklearn import metrics

* print confusion matrix

In [19]:
print(metrics.confusion_matrix(y_test, y_pred))

[[120  28]
 [ 32  51]]


* print AUC

In [20]:
print(metrics.roc_auc_score(y_test, y_pred))

0.712634321068056


----------------------------------
#### Part 2: Using a Grid Search
- import GridSearchCV from sklearn

In [21]:
from sklearn.model_selection import GridSearchCV

* create grid (optimize for number of trees and max depth in one tree)

In [96]:
clf = RandomForestClassifier()

In [97]:
params = {
    'n_estimators': [450, 500, 550],
    'max_depth': [7, 8, 9, 10, 11]
}

In [98]:
model_to_fit = GridSearchCV(estimator=clf, param_grid=params, n_jobs=-1)

* fit training data with grid search

In [99]:
model_to_fit.fit(X_train, y_train)

GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'max_depth': [7, 8, 9, 10, 11],
                         'n_estimators': [450, 500, 550]})

In [100]:
best_model = model_to_fit.best_estimator_

* print confusion matrix with the best model

In [101]:
best_model

RandomForestClassifier(max_depth=10, n_estimators=500)

In [102]:
y_pred = best_model.predict(X_test)

In [103]:
print(metrics.confusion_matrix(y_test, y_pred))

[[123  25]
 [ 30  53]]


* print AUC with the best model

In [104]:
metrics.roc_auc_score(y_test, y_pred)

0.7348176489742755

- is the model better than default?

> yes