# Random Forest Implementation
---
### By: Tyler Trzecki

This notebook will explore the Random Forest algorithm and its implementation as a classification algorithm.

The random forest algorithm is used for classification and regression. It is an esemble learning method comprised of multiple decision trees. The random forest determines the probability that a data point belongs to each decision tree in the forest. 

There are many different models available to make predictions on classification data. Logistic regression is one of the most common for binomial data. Other methodologies include support vector machines (“SVMs”), naive Bayes, and k-nearest neighbors. Random forests tend to shine in scenarios where a model has a large number of features that individually have weak predicative power but much stronger power collectively.[[1]](http://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics)

The dataset with which we will implement the random forest algorithm is the Pima Indians Diabetes database.[[2]](https://www.kaggle.com/uciml/pima-indians-diabetes-database)

In [2]:
import pandas as pd

# list for column headers
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

# open file with pd.read_csv
df = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv", names=names)
print(df.shape)

# print head of data set
df.head()

(768, 9)


Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
X = df.drop('class', axis=1)
y = df['class']

In [5]:
from sklearn.model_selection import train_test_split

# implementing train-test-split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=66)

In [8]:
from sklearn.ensemble import RandomForestClassifier

# random forest model creation
rfc = RandomForestClassifier()
rfc.fit(X_train,y_train)

# predictions
rfc_predict = rfc.predict(X_test)

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

In [11]:
rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')

In [12]:
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[145  31]
 [ 33  45]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.81      0.82      0.82       176
           1       0.59      0.58      0.58        78

    accuracy                           0.75       254
   macro avg       0.70      0.70      0.70       254
weighted avg       0.75      0.75      0.75       254



=== All AUC Scores ===
[0.77814815 0.82444444 0.79111111 0.73074074 0.79111111 0.86814815
 0.87851852 0.89555556 0.80961538 0.84269231]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.8210085470085471


The confusion matrix is useful for giving you false positives and false negatives. The classification report tells you the accuracy of your model. The ROC curve plots out the true positive rate versus the false positive rate at various thresholds. The roc_auc scoring used in the cross-validation model shows the area under the ROC curve.

### Tuning The Model
---

For a detailed description of hyperparameter tuning see William Koehrsen's article, ["Hyperparameter Tuning the Random Forest in Python"](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74).

In [14]:
import numpy as np

In [15]:
from sklearn.model_selection import RandomizedSearchCV

# number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# number of features at every split
max_features = ['auto', 'sqrt']

# max depth
max_depth = [int(x) for x in np.linspace(100, 500, num = 11)]
max_depth.append(None)

# create random grid
random_grid = {
 'n_estimators': n_estimators,
 'max_features': max_features,
 'max_depth': max_depth
 }

# Random search of parameters
rfc_random = RandomizedSearchCV(estimator = rfc, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)

# Fit the model
rfc_random.fit(X_train, y_train)

# print results
print(rfc_random.best_params_)

Fitting 3 folds for each of 100 candidates, totalling 300 fits
{'n_estimators': 2000, 'max_features': 'sqrt', 'max_depth': 220}


In [17]:
rfc = RandomForestClassifier(n_estimators=2000, max_depth=220, max_features='sqrt')
rfc.fit(X_train,y_train)

rfc_predict = rfc.predict(X_test)

rfc_cv_score = cross_val_score(rfc, X, y, cv=10, scoring='roc_auc')

print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_cv_score.mean())

=== Confusion Matrix ===
[[150  26]
 [ 32  46]]


=== Classification Report ===
              precision    recall  f1-score   support

           0       0.82      0.85      0.84       176
           1       0.64      0.59      0.61        78

    accuracy                           0.77       254
   macro avg       0.73      0.72      0.73       254
weighted avg       0.77      0.77      0.77       254



=== All AUC Scores ===
[0.77555556 0.83481481 0.83185185 0.7362963  0.81111111 0.86333333
 0.86740741 0.91       0.81269231 0.86      ]


=== Mean AUC Score ===
Mean AUC Score - Random Forest:  0.8303062678062677


For more on random forest algorithm implementation read Jake VanderPlas' [In-Depth: Decision Trees and Random Forests](https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html). 