In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.cluster import KMeans

In [2]:
df = pd.read_csv('../../datasets/heart.csv')

In [10]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#### Preparing the training and test set

In [3]:
x_train, x_test, y_train, y_test = train_test_split(df.drop('target', 1), df['target'], test_size=0.2, random_state=10)

### Random Forest Classification

Definition

Random Forests are essentially ensembles of estimators. The idea is simply that since a particular decision tree will most likely overfit the data, using an ensemble of overfitting estimators would give you a better estimator.

It will fit a number of decision tree classifiers on various sub samples of the dataset and use averaging to improve the predicitive accuracy and control over-fitting.

In [43]:
# Vanilla Random Forest Classifier
model = RandomForestClassifier(max_depth=5)
model.fit(x_train, y_train)
y_predict = model.predict(x_test)
accuracy_score(y_test, y_predict)

0.80327868852459017

Hyperparameter Intuition - 
1. how does the number of trees help reduce the variance, given that the primary aim of this ensemble method is to reduce variance
2. How to estimate a good depth of the trees to achieve a bettle model accuracy?

In [39]:
rfc_trees_20 = RandomForestClassifier(max_depth=3, n_estimators=5)
rfc_trees_20.fit(x_train, y_train)

y_predict = rfc_trees_20.predict(x_test)
accuracy_score(y_test, y_predict)

0.77049180327868849

The key point here is that - deeper trees reduce the bias while more trees reduce the variance.

You can reduce the by cross validating against 

In [13]:
y_pred = KMeans(n_clusters=2, random_state=170).fit_predict(x_train)

In [15]:
y_pred

array([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1], dtype=int32)

References:

1. https://www.kaggle.com/general/4092
2. https://github.com/glouppe/phd-thesis
3. https://stats.stackexchange.com/questions/53240/practical-questions-on-tuning-random-forests