## Why Random Forests?

- Few tuning parameters.
- Good performance.
- No need to standardize training data.
- Built in cross-validation.
- Quantify feature importance.

## What are they?

Let's start with a decision tree. Here is an example: http://thiscommutes.com/static/images/machine_learning_trees_bagging_boosting_pt_3a_1.png

These constructed based on training data. Computer algorithms see which features can be use to make splits and where. In random forests, we combine many of them to make a decision.

In [6]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

In [33]:
df = pd.read_csv('data/titanic/train.csv', header=0)

In [34]:
df['Gender'] = df['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
df['Age'][df['Age'].isnull()] = df['Age'].median()
df['FamilySize'] = df['SibSp'] + df['Parch']
df['Age*Pclass'] = df['Age'] * df['Pclass']
df = df.drop(['Name', 'PassengerId', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1) 

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
Survived      891 non-null int64
Pclass        891 non-null int64
Age           891 non-null float64
SibSp         891 non-null int64
Parch         891 non-null int64
Fare          891 non-null float64
Gender        891 non-null int64
FamilySize    891 non-null int64
Age*Pclass    891 non-null float64
dtypes: float64(3), int64(6)
memory usage: 62.7 KB


In [36]:
mask = np.random.rand(len(df)) < 0.8
train = df[mask].values
test = df[~mask].values

In [37]:
from sklearn.ensemble import RandomForestClassifier 

In [52]:
forest = RandomForestClassifier(n_estimators = 100)

In [39]:
forest = forest.fit(train[0::,1::],train[0::,0])

In [41]:
output = forest.predict(test[0::,1::])

In [42]:
output

array([ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0.,
        1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,
        0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  1.,
        1.,  1.,  1.,  0.,  0.,  1.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,
        1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,
        0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,
        1.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,
        0.,  1.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,
        1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,
        0.,  0.,  0.,  1.,  0.,  0.,  1.,  1.,  1.,  1.,  0.,  0

In [50]:
len(test[output == test[0::,0]]) / float(len(test))

0.8191489361702128