# Random Forests

This notebook takes a deep dive in how Random Forests work.

Kaggle's most popular competition is [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic). In this challenge, you'll need to predict which passengers survived the tragedy. The Titanic dataset is also used by Bill Howe in lecture about [decision trees](https://class.coursera.org/datasci-001/lecture/143) on Coursera.

Please download the Titanic training set from Kaggle. You might need to create an account first.
- [train.csv](https://www.kaggle.com/c/titanic/data?train.csv)

In [1]:
path_to_repo = '/Users/ruben/repo/personal/ga/DAT-23-NYC/'
path_to_downloads = '/Users/ruben/Downloads/'

In [2]:
import sys
import pandas as pd
from patsy import dmatrix
from sklearn.cross_validation import train_test_split
sys.path.append(path_to_repo)
from random_forest import RandomForest, DecisionTree

In [3]:
data = pd.read_csv(path_to_downloads + 'train.csv')
print len(data), 'rows'
data.head(2)

891 rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C


We will **not** do a proper analysis of this dataset, nor will we extract interestin features. We will just use these data to demontrate our home-made random forest algorithm. We leave it up to you be creative with the model and features.

In [4]:
data['Cabin_letter'] = data.Cabin.str[0]  # We'll create one extra feature
features = ['Pclass',  'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin_letter', 'Embarked']
X = dmatrix(" + ".join(features), data=data.fillna(0), return_type='dataframe')
y = data.Survived

### One Perfect Tree

I have already coded up our home-made `DecisionTree` and `RandomForest` algorithms, which you can find the [course repo](./random_forest.py).

We can train one single tree on all features, all the way until all the leaves are as pure as it gets. This will give near-perfect predictions on the training set, but bad ones on the test set, since it greatly overfits.

In [5]:
model = DecisionTree(max_features=None).fit(X, y)
print (model.predict(X) == y).mean()

0.986531986532


In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
model.fit(X_train, y_train)
print (model.predict(X_test) == y_test).mean()

0.766816143498


I added a fun method which shows you the entire decision tree. The number between parentheses `(+0.123)` is the information gain, or gain in purity.

In [7]:
print model.to_str()

if [Sex[T.male]] > 0.000000:  (+0.231890)
  if [Fare] > 26.250000:  (+0.049287)
    if [SibSp] > 2.000000:  (+0.058861)
      if [Age] > 3.000000:  (+0.079448)
        0
      else:
        if [Age] > 2.000000:  (+0.591673)
          1
        else:
          0
    else:
      if [Fare] > 26.387500:  (+0.036855)
        if [Age] > 35.000000:  (+0.044422)
          if [Cabin_letter[T.A]] > 0.000000:  (+0.052288)
            if [Age] > 36.000000:  (+0.198117)
              if [Age] > 49.000000:  (+0.251629)
                if [Fare] > 30.000000:  (+0.311278)
                  if [Fare] > 34.654200:  (+0.918296)
                    1
                  else:
                    0
                else:
                  1
              else:
                1
            else:
              0
          else:
            if [Age] > 60.000000:  (+0.051437)
              0
            else:
              if [Age] > 47.000000:  (+0.080347)
                if [Age] > 52.000000:  (+0.257831)
    

We can easily count the number of nodes in the tree:

In [8]:
model.count_nodes()

277

### Random Forest

In our search for the best split, we could limit the number of features we consider. This will force the tree to grow in a different direction. This will generally lead to a worse model. But by 'bagging' of a lot of these weak models, we will create a powerful ensemble method.  The idea is that the noise of the bad models will cancel each other out, while the good ones preserve their power.

Let's repeat the above tree model but with less features per split. By default, `max_features` will be smaller than n_features, creating a 'worse' model

In [9]:
model = DecisionTree().fit(X, y)  
print "In-sample accuracy    ", (model.predict(X) == y).mean()
model.fit(X_train, y_train)
print "Out-of-sample accuracy", (model.predict(X_test) == y_test).mean()

In-sample accuracy     0.881032547699
Out-of-sample accuracy 0.77130044843


It could be that it already performs a bit better on the out-of-sample set than our earlier model.  But we don't stop there, let's plant an entire forest.

In [10]:
model = RandomForest(n_estimators=50).fit(X, y)  
print "In-sample accuracy    ", (model.predict(X) == y).mean()
model.fit(X_train, y_train)
print "Out-of-sample accuracy", (model.predict(X_test) == y_test).mean()

In-sample accuracy     0.892255892256
Out-of-sample accuracy 0.816143497758


Indeed, much better on the out-of-sample dataset. Note that our home-made implementation is very slow.

### Benchmark

Let's briefly check how `sklearn`'s model performed.

In [11]:
from sklearn.ensemble import RandomForestClassifier

In [12]:
model = RandomForestClassifier(n_estimators=50).fit(X, y)  
print "In-sample accuracy    ", (model.predict(X) == y).mean()
model.fit(X_train, y_train)
print "Out-of-sample accuracy", (model.predict(X_test) == y_test).mean()

In-sample accuracy     0.986531986532
Out-of-sample accuracy 0.811659192825


- Definitely faster
- Definitely more accurate on the training set
- Not necessarily more accurate on the test set.

### Further Studying

- Please refer to the [source code](./random_forest.py) if you'd like to figure out how a random forest works exactly
- I added a few links to papers to the [course repo](..) if you'd like to read more.