# Random Forest demo
This is a demo for my random forest algorithm.  I'm using a very small data set, so the predictions will not be very good, but it will at least demonstrate how to use the algorithm.

In [1]:
import pandas as pd
import numpy as np
import sys
import random

## Demo data
I've included a small set of demo data, not really enough to produce a good model, but enough to demonstrate the potential of the algorithm.  This comes from a physical source, so that the labels should be predictable, but I've anonymized the data to obscure the data's meaning.  There are 500 entries with 40 features each, and one label.  About 80% of the labels are zeros, and 20% are ones.

In [2]:
data = pd.read_pickle("demo_data.pkl")

In [4]:
#Data includes 500 entries, 40 feature columns, and 1 label column
print(data.shape)

(500, 41)


In [3]:
data.head()

Unnamed: 0,feature 0,feature 1,feature 2,feature 3,feature 4,feature 5,feature 6,feature 7,feature 8,feature 9,...,feature 31,feature 32,feature 33,feature 34,feature 35,feature 36,feature 37,feature 38,feature 39,label
0,1860.0,42.0,3,15,139,6146.0,8.60839,0.0,-4.927546,-4.117362,...,15,0,0,0,0,3,51,6,14.0,1.0
1,4300.0,33.0,12,14,76,7140.0,11.2603,4.496,-1.24,-9.209714,...,4,0,0,0,0,2,6,4,43.0,0.0
2,4273.0,320.0,12,13,84,8650.0,8.29802,1.524,-0.812174,-6.677698,...,3,0,0,0,0,1,5,5,19.0,0.0
3,3675.0,48.0,5,3,187,6110.0,5.42586,0.0,-8.939261,-4.523,...,17,9,0,0,14,0,71,4,47.0,0.0
4,887.0,22.0,10,15,119,8908.0,9.7886,0.0,-5.546325,-4.650889,...,15,0,0,0,0,3,33,4,76.0,0.0


In [11]:
#I will separate the data into training data, with 450 rows, and test data, with 50 rows.
subset = np.random.choice(range(500),50,replace=False)
test_data = data.iloc[subset]
train_data = data.drop(index=subset,axis=0)

## Single trees
The user is not intended to grow single trees (since they just get overtrained very quickly), but it may be instructive to show how a single tree is grown.

In [15]:
#load decisionTrees module
if 'decisionTrees' in sys.modules:
    del sys.modules['decisionTrees']
from decisionTrees import *

In [60]:
#initialize the tree, telling it which columns are features, and which one is the label
single_tree = decision_tree(train_data,range(40),40)

In [61]:
#Watch the tree grow 5 leaves.
single_tree.verbose = True
single_tree.grow_tree(train_data,5,0)

Leaf 0 produces children using column 1.
incorrect guesses in this branch: 107 before; 105 after
Leaf 0 produces children using column 26.
incorrect guesses in this branch: 102 before; 101 after
Leaf 2 produces children using column 31.
incorrect guesses in this branch: 96 before; 95 after
Leaf 2 produces children using column 18 (chosen at random).
incorrect guesses in this branch: 84 before; 84 after
Leaf 2 produces children using column 29 (chosen at random).
incorrect guesses in this branch: 81 before; 81 after


An explanation:

At each iteration, the tree determines which leave makes the most incorrect predictions, and tells that leaf to produce new children.  The leaf selects the column using a greedy algorithm to maximize correct predictions, and produces children using that column.  If predictions cannot be improved, then the leaf chooses a column at random.  (As you can see, not much improvement is made in the first five iterations.)

In [62]:
#You can also print out the tree structure.
#Each leaf is labeled a "0" or "1" depending on how it classifies data points that reach that leaf of the tree
#as you can see, the tree does not try to be balanced.
print(single_tree.root.summarize())

#And you can evaluate how well the decision tree makes predictions (not very well)
print('Accuracy on training data: %.2f' % single_tree.evaluate(train_data))
print('Accuracy on test data: %.2f' % single_tree.evaluate(test_data))

[[1,[[[0,0],0],1]],1]
Accuracy on training data: 0.77
Accuracy on test data: 0.78


In [65]:
#Show what happens after 50 more iterations
single_tree.verbose = False
single_tree.grow_tree(train_data,50,0)
print(single_tree.root.summarize())
print('Accuracy on training data: %.2f' % single_tree.evaluate(train_data))
print('Accuracy on test data: %.2f' % single_tree.evaluate(test_data))

[[[1,0],[[[[[0,1],[[[[[[[[[0,1],0],0],1],[0,1]],[1,[[0,1],[0,0]]]],1],[1,0]],[[[[[[[[1,0],[1,[0,0]]],[1,0]],0],1],[[0,1],0]],1],[1,0]]]],[[0,1],[[0,[1,0]],[[[[[0,[1,0]],1],[0,[0,1]]],1],1]]]],[[0,0],0]],[0,[0,1]]]],[0,1]]
Accuracy on training data: 0.92
Accuracy on test data: 0.74


# Random forest testing
Now let's show how a whole forest is grown.  Each tree is given a random subset of rows (selected with replacement) and a random subset of feature columns.

In [88]:
#See the readme for details on these arguments
#This forest will grow trees, each given 5 features and 50 iterations, and keep on growing trees until 1 minute has passed.
forest = random_forest(range(40),40,50,0,5,1)

In [89]:
#train the forest, and then show how well it performs on the test data
forest.train_valid(train_data,test_data)

Forest planted in 1.02 minutes.  14 trees grown with 50 iterations and 5 features.
Accuracy: 0.880


In [90]:
#We can also print out the confusion matrix
print(forest.confusion(test_data))

[[0.14 0.06]
 [0.06 0.74]]


In [91]:
#or we can just print out the predicted labels
pred_labels = forest.predict_labels(test_data)
for i in range(10):
    print("Row %i: Actual label: %i, Predicted label: %i" % (i, test_data['label'].iloc[i], pred_labels[i]))

Row 0: Actual label: 0, Predicted label: 0
Row 1: Actual label: 0, Predicted label: 0
Row 2: Actual label: 0, Predicted label: 0
Row 3: Actual label: 0, Predicted label: 0
Row 4: Actual label: 0, Predicted label: 0
Row 5: Actual label: 0, Predicted label: 0
Row 6: Actual label: 0, Predicted label: 0
Row 7: Actual label: 0, Predicted label: 0
Row 8: Actual label: 1, Predicted label: 1
Row 9: Actual label: 0, Predicted label: 0


In [94]:
#You can also retrieve information about any particular tree
print(forest.forest[0].evaluate(test_data))

0.78


# Hyperparameter tuning
While I did not write a framework for tuning hyperparameters, most of the tools are already there!  In particular, the forest will train for a fixed amount of time, so that you can figure out what hyperparameters are most effective given your available computing resources.

Here I use a simple search to determine whether it's better to grow more trees, or to have more iterations per tree.

In [95]:
#note: this takes about 4 minutes to run
forest_grid = []
sim_index = 0
sim_results = pd.DataFrame(np.zeros((4,9)),columns=[
        'Iterations','Features','Trees','Runtime','accuracy','cmatrix_00','cmatrix_01','cmatrix_10','cmatrix_11'])
for i in range(4):
    forest = random_forest(range(40),40,i*20+20,0,5,1)
    forest_grid.append(forest)
    forest.train(train_data)
    sim_results.iloc[i,0] = forest.num_iter
    sim_results.iloc[i,1] = forest.num_features
    sim_results.iloc[i,2] = forest.num_trees
    sim_results.iloc[i,3] = forest.runtime
    c_matrix = forest.confusion(test_data)
    sim_results.iloc[i,4] = c_matrix.trace()
    sim_results.iloc[i,5] = c_matrix[0,0]
    sim_results.iloc[i,6] = c_matrix[0,1]
    sim_results.iloc[i,7] = c_matrix[1,0]
    sim_results.iloc[i,8] = c_matrix[1,1]

Forest planted in 1.01 minutes.  27 trees grown with 20 iterations and 5 features.
Forest planted in 1.06 minutes.  17 trees grown with 40 iterations and 5 features.
Forest planted in 1.03 minutes.  12 trees grown with 60 iterations and 5 features.
Forest planted in 1.04 minutes.  10 trees grown with 80 iterations and 5 features.


In [97]:
sim_results

Unnamed: 0,Iterations,Features,Trees,Runtime,accuracy,cmatrix_00,cmatrix_01,cmatrix_10,cmatrix_11
0,20.0,5.0,27.0,1.012335,0.76,0.06,0.14,0.1,0.7
1,40.0,5.0,17.0,1.058252,0.76,0.06,0.14,0.1,0.7
2,60.0,5.0,12.0,1.027268,0.84,0.08,0.12,0.04,0.76
3,80.0,5.0,10.0,1.041018,0.76,0.06,0.14,0.1,0.7


As you can see, the forest that performed best was the one that grew each tree with 60 iterations.  (Of course, this result should not be taken too seriously, given the very small test data set and short runtime.)