
# <div style="text-align: right"> Random Forest from scratch. </div>

---

<div style="text-align: right"> Geoff Counihan - Oct 6, 2017 </div>

### Notes

---

Inhereted my decision tree class to create a random forest classifier. There are a few modifications.

    1. Sample with replacement
    2. Create, predict, and average multiple tree predictions
    3. Modify split function to spit off n_feat random features
    
__Additions__: Entropy, Test Cases

In [18]:
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import operator
from scratch import decision_tree, accuracy

In [6]:
Xy = pd.read_csv('./sonar.all-data.csv',header=None)
Xy[60] = Xy[60].map({'R':0,'M':1})
X = np.array(Xy.iloc[:,:-1])
y = np.array(Xy.iloc[:,-1])
Xy = np.array(Xy)

### Data manipulations

Random Forests are powerful because they ensemble a large number of weak learners. For each tree to be different from another two manipulations are performed

---

__Sample with replacement__ - create a ficticous dataset sampled with replacement from the original data

In [7]:
def samp(Xy,ratio):
    n = int(np.round(len(Xy) * ratio))
    idx = np.random.randint(Xy.shape[0],size=n)
    return Xy[idx,:]

__Best split from subset of features__ - the best split for trees within a random forest is actually made based on a subset of features n_feats. This encourages wildly different trees structures within the forest. In this case I've hardcoded the number of features to be the sqrt(total features)

### Create class.

---

In this case, I've inhereted the class decision_tree from my decision tree built from scratch and modified functions as needed.

__Modified Fit__ - calculates n_feats as well as creates a list of trained trees the length of num_trees

__Modified predict__ - makes the predictions from the list of num_trees, stacks them together, and then averages the score to a single output per sample.

In [8]:
class random_forest(decision_tree):
    def __init__(self, num_trees, max_depth=2, min_num_split=30, sample_ratio=1):
        self.max_depth = max_depth
        self.min_num_sample = min_num_split
        self.num_trees = num_trees
        self.ratio = sample_ratio
        
    def build_tree(self, Xy):
        '''Recursively build tree, unclear if this is the correct way
        
        '''
        self.root = self.best_split(Xy)
        #print(self.root)
        self.split_branch(self.root, 1) # i don't understand how this is working, pointed to node?
        #print(self.root)
        return self.root
    
    def best_split(self, Xy):
        classes = np.unique(Xy[:,-1])
        best_feat = 999
        best_val = 999
        best_score = 999
        best_groups = None
        n_feats = np.random.choice(Xy.shape[1]-1, self.n_feat, replace=False)
        #print(n_feats)
        for feat in n_feats:
            for i in Xy:
                groups = self.split(feat, i[feat], Xy)
                #print(groups)
                gini = self.gini_score(groups, classes)
                #print('feat {}, valued < {}, scored {}'.format(feat,i[feat], gini))
                if gini < best_score:
                    best_feat = feat
                    best_val = i[feat]
                    best_score = gini
                    best_groups = groups
        output = {}
        output['feat'] = best_feat
        output['val'] = best_val
        output['groups'] = best_groups
        return output
    
    def samp(self, Xy, ratio=1):
        n = int(np.round(len(Xy) * ratio))
        idx = np.random.randint(Xy.shape[0],size=n)
        return Xy[idx,:]
        
    def fit(self, X, y):
        '''Save training data.
        
        '''
        self.X = X
        self.y = y
        self.Xy = np.column_stack((X, y))

        self.n_feat = int(np.sqrt(X.shape[1]))
        
        self.trees = [self.build_tree(self.samp(self.Xy)) for i in range(self.num_trees)]
        
    def predict(self, X_test):
        self.y_preds = np.array([]).reshape(0,X_test.shape[0])
        for root in self.trees:
            y_pred = np.array([])
            for i in X_test:
                y_pred = np.append(y_pred,self.predict_sample(root,i))
            #print(y_pred.shape)
            self.y_preds = np.vstack((self.y_preds,y_pred))
        self.avg_preds = np.rint(self.y_preds.mean(axis=0))
        return self.avg_preds
        
        

### Test.

---

In [9]:
from sklearn.model_selection import train_test_split

Xy = pd.read_csv('./sonar.all-data.csv',header=None)
Xy[60] = Xy[60].map({'R':0,'M':1})
X = np.array(Xy.iloc[:,:-1])
y = np.array(Xy.iloc[:,-1])
Xy = np.array(Xy)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=35)

In [10]:
rf = random_forest(num_trees=3)
rf.fit(X,y)
rf.predict(X_test)

array([ 1.,  0.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  1.,  0.,  1.,
        1.,  0.,  1.,  1.,  0.,  1.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,
        1.,  0.,  1.,  1.,  1.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,
        1.,  1.,  0.,  1.,  1.,  0.,  0.,  0.,  1.,  1.,  0.,  0.,  0.])

In [11]:
rf.trees

[{'feat': 46,
  'left': {'feat': 8, 'left': 0.0, 'right': 1.0, 'val': 0.12520000000000001},
  'right': {'feat': 13, 'left': 1.0, 'right': 0.0, 'val': 0.43980000000000002},
  'val': 0.12520000000000001},
 {'feat': 9,
  'left': {'feat': 46, 'left': 0.0, 'right': 1.0, 'val': 0.16339999999999999},
  'right': {'feat': 32, 'left': 0.0, 'right': 1.0, 'val': 0.19839999999999999},
  'val': 0.16309999999999999},
 {'feat': 8,
  'left': {'feat': 14, 'left': 0.0, 'right': 1.0, 'val': 0.32319999999999999},
  'right': {'feat': 15, 'left': 1.0, 'right': 0.0, 'val': 0.65920000000000001},
  'val': 0.1037}]

### Compare performance

---

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree

In [24]:
ens = RandomForestClassifier(n_estimators=50, max_depth=3, min_samples_split=30)
ens.fit(X_train,y_train)
sk_rf_pred = ens.predict(X_test)

In [27]:
clf = DecisionTreeClassifier(max_depth=2,min_samples_split=30)
clf.fit(X_train,y_train)
sk_dt_pred = clf.predict(X_test)

In [28]:
dt = decision_tree(max_depth=3,min_num_split=30)
dt.fit(X_train,y_train)
dt_pred = dt.predict(X_test)

In [22]:
rf = random_forest(max_depth=3,num_trees=50,min_num_split=30)
rf.fit(X_train,y_train)
rf_pred = rf.predict(X_test)

### Accuracy differences

---

I'm unclear how sklearn differs. Will need to look deeper

In [19]:
accuracy(dt_pred,y_test)

0.6538461538461539

In [20]:
accuracy(rf_pred,y_test)

0.6153846153846154

In [29]:
accuracy(sk_dt_pred,y_test)

0.6153846153846154

In [25]:
accuracy(sk_rf_pred,y_test)

0.6538461538461539

In [30]:
list(zip(rf_pred,dt_pred,sk_dt_pred,sk_rf_pred,y_test))

[(1.0, 1.0, 1, 1, 0),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 0.0, 0, 1, 0),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 0.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 0),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 0.0, 0, 1, 1),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 0.0, 0, 0, 0),
 (0.0, 0.0, 0, 0, 1),
 (1.0, 1.0, 0, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 0.0, 0, 0, 0),
 (0.0, 0.0, 0, 0, 0),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 0.0, 0, 0, 1),
 (1.0, 0.0, 0, 1, 1),
 (0.0, 0.0, 0, 0, 1),
 (1.0, 0.0, 0, 1, 0),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 0),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 0.0, 0, 1, 1),
 (1.0, 1.0, 1, 1, 1),
 (0.0, 0.0, 0, 0, 0),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 0),
 (1.0, 1.0, 1, 1, 1),
 (1.0, 1.0, 1, 0, 1),
 (0.0, 0.0, 0, 0, 0),
 (0.0, 0.0