 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="#What-is-Random-Forest" data-toc-modified-id="What-is-Random-Forest-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What is Random Forest</a></span></li><li><span><a href="#The-algorithm" data-toc-modified-id="The-algorithm-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The algorithm</a></span></li><li><span><a href="#Data-Prep" data-toc-modified-id="Data-Prep-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Prep</a></span></li><li><span><a href="#Decision-Tree-and-Random-Forest" data-toc-modified-id="Decision-Tree-and-Random-Forest-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Decision Tree and Random Forest</a></span><ul class="toc-item"><li><span><a href="#Decision-tree-basics" data-toc-modified-id="Decision-tree-basics-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Decision tree basics</a></span></li><li><span><a href="#Implementation-of-Tree" data-toc-modified-id="Implementation-of-Tree-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Implementation of Tree</a></span></li></ul></li></ul></div>

## What is Random Forest

>Decision trees can suffer from high variance which makes their results fragile to the specific training data used.

>Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

>Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different. This, in turn, can give a lift in performance.

## The algorithm

>Decision trees involve the greedy selection of the best split point from the dataset at each step.

>This algorithm makes decision trees susceptible to high variance if they are not pruned. This high variance can be harnessed and reduced by creating multiple trees with different samples of the training dataset (different views of the problem) and combining their predictions. This approach is called bootstrap aggregation or bagging for short.

>A limitation of bagging is that the same greedy algorithm is used to create each tree, meaning that it is likely that the same or very similar split points will be chosen in each tree making the different trees very similar (trees will be correlated). This, in turn, makes their predictions similar, mitigating the variance originally sought.

>We can force the decision trees to be different by limiting the features (rows) that the greedy algorithm can evaluate at each split point when creating the tree. This is called the Random Forest algorithm.

>Like bagging, multiple samples of the training dataset are taken and a different tree trained on each. The difference is that at each point a split is made in the data and added to the tree, only a fixed subset of attributes can be considered.

>For classification problems,  the number of attributes to be considered for the split is limited to the square root of the number of input features.

>The result of this one small change are trees that are more different from each other (uncorrelated) resulting predictions that are more diverse and a combined prediction that often has better performance that single tree or bagging alone.

## Data Prep

Sample data used is the sonar dataset.

In [5]:
%mkdir -p data/research

In [6]:
import urllib.request as request
file_path = 'data/research/sonar.all-data.csv'
d_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
request.urlretrieve(d_url, file_path)

('data/research/sonar.all-data.csv', <http.client.HTTPMessage at 0x1031df9b0>)

In [7]:
import pandas as pd
df = pd.read_csv(file_path, header=None)

In [8]:
df.shape

(208, 61)

## Decision Tree and Random Forest

### Decision tree basics

In a decision tree, split points are chosed by finding the feature and the value of that feature which results in lowerst cost.

For classification problem, this cost is usually evaluated by a cost function called Gini index. Gini index calculates the purity of the group of data created by the split point.

A tree node is pure (`gini = 0`) if all instances it applies to belong to the same class.

*Gini Impurity* is measured as 
$$
G_i = 1 - \sum_{k=1}^n p_{i,k}^2
$$

where $p_{i,k}$ refers to the ratio of class $k$  instances among the whole input instances in the $i^{th}$ node.

For example, assume there is a node with 54 input instances, 0 of them belong to class A, 49 of them belong to class B, and 5 of them belong to class C. Then the gini score is $1 - (0/54)^2 - (49/54)^2 - (5/54)^2 \approx 0.168$

In our case, we only want a binary classifier outputing `relevant (1)` or `irrelavent (0)`. So if a node perfectly separated the input into one class(leaf), the *gini impurity* will be 0.

Another measure will be *Entropy*:
$$H_i = - \sum_{k=1 \mid p_{i,k} \neq 0}^n p_{i,k}log(p_{i,k})$$
Note *Entropy* is more expensive as it uses $log$.

> - Gini is intended for continuous attributes, and Entropy for attributes that occur in classes
- Gini is to minimize misclassification
- Entropy is for exploratory analysis
- Entropy may be a little slower to compute

General Implementation of both:

In [9]:
def calc_shannon_entropy(self, left, right):
        left_sum = sum(left.values())
        right_sum = sum(right.values())
        if 0 in left.values():
            left_entropy = 0
        else:
            left_entropy = sum([-(i/left_sum)*np.log2(i/left_sum) for i in left.values()])

        if 0 in right.values():
            right_entropy = 0
        else:
            right_entropy = sum([-(i/right_sum)*np.log2(i/right_sum) for i in right.values()])
        entropy = (left_entropy*left_sum + right_entropy*right_sum) / (left_sum + right_sum)
        return entropy

In [10]:
def cal_gini_index(data):
    pass

### Implementation of Tree

In [11]:
import math
import numpy as np
import random

In [12]:
f = 5
label_ind=60
x = 3
print(df[f][x], df[label_ind][x])
print(df)

0.0368 R
         0       1       2       3       4       5       6       7       8   \
0    0.0200  0.0371  0.0428  0.0207  0.0954  0.0986  0.1539  0.1601  0.3109   
1    0.0453  0.0523  0.0843  0.0689  0.1183  0.2583  0.2156  0.3481  0.3337   
2    0.0262  0.0582  0.1099  0.1083  0.0974  0.2280  0.2431  0.3771  0.5598   
3    0.0100  0.0171  0.0623  0.0205  0.0205  0.0368  0.1098  0.1276  0.0598   
4    0.0762  0.0666  0.0481  0.0394  0.0590  0.0649  0.1209  0.2467  0.3564   
5    0.0286  0.0453  0.0277  0.0174  0.0384  0.0990  0.1201  0.1833  0.2105   
6    0.0317  0.0956  0.1321  0.1408  0.1674  0.1710  0.0731  0.1401  0.2083   
7    0.0519  0.0548  0.0842  0.0319  0.1158  0.0922  0.1027  0.0613  0.1465   
8    0.0223  0.0375  0.0484  0.0475  0.0647  0.0591  0.0753  0.0098  0.0684   
9    0.0164  0.0173  0.0347  0.0070  0.0187  0.0671  0.1056  0.0697  0.0962   
10   0.0039  0.0063  0.0152  0.0336  0.0310  0.0284  0.0396  0.0272  0.0323   
11   0.0123  0.0309  0.0169  0.0313  0.0358

In [25]:
class AlreadyFitException(Exception):
    pass
class NoBreakpointsException(Exception):
    pass

In [56]:
'''
A dummy version of tree nodes
'''
class Node:
      
    def __init__(self, data, rows, features, depth, max_depth):
        self.left = None
        self.right = None
        self.data = data
        self.rows = rows
        self.features = features
        self.label_index = 60
        self.labels = ['R', 'M']
        self.spliting_feature_val = None
        self.id = '%030x' % random.randrange(16**30)
        self.depth = depth
        self.max_depth = max_depth
        self.min_feature = None
        self.min_break_point = None
        self.min_gini = None

    
    def calc_shannon_entropy(self):
        raw_val = 0
        for label in self.labels:
            members = self.data.loc[self.data[self.label_index] == label]
            if len(members) <= 0: continue
            filtered = [x for x in members.index.values if x in self.rows]
            intermediate = len(filtered)/len(self.rows)
            raw_val += -intermediate*np.log2(intermediate)
        return raw_val
    
    def calc_gini_index(self):
        raw_val = 1
        members = [self.data[self.label_index][x] for x in self.rows]
        for label in self.labels:
#             members = self.data.loc[self.data[self.label_index] == label]
            #maybe do as a for loop?
            filtered = [x for x in members if x == label]
#             filtered = members
            raw_val -= (len(filtered)/len(self.rows))**2
        return raw_val
    
        
    '''
    calculate info gain from gini/entropy
    '''
    def cal_info_gain():
        pass
    
    def find_break_points(self, df, feature):
        breaks = []
        for i in range(len(df)-1):
            row = df[i:i+1]
            next_row = df[i+1:i+2]
#             print(row[self.label_index])
            if row[self.label_index].values[0] != next_row[self.label_index].values[0]:
                breaks.append(next_row[feature].values[0]) #float precision issue, care
        return breaks
    
        
    '''
    Choose the best feature to split at this point
    i.e. low gini/entropy, high infoGain
    '''
    
    def split(self):
        #are we a leaf node?
        if len(self.rows) == 0:
            raise ValueError('The node has no document feed, no more splitting')
        elif self.calc_gini_index() == 0:
            raise ValueError('The node is pure, no more splitting')
        elif self.depth == self.max_depth:
            raise ValueError('The node has reached max recursion depth, no more splitting')
        elif len(self.features) == 0:
            raise ValueError('There are no more features to split on.')
            
        #we are not a leaf node.
        min_gini, min_feature, min_break_point, left_members, right_members = 2, -999, -999, [], []
        bp_len_sum = 0
        for feature in self.features:
#             print('parsing')
            to_parse = [(self.data[feature][x],self.data[self.label_index][x]) for x in self.rows]
            to_parse = pd.DataFrame(to_parse, columns=(feature,self.label_index), index=self.rows)
#             print(to_parse)
            to_parse.sort_values(feature, inplace=True)
#             print(to_parse)
#             to_parse = self.data[[feature, self.label_index]]
#             to_parse = to_parse.loc[to_parse.index.isin (self.rows)]
#             to_parse.sort_values(feature, inplace=True)
#             print(to_parse)
            break_points = self.find_break_points(to_parse, feature)
#             print(break_points)
            bp_len_sum += len(break_points)

            best_gini_this_feature, best_breakpoint_this_feature = self.find_best_breakpoint(to_parse.values[:,0], to_parse.values[:,1])
            if best_gini_this_feature < min_gini:
                left_members = to_parse.loc[to_parse[feature] < best_breakpoint_this_feature].index.values
                right_members = to_parse.loc[to_parse[feature] >= best_breakpoint_this_feature].index.values
                min_gini, min_break_point, min_feature = best_gini_this_feature, best_breakpoint_this_feature, feature
        if bp_len_sum == 0:
            print(to_parse)
            print(self.calc_gini_index())
        #Node(self.data, , [x for x in self.features if x != feature], self.depth+1, self.max_depth)
        self.left = Node(self.data, left_members, [x for x in self.features if x != min_feature], self.depth+1, self.max_depth)
        self.right = Node(self.data, right_members, [x for x in self.features if x != min_feature], self.depth+1, self.max_depth)
        self.min_feature, self.min_break_point, self.min_gini = min_feature, min_break_point, min_gini
        try:
            if self.left is None:
                print(self.min_feature,self.min_break_point,self.min_gini)
            self.left.split()
        except ValueError: # probably need a customized error class
            pass
        try:
            self.right.split()
        except ValueError:
            pass
    
    '''
    A faster way to find the best breakpoint. 
    Note that we're assuming that the node we're splitting isn't pure.
    
    input:
    values - an arraylike of the values associated with each element
    classes - a arraylike of the labels for each element, in the same order as values
    
    returns:
    (min_gini, min_break_point)
    '''
    def find_best_breakpoint(self, values, classes):
        if len(values) != len(classes):
            raise ValueError("Values and classes must be the same length.")
        best_gini = 2
        best_ind = -1
        #class member values
        left_members = {}
        right_members = {}
        
        #everything starts on the right
        for i in range(len(values)):
            try:
                right_members[classes[i]] += 1
            except KeyError:
                right_members[classes[i]] = 1
                
        #compare different breakpoints
        for i in range(len(values)-1):
            #add ith value to the left (we're considering splitting after i)
            try:
                left_members[classes[i]] += 1
            except KeyError:
                left_members[classes[i]] = 1
            
            #remove ith value from the right 
            right_members[classes[i]] -= 1
            
            #if i and i+1 aren't the same class, consider splitting here
            if classes[i] != classes[i+1]:
                left_gini = Node.calc_gini_from_props(left_members)
                right_gini = Node.calc_gini_from_props(right_members)
                curr_gini = Node.aggregate_gini(left_gini, right_gini, i+1, len(values)-(i+1))
                if best_gini > curr_gini:
                    best_gini = curr_gini
                    best_ind = i+1 #if we're less than the breakpoint, we're put in one bucket, and geq is in the other bucket
        #return the best value
        return (best_gini, values[best_ind])
            
    '''
    Calculates the gini index from a dictionary of proportions
    
    input:
    members - a dict from string label to int count of members
    
    returns - the gini index for a node containing these members
    '''        
    def calc_gini_from_props(members):
        answer = 1
        total = 0
        for label in members.keys():
            total += members[label]
        for label in members.keys():
            answer -= (members[label]/total)**2
        return answer
    
    '''
    Calculates the aggregate gini index from two child nodes.
    
    input:
    score1 - the gini score for one child
    score2 - the gini score for the other child
    num1 - the number of members of the first child
    num2 - the number of members of the second child
    
    returns - a weighted average of the two scores
    '''
    def aggregate_gini(score1, score2, num1, num2):
        return (score1*num1 + score2*num2)/(num1 + num2)
    
    def __str__(self):
#         if self.left and self.right:
        children = [x.id for x in (self.left, self.right)] if self.left and self.right else []
        return "[{ID}, {Gini}, {Size}, {Feature}, {BP}, {Children}]".format(ID=self.id, 
                                                            Gini = self.calc_gini_index(),
                                                            Size = len(self.rows),
                                                            Feature=self.min_feature, 
                                                            BP=self.min_break_point,
                                                           Children=children)
#         else:
#             "[{ID}, (Children=None)]".format(ID=self.id)
    
    def get_proportions(self, target_label):
        members = [self.data[self.label_index][x] for x in self.rows]
        filtered = [x for x in members if x == target_label]
#         members = self.data.loc[self.data[self.label_index] == target_label]
#         filtered = [x for x in members.index.values if x in self.rows]
        raw_val = (len(filtered)/len(self.rows))
        return raw_val
        

In [68]:
ls = [x for x in range(60)]
dummy = Node(df, [1,2,2,7,4,5,200,150, 175, 175, 130], ls, 0, 2)
dummy.split()
# print(dummy.left.split())

In [69]:
nodes = [dummy]
while(len(nodes) > 0):
    new_nodes = []
    level_str = ''
    for node in nodes:
        level_str += str(node) + "\n"
        if node.left:
            new_nodes.append(node.left)
        if node.right:
            new_nodes.append(node.right)
    print(level_str+"\n--------------------------------------------------", end='\n')
    nodes = new_nodes

[4576a9d8bb6c52b66d3921f7edc879, 0.375, 20, 40, 0.1151, ['b8d33d76d47e31638fb41faef81075', '51c1b4bccd81cb3cfc035cf537d39a']]

--------------------------------------------------
[b8d33d76d47e31638fb41faef81075, 0.31999999999999984, 5, 2, 0.0347, ['7b0156ed9c5645c8a94d8fe1625568', '110be92be7cbe80337780f120c6028']]
[51c1b4bccd81cb3cfc035cf537d39a, 0.12444444444444439, 15, 2, 0.0152, ['05e6533a9bb4a6198ea7ceba8ff755', '7ac7d45615cdff06082edff36c06f8']]

--------------------------------------------------
[7b0156ed9c5645c8a94d8fe1625568, 0.0, 4, None, None, []]
[110be92be7cbe80337780f120c6028, 0.0, 1, None, None, []]
[05e6533a9bb4a6198ea7ceba8ff755, 0.0, 1, None, None, []]
[7ac7d45615cdff06082edff36c06f8, 0.0, 14, None, None, []]

--------------------------------------------------


In [71]:
'''
A dummy implementation of decision trees
'''
class Tree:
    
    '''
    params:
    train_data - training data to trainthe tree
    depth - max recursion depth of the tree
    benchmark - benchmark for geni/entropy
    '''
    def __init__(self, data, depth, benchmark, rows, features): #should we include data here
        self.depth = depth
        self.rows = rows
        self.features = features
        self.data = data
        self.benchmark = benchmark
        self.head = Node(data, rows, features, 0, depth)
        self.oob_error = -1
        
    '''
    Recursively split until geni/entropy benchmark met or max_depth reached
    '''
    def fit(self):
        #think about behavior of pure nodes more
        try:
            self.head.split()
        except ValueError: #change this to whatever node.split() throws
            print('Head is a pure node.')
    '''
    params: 
    test_data - test data to run the prediction on
    
    return: 
    outputs confidence/probability of each category
    '''
    def predict(self, test_data):
#         assuming input data is a dataframe right now
        cur_node = self.head
        while (cur_node.left and cur_node.right):
            if (test_data[cur_node.min_feature].values[0] < cur_node.min_break_point):
                cur_node = cur_node.left
            else:
                cur_node = cur_node.right
        
#         here, cur_node should be the leaf
        r_confidence = cur_node.get_proportions('R')
        m_confidence = cur_node.get_proportions('M')
        
        return (r_confidence, m_confidence)
    
    '''
    params: 
    more_data - more training data to update the tree
    
    return: 
    Null or we can say something like which nodes are changed
    '''
    def update(more_data):
        #decide whether to call alg 3 or alg 4
        #call the relevant one
        pass
    
    '''
    return:
    The number of ignored data pieces that we get incorrect (n) divided by the number of rows we ignored (l)
    That is, n/l
    '''
    def calc_oob_error(self):
        #complement of rows
        test_data = self.data.loc[~self.data.index.isin(self.rows)]
        complement = set(range(self.data.shape[1])) - set(self.rows)
        #predict each of those (TODO: update this once we have batch training)
        num_incorrect = 0
        for row in complement:
            case = self.data.loc[[row]]
            prediction = self.predict(case)
            if prediction[0] > prediction[1]:
                num_incorrect += 1 if case[60].values[0] == 'M' else 0
            else:
                num_incorrect += 1 if case[60].values[0] == 'R' else 0
        return num_incorrect / len(test_data)
        #calculate incorrect / total
        
    
    '''
    Maybe we can use pickle for this
    '''
    def store_tree(file_path):
        pass
    
    def load_tree(file_path):
        pass
    
    '''
    String representation
    '''
    def __str__(self):
        string = ''
        string += str(sorted(self.features))
        string += '\n'
        nodes = [self.head]
        while(len(nodes) > 0):
            new_nodes = []
            level_str = ''
            for node in nodes:
                level_str += str(node) + "\n"
                if node.left:
                    new_nodes.append(node.left)
                if node.right:
                    new_nodes.append(node.right)
            string += level_str+"\n--------------------------------------------------\n"
            nodes = new_nodes
        return string

In [72]:

ls = [x for x in range(60)]
# dummy = Node(df, range(df.shape[0]-20), ls, 0, 2)



In [73]:
def cross_val(df, tries):
    for i in range(tries):
        shuffle = df.sample(frac=1)
        tree = Tree(df, 3, None, range(shuffle.shape[0]-20), ls)
        tree.fit()
        score = 0
        for i in range(188, 208):
            actual = shuffle[i:i+1][60].values[0]
            p = tree.predict(shuffle[i:i+1])
            if p[0] > p[1]:
        #         print('R/{}'.format(actual))
                if 'R' == actual: 
                    score+=1
            else:
        #         print('M/{}'.format(actual))
                if 'M' == actual: 
                    score+=1
        print(score/(208-188))

In [75]:
cross_val(df, 5)

0.9
0.95
0.95
0.85
0.85


In [76]:
'''
Dummy Version of Random Forest
'''
class RNF: 
    '''
    params:
    train_data - training data to trainthe tree
    n_trees - number of trees to setup
    tree_depth - max recursive
    random_seed - seed for random gen
    n_max_features - max num of features to pass to each tree
    n_max_input - max num of input to pass to each tree
    '''
    def __init__(self, train_data, n_trees, tree_depth, random_seed, n_max_features, n_max_input):
        self.trees = []
        self.train_data = train_data
        self.n_trees = n_trees
        self.tree_depth = tree_depth
        self.n_max_features = n_max_features
        self.n_max_input = n_max_input
#         self.features = [()] #list of tuples like (tree, emails, features)
        random.seed(random_seed)
    
        np.random.seed(random_seed)
        pass
    
    '''
    Randomly select features and emails from the train_data 
    '''
    def random_select(self, train_data):
        selected_rows = np.random.choice(self.train_data.shape[0], self.n_max_input)
        selected_features = np.random.choice(self.train_data.shape[1] - 1, self.n_max_features, replace=False)
        return (selected_rows, selected_features)
        
    '''
    pass randomly selected emails and features to each tree
    '''
    def fit(self):
        if len(self.trees) != 0:
            raise AlreadyFitException('This forest has already been fit to the data')
        for i in range(self.n_trees):
            selected = self.random_select(self.train_data)
#             self, train_data, depth, benchmark, rows, features
            self.trees.append(Tree(self.train_data, self.tree_depth, 0, selected[0], selected[1]))
        for tree in self.trees:
            tree.fit()
    
    '''
    calculate a proba from output of each tree's prediction
    should ouput two arrays: probas and classfication
    '''
    def some_majority_count_metric(self, scores):
        return np.mean(scores, axis=0)
    
    def predict(self, test_data):
        scores = [tree.predict(test_data) for tree in self.trees]
        probas = self.some_majority_count_metric(scores)
        classes = 'R' if probas[0] > probas[1] else 'M'
        return self.some_majority_count_metric(scores), classes
    
    '''
    params: 
    more_data - more training data to update the forest
    
    return: 
    Null or we can say something like which trees are changed
    '''
    def update(more_data):
        #add more_data to the end of self.train_data
        
        #calc oob error for each tree
        
        #calc threshold
        
        #for each tree in trees:
        #if oob < thresh
            #alg 3 (trash the tree and build a new one)
        #else alg 4
        pass
    
    '''
    Maybe we can use pickle for this
    '''
    def store_rnf(file_path):
        pass
    
    def load_rnf(file_path):
        pass

In [77]:
a = RNF(df[:188], 10, 2, 42, 10, 10)

In [78]:
a.fit()

In [79]:
for tree in a.trees:
    print(tree)

[4, 5, 9, 12, 16, 22, 30, 40, 48, 59]
[477183392456de3eb13b9046685257, 0.5, 10, 48, 0.0588, ['961b768d5288f1142c3fe860e7a113', '6273a693cd59bf5c941cf0dc98d2c1']]

--------------------------------------------------
[961b768d5288f1142c3fe860e7a113, 0.2777777777777777, 6, 59, 0.0092, ['e8256547294739614ff3d719db3ad0', '5347655d65a441d58842dea2bc372f']]
[6273a693cd59bf5c941cf0dc98d2c1, 0.0, 4, None, None, []]

--------------------------------------------------
[e8256547294739614ff3d719db3ad0, 0.0, 5, None, None, []]
[5347655d65a441d58842dea2bc372f, 0.0, 1, None, None, []]

--------------------------------------------------

[3, 5, 12, 21, 26, 30, 33, 41, 42, 54]
[2ff8d207a0ca6e0822e8f36c031199, 0.31999999999999984, 10, 41, 0.4008, ['7d57dbbaa80dd488bd64072bcfbe01', '8a36996123fdf77656af7229d4beef']]

--------------------------------------------------
[7d57dbbaa80dd488bd64072bcfbe01, 0.0, 7, None, None, []]
[8a36996123fdf77656af7229d4beef, 0.4444444444444444, 3, 54, 0.0031, ['a60862af42e12f

In [112]:
a.predict(df[200:201])

(array([ 0.,  1.]), 'M')

In [113]:
print(df[200:201])

         0       1       2       3       4       5       6       7      8   \
200  0.0131  0.0387  0.0329  0.0078  0.0721  0.1341  0.1626  0.1902  0.261   

         9  ...     51      52      53      54      55     56      57      58  \
200  0.3193 ...  0.015  0.0076  0.0032  0.0037  0.0071  0.004  0.0009  0.0015   

         59  60  
200  0.0085   M  

[1 rows x 61 columns]


In [146]:
for tree in a.trees:
    print(tree.calc_oob_error())
    print(sorted(list(map(lambda x : df.loc[[x]][60].values[0], tree.rows))))
    print('-----')

0.09497206703910614
['M', 'M', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R']
-----
0.05056179775280899
['M', 'M', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R']
-----
0.08379888268156424
['M', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R']
-----
0.1452513966480447
['M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'R']
-----
0.20786516853932585
['M', 'M', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'R']
-----
0.1452513966480447
['M', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R']
-----
0.2247191011235955
['M', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'R', 'R']
-----
0.07303370786516854
['M', 'M', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'R']
-----
0.028089887640449437
['M', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R']
-----
0.09550561797752809
['M', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R']
-----


In [127]:
a.trees[0].calc_oob_error()

0.09497206703910614

In [139]:
list(map(lambda x : df.loc[[x]][60].values[0], a.trees[0].rows))

['M', 'M', 'R', 'R', 'M', 'R', 'R', 'M', 'M', 'R']

In [82]:
def cross_val_rnf(df, tries):
    for i in range(tries):
        shuffle = df.sample(frac=1)
        forest = RNF(df[0:188], 50, 4, random.randint(1, 999), 40, 80)
        forest.fit()
        score = 0
        for i in range(188, 208):
            actual = shuffle[i:i+1][60].values[0]
            p = forest.predict(shuffle[i:i+1])[1]
            if p == actual:
                score += 1
#             if p[0] > p[1]:
#         #         print('R/{}'.format(actual))
#                 if 'R' == actual: 
#                     score+=1
#             else:
#         #         print('M/{}'.format(actual))
#                 if 'M' == actual: 
#                     score+=1
        print(score/(208-188))

In [None]:
cross_val_rnf(df, 10)

1.0
