 # Table of Contents
<div class="toc" style="margin-top: 1em;"><ul class="toc-item" id="toc-level0"><li><span><a href="#What-is-Random-Forest" data-toc-modified-id="What-is-Random-Forest-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What is Random Forest</a></span></li><li><span><a href="#The-algorithm" data-toc-modified-id="The-algorithm-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>The algorithm</a></span></li><li><span><a href="#Data-Prep" data-toc-modified-id="Data-Prep-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Prep</a></span></li><li><span><a href="#Decision-Tree-and-Random-Forest" data-toc-modified-id="Decision-Tree-and-Random-Forest-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Decision Tree and Random Forest</a></span><ul class="toc-item"><li><span><a href="#Decision-tree-basics" data-toc-modified-id="Decision-tree-basics-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Decision tree basics</a></span></li><li><span><a href="#Implementation-of-Tree" data-toc-modified-id="Implementation-of-Tree-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Implementation of Tree</a></span></li></ul></li></ul></div>

## What is Random Forest

>Decision trees can suffer from high variance which makes their results fragile to the specific training data used.

>Building multiple models from samples of your training data, called bagging, can reduce this variance, but the trees are highly correlated.

>Random Forest is an extension of bagging that in addition to building trees based on multiple samples of your training data, it also constrains the features that can be used to build the trees, forcing trees to be different. This, in turn, can give a lift in performance.

## The algorithm

>Decision trees involve the greedy selection of the best split point from the dataset at each step.

>This algorithm makes decision trees susceptible to high variance if they are not pruned. This high variance can be harnessed and reduced by creating multiple trees with different samples of the training dataset (different views of the problem) and combining their predictions. This approach is called bootstrap aggregation or bagging for short.

>A limitation of bagging is that the same greedy algorithm is used to create each tree, meaning that it is likely that the same or very similar split points will be chosen in each tree making the different trees very similar (trees will be correlated). This, in turn, makes their predictions similar, mitigating the variance originally sought.

>We can force the decision trees to be different by limiting the features (rows) that the greedy algorithm can evaluate at each split point when creating the tree. This is called the Random Forest algorithm.

>Like bagging, multiple samples of the training dataset are taken and a different tree trained on each. The difference is that at each point a split is made in the data and added to the tree, only a fixed subset of attributes can be considered.

>For classification problems,  the number of attributes to be considered for the split is limited to the square root of the number of input features.

>The result of this one small change are trees that are more different from each other (uncorrelated) resulting predictions that are more diverse and a combined prediction that often has better performance that single tree or bagging alone.

## Data Prep

Sample data used is the sonar dataset.

In [3]:
%mkdir data/research -p

In [4]:
import urllib
file_path = 'data/research/sonar.all-data.csv'
d_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
urllib.request.urlretrieve(d_url, file_path)

('data/research/sonar.all-data.csv', <http.client.HTTPMessage at 0x10bd8f6d8>)

In [5]:
import pandas as pd
df = pd.read_csv(file_path, header=None)

In [103]:
df.shape

(208, 61)

## Decision Tree and Random Forest

### Decision tree basics

In a decision tree, split points are chosed by finding the feature and the value of that feature which results in lowerst cost.

For classification problem, this cost is usually evaluated by a cost function called Gini index. Gini index calculates the purity of the group of data created by the split point.

A tree node is pure (`gini = 0`) if all instances it applies to belong to the same class.

*Gini Impurity* is measured as 
$$
G_i = 1 - \sum_{k=1}^n p_{i,k}^2
$$

where $p_{i,k}$ refers to the ratio of class $k$  instances among the whole input instances in the $i^{th}$ node.

For example, assume there is a node with 54 input instances, 0 of them belong to class A, 49 of them belong to class B, and 5 of them belong to class C. Then the gini score is $1 - (0/54)^2 - (49/54)^2 - (5/54)^2 \approx 0.168$

In our case, we only want a binary classifier outputing `relevant (1)` or `irrelavent (0)`. So if a node perfectly separated the input into one class(leaf), the *gini impurity* will be 0.

Another measure will be *Entropy*:
$$H_i = - \sum_{k=1 \mid p_{i,k} \neq 0}^n p_{i,k}log(p_{i,k})$$
Note *Entropy* is more expensive as it uses $log$.

> - Gini is intended for continuous attributes, and Entropy for attributes that occur in classes
- Gini is to minimize misclassification
- Entropy is for exploratory analysis
- Entropy may be a little slower to compute

General Implementation of both:

In [7]:
def calc_shannon_entropy(data):
    pass

In [8]:
def cal_gini_index(data):
    pass

### Implementation of Tree

In [35]:
import math

In [122]:
'''
A dummy version of tree nodes
'''
class Node:
      
    def __init__(self, data, rows, features):
        self.left = None
        self.right = None
        self.data = data
        self.rows = rows
        self.features = features
        self.label_index = 60
        self.labels = ['R', 'M']

    
    def calc_shannon_entropy(self):
        raw_val = 0
        for label in self.labels:
            members = self.data.loc[self.data[self.label_index] == label]
            if len(members) <= 0: continue
            filtered = [x for x in members.index.values if x in self.rows]
            intermediate = len(filtered)/len(self.rows)
            raw_val += -intermediate*math.log(intermediate)
        return raw_val
    
    def calc_gini_index(self):
        raw_val = 1
        for label in self.labels:
            members = self.data.loc[self.data[self.label_index] == label]
            filtered = [x for x in members.index.values if x in self.rows]
#             filtered = members
            raw_val -= (len(filtered)/len(self.rows))**2
        return raw_val
    
    '''
    calculate info gain from gini/entropy
    '''
    def cal_info_gain():
        pass
    
    
    def find_break_points(self, df, feature):
        breaks = []
        for i in range(len(df)-1):
            row = df[i:i+1]
            next_row = df[i+1:i+2]
#             print(row[self.label_index])
            if row[self.label_index].values[0] != next_row[self.label_index].values[0]:
                breaks.append(next_row[feature].values[0]) #float precision issue, care
        return breaks
    
#     def group_documents_on_break_points():
#         r
        
        
    '''
    Choose the best feature to split at this point
    i.e. low gini/entropy, high infoGain
    '''
    
    def split(self):
        min_gini, min_feature, min_break_point, new_left, new_right = 1, -999, -999, None, None
        for feature in self.features:
#             print('parsing')
            to_parse = self.data[[feature, self.label_index]]
            to_parse = to_parse.loc[to_parse.index.isin (self.rows)]
            to_parse.sort_values(feature, inplace=True)
#             print(to_parse)
            break_points = self.find_break_points(to_parse, feature)
            for break_point in break_points:
                left = Node(self.data, to_parse.loc[to_parse[feature] < break_point].index.values, [x for x in self.features if x != feature])
#                 left = Node(to_parse.loc[to_parse[feature] <break_point]
                right = Node(self.data, to_parse.loc[to_parse[feature] >= break_point].index.values, [x for x in self.features if x != feature])
                ## We should ajdust this so it pass self.data and reference of rows and cols
#                 print(left.index.values)

#                 print(self.calc_gini_index(left))
                total_gini = left.calc_gini_index() + right.calc_gini_index()
#                 min_gini = min(total_gini, min_gini)
                if total_gini < min_gini:
                    min_gini, min_break_point, min_feature, new_left, new_right = total_gini, break_point, feature, left, right
        self.left = new_left
        self.right = new_right
        return(min_gini, min_break_point, min_feature)

In [123]:
ls = [x for x in range(60)]
dummy = Node(df, range(df.shape[0]), ls)
# dummy = Node(df, [3,5,6,7], range(60))
print(dummy.split())
print(dummy.left.split())

(0.49509851975296526, 0.036799999999999999, 8)
(1, -999, -999)


In [125]:
dummy.left.rows

array([22, 64, 67, 65, 50, 10])

In [128]:
print(dummy.right.split())

(0.49120158267388586, 0.088599999999999998, 16)


In [129]:
dummy.right.left.rows

array([ 8, 94, 25, 40,  3, 92])

In [10]:
'''
A dummy implementation of decision trees
'''
class Tree:
    
    '''
    params:
    train_data - training data to trainthe tree
    depth - max recursion depth of the tree
    benchmark - benchmark for geni/entropy
    '''
    def __init__(train_data, depth, benchmark): #should we include data here
        self.depth = depth
        
    '''
    Recursively split until geni/entropy benchmark met or max_depth reached
    '''
    def fit(train_data):
        pass
    
    '''
    params: 
    test_data - test data to run the prediction on
    
    return: 
    outputs confidence/probability of each category
    '''
    def predict(test_data):
        pass
    
    '''
    params: 
    more_data - more training data to update the tree
    
    return: 
    Null or we can say something like which nodes are changed
    '''
    def update(more_data):
        pass
    
    '''
    Maybe we can use pickle for this
    '''
    def store_tree(file_path):
        pass
    
    def load_tree(file_path):
        pass

In [11]:
'''
Dummy Version of Random Forest
'''
class RNF: 
    '''
    params:
    train_data - training data to trainthe tree
    n_trees - number of trees to setup
    tree_depth - max recursive
    random_seed - seed for random gen
    n_max_features - max num of features to pass to each tree
    n_max_input - max num of input to pass to each tree
    '''
    def __init__(train_data, n_trees, tree_depth, random_seed, n_max_features, n_max_input):
        init(trees) 
        self.trees = trees
        self.features = [()] #list of tuples like (tree, emails, features)
        pass
    
    '''
    Randomly select features and emails from the train_data 
    '''
    def random_select(train_data):
        pass
        
    '''
    pass randomly selected emails and features to each tree
    '''
    def fit():
        for tree in trees:
            tree.fit(random_select(train_data))
    
    '''
    calculate a proba from output of each tree's prediction
    should ouput two arrays: probas and classfication
    '''
    def some_majority_count_metric():
        pass
    
    def predict(test_data):
        scores = [tree.predict(test_data) for tree in trees]
        return some_majority_count_metric(scores)
    
    '''
    params: 
    more_data - more training data to update the forest
    
    return: 
    Null or we can say something like which trees are changed
    '''
    def update(more_data):
        pass
    
    '''
    Maybe we can use pickle for this
    '''
    def store_rnf(file_path):
        pass
    
    def load_rnf(file_path):
        pass