# Introduction to Random Forest

## Introduction
Random forests (also called random decision forests) construct multiple decision trees at training time. The output of a random forest is often the mode class of individual trees when it is a classification problem, or an average of prediction of individual trees when it is a regression problem. One major advantage of random forests is that they can correct the overfitting problem random trees suffer from. (Reference: https://en.wikipedia.org/wiki/Random_forest)

During this tutorial we will first introduce the notion of entropy and mutual information, as a prerequisite for random trees. After that we are going to implement a random tree class that can grow on some training data recursively, based on a split method called ID3, which we will explain later. Finally we are going to use our random tree class to construct our random forest.

## Entropy and mutual information
### Entropy
The entropy of a distribution is the expected amount of information we get when we observe a possible outcome of the distribution. It is used to evaluate the uncertainty of the distribution. But how can we measure how much information we get when we observe a possible outcome? A intuitive answer is that the more unlikely the outcome happens, the more information we get. To be specific, we have the following definition (After [Abramson 63]):

Let $E$ be some event which occurs with probability $P(E)$, if we are told that $E$ has occurred, then we sat that we have received 

$Info(E) = log_2\frac{1}{P(E)}$

bits of information.

Complete the following function to calculate the bits of information we received when we are told that some event with probability $p$ occurs.


In [240]:
import math
def Info(p):
    """given the probability of some event, return the bits of information we receives if it occurs
        
    Args:
        p(float): the probability of some event occurs
        
    Return:
        (float): the bits of information we receive
    """
    
    return math.log(1/p, 2)

The entropy of a distribution $D$, denoted by $H(D)$, is simply the expected amount of information we get when we get a possible outcome from the distribution. It is given by the following equation:

$H(D) = \sum_{E \in D} P(E)I(E)$

Complete the following function to calculate the entropy of a discrete distribution

In [241]:
def H(p):
    """given a discrete probability distribution, return its entropy
    Args:
        p(list of float): the list of probabilities that each event in the distribution occurs with
        
    Return:
        (float): the entropy of this distribution
    """
    
    entropy_sum = 0
    for event in p:
        if event > 0:
            entropy_sum += event * Info(event)
    return entropy_sum

In [242]:
# Simple examples to test your code:
# The entropy of a fair coin should be 1.0
print H([0.5, 0.5])
# The entropy of a fair dice should be 2.58
print H([1.0/6] * 6)

1.0
2.58496250072


### Mutual Information
To illustreate what is mutual information, let's look at an example first:
Suppose the two variables, $gosports$ and $weather$ have the following joint distribution:


|             | weather(sunny) | weather(cloudy) | weather(rainy) |
|-----------|-------------|
|gosports(yes) |     0.3        |      0.2        |       0.1      |
|gosports(no)  |     0.1        |      0.1        |       0.2      |

In [243]:
# Calculate the entropy of gosports(should be 0.97)
entropy_sports = H([0.3 + 0.2 + 0.1, 0.1 + 0.1 + 0.2])
print entropy_sports

# Calculate the entropy of gosports conditioned on weather = sunny, cloudy, and rainy (should be 0.81, 0.92, 0.92 respectively)
entropy_sports_sunny = H([0.3 / 0.4, 0.1 / 0.4])
print entropy_sports_sunny
entropy_sports_cloudy = H([0.2 / 0.3, 0.1 / 0.3])
print entropy_sports_cloudy
entropy_sports_rainy = H([0.1 / 0.3, 0.2 / 0.3])
print entropy_sports_rainy

# Calculated the expected entropy of gosports if we are told weather(should be 0.88)
entropy_sports_weather =  entropy_sports_sunny * 0.4 + entropy_sports_cloudy * 0.3 + entropy_sports_rainy * 0.3
print entropy_sports_weather

# Calculate the expected reduced entropy of gosports if we know weather(should be 0.095)
entropy_reduced = entropy_sports - entropy_sports_weather
print entropy_reduced

0.970950594455
0.811278124459
0.918295834054
0.918295834054
0.875488750216
0.0954618442383


From the example, we know that if we are told the information about weather, the entropy (uncertainty) of gosports will reduce by 0.095. We call the reduced entropy of distribution X given distribution Y the $mutual$ $information$ between X and Y, denoted by I(X, Y).

Formally, mutual information can be calculated as:

$I(X, Y) = H(X) - H(X|Y)$

where $H(X|Y)$ is just a short hand for $E_Y[H(X|Y = y)]$

complete the following function to calculate mutual information:

In [244]:
def I(joint_dist):
    """given the joint distribution of two variables, calculate the mutual information between them
    Agrs:
        joint_dist(list of list of float): the joint distribution between two variables, for example, 
        for the example above joint_dist = [[0.3, 0.2, 0.1], [0.1, 0.1, 0.2]]
        
    Return:
        (float) the mutual information between these two variables:
    """
    m = len(joint_dist)
    if m <= 0:
        return -1;
    n = len(joint_dist[0])
    probs = []
    for i in range(m):
        probs.append(sum(joint_dist[i]))
    H_total = H(probs)
    H_reduced = 0.0
    for j in range(n):
        total_prob = 0.0
        probs = []
        for i in range(m):
            total_prob += joint_dist[i][j]
            probs.append(joint_dist[i][j])
        for i in range(m):
            probs[i] /= total_prob
        H_reduced += total_prob * H(probs)
    return H_total - H_reduced
            
            

In [245]:
# result should be 0.09546
print I([[0.3, 0.2, 0.1], [0.1, 0.1, 0.2]])

#result should also be 0.09546
print I([[0.3, 0.1], [0.2, 0.1], [0.1, 0.2]])

0.0954618442383
0.0954618442383


## Decision Tree
Consider the following dataset. Each row represents a single training data.

| outlook | humidity | wind | play sports? |
|--------|--------|--------|--------|
| overcast | high | strong | yes|
| overcast | normal | weak | yes |
| overcast | high | weak | yes |
| overcast | normal | strong | yes |
| sunny | high | strong | no |
| sunny | normal | weak | yes |
| rain | high | strong | no |
| rain | normal | weak | yes|

Based on the training data above, we what to predict weather play sports is yes or no based on the information of outlook, humidity, and wind. After a close look at the dataset, we may find that, if the outlook is overcast, then play sports is yes. Otherwise, it also depends on other attributes. To be specific, if outlook is sunny, play sports = yes iff humidity is normal. if outlook is rain, play sports = yes iff wind = strong.

We can express the if-else process above as a tree as follow:

                                               outlook
                                              /   |   \
                                             /    |    \
                                        sunny overcast rain
                                          |       |     |
                                     humidity    yes   wind
                                     /     \          /    \
                                   high   normal   strong  weak
                                    |       |        |      |
                                    no     yes      no      yes
This is a simple example of decision tree.
Please note the the tree above is the not only tree to be consistent with our training dataset.
In general, in a decision tree, at each note we look at one of the attributes, and partition our dataset according to different values taken on this attribute, thus splitting our dataset. When a split dataset contains only one kind of label, the process is stopped. Otherwise we select new attributes and partitions out dataset recursively until the data in the same dataset are all of the same label, or some other stop condition is reached.

However, in the process above there is a center problem remains unsolved: when we need to split our dataset, which attribute should we use? One common approach is to select the attribute that can reduce the uncertainty(entropy) of training data, which is also called the ID3 method. In other words:

1. View each attribute as a distribution. For example, in the dataset above, there are 4 overcast, 2 sunny, and 2 rain. Then the distribution of outlook is [0.5, 0.25, 0.25].

2. Select the attribute(distribution) that has highest mutual information with the label(which can also be view as a distribution). 

3. Use the selected attribute to participate current dataset. Do recursion if necessary.

Complete the following class of DecisionTreeNode

In [246]:
class DecisionTreeNode():
    
    def __init__(self, X, Y, used_attr_num = 0):
        """
        return the tree node on dataset (X, Y). Build it children recursively.
        pick up the attribute which has maximum mutual information with our label Y, and use that attribute to split 
        data into several classes, each class corresponding to one possible value of that attribute.
        The stopping condition is that a node contains all of the same label.
        If there are multiple attributes with the same mutual information, pick up the one with smallest index
        
        args:
            X(list of list of integer): X[i][j] is the value of the jth attribute of the ith training data
            For any given attribute, we encode the set of possible values it can take as 0, 1, 2, ...
            
            Y(list of integer): Y[i] is the label of the ith training data. We encode the set of possible values
            as 0, 1, 2, ...
            
            For example, for the outlook/humidity/rain/play sports dataset above:
            X = [[0, 0, 0],
                [0, 1, 1],
                [0, 0, 1],
                [0, 1, 0],
                [1, 0, 0],
                [1, 1, 1],
                [2, 0, 0],
                [2, 1, 1]]
            Y = [0, 0, 0, 0, 1, 0, 1, 0]
            constrain: (X, Y) should represent at least one sample
            
            used_attr_num(integer): number of used attributes: if all of the attributes has been used but the dataset
            is not composed of same label(which means there are conflicting data), predict the label as majority and
            stop recursion
            
        recommended members:
            self.label (None or integer): if the node is leaf, it has a non None label indicating its label
            self.attr_index (integer, defined when self.labe is not None): the index of the splitting attribute
            for this ndoe
            self.child (dict, mapping from label for child nodes): used to find the next child if this node is not 
            leaf
        """
        if len(set(Y)) <= 1:
            self.label = Y[0];
            return
        if used_attr_num >= len(X[0]):
            #attrs has been used up
            self.label = max(set(Y), key=Y.count)
            return
        self.label = None
        #select a proper attribute as the split attribute for this Node
        sample_num = len(X)
        attr_num = len(X[0])
        max_mutual_info = 0
        opt_attr_index = -1
        for attr_index in range(attr_num):
            d = {}
            for sample_index in range(sample_num):
                attr = X[sample_index][attr_index]
                label = Y[sample_index]
                if not attr in d:
                    d[attr] = {}
                if not label in d[attr]:
                    d[attr][label] = 0
                d[attr][label] += 1
            joint_dist = []
            for attr in d:
                row_dist = []
                for label in set(Y):
                    if not label in d[attr]:
                        row_dist.append(0.0)
                    else:
                        row_dist.append(float(d[attr][label]) / sample_num)
                joint_dist.append(row_dist)
            mutual_info = I(joint_dist)
            if mutual_info > max_mutual_info:
                max_mutual_info = mutual_info
                opt_attr_index = attr_index
        #opt_attr_index is selected, split child based on the index
        self.attr_index = opt_attr_index
        self.child = {}
        attrToData = {}
        for sample_index in range(sample_num):
            attr = X[sample_index][self.attr_index]
            if not attr in attrToData:
                attrToData[attr] = [[],[]]; #map attr to [X, Y] pair for child
            attrToData[attr][0].append(X[sample_index])
            attrToData[attr][1].append(Y[sample_index])
            
        for attr in attrToData:
            self.child[attr] = DecisionTreeNode(attrToData[attr][0], attrToData[attr][1], used_attr_num + 1)
            
    def printNode(self, depth = 0, attr_names = None, attr_values = None, output_names = None):
        """
        print the tree recursively, mainly used for debug
        args:
            depth: (int) the depth of current node, depth of root is zero
            attr_names: a list of string, denoting the name for each attribute
            attr_values: a list of list of string, attr_values[i][j] is the name of the ith attribute, jth value
            output_names: a list of string, denoting the name for each kind of output
        """
        if self.label is not None:
            print '\t' * depth,
            if output_names is None:
                print 'label = ' + str(self.label)
            else:
                print 'label = ' + output_names[self.label]
        else:
            for attr in self.child:
                print '\t' * depth,
                if attr_names is None or attr_values is None:
                    print 'attr[' + str(self.attr_index) + '] = ' + str(attr) + ':'
                else:
                    print attr_names[self.attr_index] + ' = ' + attr_values[self.attr_index][attr] + ':'
                self.child[attr].printNode(depth + 1, attr_names, attr_values, output_names)
                
    def predict(self, x):
        """
        given a input data x, predict its label according to the tree.
        implement it recursively
        if some value of the splitting attribute has never been seen by the node, return None
        args:
            x (list of integers) an input sample:
        return:
            label(int) the predicted label of this input
        """
        if self.label is not None:
            return self.label
        try:
            attr = x[self.attr_index]
            return self.child[attr].predict(x)
        except KeyError:
            return None
        

### Construct Decision Tree Node
In this subsection of decision tree you will implement the construction function of node. Please follow the following specifications:
1. If the dataset for the current node is composed of the same label, stop recursion and use that label as the label for this tree node
2. If the dataset for the current node has at least two kinds of labels, you should split this label using some attribute(feature). The selection of feature should follow the following principle: choose the attribute that has the maximum mutual information with current labels. If multiple attributes have the same mutual information, choose the attribute with the smallest index. In terms of minimizing the depth of the tree, this may not be the optimum solution, but generally speaking it works well. Actually calculating the tree with minimum depth is NP-hard. Thus it is almost impossible to produce the optimum result unless P = NP

After implementing the construction function, please test your code using the following simple test case:

In [247]:
# TestCode for node construction       
Y = [0, 0, 0, 0, 1, 0, 1, 0]
X = [[0, 0, 0],[0, 1, 1],[0, 0, 1],[0, 1, 0],[1, 0, 0],[1, 1, 1],[2, 0, 0],[2, 1, 1]]
node = DecisionTreeNode(X, Y)
print node.attr_index
print node.child[0].label
print node.child[1].attr_index
print node.child[2].attr_index
#The code above should produce the following results:
0
0
1
1

0
0
1
1


1

### Print Tree Node
Implement the printTree(depth = 0) function to visualize tree node structure recursively. This function can help you can a intuitive feeling of the decision tree. 
We do not have strict requirement for the implementation function. However, your function should indicate the label, splitting attribute, and value of splitting attribute clearly. We recommend to use indention to represent the structure of tree. Please run the following test code after implementation:


In [248]:
# TestCode for printNode()      
node.printNode()

#The code above should give the following tree structure:
# 
#  attr[0] = 0:
#         label = 0
#  attr[0] = 1:
#         attr[1] = 0:
#                 label = 1
#         attr[1] = 1:
#                 label = 0
#  attr[0] = 2:
#         attr[1] = 0:
#                 label = 1
#         attr[1] = 1:
#                 label = 0
# 

 attr[0] = 0:
	label = 0
 attr[0] = 1:
	attr[1] = 0:
		label = 1
	attr[1] = 1:
		label = 0
 attr[0] = 2:
	attr[1] = 0:
		label = 1
	attr[1] = 1:
		label = 0


### Predict Label for Test Data
Now you have implemented the construction of tree node. Next step is to predict labels for training labels. You should do it recursively. Recommended Stopping condition is that self.label is not None for some node:
Please run the following test code after implementation:

In [249]:
# Test code for predict()
print node.predict([2,0,1]) # output should be 1
print node.predict([0,0,0]) # output should be 0

1
0


## Random Forest
We train our random forest based on a general technique called bootstrap aggregating, or bagging. It means that given a training set X = x1, ..., xn with labels Y = y1, ..., yn, we repeatedly (B times) selects a set of random samples with replacement and train a decision based on our selection (reference: https://en.wikipedia.org/wiki/Random_forest)

Algorithm:

For $b = 1, ..., B$:
1. Sample, with replacement, n training examples from $X$, $Y$; call these $X_b$, $Y_b$.
2. Train a decision or regression tree $f_b$ on $X_b$, $Y_b$.
  
After training, predictions on an unseen samples x can be made by averaging the predictions from all the individual regression trees on x:

$
f(x) = {\frac {1}{B}}\sum _{b=1}^{B} f_b(x)
$

or by taking the majority vote in the case of decision trees.

You should implement the construction function of RandomForest and predict() of RandomForest in the following class:

In [250]:
import random
class RandomForest:
    def __init__(self, X, Y, B, n):
        """
        Construct a Random Forest using DecisionTreeNode.
        args:
            X(list of list of integer): X[i][j] is the value of the jth attribute of the ith training data
            For any given attribute, we encode the set of possible values it can take as 0, 1, 2, ...

            Y(list of integer): Y[i] is the label of the ith training data. We encode the set of possible values
            as 0, 1, 2, ...
            
            B(integer): number of decision trees this forest has
            
            n(integer): number of samples used to train each decision tree. Samples are draw from X, Y uniformaly
            with replacement
            
        recommended member:
            roots (list of DecisionTreeNode): a list of decision tree roots. len(roots) = n
        """
        self.roots = []
        sample_num = len(X)
        for i in range(B):
            index = range(sample_num)
            random.shuffle(index)
            X_b = []
            Y_b = []
            for j in range(n):
                X_b.append(X[index[j]])
                Y_b.append(Y[index[j]])
            self.roots.append(DecisionTreeNode(X_b, Y_b))
            
    def predict(self, x):
        """
        Predict the label of input x using majority voting, if multiple labels receive the same amount of vote,
        return anyone of them
        Note: since each tree is trained based on a random set of samples, it is possible that a decision tree has
        never seen some of the values of some features. In this case, you should except the KeyError in your
        implementation and cancel the vote of this tree
        If the votes of all tree has been canceled, you should return None
        args:
            x (list of integers) an input sample:
        return:
            (int) the predicted label of this input, by majority voting of all of its random forests
        """
        cnt = {}
        for root in self.roots:
            pred = root.predict(x)
            if pred != None:
                if pred in cnt:
                    cnt[pred] += 1
                else:
                    cnt[pred] = 1
        opt_pred = None
        max_cnt = 0
        for pred in cnt:
            if cnt[pred] > max_cnt:
                max_cnt = cnt[pred]
                opt_pred = pred
        return opt_pred
        

In [251]:
# Test code for random forest:
randomForest = RandomForest(X, Y, 4, 4)
print randomForest.predict([2,0,0]) # 1 with prob around 0.6, 0 with prob aroud 0.4
print randomForest.predict([1,1,1]) # 1 with prob around 0.1, 0 with prob around 0.9

# Write your code here to test their probabilities


0
0


## Play with Real Data!
Hope now you have implemented the RandomForest class. You can play with it on some real data instead of the artificial examples regarding play sports.

We use a dataset from http://archive.ics.uci.edu/ml/datasets/Car+Evaluation. In this dataset, we want to predict the acceptability (unacc, acc, good, vgood) of a car based on the following properties:
1. buying price: vhigh, high, med, low. 
2. maintaining price: vhigh, high, med, low. 
3. number of doors: 2, 3, 4, 5more. 
4. positions for person: 2, 4, more. 
5. size of luggage boot : small, med, big. 
6. safety: low, med, high. 

First we need to load data and connnsdcstruct our training and test dataset:
(There are some dirty work here, I recommend that we should not ask students to work on this part)

In [252]:
# Dirty work start, you can ignore it
# load data, convert data to our representation, and split data into training and testing data
f = open('car_acceptability.csv')
train_sample_num = 1200
lines = f.readlines()
X = []
Y = []
convert_dict = [{}, {}, {}, {}, {}, {}, {}]
for line in lines:
    attrs = line.split(',')
    row = []
    for i in range(7):
        attr = attrs[i]
        if not attr in convert_dict[i]:
            convert_dict[i][attr] = len(convert_dict[i])
        row.append(convert_dict[i][attr])
    X.append(row[:6])
    Y.append(row[6])

sample_num = len(X)

index = range(sample_num)
random.shuffle(index)
Xtrain = []
Xtest = []
Ytrain = []
Ytest = []
for i in range(sample_num):
    if i < train_sample_num:
        Xtrain.append(X[index[i]])
        Ytrain.append(Y[index[i]])
    else:
        Xtest.append(X[index[i]])
        Ytest.append(Y[index[i]])
# construct attr_names, attr_values, and output_names to make the printNode more readable
# not something important
attr_names = ['buy price', 'maintain price', 'num of doors', 'num of person', 'size of lug', 'safety']
output_names = [0] * (max(convert_dict[6].values()) + 1)
for key in convert_dict[6]:
    output_names[convert_dict[6][key]] = key
attr_values = [0] * 6
for i in range(6):
    attr_values[i] = [0] * (max(convert_dict[i].values()) + 1)
    for key in convert_dict[i]:
        attr_values[i][convert_dict[i][key]] = key
# Dirty work end, attention please

### Play with Decision Tree on Real Data
In this part you should use $Xtrain$ and $Ytrain$ to train your decision tree. After training, you print the tree and test the accuracy on test data:

In [253]:
# Train your tree and print it here
root = DecisionTreeNode(Xtrain, Ytrain)
#root.printNode(0, attr_names, attr_values, output_names)

# The begining of your tree should looks like:
# safety = low:
#     label = unacc

#      safety = med:
#         num of person = 2:
#             label = unacc

#         num of person = 4:
#                 buy price = vhigh:
#                         maintain price = vhigh:
#                                 label = unacc

#                         maintain price = high:
#                                 label = unacc

#                         maintain price = med:
#                                 size of lug = small:
#                                         label = unacc

# test your tree on Xtest and Ytest here:
pred = []
cnt = 0
for i in range(len(Xtest)):
    if Ytest[i] == root.predict(Xtest[i]):
        cnt += 1;
print float(cnt) / len(Ytest)
# our implementation has accuracy between 0.85-0.9

0.907196969697


### Play with Random Forest on Real Data
In this part you should use $Xtrain$ and $Ytrain$ to train your random forest. After training, you test the accuracy on test data:

In [254]:
# Train your tree and print it here
forest = RandomForest(Xtrain, Ytrain, 100, 200)

## Test your tree on Xtest and Ytest here:
pred = []
cnt = 0
for i in range(len(Xtest)):
    if Ytest[i] == forest.predict(Xtest[i]):
        cnt += 1;
print float(cnt) / len(Ytest)
# When B = 100 and n = 1000, our implementation has accuracy around 0.92.
# Compared with the accuracy of a single decision tree, you can see the improvement of random forest

0.886363636364


### How accuracy changes with number of samples
We have noticed that as $n$ changes, the accuracy changes accordingly. In the last section you should plot how accuracy changes with $n$, when the other parameters are fixed. To get a relatively stable result, we recommend that you should take repeated experiments for at least 10 times and take the average

In [None]:
# Write your code here:
import matplotlib.pyplot as plt
repeat = 10
ns = range(50, 1250, 50)
res = []
B = 100
for n in ns:
    total = 0.0
    for r in range(repeat):
        forest = RandomForest(Xtrain, Ytrain, B, n)
        pred = []
        cnt = 0
        for i in range(len(Xtest)):
            if Ytest[i] == forest.predict(Xtest[i]):
                cnt += 1;
        total += float(cnt) / len(Ytest)
    res.append(total / repeat)
plt.plot(ns, res)
plt.xlabel('number of samples')
plt.ylabel('accuracy')
plt.show()


Our implementation shows that as n increase, the accuracy increases first, then decrease.

This result is quite reasonable. As your final task, think about why it happens.

<img src="files/example.png">
