## Student #1 ID:

## Student #2 ID:

# Exercise 2: Decision Trees

In this assignment you will implement a Decision Tree algorithm as learned in class.

## Read the following instructions carefully:

1. This jupyter notebook contains all the step by step instructions needed for this exercise.
1. Write **efficient vectorized** code whenever possible. Some calculations in this exercise take several minutes when implemented efficiently, and might take much longer otherwise. Unnecessary loops will result in point deduction.
1. You are responsible for the correctness of your code and should add as many tests as you see fit. Those tests will not be graded nor checked.
1. You are free to add code and markdown cells as you see fit.
1. Write your functions in this jupyter notebook only. Do not create external python modules and import from them.
1. You are allowed to use functions and methods from the [Python Standard Library](https://docs.python.org/3/library/) and [numpy](https://www.numpy.org/devdocs/reference/) only, unless otherwise mentioned.
1. Your code must run without errors. During the environment setup, you were given a specific version of Python of install (`Python >= 3.6, numpy >= 1.14`). 
1. Answers to qualitative questions should be written in **markdown cells (with $\LaTeX$ support)**.
1. Submit this jupyter notebook only using your ID as a filename. No not use ZIP or RAR. For example, your submission should look like this: `123456789.ipynb` if you worked by yourself or `123456789_987654321.ipynb` if you worked in pairs.

## In this exercise you will perform the following:
1. Practice OOP in python.
2. Implement two impurity measures: Gini and Entropy.
3. Construct a decision tree algorithm.
4. Prune the tree to achieve better results.
5. Visualize your results.

In [61]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# make matplotlib figures appear inline in the notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Make the notebook automatically reload external python modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Warmup - OOP in python

Our desicion tree will be implemented using a dedicated python class. Take a minute and practice your object oriented skills. Create a tree with some nodes and make sure you understand how objects in python work.

In [62]:
class Node(object):
    def __init__(self, data):
        self.data = data
        self.children = []

    def add_child(self, node):
        self.children.append(node)

In [162]:
n = Node(5)
p = Node(6)
q = Node(7)
n.add_child(p)
n.add_child(q)
n.children

[<__main__.Node at 0x1de06ada550>, <__main__.Node at 0x1de06adac10>]

## Data preprocessing

We will use the breast cancer dataset that is available as a part of sklearn. In this example, our dataset will be a single matrix with the **labels on the last column**. Notice that you are not allowed to use additional functions from sklearn.

In [165]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

# load dataset
X, y = datasets.load_breast_cancer(return_X_y = True)
X = np.column_stack([X,y]) # the last column holds the labels
# split dataset
X_train, X_test = train_test_split(X, random_state=99)

print("Training dataset shape: ", X_train.shape)
print("Testing dataset shape: ", X_test.shape)

print(X_train[0:4])

Training dataset shape:  (426, 31)
Testing dataset shape:  (143, 31)
[[1.200e+01 2.823e+01 7.677e+01 4.425e+02 8.437e-02 6.450e-02 4.055e-02
  1.945e-02 1.615e-01 6.104e-02 1.912e-01 1.705e+00 1.516e+00 1.386e+01
  7.334e-03 2.589e-02 2.941e-02 9.166e-03 1.745e-02 4.302e-03 1.309e+01
  3.788e+01 8.507e+01 5.237e+02 1.208e-01 1.856e-01 1.811e-01 7.116e-02
  2.447e-01 8.194e-02 1.000e+00]
 [1.157e+01 1.904e+01 7.420e+01 4.097e+02 8.546e-02 7.722e-02 5.485e-02
  1.428e-02 2.031e-01 6.267e-02 2.864e-01 1.440e+00 2.206e+00 2.030e+01
  7.278e-03 2.047e-02 4.447e-02 8.799e-03 1.868e-02 3.339e-03 1.307e+01
  2.698e+01 8.643e+01 5.205e+02 1.249e-01 1.937e-01 2.560e-01 6.664e-02
  3.035e-01 8.284e-02 1.000e+00]
 [1.646e+01 2.011e+01 1.093e+02 8.329e+02 9.831e-02 1.556e-01 1.793e-01
  8.866e-02 1.794e-01 6.323e-02 3.037e-01 1.284e+00 2.482e+00 3.159e+01
  6.627e-03 4.094e-02 5.371e-02 1.813e-02 1.682e-02 4.584e-03 1.779e+01
  2.845e+01 1.235e+02 9.812e+02 1.415e-01 4.667e-01 5.862e-01 2.035e-01
 

## Impurity Measures (10 points)

Impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the subset. Implement the functions `calc_gini` (5 points) and `calc_entropy` (5 points). You are encouraged to test your implementation.

In [65]:
def calc_gini(data):
    """
    Calculate gini impurity measure of a dataset.
 
    Input:
    - data: any dataset where the last column holds the labels.
 
    Returns the gini impurity.    
    """
    gini = 0
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    
    total = data.shape[0]
    classes = {}
    for row in data:
        label = row[-1]
        if label not in classes:
            classes[label] = 0
        classes[label]+=1
    
    #num_classes = len(np.unique(data[:-1])) #counts number of classes
    gini = 1
    for cl in classes:
        #prob = data[,-1][cl]/total
        prob = classes[cl]/total
        gini -=prob**2
    
        
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return gini
    

In [106]:
a = np.array([[1,2,6],[4,5,6],[7,8,6],[2,2,6]])
if (a[:,-1] == a[:,-1][0]).all():
       print("y")
else:
       print("n")
b = np.array([[1,2]])
b
if b.all():
    print("yes")
else:
    print("no")

y
yes


In [66]:
def calc_entropy(data):
    """
    Calculate the entropy of a dataset.

    Input:
    - data: any dataset where the last column holds the labels.

    Returns the entropy of the dataset.    
    """
    entropy = 0.0
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return entropy


## Building a Decision Tree (50 points)

Use a Python class to construct the decision tree. Your class should support the following functionality:

1. Initiating a node for a decision tree. You will need to use several class methods and class attributes and you are free to use them as you see fit. We recommend that every node will hold the **feature** and **value** used for the split and the **children** of that node. In addition, it might be a good idea to store the **prediction** in that node, the **height** of the tree for that node and whether or not that node is a **leaf** in the tree.
2. Your code should support both Gini and Entropy as impurity measures. 
3. The provided data includes continuous data. For this exercise, create at most a **single split** for each node of the tree (your tree will be binary). Determine the threshold for splitting by checking all possible features and the values available for splitting. When considering the values, take the average of each consecutive pair. For example, for the values [1,2,3,4,5] you should test possible splits on the values [1.5, 2.5, 3.5, 4.5]. 
5. After you complete building the class for a decision node in the tree, complete the function `build_tree`. This function takes as input the training dataset and the impurity measure. Then, it initializes a root for the decision tree and constructs the tree according to the procedure you saw in class.
1. Once you are finished, construct two trees: one with Gini as an impurity measure and the other using Entropy.

In [139]:
class DecisionNode:
    '''
    This class will hold everyhing you need to construct a node in a DT. You are required to 
    support basic functionality as previously described. It is highly recommended that you  
    first read and understand the entire exercises before diving into this class.
    You are allowed to change the structure of this class as you see fit.
    '''
 
    def __init__(self, data,height):
        # you should take more arguments as inputs when initiating a new node
        self.data = data
        self.children = []
        self.height = height
        self.leaf = True
   
    def set_feature(self, feature, value):
        self.feature = feature
        self.value = value
        
    def set_pred(self, prediction):
        self.prediction = prediction
        
    def add_child(self, node):
        self.children.append(node)
        self.leaf = False
        
        
    def check_split(self, feature, value):
        # this function divides the data according to a specific feature and value
        # you should use this function while testing for the optimal split
        left_data = self.data[feature <= value]
        right_data = self.data[feature > value]
        
        return left_data, right_data
    
    def split(self, impurity_measure):
        # this function goves over all possible features and values and finds
        # the optimal split according to the impurity measure. Note: you can
        # send a function as an argument
        best_impurity_score = 0
        best_feature = 0
        best_value  = 0 
        total,num_features = self.data.shape 
        for feature,idx in zip(self.data.transpose(), range(0,num_features)):
            values = sorted(np.unique(feature))
            avg_values = [(a+b)/2 for a,b in zip(values, values[1:])]
            for value in avg_values:
                left_data, right_data = self.check_split(feature, value)
                gini_gain = impurity_measure(self.data) -  (impurity_measure(left_data) - impurity_measure(right_data))/total
                if (gini_gain > best_impurity_score):
                   best_impurity_score = gini_gain
                   best_feature = idx
                   best_value  = value 
        return best_feature, best_value, best_impurity_score
        
            

In [92]:
children = []
children.append(1)
children.append(2)
#children[1]
#children.pop(0)

while children:
    print(children.pop(0))

1
2


In [161]:
def build_tree(data, impurity):
    """
    Build a tree using the given impurity measure and training dataset. 
    You are required to fully grow the tree until all leaves are pure. 
 
    Input:
    - data: the training dataset.
    - impurity: the chosen impurity measure. Notice that you can send a function
                as an argument in python.
 
    Output: the root node of the tree.
    """
    root = None
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
     
    height = 0
    q = []
    #create root node with all samples:
    root = DecisionNode( data,height)
    q.append(root)
    while q:
        #get next node
        node = q.pop(0)
        #if training example is perfectly classified
        #print(node.data)
        if (node.data[:,-1] == node.data[:,-1][0]).all():
            node.set_pred(node.data[:,-1][0])
        else:
            best_feature, best_value ,gini_gain= node.split( impurity)
            node.set_feature(best_feature,best_value)
            left_data, right_data = node.check_split(best_feature,best_value)
            #print("left_data")
            #print(left_data)
            #print("right_data")
            #print(right_data)
            if left_data.size != 0:
                left_node = DecisionNode( left_data,node.height+1)
                node.add_child(left_node)
                q.append(left_node)
            if right_data.size != 0:
                right_node = DecisionNode( right_data,node.height+1)
                node.add_child(right_node)
                q.append(right_node)
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return root


In [173]:
# python support passing a function as arguments to another function.
tree_gini = build_tree(data=X_train[0:20,], impurity=calc_gini) 
tree_entropy = build_tree(data=X_train[0:20,], impurity=calc_entropy)


## Tree evaluation (10 points)

Complete the functions `predict` and `calc_accuracy`.

After building both trees using the training set (using Gini and Entropy as impurity measures), you should calculate the accuracy on the test set and print the measure that gave you the best test accuracy. For the rest of the exercise, use that impurity measure. (10 points)

In [158]:
def predict(node, instance):
    """
    Predict a given instance using the decision tree
 
    Input:
    - root: the root of the decision tree.
    - instance: an row vector from the dataset. 
 
    Output: the prediction of the instance.
    """
    pred = None
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    if (node.leaf):
        pred = node.prediction
    else:
        feature = node.feature
        value = node.value
        print("feature")
        print(feature)
        print("value")
        print(value)
        if instance[feature] > value:
            pred = predict(node.children[1],instance) #right sub tree
        else:
            pred = predict(node.children[0],instance) #left sub tree
            
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return pred

In [159]:
def calc_accuracy(node, dataset):
    """
    Predict a given dataset using the decision tree
 
    Input:
    - node: a node in the decision tree.
    - dataset: the dataset on which the accuracy is evaluated
 
    Output: the accuracy of the decision tree on the given dataset (%).
    """
    accuracy = 0 #TF+TN/TOTAL
    
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    for instance in dataset:
        pred = predict(node, instance)
        if pred == instance[-1]:
            accuracy +=1
            
    accuracy = accuracy/dataset.shape[1]
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return accuracy

In [160]:
pred = predict(tree_gini, X_train)
acc = calc_accuracy(tree_gini, X_train)
print(pred)
print(acc)

feature
0
value
10.805


ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

## Print the tree (10 points)

Complete the function `print_tree`. Your code should do something like this (10 points):
```
[X0 <= 1],
  [X1 <= 2]
    [X2 <= 3], 
       leaf: [{1.0: 10}]
       leaf: [{0.0: 10}]
    [X4 <= 5], 
       leaf: [{1.0: 5}]
       leaf: [{0.0: 10}]
   leaf: [{1.0: 50}]
```

In [None]:
def print_tree(node):
    """
    Prints the tree similar to the example above.
    As long as the print is clear, any printing scheme will be fine
    
    Input:
    - node: a node in the decision tree.
 
    Output: This function has no return value.
    """
    
    ###########################################################################
    # TODO: Implement the function.                                           #
    ###########################################################################
    x= 0
    if (node.leaf):
        print("leaf: [{1.0:{num}}]".format(num = x))
        print("leaf: [{0.0:{num}}]".format(num = x))
    else:
        #print node value:
        print("[ {feature} <= {value}]".format(feature = node.feature, value = node.value))
        #print sub trees:
        for child in node tree:
            print_tree(child)
    
    
    
    ###########################################################################
    #                             END OF YOUR CODE                            #
    ###########################################################################
    return
  print_tree(tree_gini)  


## Post pruning (20 points)

Construct a decision tree and perform post pruning: For each leaf in the tree, calculate the test accuracy of the tree assuming no split occurred on the parent of that leaf and find the best such parent (in the sense that not splitting on that parent results in the best testing accuracy among possible parents). Make that parent into a leaf and repeat this process until you are left with the root. On a single plot, draw the training and testing accuracy as a function of the number of internal nodes in the tree. Explain and visualize the results and print your tree (20 points).

In [None]:
###########################################################################
# TODO: Implement the function.                                           #
###########################################################################
def prune(node, dataset):
    if node.leaf
    calc_accuracy(node, dataset):


###########################################################################
#                             END OF YOUR CODE                            #
###########################################################################