In [1]:
import numpy as np
import math
import matplotlib.pyplot as plt
import pandas as pd
import common as cm

# Part 1: Information Gain

Important note: this exercise uses Pandas (for data manipulation and analysis) and Graphviz (for graph-drawing) libraries. 

This exercise consists of 3 parts. Complete the first part to get a mark of 3.0, the first two parts to get 4.0, complete all assignments to get 5.0. 

1.1 ) There are 10 objects (data) characterized with 5 binary attributes:

In [2]:
attributeNames = ["attr 1", "attr 2", "attr 3", "attr 4", "attr 5"]

data = pd.DataFrame(
    [
        [1, 0, 1, 1, 1],
        [1, 1, 0, 0, 1],
        [0, 1, 1, 1, 1],
        [1, 0, 1, 0, 1],
        [1, 0, 0, 1, 1],
        [0, 0, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 0, 0, 1, 1],
        [0, 1, 0, 0, 1],
        [0, 0, 0, 1, 1],
    ],
    columns=attributeNames,
)

1.2) Each object is assigned to either a class "0" or "1". The assignments are as follows (cl):

In [3]:
data["cl"] = [1, 1, 1, 0, 0, 1, 1, 1, 0, 0]

Hint: How one can read data (columns) in Pandas

In [4]:
print(data["cl"])
print(list(data["cl"]))
print(set(data["cl"]))
print(data["attr 1"])

0    1
1    1
2    1
3    0
4    0
5    1
6    1
7    1
8    0
9    0
Name: cl, dtype: int64
[1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
{0, 1}
0    1
1    1
2    0
3    1
4    1
5    0
6    1
7    1
8    0
9    0
Name: attr 1, dtype: int64


1.3 )  Finish the below function for calculating entropy. $H(CL) = - \sum_{y \in CL}p(y)log_2p(y)$ It should return a value of entropy for an input vector CL. Assume that $log_2(0)$ is equal to 0.

In [5]:
def getEntropy(cl):
    cl_list =list(cl)
    prob_1 = sum([1 if elem_cl==1 else 0 for elem_cl in cl_list])/len(cl_list)
    prob_0 = 1-prob_1
    entropy =0
    if not prob_1 ==0:
        entropy-=prob_1*math.log2(prob_1)
    if not prob_0 ==0:
        entropy-=prob_0*math.log2(prob_0)
        
    return entropy
        


1.4 ) Calculate the entropy for the CL vector:

In [6]:
### TODO print()
getEntropy(data["cl"])

0.9709505944546686

1.5) Finish the below function for calculating a conditional entropy: $H(CL|X) = - \sum_{x \in X} \sum_{y \in CL} p(x,y) log_2 \frac{p(x,y)}{p(x)}$. Assume that $log_2(0)$ is equal to 0 and if $p(x) = 0$, $\frac{p(x,y)}{p(x)}$ is equal to 0 as well.

In [7]:
def getConditionalEntropy(dt,cl, attr):
    entropy_0=getEntropy(dt.loc[dt[attr]==0,cl]) if dt.loc[dt[attr]==0,cl].shape[0]>0 else 0
    entropy_0*=dt.loc[dt[attr]==0,cl].shape[0]/dt[cl].shape[0]
    entropy_1=getEntropy(dt.loc[dt[attr]==1,cl]) if dt.loc[dt[attr]==1,cl].shape[0]>0 else 0
    entropy_1*=dt.loc[dt[attr]==1,cl].shape[0]/dt[cl].shape[0]
    return entropy_0+entropy_1

1.6 ) Calculate conditional entropies for given attribiutes.

In [8]:
print(getConditionalEntropy(data,"cl", "attr 1"))
print(getConditionalEntropy(data,"cl", "attr 5"))

0.9509775004326937
0.9709505944546686


1.7 ) Which entropy is lesser and why?

1.8) Finish the below function for calculating information gain:

In [9]:
def getInformationGain(dt,cl, attr):
    return getEntropy(dt[cl]) - getConditionalEntropy(dt,cl,attr)
    ### TODO
    ### return 0.0

In [10]:
print(getInformationGain(data,"cl", "attr 1"))
print(getInformationGain(data,"cl", "attr 5"))

0.01997309402197489
0.0


# Part 2: ID3 algorithm

Decision tree consists of decision nodes and leaves. Nodes split data while leaves classify objects. Consider the class "Node" provided below. It consists of 4 fields:
- attr - attribute ID (use the names in attributeNames vector)
- left - left branch, i.e., a reference to other node
- right - right branch, i.e., a reference to other node
- value - a decision. If node = None, then the node is not a leaf. If value is not None, then a node is considered a leaf. 

Method __call__ returns the decision if the node is a leaf (i.e., when value is not None). 
Otherwise, it calls either the left or the right branch of an input object, based on the attribute value (0 -> left children; 1 -> right children). In this way, we can traverse the decision tree in order to find the final decision.

In [11]:
class Node:
    def __init__(self, attr, left, right, value):
        self.attr = attr
        self.left = left
        self.right = right
        self.value = value

    def __call__(self, obj):
        if self.value is None:
            if obj[self.attr] == 0:
                return self.left(obj)
            else:
                return self.right(obj)
        else:
            return self.value
        
### EXAMPLE
def example(obj):
    root = Node(0, None, None, None)
    lChildren = Node(1, None, None, None)
    rChildren = Node(None, None, None, 2)
    root.left = lChildren
    root.right = rChildren
    llChildren = Node(None, None, None, 3)
    rrChildren = Node(None, None, None, 4)
    lChildren.left = llChildren
    lChildren.right = rrChildren
    print(root(obj))
    
example([0, 0])
example([0, 1])
example([1, 0])
example([1, 1])

3
4
2
2


2.1) Create an initial root. Set the value (decision) to 1. 

In [12]:
root = Node(0, None, None, 1)
print(root([0,0]))

1


2.2) Use a getErrorRate method in common.py auxiliary file to calculate the error rate. The decision is made based on the majority rule. In case of tie, the method takes 0 as the default class.

In [13]:
### TODO
cm.getErrorRate(root,data)

0.4

2.3) Use printGraph method (see the common.py file) to draw the decision tree and save it in a png file.

In [14]:
### TODO
cm.printGraph(root,data)

2.4) Calculate information gain for all attribiutes.

In [15]:
def printInformationGain(data):
    for attribute_name in attributeNames:
        print(attribute_name)
        print(getInformationGain(data,"cl",attribute_name))
        
printInformationGain(data)

attr 1
0.01997309402197489
attr 2
0.0464393446710154
attr 3
0.12451124978365313
attr 4
0.0912774462416801
attr 5
0.0


2.5) Choose the best attribute to split the data. Construct two new nodes: one for $x_i$ = 0 decision and the second for $x_i$ = 1; connect them with the root (left and right branch). Remember to update the root. 

In [16]:
### TODO
root = Node("attr 3", None, None, None)
child_l1_0 = Node(None, None, None, 0)
child_l1_1 = Node(None, None, None, 1)
root.left = child_l1_0
root.right = child_l1_1

2.6) Print the graph and calculate the error rate. What happened with the error rate?

In [17]:
### TODO
cm.printGraph(root,data,fileName="obr.jpg")
cm.getErrorRate(root,data)

0.30000000000000004

2.7) Split the 'data' (table) based on the selected attribiute, i.e., create two new tables.

In [18]:
### TODO
left_data = data.loc[data["attr 3"]==0]
right_data = data.loc[data["attr 3"]==1]
print(left_data)
print(right_data)

   attr 1  attr 2  attr 3  attr 4  attr 5  cl
1       1       1       0       0       1   1
4       1       0       0       1       1   0
7       1       0       0       1       1   1
8       0       1       0       0       1   0
9       0       0       0       1       1   0
   attr 1  attr 2  attr 3  attr 4  attr 5  cl
0       1       0       1       1       1   1
2       0       1       1       1       1   1
3       1       0       1       0       1   0
5       0       0       1       1       1   1
6       1       1       1       1       1   1


2.8) Let us start with the left node. Firstly, calculate information gain for this node.

In [19]:
### TODO
printInformationGain(left_data)

attr 1
0.4199730940219749
attr 2
0.01997309402197489
attr 3
0.0
attr 4
0.01997309402197489
attr 5
0.0


2.9) Choose the best attribute to split the data and then update the decision tree.

In [20]:
### TODO
root.left=Node('attr 1',None,None,None)
root.left.left = Node(None, None, None, 0)
root.left.right = Node(None, None, None, 1)

2.10) Print the graph and calculate the error rate. What happened with the error rate?

In [21]:
cm.printGraph(root,data)
cm.getErrorRate(root,data)

0.19999999999999996

2.11) Split data (remember that we split left_data, not data).

In [22]:
### TODO
left_left_data = left_data.loc[left_data["attr 1"]==0]
left_right_data = left_data.loc[left_data["attr 1"]==1]
print(left_left_data)
print(left_right_data)

   attr 1  attr 2  attr 3  attr 4  attr 5  cl
8       0       1       0       0       1   0
9       0       0       0       1       1   0
   attr 1  attr 2  attr 3  attr 4  attr 5  cl
1       1       1       0       0       1   1
4       1       0       0       1       1   0
7       1       0       0       1       1   1


2.12) Repeat the whole process for the right node.

In [23]:
# TODO compute the information gain
printInformationGain(right_data)

attr 1
0.17095059445466865
attr 2
0.17095059445466865
attr 3
0.0
attr 4
0.7219280948873623
attr 5
0.0


In [24]:
# TODO update the decision tree
root.right=Node(3,None,None,None)
root.right.left = Node(None, None, None, 0)
root.right.right = Node(None, None, None, 1)

In [25]:
# TODO print the decision tree and calculate the error rate
cm.printGraph(root)
cm.getErrorRate(root,data)

0.09999999999999998

In [26]:
# TODO split the data (right_data)
right_left_data = right_data.loc[right_data["attr 4"]==0]
right_right_data = right_data.loc[right_data["attr 4"]==1]
print(right_left_data)
print(right_right_data)

   attr 1  attr 2  attr 3  attr 4  attr 5  cl
3       1       0       1       0       1   0
   attr 1  attr 2  attr 3  attr 4  attr 5  cl
0       1       0       1       1       1   1
2       0       1       1       1       1   1
5       0       0       1       1       1   1
6       1       1       1       1       1   1


2.13) Let's consider left-left node. Calculate information gain for it.

In [27]:
# TODO
printInformationGain(left_left_data)
#because entropy of this data is equal to 0

attr 1
0.0
attr 2
0.0
attr 3
0.0
attr 4
0.0
attr 5
0.0


2.14) Will adding a new node to the tree improve its effectiveness? Why? Why not?

2.15) Calculate information gain for the left-right node.

In [28]:
printInformationGain(left_right_data)

attr 1
0.0
attr 2
0.2516291673878229
attr 3
0.0
attr 4
0.2516291673878229
attr 5
0.0


In [29]:
### Select the attribute and update the tree
root.left.right=Node(1,None,None,None)
root.left.right.left = Node(None, None, None, 0)
root.left.right.right = Node(None, None, None, 1)

In [30]:
### Print the decision tree and compute the error rate
cm.printGraph(root)
cm.getErrorRate(root,data)

0.09999999999999998

2.16) What happened with the error rate? Is it necessary to keep these two newly added leaves?

2.17) Finish creating the right side of the tree

In [31]:
### TODO
#it's already ok

# Part 3: automated construction of decision trees

3.1 Complete the following function for automated construct of decision trees, so that it returns a decision tree for the given data and attribute list. Note that this is a recusive method, i.e., calls itself.

In [32]:
def getInformationGanes(dt,attribute_names):
    return list(map(lambda name:getInformationGain(dt,"cl",name),attribute_names))
    #for attribute_name in attributeNames:
        #print(getInformationGain(dt,"cl",attribute_name))
getInformationGanes(data,attributeNames)

[0.01997309402197489,
 0.0464393446710154,
 0.12451124978365313,
 0.0912774462416801,
 0.0]

In [40]:
max_depth = 4

def createTree(data, attribute_names, depth=0):
    data = data.reset_index().drop("index", axis=1)
    gains = getInformationGanes(data,attribute_names)
    max_gain=max(gains)
    if max_gain == 0 or depth==max_depth:
        return Node(None,None,None,1 if data.loc[data["cl"]==1].shape[0]*2>=data.shape[0] else 0)
    attr_id = gains.index(max_gain)
    node= Node("attr "+str(attr_id+1),None,None,None)
    node.left = createTree(data.loc[data[attribute_names[attr_id]]==0],attribute_names,depth+1)
    node.right = createTree(data.loc[data[attribute_names[attr_id]]==1],attribute_names,depth+1)
    return node
    ### TODO
root =createTree(data,attributeNames)
cm.printGraph(root)
cm.getErrorRate(root,data)

0.09999999999999998

3.2) Build a decision tree for a training dataset in the common.py auxiliary file, for diffrent values of max_depth.  Calculate & compare the error rates for training and validation datasets.

In [41]:
max_depth = 10

In [42]:
### TODO
for i in range(10):
    max_depth = i
    print("max depth: "+str(max_depth))
    root =createTree(cm.getTrainingDataSet()[1],cm.getTrainingDataSet()[0])
    cm.printGraph(root)
    print(cm.getErrorRate(root,cm.getTrainingDataSet()[1]))
    print(cm.getErrorRate(root,cm.getValidationDataSet()[1]))

max depth: 0
0.4
0.5
max depth: 1
0.35
0.5
max depth: 2
0.30000000000000004
0.5
max depth: 3
0.25
0.4
max depth: 4
0.25
0.5
max depth: 5
0.19999999999999996
0.30000000000000004
max depth: 6
0.19999999999999996
0.30000000000000004
max depth: 7
0.19999999999999996
0.30000000000000004
max depth: 8
0.19999999999999996
0.30000000000000004
max depth: 9
0.19999999999999996
0.30000000000000004


In [36]:
### TODO

3.3) Consider only the training data set and answer the following questions:
* What is the miximum depth of the tree (consider only the training data set)?
* The tree building process should stop when there is no improvement in error rate (why?). Check for which value of "max_dept" there is no improvement in error rate. 

In [37]:
for i in range(10):
    max_depth = i
    root =createTree(cm.getTrainingDataSet()[1],cm.getTrainingDataSet()[0])
    print(cm.getErrorRate(root,cm.getTrainingDataSet()[1]))
    
    ### TODO

0.4
0.35
0.30000000000000004
0.25
0.25
0.19999999999999996
0.19999999999999996
0.19999999999999996
0.19999999999999996
0.19999999999999996


Max depth should be 5 now