# Decision Tree Exercise
A) Find some data here [1] on people. The goal is to decide if someone buys a computer or not. Derive the best decision tree by calculating a little by hand (Shannon). At least the first split.

B) Compare your tree against the tree derived from SciKit Learn as given in the Python example before! Why are they different? Print the tree with Graphviz (can be easily done with WebGraphViz [2])

## A. Derive decision tree by hand
### The data
![data](data.gif)

### Step 1: Calculate entropy on the target
![step1](step1.png)

### Step 2: Calculate Information Gain on each branch
#### Entropy using the frequency table of two attributes and Information Gain on "Age"
![age](age.png)

#### Information Gain on all branches "Age", "Income", "Student" and "CreditRating"
![all](all.png)

### Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.	

#### The branch with the largest information gain is "Age"
![age1](age1.png)

#### Sort the data by "Age"
![sort](sort.png)

### Step 4a: A branch with entropy of 0 is a leaf node.	
![4a](4a.png)

### Step 4b: A branch with entropy more than 0 needs further splitting.	
![step4b](4b.png)

### Step 5: The algorithm is run recursively on the non-leaf branches, until all data is classified.	
![tree](tree.png)

## B. Decision Tree with SciKit Learn 

In [50]:
from sklearn.tree import DecisionTreeClassifier
import pandas as pd

# As SciKit Learn's DecisionTreeClassifier does not support categorical data,
# the Age values are transformed to '<=30'=0  '31-40'=1  '>40'=2
# the Income values are transformed to 'Low'=0  'Medium'=1  'High'=2
# the Student values are transformed to 'No'=0  'Yes'=1
# the Credit_rating values are transformed to 'Fair'=0  'Excellent'=1

# Creating a dataframe from input file
input_file = "buy_computer_data_numeric.csv"
df = pd.read_csv(input_file, header = 0)
df

Unnamed: 0,Age,Income,Student,Credit_rating,Buys_computer
0,0,2,0,0,0
1,0,2,0,1,0
2,1,2,0,0,1
3,2,1,0,0,1
4,2,0,1,0,1
5,2,0,1,1,0
6,1,0,1,1,1
7,0,1,0,0,0
8,0,0,1,0,1
9,2,1,1,0,1


In [66]:
# Create decision tree with criterion=entropy & max_depth=2
tree_clf = DecisionTreeClassifier(criterion='entropy', max_depth=2)
tree_clf.fit(df[['Age','Income','Student','Credit_rating']], df['Buys_computer'])
from sklearn.tree import export_graphviz

# Export the decision tree with criterion=entropy & max_depth = 2
export_graphviz(
         tree_clf,
     out_file="computer_tree_entropy_d2.dot",
         feature_names=df.columns[0:4],
         class_names=["Buys_computer=Yes", "Buys_computer=No"],
         rounded=True,
         filled=True
)

#### Decision tree with criterion=entropy & max_depth=2
![tree_d2](tree_d2.png)

#### The difference
The tree is different from the manually created tree because the values are numeric and thus the splits are different.
Moreover, the max_depth = 2 also make the tree simpler than the manually created tree.
For a better comparision, we create another tree with max_depth = 5.

In [65]:
# Create decision tree with criterion=entropy & max_depth=5
tree_clf = DecisionTreeClassifier(criterion='entropy', max_depth=5)
tree_clf.fit(df[['Age','Income','Student','Credit_rating']], df['Buys_computer'])
from sklearn.tree import export_graphviz

# Export the decision tree with criterion=entropy & max_depth = 5
export_graphviz(
         tree_clf,
     out_file="computer_tree_entropy_d5.dot",
         feature_names=df.columns[0:4],
         class_names=["Buys_computer=Yes", "Buys_computer=No"],
         rounded=True,
         filled=True
)

#### Decision tree with criterion=entropy & max_depth=5
![tree_d5](tree_d5.png)

With max_depth = 5, the decision tree is now more similar to the tree created in part A of the exercise. However the splits are still different. For features with only 2 values `Yes=1` and `No=0`, the split by `<=0.5` can be translated as `<=0.5`=yes and `>0.5`=no.