# Decision Tree Exercise
A) Find some data here [1] on people. The goal is to decide if someone buys a computer or not. Derive the best decision tree by calculating a little by hand (Shannon). At least the first split.

B) Compare your tree against the tree derived from SciKit Learn as given in the Python example before! Why are they different? Print the tree with Graphviz (can be easily done with WebGraphViz [2])

## A. Derive decision tree by hand
### The data
![data](data.gif)

### Step 1: Calculate entropy on the target
![step1](step1.png)

### Step 2: Calculate Information Gain on each branch
#### Entropy using the frequency table of two attributes and Information Gain on "Age"
![age](age.png)

#### Information Gain on all branches "Age", "Income", "Student" and "CreditRating"
![all](all.png)

### Step 3: Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.	

#### The branch with the largest information gain is "Age"
![age1](age1.png)

#### Sort the data by "Age"
![sort](sort.png)

### Step 4a: A branch with entropy of 0 is a leaf node.	
![leaf](leaf.png)

### Step 4b: A branch with entropy more than 0 needs further splitting.	
![step4b](step4b.png)

### Step 5: The algorithm is run recursively on the non-leaf branches, until all data is classified.	


## B. Decision Tree with SciKit Learn 

In [50]:
from sklearn.tree import DecisionTreeClassifier
import pandas as pd


# Creating a dataframe from input file
input_file = "buy_computer_data_numeric.csv"
df = pd.read_csv(input_file, header = 0)
df

Unnamed: 0,Age,Income,Student,Credit_rating,Buys_computer
0,0,2,0,0,0
1,0,2,0,1,0
2,1,2,0,0,1
3,2,1,0,0,1
4,2,0,1,0,1
5,2,0,1,1,0
6,1,0,1,1,1
7,0,1,0,0,0
8,0,0,1,0,1
9,2,1,1,0,1


In [46]:
# one-hot encode categorical variables
one_hot_data = pd.get_dummies(df[['Age','Income','Student','Credit_rating']],drop_first=True)
print(one_hot_data.columns)

Index(['Age_<=30', 'Age_>40', 'Income_Low', 'Income_Medium', 'Student_Yes',
       'Credit_rating_Fair'],
      dtype='object')


In [59]:
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(df[['Age','Income','Student','Credit_rating']], df['Buys_computer'])

DecisionTreeClassifier(max_depth=2)

In [60]:
from sklearn.tree import export_graphviz
 
export_graphviz(
         tree_clf,
     out_file="computer_tree_depth2.dot",
         feature_names=df.columns[0:4],
         class_names=["Buys_computer=Yes", "Buys_computer=No"],
         rounded=True,
         filled=True
)