## Decision Tree Model

Decision Trees are used for predict probability and classification. It's more intuitive than regression. We are going 
to use some sample data to illustrate this model.
we are going to import some dummy data

In [1]:
# import the things we need first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn

In [2]:
# we want to read in the csv file provided, noticed the path down in the read_csv() can be changed as we like.
df = pd.read_csv('Decision_Tree_bankloan.csv')
df # show the first five rows

Unnamed: 0,Age,Has_job,Own_house,Outcome
0,young,True,True,No
1,young,True,False,Yes
2,young,False,True,Yes
3,young,False,False,No
4,old,True,True,Yes
5,old,True,False,No
6,old,False,True,Yes
7,old,False,False,Yes
8,old,False,False,Yes
9,old,False,False,Yes


### Preprocess the data
In this sample data set, the `entertainment` is determined according to `number_of_people` and `weather`
If the number of people is more than 5 and weather is sunny, it's basketball. If the weather is rainy, it's movie
If the number of people is less or equal to 5 and weather is sunny, it's badminton, rainy gives poker.

Since we can only have numeric values to build the model,
we are going to map the `weather` column and `entertainment` column to numeric values first

In [4]:
# change the type of these features to `category` for mapping in the next step
df['Age'] = df['Age'].astype('category')
df['Has_job'] = df['Has_job'].astype('category') 
df['Own_house'] = df['Own_house'].astype('category')
df['Outcome'] = df['Outcome'].astype('category') 

# use .cat.codes on `category` type to map all literals to numeric values
df['Age'] = df['Age'].cat.codes
df['Has_job'] = df['Has_job'].cat.codes
df['Own_house'] = df['Own_house'].cat.codes
df['Outcome'] = df['Outcome'].cat.codes

df.head()

Unnamed: 0,Age,Has_job,Own_house,Outcome
0,1,1,1,0
1,1,1,0,1
2,1,0,1,1
3,1,0,0,0
4,0,1,1,1


After the convertion we have

weather column: 
    sunny: 1
    rainy: 0
    
entertainment:
    badminton; 0
    basketball: 1
    movie: 2
    poker: 3

In [None]:
# check if the categories are balanced
df['Outcome'].value_counts()

Since we have 3 entries for each label, this dataset is balanced

In [None]:
# extract data and target from our dataframe
data = df[['Age', 'Has_job', 'Own_house']] # this is like the independent variable: x we have in linear regression

target = df['Outcome'] # this is like the dependent variable: y
data

In [None]:
target

We have the data prepared. We can do train test split now.

In [None]:
# import train_test_split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data, target, random_state = 42)

In [None]:
# import decision tree model from sklearn
from sklearn.tree import DecisionTreeClassifier

# instantiate a decision tree model. All parameters can be omited to use default ones.
# details please check https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
dt = DecisionTreeClassifier() 
dt.fit(x_train, y_train) # train our model

In [None]:
x_train

In [None]:
y_train

In [None]:
y_pred = dt.predict(x_test) # let the model predict the test data

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

In [None]:
print(y_pred) # what the model predict entertainment labels
print(y_test) # true labels


Compare the predicted labels and true labels, we can see there is only one of them matches (index: 14, y_pred and y_test both have '1'). The accuracy score formula can be seen as

$$ accuracy\_score = \frac{number\_of\_matches}{number\_of\_samples} $$

In this case number of matches is 1, and there are 4 samples in total. Thus the accuracy_score is $1/4 = 0.25$

In [None]:
## we can use the model to predict any data

print(dt.predict([[1, 0,1]])) # predict when there are 6 people and weather is sunny
print(dt.predict([[1, 0,0]])) # predict when there are 5 people and weather is rainy

In [None]:
y_test = [[1, 0, 1]]
y_pred = dt.predict(y_test)
print(y_test)
print(y_pred)

### Visualize the Decision Tree

we can use `graphviz` to see what the decision tree looks like

First, run this in the directory this file is in
```
conda install python-graphviz
```

In [None]:
# show the decision tree model
# import graphviz and sklearn.tree first
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=True, feature_names= x_train.columns)) # display the tree, with no output file

In [None]:
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=['No', 'Yes'], feature_names= x_train.columns)) # display the tree, with no output file

- first row is the feature the tree uses to group child nodes. For example: from the root, any data with age <= 0.5
    goes to the left child, any data with age > 0.5 goes to the right child
- second row is the gini score which gives how good the split is. the best scenario is gini = 0 which means all data
    in this group are from the same class. gini = 0.5 means half of the group are from one class the others are from the other one
- thrid row is how many samples go in this group
- fourth row is an array with the number of each class in this group. e.g. the root has value = [1, 11] which means
    one class 0 and eleven class 1 are in this group
- fifth row gives what class most data in this group are in 

In [None]:
from sklearn import tree
import graphviz
from graphviz import Source

Source(tree.export_graphviz(dt, out_file=None, class_names=['No', 'Yes'], feature_names= x_train.columns)) # display the tree, with no output file