# Decision Tree in Python

## Introduction


* Author: thuanle@hcmut.edu.vn

* Content:
  - Practice the DecisionTree classifier
  - Learn more about split the dataset into train and test.
  - Evaluation by Accuracy score
* Major steps:
  - Load data from CSV, split into trainning set and test set
  - Buidling kNN model.
  - Evaluating results
  - Training DecisionTree classifier model
  - Evaluating results
  - Visualize models

## Configuration

In [15]:
dataset_filename = "bill_authentication.csv"

## Grab the data

Since our file is in CSV format, we will use panda's read_csv method to read our CSV data file. Execute the following script to do so:

In [16]:
import pandas as pd  
dataset = pd.read_csv(dataset_filename)
dataset.shape

(1372, 5)

In [17]:
# Overview about dataset
# đọc ra những key index trong tập data
dataset.keys()

Index(['Variance', 'Skewness', 'Curtosis', 'Entropy', 'Class'], dtype='object')

In [18]:
dataset.head()  

Unnamed: 0,Variance,Skewness,Curtosis,Entropy,Class
0,3.6216,8.6661,-2.8073,-0.44699,0
1,4.5459,8.1674,-2.4586,-1.4621,0
2,3.866,-2.6383,1.9242,0.10645,0
3,3.4566,9.5228,-4.0112,-3.5944,0
4,0.32924,-4.4552,4.5718,-0.9888,0


## Prepare train data and test set

The label is describe in **Class** column. So that we devide the dataset into attributes and labels

In [19]:
X = dataset.drop('Class', axis=1)  
y = dataset['Class']  

Here the X variable contains all the columns from the dataset, except the **Class** column, which is the label.

The y variable contains the values from the **Class** column. 

The X variable is our attribute set and y variable contains corresponding labels.

### Dividing our data into training and test sets. 

So, we split the test into 2 sets: training set and testing set. We use to split up 20% of the data in to the test set and 80% for training.

In [20]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, y_train.size)
print(X_test.shape, y_test.shape)

(1097, 4) 1097
(275, 4) (275,)


## Training with Decision Tree

Now, let's build a **Decision Tree** model

### a) criterion='gini'

In [21]:
from sklearn.tree import DecisionTreeClassifier  
dt = DecisionTreeClassifier(criterion='gini')  
dt.fit(X_train, y_train)  

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [22]:
y_pred_dt = dt.predict(X_test)  
y_pred_dt

array([1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])

##### Evaluating the Algorithm



In [23]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.9854545454545455


### b) criterion='entropy'

In [24]:
from sklearn.tree import DecisionTreeClassifier  
dt2 = DecisionTreeClassifier(criterion='entropy')  
dt2.fit(X_train, y_train)  

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [25]:
y_pred_dt = dt2.predict(X_test)  
y_pred_dt

array([1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])

##### Evaluating the Algorithm



In [26]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.9854545454545455


## Extra section

### Visualize decision tree

We can visualize the Decision Tree model using the [Graphviz](https://www.graphviz.org/) tool.

Graphviz is an easy tool for drawing graph. For example: the code

```
digraph G {Hello->World}
```

will generate the following graph

![digraph](https://graphviz.gitlab.io/_pages/Gallery/directed/hello.png)

* More example https://www.graphviz.org/gallery/
* You can play around at http://www.webgraphviz.com/ or https://dreampuf.github.io/GraphvizOnline/

So let's generate the graph.

In [27]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dt, out_file=None)
print(dot_data)

digraph Tree {
node [shape=box] ;
0 [label="X[0] <= 0.32\ngini = 0.495\nsamples = 1097\nvalue = [604, 493]"] ;
1 [label="X[1] <= 7.764\ngini = 0.294\nsamples = 525\nvalue = [94, 431]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="X[0] <= -0.459\ngini = 0.122\nsamples = 445\nvalue = [29, 416]"] ;
1 -> 2 ;
3 [label="X[2] <= 6.219\ngini = 0.061\nsamples = 379\nvalue = [12, 367]"] ;
2 -> 3 ;
4 [label="X[1] <= 7.293\ngini = 0.008\nsamples = 262\nvalue = [1, 261]"] ;
3 -> 4 ;
5 [label="gini = 0.0\nsamples = 258\nvalue = [0, 258]"] ;
4 -> 5 ;
6 [label="X[3] <= -2.185\ngini = 0.375\nsamples = 4\nvalue = [1, 3]"] ;
4 -> 6 ;
7 [label="gini = 0.0\nsamples = 3\nvalue = [0, 3]"] ;
6 -> 7 ;
8 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
6 -> 8 ;
9 [label="X[1] <= -4.675\ngini = 0.17\nsamples = 117\nvalue = [11, 106]"] ;
3 -> 9 ;
10 [label="gini = 0.0\nsamples = 105\nvalue = [0, 105]"] ;
9 -> 10 ;
11 [label="X[2] <= 6.615\ngini = 0.153\nsamples = 12\nvalue = [11, 1

Copy the code above, paste it to online graphviz service like http://viz-js.com/ and see the result.

### Jypiter does it all

If we want to display the graph automatically inside Jupiter, we'll need
* Install Graphviz binary
* Call the Graphviz inside Jupiter and grab the result.

**Note**: Google Colab does not let us install Graphviz binary, so that we cant make with work with Google Colab.


In [28]:
"""
Install lib that call Graphviz binary
"""

# pip3 install graphviz

'\nInstall lib that call Graphviz binary\n'

In [33]:
import graphviz 
from graphviz import Digraph
# graph = graphviz.Source(dot_data)
# graph

dot = Digraph(dot_data)
dot.source


'digraph "digraph Tree {\nnode [shape=box] ;\n0 [label=\\"X[0] <= 0.32\\ngini = 0.495\\nsamples = 1097\\nvalue = [604, 493]\\"] ;\n1 [label=\\"X[1] <= 7.764\\ngini = 0.294\\nsamples = 525\\nvalue = [94, 431]\\"] ;\n0 -> 1 [labeldistance=2.5, labelangle=45, headlabel=\\"True\\"] ;\n2 [label=\\"X[0] <= -0.459\\ngini = 0.122\\nsamples = 445\\nvalue = [29, 416]\\"] ;\n1 -> 2 ;\n3 [label=\\"X[2] <= 6.219\\ngini = 0.061\\nsamples = 379\\nvalue = [12, 367]\\"] ;\n2 -> 3 ;\n4 [label=\\"X[1] <= 7.293\\ngini = 0.008\\nsamples = 262\\nvalue = [1, 261]\\"] ;\n3 -> 4 ;\n5 [label=\\"gini = 0.0\\nsamples = 258\\nvalue = [0, 258]\\"] ;\n4 -> 5 ;\n6 [label=\\"X[3] <= -2.185\\ngini = 0.375\\nsamples = 4\\nvalue = [1, 3]\\"] ;\n4 -> 6 ;\n7 [label=\\"gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\"] ;\n6 -> 7 ;\n8 [label=\\"gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\"] ;\n6 -> 8 ;\n9 [label=\\"X[1] <= -4.675\\ngini = 0.17\\nsamples = 117\\nvalue = [11, 106]\\"] ;\n3 -> 9 ;\n10 [label=\\"gini = 0.0\\nsamples =