# Decision Tree trong Python

## Introduction


* Author: Vo Minh Tung

* Content:
  - Practice the DecisionTree classifier
  - Learn more about split the dataset into train and test.
  - Evaluation by Accuracy score
* Major steps:
  - Load data from CSV, split into trainning set and test set
  - Buidling Decision tree model.
  - Evaluating results
  - Training DecisionTree classifier model
  - Evaluating results
  - Visualize models

## Configuration

In [1]:
dataset_filename = "bitcoin_int.csv"
#dataset_filename = "bitcoin_draw.csv"
#dataset_filename = "bitcoin_continuous.csv"


## Grab the data

Since our file is in CSV format, we will use panda's read_csv method to read our CSV data file. Execute the following script to do so:

In [2]:
import pandas as pd
#from sklearn import datasets

dataset = pd.read_csv(dataset_filename)
#dataset = datasets.load_iris()
dataset.shape

(150, 6)

In [3]:
# Overview about dataset
dataset.keys()

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [4]:
dataset.head()  

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


## Prepare train data and test set

The label is describe in **Class** column. So that we devide the dataset into attributes and labels

In [7]:
#bitcoin
#X = dataset.drop(['Direction','WeekYear'], axis=1)
#y = dataset['Direction']  

#Iris
X = dataset.drop(['Id','Species'], axis=1)  
y = dataset['Species']

Here the X variable contains all the columns from the dataset, except the **Class** column, which is the label.

The y variable contains the values from the **Class** column. 

The X variable is our attribute set and y variable contains corresponding labels.

### Dividing our data into training and test sets. 

So, we split the test into 2 sets: training set and testing set. We use to split up 20% of the data in to the test set and 80% for training.

In [8]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, y_train.size)
print(X_test.shape, y_test.shape)

(120, 4) 120
(30, 4) (30,)


## Training with Decision Tree

Now, let's build a **Decision Tree** model

### a) criterion='gini'

In [9]:
from sklearn.tree import DecisionTreeClassifier  
dt = DecisionTreeClassifier(criterion='gini')  
dt.fit(X_train, y_train)  

In [8]:
y_pred_dt = dt.predict(X_test)  
y_pred_dt

array([1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1,
       0, 1], dtype=int64)

##### Evaluating the Algorithm



In [10]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.9333333333333333


### b) criterion='entropy'

In [10]:
from sklearn.tree import DecisionTreeClassifier  
dt2 = DecisionTreeClassifier(criterion='entropy')  
dt2.fit(X_train, y_train)  

In [11]:
y_pred_dt = dt2.predict(X_test)  
y_pred_dt

array([1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1], dtype=int64)

##### Evaluating the Algorithm



In [12]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.5588235294117647


## Extra section

### Visualize decision tree

We can visualize the Decision Tree model using the [Graphviz](https://www.graphviz.org/) tool.

Graphviz is an easy tool for drawing graph. For example: the code

```
digraph G {Hello->World}
```

will generate the following graph

![digraph](https://graphviz.gitlab.io/_pages/Gallery/directed/hello.png)

* More example https://www.graphviz.org/gallery/
* You can play around at http://www.webgraphviz.com/ or https://dreampuf.github.io/GraphvizOnline/

So let's generate the graph.

In [13]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dt, out_file=None)
print(dot_data)

digraph Tree {
node [shape=box, fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="x[5] <= -0.001\ngini = 0.5\nsamples = 268\nvalue = [131, 137]"] ;
1 [label="x[3] <= 0.016\ngini = 0.488\nsamples = 135\nvalue = [78.0, 57.0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="x[0] <= -0.01\ngini = 0.449\nsamples = 94\nvalue = [62.0, 32.0]"] ;
1 -> 2 ;
3 [label="x[6] <= -0.109\ngini = 0.204\nsamples = 26\nvalue = [23, 3]"] ;
2 -> 3 ;
4 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
3 -> 4 ;
5 [label="x[5] <= -0.103\ngini = 0.147\nsamples = 25\nvalue = [23, 2]"] ;
3 -> 5 ;
6 [label="x[1] <= -0.067\ngini = 0.5\nsamples = 2\nvalue = [1, 1]"] ;
5 -> 6 ;
7 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
6 -> 7 ;
8 [label="gini = 0.0\nsamples = 1\nvalue = [0, 1]"] ;
6 -> 8 ;
9 [label="x[6] <= 0.001\ngini = 0.083\nsamples = 23\nvalue = [22, 1]"] ;
5 -> 9 ;
10 [label="x[6] <= -0.0\ngini = 0.219\nsamples = 8\nvalue = [7, 1]"] ;
9 -> 10 ;
11 [label="gi

Copy the code above, paste it to online graphviz service like http://viz-js.com/ and see the result.

### Jypiter does it all

If we want to display the graph automatically inside Jupiter, we'll need
* Install Graphviz binary
* Call the Graphviz inside Jupiter and grab the result.

**Note**: Google Colab does not let us install Graphviz binary, so that we cant make with work with Google Colab.


In [20]:
"""
Install lib that call Graphviz binary
"""

!pip install graphviz



In [1]:
import graphviz 
graph = graphviz.Source(dot_data)
graph

NameError: name 'dot_data' is not defined