# Decision Tree trong Python

## Introduction


* Author: thuanle@hcmut.edu.vn

* Content:
  - Practice the DecisionTree classifier
  - Learn more about split the dataset into train and test.
  - Evaluation by Accuracy score
* Major steps:
  - Load data from CSV, split into trainning set and test set
  - Buidling kNN model.
  - Evaluating results
  - Training DecisionTree classifier model
  - Evaluating results
  - Visualize models

## Configuration

In [4]:
dataset_filename = "bitcoin_int.csv"
#dataset_filename = "bitcoin_draw.csv"
#dataset_filename = "bitcoin_continuous.csv"

## Grab the data

Since our file is in CSV format, we will use panda's read_csv method to read our CSV data file. Execute the following script to do so:

In [5]:
import pandas as pd  
dataset = pd.read_csv(dataset_filename)
dataset.shape

(336, 9)

In [6]:
# Overview about dataset
dataset.keys()

Index(['WeekYear', 'CN', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'Direction'], dtype='object')

In [7]:
dataset.head()  

Unnamed: 0,WeekYear,CN,T2,T3,T4,T5,T6,T7,Direction
0,2017-45,3,-5,-1,7,-4,-6,-7,0
1,2017-46,-3,5,2,10,7,2,-1,1
2,2017-47,3,4,1,-2,0,0,6,1
3,2017-48,6,4,4,4,-9,11,5,1
4,2017-49,7,-4,4,8,20,-2,-10,1


## Prepare train data and test set

The label is describe in **Class** column. So that we devide the dataset into attributes and labels

In [13]:
X = dataset.drop(['Direction','WeekYear'], axis=1)  
y = dataset['Direction']  

Here the X variable contains all the columns from the dataset, except the **Class** column, which is the label.

The y variable contains the values from the **Class** column. 

The X variable is our attribute set and y variable contains corresponding labels.

### Dividing our data into training and test sets. 

So, we split the test into 2 sets: training set and testing set. We use to split up 20% of the data in to the test set and 80% for training.

In [14]:
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(X_train.shape, y_train.size)
print(X_test.shape, y_test.shape)

(268, 7) 268
(68, 7) (68,)


## Training with Decision Tree

Now, let's build a **Decision Tree** model

### a) criterion='gini'

In [15]:
from sklearn.tree import DecisionTreeClassifier  
dt = DecisionTreeClassifier(criterion='gini')  
dt.fit(X_train, y_train)  

In [16]:
y_pred_dt = dt.predict(X_test)  
y_pred_dt

array([0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       0, 1], dtype=int64)

##### Evaluating the Algorithm



In [17]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.38235294117647056


### b) criterion='entropy'

In [18]:
from sklearn.tree import DecisionTreeClassifier  
dt2 = DecisionTreeClassifier(criterion='entropy')  
dt2.fit(X_train, y_train)  

In [19]:
y_pred_dt = dt2.predict(X_test)  
y_pred_dt

array([0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1,
       0, 1], dtype=int64)

##### Evaluating the Algorithm



In [20]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 0.38235294117647056


## Extra section

### Visualize decision tree

We can visualize the Decision Tree model using the [Graphviz](https://www.graphviz.org/) tool.

Graphviz is an easy tool for drawing graph. For example: the code

```
digraph G {Hello->World}
```

will generate the following graph

![digraph](https://graphviz.gitlab.io/_pages/Gallery/directed/hello.png)

* More example https://www.graphviz.org/gallery/
* You can play around at http://www.webgraphviz.com/ or https://dreampuf.github.io/GraphvizOnline/

So let's generate the graph.

In [15]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dt, out_file=None)
print(dot_data)

digraph Tree {
node [shape=box, fontname="helvetica"] ;
edge [fontname="helvetica"] ;
0 [label="x[0] <= 0.32\ngini = 0.492\nsamples = 1097\nvalue = [618, 479]"] ;
1 [label="x[1] <= 7.565\ngini = 0.311\nsamples = 513\nvalue = [99, 414]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="x[1] <= 5.865\ngini = 0.118\nsamples = 427\nvalue = [27, 400]"] ;
1 -> 2 ;
3 [label="x[2] <= 6.216\ngini = 0.086\nsamples = 402\nvalue = [18, 384]"] ;
2 -> 3 ;
4 [label="x[2] <= 4.878\ngini = 0.007\nsamples = 285\nvalue = [1, 284]"] ;
3 -> 4 ;
5 [label="gini = 0.0\nsamples = 257\nvalue = [0, 257]"] ;
4 -> 5 ;
6 [label="x[0] <= -0.484\ngini = 0.069\nsamples = 28\nvalue = [1, 27]"] ;
4 -> 6 ;
7 [label="gini = 0.0\nsamples = 27\nvalue = [0, 27]"] ;
6 -> 7 ;
8 [label="gini = 0.0\nsamples = 1\nvalue = [1, 0]"] ;
6 -> 8 ;
9 [label="x[1] <= -3.08\ngini = 0.248\nsamples = 117\nvalue = [17, 100]"] ;
3 -> 9 ;
10 [label="x[0] <= -0.357\ngini = 0.02\nsamples = 101\nvalue = [1, 100]"] ;
9 -> 1

Copy the code above, paste it to online graphviz service like http://viz-js.com/ and see the result.

### Jypiter does it all

If we want to display the graph automatically inside Jupiter, we'll need
* Install Graphviz binary
* Call the Graphviz inside Jupiter and grab the result.

**Note**: Google Colab does not let us install Graphviz binary, so that we cant make with work with Google Colab.


In [18]:
"""
Install lib that call Graphviz binary
"""

!pip install graphviz



In [1]:
import graphviz 
graph = graphviz.Source(dot_data)
graph

NameError: name 'dot_data' is not defined