# Bài tập Decision Tree

## Use Decision Tree to classify the Iris dataset
- dataset: 'Iris.csv'


In [1]:
import pandas as pd

In [6]:
dat = pd.read_csv('Iris.csv')
dat.shape

(150, 6)

## 1) Data Exploration

In [7]:
dat.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [8]:
dat.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [12]:
dat.groupby('Species').size()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64

Tập dữ liệu cân bằng về các nhãn lớp.

## 2) Prepare train dataset and test dataset

In [13]:
# Get the X as all the features, and Y as the labels
X = dat.drop('Species', axis=1)  
y = dat['Species']

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,1,5.1,3.5,1.4,0.2
1,2,4.9,3.0,1.4,0.2
2,3,4.7,3.2,1.3,0.2
3,4,4.6,3.1,1.5,0.2
4,5,5.0,3.6,1.4,0.2


In [18]:
# Slit the dataset into 2 datasets (80% / 20%)
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
print(f"The number of individuals in TRAIN dataset: {X_train.shape[0]}")
print(f"The number of individuals in TEST dataset :  {X_test.shape[0]}")

The number of individuals in TRAIN dataset: 120
The number of individuals in TEST dataset :  30


In [20]:
## Missing Data Checking:
pd.isnull(X_train).any() | pd.isnull(X_test).any()

Id               False
SepalLengthCm    False
SepalWidthCm     False
PetalLengthCm    False
PetalWidthCm     False
dtype: bool

Tập dữ liệu không có missing value.

## Training Dataset with Decision Tree

### a) criterion='gini'

Building the Decision Tree with the **gini index**

In [21]:
# Decision Tree Classifier is supported by the Scikit-Learn
from sklearn.tree import DecisionTreeClassifier  
dt = DecisionTreeClassifier(criterion='gini')  
dt.fit(X_train, y_train)  

DecisionTreeClassifier()

#### Predict the test data

In [22]:
y_pred_dt = dt.predict(X_test)  
print(y_pred_dt)

['Iris-versicolor' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica']


#### Evaluating the Algorithm

In [23]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 1.0


### b) criterion='entropy'

Building the tree with the **entropy**

In [24]:
from sklearn.tree import DecisionTreeClassifier  
dt2 = DecisionTreeClassifier(criterion='entropy')  
dt2.fit(X_train, y_train)  

DecisionTreeClassifier(criterion='entropy')

In [25]:
y_pred_dt = dt.predict(X_test)  
print(y_pred_dt)

['Iris-versicolor' 'Iris-setosa' 'Iris-versicolor' 'Iris-versicolor'
 'Iris-versicolor' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-setosa' 'Iris-setosa' 'Iris-versicolor' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica' 'Iris-setosa' 'Iris-virginica'
 'Iris-virginica' 'Iris-versicolor' 'Iris-virginica' 'Iris-versicolor'
 'Iris-virginica' 'Iris-setosa' 'Iris-virginica' 'Iris-setosa'
 'Iris-virginica' 'Iris-virginica']


#### Evaluating the Algorithm

In [26]:
dt_score = dt.score(X_test, y_test)
print(f"Decision Tree classifier accuracy score is {dt_score}")

Decision Tree classifier accuracy score is 1.0


### Visualize decision tree

In [27]:
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dt, out_file=None)
print(dot_data)

digraph Tree {
node [shape=box] ;
0 [label="X[4] <= 0.8\ngini = 0.666\nsamples = 120\nvalue = [42, 41, 37]"] ;
1 [label="gini = 0.0\nsamples = 42\nvalue = [42, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="X[0] <= 100.5\ngini = 0.499\nsamples = 78\nvalue = [0, 41, 37]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="gini = 0.0\nsamples = 41\nvalue = [0, 41, 0]"] ;
2 -> 3 ;
4 [label="gini = 0.0\nsamples = 37\nvalue = [0, 0, 37]"] ;
2 -> 4 ;
}


<p align="center">
<img src="https://raw.githubusercontent.com/tquangsdh20/w2_decision_tree/master/.github/tree.svg">
</p>