# Coding a decision tree

We will use the scikit-learn library to build the decision tree model. We will be using the iris dataset to build a decision tree classifier. The data set contains information of 3 classes of the iris plant with the following attributes: - sepal length - sepal width - petal length - petal width - class: Iris Setosa, Iris Versicolour, Iris Virginica

The task is to predict the class of the iris plant based on the attributes. [Link](https://archive.ics.uci.edu/ml/datasets/iris) to data.

In [5]:
#Importing required libraries
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

The scikit-learn dataset library already has the iris dataset. You can either use the dataset from the source or import it from the scikit-learn dataset library.

In [6]:
#Loading the iris data
data = load_iris()
print('Classes to predict: ', data.target_names)

Classes to predict:  ['setosa' 'versicolor' 'virginica']


In [10]:
#Extracting data attributes
X = data.data
### Extracting target/ class labels
y = data.target

print('Number of examples in the data:', X.shape[0])

#First four rows in the variable 'X'
X[:4]

Number of examples in the data: 150


array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2]])

In [13]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

Now that we have extracted the data attributes and corresponding labels, we will split them to form train and test datasets. For this purpose, we will use the scikit-learn's 'train_test_split' function, which takes in the attributes and labels as inputs and produces the train and test sets.

In [14]:
#Using the train_test_split to create train and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)

Since, this is a classification problem, we will import the DecisionTreeClassifier function from the sklearn library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.

In [21]:
#Importing the Decision tree classifier from the sklearn library.
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion = 'entropy', random_state=0)

Next, we will fit the classifier on the train attributes and labels.

In [22]:
#Training the decision tree classifier. 
clf.fit(X_train, y_train)

Now, we will use the trained classifier/ model to predict the labels of the test attributes.



In [23]:
#Predicting labels on the test set.
y_pred =  clf.predict(X_test)

We will now evaluate the predicted classes using some metrics. For this case, we will use 'accuracy_score' to calculate the accuracy of the predicted labels.

In [24]:
#Importing the accuracy metric from sklearn.metrics library

from sklearn.metrics import accuracy_score
print('Accuracy Score on train data: ', accuracy_score(y_true=y_train, y_pred=clf.predict(X_train)))
print('Accuracy Score on test data: ', accuracy_score(y_true=y_test, y_pred=y_pred))

Accuracy Score on train data:  1.0
Accuracy Score on test data:  0.9736842105263158


$SplitInformation(S, A) = - \sum_{i=1}^{c} {|S_i| \over |S|} \cdot log_2 {|S_i| \over |S|}$