# Decision Tree Classification with Python and Scikit-Learn


In this Project ,I will Build a decision tree classifier model in python using sklearn . I will build 2 model ,one with gini index and one with entropy
I have used the Car Evaluation Data Set for this project, downloaded from the UCI Machine Learning Repository website.

# Table of Contents

1- The problem statement

2- Dataset description

3- Import libraries

4- Import dataset

5- Exploratory data analysis

6- Data Cleaning 

7- Separating target variable and input variables

8- Splitting the data into train and test set

9- Feature engineering

10 - Decision Tree classifier with criterion gini-index

11- Decision Tree classifier with criterion entropy

12- Confusion matrix

13-Classification report

14-Results and conclusion

# Probelm Statement

Problem statement is to fit the model to predict the car_safety.Choosing the appropriate model and then implement it to predict the target class

# Data Description:

I have used the Car Evaluation Data Set downloaded from the Kaggle website. I have downloaded this data set from the Kaggle website. The data set can be found at the following url:-

http://archive.ics.uci.edu/ml/datasets/Car+Evaluation

Car Evaluation Database was derived from a simple hierarchical decision model originally developed for expert system for decision making. The Car Evaluation Database contains six input attributes: buying, maint, doors, persons, lug_boot, safety and with one target variable class

# Import Libraries

In [89]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

# Import Dataset:

In [90]:
car_data=pd.read_csv('car_data.csv',header=None)

# Exploratory Data Analysis:

 Now I will explore the data to gain usefull insights before fitting the decision tree model

In [91]:
# View the dimension od car_data
car_data.shape

(1728, 7)

Car_data have 1728 Rows/Instances and 7 variables/colums

In [92]:
# View the first five rows of dataset
car_data.head()

Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


As we can see that car_data does not have proper columns name . It is labbelled as 1,2,3 and so on. We cannot understand what these columns indicate That'why I will change the columns names for car_data according to data_description.

In [93]:
 # Changing the columns names

column_names=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']
car_data.columns=column_names


In [94]:
#Again checking the dataset
car_data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


Now I can easily understand what each column indicates 

Now get the Summary of data

In [95]:
car_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   buying    1728 non-null   object
 1   maint     1728 non-null   object
 2   doors     1728 non-null   object
 3   persons   1728 non-null   object
 4   lug_boot  1728 non-null   object
 5   safety    1728 non-null   object
 6   class     1728 non-null   object
dtypes: object(7)
memory usage: 94.6+ KB


We can see all of columns in car_data are of object datatype there is not a single continous variable

In [96]:
# Frequency Distribution of Catagorical Variable

for x in car_data.columns:
    print(car_data[x].value_counts())

vhigh    432
high     432
med      432
low      432
Name: buying, dtype: int64
vhigh    432
high     432
med      432
low      432
Name: maint, dtype: int64
2        432
3        432
4        432
5more    432
Name: doors, dtype: int64
2       576
4       576
more    576
Name: persons, dtype: int64
small    576
med      576
big      576
Name: lug_boot, dtype: int64
low     576
med     576
high    576
Name: safety, dtype: int64
unacc    1210
acc       384
good       69
vgood      65
Name: class, dtype: int64


# Summary of Variables 

 <- There are 1728 rows in dataset followed by 7 columns

 <- 'Class' is the Target Variable

 <- Predictors are vatagorical variables named as [buying, maint, doors, persons, lug_boot, safety]

# Data Cleaning 

<- Checking for the missing values

In [97]:
car_data.isnull().sum()

buying      0
maint       0
doors       0
persons     0
lug_boot    0
safety      0
class       0
dtype: int64

There are no missing values exist in car_data

# Separating target variable and input variables

In [98]:
 # Independent features

X=car_data.drop('class',axis=1)

 # Dependent features

y=car_data['class']


# Splitting the data into train and test set

In [99]:
# Splitting the x and y into train_test 

#import library module
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [100]:
# Info about training and testing set:
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)


(1157, 6) (1157,)
(571, 6) (571,)


# Feature Engineering

Feature Engineering is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.

First, I will check the data types of variables again.

In [101]:
car_data.dtypes

buying      object
maint       object
doors       object
persons     object
lug_boot    object
safety      object
class       object
dtype: object

Encoding Catagorical_variables:

In [102]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,vhigh,vhigh,3,more,med,low
468,high,vhigh,3,4,small,low
155,vhigh,high,3,more,small,high
1721,low,low,5more,more,small,high
1208,med,low,2,more,small,high


We can see that all variables are catagorical having ordinal nature
There is a module LabelEncoder in sklearn library but it is not used for ordinal catagory that is why I will use the ordinal Encoder to encode these variables

In [163]:
import category_encoders as ce


In [105]:
encoder = ce.OrdinalEncoder(cols=['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety'])


X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [164]:
X_train.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
48,1,1,1,1,1,1
468,2,1,1,2,2,1
155,1,2,1,1,2,2
1721,3,3,2,1,2,2
1208,4,3,3,1,2,2


In [111]:
X_test.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety
599,2,2,4,3,1,2
1201,4,3,3,2,1,3
628,2,2,2,3,3,3
1498,3,2,2,2,1,3
1263,4,3,4,1,1,1


We now have our dataset ready for model building


# Decision Tree Classifier With criterion Gini index

In [114]:
# import library
from sklearn.tree import DecisionTreeClassifier

In [117]:
# instantiate the DecisionTreeClassifier model with criterion gini index

gini_tree=DecisionTreeClassifier(criterion='gini',random_state=0,max_depth=3)

# Fit model to on training set:

gini_tree.fit(X_train,y_train)

Predict the test Set Results using gini index 

Now I will use the fitted model the predict the traget_class of test set

In [125]:
gini_y_test_pred=gini_tree.predict(X_test)

Checking the accuracy score of decision tree model of criterion gini index

In [130]:
#import accuracy_score function from sklearn library

from sklearn.metrics import accuracy_score

print('Accuracy of test set prediction with Gini Criterion is :',accuracy_score(y_test,gini_y_test_pred))


Accuracy of test set prediction with Gini Criterion is : 0.8021015761821366


Here The the y_test is the original Target Variable of Test set and gini_y_pred is the prdeicted values of target variable that are predicted using the decision tree model of gini index

# Check for Overfitting or Underfitting 


Overfitting: 

            Overfitting happens when a model learns the details and noise in the training data to such an extent that it negatively impacts its performance on unseen or test data.

Underfitting :

              Underfitting occurs when a model is too simple to capture the underlying structure of the training data, resulting in poor performance on both the training and test data.
              

Hence I will Compare the accuracy of training data and testing data prediction

In [133]:
# Predicting the values of training_set
gini_y_train_pred=gini_tree.predict(X_train)

# Finding Accuracy:
print('Accuracy of train set with Gini Criterion is :',accuracy_score(y_train,gini_y_train_pred))

Accuracy of train set with Gini Criterion is : 0.7865168539325843


In [134]:
print('Accuracy of train set',accuracy_score(y_train,gini_y_train_pred))
print('Accuracy of test set',accuracy_score(y_test,gini_y_test_pred))

Accuracy of train set 0.7865168539325843
Accuracy of train set 0.8021015761821366


As we can clearly see that there is no sign of overfitting or underfitting because there is no such special gaps between accuracy of both sets.

Accuracy of train set is 0.78 and test set is 0.80

# Decision Tree Classifier with criterion entropy

In [142]:
#Intuition of decision tree model with entropy criterion

entropy_tree=DecisionTreeClassifier(criterion='entropy',max_depth=3,random_state=42)

# Fitting The model:

entropy_tree.fit(X_train,y_train)


Accuracy for Testing set:

In [143]:
# Predicting the values of training set and finding the accuracy :

entropy_y_test_pred=entropy_tree.predict(X_test)
print('Accuracy of testing set using the entropy as criterion:',accuracy_score(y_test,entropy_y_test_pred))

Accuracy of testing set using the entropy as criterion: 0.8021015761821366


Accuracy for Testing set:

In [141]:
# Predicting the values of training set and finding the accuracy :

entropy_y_train_pred=entropy_tree.predict(X_train)
print('Accuracy of training set using the entropy as criterion:',accuracy_score(y_train,entropy_y_train_pred))

Accuracy of training set using the entropy as criterion: 0.7865168539325843


Checking for Overfitting and Underfitting

In [145]:
print('Accuracy of train set',accuracy_score(y_train,entropy_y_train_pred))
print('Accuracy of test set',accuracy_score(y_test,entropy_y_test_pred))

Accuracy of train set 0.7865168539325843
Accuracy of test set 0.8021015761821366


# Model Evaluation

Model evaluation is the process of assessing the performance and effectiveness of a machine learning model


# Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It allows visualization of the performance of an algorithm by showing the counts of true positive, true negative, false positive, and false negative predictions made by the model.

Here's a breakdown of the components of a confusion matrix:

True Positive (TP): The model correctly predicts the positive class.

True Negative (TN): The model correctly predicts the negative class.

False Positive (FP): Also known as a Type I error, the model incorrectly predicts the positive class when it is actually negative.

False Negative (FN): Also known as a Type II error, the model incorrectly predicts the negative class when it is actually positive.

These four outcomes are summarized in a confusion matrix given below.

In [149]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test,entropy_y_test_pred)

array([[ 73,   0,  56,   0],
       [ 20,   0,   0,   0],
       [ 12,   0, 385,   0],
       [ 25,   0,   0,   0]], dtype=int64)

# Classification Report

Classification Report is another way to evaluate the classification model performance. It displays the precision, recall, f1 and support scores for the model. I have described these terms in later.

We can print a classification report as follows:-

In [159]:
from sklearn.metrics import classification_report
print(classification_report(y_test,entropy_y_test_pred))


              precision    recall  f1-score   support

         acc       0.56      0.57      0.56       129
        good       0.00      0.00      0.00        20
       unacc       0.87      0.97      0.92       397
       vgood       0.00      0.00      0.00        25

    accuracy                           0.80       571
   macro avg       0.36      0.38      0.37       571
weighted avg       0.73      0.80      0.77       571



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Results and Conclusions

1 - In this Project , I build decision tree classifier model to predict the safety of car. I build 2 models ,One with entropy as criterion and one with gini index as criterion

2 - Accuracy of Both models is almost the same

3 - Accuracy of model with entropy as criterion has the accuracy of 0.8021 for testing set and 0.78 for training set and there was no sign of overfitting or underfitting

4-Accuracy of model with gini index as criterion has the accuracy of 0.8021 for testing set and 0.7865 for training set and there was no sign of overfitting or underfitting

5-In both the cases, the training-set and test-set accuracy score is the same. It may happen because of small dataset.

6-The confusion matrix and classification report yields very good model performance.

