[Decision Tree Terminology](#1) 

[CART - Classification and Regression Trees](#2)

[Entropy](#3)

[Information Gain](#4)

[Gini index](#5)

[Pruning](#6)

[Importing Libraries and Dataset](#7)

[Exploratory data analysis ](#8)

[Separating Features and Target](#9)

[Splitting Dataset to training and test data](#10)

[Decision Tree Creation](#11)

[ Creation of Decision Tree using Gini Index](#12)

[Creation of Decision Tree using with entropy](#13)

[Confusion Matrix](#14)

[Conclusion](#15)

<a id="1"></a> <br>
# 1. Decision Tree Terminology

A Decision Tree learning is a predictive modeling approach. It is used to address classification problems in statistics, data mining, and machine learning.

It is having a tree-like structure upside down and represents decisions or for decision-making. It can handle high dimension data and have good accuracy.

The topmost node is called the root node which has no incoming edges. An internal node represents a test or an attribute and each branch represents an outcome of a test and each terminal node or leaf holds a class. It has one incoming edge and has two or more outgoing edges. Terminal node or Leaf node represents a class node and has exactly one incoming node and no outgoing node.

<a id="2"></a> <br>
# 2. CART - Classification and Regression Trees

Tree analogy is generally represented by CART known as Classification And Regression Tree. CART is simple to understand, interpret, visualize and requires little effort for data preparation. Moreover, it performs feature selection. Regression trees are mainly used when the target variable is numerical. Here value obtained by a terminal node is always the mean or average of the responses falling in that region. As a result, if any unseen data or observation will predict with the mean value. Classification is used when the target variable is categorical. Here value obtained by a terminal node is the mode of response falling in that region and any unseen data or observation in this region will make a prediction based on the mode value.
Even though CART is simple and has great advantages, but it can lead to overfitting if data is not properly handled. Moreover, it can lead to instability, if there is a small variation in data.

While growing a tree below points are to be considered :

* Features to choose
* Conditions for splitting 
* To know where to stop
* Pruning


The decision to make a strategic split heavily affects the accuracy of the tree and the decision criteria for regression and classification trees will be different. Entropy/Information gain or Gini Index can be used for choosing the best split. Entropy and Information gain go hand in hand.


For a given dataset with different features, to decide which feature to be considered as the root node and which feature should be the next decision node and so on, information gain of each feature should be known. The feature which has maximum information gain will be considered as the root node. To calculate information gain first we should calculate the entropy.



<a id="3"></a> <br>
# 3. Entropy 

Entropy is a measure of disorder or impurity in the given dataset. In the decision tree, messy data are split based on values of the feature vector associated with each data point. With each split, the data becomes more homogenous which will decrease the entropy. However, some data in some nodes will not be homogenous, where the entropy value will not be small. The higher the entropy, the harder it is to draw any conclusion. When the tree finally reaches the terminal or leaf node maximum purity is added.



<a id="4"></a> <br>
# 4. Information Gain

The Information Gain measures the expected reduction in entropy. Entropy measures impurity in the data and information gain measures reduction in impurity in the data. The feature which has minimum impurity will be considered as the root node. 

Information gain is used to decide which feature to split on at each step in building the tree. The creation of sub-nodes increases the homogeneity, that is decreases the entropy of these nodes. The more the child node is homogeneous, the more the variance will be decreased after each split. Thus Information Gain is the variance reduction and can calculate by how much the variance decreases after each split.

Information gain of a parent node can be calculated as the entropy of the parent node subtracted entropy of the weighted average of the child node.

<a id="5"></a> <br>
# 5. Gini index

The Gini index can also be used for feature selection. The tree chooses the feature that minimizes the Gini impurity index. The higher value of the Gini Index indicates the impurity is higher. Both Gini Index and Gini Impurity are used interchangeably. The Gini Index or Gini Impurity favors large partitions and is very simple to implement. It performs only binary split. For categorical variables, it gives the results in terms of "success" or "failure".

<a id="6"></a> <br>
# 6. Pruning

When the tree is fully grown up, it is liking to overfit data due to noise or outliers which can lead to anomalies in decision trees. Which in turn leads to poor accuracy. This can be handled by using pruning.
Pruning is the process of removing redundant comparisons or removing subtrees. Pruning reduces unnecessary comparisons and achieves better performance. Pruned trees are less complex, smaller, and easy to understand. There are two approaches for pruning, the pre-pruning approach in which splitting or partition of the tree is halted at a particular node whereas in post-pruning approach removes subtree from the full tree. A subtree is pruned at a node. It is done by removing the branches at a node and replacing it with a leaf node.

<a id="7"></a> <br>
# 7. Importing Libraries and Dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Importing libraries matplotlib and seaborn

In [None]:
import matplotlib.pyplot as plt
import seaborn  as sns

Importing data set

In [None]:
dataset = pd.read_csv('../input/mushroom-classification/mushrooms.csv')

<a id="8"></a> <br>
# 8. Exploratory data analysis 

In [None]:
#To see the first five rows of the dataset we can use dataset.head()
dataset.head()

In [None]:
dataset['class'].unique()

The class column is target and it has two clasification which describes if mushroom is poisonous or edible. In class column posionous is p and edible is e.

In [None]:
# To see if there is any null values in the dataset
dataset.info()

All the features are categorical and there is no missing value.

In [None]:
#To find number of rows and column
dataset.shape

In [None]:
sns.histplot(dataset['class'])

<a id="9"></a> <br>
# 9. Separating Features and Target

Target is in column class. So X will have all values apart from column class and y will have column class

In [None]:
X = dataset.drop(['class'],axis=1)
y = dataset['class']

As all the values in the dataset are categorical.
X can be encoded using pandas dummy variable and y using LabelEncoder.

Dummy variable creates a separte column for each unique value of the column, where as LabelEncoder encodes target labels with value between 0 and n_classes-1. LabelEncoder should be used to encode target values, i.e. y, and not the input X.

In [None]:
X = pd.get_dummies(X)
X.head()

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y = encoder.fit_transform(y)
print(y)

For y variable encoding is done as
Poisonous = p -> 1
Edible = e -> 0

<a id="10"></a> <br>
# 10. Splitting Dataset to training and test data

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
X_train.shape , X_test.shape

In [None]:
y_train.shape , y_test.shape

<a id="11"></a> <br>
# 11. Decision Tree Creation

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [None]:
from sklearn.metrics import accuracy_score

<a id="12"></a> <br>
# 12. Creation of Decision Tree using Gini Index

In [None]:
#Using the Decision Tree Classifier with splitting criterion as Gini impurity, the maximum depth of the tree is 3.
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)


# fit the model
clf_gini.fit(X_train, y_train)

In [None]:
#Plot the tree
plt.figure(figsize=(12,8))

tree.plot_tree(clf_gini.fit(X_train, y_train)) 

In [None]:
#Predict the values 
y_pred_gini = clf_gini.predict(X_test)

Overfitting occurs when accuracy for traning set is high and test set is very low comparing to training set. Overfitting is very common problem with decision tree.

In [None]:
#Predict the value using X train for accuracy comparision 
y_pred_train_gini = clf_gini.predict(X_train)

y_pred_train_gini

In [None]:
#Determine the accuracy score
print('Model accuracy score with criterion gini index: {0:0.4f}'. format(accuracy_score(y_test, y_pred_gini)))
#Accuracy Score for training set
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_gini)))

<a id="13"></a> <br>
# 13. Creation of Decision Tree using with entropy

In [None]:
clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)


# fit the model
clf_en.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(12,8))
tree.plot_tree(clf_en.fit(X_train, y_train)) 

In [None]:
#Predict the values 
y_pred_en = clf_en.predict(X_test)

In [None]:
#Predict the value using X train for accuracy comparision
y_pred_train_en = clf_en.predict(X_train)

In [None]:
print('Model accuracy score with criterion entropy: {0:0.4f}'. format(accuracy_score(y_test, y_pred_en)))
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train_en)))

In [None]:
print('Training set score: {:.4f}'.format(clf_en.score(X_train, y_train)))
print('Test set score: {:.4f}'.format(clf_en.score(X_test, y_test)))

<a id="14"></a> <br>
# 14. Confusion Matrix 

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import  f1_score

In [None]:
cm = confusion_matrix(y_test, y_pred_en)

print('Confusion matrix\n\n', cm)

In [None]:
f,ax = plt.subplots(figsize=(10, 10))
sns.heatmap(cm, annot=True, linewidths=0.5,linecolor="red", fmt= '.0f',ax=ax)
plt.show()
plt.savefig('ConfusionMatrix.png')

In [None]:
print(classification_report(y_test, y_pred_en))

In [None]:
f1_score = f1_score(y_test, y_pred_en)
print("F1 Score:",f1_score)

<a id="15"></a> <br>
# 15. Conclusion

Decision-Tree Classifier model using both gini index and entropy have only very very small difference in model accuracy and training set accuracy, so there is no sign of overfitting.