# Decision Tree Algorithm

* One of the widely used supervised type machine learning methods for classification and regression is the decision tree algorithm.
* It also known as classification and regression tree (CART).
* According to predetermined principles, data is constantly divided in this algorithm at each row till the final result is obtained.
* Decision trees classify the results into groups until no more similarity is left.
* Decision tree is non-parametric approach and does not depend on any probability distribution assumptions.
* Decision tree is non-parametric approach and does not depend on any probability distribution assumptions. 

* It is a tree-structured format as shown below:
![Decision-Trees-modified-1.png](attachment:Decision-Trees-modified-1.png)
source: https://www.xoriant.com/blog/decision-trees-for-classification-a-machine-learning-algorithm

**The main components of a decision tree are:**

- **Root node**: The top most node in a decision tree is known as root node.

- **Decision Nodes/Internal node**, which is where the data is split or, say, it is a place for the attribute.

- **Leaf Node** which are the final outcomes.

## Working of a Decision Tree Algorithm

There are multiple steps:
**Stpe 1. Splitting** – It is the splitting of data sets into subgroups. As demonstrated in the figure below, splitting can be done depending on a variety of factors, including class, height, and gender.
![DT_split.png](attachment:DT_split.png)


Step 2. Pruning – It limits the tree depth by reducing the branches of decision tree.
![pruning.png](attachment:pruning.png)

Pruning is further divided into two types:

* **Pre-Pruning** – Based on statistically significant associations between attributes and class at any specific nodes, tree stops growing.

* **Post-Pruning** – Here validate the performance of the test set model and then based on performance we cut the branches that are a result of overfitting noise from the training set.

**3. Tree Selection** – In this step we aim to find the smallest tree that fits the data.

### Illustration of Constructing a Decision Tree

**Entropy and Information Gain**

* Your data's entropy value indicates how disordered it is.
* Entropy is employed in the decision tree since the prime goal of the decision tree is to organize the data by classifying similar data groupings into related categories.
* In the below image we have our initial dataset and we applied a decision tree algorithm to compile related data points into a single category.
* As is evident from the decision split, the majority of the red circles belong to one class whereas the majority of the blue crosses belong to a different class. Thus, decision tree categorise the traits that may be based on a variety of criteria.

Let's assume that we have "N" sets of the item and that these items fall into two categories. We now use the ratio to categorise the data based on labels:
![math_DT1.png](attachment:math_DT1.png)

The entropy of our set is given by the following equation:
![math_DT2.png](attachment:math_DT2.png)

Graph for the given formula:
![Entropy.png](attachment:Entropy.png)

**Advantages**:

Easy to understand and create.
Can be applicable for both regression and classification.
A robust model with excellent outcomes.
Handle large data efficiently.
Handle training data well with less effort.

**Disadvantages**:

**Instability**: Decision tree works well if the information is precise and accurate. A slight change in input may change the tree drastically.

**Complexity**: Too many observation and features increases the complexity of the data by increasing the number of branches.

**Costs**: Cost is an important factor as it requires good statistical knowledge.

In [None]:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Importing Decision Tree Classifier
from sklearn.model_selection import train_test_split # Importing train_test_split function
from sklearn import metrics #Importing scikit-learn metrics module for accuracy calculation
from sklearn import tree

In [None]:
#importing datasets  
pima =pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### A proper EDA and feature engineering have to be done before

In [None]:
#split dataset in features and target variable
#Extracting Independent and dependent Variable  
X=pima.iloc[:,:-1]
y=pima.iloc[:,-1]
print(X.head())
print(y.head())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  
0                     0.627   50  
1                     0.351   31  
2                     0.672   32  
3                     0.167   21  
4                     2.288   33  
0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64


In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=102) # 75% training and 25% test

In [None]:
# Create Decision Tree classifer object
classification = DecisionTreeClassifier()

# Train Decision Tree Classifer
classification = classification.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = classification.predict(X_test)

In [None]:
print(f"Decision tree training set accuracy: {format(classification.score(X_train, y_train), '.4f')} ")
print(f"Decision tree testing set accuracy: {format(classification.score(X_test, y_test), '.4f')} ")

Decision tree training set accuracy: 1.0000 
Decision tree testing set accuracy: 0.7240 


In [None]:
# Model Accuracy, how often is the classifier correct?
# print classification report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.73      0.78       131
           1       0.55      0.70      0.62        61

    accuracy                           0.72       192
   macro avg       0.70      0.72      0.70       192
weighted avg       0.75      0.72      0.73       192



In [None]:
# Create Decision Tree classifer object
classification = DecisionTreeClassifier(criterion="entropy", max_depth = 3)

# Train Decision Tree Classifer
classification = classification.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = classification.predict(X_test)

In [None]:
print(f"Decision tree training set accuracy: {format(classification.score(X_train, y_train), '.4f')} ")
print(f"Decision tree testing set accuracy: {format(classification.score(X_test, y_test), '.4f')} ")

Decision tree training set accuracy: 0.7639 
Decision tree testing set accuracy: 0.8021 


In [None]:
# Model Accuracy, how often is the classifier correct?
# print classification report
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.84      0.85       131
           1       0.68      0.72      0.70        61

    accuracy                           0.80       192
   macro avg       0.77      0.78      0.78       192
weighted avg       0.81      0.80      0.80       192

