## Decision Tree

Decision Tree is a tree shaped diagram used to determine a course of action. Each branch of the tree represents a possible decision, occurrence or reaction 

* `Entropy` - It is the measure of randomness or unpredictability in the dataset
* `Information Gain` - It is the measure of decrease in entropy after the dataset is split
* `Root Node` - The top most node (The whole data)
* `Decision Node` - Descision node has two or more branches
* `Leaf Node` - Leaf node carries the classification or the decision

**How dows a decision tree work ?**
> Problem Statement - Classify the different types of fruits based on different features
1. Frame the conditions for splitting the data in such a way that the information gain is the highest
* Eg: conditions can be the `color`, `diameter`
* based on the that split the dataset

### Use Case - Loan Repayment Prediction

To predict if a customer will repay loan amount or not, using decision tree algorithm

In [5]:
# Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Loading data
loan_data = pd.read_csv('E:\ML Algorithms\Decision_Tree_ Dataset.csv', sep=',')

In [6]:
print('Dataset Length:', len(loan_data))
print('Dataset Shape:', loan_data.shape)

Dataset Length: 1000
Dataset Shape: (1000, 6)


In [7]:
loan_data.head()

Unnamed: 0,1,2,3,4,sum,Unnamed: 5
0,201,10018,250,3046,13515,yes
1,205,10016,395,3044,13660,yes
2,257,10129,109,3251,13746,yes
3,246,10064,324,3137,13771,yes
4,117,10115,496,3094,13822,yes


In [11]:
columns = ['Initail Payment', 'Last Payment', 'Credit Score', 'House Number', 'Result']
loan_data.drop('sum', axis=1, inplace=True)

In [15]:
loan_data.columns = columns

In [17]:
loan_data.head()

Unnamed: 0,Initail Payment,Last Payment,Credit Score,House Number,Result
0,201,10018,250,3046,yes
1,205,10016,395,3044,yes
2,257,10129,109,3251,yes
3,246,10064,324,3137,yes
4,117,10115,496,3094,yes


In [18]:
# Splitting data into X & y
X = loan_data.values[:, 0:4]
y = loan_data.values[:, 4]

# Splitting into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

In [20]:
# Training with Entropy
clf_entropy = DecisionTreeClassifier(criterion='entropy', random_state=100, max_depth=3, min_samples_leaf=5)
'''
max_depths = 3 (3 layers)
min_samples_leaf = 5 (Max of 5 leafs at the bottom)
'''
clf_entropy.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=3, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=100, splitter='best')

In [21]:
y_preds_en = clf_entropy.predict(X_test)
y_preds_en

array(['yes', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'No', 'No', 'No',
       'No', 'No', 'yes', 'No', 'No', 'No', 'yes', 'No', 'yes', 'No',
       'No', 'yes', 'No', 'yes', 'yes', 'No', 'No', 'yes', 'No', 'No',
       'No', 'yes', 'yes', 'yes', 'yes', 'No', 'No', 'No', 'yes', 'No',
       'yes', 'yes', 'yes', 'yes', 'No', 'yes', 'No', 'yes', 'No', 'No',
       'yes', 'No', 'yes', 'yes', 'yes', 'yes', 'No', 'yes', 'No', 'yes',
       'yes', 'No', 'No', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'yes',
       'No', 'No', 'No', 'yes', 'No', 'yes', 'yes', 'No', 'yes', 'No',
       'No', 'No', 'No', 'yes', 'No', 'yes', 'No', 'yes', 'yes', 'No',
       'yes', 'yes', 'yes', 'yes', 'yes', 'No', 'yes', 'yes', 'yes',
       'yes', 'No', 'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No',
       'No', 'yes', 'yes', 'yes', 'No', 'No', 'yes', 'yes', 'yes', 'No',
       'No', 'No', 'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No',
       'No', 'yes', 'yes', 'No', 'yes', 'yes', 'yes', 'No', 'yes',

In [23]:
print('Accuracy is:', accuracy_score(y_test, y_preds_en))

Accuracy is: 0.93
