# Decision Tree using Entropy and Information Gain
Python module contained [here](https://github.com/ryan-kp-miller/Machine-Learning-Algorithms/tree/master/DecisionTree).

## Decision Tree Module and Dependencies

In [3]:
from DecisionTree.DecisionTree import DecisionTree

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from time import time

### Reading in and Splitting the Data 

In [6]:
#reading in the Pima Indians Diabetes dataset
data = pd.read_csv('../Data/diabetes.csv')

#splitting data into features and target variables
target = np.array(data.iloc[:,-1]).reshape((-1,1))
features = data.iloc[:,:-1]

#scaling the features to mean 0 and unit variance
ss = StandardScaler()
features = ss.fit_transform(np.array(features))

#adding intercept column to features
features = np.append(features,np.ones((features.shape[0],1)),axis=1)

#splitting the data into train and test sets
X_train,X_test,y_train,y_test = train_test_split(features,target,test_size=0.2,random_state=0)

### Using Information Gain for Deciding on Decision Tree Split Values
Information Gain = Entropy before Split - Entropy after Split  
Entropy = $ \sum^c_{i=1} -p_i*log_2(p_i) $  
where $c$ is the number of classes and $p_i$ is the probability that an observation belongs to the current class.

### Comparing Performance to Sklearn's DecisionTreeClassifier
The test accuracy of my remade Decision Tree Classifier is comparable to Sklearn's implementation, and unlike Sklearn's version, it is capable of handling categorical data without requiring preprocessing beforehand. The main downside that comes to mind is the noticeably slower runtime.

In [5]:
#sklearn
start = time()
dtc = DecisionTreeClassifier(criterion="entropy",random_state=0,max_depth=25)
dtc.fit(X_train,y_train)
end = time()
print("Sklearn's DecisionTreeClassifier Test Accuracy:",np.round(100*dtc.score(X_test,y_test),2),'%')
print("Sklearn's DecisionTreeClassifier Runtime:",np.round(end-start,6),'seconds')

Sklearn's DecisionTreeClassifier Test Accuracy: 72.08 %
Sklearn's DecisionTreeClassifier Runtime: 0.005984 seconds


In [4]:
#self-made
start = time()
dt = DecisionTree()
dt.learn(X_train,y_train,max_depth=25)
end = time()
print("Self-Made DecisionTree Classifier Test Accuracy:",np.round(100*dt.score(X_test,y_test),2),'%')
print("Self-Made DecisionTree Classifier Runtime:",np.round(end-start,6),'seconds')

Self-Made DecisionTree Classifier Test Accuracy: 75.32 %
Self-Made DecisionTree Classifier Runtime: 0.325629 seconds
