**Steps**

In this guide, we will follow the following steps:

Step 1 - Loading the required libraries and modules.

Step 2 - Reading the data and performing basic data checks.

Step 3 - Creating arrays for the features and the response variable.

Step 4 - Creating the training and test datasets.

Step 5 - Building , predicting, and evaluating the neural network model.

Step 1 - Loading the Required Libraries

In [1]:
# Import required libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import sklearn
from sklearn import tree

# Import necessary modules
from sklearn.model_selection import train_test_split

**Step 2 - Reading the Dataset and Performing Basic Data Checks**

In [None]:
# Reading Train Dataset from Local Drive
from google.colab import files
uploaded = files.upload()

In [2]:
diabetes = pd.read_csv("diabetes_datatset.csv") 

In [7]:
diabetes.tail(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diabetes
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [10]:
print(diabetes.shape)
diabetes.describe().transpose()

(768, 9)


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Pregnancies,768.0,3.845052,3.369578,0.0,1.0,3.0,6.0,17.0
Glucose,768.0,120.894531,31.972618,0.0,99.0,117.0,140.25,199.0
BloodPressure,768.0,69.105469,19.355807,0.0,62.0,72.0,80.0,122.0
SkinThickness,768.0,20.536458,15.952218,0.0,0.0,23.0,32.0,99.0
Insulin,768.0,79.799479,115.244002,0.0,0.0,30.5,127.25,846.0
BMI,768.0,31.992578,7.88416,0.0,27.3,32.0,36.6,67.1
DiabetesPedigreeFunction,768.0,0.471876,0.331329,0.078,0.24375,0.3725,0.62625,2.42
Age,768.0,33.240885,11.760232,21.0,24.0,29.0,41.0,81.0
Diabetes,768.0,0.348958,0.476951,0.0,0.0,0.0,1.0,1.0


In [17]:
diabetes.isnull().sum()
diabetes.isnull().any()
diabetes.isnull()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Diabetes                    0
dtype: int64

**Step 3 - Creating Arrays for the Features and the Response Variable**

The first line of code creates an object of the target variable called 'target_column'. 

The second line gives us the list of all the features, excluding the target variable 'unemploy', while the third line normalizes the predictors.

The fourth line displays the summary of the normalized data. We can see that all the independent variables have now been scaled between 0 and 1. The target variable remains unchanged.

In [None]:
target_column = ['Diabetes'] 
predictors = list(set(list(diabetes.columns))-set(target_column))

diabetes[predictors] = diabetes[predictors]/diabetes[predictors].max()
diabetes.describe().transpose()

**Step 4 - Creating the Training and Test Datasets**

The first couple of lines of code below create arrays of the independent (X) and dependent (y) variables, respectively. The third line splits the data into training and test dataset, and the fourth line prints the shape of the training and the test data.



In [None]:
X = diabetes[predictors].values
y = diabetes[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape)
print(X_test.shape)

**Step 5 - Building, Predicting, and Evaluating the Decision Tree**


In [None]:
DT = tree.DecisionTreeClassifier(criterion='entropy', min_samples_split=4, max_depth=5)
DT.fit(X_train,y_train)


predict_train = DT.predict(X_train)
predict_test = DT.predict(X_test)

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Once the predictions are generated, we can evaluate the performance of the model. 

Being a classification algorithm, we will first import the required modules, which is done in the first line of code below. 

The second and third lines of code print the confusion matrix and the confusion report results on the training data.

In [None]:
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix
print("Accuracy Score of DT on Training Dataset - ", accuracy_score(y_train,predict_train))
print("\nConfusion Matrix of DT on Training Dataset - \n", confusion_matrix(y_train,predict_train))
print("\nClassification Report of DT on Training Dataset - \n", classification_report(y_train,predict_train))

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
# for encoding

In [None]:
plt.figure(figsize=(6,6))
sns.heatmap(data=confusion_matrix(y_train,predict_train),linewidths=.5, annot=True,square = True,  cmap = 'Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix of Train Dataset')

In [None]:
print("Accuracy Score of DT on Test Dataset - ", accuracy_score(y_test,predict_test))
print("\nConfusion Matrix of DT on Test Dataset - \n", confusion_matrix(y_test,predict_test))
print("\nClassification Report of DT on Test Dataset - \n", classification_report(y_test,predict_test))

In [None]:
plt.figure(figsize=(6,6))
sns.heatmap(data=confusion_matrix(y_test,predict_test),linewidths=.5, annot=True,square = True,  cmap = 'Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix of Test Dataset')

In [None]:

# Once trained, you can plot the tree with the plot_tree function:
tree.plot_tree(DT) 
plt.savefig('out.pdf')