For our analysis, we have chosen a very relevant, and unique dataset which is applicable in the field of medical sciences, that will help predict whether or not a patient has diabetes, based on the variables captured in the dataset (more datasets here). This information has been sourced from the National Institute of Diabetes, Digestive and Kidney Diseases and includes predictor variables like a patient’s BMI, pregnancy details, insulin level, age, etc. Let’s dig right into solving this problem using a decision tree algorithm for classification.

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier #Import decision tree classifier
from sklearn.model_selection import train_test_split #import train_test_split function
from sklearn import metrics #import scikit learn metrics for accuracy calculation
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

In [3]:
pima = pd.read_csv('pima-indians-diabetes.csv', header = None, names = col_names)
pima

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


 After loading the data, we understand the structure & variables, determine the target & feature variables (dependent & independent variables respectively)

In [4]:
#split the dataset into feature and target variables
feature_cols = ['pregnant', 'insulin', 'bmi', 'age', 'glucose', 'bp', 'pedigree']
x = pima[feature_cols] #Features
y = pima.label #Target variable

In [5]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.3, random_state=100) #70% Training and 30% Testing.

In [12]:
#create decision tree classifier object
clf = DecisionTreeClassifier(criterion = 'gini', random_state = 100, max_depth = None, min_samples_leaf = 1)
clf.fit(x_train, y_train)

DecisionTreeClassifier(random_state=100)

In [13]:
y_pred = clf.predict(x_test)
y_pred

array([0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0], dtype=int64)

In [14]:
print('Accuracy', metrics.accuracy_score(y_test, y_pred))

Accuracy 0.6406926406926406


In [15]:
from sklearn.tree import export_graphviz
import pydotplus
export_graphviz(clf, out_file = 'clf_tree.dot')