Medical records of 270 patients have been provided in the file.

1) Find out variable importance using Decision Tree classifier to predict heart disease<br>
2-a) Train a decision tree model to predict heart disease using only the top 5 important variables. Use entire data for training<br>
2-b) What is the accuracy of the model with 5 fold cross validation

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier

In [2]:
# Load data
heart_disease = pd.read_excel('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/04_heart_disease.xlsx', sheet_name='data')
heart_disease.head()

Unnamed: 0,age,sex,chest_pain_type,BP,cholestrol,bloodsugarlevel,ECG_result,Max_heart_rate,Angina,oldpeak,slopepeak,major_vessels,thal,disease
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,1
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,0
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,1
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,0
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,0


In [3]:
# Fit the model
X = heart_disease.drop('disease', axis=1)
y = heart_disease['disease']

clf = DecisionTreeClassifier(criterion='gini')
clf = clf.fit(X,y)

In [4]:
# Feature importance
feature_imp = pd.Series(clf.feature_importances_, index=X.columns)
feature_imp.sort_values(ascending=False,inplace=True)
feature_imp

thal               0.279628
major_vessels      0.154112
cholestrol         0.087120
chest_pain_type    0.081865
age                0.074420
oldpeak            0.061400
BP                 0.054969
Max_heart_rate     0.053225
sex                0.046375
Angina             0.045635
slopepeak          0.035500
ECG_result         0.025750
bloodsugarlevel    0.000000
dtype: float64

In [5]:
# Top 5 features
top_5 = list(feature_imp.index[:5])
top_5

['thal', 'major_vessels', 'cholestrol', 'chest_pain_type', 'age']

In [6]:
# Train the model using only top 5 features
X = heart_disease[top_5]
y = heart_disease['disease']

clf2 = DecisionTreeClassifier(criterion='gini')
clf2 = clf2.fit(X,y)

In [7]:
# Accuracy
clf2.score(X,y)*100

100.0

Accuracy on the entire training set without cross validation is 100%, indicating it is overfitting on the training data.

In [8]:
# With cross validation
clf3 = DecisionTreeClassifier(random_state=42)

from sklearn.model_selection import GridSearchCV, StratifiedKFold

# Create parameter grid
params = {'criterion': ['gini'],'max_depth' : [2, 3, 5, 10, 20], 'min_samples_split' : [5, 10, 15, 20, 25, 30, 35, 40]}

# Create 5 fold
folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create gridsearch object
gs = GridSearchCV(estimator=clf3, cv=folds, param_grid=params)

# Fit the model
gs.fit(X,y)

# Print best score
print('Best Train Score:', np.round(gs.best_score_*100, 2))

Best Train Score: 84.44


We can see that after doing cross validation, the training score decreased to 84% and thereby reducing the overfitting. Cross validation is important in training the model.