# Coronary Artery Disease (CAD)

Coronary artery disease is the narrowing or blockage of the coronary arteries. This condition is usually caused by atherosclerosis. Atherosclerosis is the build-up of cholesterol and fatty deposits (called plaques) inside the arteries.

This notebook analysis the CAD dataset and aims to predict the disease using the 13 parameters given in the dataset.
Divided into 3 parts:
1. EDA
2. Feature Selection
3. Model building

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display_html
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.gridspec as gridspec
from sklearn import model_selection
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

import matplotlib.gridspec as gridspec

blue_red = ['#74a09e','#86c1b2','#98e2c6','#f3c969','#f2a553', '#d96548', '#c14953']
sns.palplot(sns.color_palette(blue_red))

# Set Style
sns.set_style("whitegrid")
sns.despine(left=True, bottom=True)

In [None]:
df = pd.read_csv('../input/coronary-artery-disease/Coronary_artery.csv')
print(print('Features:{}'.format(df.columns.tolist())))

# Dataset:
The dataset contains 13 independent features and 1 dependent feature (class)

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
print(df.shape)
print(df.isnull().values.any())

In [None]:
df2 = pd.read_csv('../input/coronary-artery-disease/data.csv')
df2.info()

Treatment of values with a question mark sign

In [None]:
df2 = df2[~df2.isin(['?'])]
df2 = df2.dropna(axis=0)
df2.info()

In [None]:
df2.head()

In [None]:
correlation_mat = df2.corr()
plt.figure(figsize=(20,20))
ax=sns.heatmap(correlation_mat, annot = True)
plt.show(ax)

In [None]:
sns.pairplot(df2, height = 1.5, palette = 'rocket')

In [None]:
black_red = [
    '#1A1A1D', '#4E4E50', '#C5C6C7', '#6F2232', '#950740', '#C3073F'
]

# Exploratory Data Analysis

The graphs below show the exact distribution of disease at various stages. 0,1,2,3 and 4 with respect to categorical features. These count plots show the distribution of stage of disease with respect to categorical features Gender, Chest pain Type, Resting ECG results and Defect Type respectively. We can analyse that the patients with Asymptomatic chest pain have a higher chance of suffering from the disease and least chance with Typical Angimal Pain. Also, the patient with ECG report of Left Ventricular Hypertrophy has a higher chance of disease.


In [None]:
fig = plt.figure(constrained_layout = True, figsize = (25,12))

#create grid

grid = gridspec.GridSpec(ncols = 4, nrows = 2, figure = fig)

ax1 = fig.add_subplot(grid[0, :2])
ax1.set_title('Gender Distribution')


sns.countplot(df['sex'],
             alpha = 0.9,
             hue = df['class'],
             ax = ax1,
             palette = 'rocket',
             order=df['sex'].value_counts().index)

ax1.legend()
plt.xticks(fontsize = 14)

ax2 = fig.add_subplot(grid[0, 2:])
ax2.set_title('Chest Pain Distribution')
sns.countplot(df['cp'],
             alpha = 0.9,
             hue = df['class'],
             ax = ax2, 
             palette = 'rocket',
             order=df['cp'].value_counts().index)
ax2.legend()
plt.xticks( fontsize = 14)

ax3 = fig.add_subplot(grid[1, :2])
ax3.set_title('Resting Electrographic Results Distribution')
sns.countplot(df['restecg'],
             alpha = 0.9,
             hue = df['class'],
             ax = ax3, 
             palette = 'rocket',
             order=df['restecg'].value_counts().index)
ax3.legend()
plt.xticks(fontsize = 14)

ax4 = fig.add_subplot(grid[1, 2:])
ax4.set_title('Defect Type Distribution')
sns.countplot(df['thal'],
             alpha = 0.9,
             hue = df['class'],
             ax = ax4, 
             palette = 'rocket',
             order=df['thal'].value_counts().index)
ax4.legend()
plt.xticks(fontsize = 14)
plt.show()

Almost all the persons are in the range of 40 to 70 years of age. Maximum Patients have an age of 55 to 58. All patients have cholesterol level below 300 and maximum have around 220 mm/dL.  A majority is having blood sugar in between 120 and 140 mg/dL. The distribution for Maximum Heart Rate Achieved shows that the majority of patients have a max. Heart rate between 150 and 160.


In [None]:
fig = plt.figure(constrained_layout = True, figsize = (15,9))

#create grid

grid = gridspec.GridSpec(ncols = 1, nrows = 4, figure = fig)
ax1 = fig.add_subplot(grid[0, :])

sns.distplot(df.age, ax = ax1, color = blue_red[1])
ax1.set_title('Age Distribution')

ax2 = fig.add_subplot(grid[1, :])
sns.distplot(df.chol, ax = ax2, color = blue_red[2])
ax2.set_title('Cholestrol Distribution')


ax3 = fig.add_subplot(grid[2, :])
sns.distplot(df.trestbps, ax = ax3, color = blue_red[3])
ax3.set_title('Resting Blood Sugar Distribution')

ax4 = fig.add_subplot(grid[3, :])
sns.distplot(df.thalach, ax = ax4, color = blue_red[4])
ax4.set_title('Maximum Heart Rate Distribution')

The sunburst chart shows us the distribution of disease stages with respect to gender (male or female) and chest pain type (Typical Angima, Atypical Angima, Nonangimal Pain and Asymptomatic Pain)

In [None]:
import plotly.express as px
fig = px.sunburst(data_frame = df,
                 path = [ 'sex','class','cp'],
                 color = 'class',
                 maxdepth = -1,
                 title = 'Sunburst Chart SmokingStatus > Gender > Age')
fig.update_traces(textinfo = 'label+percent parent')
fig.update_layout(margin=dict(t=0, l=0, r=0, b=0))
fig.show()

In [None]:
print(df2.sex.unique())
print(df2.cp.unique())
print(df2.fbs.unique())
print(df2.restecg.unique())
print(df2.exang.unique())
print(df2.slope.unique())
print(df2.ca.unique())
print(df2.thal.unique())

# Feature Selection

The features were selected based on the p values extracted from the GLM (Generalised Linear Model). All those features with p-value < 0.05 were taken for further analysis and model building. Comparing the p-values < 0.05, 7 features namely ca, cp, restecg, thalach, oldpeak, slope, thal were selected for further processing and building model for prediction of disease.


In [None]:
df_new = df2.rename(columns={'class': 'label'})
formula = 'label ~ age+sex+cp+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal'
result = smf.glm(formula = formula, data=df_new).fit()
print(result.summary())

These 7 features were then encoded and a binary column for each category of a distinct feature was made by building dummies (One Hot Encoding). This will further help in the successful development of the model where all categorical features will become binary and thus the model will learn more effectively.

In [None]:
X = df_new[['cp', 'restecg', 'thalach', 'oldpeak', 'slope', 'ca', 'thal']]
Y = df_new['label']
X = pd.get_dummies(X, columns=['cp', 'restecg', 'slope', 'ca', 'thal'])
X.head()

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.20, random_state=42)

# Decision Tree
Decision Trees are a type of Supervised Machine Learning where the data is continuously split according to a certain parameter.

In [None]:
tree_model = DecisionTreeClassifier(max_depth = 25)
tree_model.fit(X_train, y_train)
y_pred_tree = tree_model.predict(X_test)
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import accuracy_score
y_pred_train_tree = tree_model.predict(X_train)
tree_confusion = metrics.confusion_matrix(y_train, y_pred_train_tree)
print('Confusion Matrix for Train:\n{}'.format(tree_confusion))
acc=accuracy_score(y_train, y_pred_train_tree) 
print('Train case accuracy is :'+ format(acc))
print('\n')
y_pred_test_tree = tree_model.predict(X_test)
tree_confusion = metrics.confusion_matrix(y_test, y_pred_test_tree)
print('Confusion Matrix for Test:\n{}'.format(tree_confusion))
acc= accuracy_score(y_test, y_pred_test_tree)
print('Test case accuracy is :'+ format(acc))
print('\n')
print(classification_report(y_test, y_pred_test_tree))

In [None]:
y_pred_prob_tree = tree_model.predict_proba(X_test)
import scikitplot as skplt
skplt.metrics.plot_roc(y_test, y_pred_prob_tree, figsize = (10, 10))
plt.show()

# Random Forest

Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks). 


In [None]:
from sklearn.preprocessing import StandardScaler
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


In [None]:
from sklearn.ensemble import RandomForestClassifier
# Fitting Random Forest Classification to the Training set
classifier = RandomForestClassifier(n_estimators = 15, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)

y_pred_train = classifier.predict(X_train)

# Making the Confusion Matrix
random_confusion = metrics.confusion_matrix(y_train, y_pred_train)
print('Confusion Matrix for train:\n{}'.format(random_confusion))
acc=accuracy_score(y_train, y_pred_train)
print('Train case accuracy is :'+ format(acc))
print('\n')
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
random_confusion = metrics.confusion_matrix(y_test, y_pred)
print('Confusion Matrix for test:\n{}'.format(random_confusion))
acc=accuracy_score(y_test, y_pred)
print('Test case accuracy is :'+ format(acc))
print('\n')
print(classification_report(y_test, y_pred))

In [None]:
y_pred_prob_rf = classifier.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_pred_prob_rf, figsize = (10, 10))
plt.show()

# SVM

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is the number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiates the two classes very well


In [None]:
from sklearn.svm import SVC 
svm_model_linear = SVC(decision_function_shape='ovo',kernel = 'poly', C = 1, probability = True).fit(X_train, y_train) 
#kernel='poly' gives best result

svm_predictions_train = svm_model_linear.predict(X_train) 
cm1 = metrics.confusion_matrix(y_train, svm_predictions_train) 
acc1=accuracy_score(y_train, svm_predictions_train)
print('Confusion Matrix for train:\n{}'.format(cm1))
print('Train case accuracy is :'+ format(acc1))
print('\n')
svm_predictions = svm_model_linear.predict(X_test) 
# creating a confusion matrix  and finding accuracy
cm = metrics.confusion_matrix(y_test, svm_predictions) 
acc=accuracy_score(y_test, svm_predictions)
print('Confusion Matrix for test:\n{}'.format(cm))
print('Test case accuracy is :'+ format(acc))
print('\n')
print(classification_report(y_test, svm_predictions))

In [None]:
y_pred_prob_svm = svm_model_linear.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_pred_prob_svm, figsize = (10, 10))
plt.show()

# KNN

K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K- NN algorithm


In [None]:
# training a KNN classifier 
from sklearn.neighbors import KNeighborsClassifier 
knn = KNeighborsClassifier(n_neighbors = 7).fit(X_train, y_train) 
knn_predictions_train = knn.predict(X_train) 
cm1 = metrics.confusion_matrix(y_train, knn_predictions_train) 
acc1=accuracy_score(y_train, knn_predictions_train)
print('Confusion Matrix for train:\n{}'.format(cm1))
print('Train case accuracy is :'+ format(acc1))
print('\n')
knn_predictions = knn.predict(X_test) 
# creating a confusion matrix  and finding accuracy
cm = metrics.confusion_matrix(y_test, knn_predictions) 
acc=accuracy_score(y_test, knn_predictions)
print('Confusion Matrix for test:\n{}'.format(cm))
print('Test case accuracy is :'+ format(acc))
print('\n')
print(classification_report(y_test, knn_predictions))

In [None]:
y_pred_prob_knn = knn.predict_proba(X_test)
skplt.metrics.plot_roc(y_test, y_pred_prob_knn, figsize = (10, 10))
plt.show()

# Logistic Regression

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative distribution function of the logistic distribution


In [None]:
from sklearn.linear_model import LogisticRegression

# instantiate model
logreg = LogisticRegression()

# fit model
logreg.fit(X_train, y_train)
y_pred_train = logreg.predict(X_train)
y_pred_class = logreg.predict(X_test)
confusion1 = metrics.confusion_matrix(y_train, y_pred_train)
acc=accuracy_score(y_train, y_pred_train)
confusion = metrics.confusion_matrix(y_test, y_pred_class)
acc1=accuracy_score(y_test, y_pred_class)
print(confusion1)
print('Train case accuracy is :'+ format(acc))
print('\n')
print(confusion)
print('Test case accuracy is :'+ format(acc1))
print('\n')
print(classification_report(y_test, y_pred_class))

In [None]:
y_pred_prob_log = logreg.predict_proba(X_test)
#plt.figure(figsize = (10, 10))
skplt.metrics.plot_roc(y_test, y_pred_prob_log, figsize = (10, 10))
plt.show()

# Neural Net

Architecturally, an artificial neural network is modelled using layers of artificial neurons, or computational units able to receive input and apply an activation function along with a threshold to determine if messages are passed along.
In a simple model, the first layer is the input layer, followed by one hidden layer, and lastly by an output layer. Each layer can contain one or more neurons.
Models can become increasingly complex, and with increased abstraction and problem-solving capabilities by increasing the number of hidden layers, the number of neurons in any given layer, and/or the number of paths between neurons.
Model architecture and tuning are therefore major components of ANN techniques, in addition to the actual learning algorithms themselves. All of these characteristics of an ANN can have a significant impact on the performance of the model.


In [None]:
X= df2[['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach','exang', 'oldpeak', 'slope', 'ca', 'thal']]
y= df2['class']
X = np.asarray(X).astype('float32')
y = np.asarray(y).astype('float32')
# Split the dataset using a 70:30 split
X_train1, X_test1, y_train1, y_test1 = model_selection.train_test_split(X, y, test_size=0.20, random_state=0)

#Check the shape of each variable, remember the X variable must be in matrix form and the y varibale a vector
X_train1.shape, y_train1.shape, X_test1.shape, y_test1.shape

In [None]:
from keras.utils.np_utils import to_categorical

Y_train1 = to_categorical(y_train1, num_classes=None)
Y_test1 = to_categorical(y_test1, num_classes=None)
print(Y_train1.shape)
print(Y_train1[:10])

In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers.core import Flatten, Dense, Dropout
from keras.callbacks import EarlyStopping
from keras.optimizers import Adam
model = Sequential()
model.add(Dense(16, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(10, kernel_initializer='normal', activation='relu'))
model.add(Dense(8, kernel_initializer='normal', activation='relu'))
model.add(Dense(5, activation='softmax'))
#compiling model
adam = Adam(lr = 0.001)
model.compile(loss="categorical_crossentropy", optimizer=adam, metrics=["accuracy"])

In [None]:
estop = EarlyStopping(patience=10, mode='min', min_delta=0.001, monitor='val_loss')

history = model.fit(X_train1, Y_train1, validation_data=(X_test1, Y_test1), epochs=100, batch_size=10, verbose = 1, callbacks = [estop])

In [None]:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('MODEL ACCURACY')
plt.ylabel('Accuracy')
plt.xlabel('No. of epochs')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:
probas = model.predict_proba(X_test1, batch_size=10)
skplt.metrics.plot_roc(y_test1, probas)

# Binary NN

Only the probability of the presence of the disease is predicted by 0 and 1. This model gives the best accuracy so as to predict the presence of disease.

In [None]:
Y_train_binary = y_train1.copy()
Y_test_binary = y_test1.copy()

Y_train_binary[Y_train_binary > 0] = 1
Y_test_binary[Y_test_binary > 0] = 1

print(Y_train_binary[:20])

In [None]:
model2 = Sequential()
model2.add(Dense(8, input_dim=13, kernel_initializer='normal', activation='relu'))
model2.add(Dense(4, kernel_initializer='normal', activation='relu'))
model2.add(Dense(1, activation='sigmoid'))
#compiling model
adam = Adam(lr = 0.001)
model2.compile(loss="binary_crossentropy", optimizer=adam, metrics=["accuracy"])

In [None]:
history2 = model2.fit(X_train1, Y_train_binary, validation_data=(X_test1, Y_test_binary), epochs=100, batch_size=10, verbose = 1, callbacks = [estop])

In [None]:
plt.plot(history2.history['accuracy'])
plt.plot(history2.history['val_accuracy'])
plt.title('MODEL ACCURACY')
plt.ylabel('Accuracy')
plt.xlabel('No. of epochs')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

In [None]:


categorical_pred = np.argmax(model.predict(X_test1), axis=1)

print('Results for Categorical Model')
print(accuracy_score(y_test1, categorical_pred))
print(classification_report(y_test1, categorical_pred))

In [None]:
binary_pred = np.round(model2.predict(X_test1)).astype(int)

print('Results for Binary Model')
print(accuracy_score(Y_test_binary, binary_pred))
print(classification_report(Y_test_binary, binary_pred))