## Predicting Heart Disease with Classification Machine Learning Algorithms

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt 
import seaborn as sns

In [None]:
df = pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head()

Let's look at the descriptions of different features:
1. age (#)
2. sex : 1= Male, 0= Female (Binary)
3. (cp)chest pain type (4 values -Ordinal):Value 1: typical angina ,Value 2: atypical angina, Value 3: non-anginal pain , Value 4: asymptomatic
4. (trestbps) resting blood pressure (#)
5. (chol) serum cholesterol in mg/dl (#)
6. (fbs)fasting blood sugar > 120 mg/dl(Binary)(1 = true; 0 = false)
7. (restecg) resting electrocardiography results(values 0,1,2)
8. (thalach) maximum heart rate achieved (#)
9. (exang) exercise induced angina (binary) (1 = yes; 0 = no)
10. (oldpeak) = ST depression induced by exercise relative to rest (#)
11. (slope) of the peak exercise ST segment (Ordinal) (Value 1: up sloping , Value 2: flat , Value 3: down sloping )
12. (ca) number of major vessels (0–3, Ordinal) colored by fluoroscopy
13. (thal) maximum heart rate achieved — (Ordinal): 3 = normal; 6 = fixed defect; 7 = reversible defect

In [None]:
df.describe()

Our data has 3 types of data:
1. Continuous (#): which is quantitative data that can be measured- age, trestbps, chol, thalach, oldpeak
2. Ordinal Data: Categorical data that has a order to it (0,1,2,3, etc)- cp, restecg,  slope, ca, thal
3. Binary Data: data whose unit can take on only two possible states ( 0 &1 )- sex, fbs, exang

In [None]:
df.dtypes

## Numerical Features

### Distribution

In [None]:
plt.hist(df['age'], bins = [20,30,40,50,60,70,80], edgecolor = 'black')
plt.title('Age')

In [None]:
plt.hist(df['trestbps'], bins = [90,100,110,120,130,140,150,160,170,180,190,200], edgecolor = 'black')
plt.title('Resting Blood Pressure')

In [None]:
plt.hist(df['chol'], bins = 7, edgecolor = 'black')
plt.title('Cholesterol')

In [None]:
plt.hist(df['thalach'], bins = [70,80,90,100,110,120,130,140,150,160,170,180,190,200], edgecolor = 'black')
plt.title('Max Heart Rate')

In [None]:
plt.hist(df['oldpeak'], bins = 5, edgecolor = 'black')
plt.title('ST Depression')

### Relationship between different numerical features

In [None]:
plt.scatter(df['age'],df['trestbps'], s=30, c = '#b6eb7a', edgecolor = 'green', linewidth = 1, alpha = 0.8)
plt.xlabel('Age')
plt.ylabel('Resting Blood Pressure')
plt.title('Age vs RBP')

In [None]:
plt.scatter(df['age'],df['chol'], s=30, c = '#9bdeac', edgecolor = 'green', linewidth = 1, alpha = 0.8)
plt.xlabel('Age')
plt.ylabel('Cholesterol')
plt.title('Age vs Cholesterol')

In [None]:
plt.scatter(df['age'],df['thalach'], s=30, c = '#b6eb7a', edgecolor = 'green', linewidth = 1, alpha = 0.8)
plt.xlabel('Age')
plt.ylabel('Max Heart Rate')
plt.title('Age vs Max Heart Rate')

In [None]:
plt.scatter(df['age'],df['oldpeak'], s=30, c = '#a8df65', edgecolor = 'green', linewidth = 1, alpha = 0.8)
plt.xlabel('Age')
plt.ylabel('ST depression')
plt.title('Age vs ST depsression')

There doesn't seem to be much relationship between Age and other numerical features

### KDE Plots

In [None]:
sns.jointplot(x=df['chol'], y=df['trestbps'], data=df, kind="kde")

In [None]:
sns.jointplot(x=df['thalach'], y=df['chol'], data=df)

In [None]:
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df['chol'], df['oldpeak'], ax=ax)
sns.rugplot(df['chol'], color="g", ax=ax)
sns.rugplot(df['oldpeak'], vertical=True, ax=ax);

In [None]:
plt.scatter(df['trestbps'],df['thalach'], s=30, c = '#e2979c', edgecolor = 'red', linewidth = 1, alpha = 0.9)
plt.xlabel('Resting Blood Pressure')
plt.ylabel('Max Heart Rate')
plt.title('RBP vs Max Heart Rate')

In [None]:
plt.scatter(df['trestbps'],df['oldpeak'], s=30, c = '#c70039', edgecolor = 'red', linewidth = 1, alpha = 0.8)
plt.xlabel('Resting Blood Pressure')
plt.ylabel('ST Depression')
plt.title('RBP vs ST Depression')

In [None]:
plt.scatter(df['thalach'],df['oldpeak'], s=40, c = '#ffbd69', edgecolor = 'orange', linewidth = 1, alpha = 1)
plt.xlabel('Max Heart Rate')
plt.ylabel('ST Depression')
plt.title('Max Heart Rate vs ST Depression')

Overall, all the features are not similar which is good for our model.

### Analysing numerical features w.r.t Target

In [None]:
X = df[['age','trestbps','chol','thalach','oldpeak']]
y = df['target']

In [None]:
sns.countplot(y)
yes, no = y.value_counts()
print('Number of Patients not diagnosed with Heart Disease:', no)
print('Number of Patients diagnosed with Heart Disease:', yes)

We have a good balance between our output.<br>
Now, let's transform our dataset such that our columns work as identifiers and our target variable is used as value.

In [None]:
data = X
data_std = (data - data.mean()) / data.std()
data = pd.concat([y, data_std.iloc[:,:]], axis=1)
data = pd.melt(data, id_vars = 'target', var_name = 'features',
                value_name = 'value')
data.head()

In [None]:
plt.figure(figsize = (8,8))
sns.violinplot(x = 'features', y = 'value', hue = 'target', data = data, split = True, inner = 'quart', color='g')

In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(x = 'features' , y='value', hue='target', data = data, color = 'r')

In [None]:
plt.figure(figsize = (8,8))
sns.swarmplot(x = 'features', y = 'value', hue = 'target', data = data)

Thalach and oldpeak and, to some extent even age, are good indicators of the target variable

## Categorical Features

In [None]:
sns.countplot(x="sex", data=df,hue='target')
male, fm = df['sex'].value_counts()
print('Number of Female Patients:', fm)
print('Number of Male Patients:', male)

Females had higher number of patients with heart disease

In [None]:
sns.countplot(x="fbs", data=df,hue='target')
fbsno, fbsyes = df['fbs'].value_counts()
print('Fasting Blood Sugar > 120 :', fbsyes)
print('Fasting Blood Sugar < 120: ', fbsno)

In [None]:
sns.countplot(x="exang", data=df,hue='target')
no, yes = df['exang'].value_counts()
print('Exercise Induced Angina Yes: ', yes)
print('Exercise Induced Angina No:', no)

People with no exercise induced angina have more people with the disease

In [None]:
sns.catplot(x='exang',y='target',data=df,kind='point', hue = 'sex', color = '#e7305b')

As previously seen, there are more diseased people with no Exercise Induced Angina plus this also provides a confirmation that Females suffered more from the disease

In [None]:
sns.catplot(x='fbs',y='target',data=df,kind='point', hue = 'sex', color = '#436f8a')

There are more disease-free females with FBS < 120 as compared to males in that category

In [None]:
sns.catplot(x='fbs',y='target',data=df,kind='point', hue = 'exang', color = '#79d70f')

Majority of people with heart disease had FBS > 120 and no Exercised Induced Angina.

In [None]:
sns.catplot(x = 'target',y='oldpeak',data=df,kind='violin',hue='sex', palette=sns.color_palette(['#ffdcb4', '#c060a1']))

We can see that the overall shape & distribution for negative & positive patients differ vastly. <br>Positive patients exhibit a lower median for ST depression level & thus a great distribution of their data is between 0 & 2, while negative patients are between 1 & 3. In addition, we don’t see many differences between male & female target outcomes.

In [None]:
sns.catplot(x = 'target',y='thalach',data=df,kind='box',hue='exang', palette=sns.color_palette(['#162447', '#74d4c0']))

Positive patients exhibit a heightened median for Max Heart Rate, while negative patients have lower levels

In [None]:
sns.countplot(x="cp", data=df,hue='target')

In [None]:
sns.countplot(x="restecg", data=df,hue='target')

In [None]:
sns.countplot(x="slope", data=df,hue='target')

In [None]:
sns.countplot(x="ca", data=df,hue='target')

In [None]:
sns.countplot(x="thal", data=df,hue='target')

In [None]:
sns.catplot(x='cp',y='target',data=df,kind='point', color = 'g')

Here we see that patients with no or low chest pain(0) very rarely show a tendency to have a disease which makes sense since a greater amount of chest pain will lead to a greater chance of having heart disease.

In [None]:
sns.catplot(x='restecg',y='target',data=df,kind='point', color = 'm' )

People with High Resting ECG(~2) show a lesser chance of having the disease.

In [None]:
sns.catplot(x='slope',y='target',data=df,kind='point', color = '#ffa5b0')

More number of positive patients had peak exercise ST segment equal to 2

In [None]:
sns.catplot(x='ca',y='target',data=df,kind='point',  color = '#1b6ca8')

### Correlation

In [None]:
plt.figure( figsize = (10,10))
sns.heatmap(df.corr(), annot = True)

<p>
There is a moderate positive correlation between the target variable and 'cp','thalach' and 'slope'. As a person's chest pain, max. heart rate or peak exercise ST deprssion increases, his chances of getting a heart disease also increase.
</p>
<p>
On the other hand, there is a moderate negative correaltion between the target variable and 'exang'(Exercise Induced Angia),'oldpeak'(ST depression induced by exercise relative to rest),'ca'( number of major vessels ) and 'thal'(max. heart rate achieved).
</p>

### Apply categorical encoding

In [None]:
a = pd.get_dummies(df['cp'], prefix = "cp")
b = pd.get_dummies(df['thal'], prefix = "thal")
c = pd.get_dummies(df['slope'], prefix = "slope")

In [None]:
df = pd.concat([df, a, b, c], axis = 1)
df = df.drop(columns = ['cp', 'thal', 'slope'])
df.head()

## Feature Selection and Data Preparation

In [None]:
X = df.drop(columns = ['chol','fbs','age','sex','trestbps','restecg','target'], axis = 1)
y = df['target']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
print('Training set shape: ', X_train.shape, y_train.shape)
print('Testing set shape: ', X_test.shape, y_test.shape)

In [None]:
# Function definition for fitting data
def model_fit(model,X, y,test):
    model.fit(X,y)
    y_pred = model.predict(test)
    return y_pred

In [None]:
# Function for calculating accuracy
from sklearn.metrics import accuracy_score
def accuracy(Y, y):
    return accuracy_score(Y,y)

In [None]:
from sklearn.preprocessing import StandardScaler
X_train = StandardScaler().fit(X_train).transform(X_train.astype(float))
X_test = StandardScaler().fit(X_test).transform(X_test.astype(float))

In [None]:
model_accuracy = {}

## Classification Models

### K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn

In [None]:
y_knn = model_fit(knn, X_train, y_train, X_test)
knn_acc = accuracy(y_test, y_knn)

In [None]:
print('Test accuracy: ', knn_acc)

In [None]:
x = [0]
mean_acc = np.zeros(20)
mean_acc_train = np.zeros(20)
for i in range(1,21):
    #Train Model and Predict  
    knn = KNeighborsClassifier(n_neighbors = i).fit(X_train,y_train)
    yhat= knn.predict(X_test)
    yhat2 = knn.predict(X_train)
    mean_acc[i-1] = accuracy_score(y_test, yhat)
    mean_acc_train[i-1] = accuracy_score(y_train, yhat2)
    x.append(i)

In [None]:
plt.figure(figsize = (8,6))
plt.plot(np.arange(1,21), mean_acc, label = 'Test')
plt.plot(np.arange(1,21), mean_acc_train, label = 'Train')
plt.title('Test vs Train')
plt.xticks(np.arange(min(x), max(x)+1, 1.0))
plt.legend()

In [None]:
y_knn = model_fit(KNeighborsClassifier(n_neighbors = 13), X_train, y_train, X_test)
model_accuracy['KNN'] = accuracy(y_test, y_knn)

In [None]:
X = StandardScaler().fit(X).transform(X.astype(float))

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, X , y, cv = 5)
scores

In [None]:
scores.mean()

### SVMs

In [None]:
from sklearn import svm
clf2 = svm.SVC(C=1, kernel = 'rbf', gamma = 'auto')
clf2

In [None]:
y_svm = model_fit(clf2, X_train, y_train, X_test)
y2 = model_fit(clf2, X_train, y_train, X_train)
svm_acc = accuracy(y_test, y_svm)
svm2 = accuracy(y_train, y2)
model_accuracy['SVM'] = svm_acc
print('Train accuracy: ', svm2)
print('Test accuracy: ', svm_acc)

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf2, X , y, cv = 5)
scores

In [None]:
scores.mean()

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
clf3 = DecisionTreeClassifier(random_state=42,criterion = 'entropy', max_depth = 3)
clf3

In [None]:
y_tree = model_fit(clf3,X_train,y_train, X_test)
y_tree2 = model_fit(clf3,X_train,y_train, X_train)

In [None]:
print("Train score: ", accuracy(y_train,y_tree2)," Test score: ",accuracy(y_test,y_tree))

In [None]:
scores = cross_val_score(clf3, X , y, cv = 5)
scores

In [None]:
scores.mean()

In [None]:
model_accuracy['Decision Tree'] = accuracy(y_test,y_tree)

### RFs

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf4 = RandomForestClassifier(random_state = 42, max_depth = 4, criterion = 'entropy')
clf4

In [None]:
y_rf = model_fit(clf4,X_train,y_train, X_test)
y_rf2 = model_fit(clf4,X_train,y_train, X_train)
print("Train score: ", accuracy(y_train,y_rf2)," Test score: ",accuracy(y_test,y_rf))

In [None]:
scores = cross_val_score(clf4, X , y, cv = 5)
scores

In [None]:
scores.mean()

In [None]:
model_accuracy['Random Forest'] = accuracy(y_test,y_rf)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
clf5 = LogisticRegression(C = 0.1, solver = 'newton-cg')
clf5

In [None]:
y_lr = model_fit(clf5,X_train,y_train, X_test)
y_lr2 = model_fit(clf5,X_train,y_train, X_train)
print("Train score: ", accuracy(y_train,y_lr2)," Test score: ",accuracy(y_test,y_lr))

In [None]:
scores = cross_val_score(clf5, X , y, cv = 5)
scores

In [None]:
scores.mean()

In [None]:
model_accuracy['Logistic Regression'] = accuracy(y_test,y_lr)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
clf6 = GaussianNB()
clf6

In [None]:
y_nb = model_fit(clf6,X_train,y_train, X_test)
y_nb2 = model_fit(clf6,X_train,y_train, X_train)
print("Train score: ", accuracy(y_train,y_nb2)," Test score: ",accuracy(y_test,y_nb))

In [None]:
scores = cross_val_score(clf6, X , y, cv = 5)
scores

In [None]:
scores.mean()

In [None]:
model_accuracy['Naive Bayes'] = accuracy(y_test,y_nb)

### XGBoost

In [None]:
from xgboost import XGBClassifier
clf7 = XGBClassifier(random_state=42, max_depth = 3, learning_rate = 0.01, n_estimators = 200)
clf7

In [None]:
y_xg = model_fit(clf7,X_train,y_train, X_test)
y_xg2 = model_fit(clf7,X_train,y_train, X_train)
print("Train score: ", accuracy(y_train,y_xg2)," Test score: ",accuracy(y_test,y_xg))

In [None]:
scores = cross_val_score(clf7, X , y, cv = 5)
scores

In [None]:
scores.mean()

In [None]:
model_accuracy['XGBoost'] = accuracy(y_test,y_xg)

## Comparing Models

In [None]:
plt.figure(figsize=(15,8))
plt.bar(model_accuracy.keys(),model_accuracy.values(), color = ['#87dfd6','#a6dcef','#ddf3f5','#111d5e','#111d5e','#a6dcef','#40bad5'])
plt.ylabel("Accuracy")
plt.xlabel("Classification Algorithm")
plt.show()

We see that relatively all models perform well but we get the highest accuracy with <strong><em>Random Forest</em></strong> and <strong><em>Logistic Regression</em></strong> of around <strong>85.5%</strong>

I'm a beginner in ML and have recently started analysing different datasets, so I would very much appreciate any sort of feedback on this notebook. Thanks! 