# Predicting the presence of heart disease using ML classification models.

The dataset given has 14 features(columns) and 303 rows. There are 303 training examples. Let's look at what the columns exactly represent since they are all abbreviated. I always find it easier to work on data that I am familiar with so lets understand what exactly the columns represent.

The heart disease dataset contains the following features:
1. Age of the person 
2. Sex or the gender of the person 
3. Type of chest pain-represented by 4 values(0,1,2 and 3)
4. Resting blood pressure
5. Serum cholesterol which is the combined measurement of HDL and LDL (high and low density lipo-proteins). HDL is often deemed as good cholesterol and indicates lower risk of heart disease whereas LDL is seen as bad cholesterol which indicates a higher risk of heart disease and increased plaque formation in your blood vessels and arteries. 
6. Fasting blood sugar which indicates the level of diabetes and is considered to be a risk factor if found to be above 120 mg/dl.
7. Resting electrocardiographic results which measure the electrical activity of the heart. It can diagnose the irregular heart rhythms, abnormally slow heart rhythms, evidence of an evolving/acute heart attack possibilities etc. 
8. Maximum heart rate achieved is the average maximum number of times our heart beats per minute. It is calculates as (220-age of the person).
9. Exercise induced angina(AP) is a common concern among cardiac patients. Angina is usually stable but is triggered when we do physical activity especially in cold conditions.
10. Oldpeak is described as the ST depression induced by exercise relative to rest. ST depression occurs when the J point is displaced below baseline. Not all ST depressions represents an emergency condition.
11. The slope of the peak exercise ST segment
12. The number of major vessels
13. Thalach: 3 = normal; 6 = fixed defect; 7 = reversible defect
14. Target-tells us whether the person has heart disease(1) or not(0).

Import the various necessary modules and read/display the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [None]:
df=pd.read_csv('../input/heart-disease-uci/heart.csv')
df.head()

The shape of the dataset is nothing but the rows x columns so lets see the dimensions of our dataset. Also lets check the presence of any NA or null values

In [None]:
df.shape

In [None]:
df.isna().sum()

# Data visualization using in-built python libraries

The heatmap tells us the relation between various variables in our dataset by indicating how they affect each other using a color scheme as well as numerical data. Negative values indicate the relatively less correlation between the 2 specific variables whereas values closer to 1 are highly correlated. df.corr() is used to find the pairwise correlation of all columns in the dataframe.

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True,linewidth=0.2,cmap='coolwarm')

The 8 plots below (countplots) represent the relations between the categorical values and the target variable(0 or 1) and is supposed to show the count of the categorical values.

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(4,4,1)
sns.countplot(data=df,x='sex',hue='target',palette='Set2')
plt.subplot(4,4,2)
sns.countplot(data=df,x='cp',hue='target',palette='Set2')
plt.subplot(4,4,3)
sns.countplot(data=df,x='fbs',hue='target',palette='Set2')
plt.subplot(4,4,4)
sns.countplot(data=df,x='restecg',hue='target',palette='Set2')
plt.subplot(4,4,5)
sns.countplot(data=df,x='exang',hue='target',palette='Set1')
plt.subplot(4,4,6)
sns.countplot(data=df,x='slope',hue='target',palette='Set1')
plt.subplot(4,4,7)
sns.countplot(data=df,x='ca',hue='target',palette='Set1')
plt.subplot(4,4,8)
sns.countplot(data=df,x='thal',hue='target',palette='Set1')

The 4 plots (distplots) below are the histograms that represent the range of values that the continuous values posses.

In [None]:
plt.figure(figsize=(20,15))
plt.subplot(4,4,1)
sns.distplot(a=df['age'],bins=30)
plt.subplot(4,4,2)
sns.distplot(a=df['trestbps'],bins=40,color='red')
plt.subplot(4,4,3)
sns.distplot(a=df['chol'],bins=50,color='green')
plt.subplot(4,4,4)
sns.distplot(a=df['oldpeak'],bins=30,color='purple')

The following 4 plots depict the status of correlations between 3 different variables i.e. target, gender and 1 of 4 variables that contain continous numerical values. It is depicted using the boxplot where the green box indicates no heart disease and orange box represents the presence of heart disease. The X-axis represents age, resting blood pressure, cholesterol and oldpeak values. The 2 plots are further classified based on gender and the y-axis represents the target(0 or 1).

In [None]:
g = sns.catplot(x="age", y="target", row="sex",kind="box", orient="h", height=1.5, aspect=4,data=df,palette='Set2')
g.set(xscale='log')

In [None]:
g = sns.catplot(x="trestbps", y="target", row="sex",kind="box", orient="h", height=1.5, aspect=4,data=df,palette='Set2')
g.set(xscale='log')

In [None]:
g = sns.catplot(x="chol", y="target", row="sex",kind="box", orient="h", height=1.5, aspect=4,data=df,palette='Set2')
g.set(xscale='log')

In [None]:
g = sns.catplot(x="oldpeak", y="target", row="sex",kind="box", orient="h", height=1.5, aspect=4,data=df,palette='Set2')

# Data manipulation

The values of few categorical variables cause ambiguity while fitting our model and training it. So let's make them all binary and convert ranges of 0-3 or 0-4 to 1's and 0's by adding columns. We do this using pandas by specifying which columns need to be encoded and thus we get a dataframe with original columns replaced by our encoded variables.

In [None]:
d1=pd.get_dummies(df['cp'],drop_first=True,prefix='cp')
d2=pd.get_dummies(df['thal'],drop_first=True,prefix='thal')
d3=pd.get_dummies(df['slope'],drop_first=True,prefix='slope')
df=pd.concat([df,d1,d2,d3],axis=1)
df.drop(['cp','thal','slope'],axis=1,inplace=True)
df.head()

# Creating an extra feature

Lets just perform some extremely simple feature engineering using the age column. A well known fact is that adults over the age of 60 are more likely to suffer from heart diseases than younger adults. So, we create a separate column to filter the entries in which the person is either 60 years or older. We can do this by assigning 0's to those below 60 years of age and 1's to people over 60. We name the column 'seniors' which means senior citizens.

In [None]:
df['age'].min()

In [None]:
df['age'].max()

In [None]:
df['seniors'] = df['age'].map(lambda s: 1 if s >= 60 else 0)

In [None]:
df.head()

# Train-test split

Now that our dataframe is ready, lets split the data into training and testing sets.

In [None]:
X=df.drop('target',axis=1)
y=df['target']

In [None]:
xtrain,xtest,ytrain,ytest=train_test_split(X,y,test_size=0.2,random_state=42)

# Scaling the data

In [None]:
scale=StandardScaler()
xtrain=scale.fit_transform(xtrain)
xtest=scale.transform(xtest)

Let us create a list called scores so that we can finally compare the performances of 5 different classification models based on their accuracy score.

In [None]:
scores=[]

# Classification

1. Logistic Regression

In [None]:
clf1=LogisticRegression()
clf1.fit(xtrain,ytrain)
pred1=clf1.predict(xtest)
s1=accuracy_score(ytest,pred1)
scores.append(s1*100)
print(s1*100)

2. Random Forest

In [None]:
clf2=RandomForestClassifier(max_depth=2,random_state=0)
clf2.fit(xtrain,ytrain)
pred2=clf2.predict(xtest)
s2=accuracy_score(ytest,pred2)
scores.append(s2*100)
print(s2*100)

3. K nearest neighbors

In [None]:
clf3=KNeighborsClassifier()
clf3.fit(xtrain,ytrain)
pred3=clf3.predict(xtest)
s3=accuracy_score(ytest,pred3)
scores.append(s3*100)
print(s3*100)

4. Support Vector Machine

In [None]:
clf4=svm.SVC(kernel='rbf',C=1)
clf4.fit(xtrain,ytrain)
pred4=clf4.predict(xtest)
s4=accuracy_score(ytest,pred4)
scores.append(s4*100)
print(s4*100)

5. Decision Tree

In [None]:
clf5=DecisionTreeClassifier(max_depth=3,random_state=0)
clf5.fit(xtrain,ytrain)
pred5=clf5.predict(xtest)
s5=accuracy_score(ytest,pred5)
scores.append(s5*100)
print(s5*100)

Viewing the scores list, we can see how different models perform.

In [None]:
print(scores)

In [None]:
names=['LogisticRegression','RandomForest','KNN','SVM','Decision Tree']
classifier=pd.Series(data=scores,index=names)
print(classifier)

In [None]:
plt.figure(figsize=(10,7))
classifier.sort_index().plot.bar()

We can see that we get the highest accuracy of 90.16% with the logistic regression classifier. So lets check it's classification report and confusion matrix to better understand our results from the predictions.

In [None]:
print(confusion_matrix(ytest,pred1))

Our classifier has predicted 27+28 outcomes correctly and 2+4 outcomes wrongly

In [None]:
print(classification_report(ytest,pred1))

Thanks for viewing my kernel. Do upvote :)