# Prediction Of Heart Disease
### By: Rahul Kulkarni

### Scope and Problem Statement:
In this kernel we will try to predict the possibility of a person having heart diesease with the help of various factors such as age, gender,blood pressure etc. We will use various classification models for the purpose of prediction. We will got through various processes such as Data cleaning,EDA,Model fitting and Evaluation.

This prediction is very useful in the healthcare industry, as an accurate prediction can help the doctors in helping the patient beforehand.

### Description of the dataset:
1. **age**: age of the patient
2. **sex**: 1 = male and 0 = female
3. **cp(chest pain type)**: 1= typical anginaValue, 2= atypical anginaValue, 3= non-anginal painValue, 4= asymptomatic
4. **trestps(resting blood pressure)**:  resting blood pressure (in mm Hg on admission to the hospital)
5. **chol(serum cholestoral in mg/dl)**
1. **fbs(fasting blood sugar > 120 mg/dl)**: 1 = true, 0 = false
1. **restecg(resting electrocardiographic results)**: 0= normal, 1= having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2= showing probable or definite left ventricular hypertrophy by Estes' criteria
1. **thalach(maximum heart rate achieved)**
1. **exang(exercise induced angina)**: 1 = yes, 0 = no
1. **oldpeak**: ST depression induced by exercise relative to rest
1. **slope(the slope of the peak exercise ST segment)**: 1= upsloping, 2= flat, 3= downsloping
1. **ca(number of major vessels (0-3) colored by flourosopy)**
1. **thal**: 3 = normal; 6 = fixed defect; 7 = reversable defect
1. **target**: 0= < 50% diameter narrowing, 1= > 50% diameter narrowing

# Data Cleaning

In [None]:
import pandas as pd
import numpy as np
heart_data = pd.read_csv('../input/heart-disease-uci/heart.csv')
heart_data.head()

In [None]:
heart_data.shape

The dataset is not very large.

Let's make sure that the data types for the features are according to the description.

In [None]:
heart_data.dtypes

Let's check for missing values.

In [None]:
heart_data.isnull().sum()

There are no missing values.

There are certain column names that bother me so I will rename them.

In [None]:
heart_data.rename(columns={'trestbps':'restbp','thalach':'maxhr','ca':'nmv'},inplace=True)

# Exploratory Data Analysis

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

Let's look at the statistical properties of the features.

In [None]:
heart_data.describe()

The features such as 'restbp','chol' and 'maxhr' have large ranges compared to other features. This might cause bias towards these features during modelling. We will handle this later.

Let's have a closer at the features and how they affect the target variable.

In [None]:
sns.boxplot(x='target',y='age',data=heart_data)

The box plot doesn't give us good information, so binning ages would be a better option.

In [None]:
heart_data['age']  = pd.cut(heart_data['age'],bins=[29,39,49,59,69,79],labels=[1,2,3,4,5],include_lowest=True)

In [None]:
age = heart_data.groupby(['age','target'])['target'].count().unstack()
age['per'] = round((age[1]/(age[0]+age[1]))*100,2)
age

According to the data mid age patients have a higher chance of having heart problems.

In [None]:
sex = heart_data.groupby(['sex','target'])['target'].count().unstack()
sex['per'] = round((sex[1]/(sex[0]+sex[1]))*100,2)
sex

Females have a higher chance of having heart problems.

Let's do further analysis by grouping data by age and sex.

In [None]:
age_sex = heart_data.groupby(['age','sex','target'])['target'].count().unstack()
age_sex['per'] = round((age_sex[1]/(age_sex[0]+age_sex[1]))*100,2)
age_sex

We can club age and sex for further analysis, as this has given us interesting information.

In [None]:
cp = heart_data.groupby(['cp','target'])['target'].count().unstack()
cp['per'] = round((cp[1]/(cp[0]+cp[1]))*100,2)
cp

As expected, if a person has chest pain, they have a higher chance of heart problems.

In [None]:
sns.boxplot(x='target',y='restbp',data=heart_data)

Both the target classes have the same median. The IQR of '0' label is larger than '1'. There are some outliers, such as restbp of 200. This blood pressure would mean abnormal heart behaviour. 

In [None]:
sns.boxplot(x='target',y='chol',data=heart_data)

Both the labels have same ranges. As expected very high cholestrol would result in heart disease, this is visible from the outliers.

In [None]:
fbs = heart_data.groupby(['fbs','target'])['target'].count().unstack()
fbs['per'] = round((fbs[1]/(fbs[0]+fbs[1]))*100,2)
fbs

Nothing conclusive can be obtained from fasting blood sugar.

In [None]:
restecg = heart_data.groupby(['restecg','target'])['target'].count().unstack()
restecg['per'] = round((restecg[1]/(restecg[0]+restecg[1]))*100,2)
restecg

The ecg result of 1 has a higher chance of heart disease.

In [None]:
sns.boxplot(x='target',y='maxhr',data=heart_data)

As expected higher heart rate results in higher chance of heart disease. 

In [None]:
ex = heart_data.groupby(['exang','target'])['target'].count().unstack()
ex['per'] = round((ex[1]/(ex[0]+ex[1]))*100,2)
ex

It is surprising to see that no angina results in higher chance of heart disease and angina results in less chance of heart disease.

In [None]:
sns.boxplot(x='target',y='oldpeak',data=heart_data)

Very low depression results in higher chance of heart disease. A higher ST depression is seen in normal patients.

In [None]:
sl = heart_data.groupby(['slope','target'])['target'].count().unstack()
sl['per'] = round((sl[1]/(sl[0]+sl[1]))*100,2)
sl

A flat slope results in less chance of heart disease and down slope has a higher chance of disease.

In [None]:
nmv = heart_data.groupby(['nmv','target'])['target'].count().unstack()
nmv['per'] = round((nmv[1]/(nmv[0]+nmv[1]))*100,2)
nmv

very low and very high vessels results in heart disease. But mid range vessels have a lower chance of disease.

In [None]:
thal = heart_data.groupby(['thal','target'])['target'].count().unstack()
thal['per'] = round((thal[1]/(thal[0]+thal[1]))*100,2)
thal

As expected irreversible defects have a higher chance of disease and reversible defects have a lower chance.

Let's create a new variable using age and sex.

In [None]:
heart_data['age']=heart_data['age'].astype('int')
heart_data['age_sex'] = heart_data['age']*heart_data['sex']

Let's check the correlation between features.

In [None]:
c = heart_data.corr()
plt.figure(figsize=(20,6))
sns.heatmap(c,annot=True,fmt='f')

We can observe 'cp' and 'maxhr' have a decent positive correlation with our target variables. Features such as 'exang','oldpeak','nmv','thal' and 'age_sex' have a negative correlation.

# Model Fitting

Here we will select the features we will use for prediction. We will select features which have a decent correlation with our target variable. Also we will make use of MinMaxScaler to reduce the range of 'restbp' and 'maxhr' to avoid bias towards them. We make use of training set to find the optimum parameters for our models and make use of KFold to check the performance of our model. The metric which we will use for evaluation will be F1 score. Since we need to reduce the false negatives, F1 score is a better metric than accuracy score. The reason being F1 score includes both precision and recall.

In [None]:
from sklearn.model_selection import train_test_split as tts
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score as fs
from statistics import mean
from sklearn.preprocessing import MinMaxScaler as MMS
heart_data['mms_restbp'] = MMS().fit_transform(heart_data[['restbp']])
heart_data['mms_maxhr'] = MMS().fit_transform(heart_data[['maxhr']])
x = heart_data[['age_sex','cp','mms_restbp','restecg','mms_maxhr','exang','oldpeak','slope','nmv','thal']]
y = heart_data['target']
xtrain,xtest,ytrain,ytest = tts(x,y,test_size=0.2,random_state=100)
kf = KFold(n_splits=5,shuffle=True,random_state=100)

### K Nearest Neighbors

We need to find the optimum number of neighbors for our model. Hence we will compare the F1 scores obtained for various values for our neighbors.

In [None]:
from sklearn.neighbors import KNeighborsClassifier as KNC
neighbors,scores_train,scores_test = [i for i in range(1,20)],[],[]
for n in neighbors:
    score_train,score_test = [],[]
    for train,test in kf.split(xtrain):
        xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
        ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
        knc = KNC(n_neighbors=n,weights='distance') #Giving weight to the distance of neighbors
        knc.fit(xtr,ytr)
        yhat_train = knc.predict(xtr)
        yhat_test = knc.predict(xtt)
        score_train.append(round(fs(ytr,yhat_train),2))
        score_test.append(round(fs(ytt,yhat_test),2))
    scores_train.append(mean(score_train))
    scores_test.append(mean(score_test))
sns.lineplot(x=neighbors,y=scores_train,color='r')
sns.lineplot(x=neighbors,y=scores_test,color='b')
plt.legend(('Train','Test'))

The value of optimum neighbors is 8.

### Decision Tree

We need to find the optimum depth for our model.

In [None]:
from sklearn.tree import DecisionTreeClassifier as DTC
depths,scores_train,scores_test = [i for i in range(3,20)],[],[]
for d in depths:
    score_train,score_test = [],[]
    for train,test in kf.split(xtrain):
        xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
        ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
        dtc = DTC(max_depth=d)
        dtc.fit(xtr,ytr)
        yhat_train = dtc.predict(xtr)
        yhat_test = dtc.predict(xtt)
        score_train.append(round(fs(ytr,yhat_train),2))
        score_test.append(round(fs(ytt,yhat_test),2))
    scores_train.append(mean(score_train))
    scores_test.append(mean(score_test))
sns.lineplot(x=depths,y=scores_train,color='r')
sns.lineplot(x=depths,y=scores_test,color='b')

Depth value of 3 has the highest F1 score but I am certain that this value would underfit the data. Therefore I believe choosing a value of 6 would be better.

### Logistic Regression

We need to find the optimum value of regularization factor for our model. Increasing the regularization strength penalizes "large" weight coefficients. Our goal is to prevent that our model picks up "peculiarities," "noise," or "imagines a pattern where there is none."

In [None]:
from sklearn.linear_model import LogisticRegression as LR
C,scores_train,scores_test = [0.001,0.005,0.01,0.05,0.1,0.5],[],[]
for c in C:
    score_train,score_test = [],[]
    for train,test in kf.split(xtrain):
        xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
        ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
        lr = LR(C=c,max_iter=1000)  # C is Inverse of regularization strength
        lr.fit(xtr,ytr)
        yhat_train = lr.predict(xtr)
        yhat_test = lr.predict(xtt)
        score_train.append(round(fs(ytr,yhat_train),2))
        score_test.append(round(fs(ytt,yhat_test),2))
    scores_train.append(mean(score_train))
    scores_test.append(mean(score_test))
sns.lineplot(x=C,y=scores_train,color='r')
sns.lineplot(x=C,y=scores_test,color='b')

The optimum regularization factor was found to be 0.05.

### Gaussian Naive Bayes

This a model where it assumes data to be normally distributed and applies bayes theorem assumptions. There are no parameters to be adjusted in this model.

In [None]:
from sklearn.naive_bayes import GaussianNB as GNB
score_train,score_test = [],[]
for train,test in kf.split(xtrain):
    xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
    ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
    gnb = GNB()
    gnb.fit(xtr,ytr)
    yhat_train = gnb.predict(xtr)
    yhat_test = gnb.predict(xtt)
    score_train.append(round(fs(ytr,yhat_train),2))
    score_test.append(round(fs(ytt,yhat_test),2))
print(mean(score_train),mean(score_test))

### Support Vector Machine

Similar to the logistic regression model, we will try to find the optimum value for the regularization factor.

In [None]:
from sklearn.svm import SVC
C,scores_train,scores_test = [0.001,0.005,0.01,0.05,0.1,0.5],[],[]
for c in C:
    score_train,score_test = [],[]
    for train,test in kf.split(xtrain):
        xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
        ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
        svc = SVC(C=c,kernel='linear')
        svc.fit(xtr,ytr)
        yhat_train = svc.predict(xtr)
        yhat_test = svc.predict(xtt)
        score_train.append(round(fs(ytr,yhat_train),2))
        score_test.append(round(fs(ytt,yhat_test),2))
    scores_train.append(mean(score_train))
    scores_test.append(mean(score_test))
sns.lineplot(x=C,y=scores_train,color='r')
sns.lineplot(x=C,y=scores_test,color='b')

The optimum value is 0.1

### Random Forest Classifier

This model creates various trees and finds the average of all the trees. We will need to find the optimum number of trees to be considered for this model. We will use the maximum depth which we found from the DTC model.

In [None]:
from sklearn.ensemble import RandomForestClassifier as RFC
estimators,scores_train,scores_test = [100,150,200,250,300,350,400,450,500],[],[]
for e in estimators:
    score_train,score_test = [],[]
    for train,test in kf.split(xtrain):
        xtr,xtt = xtrain.iloc[train],xtrain.iloc[test]
        ytr,ytt = ytrain.iloc[train],ytrain.iloc[test]
        rfc = RFC(n_estimators=e,max_depth=6)
        rfc.fit(xtr,ytr)
        yhat_train = rfc.predict(xtr)
        yhat_test = rfc.predict(xtt)
        score_train.append(round(fs(ytr,yhat_train),2))
        score_test.append(round(fs(ytt,yhat_test),2))
    scores_train.append(mean(score_train))
    scores_test.append(mean(score_test))
sns.lineplot(x=estimators,y=scores_train,color='r')
sns.lineplot(x=estimators,y=scores_test,color='b')

The optimum number of trees was found to be 250.

# Evaluation

Now we will test our models on the test data and see the performance with the help of confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score as acs
from sklearn.metrics import confusion_matrix
models,f1score,acscore = ['KNN','DTC','LR','GNB','SVM','RFC'],[],[]

### K Nearest Neighbors

In [None]:
knn = KNC(n_neighbors=8,weights='distance')
knn.fit(xtrain,ytrain)
yhat_eval = knn.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix KNN')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

### Decision Tree Classifier

In [None]:
dtc = DTC(max_depth=6)
dtc.fit(xtrain,ytrain)
yhat_eval = dtc.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix DTC')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

### Logistic Regression

In [None]:
lr = LR(C=0.05)
lr.fit(xtrain,ytrain)
yhat_eval = lr.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix LR')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

### Gaussian Naive Bayes

In [None]:
gnb = GNB()
gnb.fit(xtrain,ytrain)
yhat_eval = gnb.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix GNB')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

### Support Vector Machine

In [None]:
svc = SVC(kernel='linear',C=0.1)
svc.fit(xtrain,ytrain)
yhat_eval = svc.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix SVC')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

### Random Forest Classifier

In [None]:
rfc = RFC(n_estimators=250,max_depth=6)
rfc.fit(xtrain,ytrain)
yhat_eval = rfc.predict(xtest)
f1 = round(fs(ytest,yhat_eval),2)
ac = round(acs(ytest,yhat_eval),2)
f1score.append(f1)
acscore.append(ac)
print(f1,ac)

In [None]:
cfm = confusion_matrix(ytest,yhat_eval)
sns.heatmap(cfm,annot=True)
plt.title('Confusion Matrix RFC')
plt.xlabel('Predicted Value')
plt.ylabel('True Label')

We will analyse the metrics for various models and select the best one.

In [None]:
Results = pd.DataFrame({'F1 Score':f1score,'Accuracy Score':acscore},index=models)
Results.sort_values(by=['F1 Score','Accuracy Score'],ascending=False)

Therefore from the above table it is clear that Logistic Regression was the best model as it generated the highest F1 and Accuracy score.