![Heart - Disease](https://www.deccanherald.com/sites/dh/files/styles/article_detail/public/article_images/2019/11/20/heart-attack-1574189524.jpg)

Heart Diseases are very common and are also very fatal. A lot of Machine Learning algorithms have been used in the Healthcare sector in order to tackle many problems. Here, in this kernel, we are going to see how we can apply some common Machine Learning Algorithms to get a reasonably good model that can help predict Heart Disease fairly well. 

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import plotly_express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.offline import plot, iplot,init_notebook_mode
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

init_notebook_mode()
pio.templates.default = 'plotly_white'

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
df.head()

# Data Description

Dataset Features
* age - age in years
* sex - (1 = male; 0 = female)
* cp - chest pain type
* trestbps - resting blood pressure (in mm Hg on admission to the hospital)
* chol - serum cholestoral in mg/dl
* fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg - resting electrocardiographic results
* thalach - maximum heart rate achieved
* exang - exercise induced angina (1 = yes; 0 = no)
* oldpeak - ST depression induced by exercise relative to rest
* slope - the slope of the peak exercise ST segment
* ca - number of major vessels (0-3) colored by flourosopy
* thal - 3 = normal; 6 = fixed defect; 7 = reversable defect
* target - have disease or not (1=yes, 0=no)

# EDA

In [None]:
df.info()

#converting the columns into categorical variables, this would make it easier for us to visualize them later. 
df['target'] = df['target'].astype('category')
df['slope'] = df['slope'].astype('category')
df['fbs'] = df['fbs'].astype('category')
df['ca'] = df['ca'].astype('category')
df['thal'] = df['thal'].astype('category')
df['exang'] = df['exang'].astype('category')

In [None]:
df.describe()

We see we have no null values in our dataset. 

In [None]:
target = df.target.value_counts(normalize = True)*100
trace1 = go.Bar(
    x = ['Has Disease','Does not have Disease'],
    y = target.values, 
    text = target.values, 
    textposition = 'auto',
    texttemplate = "%{y:.2f} %"
)
fig = go.Figure(data = [trace1])
fig.update_layout(title_text = '<b>Target Distribution</b>',
                 xaxis_title="Target",
                yaxis_title="Percentage")
fig.show()

We see that the dataset contains a greater number of samples with Heart Disease. 

In [None]:
traces = []
for sex,data in df.groupby('sex'):
    if sex == 1:
        name = 'Male'
    else:
        name = 'Female'
    target = data['target'].value_counts(normalize = True)*100
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = name,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = '<b>Distribution of target based on sex</b>',
    xaxis_title="Target",
    yaxis_title="Percentage",
    legend_title="Sex"
)
iplot(fig)

**Females have a much higher chance of having heart diseases as compared to males.** 

75% of the females in the dataset have heart disease, while 55% of the males do not have a heart disease

In [None]:
traces = []
for target,data in df.groupby('target'):
    if target == 1:
        name = 'Has Disease'
    else:
        name = 'Does not have Disease'
    age = data['age'].value_counts()
    trace = go.Bar(
        x = age.index,
        y = age.values,
        name = name
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on age',
    xaxis_title="Age",
    yaxis_title="Counts",
    legend_title="Target",
    legend = dict(x = 0)
)
iplot(fig)

In [None]:
print(f"The average age of People without Heart Disease is {df[df['target'] == 0]['age'].mean()}")
print(f"The average age of People with Heart Disease is {df[df['target'] == 1]['age'].mean()}")

While we would think that the heart diseases are prominent in elderly, the given dataset shows that the Average age of people without heart disease is more than those of people with heart disease.

In [None]:
fig = px.scatter(df,x = 'age', y = 'thalach',trendline = 'ols', marginal_y = 'violin',color = 'target')
fig.update_traces(marker = dict(size = 8, ))
fig.update_layout(title = '<b>Distribution of heartrate based on age in people with and without heart disease</b>',
    xaxis_title="Age",
    yaxis_title="Heart Rate",
    legend_title="Target",
    legend = dict(x = 0)
)
iplot(fig)

In [None]:
fig = px.scatter(df,x = 'age', y = 'trestbps', color = 'target', trendline = 'ols', marginal_y = 'violin')
fig.update_traces(marker = dict(size = 10, ))
fig.update_layout(title = '<b>Distribution of Resting Blood Pressure based on age in people with and without heart attack</b>',
    xaxis_title="Age",
    yaxis_title="Blood Pressure",
    legend_title="Target",
    legend = dict(x = 0)   
)
iplot(fig)

Although we do not get a significant $R^2$ value, we can still see that the slope of the OLS trendline is more in case of people with heart disease, thus as the age increases the blood pressure increases at a faster pace in case of people with Heart Disease as compared to people without heart disease.

In [None]:
traces = []
for slope,data in df.groupby('slope'):
    target = data['target'].value_counts(normalize = True)*100
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = slope,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on Slope of The Peak Exercise ST Segment ',
    xaxis_title="Target",
    yaxis_title="Counts",
    legend_title="Slope"
)
iplot(fig)

In [None]:
traces = []
for fbs,data in df.groupby('fbs'):
    if fbs == 1:
        name = 'Fasting Blood Sugar > 120 mg/dl'
    else:
        name = 'Fasting Blood Sugar <= 120 mg/dl'
    target = data['target'].value_counts(normalize = True)*100
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = name,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on Fasing Blood Sugar',
    xaxis_title="Target",
    yaxis_title="Counts",
    legend_title="Fasting Blood Sugar"
)
iplot(fig)

In [None]:
traces = []
for ca,data in df.groupby('ca'):
    target = data['target'].value_counts(normalize = True)
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = ca,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on Number of Major Vessels',
    xaxis_title="Target",
    yaxis_title="Counts",
    legend_title="Number of Major Vessels"
)
iplot(fig)

In [None]:
traces = []
for exang,data in df.groupby('exang'):
    if exang == 1:
        name = 'Yes'
    else:
        name = 'No'
    target = data['target'].value_counts(normalize = True)*100
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = name,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on Exercise Induced Angina',
    xaxis_title="Target",
    yaxis_title="Counts",
    legend_title="Exercise Induced Angina"
)
iplot(fig)

In [None]:
traces = []
for thal,data in df.groupby('thal'):
    target = data['target'].value_counts(normalize = True)*100
    trace = go.Bar(
        x = target.index,
        y = target.values,
        text = target.values,
        textposition = 'auto',
        name = thal,
        texttemplate = "%{y:.2f} %"
    )
    traces.append(trace)
fig = go.Figure(data = traces)
fig.update_layout(title = 'Distribution of target based on Thal',
    xaxis_title="Target",
    yaxis_title="Counts",
    legend_title="Thal"
)
iplot(fig)

# Preprocessing

In [None]:
df = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')

In [None]:
df['age'] = pd.cut(df['age'],bins=[0,47,61,100],labels=['Adult','Aging','Old'])

In [None]:
categorical_features = ['age','cp', 'fbs', 'exang', 'slope', 'ca', 'thal']
for feature in categorical_features:
    encoder = LabelEncoder()
    df[feature] = encoder.fit_transform(df[feature])

In [None]:
continuous_features = ['trestbps', 'chol','restecg','thalach','oldpeak']
scaler = StandardScaler()
df[continuous_features] = scaler.fit_transform(df[continuous_features])

In [None]:
y = df.target.values
X = df.drop(['target'], axis = 1)

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
stratifiedSplit = StratifiedShuffleSplit(n_splits=1, test_size = 0.1, random_state = 0)
for train_idx, test_idx in stratifiedSplit.split(X, y):
    x_train, x_test = X.iloc[train_idx,], X.iloc[test_idx,]
    y_train, y_test = y[train_idx], y[test_idx]

# Training and Evaluating Models

In [None]:
log_reg = LogisticRegression(random_state=0,class_weight='balanced')
log_reg.fit(x_train, y_train)
from sklearn.metrics import accuracy_score
train_acc = accuracy_score(y_train, log_reg.predict(x_train))
test_acc = accuracy_score(y_test, log_reg.predict(x_test))
print('-'*25)
print('Training Accuracy is {:.2f}'.format(train_acc*100))
print('-'*25)
print('-'*25)
print('Testing Accuracy is {:.2f}'.format(test_acc*100))
print('-'*25)

In [None]:
conf = confusion_matrix(y_test, log_reg.predict(x_test))
fig = px.imshow(conf)
fig.update_layout(
title = 'Logistic Regression Confusion Matrix',
xaxis_title = 'Predicted Label',
yaxis_title = 'True Label'
)
iplot(fig)

In [None]:
max_acc = 0.0
neighbours = 0
for i in range(1,10):
    knn = KNeighborsClassifier(n_neighbors=i,p=1)
    knn.fit(x_train, y_train)
    test_acc = accuracy_score(y_test, knn.predict(x_test))
    if(test_acc>max_acc):
        max_acc = test_acc
        neighbours = i
knn = KNeighborsClassifier(n_neighbors=neighbours)
knn.fit(x_train, y_train)
train_acc = accuracy_score(y_train, knn.predict(x_train))
print('-'*25)
print('Training Accuracy is {:.2f} with {} neighbours'.format(train_acc*100, neighbours))
print('-'*25)
print('-'*25)
print('Maximum Testing Accuracy is {:.2f} with {} neighbours'.format(max_acc*100, neighbours))
print('-'*25)

In [None]:
conf = confusion_matrix(y_test, knn.predict(x_test))
fig = px.imshow(conf)
fig.update_layout(
title = 'KNN Classifier Confusion Matrix',
xaxis_title = 'Predicted Label',
yaxis_title = 'True Label'
)
iplot(fig)

In [None]:
rf = RandomForestClassifier(n_estimators=5,min_samples_split=15,random_state = 0, class_weight='balanced_subsample')
rf.fit(x_train, y_train)
test_acc = accuracy_score(y_test, rf.predict(x_test))
train_acc = accuracy_score(y_train, rf.predict(x_train))
print('-'*25)
print('Training Accuracy is {:.2f}'.format(train_acc*100))
print('-'*25)
print('-'*25)
print('Testing Accuracy is {:.2f}'.format(test_acc*100))
print('-'*25)

In [None]:
conf = confusion_matrix(y_test, rf.predict(x_test))
fig = px.imshow(conf)
fig.update_layout(
title = 'Random Forest Classifier Confusion Matrix',
xaxis_title = 'Predicted Label',
yaxis_title = 'True Label'
)
iplot(fig)

## We decide to use KNN Classifier as the final model.