**HEART DISEASE PREDICTION**

The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts.

![](http://)The goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD)

In [None]:
#importing necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import imblearn

**Exploratory Data Analysis**

In [None]:
#Reading data
data = pd.read_csv("../input/heart-disease-prediction-using-logistic-regression/framingham.csv")
data.head()

In [None]:
#Pie chart of data
data['TenYearCHD'].value_counts().plot.pie(explode=[0,0.2],autopct='%1.1f%%',shadow=True)

Seems that 15.2% of people have 10-year risk of future coronary heart disease. This also tells that the data is skewed which we will handle later.

In [None]:
#Plotting histogram of age with respect to TenYearCHD
plt.figure(figsize=(12, 6))
sns.countplot('age',hue='TenYearCHD',data=data)

**Insight** : There seems to be increased risk with age

In [None]:
plt.figure(figsize=(15, 12))

plt.subplot(3,3,1)
sns.countplot('male',hue='TenYearCHD',data=data)
plt.subplot(3,3,2)
sns.countplot('education',hue='TenYearCHD',data=data)
plt.subplot(3,3,3)
sns.countplot('currentSmoker',hue='TenYearCHD',data=data)
plt.subplot(3,3,4)
sns.countplot('BPMeds',hue='TenYearCHD',data=data)
plt.subplot(3,3,5)
sns.countplot('prevalentStroke',hue='TenYearCHD',data=data)
plt.subplot(3,3,6)
sns.countplot('prevalentHyp',hue='TenYearCHD',data=data)
plt.subplot(3,3,7)
sns.countplot('diabetes',hue='TenYearCHD',data=data)

plt.show()

**Insight**: 
1. There seems to be a slighlty higher risk if its a male.
2. The more educated people are the lower their risk. Makes sense as they have more knowledge on how to take care of themselves.
3. Risk is the same for both smoker and non-smoker
4. People taking BPMeds have a higher risk.
5. People that have had a stroke are at greater risk.
6. People who are hypertensive are at greater risk.
7. People with diabetes are at a higher risk

In [None]:
plt.figure(figsize=(15, 12))

plt.subplot(3,3,1)
sns.boxplot(data['TenYearCHD'], data['totChol'], palette = 'viridis')
plt.subplot(3,3,2)
sns.boxplot(data['TenYearCHD'], data['sysBP'], palette = 'viridis')
plt.subplot(3,3,3)
sns.boxplot(data['TenYearCHD'], data['diaBP'], palette = 'viridis')
plt.subplot(3,3,4)
sns.boxplot(data['TenYearCHD'], data['BMI'], palette = 'viridis')
plt.subplot(3,3,5)
sns.boxplot(data['TenYearCHD'], data['heartRate'], palette = 'viridis')
plt.subplot(3,3,6)
sns.boxplot(data['TenYearCHD'], data['glucose'], palette = 'viridis')

plt.show()

**Insight**: 
1. People with risk of CHD seem to have slighlty elevated cholestrol levels
2. People with risk of CHD seem to have elevated levels of systolic blood pressure (sysBP)
3. People with risk of CHD seem to have elevated levels of diastolic blood pressure (diaBP)
4. People with risk of CHD seem to have slighlty elevated BMI
5. People with risk of CHD seem to have slighlty elevated heart rate
6. People with risk of CHD seem to have slighlty elevated glucose levels

In [None]:
#Checking if data is missing
data.info()
print("------------------------")
data.isnull().sum()

There seems to be data missing

In [None]:
#Calculating the Missing Values % contribution in data
data_null = data.isna().mean().round(4) * 100
data_null.sort_values(ascending=False).head()

In [None]:
#A dendrogram is a diagram that shows the hierarchical relationship between features
#Here we are looking at dendogram of missing data
import missingno as msno
msno.dendrogram(data)

Handling missing data

In [None]:
#Imputing the missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

data_new = pd.DataFrame(imputer.fit_transform(data))
data_new.columns = data.columns
data_new.index = data.index

In [None]:
data_new.isnull().sum()

Now lets check for outliers

In [None]:
#Plotting boxplot of features to find outliers
plt.figure(figsize=(15, 8))

plt.subplot(3,3,1)
sns.boxplot(data_new['age'],color='yellow')
plt.subplot(3,3,2)
sns.boxplot(data_new['cigsPerDay'],color='yellow')
plt.subplot(3,3,3)
sns.boxplot(data_new['totChol'],color='yellow')
plt.subplot(3,3,4)
sns.boxplot(data_new['sysBP'],color='yellow')
plt.subplot(3,3,5)
sns.boxplot(data_new['diaBP'],color='yellow')
plt.subplot(3,3,6)
sns.boxplot(data_new['BMI'],color='yellow')
plt.subplot(3,3,7)
sns.boxplot(data_new['heartRate'],color='yellow')
plt.subplot(3,3,8)
sns.boxplot(data_new['glucose'],color='yellow')

plt.show()

We will handle this using Z-score

In [None]:
"""
Z-score:
This score helps to understand if a data value is greater or smaller than mean and how far away it is from the mean.
If the z score of a data point is more than 3/-3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.
"""

from scipy import stats
z = np.abs(stats.zscore(data_new))
threshold = 3
print(np.where(z > 3)) # The first array contains the list of row numbers and second array respective column numbers

#Removing outliers
data_new = data_new[(z < 3).all(axis=1)]

In [None]:
#Plotting boxplot of features to find outliers
plt.figure(figsize=(15, 8))

plt.subplot(3,3,1)
sns.boxplot(data_new['age'],color='yellow')
plt.subplot(3,3,2)
sns.boxplot(data_new['cigsPerDay'],color='yellow')
plt.subplot(3,3,3)
sns.boxplot(data_new['totChol'],color='yellow')
plt.subplot(3,3,4)
sns.boxplot(data_new['sysBP'],color='yellow')
plt.subplot(3,3,5)
sns.boxplot(data_new['diaBP'],color='yellow')
plt.subplot(3,3,6)
sns.boxplot(data_new['BMI'],color='yellow')
plt.subplot(3,3,7)
sns.boxplot(data_new['heartRate'],color='yellow')
plt.subplot(3,3,8)
sns.boxplot(data_new['glucose'],color='yellow')

plt.show()

In [None]:
#Looking at some of the properties such as mean and std of each feature
data_new.describe()

Resampling data

In [None]:
#Viewing class distribution
plt.figure(figsize=(6, 4))
sns.countplot('TenYearCHD', data=data_new)
plt.title('Class Distributions')

We can see that the data is imbalanced. This could lead to bad precision and recall. We will use SMOTE and RandomUndersample for resampling the data

Before that we will scale the follwing features:

age, cigsPerDay, totChol, sysBP, diaBP, BMI, heartRate and glucose

In [None]:
#We will use Robust scaler for scaling as it is less prone to outliers
from sklearn.preprocessing import RobustScaler
sc = RobustScaler()

data_new['age'] = sc.fit_transform(data_new['age'].values.reshape(-1,1))
data_new['cigsPerDay'] = sc.fit_transform(data_new['cigsPerDay'].values.reshape(-1,1))
data_new['totChol'] = sc.fit_transform(data_new['totChol'].values.reshape(-1,1))
data_new['sysBP'] = sc.fit_transform(data_new['sysBP'].values.reshape(-1,1))
data_new['diaBP'] = sc.fit_transform(data_new['diaBP'].values.reshape(-1,1))
data_new['BMI'] = sc.fit_transform(data_new['BMI'].values.reshape(-1,1))
data_new['heartRate'] = sc.fit_transform(data_new['heartRate'].values.reshape(-1,1))
data_new['glucose'] = sc.fit_transform(data_new['glucose'].values.reshape(-1,1))

data_new.head()

In [None]:
X = data_new.iloc[:,:-1].values
y = data_new.iloc[:, -1].values

In [None]:
#SMOTE
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

over = SMOTE()
under = RandomUnderSampler()
steps = [('o', over), ('u', under)]
pipeline = Pipeline(steps=steps)

X, y = pipeline.fit_resample(X, y)
print(X.shape, y.shape)

#reshaping
y = y.reshape(len(y), 1)
print(X.shape, y.shape)

Data successfully resampled

In [None]:
#Viewing class distribution after resampling
df_temp = {'TenYearCHD' : y[:,0]}
df = pd.DataFrame(df_temp)

plt.figure(figsize=(6, 4))
sns.countplot('TenYearCHD', data = df)
plt.title('Class Distributions after resampling')

Now the dataset is balanced

In [None]:
#Splitting data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#**Model Training**

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
y_train = y_train.reshape(len(y_train))
classifier.fit(X_train, y_train)


In [None]:
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))