<a id="4"></a><h1 style='background:#33cccc; border:3; color:white'><center> Machine Learning For Heart Failure Prediction: Boosting </center></h1>

<center><img 
src="https://media.giphy.com/media/woNmVBgHCS0xO/giphy.gif" width="900" height="900"></img></center>

<br>


<a id="1"></a><h1 style='background:#ccffcc
; border:0; color:black'><center> Table of contents </center></h1>

1. [Introduction](#1)
2. [Data cleaning, exploration and preprocessing](#2)
3. [Model building](#3)
4. [Improving the accuracy of model](#5)
5. [Prediction](#6)

<a id="1"></a><h1 style='background:#ccffcc
; border:0; color:black'><center> Introduction </center></h1>

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide.
Heart failure is a common event caused by CVDs.
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

### Variables in the dataset:

1.Age: Age of the patient

2.Anaemia: If the patient had the haemoglobin below the normal range

3.Creatinine_phosphokinase: The level of the creatine phosphokinase in the blood in mcg/L

4.Diabetes: If the patient was diabetic

5.Ejection_fraction: Ejection fraction is a measurement of how much blood the left ventricle pumps out with each contraction

6.High_blood_pressure: If the patient had hypertension

7.Platelets: Platelet count of blood in kiloplatelets/mL

8.Serum_creatinine: The level of serum creatinine in the blood in mg/dL

9.Serum_sodium: The level of serum sodium in the blood in mEq/L

10.Sex: The sex of the patient

11.Smoking: If the patient smokes actively or ever did in past

12.Time: It is the time of the patient's follow-up visit for the disease in months

13.Death_event: If the patient deceased during the follow-up period


<a id="2"></a><h1 style='background:#ccffcc
; border:0; color:black'><center> Data cleaning, exploration and preprocessing  </center></h1>

In [None]:
import numpy as np
import pandas as pd 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import pickle
from sklearn import datasets
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Data
data= pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
data.head()

We are having the data of 300 patients this data consist there

• Age: age of the patient (years) • Anaemia: decrease of red blood cells or haemoglobin (Boolean) • High Blood Pressure: if the patient has hypertension (Boolean) • Creatinine Phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L) • Diabetes: if the patient has diabetes (Boolean) • Ejection Fraction: percentage of blood leaving the heart at each contraction (percentage) • Platelets: platelets in the blood (kilo platelets/mL) • Sex: woman or man (binary) • Serum Creatinine: level of serum creatinine in the blood (mg/dL) • Serum Sodium: level of serum sodium in the blood (mEq/L)

In [None]:
# Features
data.columns

In [None]:
# Checking for null value
data.isna().sum()

In [None]:
# Making the dependent feature separate
x=data.drop(labels='DEATH_EVENT', axis=1)
y= data['DEATH_EVENT']

In [None]:
x.head()

In [None]:
#correlation between the variables in the study
data.corr().style.background_gradient(cmap='Spectral').set_precision(2)

In [None]:
#seeing distribution for age
plt.figure(figsize=(15,5))
plt.hist(data['age'],bins = 50,edgecolor = 'black')
plt.xlabel('Age range')
plt.ylabel('Frequency')
plt.title('Age Distribution Graph')

After analysing this data we got to know that the ones who got heart attack were in the range of 40-80 age mostly. The above diagram shows the frequency of data with respect to age.

In [None]:
plt.figure(figsize=(15,5))
sns.countplot(x='age', hue='DEATH_EVENT', data=data)

The above fig shows us the number of deaths and survivals with respect to age.

In [None]:
#Prevalence of outcome event
sns.set_theme(context='poster')
plt.figure(figsize=(10,7))
plt.title('Disease status \n (Survived (0), Death (1))', fontsize=20)
cols= ["#ffff99","#ff6600"]
sns.countplot(x= data["DEATH_EVENT"], palette= cols)
plt.show()

# Boxen and swarm plot of some non binary features.

In [None]:
feature = ["age","creatinine_phosphokinase","ejection_fraction","platelets","serum_creatinine","serum_sodium", "time"]
for i in feature:
    plt.figure(figsize=(8,8))
    sns.swarmplot(x=data["DEATH_EVENT"], y=data[i], color="black", alpha=0.5)
    sns.violinplot(x=data["DEATH_EVENT"], y=data[i], palette=cols)
    plt.show()

In [None]:
#Standard scaler features of the dataset
col_names = list(x.columns)
s_scaler = preprocessing.StandardScaler()
X_df= s_scaler.fit_transform(x)
X_df = pd.DataFrame(X_df, columns=col_names)   
X_df.describe().T

In [None]:
#Examining the scaled features
sns.set_style("whitegrid")

plt.figure(figsize=(20,15))
plt.title('Examining the scaled features (of columns)', color="y",fontsize=30)
#colours =["#774571","#b398af","#f1f1f1" ,"#afcdc7", "#6daa9f"]
sns.violinplot(data = X_df,palette = 'Set2')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

<a id="3"></a><h1 style='background:#ccffcc
f; border:0; color:black'><center> Model building  </center></h1>

### Standard Scaler

As the data points differ a lot in magnitude, we'll scale them by standard scaler. Then we are splitting our data and fitting the model which is XGB on training data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaled_data=scaler.fit_transform(x)

In [None]:
from sklearn.model_selection import train_test_split
train_x,test_x,train_y,test_y=train_test_split(scaled_data,y,test_size=0.3,random_state=42)

In [None]:
# fit model no training data
model = XGBClassifier(objective='binary:logistic')
model.fit(train_x, train_y)

Checking the training and initial testing accuracy respectively. We are gonna make the model more accurate

### Accuracy

In [None]:
# cheking training accuracy
y_pred = model.predict(train_x)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(train_y,predictions)
accuracy

In [None]:
# cheking initial test accuracy
y_pred = model.predict(test_x)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(test_y,predictions)
accuracy

In [None]:
test_x[0]

<a id="5"></a><h1 style='background:#ccffcc
; border:0; color:black'><center> Improving the accuracy of model  </center></h1>

### To find the best parameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid={
   
    ' learning_rate':[1,1,0.1,0.5,0.001],
    'max_depth': [3,5,15,30],
    'n_estimators':[10,70,150,300]
    
}

In [None]:
grid= GridSearchCV(XGBClassifier(objective='binary:logistic'),param_grid, verbose=4)

In [None]:
grid.fit(train_x,train_y)

In [None]:
# To  find the parameters givingmaximum accuracy
grid.best_params_

got the parameters

In [None]:
# Create new model using the same parameters
new_model=XGBClassifier(learning_rate= 1, max_depth= 5, n_estimators= 10)
new_model.fit(train_x, train_y)

### Improved Accuracy

In [None]:
# Accuracy
y_pred_new = new_model.predict(test_x)
predictions_new = [round(value) for value in y_pred_new]
accuracy_new = accuracy_score(test_y,predictions_new)
accuracy_new

In [None]:
# As we have increased the accuracy of the model, we'll save this model
filename = 'xgboost_model.pickle'
pickle.dump(new_model, open(filename, 'wb'))

loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
# we'll save the scaler object as well for prediction
filename_scaler = 'scaler_model.pickle'
pickle.dump(scaler, open(filename_scaler, 'wb'))

scaler_model = pickle.load(open(filename_scaler, 'rb'))

<a id="6"></a><h1 style='background:#ccffcc
; border:0; color:black'><center> Prediction </center></h1>

In [None]:
# Trying a random prediction
d=scaler_model.transform([[65.0, 0, 146, 0, 20, 0, 162000.00, 1.3, 129, 1, 1, 7]])
pred=loaded_model.predict(d)
print('This data belongs to class :',pred[0])

<center><img 
src="https://i.pinimg.com/originals/b3/70/5c/b3705cc2edf8f527789e6e2be29f6267.gif" width="900" height="900"></img></center>

<br>