# HEART FAILURE PREDICTION


In [None]:
# Data clearnning and EDA
import pandas as pd
import numpy as np
import math
import json
import matplotlib.pyplot as plt
import seaborn as sns

## Introduction

Heart is among the most important organs, early-predicting heart failure is a vital matter to any patients with cardiovascular diseases (CVDs). In this background, electronic health records  (EHRs, also called medical records) is a useful source of information on which several screening studies have been working. However, healthcare data has vast data sources, with multiple attributes, which causes difficulty in manually handling the data. Many models varying from standard statistical techniques for sufficient datasets to machine learning models for large-scale datasets have been developed and applied in identifying risks at early stages of heart failure. 

This project is only concerned with the standard statistical techniques as it is sufficient for such a dataset of 299 patients to deal with 3 main problems: (1) Analyze the change of health indicator with regards to gender and age of the patients, (2) Analyze which factors contribute to the mortality rate of a patient and (3) From the given data, Can we build a model which can effectively predict the patient possibility of death. In the following part, we will have a quick review about the related work regarding this data set and discussion about the result. 


## I. Exploratory Data Analysis

In this section, some basic investigation about the data is performed to gain deeper insight about the data set. Then, more analysis is conducted to answer our research question.

In [None]:
data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

In [None]:
data.head()

### 1. Overview on data set

The dataset contains the medical records of 299 patients of heart failure, which consisted of 194 men and 105 women. The patient's ages vary from 40 to 95 years old. The dataset keeps track of clinical, body, and lifestyle information of the patients, which are called features in this research. There are 13 features in the dataset, 12 of which are considered to be potential reasons contributing to mortality of patients with Cardiovascular Heart Disease (CHD): age, anaemia, Creatinine-Phosphokinase (CPK), diabetes, Ejection Fraction (EF), High Blood Pressure (HBP), platelets, serum creatinine, serum sodium, sex, smoking. Some of the features are of binary data type: anaemia, diabetes, HBP, sex, smoking and death event, which then are taken as category features. The other are continuous (analogous) values, which are taken as numeric features. 


In [None]:
data.info()

There are 13 rows in the above table representing 13 clinical features in the patients’ profile: 12 complementary features and one target feature (death event). There is no null value for all features and the type of data is either of type float or integer, which means the data is cleaned and ready to be analysed.

In [None]:
categorical_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking', 'DEATH_EVENT']
numerical_features = ['age', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium', 'time']
health_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'creatinine_phosphokinase', 'ejection_fraction', 'platelets', 'serum_creatinine', 'serum_sodium']

In [None]:
data[numerical_features].describe()

The statistical quantitative characteristics of the numerical feature of the dataset is reported in the table above. The total number, mean value, standard deviation, minimum value and maximum value, and the quartiles of the numeric features are taken in full sample (all patients). The calculated quantitative description of each  feature then can be used to build their respective box plot.


##### Age

In [None]:
sns.histplot(x=data.age)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.show()

The age distribution is shown in the above bar chart. All of the patients are between 40 and 95 years old with most of them being from 40 to 72 years old. The patients’ number of 56 to 62 years old got the highest count, with more than 50 people. From the age of 75 onwards, the patients’ number decreased in proportional to their age range. Three last columns of age ranges got the least number of patients (under 10 patients). 


In [None]:
sns.displot(data=data, x="age", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of Age and Death event')
plt.xlabel('Age')
plt.show() 

The age and death event distribution is shown in the following line chart. The area between two curves indicates the number of survived patients with respect to their age, while the area between lower curve and the horizontal axis shows the number of dead patients. In detail, the area of survived patients increases from the age of 45 to 75 before decreasing after the age of 75. The peak of survival density of above 0.03 is between 60 and 70 years old. The patients of 80 years old or above had the slimmest density of survival with below 0.01 survival density compared to 0.005 death density. 


##### Ejection fraction

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
sns.boxplot(x=data.ejection_fraction, ax=ax)
plt.title('Box plot of ejection fraction')
plt.xlabel('Ejection fraction')
plt.show()

The box plot of ejection fraction is as shown in above table. The mean value is 38%, the Q1 and Q3 are 30% and 45% respectively. The minimum value is 10% and the maximum value is 65%.  


In [None]:
sns.displot(data=data, x="ejection_fraction", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of ejection fraction and Death event')
plt.xlabel('ejection_fraction')
plt.show()

The distribution of ejection fraction and death event is shown as above. The area between two curves represents the survival density while the area between the lower curve and the horizontal axis is the death density. The survival density reaches its peak of above 0.04 at around 40% of the ejection fraction, however, drops significantly if the ejection fraction is either below 20% or above 70%. 

##### Creatinine phosphokinase

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.boxplot(x=data.creatinine_phosphokinase, ax=ax)
plt.title('Box plot of creatinine phosphokinase')
plt.xlabel('Creatinine phosphokinase')
plt.show()

There are many outlier in Creatinine phosphokinase feature. We can consider eliminate those during feature engineering process.

In [None]:
fig, ax = plt.subplots(figsize=(20,10))
sns.histplot(x=data.creatinine_phosphokinase, ax=ax)
plt.title('Distribution of creatinine phosphokinase')
plt.xlabel('creatinine phosphokinase')

plt.show()

The distribution of creatinine phosphokinase is shown in the above bar chart. The level of CPK enzyme in the blood is mostly distributed from 0 to below 3000 mcg/L. 


In [None]:
sns.displot(data=data, x="creatinine_phosphokinase", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of creatinine phosphokinase and Death event')
plt.xlabel('creatinine_phosphokinase')
plt.show()

The distribution of creatinine phosphokinase and death events. The survival density is enlarged when the level of CPK enzyme in the blood is from 0 to 1000 mcg/L and decreases if the creatinine phosphokinase level is higher than 2000 mcg/L. 


##### Plateletes

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.boxplot(x=data.platelets, ax=ax)
plt.title('Boxplot of platelets')
plt.xlabel('Plateletes')
plt.show()

The box plot of platelets shows the mean value of platelets in blood of all patients is 260 kiloplatelets/mL. The Q1 and Q3 value are 220 kiloplatelets/mL and 300 kiloplatelets/mL, respectively. There are many outliers in the box plot. 


In [None]:
sns.displot(data=data, x="platelets", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of platelets and Death event')
plt.xlabel('platelets')
plt.show()

The distribution of platelets and death events is in the above line chart. The area of survival, which lies between two curves, increases when the platelets in blood is from 200 kiloplatelets/mL to 350 kiloplatelets/mL, however, decreases when the platelets drops out of the range. 


##### Serum creatinine

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.boxplot(x=data.serum_creatinine, ax=ax)
plt.title('Box plot of serum cretinine')
plt.xlabel('Serum creatinine')
plt.show()

In [None]:
sns.displot(data=data, x="serum_creatinine", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of serum_creatinine and Death event')
plt.xlabel('serum_creatinine')
plt.show()

The distribution of serum creatinine and death events is shown in the above line chart. The survival density area lying between two curves increases when the level of creatinine in the blood is between 1 mg/dL and 1.8 mg/dL. However, the survival density decreases dramatically when the level of creatinine is either below 1 mg/dL or above 2 mg/dL. 


##### Seruim sodium

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.boxplot(x=data.serum_sodium, ax=ax)
plt.title('Box plot of serum sodium')
plt.xlabel('Seruim sodium')
plt.show()

In [None]:
sns.displot(data=data, x="serum_sodium", hue="DEATH_EVENT", multiple="stack", kind='kde')
plt.title('Distribution of serum_sodium and Death event')
plt.xlabel('serum_sodium')
plt.show()

The minimum value of serum sodium is 125 mEq/L. The Q1 and Q3 values are 134 mEq/L and mEq/L respectively. The mean is 137 mEq/L. The maximum value is 150 mEq/L. There are many outliers to the left. 


##### Time

In [None]:
fig, ax = plt.subplots(figsize=(20,5))
sns.boxplot(x=data.time, ax=ax)
plt.title('Box plot of time')
plt.show()

Average value of follow-up time is 130 days. The patients had the follow-up time of 4-285 days. 


#### Categorical features

##### Anaemia

In [None]:
data['anaemia'].value_counts().plot(kind='bar')
plt.xlabel("anaemia", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('The number of people has and do not have anaemia')

The number of the patients did not have anaemia was over 160 patients. The number of patients had anaemia was 40 patients less than the one had not, which was over 120 patients. 


In [None]:
pd.crosstab(data.anaemia  ,data.DEATH_EVENT).plot(kind='bar')
plt.title('Mortality rate correlating to anaemia')
plt.xlabel('Anaemia')
plt.ylabel('Death')
plt.show()

The mortality rate correlating to anaemia is shown in the following bar chart. The first two columns show the mortality rate of the patients who did not have anaemia and the other two columns show the mortality rate for those who had anaemia. For those who did not have anaemia, there were 120 survived patients and 50 dead patients. For those who had, there were 80 survived patients and 40 dead patients. 


##### Diabetes

In [None]:
data['diabetes'].value_counts().plot(kind='bar')
plt.xlabel("diabetes", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('The number of people has and do not have diabetes')

In [None]:
pd.crosstab(data.diabetes  ,data.DEATH_EVENT).plot(kind='bar')
plt.title('Mortality rate correlating to diabetes')
plt.xlabel('Diabetes')
plt.ylabel('Death')
plt.show()

For the patient who did not have diabetes, the survived patients were nearly 120 patients and the dead patients were around 55 patients. For one who had, the dead patients’ number was over 80 and the dead patients were below 50 patients. 


##### High blood pressure

In [None]:
data['high_blood_pressure'].value_counts().plot(kind='bar')
plt.xlabel("high_blood_pressure", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('The number of people with or without high_blood_pressure')

In [None]:
pd.crosstab(data.high_blood_pressure  ,data.DEATH_EVENT).plot(kind='bar')
plt.title('Mortality rate correlating to high_blood_pressure')
plt.ylabel('high_blood_pressure')
plt.show()

For those who had hypertension, the dead patients were 40 patients, compared to 65 survived patients. While the number of dead patients were 55 patients compared to nearly 140 survived patients in the group of non-hypertension patients. 


##### Gender

In [None]:
data['sex'].value_counts().plot(kind='bar')
plt.xlabel("gender", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('Number of male and female')

##### Smoking

In [None]:
data['smoking'].value_counts().plot(kind='bar')
plt.xlabel("Smoke", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('Number of patient smoke')

In [None]:
pd.crosstab(data.smoking  ,data.DEATH_EVENT).plot(kind='bar')
plt.title('Mortality rate correlating to smoking')
plt.ylabel('Count of people')
plt.show()

The number of dead patients who did smoke was 30, compared to over 60 dead patients who did not. The survived patients who did smoke were 60, while the survived patients who did not smoke were nearly 140 patients. 


##### Death event

In [None]:
data['DEATH_EVENT'].value_counts().plot(kind='bar')
plt.xlabel("DEATH_EVENT", labelpad=14)
plt.ylabel("Count of people", labelpad=14)
plt.title('Number of death patients')

In [None]:
data['DEATH_EVENT'].value_counts()


There were around 200 dead patients and nearly 100 survived patients. We can see that there exist an imbalance between two value in the target of the data set (death event).


In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(data.corr(method='pearson'), annot=True)

#### 2. The patients’ age and sex affect the their health’s indicators 



To analyze the effects of age and sex on other health indicators, we performed one-way ANOVA analysis of each indicator with age and sex as factors. The ANOVA tests yield crucial information which helps determine whether the null hypothesis is accepted or decline for each indicator

The ANOVA tests focus on performing Ordinary Least Square regression (OLS regression) between the factors and each dependent variable. The regression method estimates the parameters (slope and intercept) by minimizing the sum of square of differences between the available data and the predicted values. The method is chosen due to the simple model of the data (2 factors and 1 dependent variable for each model)

Crtical values:
Given data size = 299, degree of freedom = 2, significant level = 0.05, we have
<ul>
<li>F critical = 3.026</li>
<li>t critical = 1.968</li>
</ul>

In [None]:
from sklearn import linear_model
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [None]:
# rsquared: get r squared
# fvalue: get f value
# f_pvalue: get p value
# params: get coefficient
# tvalues: get t-statistic

test_results = []
coef_results = []
t_results = []
f_pass_results = []
t_pass_results = []
f_crit = 3.026
t_crit = 1.968

for feature in health_features:
    results = smf.ols(feature + ' ~ age + sex', data=data).fit()
    
    test = [round(results.rsquared, 3), round(results.fvalue, 3), round(results.f_pvalue, 3)]
    coef = [round(results.params[1], 4), round(results.params[2], 4)]
    t = [round(results.tvalues[1], 3), round(results.tvalues[2], 3)]
    f_pass = ["Pass" if round(results.fvalue, 3) <= f_crit else "Fail"]
    t_pass = ["Pass" if (t[0] <= t_crit) and (t[0] >= t_crit * -1) else "Fail",
              "Pass" if (t[1] <= t_crit) and (t[1] >= t_crit * -1) else "Fail"]
    
    test_results.append(test)
    coef_results.append(coef)
    t_results.append(t)
    f_pass_results.append(f_pass)
    t_pass_results.append(t_pass)
    
    print('\033[1m ANOVA of age and sex with', feature,'\033[0m')
    print(results.summary())
    print('\n')

From the ANOVA results, the most crucial variables are selected. These include:
<ul>
<li>Coeficients: shows the slope of each factor</li>
<li>F and p-value: determines whether the null hypothesis is rejected or not in general</li>
<li>t-statistics: determines whether the null hypothesis is rejected or not for each factor</li>    
<li>r-squared: determines how much data fits the regression model</li>
</ul>

In [None]:
fig, ax = plt.subplots(figsize = (20,5)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = test_results,  
    rowLabels = health_features,  
    colLabels = ['r squared', 'F', 'p'],
    colWidths = [0.1] * 3,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('r squared, F, and p value of both age and sex in regards to each health indicator (indicators are dependent)', fontweight ="bold") 
   
plt.show() 

In [None]:
fig, ax = plt.subplots(figsize = (20,5)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = coef_results,  
    rowLabels = health_features,  
    colLabels = ['age', 'sex'],
    colWidths = [0.1] * 2,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('Cooeficients of age and sex in regards to each health indicator (indicators are dependent)', fontweight ="bold") 
   
plt.show() 

In [None]:
fig, ax = plt.subplots(figsize = (20,5)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = t_results,  
    rowLabels = health_features,  
    colLabels = ['age', 'sex'],
    colWidths = [0.1] * 2,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('t-statistics of age and sex in regards to each health indicator (indicators are dependent)', fontweight ="bold") 
   
plt.show() 

##### Results analysis

The null hypothesis result for each indicator is shown as below:

In [None]:
fig, ax = plt.subplots(figsize = (20,5)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = f_pass_results,  
    rowLabels = health_features,  
    colLabels = ['F test result'],
    colWidths = [0.1],
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('Null hypothesis result of age and sex in regards to each health indicator (F critical = 3.026)', fontweight ="bold") 
   
plt.show() 

From this, we accept the null hypothesis of age and sex when the dependent variables are
<ul>
<li>Anaemia</li>
<li>Creatinine phosphokinase</li>
<li>Platelets</li>    
<li>Serum sodium</li>
</ul>
This means both factors together do not affect these indicators while the remaining indicators are affected

The t test result of age and sex in regards to each health indicator is shown as below:

In [None]:
fig, ax = plt.subplots(figsize = (20,5)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = t_pass_results,  
    rowLabels = health_features,  
    colLabels = ['age', 'sex'],
    colWidths = [0.1] * 2,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('t test result of age and sex in regards to each health indicator (t critical = 1.968)', fontweight ="bold") 
   
plt.show() 

The results show how age and sex individually affects other indicators. If a factor passes the t test against an indicator, it means that the factor has no effect on it and vice versa.

From the results, we can confirm that:
<ol>
<li>Age individually affects:</li>
    <ul>
        <li>Serum creatinine</li>
    </ul>
<li>Sex individually affects:</li>
    <ul>
        <li>Diabetes</li>
        <li>Ejection fraction</li>
        <li>Platelets</li>
    </ul>
</ol>

#### 3. Do health indices affect the mortality rate of patients?



##### ANOVA results

For this task, we peform ANOVA analysis on DEATH_EVENT, with all other variables being factors

Crtical values: Given data size = 299, degree of freedom = 12, significance level = 0.05, we have:
<ul>
<li>F critical = 1.786</li>
<li>t critical = 1.968</li>
</ul>

In [None]:
# rsquared: get r squared
# fvalue: get f value
# f_pvalue: get p value
# params: get coefficient
# tvalues: get t-statistic

test_results = []
coef_and_t_results = []
t_pass_results = []
t_crit = 1.968
index = 1

results = smf.ols('DEATH_EVENT ~ age + anaemia + creatinine_phosphokinase + diabetes + ejection_fraction + high_blood_pressure' +
                  ' + platelets + serum_creatinine + serum_sodium + sex + smoking + time', data=data).fit()

test_values = [round(results.rsquared, 3), round(results.fvalue, 3), round(results.f_pvalue, 3)]
test_results.append(test_values)
while index < len(results.params):
    coef_and_t = [round(results.params[index], 4), round(results.tvalues[index], 3)]
    t_pass = ["Pass" if (round(results.tvalues[index], 3) <= t_crit) and (round(results.tvalues[index], 3) >= t_crit * - 1) else "Fail"]
    
    coef_and_t_results.append(coef_and_t)
    t_pass_results.append(t_pass)
    index += 1

print(results.summary())
print('\n')

In [None]:
fig, ax = plt.subplots(figsize = (20,3)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = test_results,    
    colLabels = ['r squared', 'F', 'p'],
    colWidths = [0.1] * 3,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('r squared, F, and p value of all health indicators in regards to DEATH_EVENT (indicators are independent)', fontweight ="bold") 
   
plt.show() 

In [None]:
fig, ax = plt.subplots(figsize = (20,7)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = coef_and_t_results,  
    rowLabels = data.columns[:12],  
    colLabels = ['coefficient', 't'],
    colWidths = [0.1] * 2,
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('Cooeficient and t_statistic of each health indicator in regards to DEATH_EVENT (indicators are independent)', fontweight ="bold") 
   
plt.show() 

##### Results analysis  
With the F value being 17.036 while the F critical value is 1.786, we reject the null hypothesis. This means that all health indicators together affect the mortality rate

The t test result of each indicator against DEATH_EVENT is shown as below:

In [None]:
fig, ax = plt.subplots(figsize = (20,7)) 
ax.set_axis_off() 
table = ax.table( 
    cellText = t_pass_results,  
    rowLabels = data.columns[:12],  
    colLabels = ['t test result'],
    colWidths = [0.1],
    loc = 'center') 
table.set_fontsize(14)
table.scale(2, 2)
   
ax.set_title('t test result of each health indicator in regards to DEATH_EVENT (t critical = 1.968)', fontweight ="bold") 
   
plt.show()

From this, we notice that only 4 indicators fail the t test against DEATH_EVENT, which are:
<ul>
<li>Age</li>
<li>Ejection fraction</li>
<li>Serum creatinine</li>    
<li>Time</li>
</ul>
This means that only these factors affect the mortality rate

## Predictive analysis

The Exploratory data analysis step above gave us a profound understanding about the data set, thus, the next step is to perform predictive analysis. The aim of this stage is to built a classification model which can made accurate prediction about the death event of given patient. In the next sections, several steps need to performed which are feature engineering, hyperparatunning, data modelling and model evaluation. 

In [None]:
# Import ML library
from sklearn.metrics import classification_report, f1_score, accuracy_score, recall_score, precision_score
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb 


### Feature engineering

As stated above, in this section, we have to modify the data set before use it to train model. The steps are:
* split the data
* oversampling data set using SMOTE technique  

It is noticeable to mention that we have try remove outlier in the data set but not increase the accuracy of the model.

In [None]:
# Split the data into train and test set 
y = data['DEATH_EVENT']
X = data.drop(['DEATH_EVENT'], axis=1)
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

The target of this data set show imbalance which can cause bias to the model which is affect our final prediction, therefore, we have to balancing the target of the data set by over sampling method using imblearn package. We applied SMOTE technique which first selects a minority class instance a at random and finds its k nearest minority class neighbors. Then it will created sample based on the K nearest neighbour class.


In [None]:
# Over sampling train data to avoid imbalance data set
from imblearn.over_sampling import SMOTE

df = pd.concat([X_train, y_train], axis=1)

sm = SMOTE(sampling_strategy='minority', random_state=7)

# Fit the model to generate the data.
oversampled_trainX, oversampled_trainY = sm.fit_resample(df.drop('DEATH_EVENT', axis=1), df['DEATH_EVENT'])
oversampled_train = pd.concat([pd.DataFrame(oversampled_trainY), pd.DataFrame(oversampled_trainX)], axis=1)
oversampled_train['DEATH_EVENT'].value_counts()

In [None]:
X_train = oversampled_trainX.copy()
y_train = oversampled_trainY.copy()

### Evaluation Framework

In this project, the use the accuracy score, F1 score and ROC curve to evaluate the performance of model. While accuracy show the overall accuracy of the model, F1 score give us insight from the precision and recall score to understand if there exist any problem regarding the imbalance data set. In addition, the ROC curve is used for measuring the trade off between true positive rate and fasle positive rate

In [None]:
def evaluate_model_performance(clf, X_train, y_train):
    '''evaluate a classification model's performance
    INPUT:
    clf - Model object
    X_train - Training data matrix
    y_train - Expected model output vector
    OUTPUT:
    clf_accuracy: Model accuracy
    clf_f1_score: Model F1-score
    clf_recall_score: model recall score
    clf_precision_score: model precision score
    '''
    y_pred_rf = clf.predict(X_train)
    clf_accuracy = accuracy_score(y_train, y_pred_rf)
    clf_f1_score = f1_score(y_train, y_pred_rf)
    clf_recall_score = recall_score(y_train, y_pred_rf, average='binary')
    clf_precision_score = precision_score(y_train, y_pred_rf, average='binary')

    confusion_matrix = plot_confusion_matrix(clf, X_train, y_train, 
                                             cmap=plt.cm.plasma,
                                             normalize='true')
    plt.grid(False)
    plt.title("Confusion matrix")
    
    roc_display = plot_roc_curve(clf, X_train, y_train)
    plt.title("ROC Curve and AUC score")

    return clf_accuracy, clf_f1_score, clf_recall_score, clf_precision_score

In [None]:
model_score = []

### Feature selection

Before train the model and perform hyperparameter tunning, we train an baseline model `RandomForest` to observe the feature important of each feature.

In [None]:
%%time

rf =  RandomForestClassifier(n_jobs=4)

rf.fit(X_train, y_train)

acc, f1, recall, precision = evaluate_model_performance(rf, X_val, y_val)
print(acc)

In [None]:
# Feature Selection

feat_importances = pd.Series(rf.feature_importances_, index=X_train.columns)
print(feat_importances.sort_values(ascending=True))
feat_importances.nlargest(12).plot(kind='barh')
plt.show()

From the graph above we can see that `anaemia`, `high_blood_pressure`, `diabetes` have the least impact on the target. Whle `serum_creatine`, `time`, `ejection_fraction` and `age` are the most important feature which similar to the ANOVA analysis above.


In [None]:
selected_features = ['time', 'age', 'ejection_fraction', 'serum_creatinine', "serum_sodium"]

### Data modelling

In this section we use `GridSearchCV` to perform hyperparameter trainning and cross validation in order to achieve model with the optimize initialization. In this section we will train 5 different models from the simple model to high performance stacked model and compare the result to select the best model for predict the survival rate of a patient.

1. Gaussian Naive Bayes

Gaussian Naive Bayes model is a basic classification model based on Naivebayes theorem. Basically, the model will compute the probability of the hypothesis based on the prior knowlegde about the hypothesis. The model we applied in this project is the variant of the naive bayes model which it assumpt that the data is gausian distribution. 

In [None]:
%%time

nb_clf =  GaussianNB()

nb_clf.fit(X_train, y_train)  

print('Best Score: ', nb_clf.score(X_val, y_val))

acc, f1, recall, precision = evaluate_model_performance(nb_clf, X_val, y_val)
model_score.append(['Gaussian Naive Bayes', acc, f1, recall, precision])

Although the simplicity of the model, the model show great result with 80% accuraccy and AUC is 0.85. It is noted that, in the confusion matrix, the false positive is higher than true positive which is because the unbalance data set at the beginning.

2. Logistic regression

The logistic regression is a linear approach for classification problems with a major difference is it uses sigmoid function. 

In [None]:
%%time

model =  LogisticRegression()

parameters = {
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear'],
    'C': [ 0.01, 0.1, 10, 100],
    'max_iter': [5000, 10000, 20000]
}

log_reg = GridSearchCV(model, parameters, refit=True, verbose=1, cv = 5, n_jobs = 4)
log_reg.fit(X_train, y_train)

print('Best Score: ', log_reg.best_score_*100, '\nBest Parameters: ', log_reg.best_params_)

acc, f1, recall, precision = evaluate_model_performance(log_reg, X_val, y_val)
model_score.append(['Logistic regression', acc, f1, recall, precision])

The model also show great result with high accuraccy and AUC score. Also, the false positive is better than naive bayes model.

3. AdaboostClassifier

Adaboost classifier is a high performance stacked model, it use a combination of weak learner, combine it by weighted majority and made prediction.

In [None]:

%%time

model = AdaBoostClassifier()

parameters = {
    'n_estimators': [200, 300, 500, 600, 800],
    'learning_rate':[0.001, 0.1, 0.2, 0.5]
}

ada_clf = GridSearchCV(model, parameters, refit=True, verbose=1, cv = 5, n_jobs = 4)
ada_clf.fit(X_train, y_train)

print('Best Score: ', ada_clf.best_score_*100, '\nBest Parameters: ', ada_clf.best_params_)

acc, f1, recall, precision = evaluate_model_performance(ada_clf, X_val, y_val)
model_score.append(['Adaboost classifier', acc, f1, recall, precision])

The model show great result. the AUC score is higher which prove that the model have a good measure of separability.

4. Random forest

Random forest model is an algorithm by randomize the each batch of data set to the decision tree and perform voting to make final prediction.

In [None]:
%%time
model =  RandomForestClassifier()

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

parameters = {'n_estimators': [100, 200, 300],
               'max_features': ['auto', 'sqrt'],
               'max_depth': max_depth,
               'min_samples_split': [2, 5],
               'min_samples_leaf': [1, 2] }

rf_clf= GridSearchCV(model, parameters, refit=True, verbose=1, cv = 5, n_jobs = 4)
rf_clf.fit(X_train, y_train)

print('Best Score: ', rf_clf.best_score_*100, '\nBest Parameters: ', rf_clf.best_params_)

acc, f1, recall, precision = evaluate_model_performance(rf_clf, X_val, y_val)
model_score.append(['Random Forest', acc, f1, recall, precision])

The model have 90% accuraccy on the trainning set and good AUC score on validation set.

5. LightGBM

The lightGBM is an gradient boosting model uses tree based learning algorithms. It is widely known as the more efficient version of XGBoosting model and can have great performance when dealing with large data set.

In [None]:
%%time

model = lgb.LGBMClassifier()
# Create parameters to search
gridParams = {
    'learning_rate': [0.01, 0.1, 0.001],
    'max_depth': [5, 10, 15, None],
    'min_data_in_leaf': [30, 50, 100],
    'boosting_type': ['gbdt', 'dart']
    }

# To view the default model params:
model.get_params().keys()

# Create the grid
lgb_clf = GridSearchCV(model, gridParams,
                    verbose=1,
                    cv = 5,
                    n_jobs = 4)
# Run the grid
lgb_clf.fit(X_train, y_train)

# Print the best parameters found
print(lgb_clf.best_params_)
print(lgb_clf.best_score_)

acc, f1, recall, precision = evaluate_model_performance(lgb_clf, X_val, y_val)
model_score.append(['LightGBM', acc, f1, recall, precision])

The performance is quite similar to the random forest model, except it AUC score is slightly higher

### Model Evaluation

After trainning model and hyperparameter tunning phase, model with their best initialization is use to predict the test data set.

In [None]:
# Display the score of model with test data set
scores = pd.DataFrame(model_score, columns =['Model', 'Accuracy Score', 'F1 Score', 'Recall score', 'Precision score'])
scores

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(20,10))
fig.suptitle('Model performance comparision')

sns.barplot(data=scores, x='Model', y='Accuracy Score',  ax=axes[0][0])
axes[0][0].set_title('Accuraccy Score')

sns.barplot(data=scores, x='Model', y='F1 Score',  ax=axes[0][1])
axes[0][1].set_title('F1 Score')

sns.barplot(data=scores, x='Model', y='Recall score',  ax=axes[1][0])
axes[1][0].set_title('Recall Score')

sns.barplot(data=scores, x='Model', y='Precision score',  ax=axes[1][1])
axes[1][1].set_title('Precision Score')

From the table and the bar chart, we can saw that the `Gaussian Naive Bayes`, `Logistic Regression` and `LightGBM` is model with high accuraccy and F1 scores. It is strange since the simplest model produce the best result. However, looking on the data set with originally have only 300 records which is considered small, hence, the complex models couldn't have enough data to optimize result bad performance. The `lightGBM` is an exception, the accuracy of the model is slightly lower than `naive bayes` but have higher AUC score

#### Data Modelling with feature selection

We choose the `Naive Bayes` and `LightGBM` to train with data with selected feature which are  `serum_creatine`, `serum_sodium`, `time`, `ejection_fraction` and `age`

In [None]:
X_train_selected = X_train[selected_features]
X_val_selected = X_val[selected_features]
X_train_selected

In [None]:
%%time

model = lgb.LGBMClassifier()
# Create parameters to search
gridParams = {
    'learning_rate': [0.01, 0.1, 0.001],
    'max_depth': [5, 10, 15, None],
    'min_data_in_leaf': [30, 50, 100],
    'boosting_type': ['gbdt', 'dart']
    }

# To view the default model params:
model.get_params().keys()

# Create the grid
lgb = GridSearchCV(model, gridParams,
                    verbose=1,
                    cv = 5,
                    n_jobs = 4)
# Run the grid
lgb.fit(X_train_selected, y_train)

# Print the best parameters found
print(lgb.best_params_)
print(lgb.best_score_)

acc, f1, recall, precision = evaluate_model_performance(lgb, X_val_selected, y_val)

print("Accuraccy:", acc*100)
print("F1 score:", f1)
print("Recall:", recall)
print("precision:", precision)
model_score.append(['LBGM_classifier with feature selection', acc, f1, recall, precision])


In [None]:
from sklearn.neighbors import KNeighborsClassifier

model =  KNeighborsClassifier()

parameters = { 
    'n_neighbors':[1, 3, 5, 10, 15],
    'weights':['uniform', 'distance'],
    'metric':['euclidean', 'manhattan', 'minkowski'] 
}

knn_clf= GridSearchCV(model, parameters, refit=True, verbose=1, cv = 10, n_jobs = 4)
knn_clf.fit(X_train_selected, y_train)

print('Best Score: ', knn_clf.best_score_*100, '\nBest Parameters: ', knn_clf.best_params_)
acc, f1, recall, precision = evaluate_model_performance(knn_clf, X_val_selected, y_val)

print("Accuraccy:", acc*100)
print("F1 score:", f1)
print("Recall:", recall)
print("precision:", precision)
model_score.append(['KNN with feature selection', acc, f1, recall, precision])

In [None]:
from sklearn.neural_network import MLPClassifier

model =  MLPClassifier()

parameters = { 
    'activation': ['relu', 'logistic', 'tanh'],
    'hidden_layer_sizes':[20, 50, 100],
    'solver': ['adam', 'lbfgs'],
    'batch_size' :[10, 20, 30],
    'learning_rate_init': [0.0001, 0.001, 0.1],
    'max_iter': [1000, 2000]
}

mlp_clf= GridSearchCV(model, parameters, refit=True, verbose=3, cv = 5, n_jobs = 4)
mlp_clf.fit(X_train_selected, y_train)

print('Best Score: ', knn_clf.best_score_*100, '\nBest Parameters: ', knn_clf.best_params_)
acc, f1, recall, precision = evaluate_model_performance(mlp_clf, X_val_selected, y_val)

print("Accuraccy:", acc*100)
print("F1 score:", f1)
print("Recall:", recall)
print("precision:", precision)
model_score.append(['MLP with feature selection', acc, f1, recall, precision])

In [None]:
scores = pd.DataFrame(model_score, columns =['Model', 'Accuracy Score', 'F1 Score', 'Recall score', 'Precision score'])
scores

From the result, we can see that data set with feature selection have improve the accuraccy of the LightGBM model to 80% and the F1 score to 0.739 which is the highest compare to other model. Also, the AUC score is also seen a little improvement. Hence, we can see the faeature selection can improve the model performance and also generalize model. As a result, it is resonable to conclude that the LightGBM model is the most suitable for this problems, however, Gaussian Naive bayes and Logistic Regression should be considered because of its performance and simplicity. 

## Conclusion

In conclusion, through this project our team have applied several data analysis and machine learning techiniques to gain useful insight from the data set to answer the research questions. In the exploratory data analysis part, each feature is inspected to have a basic understand about each featue, then we perform analysis of variance (ANOVA) in order to evaluate the impact of patient's age and sex of health indicators and the affect of each health inidcator to the morality rate. We have founf that time, age, ejection fraction and serum creatinine are major factor contribute to the patient's chance of survival. Finally, we have build and analyze different models to predict the survival chance of the patient. The result show that LightGBM and Gausian Naive Bayes shows the best performance whith 80% accuraccy and 0.74 F1 score. In the future, more investigation need to perform to gather more patients's data and increase the features to improve the accuraccy of the model.