<center><h1> <u>Heart failure prediction</u></h1></center>
<img src="https://images.pexels.com/photos/6765583/pexels-photo-6765583.jpeg?auto=compress&cs=tinysrgb&dpr=2&h=650&w=940" width="50%">
<center><a href="https://www.pexels.com/photo/flower-petals-scattered-around-decorative-heart-6765583/">Photo by Michelle Leman from Pexels</a></center>

## Contents
- [The problem and The data](#section1)
    - [Understanding the problem](#subsection1)
    - [About the dataset](#subsection2)
- [Exploratory data analysis](#section2)
- [Feature Engineering](#section3)
- [Feature Selection](#section4)
- [Modeling and Evaluation](#section5)
    - [Decision tree Classifier](#tree)
    - [Logistic Regression](#logistic)
    - [Random Forest Classifier](#forest)
- [Final Model](#final)

In [None]:
# Importing python modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgbm

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import FunctionTransformer

import warnings
warnings.filterwarnings('ignore')
pd.set_option('precision', 2)
pd.set_option('display.float_format', lambda x: '%.2f' % x)
plt.style.use('ggplot')

<a id="section1"></a>
# The problem and The data
<a id="subsection1"></a>
## Understanding the problem
<b>Let's understand the problem that we are going to solve.<br></b>
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide.<br>

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management.<br>
Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.<br>
<b>We can build a machine learning model for predicting mortality caused by Heart Failure using other health factors of the patient.</b><br>
In ML terminology, a <b style="color:green;"> Supervised Learning Binary Classifcation problem.</b>

<a id="subsection2"></a>
## About the dataset
This dataset contains 12 features that can be used to predict mortality by heart failure.<br>
<b> age </b> : Age of the patient <br>
<b> anaemia </b> : 0 = N0, 1 = YES  <br>
<b> creatinine_phosphokinase </b> : measure of creatinine phosphokinase level in bloodstream <br>
<b> diabetes </b> : 0 = NO, 1 = YES <br>
<b> ejection_fraction </b> : The measurement of the percentage of blood leaving the heart each time it contracts. <br>
<b> high_blood_pressure </b> : 0 = NO, 1 = YES <br>
<b> platelets </b> : Count of platelets <br>
<b> serum_creatinine </b> : serum creatinine level <br>
<b> serum_sodium </b> :  measure of sodium in the body<br>
<b> sex </b> :  0 = FEMALE, 1 = MALE<br>
<b> smoking </b> : 0 = NO, 1 = YES <br>
<b> time </b> : the time at which DEATH_EVENT happened in days. For example; if the patient died, then it tells how many days it took to happen, if the patient survives, it tells how long recovery took.<br>
<b> DEATH_EVENT </b> : 0 = NO, 1 = YES (target) <br>

### Loading the data into memory

In [None]:
data = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
print("Data loaded successfully!!")
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns are present in the data.")

<a id="section2"></a>
# Exploratory Data Analysis



In [None]:
# random sample of data
data.sample(5)

In [None]:
# statistical summary
data.describe()

In [None]:
# Null values
data.isna().mean().to_frame(name="% of null values")

In [None]:
# Unique values
data.nunique().to_frame(name="# of unique values")

### Variable Separation
Separating the features based on their data type.

In [None]:
features = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
            'ejection_fraction', 'high_blood_pressure', 'platelets',
            'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']

continuous_features = ['age','creatinine_phosphokinase','ejection_fraction',
                       'platelets','serum_creatinine','serum_sodium','time']

discrete_features = ['anaemia', 'diabetes', 'high_blood_pressure', 'sex', 'smoking']

target = 'DEATH_EVENT'

### Target distribution

In [None]:
fig, ax = plt.subplots(figsize=(8,5))
sns.countplot(x=data[target], ax=ax)
ax.set_xlabel(target, fontsize=13, fontweight='bold')
for patch in ax.patches:
    height = patch.get_height()
    width = patch.get_width()
    new_width = width * 0.4
    patch.set_width(new_width)
    x = patch.get_x()
    patch.set_x(x + (width - new_width) / 2)
    ax.text(x=x + width/2, y=height, s=height, ha='center', va='bottom')
plt.tight_layout()

### Distribution of continuous features

In [None]:
fig, axes = plt.subplots(4,2, figsize=(15,20))
axes = np.ravel(axes)
for i, col in enumerate(continuous_features):
    sns.distplot(a=data[col], ax=axes[i], bins=30, color='blue')
    axes[i].set_title(f" Distribution of {col}")
plt.tight_layout()

### Distribution of discrete features

In [None]:
## source: https://stackoverflow.com/questions/64946868/on-changing-the-bar-width-of-a-countplot-the-relative-position-of-the-bars-get
disc_data = data[discrete_features].astype('category')

fig, axes = plt.subplots(3,2, figsize=(13,15))
axes=np.ravel(axes)
for i, col in enumerate(discrete_features):
    sns.countplot(x=disc_data[col], ax=axes[i])
    axes[i].set_title(col, fontsize=13, fontweight='bold')
    for patch, label in zip(axes[i].patches, ["NO", "YES"]):
        height = patch.get_height()
        width = patch.get_width()
        new_width = width * 0.4
        patch.set_width(new_width)
        patch.set_label(label)
        x = patch.get_x()
        patch.set_x(x + (width - new_width) / 2)
        axes[i].text(x=x + width/2, y=height, s=height, ha='center', va='bottom')
            
    axes[i].legend(loc='lower right')
    axes[i].margins(y=0.1)
plt.tight_layout()
plt.show()

### Continuous features Vs Target (Box plot)

In [None]:
fig, axes = plt.subplots(4,2, figsize=(13,15))
axes=np.ravel(axes)

for i,col in enumerate(continuous_features):
    sns.boxplot(x = data[target].astype('category'), y = col, data=data, ax=axes[i])
    axes[i].set_ylabel(col, fontweight='bold')
    axes[i].set_xlabel(target, fontweight='bold')
    axes[i].set_title(f'{col} vs target', fontsize=14)
    
plt.tight_layout()

### Discrete features distribution w.r.t Target

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15, 15))
axes = [ax for axes_row in axes for ax in axes_row]

for i, col in enumerate(discrete_features):
    fltr = data[target] == 0
    vc_a = data[fltr][col].value_counts().reset_index().rename({'index' : col, col: 'count'}, axis=1)

    vc_b = data[~fltr][col].value_counts().reset_index().rename({'index' : col, col: 'count'}, axis=1)

    vc_a[target] = 0
    vc_b[target] = 1

    df = pd.concat([vc_a, vc_b]).reset_index(drop = True)

    sns.barplot(x = col, y = 'count', data = df , hue=target, ax=axes[i])
    axes[i].set_title(col, fontweight='bold')
plt.tight_layout()

### Discrete features Vs Target

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15,15))
axes = [ax for axes_row in axes for ax in axes_row]
for i, c in enumerate(discrete_features):
    df = data[[c,target]].groupby(c).mean().reset_index()
    sns.barplot(df[c], df[target], ax=axes[i])
    for patch in axes[i].patches:
        height = patch.get_height()
        width = patch.get_width()
        new_width = width * 0.4
        patch.set_width(new_width)
        x = patch.get_x()
        patch.set_x(x + (width - new_width) / 2)
    axes[i].set_ylabel('mean of target', fontsize=14)
    axes[i].set_xlabel(c, fontsize=14, fontweight='bold')
    
plt.tight_layout()
plt.show()

Observation : Seems like Many discrete features are not so helpful in predicting target.<br>
High_blood_pressure, anaemia are useful

### Correlation of features with target

In [None]:
corr_mat = data.corr()[target].sort_values(ascending=False).to_frame()
plt.figure(figsize=(2,8))
sns.heatmap(corr_mat, cmap='Blues', cbar=False, annot=True)
plt.show()

<a id="section3"></a>
# Feature Engineering
Before transforming the features it is better to split the data into train and test sets

### Train test split
<b>Training : </b>80% of data<br>
<b>Testing : </b>20% of data

In [None]:
train, test = train_test_split(data, test_size=0.2, random_state=1, stratify=data[target])

### Log transformation

In [None]:
transformer = FunctionTransformer(np.log)

train[continuous_features] = transformer.fit_transform(train[continuous_features])
test[continuous_features] = transformer.transform(test[continuous_features])

### Preprocessed data

In [None]:
X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

print("Train set : ", train.shape)
print("Test set : ", test.shape)

<a id="section4"></a>
# Feature Selection
<b>Feature selection using Random forest</b> comes under the category of Embedded methods. Embedded methods combine the qualities of filter and wrapper methods. They are implemented by algorithms that have their own built-in feature selection methods. Some of the benefits of embedded methods are :
- They are highly accurate.
- They generalize better.
- They are interpretable

<a href="https://towardsdatascience.com/feature-selection-using-random-forest-26d7b747597f">Reference blog</a>

In [None]:
selector = SelectFromModel(
    
    RandomForestClassifier(n_estimators = 100,
                           random_state=1),
    threshold='median')

selector.fit(X_train, y_train)

selected_feat= X_train.columns[(selector.get_support())].tolist()
print("Best features : ",selected_feat)

### Importance of all features

In [None]:
importance = pd.Series(
    selector.estimator_.feature_importances_.ravel(),
    features).to_frame(name="feature importance") \
.sort_values('feature importance', ascending=False)
importance

### Final data

In [None]:
X_train = X_train[selected_feat]
X_test = X_test[selected_feat]

<a id="section5"></a>
# Modeling and Evaluation

In [None]:
results = {"model":[], "CV f1-score":[]}

In [None]:
# Baseline model
def base_model(clf):
    clf.fit(X_train, y_train)
    train_preds = clf.predict(X_train)
    test_preds = clf.predict(X_test)
    print("Train f1 Score :", f1_score(y_train, train_preds))
    print("Test f1 Score :", f1_score(y_test, test_preds))  

In [None]:
'''### K - Fold Cross validation ###
Step 1: Randomly divide a dataset into k groups, or “folds”, of roughly equal size.
Step 2: Choose one of the folds to be the holdout set. Fit the model on the remaining k-1 folds.
Step 3: Calculate the test F1-score on the observations in the fold that was held out.
Step 4: Repeat this process k times, using a different set each time as the holdout set.
Step 5: Calculate the average of the k test F1-scores to get the overall test F1-score.'''
# Below function implements K-Fold cross validation.

def run_kfold(model, X_train, y_train, N_SPLITS = 10):
    f1_list = []
    oofs = np.zeros(len(X_train))
    folds = StratifiedKFold(n_splits=N_SPLITS)
    for i, (trn_idx, val_idx) in enumerate(folds.split(X_train, y_train)):
        
        print(f'\n------------- Fold {i + 1} -------------')
        X_trn, y_trn = X_train.iloc[trn_idx], y_train.iloc[trn_idx]
        X_val, y_val = X_train.iloc[val_idx], y_train.iloc[val_idx]
        
        model.fit(X_trn, y_trn)
        # Instead of directly predicting the classes we will obtain the probability of positive class.
        preds_val = model.predict_proba(X_val)[:,1]
        
        fold_f1 = f1_score(y_val, preds_val.round())
        f1_list.append(fold_f1)
        
        print(f'\nf1 score for validation set is {fold_f1}') 
        
        oofs[val_idx] = preds_val
        
    print(f'\n----------------------------------')
    mean_f1 = sum(f1_list)/N_SPLITS
    print("\nMean validation f1 score :", mean_f1)
    
    oofs_score = f1_score(y_train, oofs.round())
    print(f'\nF1 score for oofs is {oofs_score}')
    return oofs, mean_f1

<a id="tree"></a>
## Decision tree

### Base model

In [None]:
tree = DecisionTreeClassifier(random_state=1)
base_model(tree)

### Hyperparameter tuning

In [None]:
params = {
    'max_depth': [4, 6, 8, 10, 12, 14, 16, 20],
    'criterion': ['gini', 'entropy'],
    'min_samples_split': [5, 10, 20, 30, 40, 50],
    'max_features': [0.2, 0.4, 0.6, 0.8, 1],
    'max_leaf_nodes': [8, 16, 32, 64, 128,256],
    'class_weight': [{0: 1, 1: 1}, {0: 1, 1: 2},
                     {0: 1, 1: 3}, {0: 1, 1: 4}]
}

clf = RandomizedSearchCV(DecisionTreeClassifier(random_state=1),
                         params,
                         scoring='f1',
                         verbose=1,
                         random_state=1,
                         cv=5,
                         n_iter=50)

search = clf.fit(X_train, y_train)

print("\nBest f1-score:",search.best_score_)
print("\nBest params:",search.best_params_)

### K Fold - Cross validation

In [None]:
clf = DecisionTreeClassifier(random_state = 1,
                             **search.best_params_)
oofs, mean_f1 = run_kfold(clf, X_train, y_train, N_SPLITS=5)
results['model'].append("Decision Tree")
results['CV f1-score'].append(mean_f1)

<a id="logistic"></a>
## Logistic Regression

### Base model

In [None]:
log = LogisticRegression(random_state=1)
base_model(log)

### Hyperparameter tuning

In [None]:
params = {
    'penalty': ['l1', 'l2','elasticnet'],
    'C':[0.0001, 0.001, 0.1, 1, 10, 100,1000],
    'fit_intercept':[True, False],
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight':['balanced', None]
}

clf = RandomizedSearchCV(LogisticRegression(random_state=1),
                         params,
                         scoring='f1',
                         verbose=1,
                         random_state=1,
                         cv=5,
                         n_iter=50)

search = clf.fit(X_train, y_train)

print("\nBest f1-score:",search.best_score_)
print("\nBest params:",search.best_params_)

### K Fold - Cross validation

In [None]:
clf = LogisticRegression(random_state = 1,
                         **search.best_params_)

oofs, mean_f1 = run_kfold(clf, X_train, y_train, N_SPLITS=5)

results['model'].append("Logistic regression")
results['CV f1-score'].append(mean_f1)

<a id="forest"></a>
## Random Forest Classifier

### Base model

In [None]:
tree = RandomForestClassifier(random_state=1)
base_model(tree)

### Hyperparameter tuning

In [None]:
params = {'bootstrap': [True, False],
         'max_depth': [5,10, 20, 30, 50,None],
         'max_features': ['auto', 'sqrt'],
         'min_samples_leaf': [1, 2, 4],
         'min_samples_split': [2, 5, 10],
         'class_weight': [{0: 1, 1: 1}, {0: 1, 1: 2}, {0: 1, 1: 3}],
         'n_estimators': [50, 100, 200, 300, 500]}

clf = RandomizedSearchCV(RandomForestClassifier(random_state=1),
                         params,
                         scoring='f1',
                         verbose=1,
                         random_state=1,
                         cv=5,
                         n_iter=50)

search = clf.fit(X_train, y_train)

print("\nBest f1-score:",search.best_score_)
print("\nBest params:",search.best_params_)

### K fold - Cross validation

In [None]:
clf = RandomForestClassifier(random_state = 1,
                         **search.best_params_)

oofs, mean_f1 = run_kfold(clf, X_train, y_train, N_SPLITS=5)

results['model'].append("Random Forest")
results['CV f1-score'].append(mean_f1)

<a id="final"></a>
# Final Model


In [None]:
pd.DataFrame(results)

<b> Random Forest performing better.. Let's evaluate the results.!</b>

In [None]:
params = {'n_estimators': 100,
          'min_samples_split': 5,
          'min_samples_leaf': 4,
          'max_features': 'auto',
          'max_depth': 30, 
          'class_weight': 
          {0: 1, 1: 2}, 
          'bootstrap': True}

final_model = RandomForestClassifier(random_state=1,
                                     **params
                                    )
final_model.fit(X_train, y_train)

train_preds = final_model.predict(X_train)
test_preds = final_model.predict(X_test)

print("Train f1 Score :", f1_score(y_train, train_preds))
print("Test f1 Score :", f1_score(y_test, test_preds))  

### Classification report

In [None]:
print(classification_report(y_test, test_preds))

### Confusion matrix

In [None]:
cm = confusion_matrix(y_test,test_preds,normalize='true')
plt.figure(figsize=(5,5))
sns.heatmap(cm, annot=True, cmap='Blues', cbar=False,fmt='.2f')
plt.show()

### Thank you..!!
--- &nbsp;  Ashok kumar