# Heart <font color="red">Failure</font> Clinical Records
reached 99% precision! What did it cost? Recall

![](https://www.lifespan.io/wp-content/uploads/2017/10/shutterstock_488843971.jpg)

# Problem Description
- To create a model in order to predict the likelihood of a patient dying due to heart failure.
- This a binary clasification problem since the target class (Death Event) consists of two classes True or False

In [None]:
!pip install seaborn --upgrade
import pandas as pd # data manipulation
import numpy as np # linear algebra

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from statsmodels.graphics.gofplots import qqplot # normality check
import plotly.express as px
from sklearn.tree import plot_tree # decision tree 

# data preprocessing
from imblearn.over_sampling import SMOTE # deal with imbalance data
from sklearn.preprocessing import MinMaxScaler, PowerTransformer # scale data

# classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression # linear classification
from sklearn.svm import LinearSVC, SVC # support vector machines
from sklearn.tree import DecisionTreeClassifier # tree based
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier,\
AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb

# model evaluation and selection
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import classification_report, plot_roc_curve
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.metrics import roc_curve, accuracy_score, roc_auc_score

# <a id=toc>Table of contents</a>
1. [Data Exploration](#eda)
2. [Data Prepartion](#data_prep)
3. [Data Modelling and Hyperparameter Tuning](#model)
4. [Model Evaluation](#eval)
5. [Prediction on Test Data](#predict)
6. [Conclusion](#conclude)

# <a id=eda>1. Data Exploration</a>
[Back to index](#toc)

## Dataset and feature description

In [None]:
df = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
df.head()

**1. Age:** age of patient (in years)

**2. Anaemia:** Decrease of red blood cells or hemoglobin

**3. High blood pressure:** If a patient has hypertension

**4. Creatinine phosphokinase:** Level of the CPK enzyme in the blood (mcg/L)

**5. Diabetes:** If the patient has diabetes

**6. Ejection fraction:** Percentage of blood leaving the heart at each contraction

**7. Sex:** Woman or man

**8. Platelets:** Platelets in the blood (kiloplatelets/mL)

**9. Serum creatinine:** Level of creatinine in the blood (mg/dL)

**10. Serum sodium:** Level of sodium in the blood (mEq/L)

**11. Smoking:** If the patient smokes

**12. Time:** Follow-up period (in days)

**13. (target) death event:** If the patient died during the follow-up period

## Data shape

In [None]:
df.shape

## Data types

In [None]:
df.info()

In [None]:
numeric = ['age', 'creatinine_phosphokinase', 
           'ejection_fraction', 'platelets', 
           'serum_creatinine', 'time']
categorical = ['anaemia', 'diabetes', 'high_blood_pressure', 
               'sex', 'smoking']

## Fix age data type

In [None]:
df.age = df.age.astype('int64')

## Missing Values
There are no missing values

In [None]:
df.isnull().sum()

## EDA

### Target
- The target class or label is imbalanced

In [None]:
target_count = df.DEATH_EVENT.value_counts()
death_color = ['navy', 'crimson']
with plt.style.context('ggplot'):
    plt.figure(figsize=(6, 5))
    sns.countplot(data=df, x='DEATH_EVENT', palette=death_color)
    for name , val in zip(target_count.index, target_count.values):
        plt.text(name, val/2, f'{round(val/sum(target_count)*100, 2)}%\n({val})', ha='center',
                color='white', fontdict={'fontsize':13})
    plt.xticks(ticks=target_count.index, labels=['No', 'True'])
    plt.yticks(np.arange(0, 230, 25))
    plt.show()

### Distribution of Numeric Features
- features `creatinine_phosphokinase` and `serum_creatinine` are extremely positive or right skewed

In [None]:
colors = sns.color_palette("tab10")
with plt.style.context('bmh'):
    plt.figure(figsize=(10, 10))
    plt.subplots_adjust(wspace=0.4, hspace=0.4)
    for i, (col, name) in enumerate(zip(colors, numeric)):
        plt.subplot(3, 3, i+1)
        sns.histplot(data=df, x=name, color=col)
    plt.suptitle('Histograms of Numeric features', fontsize=15)

In [None]:
fig, axes = plt.subplots(6, 2, figsize=(10, 20))
plt.subplots_adjust(hspace=0.4)
axes = axes.ravel()
for i, name, col in zip(np.arange(0, 14, 2), numeric, colors):
    sns.boxplot(data=df, x=name, ax=axes[i], y='DEATH_EVENT', 
                orient='h', palette=death_color, showfliers=True)
    sns.boxplot(data=df, x=name, ax=axes[i+1], y='DEATH_EVENT', 
                orient='h', palette=death_color, showfliers=False)
plt.suptitle('Boxplot of Numeric features with repect to the target class\n(with and without outliers)', 
             fontsize=15)
plt.show()

### Distribution Categorical Features w.r.t target class

In [None]:
colors = sns.color_palette("tab10")
with plt.style.context('bmh'):
    plt.figure(figsize=(12, 15))
    plt.subplots_adjust(wspace=0.4, hspace=0.4)
    for i, (col, name) in enumerate(zip(colors, categorical)):
        plt.subplot(3, 2, i+1)
        sns.countplot(data=df, x=name, hue='DEATH_EVENT')

# <a id=data_prep>2. Data Preparation</a>
[Back to index](#toc)

## Separate features and target class

In [None]:
X = df.iloc[:, :-1]
y = df['DEATH_EVENT']

In [None]:
print(X.shape)
print(y.shape)

## Fix Class Imbalance using SMOTE
SMOTE is an oversampling technique where the synthetic samples are generated for the minority class, in our case, 1's 

In [None]:
smote = SMOTE(random_state=2021, n_jobs=-1, k_neighbors=5)
smote.fit(X, y)
X_smote, y_smote = smote.fit_resample(X, y)

In [None]:
print(X_smote.shape)
print(y_smote.shape)

## Data Transformation
During EDA for the numeric features, the histograms of few features indicated skewness. Some of the features like `creatinine_phosphokinase` and `serum_creatinine` were extremely skewed. Skewed features like these can be made more Gaussian-like using power transforms or log transforms. For example: 

**1. creatinine_phosphokinase** using the log transformation can make data conform to normality. In this case log-transform does remove or reduce the skewness since the original data follows a log-normal distribution or approximately so. 

* The qq plot below shows the effect of log trasformation on creatinine_phosphokinase. QQ plot (or quantile-quantile plot) is a plot where the axes are purposely transformed in order to make a normal (or Gaussian) distribution appear in a straight line. In other words, a perfectly normal distribution would exactly follow a line with slope = 1 and intercept = 0.

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
qqplot(df.creatinine_phosphokinase, fit=True, line='45', ax=ax[0])
ax[0].set_title('before transformation')
qqplot(np.log10(df.creatinine_phosphokinase), fit=True, line='45', ax=ax[1])
ax[1].set_title('after transformation')
plt.suptitle('q-q plot for creatinine_phosphokinase', fontweight='bold')
plt.show()

**2. serum_creatinine** using reciprocal transform (p = -1). This transformation has a radical effect as it reverses the order among the values of same sign, therefore, larger values become smaller and visa verse

In [None]:
p = -1
fig, ax = plt.subplots(1, 2, figsize=(10, 5))

qqplot(df.serum_creatinine, fit=True, line='45', ax=ax[0])
ax[0].set_title('before transformation')

qqplot(df.serum_creatinine**p, fit=True, line='45', ax=ax[1])
ax[1].set_title('after transformation')

plt.suptitle('q-q plot for serum_creatinine', fontweight='bold')
plt.show()

The power transformer of sklearn learn provides two methods to make the distribution gaussian-like
- Boxcox
- Yeo-johnson

Both these methods searches for the right value of p (just like in the above example) in order to make the distribution normal. Yeo-johnson is a upgraded version of Boxcox as it deals with the data with negative values

In [None]:
pt = PowerTransformer(method='yeo-johnson')
X_pt = pt.fit_transform(X_smote)

## Normalise Data
Finally, normalise the data using the min max scaler which scales the data to the 0-1 range. Scaling is required for ML algo like SVM, Logistic regression, knn which are sensitive to scaling and outliers (applicable for both classification and regression problems).

In [None]:
mm = MinMaxScaler()
X_scaled = mm.fit_transform(X_pt)

## Distribution of features after transformation and scaling

In [None]:
pd.DataFrame(X_scaled, columns=X.columns).hist(figsize=(10, 10))
plt.show()

## Feature Selection using Random Forest

In [None]:
rf = RandomForestClassifier(n_estimators=100, max_depth=10, n_jobs=-1, 
                            class_weight='balanced', random_state=2021)
rf.fit(X_scaled, y_smote)

In [None]:
feature_imp = pd.DataFrame(np.round(rf.feature_importances_*100, 2), index=X.columns, columns=['importance%'])
feature_imp = feature_imp.sort_values(by='importance%', ascending=False)
feature_imp.plot(kind='barh', figsize=(8, 5))
plt.xlabel('percentage')
plt.show()

In [None]:
imp_features = feature_imp.index[:3]
imp_features

In [None]:
X_selected = pd.DataFrame(X_scaled, columns=X.columns)[imp_features]

In [None]:
X_selected

# <a id=model>3. Data Modelling and Hyperparameter Tuning</a>
[Back to index](#toc)

Therefore, the goal is now to separate both the classes as shown the figure below

`Note` All the classifiers have been tuned to maximize the f1 score instead of accuracy. F1 score is the harmonic mean of recall and precision. This score will favor classifiers with a similar precision and recall. I could have achieved high recall or precision but unfortunately, we cannot have it both ways as increasing precision reduces recall, and visa verse

In [None]:
model_data = X_selected
model_data['target'] = y_smote

In [None]:
model_data

In [None]:
px.scatter_3d(model_data, x='time', y='serum_creatinine', z='ejection_fraction', color='target')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(model_data.drop(['target'], axis=1), 
                                                    model_data['target'], 
                                                    test_size=0.25, 
                                                    random_state=2021, 
                                                    stratify=model_data['target']
                                                   )
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## 3.1 k-Nearest Neighbour Classifier
Let start with  simple and a lazy learner where an object is classified by a plurality vote of its neighbors.

In [None]:
knn_clf = KNeighborsClassifier(n_jobs=-1)

### Hyperparameter tuning

In [None]:
knn_params = {
    'n_neighbors': np.arange(2, 12)
}
knn_cv = GridSearchCV(knn_clf, knn_params, scoring='f1', n_jobs=-1, cv=10)
knn_cv.fit(X_train, y_train)

### Optimum value to maximize performance
Therefore, highest f1 score is achieved by knn using 9 neighbours. 

In [None]:
knn_cv.best_params_

### Predictions on training data using cross val predict
cross_val_predict() performs K-fold cross-validation, but instead of returning the evaluation scores, it returns the predictions made on each test fold. This means that you get a clean prediction for each instance in the training set (“clean” meaning that the prediction is made by a model that never saw the data during training).

In [None]:
knn_train_pred = cross_val_predict(knn_cv, X_train, y_train, cv=10, n_jobs=-1)

### Classification Report of training data

In [None]:
print(classification_report(y_train, knn_train_pred, digits=4, target_names=['not gonna die', 'will die']))

`NOTE` similar approach is used for all classifiers below

## 3.2 Logistic Regression
Logistic regression uses logit function to compute the probability of the outcomes, in our case, the target class `Death Event`

In [None]:
lr_clf = LogisticRegression(class_weight='balanced', random_state=2021, n_jobs=-1)

### Hyper-parameter tuning using Grid Search

In [None]:
lr_params = {
    'penalty': ['l1', 'l2', 'elasticnet'], 
    'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000], 
    #'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}
lr_cv = GridSearchCV(lr_clf, lr_params, scoring='f1', cv=10, n_jobs=-1)
lr_cv.fit(X_train, y_train)

### Best parameter value to achieve the highest F1 score

In [None]:
lr_cv.best_params_

### Predictions on training data using cross val predict

In [None]:
lr_train_pred = cross_val_predict(lr_cv, X_train, y_train, cv=10, n_jobs=-1)

###  Classification report

In [None]:
print(classification_report(y_train, lr_train_pred, digits=4, target_names=['not gonna die', 'will die']))

## 3.3 Support Vector Machine (SVM)

### Linear SVM classification (Hard Margin)
As seen from the 3D plot of the data. The problem doesnt not look like it can be separated using a hard margin svm since there is some noise in both the classes.

In [None]:
lin_svm_clf = SVC(kernel='linear', class_weight='balanced', random_state=2021)

In [None]:
params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], # deicison boundries
}
lin_svm_cv = GridSearchCV(lin_svm_clf, params, scoring='f1', n_jobs=-1, cv=10)
lin_svm_cv.fit(X_train, y_train)

In [None]:
lin_svm_cv.best_params_

In [None]:
lin_svm_train_pred = cross_val_predict(lin_svm_cv, X_train, y_train, cv=10, n_jobs=-1)

In [None]:
print(classification_report(y_train, lin_svm_train_pred, digits=4, target_names=['not gonna die', 'will die']))

### Non-Linear Classification (Soft Margin)
Using rbf kernal

In [None]:
rbf_svm = SVC(kernel='rbf', class_weight='balanced')

In [None]:
params = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], # deicison boundries
}
rbf_svm_cv = GridSearchCV(rbf_svm, params, scoring='f1', n_jobs=-1, cv=10)
rbf_svm_cv.fit(X_train, y_train)

In [None]:
rbf_svm_cv.best_params_

In [None]:
rbf_svm_train_pred = cross_val_predict(rbf_svm_cv, X_train, y_train, cv=10, n_jobs=-1)

In [None]:
print(classification_report(y_train, rbf_svm_train_pred, digits=4, target_names=['not gonna die', 'will die']))

## 3.4 Decision Tree
**Pros**
- requires very little data preparation and doesn't require feature scaling or centering.
- simple to understand and iterpret.

**Cons**
- Decision Trees love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. It is very likely that the model will not generalize well. One way to limit this problem is to use Principal Component Analysis which often results in a better orientation of the training data.
- the main issue with Decision Trees is that they are very sensitive to small variations in the training data. Random Forests can limit this instability by averaging predictions over many trees.

In [None]:
dt_clf = DecisionTreeClassifier(class_weight='balanced', random_state=2021)

In [None]:
params = {
    'criterion': ['gini', 'entropy'], 
    'max_depth': np.arange(2, 22, 2), # depth of tree
    'min_samples_split': [2, 3, 4], # min. no. of samples a node must have before it splits 
    'min_samples_leaf': [1, 2, 3, 4] # min. non of samples a leaf node must have
}
dt_cv = GridSearchCV(dt_clf, params, scoring='f1', n_jobs=-1, cv=10)
dt_cv.fit(X_train, y_train)

In [None]:
dt_cv.best_params_

In [None]:
dt_train_pred = cross_val_predict(dt_cv, X_train, y_train, cv=10, n_jobs=-1)

### DT Visualized

In [None]:
best_dt_clf = DecisionTreeClassifier(class_weight='balanced', random_state=2021, 
                                    max_depth=4, criterion='entropy', min_samples_split=2, 
                                     min_samples_leaf= 1)

In [None]:
best_dt_clf.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(15, 6))
plot_tree(best_dt_clf, filled=True, 
          #feature_names=['time', 'serum_creatinine', 'ejection_fraction']
         )
plt.show()

## 3.5 Random Forest
- It is an ensemble(group of predictors) of decision trees.
- It has all the hyperparameters of the decision tree

In [None]:
rf = RandomForestClassifier(n_jobs=-1, random_state=2021, class_weight='balanced')

In [None]:
params = {
    #'n_estimators': [100, 200, 300], 
    'max_depth': np.arange(2, 22, 1), 
    #'min_samples_split': [2, 3, 4], 
    #'min_samples_leaf': [1, 2, 3, 4], 
    'criterion': ['gini', 'entropy']
}
rf_cv = RandomizedSearchCV(rf, params, scoring='f1', n_jobs=-1, cv=10, random_state=2021, n_iter=20)
rf_cv.fit(X_train, y_train)

In [None]:
rf_cv.best_params_

In [None]:
rf_train_pred = cross_val_predict(rf_cv, X_train, y_train, cv=10, n_jobs=-1, verbose=1)

In [None]:
print(classification_report(y_train, rf_train_pred, digits=4, target_names=['not gonna die', 'will die']))

# <a id=eval>4. Model Evaluation</a>
[Back to index](#toc)

Random Forest has outperformed all the other classifers in accuracy, precision, recall, f1 score and auc score. 

In [None]:
models = ['kNN', 'Logistic Regression', 'Linear SVM', 'Non-Linear SVM', 
          'Decision Tree', 'Random Forest']
model_colors = sns.color_palette("Dark2")
accuracy = []
recall = []
precision = []
f1 = []
auc = []
predictions = [knn_train_pred, lr_train_pred, lin_svm_train_pred, 
               rbf_svm_train_pred, dt_train_pred, rf_train_pred]

for model_pred in predictions:
    accuracy.append(accuracy_score(y_train, model_pred))
    precision.append(precision_score(y_train, model_pred))
    recall.append(recall_score(y_train, model_pred))
    f1.append(f1_score(y_train, model_pred))
    auc.append(roc_auc_score(y_train, model_pred))

In [None]:
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    plt.bar(models, accuracy, color=model_colors)
    for m, a in zip(models, accuracy):
        plt.text(m, a+0.01 , f'{round(a*100, 3)}%', ha='center')
    plt.xlabel('Models')
    plt.ylabel('Accuracy percentage (%)')
    plt.title('Model comparison on training data using Accuracy')
    plt.show()

In [None]:
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    plt.bar(models, recall, color=model_colors)
    for m, a in zip(models, recall):
        plt.text(m, a+0.01 , f'{round(a*100, 3)}%', ha='center')
    plt.xlabel('Models')
    plt.ylabel('Recall percentage (%)')
    plt.title('Model comparison on training data using Recall')
    plt.show()

In [None]:
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    plt.bar(models, precision, color=model_colors)
    for m, a in zip(models, precision):
        plt.text(m, a+0.01 , f'{round(a*100, 3)}%', ha='center')
    plt.xlabel('Models')
    plt.ylabel('Precision percentage (%)')
    plt.title('Model comparison on training data using Precision')
    plt.show()

In [None]:
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    plt.bar(models, f1, color=model_colors)
    for m, a in zip(models, f1):
        plt.text(m, a+0.01 , f'{round(a*100, 3)}%', ha='center')
    plt.xlabel('Models')
    plt.ylabel('F1 percentage (%)')
    plt.title('Model comparison on training data using F1 score')
    plt.show()

In [None]:
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    plt.bar(models, auc, color=model_colors)
    for m, a in zip(models, auc):
        plt.text(m, a+0.01 , round(a, 3), ha='center')
    plt.xlabel('Models')
    plt.ylabel('AUC')
    plt.title('Model comparison on training data using AUC')
    plt.show()

# <a id=predict>5. Predictions on Test Data</a>
[Back to index](#toc)

In [None]:
best_models = [knn_cv, lr_cv, lin_svm_cv, rbf_svm_cv, dt_cv, rf_cv]
for name, model in zip(models, best_models):
    best_predictions = model.predict(X_test)
    print(name.upper())
    print(classification_report(y_test, best_predictions))
    print("-------------------------------------------------------------")

## Model comparison on test data using ROC Curve
- the ROC curve, plots the true positive rate (another name for recall) against the false positive rate (FPR). 
- Once again there is a trade-off: the higher the recall (TPR), the more false positives (FPR) the classifier produces. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible
- One way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5.
- Random forest has the highest auc.

In [None]:
best_models = [knn_cv, lr_cv, lin_svm_cv, rbf_svm_cv, dt_cv, rf_cv]
with plt.style.context('ggplot'):
    plt.figure(figsize=(10, 5))
    for name, model in zip(models, best_models):
        best_predictions = model.predict(X_test)
        fpr, tpr, thresholds = roc_curve(y_test, best_predictions)
        plt.plot(fpr, tpr, linewidth=2, label=name)
        plt.plot([0, 1], [0, 1], 'k--') 
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate (Reccll)')
    plt.legend()
    plt.show()

# <a id=conclude>Conclusion</a>
[Back to index](#toc)

- Instead of all the 13 features, only 3 features `time`, `serum_creatinine` and `ejection_fraction` are sufficient to model the data. Using top 7 or all the features resulted in overfitting. 
- Random Forest has the best performance as compared to other models on both train and test data