# Heart Failure Analysis

## Background on the Dataset

The data was collected from the Institute of Cardiology and Allied hospital Faisalabad-Pakistan during April-December 2015.  (Ahmad T, Munir A, Bhatti SH, Aftab M, Raza MA (2017) Survival analysis of heart failure patients: A case study. PLoS ONE 12(7): e0181001. https://doi.org/10.1371/journal.pone.0181001). 

Chicco & Jurman, 2020, found that a machine learning model using only the ejection fraction and serum creatinine from this dataset was enough to predict death during the follow up period.  (Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5).

The dataset consists of 299 patients who were 40 years old or more, and who had left ventricular systolic dysfunction falling into NYHA classes 3 and 4.  The features are not causes of heart failure, but symptoms of it.

Serum creatinine's normal level is 1.5, and any measurement greater than 1.5 is an indicator of renal dysfunction (Ahmad et. al., 2017).

Ejection fraction is the percentage of blood pumped out of the heart during a single contraction.  Normal values range from 50-75%, and a value < 40% inidicates heart failure (Chicco & Jurman, 2020).  

Each patient has 1 follow up time, which can vary from 4 days to 285 days after their appointment.  

A description of each feature can be found here: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data/discussion/193109.

## Identifying the Problem to be Solved

Although it seems reasonable to assume patients in this dataset died from heart failure, there is no way to be certain.  There is also no way of knowing if patients died shortly after their follow up.  The follow up time varies between patients.  A patient who was ok for their follow up 30 days later could have died on day 31, but they would be labeled as death_event = 0 in this dataset.  For this reason, and given the small size of the dataset, I expect there will be a large number of outliers.

Since the follow up times are all different, I think it is misguided to say that death can be predicted from this dataset, despite Chicco & Jurman, 2020 having success in this effort.  If 2 patients have nearly identical feature vectors, but patient A was followed up on day 10 and was ok, but patient B was followed up on day 190 and had died, then the model would be confused because the labels would be different.  In one case, the model is essentially learning to predict whether a patient died within 10 days, and in the other, it is learning to predict whether a patient died within 190.  The only way it would make sense to predict death would be if all of the follow up times were exactly the same.  

So instead of using this data to predict death, it seems to make more sense to cluster it into 2 groups, and identify the feature differences between these groups.  This may help determine how different symptoms measure the severity of heart failure.  The death_event variable can be used as a guide to separate the clusters, but it will not be treated as ground truth.

**TLDR: I will cluster the data by severe & less severe heart failure, then examine how the attributes vary between these groups.**

## Data Ingestion

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
import plotly.express as px
from scipy.spatial.distance import cdist
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [None]:
d = pd.read_csv("/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv")
d.info()

In [None]:
d.rename(columns={"DEATH_EVENT": "death_event"}, inplace=True)
d.isna().sum()

## Exploratory Data Analysis

There are no null values in the data.  Let's see what the univarite distributions look like.  

Things to look for/verify:
1. The patients are all supposed to be >= 40 years old.  Are they?
2. Are there any univariate outliers or unusual categories?
3. Are there any rare levels of categorical features?
4. Do any variables have skewed distributions that may need to be corrected?
5. Is there a class imbalance in the target?  This is less important for my goal, but it would be important for anyone trying to fit a classifier.

In [None]:
for col in d.columns.tolist():
    sns.distplot(d[col], kde=False)
    plt.title(f"Distribution of {col}")
    plt.show()

Age is distributed as expected.

There are some skews but serum_creatine and creatinine_phosphokinase are the worst offenders.  I will plan to log transform these.

The only categorical features are the binary ones, and there are no rare categories.

ejection_fraction has some unsual values.

The target variable (death_event) is imbalanced toward 0.

The time variable (follow up days) is interesting.  Is there any evidence that patients who were followed up with sooner had worse cases of heart failure?  Might they have died sooner, on average?  I'll explore these questions in a moment.  First, let's see what is going on with ejection_fraction.

In [None]:
# what happens to people whose hearts can only pump out 25% of their blood?
d[d['ejection_fraction'] <= 25].groupby("death_event")["death_event"].count()

In [None]:
dead = d[d['ejection_fraction'] <= 25].loc[d.death_event == 1]
not_dead = d[d['ejection_fraction'] <= 25].loc[d.death_event == 0]
sns.distplot(dead.time, label="dead", color="Crimson")
sns.distplot(not_dead.time, label="not dead", color="Green")
plt.legend()
plt.title("Follow Up Times for Patients with Low Ejection Rates")
plt.show()

So low ejection rates seem to correspond to more severe cases of heart failure.  

Let's take a look at all of the features, split by death_event.

In [None]:
# apply log transformations
original_vars = d[['serum_creatinine', 'creatinine_phosphokinase']].copy()
d['log_serum_creatinine'] = np.log(d['serum_creatinine']+1e-9)
d['log_creatinine_phosphokinase'] = np.log(d['creatinine_phosphokinase']+1e-9)

# standardize the non-binary features
features_to_scale = d.columns[d.dtypes == 'float'].tolist() + ['ejection_fraction', 'serum_sodium', 'time']
scaler = StandardScaler()
d[features_to_scale] = scaler.fit_transform(d[features_to_scale])

# remove the original values
d.drop(original_vars.columns.tolist(), axis=1, inplace=True)

# plot distributions by death_event
for col in d.columns:
    fig = px.histogram(
        d, x=col, color="death_event", 
        marginal="violin", hover_data=d.columns,
        title = f"Distribution of {col} vs death_event", 
        labels=col,
        template="plotly_dark",
        color_discrete_map={0: "Green", 1: "Crimson"}
    )
    fig.show()

These plots are looking for 2 things:
1. Do the distributions of any of these variables vary by the target?
2. Do any of the outliers vary by the target?

ejection_fraction is lower for those who died.  Outliers in this feature follow that pattern too: high end outliers lived (higher ejection fraction) and low end outliers died (lower ejection fraction).

log_serum_creatinine is slightly lower, on aveage, for those who lived, while those who died have a broader range that extends to larger values.

ejection_fraction and serum_creatinine were found to be the most useful features by Chicco & Jurman, 2020.  So these patterns align with what they found.

## Clustering

I want to divide the patients into 2 groups, where one consists of more severe heart failure, and the other is less severe.  The death_event variable can serve as a guide for this clustering, because patients who died likely had more severe heart failure.  By using it as a guide to divide the data, I can create 2 exemplars: 1 sample that is characteristic of patients who lived and another sample that is characteristic of patients who died.  These exemplars can serve as cluster centroids, and I can then find the pairwise Euclidean distances between each centroid and each observation.  This is a lot like k-means, but rather than assigning clusters, I will keep the distances as new features. 

In [None]:
# split the dataset by death_event
dead = d.loc[d.death_event == 1].copy().reset_index(drop=True)
not_dead = d.loc[d.death_event == 0].copy().reset_index(drop=True)

# create exemplars by averaging the features in each partition
# the result will be a single synthetic observation that is characteristic of dead/not dead
# the mean will ensure that the effect of outliers, which would otherwise trip up a classifier, are smoothed out
# note that the data has already been standardized, so univariate outliers will have less of an impact on the mean

dead[[c for c in dead.columns if c not in ["death_event"]]] = pd.DataFrame(dead[[c for c in dead.columns if c not in ["death_event"]]].mean(axis=0).values.reshape(1, -1))
dead = dead.loc[0,:]

not_dead[[c for c in not_dead.columns if c not in ["death_event"]]] = pd.DataFrame(not_dead[[c for c in not_dead.columns if c not in ["death_event"]]].mean(axis=0).values.reshape(1, -1))
not_dead = not_dead.loc[0,:]

In [None]:
dead

In [None]:
not_dead

In [None]:
# calculate the pairwise distance to each exemplar
# and add these distances as new features
d['dist_from_dead'] = cdist(d.drop("death_event", axis=1), dead.drop("death_event", axis=0).values.reshape(1, -1), 'euclid')
d['dist_from_not_dead'] = cdist(d.drop(["death_event", "dist_from_dead"], axis=1), not_dead.drop("death_event", axis=0).values.reshape(1, -1), 'euclid')
d.head(20)

A simple classifier could be built around these 2 features.  If dist_from_dead < dist_from_not_dead, then the classifier could predict 1 for death_event.  Otherwise it could predict 0.  If this simple rule were used, how would it do?

In [None]:
preds = np.where(d.dist_from_dead < d.dist_from_not_dead, 1, 0)
print(
    "Accuracy if this were a classifier:", 
    round((sum(preds==d.death_event)/len(d)) * 100, 2),
    "%"
)
# confusion matrix
pd.crosstab(d.death_event, preds)

This simple rule makes a decent classifier.  But again, since the meaning of "predicting death" for this dataset is ambiguous due to the different follow up times, my goal is to explore the feature differences between the groups. 

Let's first explore the outliers: the patients who were more similar to the dead exemplar but who lived, and the patients who were more similar to the not_dead exemplar but who still died.  According to the continengcy table, there are 33 and 21 of those, respectively.

In [None]:
# inspect patients who were closer to death but did not die
d[(d['dist_from_dead'] < d['dist_from_not_dead']) & (d['death_event'] == 0)]

In [None]:
# inspect patients who were closer to not_dead but who still died
d[(d['dist_from_not_dead'] < d['dist_from_dead']) & (d['death_event'] == 1)]

## Fitting a Classifier, Just for Kicks and Giggles

Since I used death_event to guide the partitioning used to create the exemplars, let's just see if a classifier could find better boundaries.  We can look at the accuracy and confusion matrix to see how a classifier might compare.  

In [None]:
dsub = d[["death_event", "time", "ejection_fraction", "log_serum_creatinine"]].copy()

x_train, x_test, y_train, y_test = train_test_split(
    dsub.drop("death_event", axis=1),
    dsub["death_event"],
    test_size=0.5,
    random_state=14,
    shuffle=True,
    stratify=dsub["death_event"],
)

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

knn = KNeighborsClassifier(
    n_neighbors=5, 
    algorithm='auto', 
    p=2, 
    metric='minkowski'
)
knn.fit(x_train, y_train)
knn_preds = knn.predict(x_test)

svc = SVC(
    C=1.0,
    kernel='rbf',
    gamma='scale',
    random_state=14
)
svc.fit(x_train, y_train)
svc_preds = svc.predict(x_test)

print(
    "Accuracy of KNN, k=5:", 
    round((sum(knn_preds==y_test)/len(y_test)) * 100, 2),
    "%"
)
# confusion matrix
print(pd.crosstab(y_test, svc_preds), "\n")
print(
    "Accuracy of SVC:", 
    round((sum(svc_preds==y_test)/len(y_test)) * 100, 2),
    "%"
)
# confusion matrix
print(pd.crosstab(y_test, svc_preds))

So the classifiers did not do much better than the simple rule.  In fact, tweaking the hyperparameters, feature selection, and train/test split produces values that are both higher and lower than 81% accuracy, showing that all 3 methods are comparable.

Let's return to the primary objective though: examining the feature differences between the groups.

## Factor Analysis

Now I want to see how each feature varies by cluster.  Some feature values should stand out as being more characteristic of severe heart failure.

In [None]:
d['cluster'] = preds
dsub = d.drop(["death_event", "dist_from_dead", "dist_from_not_dead"], axis=1).copy()

fig = px.parallel_coordinates(
    dsub.drop(["anaemia", "diabetes", "sex", "log_creatinine_phosphokinase"], axis=1), 
    color="cluster", 
    title="Parallel Coordinates by Cluster",
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=0.5
)
fig.show()

The few features that stand out are time, age (age is not really a symptom of heart failure), serum creatinine, ejection fraction.  

Let's do some PCA and look at the factor loadings.  The features that are not symptoms of heart failure will be removed.

In [None]:
dsub = d.drop(["death_event", "dist_from_dead", "dist_from_not_dead", "age", "sex", "smoking", "cluster"], axis=1).copy()

pca = PCA(n_components=None, whiten=False)
pca_dim = pca.fit_transform(dsub.values)
loadings = pca.components_ * np.sqrt(pca.explained_variance_)
for factor in range(loadings.shape[0]):
    ldf = pd.DataFrame({
        "feature": dsub.columns.to_list(),
        "loading": loadings[factor]
    })
    print(f"Principal Component {factor}")
    print(ldf)
    print("\n")

In [None]:
# plot the first 2 principal components with death event colored yellow
plt.figure(figsize=(10, 5))
plt.xlabel("Latent Variable 1 (explains most variance)")
plt.ylabel("Latent Variable 2 (explains 2nd most variance)")
plt.title("PCA 2-Dimension Plot with Death Event Colored")
plt.scatter(pca_dim[:, 0], pca_dim[:, 1], c=preds)
plt.colorbar()
plt.show()

The first principal component, which explains the most variance, has the largest loadings from ejection_fraction and serum_sodium.  They are both negative, meaning they impact the component in the same direction.  log_serum_creatinine has a large positive loading, meaning it impacts the component in the opposite direction of ejection_fraction and serium_sodium.

The second principal component has large loadings from log_creatinine_phosphokinase, time, and ejection_fraction.  

The third principal component appears to chiefly capture plaetelets, as its loading far outweighs the rest.

So given the loadings from the first 3 principal components, it seems the most important factors are ejection_fraction, serum_creatinine, serum_sodium, creatinine_phosphokinase, and time.  All that is left to determine is to figure out how these impact heart failure.

Looking back at Ahmad et. al., 2017, a serum sodium < 135 can result in hypnonatremia when the body swells with water.  High creatinine levels are a signal for renal (kidney) failure.  **So a combination of a low ejection fraction, low sodium, high creatinine, and high creatinine phosphokinase seems problematic, as this combination indicates the most severe heart failure.  It also makes sense that a patient with severe heart failure would have a sooner follow up appointment.**