<img src="https://www.vippng.com/png/full/234-2341597_heart-anatomical-drawing-vintage-old-heart-drawing.png" width="200">

# Introduction

The purpose of this notebook is to introduce a Bayesian generalized linear model approach aiming at *i)* predicting heart failure risk and *ii)* understanding which factors contribute the most, using the Heart Failure Prediction dataset. This dataset comprises a total of 299 observations and 13 clinical covariates, including well-known risk factors.

My choice for a backend framework in this analyis was PyMC, although I would like to try Tensorflow Probability. For a practical introduction to Bayesian modelling using PyMC I recommend the book [Bayesian Methods for Hackers](https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/); for a theoretical introduction, I suggest instead the book [Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/), originally written along R examples but recently accompanied by the PyMC3 equivalent. Finally, for a in-depth introduction based on R with a behavioural ecology use-case you can also refer to my [own blog post](https://poissonisfish.com/2019/05/01/bayesian-models-in-r/) from two years ago.

I hope this use-case will convince you to consider Bayesian inference in tackling small modelling problems.

In [None]:
# Imports
import os
import numpy as np
import pandas as pd
import seaborn as sns
import pymc3 as pm
from scipy import stats
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix

# Import dataset
data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

# Set constants and seed
NUM_SAMPLES, NUM_VARS = data.shape
SEED = 999
np.random.seed(SEED)

# Preview
data.head()

# Pre-processing

Starting off, each of the quantitative variables present in the dataset will be transformed and standardized to approximate a standard normal distribution, *i.e.* $X \sim \mathcal{N}(0,1)$. This procedure brings about many advantages in the scope of a Bayesian analysis, such as

- Prior distributions of the model coefficients can be shared
- Inference becomes straightforward, *e.g.* discarding input variables defaults to the respective sample averages

To this end I decided to apply a Box-Cox transformation followed by mean-centering and unit-variance scaling to the quantitative variables, which I will heretoforth designate as normalized.

Regarding the binary variables that constitute the rest of the variable set, we will instead check for near-zero variance - skewed binary variables can seriously hurt model inference and performance.

In [None]:
# Make sure there are no missing values
assert not data.isnull().values.any()

# Index binary variables 
is_binary = data.isin([0,1]).all().values

# Split variable types and keep record of var names
X_bin = data.iloc[:, is_binary]
X_quant = data.iloc[:, ~is_binary]

# Store var names as lists
bin_names = data.columns[is_binary].tolist()
quant_names = data.columns[~is_binary].tolist()

# Box-Cox transform of quant variables, then standardize
X_quant = X_quant.apply(lambda x: stats.boxcox(x)[0])
X_quant = StandardScaler().fit_transform(X_quant)

# Look into proportions in binary variables
print('Proportions in binary variables:\n', X_bin.mean())

As seen from above, the mean of all binary variables falls in the range $[0.3, 0.7]$ which given the sample size of $N = 299$ should present no problem.

# Exploratory data analysis

Following pre-processing we can set to investigate the relationships among the variables. I will do this separately for binary and quantitative features, but a joint exploratory analysis is definitely possible and recommended.

## Binary variables

Any interdependencies among binary variables can be assessed from a [Cram√©r's V](https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V) correlation matrix. Note that I discard the $\chi^2$ Yates' continuity correction below, by passing `correction=False` to more rigorously fulfil unit value with self-correlation. You can see for yourself how little difference it makes if enabled.

In [None]:
# NOTE: This function should only take binary variables
def cramers_v(x, y):
    # Confusion matrix
    conf_matrix = pd.crosstab(x, y)
    # Compute chi-squared w/o Yates' continuity correction
    chi2 = stats.chi2_contingency(conf_matrix, correction=False)[0]
    n = sum(conf_matrix.sum())
    # With binary variables we have V = np.sqrt(chi2 / n)
    return np.sqrt(chi2 / n)

# Create and populate correlation matrix
corr_bin = np.eye(X_bin.shape[1])
for r in range(X_bin.shape[1]):
    for c in range(X_bin.shape[1]):
        corr_bin[r,c] = cramers_v(X_bin.iloc[:,r], X_bin.iloc[:,c])

# Plot correlation matrix
sns.set(rc={'figure.figsize':(9, 7)})
sns.heatmap(corr_bin, vmin=0, vmax=1, annot=True,
            cbar_kws={'label': "Cramer's V"},
            xticklabels=bin_names, yticklabels=bin_names)

What clearly jumps to sight from the correlation matrix above is the intriguing association between smoking and sex. On closer look, we have 96% of non-smokers among women (101/105) but only 53% among men (102/194). Here too, you can see this for yourself using a confusion matrix, *e.g.* `pd.crosstab(X_bin['smoking'], X_bin['sex'])`. This dependence between the two factors would require special handling upon modelling, by means of introducing interaction effects or other strategies. However, I will refrain from using interaction effects or effecting any further treatment.

## Quantitative variables

In a similar vein we proceed to examine any existing interdepencencies among the normalized quantitative variables. Here I would argue we do not need to compute and examine similarity metrics, and can directly visualize the bivariate distributions over all such features. The following figure will additionally distinguish smokers from non-smokers using different colors.

In [None]:
# Create dataframe to enable pairplot
data_pairplot = pd.DataFrame(X_quant, columns=quant_names)
data_pairplot['smoking'] = X_bin.smoking

# Pairplot (quantitative variables)
sns.pairplot(data=data_pairplot, hue='smoking')

The normalized quantitative variables seem rather uncorrelated, which is good news for Bayesian optimization. As for differences associated with smoking status, there seem to be few - if any - differences overall.

## Principal Component Analysis (PCA)

To conclude this exploratory phase I suggest performing a principal component analysis (PCA) of the normalized quantitative variables and visualize the scores over the first two principal components (PCs). This should both provide us an overview over the sample composition, *e.g.* presence of outliers and help assessing whether the projections carrying the most variance relate to heart failure incidence.

In [None]:
# Fit PCA
pca_model = PCA(n_components=2).fit(X_quant)
explained_var = pca_model.explained_variance_ratio_*100
scores = pd.DataFrame(pca_model.transform(X_quant), columns=['PC1','PC2'])
scores['DEATH_EVENT'] = X_bin.DEATH_EVENT

# PC1-2 scatterplot
sns.set(rc={'figure.figsize':(7, 7)})
sns.scatterplot(x='PC1', y='PC2', data=scores,
               hue='DEATH_EVENT')

plt.xlim(-4, 4)
plt.ylim(-4, 4)
plt.axhline(0, linestyle='--', color='black', alpha=.25)
plt.axvline(0, linestyle='--', color='black', alpha=.25)
plt.xlabel('PC1 ({exp_var:.2f}%)'.format(exp_var=explained_var[0]))
plt.ylabel('PC2 ({exp_var:.2f}%)'.format(exp_var=explained_var[1]))

Using the quantitative variables alone there is a noticeable separation between deceased and survivors following PC1. We can also observe that these two PCs capture approximately 40% of the total variation contained in this feature set. On the whole, the Box-Cox transformation seems to preserve predictive information.

# Bayesian model

Time for setting up the Bayesian model. This part covers *i)* model structure and priors, *ii)* the Markov chain Monte Carlo (MCMC) sampling and *iii)* inference using the samples from the posterior distribution.

## Model structure and priors

There is a myriad ways we can model heart failure risk using this dataset, so how exactly does one know what model works the best? Unfortunately there is no clear answer, but Bayesian model comparison helps with identifying good models and hypotheses. Here, tools in the likes of the widely applicable information criterion (WAIC) and the deviance information criteria (DIC) compare goodness-of-fit across models. By looking into how much knowledge the different models capture and the number of parameters they use (*i.e.* degrees of freedom), these criteria highlight those that explain the most with the fewest parameters. Model selection is a topic in itself and would deserve a lot more discussion than I intend to cover.

Instead, I will fit a generalized linear model using all features, quantitative and binary alike, to predict the probability of  the target variable $Y$, a.k.a. `DEATH_EVENT`. This being a binary variable, the most appropriate distribution to model it is a Bernoulli parameterized by $P(Death|X)$. Our job is to build a linear model of the logit expression,

$$log(\frac{P(Death|X)}{1-P(Death|X)}) = \alpha + \beta_1 . X_1 + \beta_2 . X_2 + ... \beta_p . X_p$$

extract the probability using a logistic transformation,

$$P(Death | X) = \frac{1}{1+e^{-(\alpha + \beta_1 . X_1 + \beta_2 . X_2 + ... \beta_p . X_p)}}$$

link it to our observed $Y$,

$$Y \sim \mathcal{B}(P(Death|X))$$

and finally let the Bayesian backend work out, for all $p = 12$ coefficients plus intercept $\beta$, the famous relationship

$$P(\beta|data) = \frac{P(data|\beta).P(\beta)}{P(data)} \Leftrightarrow Posterior = \frac{Likelihood.Prior}{Constant}$$

The prior distributions of the coefficients $\beta$ will each have the form $\beta \sim \mathcal{N}(0, 5^2)$. Centering them at $\mu = 0$ conveys our ignorance regarding their magnitude and sign, while $\sigma = 5$ provides sufficient spread to capture regions of high likelihood.

In [None]:
# Pop DEATH_EVENT from X_bin
Y = X_bin.pop('DEATH_EVENT')

# Combine X and binary_vars along with vector of ones to accommodate intercept
X = np.concatenate([np.ones((NUM_SAMPLES, 1)), X_bin.to_numpy(), X_quant], axis=1)

with pm.Model() as model:
    # Intercept and coefficients
    beta = pm.Normal('beta', mu=0, sigma=5, shape=X.shape[1])
    logit = pm.math.dot(X, beta)
    # Logistic link
    p = 1 / (1 + np.exp(-logit))
    # Return loglik of Y
    obs = pm.Bernoulli('obs', p, observed=Y)

## Sampling

Now that the model is defined, we can set up the optimization procedure using MCMC. In the present case I will choose the Hamiltonian method with maximum a posteriori (MAP) estimation for starting values. The number of both burn-in samples and chains can also be specified. With the setting below we will end up with 10,000 effective samples from each of two chains, totalling 20,000. At last, a look into the sampling trace will help diagnose convergence issues.

In [None]:
# Sample
with model:
    start = pm.find_MAP()
    opt = pm.HamiltonianMC(beta)
    trace = pm.sample(10000, opt, start=start,
                      return_inferencedata=True, random_seed=SEED)
    
# Trace plot
pm.traceplot(trace)

[](http://)From the trace plot above we can see the chains converged really quick - I suspect much because of the MAP pre-estimation. In the next section we will take a deep-dive into these 20,000 posterior samples and make sense of them.

## Inference

In my perspective the beauty of the Bayesian framework lies in how uncertainty propagates throughout the model. Such models survey a comprehensive range of parameter values and combinations that may, with a certain probability, lead to the observed response variable. This is in contrast to traditional machine and deep learning approaches that rely on single point estimates instead.

In what follows we will leverage this uncertainty over unknown parameters to assess the impact of each predictor on heart failure risk. To this end we can plot the posterior distribution of each coefficient plus the intercept stored under `beta`.

In [None]:
# Extract posterior samples of beta
trace_beta = trace.posterior['beta'].values.reshape(-1, NUM_VARS)
beta_names = ['intercept'] + bin_names[:-1] + quant_names + ['sex_x_smoking']

# Violin plot with seaborn
sns.set(rc={'figure.figsize':(6, 12)})
ax = sns.violinplot(data=trace_beta, orient='h')
ax.set(xlabel=r'$\beta$')
ax.set_yticklabels(beta_names)
plt.axvline(0., linestyle='--', color='black', alpha=.25)

First, you will note that the posterior of the intercept $\alpha$ (also within `beta`) is shifted towards negative values. This should not be surprising since the deceased proportion in the population is 32.1%, as shown at the pre-processing stage. In simple terms, in the lack of any evidence from $X$ we have $P(Death) = \frac{1}{1+e^{-\alpha}}$ and hence, with negative values of $\alpha$, $P(Death) < 0.5$ close to the sample mean.

The strongest effect over heart failure risk is `time`, the patient follow-up period. The model is quite certain that the longer the follow-up period, the more likely a patient survives. Keeping this variable is kind of cheating, don't you agree?

Some other strong effects are those from serum creatine levels and ejection fraction. Large levels of serum creatine and small ejection fractions seem to associate with heart failure. Regarding these two, the dataset publication title is self-evident: *Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone*.

Also noteworthy are the differences in spread over the different posterior distributions (ignore the intercept, as it tends to be wider and hence less certain). Recalling the strange correlation between `age` and `sex` aforementioned, we can make better sense of the large spread in the corresponding posteriors, *i.e.* a large posterior variance denotes greater uncertainty from the model. The negative effect of `sex` suggests men are on average more protected agaisnt heart failure compared to women - does this make sense? On the other hand, the moderately positive effect of `smoking` suggests a general increase in risk - I was frankly expecting a much larger effect.

## Predictions

We shall conclude this analysis with model predictions, but before that look into the posterior heart failure probabilities over all observations.

In [None]:
# Compute logit and probabilities from the posterior samples
post_logit = np.matmul(X, trace_beta.T)
post_p = 1 / (1 + np.exp(-post_logit))

# Plot
sns.set(rc={'figure.figsize':(12, 4)})
plt.hist(post_p[Y == 1,:].flatten(), alpha=.25, label='Died', bins=25)
plt.hist(post_p[Y == 0,:].flatten(), alpha=.25, label='Survived', bins=25)
plt.legend()
plt.xlabel(r'$P(Death|X)$')
plt.ylabel('Frequency')

We can see there is a satisfying separation between deceased and survivors, albeit some cases are grossly misrepresented in either group. 

So, how can we determine what level of risk or probability warrants medical assistance? Prediction accuracy alone can be misleading since the majority of the observations comprises survivors. Instead, sensitivity is more appropriate to cover more potential occurrences at the comparatively unexpensive misdiagnosis of healthy individuals that do not require assistance.

As such, I suggest predicting heart failure for all individuals that have a 1% chance of their risk being over 50%. This is a conservative choice that should boost sensitivity even if decreasing specificity or accuracy. Going one step further, we can compare that 1% threshold to various others.

In [None]:
# Flag observation w/ posterior prob > .01 that P(Death) > 0.5
acc, sens, spec = [], [], []
for p in (.01,.25,.5,.75,.99):
    pred = np.mean(post_p > .5, axis=1) > p
    tn, fp, fn, tp = confusion_matrix(Y, pred.astype(np.int16)).ravel()
    acc.append((tp + tn) / (tn + fp + fn + tp))
    sens.append(tp / (tp + fn))
    spec.append(tn / (tn + fp))
    
results = {'accuracy': acc, 'sensitivity': sens, 'specificity': spec}
results_df = pd.DataFrame(results, index=(.01,.25,.5,.75,.99))
results_df.index.name = 'cutoff'
results_df

Indeed, the 1% threshold yielded the highest sensitivity (87.5%) along with a moderate specificity (72.9%). As demonstrated above, larger thresholds revert the trade-off, thereby decreasing sensitivity and increasing specificity. Accuracy can also be shown to be optimal between the two most extreme choices for threshold.


# Conclusion

Much more could be done thoroughout this analysis. To list a few recommended steps:

- Use a hierarchical model introducing a prior for the $\beta$ prior's $\mu$ and a separate prior for $\alpha$
- Resolve the dependence between `sex` and `smoking`
- Produce ROC curve to better understand the sensitivity-specificity trade-off
- Experiment with Cox regression for survival analysis

All in all, being able to anticipate 87.5% of the occurrences should be deemed a success. I hope this Bayesian approach made sense and that the underlying strenghts compel you to try it out in different modelling problems.

Of course, I look forward to your remarks, suggestions and questions. Greetings and stay healthy!