# 1. Probabilistic Machine Learning: A Different Approach
I'm very grateful to have entered in this competition, as it drove me to learn new things I wouldn't otherwise.
As I wrote in other notebooks, my first objective was to learn how to implement AutoEncoders, and discover why generative models are so hype.
After reading papers, tutorials, experimenting, analyzing the (poor) results achieved, I started asking questions.
Why isn't the model learning good representation of the latent space? Btw, what the heck is a latent space? Why are generative models so cool? How do they actually work?

The search for these (and many more answers) drove me to Variational AutoEncoders (which I will share my implementation here later), but most importantly, drove me to [Probabilistic Machine Learning](https://www.nature.com/articles/nature14541), [Bayesian Inference](https://en.wikipedia.org/wiki/Bayesian_inference), and the very cool field of [Probabilistic Programming](https://en.wikipedia.org/wiki/Probabilistic_programming). As I started to deep dive into papers and books about those subjects, I realized that OSIC Pulmonary Fibrosis problem screams for applying Probabilistic Machine Learning. Mindlessly applying Deep Learning to solve this problem reminded of an oldie but goldie: **if all you have is a hammer, everything looks like a nail**.

I'm so excited about my "discovery" of this new field, that I could go on and on writing about what I learned over the past month.
But instead, I will leave some nice references that helped me gain insight about **discriminative vs generative machine learning**, **deterministic vs stochastic algorithms**, **bayesian vs frequentist approach**, and dive into a demonstration (there are many many more great readings, these are just some to get started with varied level of depth):
- [Probabilistic machine learning and artificial intelligence](https://www.nature.com/articles/nature14541)
- [Automating Inference, Learning, and Design using Probabilistic Programming](https://www.robots.ox.ac.uk/~twgr/assets/pdf/rainforth2017thesis.pdf)
- [Probabilistic Programming and Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers)
- [Probabilistic Models of Cognition](http://probmods.org/)
- [Machine Learning: A Probabilistic Perspective](https://doc.lagout.org/science/Artificial%20Intelligence/Machine%20learning/Machine%20Learning_%20A%20Probabilistic%20Perspective%20%5BMurphy%202012-08-24%5D.pdf)

My idea is to break the problem in 3 sub-problems:
1. Define and train a simple probabilistic model to make inferences about FVC using only tabular data (baby steps)
2. Define and train an advanced probabilistic model (a Variational AutoEncoder) to learn latent features from the CT scans
3. Combine the 2 models together

## 1.1. Tools
IMHO the mathematical tools needed are no more advanced than the tools used in Deep Learning. Unfortunately, most of the books and papers are overloaded with hardcore math that may discourage at first. But there are good exceptions (some cited above, like the excelent open books Probabilistic Programming and Bayesian Methods for Hackers, or the Probabilistic Models of Cognition), and I think everyone can learn.

In terms of Probabilistic Programming Languages, there are several options. I'd say some of the obvious choices would be those 3:
<img src="https://i.ibb.co/10BYkX6/Untitled-2.jpg" alt="drawing" width="600"/>

As a PyTorch user (always found TF too verbose to my taste), my natural choice was [Pyro](http://pyro.ai/). However, to my surprise, I discovered that I'd have to install it. PyMC3 and TFP work out-of-the-box in Kaggle Kernels, but Pyro doesn't. As this competition do not allow internet, I will use PyMC3. But most importantly, **Kaggle please add Pyro to kernels**!

## 1.2. Statistical Modeling Process
In the course [Bayesian Statistics: Techniques and Models](https://www.coursera.org/learn/mcmc-bayesian-statistics), prof Matthew Heiner summarizes the modeling process as having 8 steps:
1. Understand the problem
2. Plan and collect data
3. Explore the data
4. Postulate the model
5. Fit the model
6. Check the model
7. Iterate
8. Use the model

We understand the problem well enoguh, and data is already collected. So, let's do a quick exploration of the tabular data and then postulate a model.

# 2. Exploring the data
"In this competition, you’ll predict a patient’s severity of decline in lung function based on a CT scan of their lungs. You’ll determine lung function based on output from a spirometer, which measures the volume of air inhaled and exhaled. The challenge is to use machine learning techniques to make a prediction with the image, metadata, and baseline FVC as input."

In this first notebook, we will use only tabular data. Let's see this **decline in lung function** for 3 different patients.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/train.csv')
test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')

In [None]:
def chart(patient_id, ax):
    data = train[train['Patient'] == patient_id]
    x = data['Weeks']
    y = data['FVC']
    ax.set_title(patient_id)
    ax = sns.regplot(x, y, ax=ax, ci=None, line_kws={'color':'red'})
    

f, axes = plt.subplots(1, 3, figsize=(15, 5))
chart('ID00007637202177411956430', axes[0])
chart('ID00009637202177434476278', axes[1])
chart('ID00010637202177584971671', axes[2])

The decline in lung capacity is very clear. We see, though, they are very different from patient to patient.

"Many large data sets are in fact large collections of small data sets. For example, in areas such as **personalized medicine** and recommendation systems, there might be a large amount of data, but there is still a relatively **small amount of data for each patient** or client, respectively. To **customize predictions for each person** it becomes necessary to **build a model for each person** — with its inherent **uncertainties** — and to couple these models together in a hierarchy so that **information can be borrowed from other similar people**. We call this the **personalization of models**, and it is naturally implemented using **hierarchical Bayesian approaches** (...)." (Ghahramani, 2015)

This is exactly what we will try here. Moving on.

# 3. Postulate the model
Time to be creative. There are a multitude of ways we could model this tabular dataset. Some of the tools we could use:
- Hidden Markov Models
- Gaussian Processes
- Variational Auto Encoders

As I am learning, I will try first the simplest possible model: a **linear regression**.
However, we will sophisticate a little bit. Here are our assumptions:
- Every patient has unique linear regression parameters ($\alpha$ and $\beta$). So, by inferring the right parameters, we will be able to predict the line(s) for each patient, thus being able to predict his FVC in any week
- However, these parameters are not completely independent. There is an underlying model that governs them for all patients.
- Both $\alpha$ and $\beta$ are normally distributed with different means and variances
- These means and variances are functions of the baseline measure (baseline week, FVC and Percent), and patient's age, sex and smoking status
- In the next notebook, we will sophisticate even further, by assuming the parameters are also function of latent variables learned from the CT scans. But that will come later. Baby steps :)

Our model is represented by the following Bayesian Network:

<img src="https://i.ibb.co/DCxKbdT/Asset-1-2x-100.jpg" alt="drawing" width="600"/>

Let me explain the logic behind this model:
- $FVC_{ij}$ is the observed variable we are interested in. At any week $j$, $-12 \leq j \leq 133$, the FVC of the patient $i$ is presumed to be normally distributed with mean $\alpha_i + \beta_i i$ and $\sigma_i^2$ (the confidence asked)
- $\alpha_i$, the intercept of the decline function for each patient $i$, logically is a function of $FVC_i^b$ (the baseline measurement for patient $i$) and $w_i^b$ (the week when the baseline FVC was measured). We assume it is normally distributed with mean $FVC_i^b + w_i^b \beta^{int}$ and variance $\sigma_{int}^2$ (int is superscript, but I couldn't get latex to behave rs)
- $\beta_i$, the slope of the decline function for each patient $i$, logically is a function of $A_i$ (patient's age), sex and smoking status. We assume it is normally distributed with mean $\alpha^s + A_i \beta_c^s$, with variance $\sigma_s^2$ (again, s should be superscript). We considered 6 different $\beta_c^s$: for women who currently smoke, men who currently smoke, women ex-smokers, men ex-mokers, women who never smoked and men who never smoked.
- For now, to simplify, we left Percent random variable out. We will include in a second version.
- Finally, we know nothing about the priors $\beta^{int}$, $\alpha^s$, $\sigma_i$, $\sigma^{int}$ and $\sigma^s$. We will model the first 2 as normals, and the last 3 as half-normals.

Mathematically, the model specification is
$$
FVC_{ij} \sim \mathcal{N}(\alpha_i + j \beta_i, \sigma_i) \\
\sigma_i \sim |\mathcal{N}(0, 200)| \\
\alpha_i \sim \mathcal{N}(FVC_i^b + w_i^b \beta^{int}, \sigma^{int}) \\
\beta_i \sim \mathcal{N}(\alpha^s + A_i \beta_c^s, \sigma^s)\\
\beta^{int} \sim \mathcal{N}(0, 100) \\
\sigma^{int} \sim |\mathcal{N}(0, 100)| \\
\beta_c^s \sim \mathcal{N}(0, 100) \\
\alpha^s \sim \mathcal{N}(0, 100) \\
\sigma^s \sim |\mathcal{N}(0, 100)|
$$

In [None]:
# Kaggle, please add Pyro/PyTorch support!
import pymc3 as pm
import theano
import arviz as az
from sklearn import preprocessing

## 3.1. Simple data prep

In [None]:
# Very simple pre-processing: adding patient class
def patient_class(row):
    if row['Sex'] == 'Male':
        if row['SmokingStatus'] == 'Currently smokes':
            return 0
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 1
        elif row['SmokingStatus'] == 'Never smoked':
            return 2
    else:
        if row['SmokingStatus'] == 'Currently smokes':
            return 3
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 4
        elif row['SmokingStatus'] == 'Never smoked':
            return 5

train['Class'] = train.apply(patient_class, axis=1)

In [None]:
# Very simple pre-processing: adding FVC and week baselines
aux = train[['Patient', 'Weeks']].groupby('Patient')\
    .min().reset_index()
aux = pd.merge(aux, train[['Patient', 'Weeks', 'FVC']], how='left', 
               on=['Patient', 'Weeks'])
aux = aux.groupby('Patient').mean().reset_index()
aux['Weeks'] = aux['Weeks'].astype(int)
aux['FVC'] = aux['FVC'].astype(int)
train = pd.merge(train, aux, how='left', on='Patient', suffixes=('', '_base'))

In [None]:
# Very simple pre-processing: creating patient indexes
le = preprocessing.LabelEncoder()
train['PatientID'] = le.fit_transform(train['Patient'])

patients = train[['Patient', 'PatientID', 'Age', 'Class', 'Weeks_base', 'FVC_base']].drop_duplicates()
fvc_data = train[['Patient', 'PatientID', 'Weeks', 'FVC']]

patients.head()

In [None]:
fvc_data.head()

## 3.2. Modeling in PyMC3
Probabilistic Programming Languages are very very very cool :)

In [None]:
FVC_b = patients['FVC_base'].values
w_b = patients['Weeks_base'].values
age = patients['Age'].values
patient_class = patients['Class'].values

t = fvc_data['Weeks'].values
FVC_obs = fvc_data['FVC'].values
patient_id = fvc_data['PatientID'].values

with pm.Model() as hierarchical_model:
    # Hyperpriors for Alpha
    beta_int = pm.Normal('beta_int', 0, sigma=100)
    sigma_int = pm.HalfNormal('sigma_int', 100)
    
    # Alpha
    mu_alpha = FVC_b + beta_int * w_b
    alpha = pm.Normal('alpha', mu=mu_alpha, sigma=sigma_int, 
                      shape=train['Patient'].nunique())
    
    # Hyperpriors for Beta
    sigma_s = pm.HalfNormal('sigma_s', 100)
    alpha_s = pm.Normal('alpha_s', 0, sigma=100)
    beta_cs = pm.Normal('beta_cs', 0, sigma=100, shape=6)
    
    # Beta
    mu_beta = alpha_s + age * beta_cs[patient_class]
    beta = pm.Normal('beta', mu=mu_beta, sigma=sigma_s,
                     shape=train['Patient'].nunique())
    
    # Model variance
    sigma = pm.HalfNormal('sigma', 200)
    
    # Model estimate
    FVC_est = alpha[patient_id] + beta[patient_id] * t
    
    # Data likelihood
    FVC_like = pm.Normal('FVC_like', mu=FVC_est,
                          sigma=sigma, observed=FVC_obs)

# 4. Fit the model
Just press the inference button (TM)! :)

In [None]:
# Inference button (TM)!
with hierarchical_model:
    trace = pm.sample(2000, tune=2000, target_accept=.9)

We just sampled 4000 different models that explain the data! Very cool! :)

# 5. Check the model
Let's see the generative model we've created.

In [None]:
with hierarchical_model:
    pm.traceplot(trace);

Very cool!!! Looks like our model learned personalized alphas and betas for each patient!

## 5.1. Checking some patients
PyMC3 comes with a very powerful visualization tool called [ArviZ](https://arviz-devs.github.io/arviz/index.html). However, I didn't figure out how to use yet... Let's use Matplotlib and Seaborn.

In [None]:
def chart(patient_id, ax):
    data = train[train['Patient'] == patient_id]
    x = data['Weeks']
    y = data['FVC']
    ax.set_title(patient_id)
    ax = sns.regplot(x, y, ax=ax, ci=None, line_kws={'color':'red'})
    
    x2 = np.arange(-12, 133, step=0.1)
    
    pid = patients[patients['Patient'] == patient_id]['PatientID'].values[0]
    for sample in range(100):
        alpha = trace['alpha'][sample, pid]
        beta = trace['beta'][sample, pid]
        sigma = trace['sigma'][sample]
        y2 = alpha + beta * x2
        ax.plot(x2, y2, linewidth=0.1, color='green')
        y2 = alpha + beta * x2 + sigma
        ax.plot(x2, y2, linewidth=0.1, color='yellow')
        y2 = alpha + beta * x2 - sigma
        ax.plot(x2, y2, linewidth=0.1, color='yellow')

f, axes = plt.subplots(1, 3, figsize=(15, 5))
chart('ID00007637202177411956430', axes[0])
chart('ID00009637202177434476278', axes[1])
chart('ID00010637202177584971671', axes[2])

Here I plotted 100 out of the 4000 personalized models each patient has! In green we can see the fitted regression line, in yellow the standard deviation. Let's ensemble all that!

# 6. (Iterate and) Use the model
Let's use our generative model now! Iteration will be left as an exercise to the reader :)

## 6.1. Simple data prep

In [None]:
# Very simple pre-processing: adding patient class
def patient_class(row):
    if row['Sex'] == 'Male':
        if row['SmokingStatus'] == 'Currently smokes':
            return 0
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 1
        elif row['SmokingStatus'] == 'Never smoked':
            return 2
    else:
        if row['SmokingStatus'] == 'Currently smokes':
            return 3
        elif row['SmokingStatus'] == 'Ex-smoker':
            return 4
        elif row['SmokingStatus'] == 'Never smoked':
            return 5

test['Class'] = test.apply(patient_class, axis=1)
test = test.rename(columns={'FVC': 'FVC_base', 'Weeks': 'Weeks_base'})
test.head()

In [None]:
# prepare submission dataset
submission = []
for i, patient in enumerate(test['Patient'].unique()):
    df = pd.DataFrame(columns=['Patient', 'Weeks', 'FVC'])
    df['Weeks'] = np.arange(-12, 134)
    df['Patient'] = patient
    df['PatientID'] = i
    df['FVC'] = 0
    submission.append(df)
    
submission = pd.concat(submission).reset_index(drop=True)
submission.head()

## 6.2. Posterior prediction
There are 2 ways of generating predictions on unseen held-out data using PyMC3. The first involves using `theano.shared` variables. It's pretty straightforward, 4-5 lines of code and we are done. I tried that, and although it worked perfectly while I was runnning the notebook, when I submitted Kaggle server complained, outputting **Submission CSV Not Found** error msg.

Motivated by that, we used a 2nd approach, that works! It's a little bit longer than the 4-5 lines of code, but way more educational. The idea is outlined by PyMC3 developers in [this answer from Luciano Paz](https://discourse.pymc.io/t/how-do-we-predict-on-new-unseen-groups-in-a-hierarchical-model-in-pymc3/2571). To predict FVCs on hold-out data, we will **create a 2nd model, using as priors the distributions for the parameters learned on the 1st model**. It's Bayes spirit/philosophy: we keep constantly updating our models as we see more data. :)

In [None]:
FVC_b = test['FVC_base'].values
w_b = test['Weeks_base'].values
age = test['Age'].values
patient_class = test['Class'].values
t = submission['Weeks'].values
patient_id = submission['PatientID'].values
            
with pm.Model() as new_model:
    # Hyperpriors for Alpha
    beta_int = pm.Normal('beta_int', 
                         trace['beta_int'].mean(), 
                         sigma=trace['beta_int'].std())
    sigma_int = pm.TruncatedNormal('sigma_int', 
                                   trace['sigma_int'].mean(),
                                   sigma=trace['sigma_int'].std(),
                                   lower=0)
    
    # Alpha
    mu_alpha = FVC_b + beta_int * w_b
    alpha = pm.Normal('alpha', mu=mu_alpha, sigma=sigma_int, 
                      shape=test['Patient'].nunique())
    
    # Hyperpriors for Beta
    sigma_s = pm.TruncatedNormal('sigma_s', 
                                 trace['sigma_s'].mean(),
                                 sigma=trace['sigma_s'].std(),
                                 lower=0)
    alpha_s = pm.Normal('alpha_s', 
                        trace['alpha_s'].mean(), 
                        sigma=trace['alpha_s'].std())
    cov = np.zeros((6, 6))
    np.fill_diagonal(cov, trace['beta_cs'].var(axis=0))
    beta_cs = pm.MvNormal('beta_cs',
                          mu=trace['beta_cs'].mean(axis=0),
                          cov=cov,
                          shape=6)
    
    # Beta
    mu_beta = alpha_s + age * beta_cs[patient_class]
    beta = pm.Normal('beta', mu=mu_beta, sigma=sigma_s,
                     shape=test['Patient'].nunique())
    
    # Model variance
    sigma = pm.TruncatedNormal('sigma', 
                               trace['sigma'].mean(),
                               sigma=trace['sigma'].std(),
                               lower=0)
    
    # Model estimate
    FVC_est = pm.Normal('FVC_est', mu=alpha[patient_id] + beta[patient_id] * t, 
                        sigma=sigma,
                        shape=submission.shape[0])

In [None]:
with new_model:
    trace2 = pm.sample(2000, tune=2000, target_accept=.9)
    
trace2['FVC_est'].shape

There we go! 4000 predictions for each point! Before we merge the predictions and submit, let's take a look at the learned parameters for the 5 test patients:

In [None]:
with new_model:
    pm.traceplot(trace2);

We can clearly see very different $\alpha$'s and $\beta$'s for each patient, with varied levels of uncertainty! That's exactly what we wanted!

## 6.3. Estimating competition metric
Finally, before submitting, let's estimate the competition metric:

In [None]:
preds = pd.DataFrame(data=trace2['FVC_est'].T)
submission = pd.merge(submission, preds, left_index=True, 
                      right_index=True)
submission['Patient_Week'] = submission['Patient'] + '_' \
    + submission['Weeks'].astype(str)
submission = submission.drop(columns=['FVC', 'PatientID'])

FVC = submission.iloc[:, :-1].mean(axis=1)
confidence = submission.iloc[:, :-1].std(axis=1)
submission['FVC'] = FVC
submission['Confidence'] = confidence
submission = submission[['Patient', 'Weeks', 'Patient_Week', 
                         'FVC', 'Confidence']]

In [None]:
temp = pd.merge(train[['Patient', 'Weeks', 'FVC']], 
                submission.drop(columns=['Patient_Week']),
                on=['Patient', 'Weeks'], how='left', 
                suffixes=['', '_pred'])
temp = temp.dropna()
temp = temp.groupby('Patient')

# The metric only uses the last 3 measurements, the most uncertain
temp = temp.tail(3)

In [None]:
sigma_clipped = temp['Confidence'].apply(lambda s: max(s, 70))
delta = temp.apply(lambda row: min([abs(row['FVC'] - row['FVC_pred']), 1000]), axis=1)
metric = -np.sqrt(2) * delta / sigma_clipped - np.log(np.sqrt(2) * sigma_clipped)
metric.mean()

## 6.4. Generating final predictions

In [None]:
submission = submission[['Patient_Week', 'FVC', 'Confidence']]
submission.to_csv('submission.csv', index=False)
submission.head()