# Why might we want to go Bayesian here?

Particularly with very little data, when you need uncertainty quantification and when you need to optimize a metric that is not your models loss function/likelihood, a Bayesian + decision theoretic approach is very attractive.

This notbook follows a rather classical Bayesian approach:
* I define a sensible model in terms of a likelihood function (in this case a linear decline model with both intercept and slope influenced by patient-specific random effects and some covariates/features).
* I chose to model log-transformed Percent (=FVC percent predicted of normal).
    * Using Percent is logical, because it already accounts for some covariates like sex, height, and age, rather than modeling FVC directly (you can always get a predicted FVC from the predicted Percent from the ratio of FVC to Percent at baseline). 
    * A log-transformation seemed like a good idea, because residuals should be more normal that way and heteroskedasticity is less of a concern.
* One nice idea as also pointed out by others before is to have  patient-specific random effects on the intercept and the slope. 
    * This has the nice property to sensibly reflect the correlation structure in the data - you just do not learn as much from multiple observations from the same patient as from multiple independent observations from separate patients. The posterior uncertainty from the Bayesian model nicely reflects this.
    * Another rationale for why I thought this was a very good idea is that the baseline measurement contains measurement error and we account for this more clearly by having both the baseline observation as a coveriate and a patient specific random effect. I tried experiments where I made the random effects of the intercept and the slope correlated, but I did not seem to gain much from that.
    * For those more used to neural network, you can think of these random effects as 2D-embeddings for a patient that have very specific meanings (i.e. first one is about where the patient lies on average relative to the population of patients and the second one describes whether the patient has an above or below average slope over time).
    * You might think, can we not do what you did here with a neural network? That certainly occurred to me, but at least my attempts produced worse results (more my problem than that of the neural networks).
* I specified some reasonable prior distributions based on what we know a-priori (e.g. we have an approximate idea of the expected decline over time - I work in respiratory drug development, so I had at least some idea of what the priors should be). I experimented with tweaking the prior parameters using cross-validation, but I failed to beat the priors I specified based on my knowledge.
* Then, we let the MCMC sampler do its magice and give us MCMC samples from the posterior implied by what we defined above.
* Then we take your posterior belief about the truth as represented by the MCMC samples from the posterior distribution and perform a decision analysis to optimize the metric we care about - i.e. the competition metric. 
* Also note that there are limits beyond which you never want to go with the competition metric due to the way it gets clipped.

I had worked on this notebook even before seeing the [nice notebook](https://www.kaggle.com/carlossouza/bayesian-experiments) by [Carlos Souza](https://www.kaggle.com/carlossouza), who used [PyMC3](https://docs.pymc.io/) and has some very nice diagrams of the model structure (definitely worth visiting and looking at). The main differences to that notebook is that I model log-transformed Percent (rather than FVC directly), and try to let additional covariates/features affect the intercept and slope for each patient. The other - but less important difference - is that I am using [pystan](https://pystan.readthedocs.io/en/latest/) (and you could do the same thing in R with rstan - see the [Stan homepage](https://mc-stan.org/) for more options and information on the Stan project), but the two packages both use the NUTS-MCMC sampler. I used [pystan](https://pystan.readthedocs.io/en/latest/) simply because I know Stan pretty well and felt more like I knew exactly what I was doing.

This went into our team's final submission as part of a bunch of models that we stacked, but I thought I'd share it, since there's a couple of nice ideas here. I left out the cross-validation (we used a 5-fold that never split a patient across folds and stacked on the last 3 out-of-fold observations per patient) here and just provide code for doing a submission, in part because this code takes pretty long to run. In my day-to-day work, I would try to speed this up with within chain-parallization, but within a Kaggle notebook we do not have the necessary CPUs.

# Read the training and test data

In [None]:
import numpy as np
import pandas as pd
import os
import os.path
import pystan
from scipy.optimize import minimize

train = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/train.csv")
test = pd.read_csv("../input/osic-pulmonary-fibrosis-progression/test.csv")

if len(test)==5:
    test['Patient'] = test['Patient'] + "_t" # This ensures in development situation test patients are treated differently than in same record for train patients

test2 = test[['Patient', 'Age', 'Sex', 'SmokingStatus']].merge(pd.DataFrame([(Patient, np.int(Weeks)) for Weeks in range(-12,134) for Patient in set(test['Patient'])],
                        columns=['Patient', 'Weeks']), 
           on="Patient", how="left")
test2['fold'] = 0

train = train.append(test).reset_index(drop=True)
train['fold'] = 1

train = train.append(test2).reset_index(drop=True)


# Obtain baseline values & number patient
I.e. first FVC and Percent values that we have even on the test set.

In [None]:
base = (train.loc[train[train['fold']==1].groupby('Patient')['Weeks'].idxmin()]
    .groupby('Patient')
    .agg(
        base = ("FVC", lambda x: np.mean(x)),
        logbase = ("FVC", lambda x: np.mean(np.log(x))),
        basepercent = ("Percent", lambda x: np.mean(x)),        
        logbasepercent = ("Percent", lambda x: np.mean(np.log(x))),
        basewk = ("Weeks", lambda x: np.mean(x))
    )
    .filter(items=['Patient', 'base', 'logbase', 'basewk', 'basepercent', 'logbasepercent'])
    .reset_index())

train2 = (train
  .merge(base, on='Patient', how='left')
  .assign(sexm = lambda x: np.select([x.Sex.eq('Male'), x.Sex.eq('Female')], [1, 0]),
          relweek = lambda x: x.Weeks-x.basewk,
          logpercent = lambda x: np.log(x.Percent),
          smoker = lambda x: np.select( [x.SmokingStatus.eq("Never smoked"),
                                        x.SmokingStatus.eq("Ex-smoker"),
                                        x.SmokingStatus.eq("Currently smokes")],
                                      [0, 1, 2])))

train2.Patient = pd.Categorical(train2.Patient)
train2['patno'] = train2.Patient.cat.codes
train2['denom'] = 100*train2['base']/train2['basepercent']

# Generate training (and test/validation data in the format pystan wants)

In [None]:
# Function to generate training and validation data in the format pystan wants it
def get_train_data(fold, hyperparams):
    train2tmp = (train2.copy())[ (train2['fold']!=fold) ]
    train_dat = {'records': len(train2tmp),
            'patients': len(np.unique(train2['patno'])), # We want the actual number of patients so that we immediately create random effects even for out-of-fold patients
            'patno': train2tmp['patno']+1,            
            'denom': train2tmp['denom'],
            'fvc': train2tmp['FVC'],
            'base': train2tmp['base'],
            'basenormal': 1.0*(train2tmp['basepercent']>=80),
            'basepercent': train2tmp['basepercent']*(train2tmp['basepercent']<80.0) + (train2tmp['basepercent']>=80)*(80.0 + (train2tmp['basepercent']-80.0)*2.0/3.0),
            'logpercent': train2tmp['logpercent'],
            'logbasepercent': train2tmp['logbasepercent'],
            'relweek': train2tmp['relweek'],
            'Weeks': train2tmp['Weeks'],
            'smoker': train2tmp['smoker'],             
            'basewk': train2tmp['basewk'],
            'hyperparams': hyperparams}
    train2tmp2 = (train2.copy())[ (train2['fold']==fold) ]
    test_dat = {'records': len(train2tmp2),
                'Patient': train2tmp2['Patient'],
                'Weeks': train2tmp2['Weeks'],
            'patients': len(np.unique(train2['patno'])),
            'patno': train2tmp2['patno']+1,            
            'denom': train2tmp2['denom'],
            'fvc': train2tmp2['FVC'],
            'base': train2tmp2['base'],
            'basenormal': 1.0*(train2tmp2['basepercent']>=80),
            'basepercent': train2tmp2['basepercent']*(train2tmp2['basepercent']<80.0) + (train2tmp2['basepercent']>=80)*(80.0 + (train2tmp2['basepercent']-80.0)*2.0/3.0),
            'logpercent': train2tmp2['logpercent'],
            'logbasepercent': train2tmp2['logbasepercent'],
            'relweek': train2tmp2['relweek'],            
            'Weeks': train2tmp2['Weeks'],            
            'smoker': train2tmp2['smoker'],             
            'basewk': train2tmp2['basewk'],
            'hyperparams': hyperparams}
    return train_dat, test_dat

# How do we get point predictions and "confidence" from posterior MCMC samples?
This is a classic Bayesian approach, you take your posterior belief about the truth as represented by the MCMC samples from the posterior distribution and then you perform decision analysis to optimize the metric you care about - i.e. the competition metric. Also note that there are limits beyond which you never want to go with the competition metric due to the way it gets clipped.

In [None]:
# Function to obtain confidence based on posterior predictive distribution samples

def fct(x, samples):
    confidence = np.maximum(70.0, x[1])
    return - np.mean( - np.sqrt(2.0)*np.minimum( np.absolute( x[0] - samples ), 1000.0)/x[1] - np.log(np.sqrt(2.0)*x[1]) )
#def minimize_metric(samples):
#samples = np.random.normal(loc=0.0, scale=1.0, size=10000)

# Even when we are wrong by 1000 or more, a confidence of >1414 is just losing on the metric
# when we are at least wrong by 70, then <99 is pointless, but if it's 35, then that's 50, so let's set the lower limit to 70
def minimize_metric(samples):
    res = minimize(fct, x0=np.array( [np.maximum(101,np.minimum(8999,np.mean(samples))), 
                                      np.minimum(1413, np.maximum(71,np.std(samples)))]), 
                   args=samples, bounds = ((100, 9000), (70, 1414))).x
    return res[0], res[1]


# Specifying the Stan  model

This specifies the Stan model. For more information on the Stan modeling language see the [Stan homepage ](https://mc-stan.org/)and the documentation you can finde there. It's a basic linear random effects model.

In [None]:
stan_code = """
data {
    int<lower=0> records; // Total number of records
    int<lower=0> patients; // Total number of patients
    int patno[records]; // Patient number for each record
    
    real logpercent[records]; // Log-Percent for each record
    
    int relweek[records]; // Week relative to first FVC for each record
    real logbasepercent[records]; // baseline (first record) for the log-Percent that we are modeling
    real basepercent[records]; // baseline Percent
    real basenormal[records]; // Is the baseline observation normal (Percent>= 80%)
    real smoker[records]; // Is the patient a smoker
    real basewk[records]; // What week was the baseline observation taken in?
    
    real hyperparams[24]; // Hyperparameters that specify the priors - we hand these over this way rather than hard-coding them so that we do not need to re-compile the model each time we want to tweak them.
}

parameters {
    real mu[2]; // Prior means for population intercept and slope
    real beta[7]; // coefficients for covariates
    
    real<lower=0> sigma; // residual error SD
    
    real patre[patients]; // Patient random effects on intercept
    real<lower=0> patre_sigma; // SD for those
    real patslopere[patients]; // Patient random effects on slope
    real<lower=0> patslopere_sigma; // SD for those
}

model {
    // Specifying priors
    mu[1] ~ normal(hyperparams[1], hyperparams[2]);
    mu[2] ~ normal(hyperparams[3], hyperparams[4]);
    
    sigma ~ normal(hyperparams[5], hyperparams[6]);
        
    beta[1] ~ normal( hyperparams[7], hyperparams[8]);    
    beta[2] ~ normal( hyperparams[9], hyperparams[10]);        
    beta[3] ~ normal( hyperparams[11], hyperparams[12]);    
    beta[4] ~ normal( hyperparams[13], hyperparams[14]);
    beta[5] ~ normal( hyperparams[15], hyperparams[16]);
    beta[6] ~ normal( hyperparams[17], hyperparams[18]);
    beta[7] ~ normal( hyperparams[19], hyperparams[20]);
            
    // It turned out that this particular way of parameterizing sampled better
    // than specifying patre_sigma ~ normal(0,1); and then multiplying by SD and adding mean later,
    // which can often be a crucial reparameterizing for getting better MCMC sampling (but not here).
    patre_sigma ~ normal(hyperparams[21], hyperparams[22]);
    patslopere_sigma ~ normal(hyperparams[23], hyperparams[24]);
    
    patre ~ std_normal();
    patslopere ~ std_normal();

  // Specifying the likelihood
  { 
  real tmpmeans[records]; 
  for (r in 1:records){
      tmpmeans[r] = mu[1] + patre[patno[r]]*patre_sigma + 
                    beta[1]*logbasepercent[r]  + beta[2] * basewk[r]/52.0 + beta[3]*basepercent[r] +
                    relweek[r]/52.0 * 
                        ( mu[2] + patslopere[patno[r]]*patslopere_sigma + 
                          beta[4]*(smoker[r]>=1) + beta[5]*(smoker[r]==2) + beta[6]*basewk[r]/52.0 + beta[7]*basenormal[r]);
    }
  logpercent ~ normal(tmpmeans, sigma);
  }
  
}
"""

sm = pystan.StanModel(model_code=stan_code) # Compile Stan model

# Fitting the Stan model

You can still see the left-overs from our cross-validation here, but we just fit to one fold (=training data) and predict the other (=test data).

In [None]:
iter_no = 2500
half_iter_no = 1250
#iter_no = 200 # Low number for testing, if you just want to iterate
#half_iter_no = 100
num_of_chains = 4

total_number_of_patients = len(set(train2['Patient'])) # Should be number in training and validation/test sets (e.g. for developement this is 176, 
    # for submitting higher, but unknown, the baseline observation of all gets always included in training to learn random effect of patient - you can think of it as a super-simple embedding)

param = {'hyperparam1': 2.65, 'hyperparam2': 0.125, 'hyperparam3': -0.12, 'hyperparam4': 0.05, 'hyperparam5': 0.06, 'hyperparam6': 0.05, 'hyperparam7': 0.25, 
         'hyperparam8': 0.5, 'hyperparam9': 0, 'hyperparam10': 0.2, 'hyperparam11': 0.03, 'hyperparam12': 0.1, 'hyperparam13': 0, 'hyperparam14': 0.1, 'hyperparam15': 0, 
         'hyperparam16': 0.1, 'hyperparam17': 0, 'hyperparam18': 0.05, 'hyperparam19': 0, 'hyperparam20': 0.05, 'hyperparam21': 0.1, 'hyperparam22': 0.25, 'hyperparam23': 0.1, 'hyperparam24': 0.25}

for fold in range(1):
    
    # Get the data for the fold
    
    print("Fold: " + str(fold))    
    train_dat, test_dat = get_train_data(fold=fold, 
                                             hyperparams=[param['hyperparam1'],
                                                         param['hyperparam2'],
                                                         param['hyperparam3'],
                                                         param['hyperparam4'],
                                                         param['hyperparam5'],
                                                         param['hyperparam6'],
                                                         param['hyperparam7'],
                                                         param['hyperparam8'],
                                                         param['hyperparam9'],
                                                         param['hyperparam10'],
                                                         param['hyperparam11'],
                                                         param['hyperparam12'],
                                                         param['hyperparam13'],
                                                         param['hyperparam14'],
                                                         param['hyperparam15'],
                                                         param['hyperparam16'],
                                                         param['hyperparam17'],
                                                         param['hyperparam18'],
                                                         param['hyperparam19'],
                                                         param['hyperparam20'],
                                                         param['hyperparam21'],
                                                         param['hyperparam22'],
                                                         param['hyperparam23'],
                                                         param['hyperparam24']])    
        
    # Fit the Stan model
    
    fit = sm.sampling(data=train_dat, iter=iter_no, chains=num_of_chains, n_jobs=-1, control = {'adapt_delta': 0.99})
    
    print( fit.stansummary(pars=['mu', 'beta', 'sigma', 'patre_sigma', 'patslopere_sigma']) )
    
    # Get the posterior samples
    
    sigma = fit.extract(permuted=False, pars="sigma")['sigma'].reshape(half_iter_no*num_of_chains)
    mu = fit.extract(permuted=False, pars="mu")['mu'].reshape(half_iter_no*num_of_chains, 2)
    beta = fit.extract(permuted=False, pars="beta")['beta'].reshape(half_iter_no*num_of_chains, 7)
    
    pat_sigma = fit.extract(permuted=False, pars="patre_sigma")['patre_sigma'].reshape(half_iter_no*num_of_chains)
    slope_sigma = fit.extract(permuted=False, pars="patslopere_sigma")['patslopere_sigma'].reshape(half_iter_no*num_of_chains)
    
    patre = fit.extract(permuted=False, pars="patre")['patre'].reshape(half_iter_no*num_of_chains, total_number_of_patients)
    patslopere = fit.extract(permuted=False, pars="patslopere")['patslopere'].reshape(half_iter_no*num_of_chains, total_number_of_patients)
    
    # Create out-of-fold (or here test set) predictions
    
    foldresults = {'fold': fold,                   
                   'true': test_dat['fvc'].copy().values,
                   'relweek': test_dat['relweek'].copy().values,
                   'patno': test_dat['patno'].copy().values,
                   'Patient': test_dat['Patient'].copy().values,
                   'Weeks': test_dat['Weeks'].copy().values,
                   'pred':  np.zeros(shape=len(test_dat['fvc'].values)),
                   'pred_sd': np.zeros(shape=len(test_dat['fvc'].values))
                  }

    # Lopping through all the records in the validation or test set
    # I wish I had implemented a better way of doing this than just spelling out the regression equation (apologies).
    # (clearly, you can re-write this with a design matrix for fixed and random effects and then write as a matrix multiplcation)
    for r in range(test_dat['records']):      

        avgpred = mu[:, 0] + beta[:,0]*test_dat['logbasepercent'].values[r] + beta[:,1]*test_dat['basewk'].values[r]/52.0 + \
            beta[:,2]*test_dat['basepercent'].values[r] + patre[:, (foldresults['patno'][r]-1) ] * pat_sigma + \
            test_dat['relweek'].values[r]/52.0 * ( mu[:, 1] + patslopere[:, (foldresults['patno'][r]-1) ]*slope_sigma + \
                                                  beta[:,3]*(test_dat['smoker'].values[r]>=1) + beta[:,4]*(test_dat['smoker'].values[r]==2) + \
                                                  beta[:,5]*test_dat['basewk'].values[r]/52.0 +  beta[:,6]*test_dat['basenormal'].values[r] )
        
        samples = np.exp( np.repeat(avgpred, 100) + np.random.normal(loc=0, scale=1, size=100*len(sigma)) * np.repeat(sigma, 100) ) * test_dat['denom'].values[r]/100.0        
        foldresults['pred'][r], foldresults['pred_sd'][r] = minimize_metric(samples)        
        
    if fold==0:
        results = pd.DataFrame(foldresults)
    else:
        results = results.append(pd.DataFrame(foldresults))


# Create submission

In [None]:
results['Patient_Week'] = results['Patient'].astype(str) + '_' + results['Weeks'].astype(str)
results['FVC'] = results['pred']
results['Confidence'] = results['pred_sd']

In [None]:
results[['Patient_Week', 'FVC', 'Confidence']].to_csv('submission.csv', index=False)