# Reparameterizing the Weibull Accelerated Failure Time Model

In [None]:
%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels
import patsy
import theano.tensor as tt

plt.style.use('seaborn-darkgrid')
print('Running on PyMC3 v{}'.format(pm.__version__))

## Dataset

The [previous example notebook on Bayesian parametric survival analysis](https://docs.pymc.io/notebooks/bayes_param_survival.html) introduced two different accelerated failure time (AFT) models: Weibull and log-linear. In this notebook, we will explore three different versions/parameterizations of the Weibull AFT model.

The data set we'll use is the `flchain` R data set, which comes from a medical study investigating the effect of serum free light chain (FLC) on lifespan. Read the full documentation of the data by running:

`print(statsmodels.datasets.get_rdataset(package='survival', dataname='flchain').__doc__)`.

Here, we are interested in seeing what effect, if any, the `age` and `sex` covariates have on the survival times of subjects.

In [None]:
# Fetch and clean data
data = (statsmodels.datasets
                   .get_rdataset(package='survival', dataname='flchain')
                   .data
                   .sample(500)  # Limit ourselves to 500 observations
                   .reset_index(drop=True))

In [None]:
data.head()

In [None]:
# Get matrices of covariates from dataframe using patsy
_, X_df = patsy.dmatrices("death ~ age + sex", data, return_type='dataframe')
X_df = X_df.iloc[:, X_df.columns != 'Intercept']

# `X` contains covariates, `y` contains survival times, and
# `censored` contains whether or not the survival time has been censored
X = X_df.values
y = data.futime.values
censored = ~data['death'].values.astype(bool)

## Using `pm.Potential`

We have an unique problem when modelling censored data. Strictly speaking, we don't have any _data_ for censored values: we only know the _number_ of values that were censored! How can we include this information in our model?

One way do this is by making use of `pm.Potential`. The [PyMC2 docs](https://pymc-devs.github.io/pymc/modelbuilding.html#the-potential-class) explain it's use very well. Essentially, declaring `pm.Potential('x', logp)` will add `logp` to the log-likelihood of the model.

## Model 1

[(Source)](https://discourse.pymc.io/t/weibull-survival-regression-aft/1107/3)

In [None]:
def weibull_lccdf(x, alpha, beta):
    ''' Log complementary cdf of Weibull distribution. '''
    return -(x / beta)**alpha

In [None]:
with pm.Model() as model_1:
    alpha_sd = 10.0

    mu = pm.Normal('mu', mu=0, sd=100)
    alpha_raw = pm.Normal('a0', mu=0, sd=0.1)
    alpha = pm.Deterministic('alpha', tt.exp(alpha_sd * alpha_raw))
    beta = pm.Deterministic('beta', tt.exp(mu / alpha))
    
    y_obs = pm.Weibull('y_obs', alpha=alpha, beta=beta, observed=y[~censored])
    y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], alpha, beta))

In [None]:
with model_1:
    # Increase tune and change init to avoid divergences
    trace_1 = pm.sample(draws=1000, tune=1000,
                        nuts_kwargs={'target_accept': 0.9},
                        init='adapt_diag')

In [None]:
pm.traceplot(trace_1);

In [None]:
pm.summary(trace_1).round(2)

## Model 2

[(Source)](https://github.com/stan-dev/example-models/blob/master/bugs_examples/vol1/kidney/kidney.stan)

In [None]:
with pm.Model() as model_2:
    alpha = pm.Normal('alpha', mu=0, sd=10)
    r = pm.Gamma('r', alpha=1, beta=0.001, testval=0.25)
    beta = pm.Deterministic('beta', tt.exp(-alpha / r))

    y_obs = pm.Weibull('y_obs', alpha=r, beta=beta, observed=y[~censored])
    y_cens = pm.Potential('y_cens', weibull_lccdf(y[censored], r, beta))

In [None]:
with model_2:
    # Increase tune and target_accept to avoid divergences
    trace_2 = pm.sample(draws=1000, tune=1000,
                        nuts_kwargs={'target_accept': 0.9})

In [None]:
pm.traceplot(trace_2);

In [None]:
pm.summary(trace_2).round(2)

## Model 3

[(Source)](http://austinrochford.com/posts/2017-10-02-bayes-param-survival.html).

In [None]:
logtime = np.log(y)

def gumbel_sf(y, mu, sigma):
    ''' Gumbel survival function. '''
    return 1.0 - tt.exp(-tt.exp(-(y - mu) / sigma))

In [None]:
with pm.Model() as model_3:
    s = pm.HalfNormal('s', tau=5.0)
    gamma = pm.Normal('gamma', mu=0, sd=5)

    y_obs = pm.Gumbel('y_obs', mu=gamma, beta=s, observed=logtime[~censored])
    y_cens = pm.Potential('y_cens', gumbel_sf(y=logtime[censored], mu=gamma, sigma=s))

In [None]:
with model_3:
    trace_3 = pm.sample(draws=1000, tune=1000,
                        init='adapt_diag')

In [None]:
pm.traceplot(trace_3);

In [None]:
pm.summary(trace_3).round(2)

## Authors

- Originally researched by [Junpeng Lao](https://junpenglao.xyz/) on Apr 21, 2018. See original code [here](https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/65447fdb431c78b15fbeaef51b8c059f46c9e8d6/PyMC3QnA/discourse_1107.ipynb).
- Authored and ported to Jupyter notebook by [George Ho](https://eigenfoo.xyz/) on Jul 15, 2018.