# Posterior predictive checks {#sec-posterior-predictive-checks}

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv)

<hr>

In [1]:
#| code-fold: true

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

In [2]:
import numpy as np
import scipy.stats as st
import polars as pl

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

<hr>

In @sec-parameter-estimation-with-mcmc-1, we performed parameter estimation using MCMC. We were thrilled to get parameter estimates! But how do we know if our model is a reasonable approximation of the true data generation process?

One obvious approach is to consider parameter values suggested by our posterior and then use those to parametrize the likelihood to generate new data sets from the model. We can then check to see if the observed data are consistent with those parametried by the posterior-parametrized model. This apprrach is referred to as **posterior predictive checks**. The procedure is the remarkably similar to prior predictive checks with one difference (highlighted in bold below).

1. Draw parameter values out of the **posterior**.
2. Use those parameter values in the likelihood to generate a new data set.
3. Store the result and repeat.

Conveniently, we get samples of parameter values out of the posterior from Markov chain Monte Carlo. Once we have the generated data sets, we can compare them to the measured data. This helps answer the question: Could this generative model actually produce the observed data? If the answer is yes, the generative model is not ruled out by the data (though it still may be a bad model). If the answer is no, then the generative model cannot fully describe the process by which the data were generated.

Part of the art of performing a posterior predictive check (or a prior predictive check for that matter) is choosing good summaries of the measured data that can be clearly and quantitatively visualized. Which summaries you choose to plot is up to you, and is often not a trivial choice; as [Michael Betancourt says](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html#21_domain_expertise_consistency), "Constructing an interpretable yet informative summary statistic is very much a fine art." For univariate measurements, the ECDF is a good summary. You may need to choose particular summaries that are best for your modeling task at hand.

Let us proceed with posterior predictive checked on our inferences in @sec-parameter-estimation-with-mcmc-1. As a reminder, the generative model, given by @eq-bursty-model, is below.

$$\begin{align}
&\log_{10} \alpha \sim \text{Norm}(0, 1),\\[1em]
&\log_{10} b \sim \text{Norm}(2, 1),\\[1em]
&\beta = 1/b,\\[1em]
&n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i.
\end{align}
$$

## Getting posterior predictive samples using Stan

We could always get samples out of the posterior and then use Numpy to generate posterior predictive samples. However, it is more convenient to use Stan to do it. As such, we augment our Stan code with posterior predictive checks in the `generated quantities` block. Our updated Stan code is

```stan
data {
  int<lower=0> N;
  array[N] int<lower=0> n;
}


parameters {
  real log10_alpha;
  real log10_b;
}


transformed parameters {
  real alpha = 10^log10_alpha;
  real b = 10^log10_b;
  real beta_ = 1.0 / b;
}


model {
  // Priors
  log10_alpha ~ normal(0, 1);
  log10_b ~ normal(2, 1);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}


generated quantities {
  array[N] int<lower=0> n_ppc;
  n_ppc = neg_binomial_rng(alpha, beta_);
}
```

Let's grab our samples, including the posterior predictive checks!

In [3]:
# Load in as dataframe
df = pl.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment_prefix="#")

# Construct data dict, making sure data are ints
data = dict(N=len(df), n=df["Rest"].to_numpy())

with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='smfish_with_ppc.stan')
    samples = sm.sample(data=data)

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


When we convert the samples into an ArviZ instance, we should specify that the variable `n_ppc` is a posterior predictive variable.

In [4]:
samples = az.from_cmdstanpy(samples, posterior_predictive='n_ppc')

Now, let's look at the posterior predictive checks. Note that the `n_ppc` variable has an entire data set of $N$ mRNA counts *for each posterior sample of* $\alpha$ and $b$. This can be verified by looking at its dimension.

In [5]:
samples.posterior_predictive.n_ppc.shape

(4, 1000, 279)

Indeed, for four chains, we have 1000 samples each, with each sample containing 279 mRNA counts. To perform a graphical posterior predictive check, we can plot all 4000 ECDFs of the posterior-sampled mRNA copy numbers. For each point along the $n$-axis, we compute percentiles of the ECDF values, and this gives us intervals within which we might expect data sets to lie. We can then overlay the measured data and compare.

The `bebi103.viz.predictive_ecdf()` function does this for us. It expects input having `n_samples` rows and `N` columns, where `N` is the number of data points and `n_samples` is the total number of posterior predictive data sets we generated. Because we sampled with four chains, the posterior predictive array is three-dimensional. The first index is the chain, the second the draw, and the third is the number of data points. The samples are stored as an xarray, which we can reshape using the `stack` function. We will collapse the `chain` and `draw` indexes into a single `sample` index. We also want to be sure to specify the ordering of the indexes; samples should go first, followed by the number of the data point. We can do this using the `transpose()` method of an xarray `DataArray`, which lets us specify the ordering of the indexes. We can then make the predictive ECDF plot, passing in our measured data using the `data`keyword argument.

In [6]:
n_ppc = (
    samples.posterior_predictive['n_ppc']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "n_ppc_dim_0")
)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(n_ppc, data=df['Rest'])
)

The dark line is the median posterior-parametrized ECDF, and the shaded regions contain 68% and 95% of the samples. The data seem consistent with the model. This can be seen more clearly be comparing the *difference* of the ECDFs from the median ECDF, which is accomplished using the `diff='ecdf'` keyword argument.

In [7]:
bokeh.io.show(
    bebi103.viz.predictive_ecdf(n_ppc, data=df['Rest'], diff='ecdf')
)

We see more clearly that the observed data set is consistent with data sets generated by the model.

## How about Rex1?

Let us now do the same analysis with the Rex1 gene, which, if you recall the EDA (@sec-eda-smfish), exhibited bimodality and a Negative Binomial model may not be our best option.

In [8]:
# Construct data dict for Rex1
data_rex1 = dict(N=len(df), n=df["Rex1"].to_numpy())

# Grab samples
with bebi103.stan.disable_logging():
    samples = az.from_cmdstanpy(
        sm.sample(data=data_rex1), 
        posterior_predictive='n_ppc'
    )

# Plot posterior predictive check
n_ppc = (
    samples.posterior_predictive['n_ppc']
    .stack({"sample": ("chain", "draw")})
    .transpose("sample", "n_ppc_dim_0")
)

bokeh.io.show(
    bebi103.viz.predictive_ecdf(n_ppc, data=df['Rex1'])
)

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


This already looks bad. Let's look at the difference of the ECDFs to get more clarity.

In [9]:
bokeh.io.show(
    bebi103.viz.predictive_ecdf(n_ppc, data=df['Rex1'], diff='ecdf')
)

In [10]:
bebi103.stan.clean_cmdstan()

## Computing environment

In [11]:
%load_ext watermark
%watermark -v -p numpy,scipy,polars,cmdstanpy,arviz,bokeh,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
scipy     : 1.16.0
polars    : 1.31.0
cmdstanpy : 1.2.5
arviz     : 0.22.0
bokeh     : 3.7.3
iqplot    : 0.3.7
bebi103   : 0.1.28
jupyterlab: 4.4.5

cmdstan   : 2.36.0
