# Parameter estimation with Markov chain Monte Carlo {#sec-parameter-estimation-with-mcmc-1}

[Data set download](https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv)

<hr>

In [1]:
#| code-fold: true

# Colab setup ------------------
import os, shutil, sys, subprocess, urllib.request
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade polars iqplot colorcet datashader bebi103 arviz cmdstanpy watermark"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    from cmdstanpy.install_cmdstan import latest_version
    cmdstan_version = latest_version()
    cmdstan_url = f"https://github.com/stan-dev/cmdstan/releases/download/v{cmdstan_version}/"
    fname = f"colab-cmdstan-{cmdstan_version}.tgz"
    urllib.request.urlretrieve(cmdstan_url + fname, fname)
    shutil.unpack_archive(fname)
    os.environ["CMDSTAN"] = f"./cmdstan-{cmdstan_version}"
    data_path = "https://s3.amazonaws.com/bebi103.caltech.edu/data/"
else:
    data_path = "../data/"
# ------------------------------

In [2]:
import numpy as np
import scipy.stats as st
import polars as pl

import cmdstanpy
import arviz as az

import iqplot
import bebi103

import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

<hr>

In this lesson, we will learn how to use **Markov chain Monte Carlo** to do parameter estimation. To get the basic idea behind MCMC, imagine for a moment that we can draw samples out of the posterior distribution. This means that the probability of choosing given values of a set of parameters is proportional to the posterior probability of that set of values. If we drew many many such samples, we could reconstruct the posterior from the samples, e.g., by making histograms. That's a big thing to imagine: that we can draw properly weighted samples. But, it turns out that we can! That is what MCMC allows us to do.

We discussed some theory behind this seemingly miraculous capability in lecture. For this lesson, we will just use the fact that we can do the sampling to learn about posterior distributions in the context of parameter estimation.

## The data set

The data come from the [Elowitz lab](http://elowitz.caltech.edu/), published in Singer et al., Dynamic Heterogeneity and DNA Methylation in Embryonic Stem Cells, *Molec. Cell*, **55**, 319-331, 2014, available [here](https://doi.org/10.1016/j.molcel.2014.06.029). In the following paragraphs, I repeat the description of the data set and EDA from last term:

>In this paper, the authors investigated cell populations of embryonic stem cells using RNA single molecule fluorescence in situ hybridization (smFISH), a technique that enables them to count the number of mRNA transcripts in a cell for a given gene.  They were able to measure four different genes in the same cells. So, for one experiment, they get the counts of four different genes in a collection of cells.  
>
>The authors focused on genes that code for pluripotency-associated regulators to study cell differentiation. Indeed, differing gene expression levels are a hallmark of differentiated cells.  The authors do not just look at counts in a given cell at a given time.  The *temporal* nature of gene expression is also important.  While the authors do not directly look at temporal data using smFISH (since the technique requires fixing the cells), they did look at time lapse fluorescence movies of other regulators.  We will not focus on these experiments here, but will discuss how the distribution of mRNA counts acquired via smFISH can serve to provide some insight about the dynamics of gene expression.
>
>The data set we are analyzing now comes from an experiment where smFISH was performed in 279 cells for the genes *rex1*, *rest*, *nanog*, and *prdm14*.  The data set may be downloaded at [https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv](https://s3.amazonaws.com/bebi103.caltech.edu/data/singer_transcript_counts.csv).

## ECDFs of mRNA counts {#sec-eda-smfish}

We will do a quick EDA to get a feel for the data set by generating ECDFs for the mRNA counts for each of the four genes. 

In [3]:
df = pl.read_csv(os.path.join(data_path, "singer_transcript_counts.csv"), comment_prefix="#")

genes = ["Nanog", "Prdm14", "Rest", "Rex1"]

plots = [
    iqplot.ecdf(
        data=df[gene],
        q=gene,
        x_axis_label="mRNA count",
        title=gene,
        frame_height=150,
        frame_width=200,
    )
    for gene in genes
]

bokeh.io.show(
    bokeh.layouts.column(bokeh.layouts.row(*plots[:2]), bokeh.layouts.row(*plots[2:]))
)

Note the difference in the $x$-axis scales. Clearly, *prdm14* has far fewer mRNA copies than the other genes. The presence of two inflection points in the Rex1 EDCF implies bimodality.

## Building a generative model

We can model the transcript counts, which result from bursty gene expression, as being Negative Binomially distributed. (The details behind this model are a bit nuanced, and you can read about them [here](https://biocircuits.github.io/chapters/16_bursty.html).) For a given gene, the likelihood for the counts is

$$
\begin{align}
n_i \mid \alpha, b \sim \text{NegBinom}(\alpha, 1/b) \;\forall i,
\end{align}
$${#eq-nbinom-likelihood}

where $\alpha$ is the **burst frequency** (higher $\alpha$ means gene expression comes on more frequently) and $b$ is the **burst size**, i.e., the typical number of transcripts made per burst. We have therefore identified the two parameters we need to estimate, $\alpha$ and $b$.

Because the Negative Binomial distribution is often parametrized in terms of $\alpha$ and $\beta= 1/b$, we can alternatively state our likelihood as

$$
\begin{align}
&\beta = 1/b,\\[1em]
&n_i \mid \alpha, \beta \sim \text{NegBinom}(\alpha, \beta)\;\; \forall i.
\end{align}
$${#eq-nbinom-likelihood-beta}

Given that we have a Negative Binomial likelihood, we are left to specify priors the burst size $b$ and the burst frequency $\alpha$.

### Priors for burst size and inter-burst time

We will apply the [bet-the-farm technique](../02/choice_of_prior.html#the-bet-the-farm-method-of-specifying-weakly-informative-priors) to get our priors for the burst size and inter-burst times. I would expect the time between bursts to be longer than a second, since it takes time for the transcriptional machinery to assemble. I would expect it to be shorter than a few hours, since an organism would need to adapt its gene expression based on environmental changes on that time scale or faster. The time between bursts needs to be in units of RNA lifetimes, and bacterial RNA lifetimes are of order minutes. So, the range of values of $\alpha$ is $10^{-2}$ to $10^2$, leading to a prior of 

$$\begin{align}
\log_{10} \alpha \sim \text{Norm}(0, 1).
\end{align}
$$

I would expect the burst size to depend on promoter strength and/or strength of transcriptional activators. I could imagine anywhere from a few to a few thousand transcripts per burst, giving a range of $10^0$ to $10^4$, and a prior of

$$\begin{align}
\log_{10} b \sim \text{Norm}(2, 1).
\end{align}
$$

We then have the following model.

$$
\begin{align}
&\log_{10} \alpha \sim \text{Norm}(0, 1),\\[1em]
&\log_{10} b \sim \text{Norm}(2, 1),\\[1em]
&\beta = 1/b,\\[1em]
&n_i \sim \text{NegBinom}(\alpha, \beta) \;\forall i.
\end{align}
$${#eq-bursty-model}

## Sampling the posterior

To draw samples out of the posterior, we need to use some new Stan syntax. Here is the Stan code we will use with some notes about Stan syntax.

```stan
data {
  int<lower=0> N;
  array[N] int<lower=0> n;
}


parameters {
  real log10_alpha;
  real log10_b;
}


transformed parameters {
  real alpha = 10^log10_alpha;
  real b = 10^log10_b;
  real beta_ = 1.0 / b;
}


model {
  // Priors
  log10_alpha ~ normal(0, 1);
  log10_b ~ normal(2, 1);

  // Likelihood
  n ~ neg_binomial(alpha, beta_);
}
```

- Note that the raise-to-power operator is `^`, not `**` as in Python.
- The `data` block contains the counts $n$ of the mRNA transcripts. There are $N$ cells that are measured. Most `data` blocks look like this. There is an integer parameter that specifies the size of the data set, and then the data set is given as an array. The declaration that `n` is an array of length `N` is `array [N]`, followed by the type of data, is integer. We specified a lower bound on the data (as we will do on the parameters) using the `<lower=0>` syntax.
- The `parameters` block tells us what the parameters of the posterior are. In this case, we wish to sample out of the posterior $g(\alpha, b \mid \mathbf{n})$, where $\mathbf{n}$ is the set of transcript counts for the gene. So, the two parameters are $\alpha$ and $b$. However, since defining the prior was more easily defined in terms of logarithms, we specify $\log_{10} \alpha$ and $\log_{10} b$ as the parameters.
- The `transformed parameters` block allows you to do any transformation of the parameters you are sampling for convenience. In this case, Stan's Negative Binomial distribution is parametrized by $\beta = 1/b$, so we make the transformation of the `b` to `beta_`. Notice that I have called this variable `beta_` and not `beta`. I did this because `beta` is one of Stan's distributions, and you should avoid naming a variable after a word that is already in the Stan language. The other transformations we need to make involve converting the logarithms to the actual parameter values.
- Finally, the `model` block is where the model is specified. The syntax of the model block is almost identical to that of the hand-written model. 

Now that we have specified our model, we can compile it.

In [4]:
sm = cmdstanpy.CmdStanModel(stan_file='smfish.stan')

With our compiled model, we just need to specify the data and let Stan's sampler do the work! When using CmdStanPy, the data has to be passed in as a dictionary with keys corresponding to the variable names declared in the `data` block of the Stan program and values as Numpy arrays with the appropriate data type. For this calculation, we will use the data set for the *rest* gene.

In [5]:
# Construct data dict, making sure data are ints
data = dict(N=len(df), n=df["Rest"].to_numpy())

# Sample using Stan
samples = sm.sample(    
    data=data,
    chains=4,
    iter_sampling=1000,
)

# Convert to ArviZ InferenceData instance
samples = az.from_cmdstanpy(posterior=samples)

12:15:48 - cmdstanpy - INFO - CmdStan start processing


chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                

12:15:48 - cmdstanpy - INFO - CmdStan done processing.
Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	Exception: neg_binomial_lpmf: Shape parameter is inf, but must be positive finite! (in 'smfish.stan', line 26, column 2 to column 33)
	E




We got lots of warnings! In particular, we get warnings that some of the parameters fed into the Negative Binomial distribution are invalid, being either zero or infinite. These warnings are arising during Stan's warm-up phase as it is assessing optimal settings for sampling, and should not be of concern. It is generally a bad idea to silence warnings, but if you are sure that the warnings that the sampler will throw are of no concern, you can silence logging using the `bebi103.stan.disable_logging` context. In most notebooks in these notes, to avoid clutter for pedagogical purposes, we will disable the warnings.

In [6]:
with bebi103.stan.disable_logging():
    samples = sm.sample(    
        data=data,
        chains=4,
        iter_sampling=1000,
    )

# Convert to ArviZ InferenceData instance
samples = az.from_cmdstanpy(posterior=samples)

chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

                                                                                                                                                                                                                                                                                                                                


Now, let's take a quick look at the samples.

In [7]:
samples.posterior

As we have already seen, the samples are indexed by chain and draw. Parameters represented in the `parameters` and `transformed parameters` blocks are reported.

## Plots of the samples

There are many ways of looking at the samples. In this case, since we have two parameters of interest, the pulse frequency and pulse size, we can plot the samples as a scatter plot to get the approximate density.

In [8]:
def plot_scatter(samples, title=None):
    p = bokeh.plotting.figure(
        frame_width=200,
        frame_height=200,
        x_axis_label='α',
        y_axis_label='b',
        title=title,
    )
    
    p.scatter(
        samples.posterior['alpha'].values.ravel(), 
        samples.posterior['b'].values.ravel(),
        size=2,
        alpha=0.2,
    )

    return p

bokeh.io.show(plot_scatter(samples))

We see very strong correlation between $\alpha$ and $b$. This does not necessarily mean that they depend on each other. Rather, it means that *our degree of belief about their values* depends on both in a correlated way. The measurements we made cannot effectively separate the effects of $\alpha$ and $b$ on the transcript counts.

## Marginalizing the posterior

We can also plot the marginalized posterior distributions. Remember that the marginalized distributions properly take into account the effects of the other variable, including the strong correlation I just mentioned. To obtain the marginalized distribution, we simply ignore the samples of the parameters we are marginalizing out. It is convenient to look at the marginalized distributions as ECDFs.

In [9]:
plots = [
    iqplot.ecdf(
        samples.posterior[param].values.ravel(),
        q=param,
        frame_height=200,
        frame_width=250,
    )
    for param in ["alpha", "b"]
]

bokeh.io.show(bokeh.layouts.row(*plots))

Alternatively, we can visualize the marginalized posterior PDFs as histograms. Because we have such a large number of samples, binning bias from histograms is less of a concern.

In [10]:
plots = [
    iqplot.histogram(
        samples.posterior[param].values.ravel(),
        q=param,
        rug=False,
        frame_height=200,
        frame_width=250,
    )
    for param in ["alpha", "b"]
]

bokeh.io.show(bokeh.layouts.row(*plots))

## Corner plots {#sec-corner-plots}

We now have a two-dimensional posterior distribution, with our two parameters being $\alpha$ and $b$. We can combine a plot of the samples from the full posterior with the histograms (or ECDFs) of samples from the marginal distributions in a **corner plot**, available via `bebi103.viz.corner()`.

In [11]:
bokeh.io.show(
    bebi103.viz.corner(
        samples, parameters=[('alpha', 'α'), 'b']
    )
)

Corner plots generalize to dimensions beyond two. The off-diagonal plots are of samples from marginal distributions where two parameters remain and the diagonals are plots from univariate marginal distributions.

In [12]:
bebi103.stan.clean_cmdstan()

## Computing environment

In [13]:
%load_ext watermark
%watermark -v -p numpy,scipy,polars,cmdstanpy,arviz,bokeh,iqplot,bebi103,jupyterlab
print("cmdstan   :", bebi103.stan.cmdstan_version())

Python implementation: CPython
Python version       : 3.13.5
IPython version      : 9.4.0

numpy     : 2.2.6
scipy     : 1.16.0
polars    : 1.31.0
cmdstanpy : 1.2.5
arviz     : 0.22.0
bokeh     : 3.7.3
iqplot    : 0.3.7
bebi103   : 0.1.28
jupyterlab: 4.4.5

cmdstan   : 2.36.0
