## Gelman-Rubin Test
The Gelman-Rubin test has the insight to compare m chains started at very different initial conditions. The idea behind the test is that chains have reached stationarity precisely when they have forgotten thier initial conditions, i.e. when results from each separate chain are indistinguishable.

Specifically, the test compares (average) variance within each chain to the estimated variance of the stationary distribution. The latter estimate converges to the true variance with lots of samples, so asymptotically, GR is checking to see if the variance of particular chains matches the variance of the stationary distribution, as best we know it from using all the chains.



#### The test
Let's assume that we have m chains, each of length n. The sample variance of the $j$th chain is:

$$ s_j^2 =  \frac{1}{n-1} \sum_i (\theta_{ij} - \mu_{\theta_j})^2$$

Let $w$ be the mean of the within-chain variances (equivalently, the average within-chain variance). Then:

$$w = \frac{1}{m} \sum_j s_j^2$$

Note that we expect the $s_j$ to be all equal asymptotocally as $n \to \infty$ as we have then reached stationarity.

Let $\mu$ be the mean of the chain means:

$$\mu = \frac{1}{m} \sum_j \mu_{\theta_j}$$

The between-chain variance can then be written as:

$$B = \frac{n}{m-1}\sum_j (\mu_{\theta_j} - \mu)^2$$

Basically, we build a dataset of m means (one from each chain) and measure the variance in that dataset. (We scale by n becuase the variance in the actual chains is bigger than the variance among the means)

We use the weighted average of the within and between chain variances to estimate the variance of the stationary distribution (and Gelman and Rubin do math to show this is a good estimator):

$$\hat{Var}(\theta) = (1 - \frac{1}{n})w + \frac{1}{n} B$$

Picking this estimator apart, for large n, we approach stationarity and the average within chain variance is a fine estimate of the stationary distribution's variance. For small n, we want to look at the variance among the chains' means, since the chains should all be in widely spaced portions of the target distribution. (And probably too widely spaced if they haven't had time to creep towards higher-probability regions), meaning $\hat{Var}(\theta)$ overestimates our variance for small n, but is unbiased as $n \to \infty$.

Let's define the ratio of the estimated variance of the stationary distribution to the actaully observed within-chain variance:

$$\hat{R} = \sqrt{\frac{\hat{Var}(\theta)}{w}}$$

Stationarity would imply that this value is 1: the within-chain matches the theoretical. If the chains aren't stationary yet $\hat{Var}(\theta)$ overestimates and the ratio is above 1. Values of $\hat{Var}(\theta)$ below 1.1 or 1.2 are consistent with convergence, and values above are a sign of trouble.

#### Code
This test is available in pymc3 as `pymc3.diagnostics.gelman_rubin`