## Autocorrelation and Effective Sample Size

## TL;DR:
There's a formula for how many independent samples our set of MCMC samples is worth. It measures the total autocorrelation at each possible lag, so more autocorrelation means our samples are less informative.

#### Correlation and 
When we draw from a marokv chain our draws are correlated, to a greater or lesser degree.

If our next draw were completely independent from our current draw we would have iid draws from the stationary distribution and the usual statistcal theory would apply. The N we get to use in the central limit theorem and law of large numbers would be the number of samples we drew.

However, in MCMC our draws are not independent. If we know that the chain is currently at state x we can predict, better than random, where the next draw will be (it'll probably be close to x). In the iid case we'd have no ability to predict where the new draw will be. So in an information theory sense, an iid draw is harder to predict, therefore more surprising, therefore more informative than an MCMC draw.

#### Effective Sample Size
We can operationalize "the MCMC is less informative" by measuring the correlation among successive draws. With iid samples the correlation will be zero, with MCMC samples the correlation will be higher, especially if the chain has certain flaws.

Effective sample size distills 'less informative" down to "as informative as N iid samples". The exact derivation involves us going into time series theory, which we do not have the time to do here. Instead we shall just write the result:

$$n_{eff} = \frac{n_{MCMC}}{1 + 2 \sum_{\Delta t}\rho_{\Delta t}}$$

$n_{MCMC}$ is the raw number of MCMC samples. $\rho_{z}$ is the autocorrelation with lag z. So overall, the formula totals the autocorrelation at all possible lags and reduces the sample size accordingly. If autocorrelation decreases to 0 relatively quickly we have a good effective sample ratio.

The above formula clearly only considers autocorrelations. Correlations between sampled variables are ignored. (For instance, correlations among the lab-level means in a hierarchical model). Some modern extensions define an ESS measure that factors in the entire correlation matrix at each lag, instead of just (one element of) the diagonals.

#### Thinning
Thining reduces autocorrelation by taking every 100th sample from the chain, for instance. In terms of effective sample size, this is ineffecient. The ESS of the whole chain will be greater than the ESS of the thinned chain (the thinned chain has lower autocorrelation, but far fewer samples). Thinning is motivated by 1) saving memory spent to store the samples 2) making the samples look more like iid samples so that classical results will apply directly. If samples are hard to get, you can spend the memory, and can live with a more complicated analysis, consider not thinning your chain.


### Causes of Autocorrelation
A large autocorrelation may happen due to a) strong correlations in parameters (which can be measured) b) smaller step sizes which are not letting us explore a distribution well / poor mixing c) Unidentifiability in the model, as two parameters may carry the same information.

The general observation that can be made is that problems in sampling often cause strong autocorrelation, but autocorrelation by itself does not mean that our sampling is wrong. But it is something we should always investigate, and only use our samples if we are convinced that the autocorrelation is benign.