## Introduction to Monte Carlo Methods

### References

[Beginning Bayesian Statistics](https://pubs.er.usgs.gov/publication/70204463)

[Hamiltonian Monte Carlo in Python](https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/)

[Betancourt HMC - Best introduction to HMC](https://www.youtube.com/watch?v=VnNdhsm0rJQ)

[NUTS paper](http://arxiv.org/abs/1111.4246)

[HMC Tuning by Colin Caroll](https://colcarroll.github.io/hmc_tuning_talk/)


### Building blocks

#### Proposal distribution
An easy to sample distribution such as a Gaussian distribution $q(x)$ such that 

$q(x_{i+1} | x_{i}) \approx N(\mu, \sigma)$

#### Foundation of Bayesian Inference

1. Obtain the data and inspect it for a high-level understanding of the distribution of the data and the outliers
2. Define a reasonable prior for the data based on (1) and your understanding of the problem
3. Define a likelihood distribution for the data and obtain the likelihood of the data given this likelihood distribution
4. Obtain the posterior distribution using (2) and (3) by applying the Bayes Theorem

### The Metropolis Algorithm

We start off by modeling a discrete number of events using a Poisson distribution shown below. 

$f(x) = e^{-\mu} \mu^x / x!$

The mean rate is represented by μ and x is positive integer that represents the number of events that can happen. If you recall from the discussion of the binomial distribution, that can also be used to model the probability of the number of successes out of 'n' trials. The Poisson distribution is a special case of this binomial distribution and is used when the trials far exceed the number of successes.

If our observed data has a Poisson likelihood distribution, using a Gamma prior for $\mu$ results in a Gamma posterior distribution. 

#### Outline of the Metropolis algorithm
*What do we want to compute?*

To estimate a distribution of a parameter $\theta$

*What do we have available?*

Observed data

*How do we do it?*

1. Start with a parameter sample (a) that is drawn from a distribution
2. Draw a second parameter sample (b) from a proposal distribution
3. Compute the likelihood of the data for both the parameters
4. Compute the prior probability density of both the parameters
5. Compute the posterior probability density of both parameters by multiplying the prior and the likelihood from (3) and (4)
6. Select one from the posterior probability density computed above using a rule and save the selected one as (a) 
7. Repeat steps (2) to (7) till a large number of parameters have been drawn (usually around 5000, but this really depends on the problem)
8. Compute the distribution of the parameter $\theta$ by plotting a histogram of the saved sampled parameter (a) in step (6)

#### The details

1. Propose a single plausible value for our parameter $\theta$. This is (a) from the previous section. This is also called the current value. Let us assume that this is 7.5 for now.

2. Compute the prior probability density of getting 7.5. We stated earlier in our example that we have a Gamma prior distribution. 

$Gamma(x=7.5, \alpha, \beta) = \beta^{\alpha} x^{\alpha - 1} e^{-\beta x} / \gamma(\alpha) = \beta^{\alpha} 7.5^{\alpha - 1} e^{-\beta 7.5} / \gamma(\alpha)$

3. Compute the likelihood of the data given the parameter value of 7.5. The likelihood distribution was a Poisson distribution in our example

$Poisson(x, mu=7.5) = e^{-\mu} \mu^x / x! = e^{-7.5} 7.5^x / x!$

4. Compute the posterior density from (2) and (3), we skip the denominator here since we are only going to make comparisons and the denominator is a constant.

Posterior density $\propto$ Prior * likelihood 

5. Propose a second value for $\mu$ which is drawn from a distribution called a proposal distribution. For the Metropolis algorithm, it has to be a symmetrical distribution. We will use a normal distribution for this example and set the mean of this proposal distribution to be the current value of $\mu$. The standard deviation is a hyperparameter called the tuning parameter. Let us assume that we draw a value of 8.5.





#### Traceplot 

The sequence of accepted values from the proposed values that is plotted over each draw. If a proposed value was not accepted, you see the same value repeated again. If you notice a straight line, this is an indication that several proposed values are being rejected. This is a sign that something is askew with the distribution or sampling process.


#### Building the Inferred Distribution

Use the current values that we obtain at each step and build a frequency distribution (histogram) from it.

#### Representing the Inferred Distribution

* Compute the mean values of the saved parameters
* Compute the standard deviation and variance of the saved parameters
* Compute the minimum and maximum values of the saved parameters
* Compute the quantiles of the saved parameters
* If required, express it as the parameters of a canonical distribution if it is known that the inferred distribution will be of a certain form.


#### Notes about the Metropolis algorithm

* The proposal distribution has to be symmetric, this condition is relaxed in the Metropolis-Hastings algorithm. A normal distribution is commonly used.

* The choice of a prior distribution influences the performance of this algorithm.

* Tuning - A hyperparameter, i.e. the standard deviation is essential to tune this proposal distribution. This needs to be tuned such that the acceptance probability is a certain value. This is referred to as the tuning parameter.


### Hamiltonian Monte Carlo (also called Hybrid Monte Carlo)

Based on the solution of differential equations known as Hamilton's equations. These differential equations depend on the probability distributions we are trying to learn. We navigate these distributions by moving around them in a trajectory using steps that are defined by a position and momentum at that position. Navigating these trajectories can be a very expensive process and the goal is to minimize this computational process.

HMC is based on the notion of conservation of energy. When the sampler trajectory is far away from the probability mass center, it has high potential energy but low kinetic energy and when it is closer to the center of the probability mass will have high kinetic energy but low potential energy.