
# Notes

## Bayesian theorem
Bayesian Data Analysis produces distributions as predictions rather than point estimates which are produced by frequentist statistical methods. With distributions we can model the uncertainty of the prediction much better than with point estimates. Bayesian methods are all based on a fundamental probability theory, the Bayesian theorem:<br>
$$ 
P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}
$$
where we have on the left the posterior and on the right hand side the likelihood times the prior divided by a marginal likelihood (normalisation term). The bayes rule therefore combines the likelihood of getting B given A and the prior information of A to form the posterior which is the probability of A given (the data) B. In the likelihood the data, which is in this case B, stays the same. Therefore the likelihood tells us how likely each A value is given we observe B. The prior can be non-informative (uniform), weakly informative (some knowledge, boundaries, more information on previous observations but not certain how well that is applicable in this situation) or informative. The posterior can often be seen as a compromise between the data and the prior knowledge of the investigated parameter.

In Bayesian Data Analysis the previous equation is often in the form:<br>
$$ 
p(\theta \mid y) = \frac{p(y\mid \theta)p(\theta)}{p(y)}
$$
where theta is the value we want to investigate and y is the data.<br>

The bayesian computation is all about expectations as we are often interested on the expectation of for example the theta given the data:
$$
E_{p(\theta\mid y)}[f(\theta)] = \int f(\theta)p(\theta\mid y)d\theta
$$

## Markov Chain Monte Carlo (MCMC)
As the posterior distributions are often hard to compute, one can use Markov Chain Monte Carlo methods to produced the posterior distribution. The hard part in computing the posterior distribution comes from the normalisation term, where we must often take the integral to obtain all possible values shown below. 
$$
\int p(y\mid\theta)p(\theta)d\theta
$$

With Monte Carlo methods one can simulate draws from the target distribution (often the posterior) and these draws can be treated like observations. The Markov Chain in Markov Chain Monte Carlo (MCMC) comes from the fact that in Markov chains the probability of each event depends on the state attained in the previous event. Therefore the draws in MCMC are also dependant. In MCMC we apply some deterministic rule on the Markov chain so that the Markov chain goes where the most of the posterior mass is. The key in MCMC is that the approximate distributions are improved at each step in the simulation. Therefore we also often leave out the initial steps of the Markov chain, the so-called warm-up.<br>

When using MCMC one should:
1. Use several chains to make convergence diagnostics easier (chains should mix).
2. Start chains from different starting points.
3. Use R_hat in convergence diagnostics (if variance finite in target)<br>
    a) Compares within and between variances of the chains<br>
    b) Should be 1 (1.01 > R_hat is ok).

MCMC methods:
- Gibbs sampling
- Metropolis-Hastings algorithm
- Hamiltonian Monte Carlo

## Hierarchical model

## Poisson and Gamma distributions

### Poisson distribution



### Gamma distribution

$$
y \sim gamma(\alpha, \beta)
$$
$$
p(y\mid\alpha,\beta)=\frac{\beta^{\alpha}y^{\alpha-1}e^{-\beta y}}{\Gamma(\alpha)}
$$
$$
\Gamma(\alpha)=\int_{0}^{\infty}x^{\alpha-1}e^{-x}dx
$$
Gamma-distribution is quite useful in Bayesian methods as it can be used as a prior for the lambda in the Poisson-distribution. There are multiple reasons for this. First of all, the Gamma-distribution is always positive and so is lambda by definition. Also, the Gamma distribution belongs to the conjugate family of the Poisson distribution, making the calculation easier. Gamma distribution is also closely related to exponential distribution.