# Bayesian Methods 

## Why is the Bayesian Inference controversial?

1. Bayesians claim that the parameters are random so that their credible interval is a valid probability argument. This interpretation looks nice but the credible interval of the parameters of interest not only depends on the likelihood but also the prior, which is usually hard to obtain.

2. When the likelihood and prior is complicated, the inference has to rely on the MCMC sampling, which can be really slow in most of the real-world cases.

3. The biggest controversy about Bayesian inference is that you must quantify your prior knowledge about the question at hand. This makes it possible to actually influence your results, either accidentally or on purpose.

This is a genuine concern, and any Bayesian analysis worth it’s salt must demonstrate that the chosen priors aren’t influencing the final results. 






# [Bayesian Inference](http://pages.stat.wisc.edu/~larget/stat302/bayes.pdf)
  

###  Prior and Posterior Distributions, Likelihood

Before seeing data, the prior distribution of an unknown parameter θ is described by a
probability density (or mass function, if discrete)  $f(\theta)$  The Bayesian approach connects
data and parameter through the likelihood function, $f(x \mid \theta)$ . The function where x is fixed as the data and the parameter $\theta$ is what varies
is called the likelihood. Parameter values where the likelihood is high are those that have
a high probability of producing the observed data.   In the maximum likelihood approach to
statistics, the best estimate of the value $\hat{\theta}$ that maximizes the likelihood (and log-likelihood)
function.

 All Bayesian inference is based on evaluation of the posterior distribution: 
 $$f(\theta \mid x) = \dfrac{f(x \mid \theta)f(\theta)}{f(x)}$$
 where $f(x) = \int f(x \mid \theta)f(\theta) d \theta $ is the marginal likelihood f(x), which is the probability
of observing the data $x$ averaged across the entire parameter space.


# [Hierarchical Bayes and Empirical Bayes](https://www2.isye.gatech.edu/~brani/isyebayes/bank/handout8.pdf)

Hierarchical Bayes and Empirical Bayes are related by their goals, but quite different by the methods of how
these goals are achieved.

Both methods are concerned in specifying the distribution at prior level, hierarchical
via Bayes inference involving additional degrees of hierarchy (hyperpriors and hyperparameters),
while empirical Bayes is using data more directly.

### Hierarchical Bayesian Analysis

Hierarchical Bayesian Analysis is a convenient representation of a Bayesian model, in particular the prior
$\pi$, via a conditional hierarchy of so called hyper-priors $\pi_1, \dots, \pi_{n+1}$,
$$ \pi(\theta) = \int \pi_1(\theta|\theta_1) \pi_2(\theta_1|\theta_2) \dots  \pi_n(\theta_{n-1}|\theta_n)\pi_{n+1}(\theta_n) d\theta_1d\theta_2 \dots d\theta_n$$ 

Operationally, the model: 
$$[x|\theta] \sim f(x|\theta)$$ $$[\theta|\theta_1]  \sim \pi_1(\theta|\theta_1) $$ $$ [\theta_{n−1}|\theta_n]  \sim \pi_n(\theta|\theta_1)$$ $$ [\theta_n] \sim \pi_{n+1}(\theta_n)$$ 
is equivalent to the model
$$[x|\theta] \sim f(x|\theta), [\theta] \sim \pi(\theta)$$ 
as the inference on θ is concerned. 

Notice that in the hierarchy of data, parameters and hyperparameters,
$X$ and $\theta_i$ are independent, given $\theta$: $ X \to  \theta \to \theta_1 \to \dots \to \theta_n$

That means, $$[X|\theta, \theta_1, \dots ] \overset{\mathrm{d}}{=}  [X|\theta], [\theta_i|\theta, X] \overset{\mathrm{d}}{=} [\theta_i
|\theta] $$ 
the joint distribution  which by definition is
$$[X, \theta, \theta_1, \dots, \theta_n] = [X|\theta, \theta_1, \dots, \theta_n] [\theta|\theta_1, \dots , \theta_n] [\theta_1|\theta_2, \dots, \theta_n] \dots [\theta_{n−1}|\theta_n] [\theta_n] = [X|\theta][\theta|\theta_1] [\theta_1|\theta_2] \dots [\theta_{n−1}|\theta_n] [\theta_n]$$
thus, to fully specify the model, only “neighbouring” conditionals and
the “closure” distribution $[\theta_n]$ are needed.

- Modeling requirements may lead to the hierarchy in the prior
- Robustness and objectiveness (let the data talk about the hyperparameters)
- Calculational issues (utilizing hidden mixtures, mixture priors, missing data, MCMC format).

Sometimes it is not calculatingly feasible to carry out the analysis by reducing the sequence of hyperpriors to a single prior.

Rather, Bayes rule is obtained (by using [Fubini’s theorem](http://ru.math.wikia.com/wiki/%D0%A2%D0%B5%D0%BE%D1%80%D0%B5%D0%BC%D0%B0_%D0%A2%D0%BE%D0%BD%D0%B5%D0%BB%D0%BB%D0%B8_%E2%80%94_%D0%A4%D1%83%D0%B1%D0%B8%D0%BD%D0%B8)) as repeated integral with respect to more
convenient conditional distributions. 

### Empirical Bayes. 

Empirical Bayes is an approach to inference in which the observations are used to select the prior, usually
via the marginal distribution. Once the prior is specified, the inference proceed in a standard Bayesian
fashion. The use of data to estimate the prior in addition to subsequent use for the inference in empirical
Bayes is criticized by subjectivists who consider the prior information exogenous to observations. The
repeated use of data is also loaded with perils since it can underestimate modeling errors. Any data is going
to be complacent with a model which used the same data to specify some of its features.

Example: 
1. We find the marginal for $X_i$ in the experiment $i$
2. Estimate $\alpha$ and $\beta$ in the marginal.
3. Express $P(0 \leq \theta_{I+1} \leq  0.2|X_{I+1} = 0)$ in terms of incomplete Beta functions with estimated hyperparameters $\hat{\alpha}, \hat{\beta}$

Nonparametric empirical Bayes follows and uses the historic data to estimate the
marginal distribution in nonparametric fashion. The estimated marginal distributions are then plugged in the
formal Bayes rule.

###  ML II

The idea is to mimic the maximum likelihood estimation at the marginal level: Select a prior $\pi$ that maximizes $m_{\pi}(x)$,
given the data.

Offtop:


- This ratio is easily numerically evaluated, see mathematica notebook. (p.3)
- Berger (1985), Section 4.6 pages 180–195 contains an excellent account on hierarchical models with detailed proofs.
- The result (see MATHEMATICA program jeremy.nb on the web site)
- Marginal posterior,  marginal distribution
- James Stein Estimator