# [Hierarchical Bayes](https://www2.isye.gatech.edu/~brani/isyebayes/bank/handout8.pdf)

### Hierarchical Bayesian Analysis

Hierarchical Bayesian Analysis is a convenient representation of a Bayesian model, in particular the prior
$\pi$, via a conditional hierarchy of so called hyper-priors $\pi_1, \dots, \pi_{n+1}$,
$$ \pi(\theta) = \int \pi_1(\theta|\theta_1) \pi_2(\theta_1|\theta_2) \dots  \pi_n(\theta_{n-1}|\theta_n)\pi_{n+1}(\theta_n) d\theta_1d\theta_2 \dots d\theta_n$$ 

Operationally, the model: 
$$[x|\theta] \sim f(x|\theta)$$ $$[\theta|\theta_1]  \sim \pi_1(\theta|\theta_1) $$ $$ [\theta_{n−1}|\theta_n]  \sim \pi_n(\theta|\theta_1)$$ $$ [\theta_n] \sim \pi_{n+1}(\theta_n)$$ 
is equivalent to the model
$$[x|\theta] \sim f(x|\theta), [\theta] \sim \pi(\theta)$$ 
as the inference on θ is concerned. 

Notice that in the hierarchy of data, parameters and hyperparameters,
$X$ and $\theta_i$ are independent, given $\theta$: $ X \to  \theta \to \theta_1 \to \dots \to \theta_n$

That means, $$[X|\theta, \theta_1, \dots ] \overset{\mathrm{d}}{=}  [X|\theta], [\theta_i|\theta, X] \overset{\mathrm{d}}{=} [\theta_i
|\theta] $$ 
the joint distribution  which by definition is
$$[X, \theta, \theta_1, \dots, \theta_n] = [X|\theta, \theta_1, \dots, \theta_n] [\theta|\theta_1, \dots , \theta_n] [\theta_1|\theta_2, \dots, \theta_n] \dots [\theta_{n−1}|\theta_n] [\theta_n] = [X|\theta][\theta|\theta_1] [\theta_1|\theta_2] \dots [\theta_{n−1}|\theta_n] [\theta_n]$$
thus, to fully specify the model, only “neighbouring” conditionals and
the “closure” distribution $[\theta_n]$ are needed.

Hierarchical Bayes and Empirical Bayes are related by their goals, but quite different by the methods of how
these goals are achieved. The attribute hierarchical refers mostly to the modeling strategy, while empirical
is referring to the methodology. Both methods are concerned in specifying the distribution at prior level, hierarchical via Bayes inference involving additional degrees of hierarchy (hyperpriors and hyperparameters),
while empirical Bayes is using data more directly.

Why then decompose the prior:
- Modeling requirements may lead to the hierarchy in the prior. For example Bayesian models in meta
analysis;
- The prior information may be separated into the structural part and the subjective/noninformative part
at higher level of hierarchy;
- Robustness and objectiveness – “let the data talk about the hyperparameters;”
- Calculational issues (utilizing hidden mixtures, mixture priors, missing data, MCMC format).

Suppose the hierarchical model is given as $[X|θ] ∼ f(x|θ), [θ|θ_1] ∼ π_1(θ|θ_1)$, and $[θ_1] ∼ π_2(θ_1)$, then
the pe posterior distribution can be written as
$$π(θ|x) = \int_{\Theta_1} π(θ|x, θ_1)π(θ_1|x)dθ_1$$

The densities under the integral are $π(θ|x, θ_1) = \frac{f(x|θ)π_1(θ|θ_1)}{m_1(x|θ_1)}$, and $π(θ_1|x) = \frac{m_1(x|θ_1)π_2(θ_1)}{m(x)}$, where
$m_1(x|θ_1) = \int_\Theta f(x|θ)π_1(θ|θ_1)dθ$ is the marginal likelihood, and $m(x) = \int_{\Theta_1} m_1(x|θ_1)π_2(θ_1)dθ_1$
marginal.

Now, for any function of the parameter $h$,
$$E^{θ|x}h(θ) = E^{θ_1|x}[E^{θ|θ_1,x}h(θ)]$$. 


Sometimes it is not calculatingly feasible to carry out the analysis by reducing the sequence of hyperpriors to a single prior.

Rather, Bayes rule is obtained (by using [Fubini’s theorem](http://ru.math.wikia.com/wiki/%D0%A2%D0%B5%D0%BE%D1%80%D0%B5%D0%BC%D0%B0_%D0%A2%D0%BE%D0%BD%D0%B5%D0%BB%D0%BB%D0%B8_%E2%80%94_%D0%A4%D1%83%D0%B1%D0%B8%D0%BD%D0%B8)) as repeated integral with respect to more
convenient conditional distributions. 

## Hierarchical and empirical Bayesian methods

https://www.ics.uci.edu/~sternh/courses/225/slides2new.pdf

+

https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/hierarchical-models.pdf

For simulation in [hierarchical models](http://www.stats.ox.ac.uk/~reinert/stattheory/chapter1107.pdf): we simulate first from β,
then, given β, we simulate from θ. We hope that the distribution of
β is easy to simulate, and also that the conditional distribution of θ
given β is easy to simulate. This approach is particularly useful for
MCMC (Markov chain Monte Carlo) methods, e.g.: see next term

http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

https://pdfs.semanticscholar.org/bc59/3aae11890468c5d6c1a7610ddd928236ffa7.pdf