## [Empirical Bayes method](https://en.wikipedia.org/wiki/Empirical_Bayes_method)

Empirical Bayes uses the data to set the hyperparameters of the prior. Performing Bayesian inference with this prior then gets you a sort of shrinkage and can be viewed as an approximation to a hierarchical Bayesian model.


EB ignores the uncertainty in the hyper-parameters, whereas HBM attempts to include it in the analysis. HMB is a good idea where there is little data and hence significant uncertainty in the hyper-parameters, which must be accounted for. On the other hand for large datasets EB becomes more attractive as it is generally less computationally expensive and the the volume of data often means the results are much less sensitive to the hyper-parameter settings.



Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

###  [James-Stein estimators](http://www.stats.ox.ac.uk/~reinert/stattheory/chapter1107.pdf)


Assume that $θ_i ∼ N (0, τ^2)$, then $p(x|τ^2) = N (0,(1 + τ^2)I_p)$, and
the posterior for $θ$ given the data is
$$θ|x ∼ N (τ^2/(1 + τ^2) x, 1/(1 + τ)^2I_p)$$

Under quadratic loss, the Bayes estimator $δ(x)$ of $θ$ is the posterior
mean $τ^2/(1 + τ^2) x$

In the empirical Bayes approach, we would use the m.l.e. for 
$τ^2$
and the empirical Bayes estimator is the estimated posterior mean,
$$δ^{EB}(x) = \hat{τ}^2/(1 + \hat{τ}^2)x = (1 − p/||x||^2)^+ x$$
is the truncated James-Stein estimator. It can can be shown to
outperform the estimator $δ(x) = x$.

Alternatively, the best unbiased estimator of $1/(1 + τ^2)$ is $(p−2)/||x||$
giving
$$δ^{EB}(x) = (1 − p/|| x ||^2) x$$
This is the James-Stein estimator. It can be shown that under
quadratic loss function the James-Stein estimator has a risk function
that is uniformly better than $δ(x) = x$.

>Note: both estimators tend to ”shrink” towards 0. It is now known
to be a very general phenomenon that when comparing three or more
populations, the sample mean is not the best estimator. ”Shrinkage”
estimators are an active area of research


##### [James Stein Estimator and its EB Justification. ](https://www2.isye.gatech.edu/~brani/isyebayes/bank/handout8.pdf)

Consider the estimation of $θ$ in a model $X ∼ MVN_p(θ, I)$
under squared error loss $L(θ, a) = \sum_i (θ_i − a_i)^2$.

For $p = 1$ and $2$, estimator $\hat{θ} = X$ is admissible (as unique minimax), i.e., no estimator has uniformly
better risk. However, for $p ≥ 3$ $X$ is neither unique minimax nor admissible. A better estimator is
$$δ_{JS}(X) = (1 − (p − 2)/(\sum^p_{i=1} X^2_i) ) X$$
known as James-Stein estimator.


The empirical Bayes justification for δJS(X) is provided next.

Suppose that $θ$ has a prior distribution
$θ ∼ MVN (0, τ^2
I)$,
where hyperparameter $τ^2$
is not known and will be estimated from the sample, $X$ in this case.

The Bayes rule, under squared error loss is
$$δ_B(X) = τ^2/(1 + τ^2)X =(1- 1/(1 + τ^2)X$$

The method-of-moments estimator of $1/(1+τ^2)$
 is $(p-2)/(\sum^p_{i=1} X^2_i) X$
, which yields an empirical Bayes estimator
$$δ_{EB}(X) = (1 -(p − 2)/(\sum^p_{i=1} X^2_i)) X$$

### [ Hierarchical vs Empirical Bayesian methods](http://www.stats.ox.ac.uk/~reinert/stattheory/chapter1107.pdf)

Hierarchical Bayes: $
θ|β ∼ π_1(θ|β)$, where $β ∼ π_2(β)$
so that then $π(θ) = \int π_1(θ|β)π_2(β)dβ$.



For simulation in hierarchical models: we simulate first from $β$,
then, given $β$, we simulate from $θ$. We hope that the distribution of
$β$ is easy to simulate, and also that the conditional distribution of $θ$
given $β$ is easy to simulate. This approach is particularly useful for
MCMC (Markov chain Monte Carlo) methods, e.g.: see next term.


Empirical Bayes:
$p(x|λ) = \int f(x|θ)π(θ|λ)dθ$

Rather than specifying $λ$, we estimate $λ$ by $\hat{λ}$, for example by frequentist methods, based on $p(x|λ)$,  $π(θ|x, \hat{λ})$ 
is called a pseudo-posterior


The empirical Bayes approach
- is neither fully Bayesian nor fully frequentist;
- depends on $\hat{λ}$, different $\hat{λ}$ will lead to different procedures;
- if $\hat{λ}$ is consistent, then asymptotically will lead to coherent Bayesian
analysis.
- often outperforms classical estimators in empirical terms.

#### [Empirical and Hierarchical Bayes](https://www.cs.ubc.ca/~schmidtm/Courses/540-W16/L19.pdf)

> Why the beta distribution? It’s a flexible distribution that includes uniform as special case (if $α = 1$ and $β = 1$.)

- Likelihood $p(x|θ)$. Probability of seeing data given parameters.
- Prior $p(θ|α, β)$.
Belief that parameters are correct before we’ve seen data.
- Posterior $p(θ|x, α, β)$.
Probability that parameters are correct after we’ve seen data.
- Posterior predictive $p(\hat{x}|x, α, β)$.
Probability of new data given old, integrating over parameters, 
tells us which prediction is most likely given data and prior.
- Marginal likelihood (also called evidence) $p(x|α, β)$.
Probability of seeing data given hyper-parameters.

Hyper-parameters $α$ and $β$ are like “pseudo-counts” in our mind before we flip

##### Learning the Prior from Data:

_Can we use the data to set the hyper-parameters_

- In theory: No!
    - It would not be a “prior”. 
- In practice: Yes!
    - Approach 1: use a validation set or cross-validation as before.
    - Approach 2: optimize the marginal likelihood,
$$ p(y|X, λ) = \int_w p(y|X, w)p(w|λ)dw$$

Also called type II maximum likelihood or evidence maximization or empirical Bayes.

##### Overivew of Bayesian Variable Selection

[Empirical Bayes vs. fully Bayes variable selection](http://www-stat.wharton.upenn.edu/~edgeorge/Research_papers/CG%20JSPI%202008.pdf)

- If we fix $λ$ and use L1-regularization (Bayesian lasso), posterior is not sparse.
Probability that a variable is exactly 0 is zero.
L1-regularization only lead to sparsity because the MAP point estimate is sparse.
- Type II maximum likelihood leads to sparsity in the posterior because variance
goes to zero.
    - Weird fact: yields sparse solutions (automatic relevance determination).
Can send $λ_j → ∞$, concentrating posterior for $w_j$ at 0.
This is L2-regularization, but empirical Bayes naturally encouages sparsity.
(Non-convex and theory not well understood, but recent work shows:
Never performs worse than L1-regularization, and exists cases where it does better)
- We can encourage sparsity in Bayesian models using a spike and slab prior:
    - Mixture of Dirac delta function 0 and another prior with non-zero variance. Places non-zero posterior weight at exactly 0. Posterior is still non-sparse, but answers the question “what is the probability that variable is non-zero”?
    


##### Bayesian Model Selection and Averaging

Bayesian model selection (“type II MAP”): maximize hyper-parameter posterior,
which further takes us away from overfitting (thus allowing more complex models).


Bayesian model averaging considers posterior over hyper-parameters.
We could also maximize marginal likelihood of $γ$, (“type III ML”)

- Posterior predictive lets us directly model what we want given hyper-parameters.
- Marginal likelihood is probability seeing data given hyper-parameters.
- Empirical Bayes optimizes this to set hyper-parameters:
    - Allows tuning a large number of hyper-parameters.
    - Bayesian Occam’s razor: naturally encourages sparsity and simplicity.
- Hierarchical Bayes goes even more Bayesian with prior on hyper-parameters.
    - Leads to Bayesian model selection and Bayesian model averaging.



#### [Hierarchical Bayes and Empirical Bayes.](https://www2.isye.gatech.edu/~brani/isyebayes/bank/handout8.pdf)

##### Empirical Bayes. ML II Method

Empirical Bayes has several formulations. Original formulation of empirical Bayes assumes that past values
of $X_i$ and corresponding parameter $θ_i$ are known to the statistician who then on basis of current observation
$X_{n+1}$ tries to make inference on unobserved $θ_{n+1}$. Of course, the parameters $θ_i$ are seldom known. However,
it may be assumed that the past (and current) $θ$’s are realizations from the same unknown prior distribution.

Empirical Bayes is an approach to inference in which the observations are used to select the prior, usually
via the marginal distribution. Once the prior is specified, the inference proceed in a standard Bayesian
fashion. The use of data to estimate the prior in addition to subsequent use for the inference in empirical
Bayes is criticized by subjectivists who consider the prior information exogenous to observations. The
repeated use of data is also loaded with perils since it can underestimate modeling errors. Any data is going
to be complacent with a model which used the same data to specify some of its features

__ML II__

The idea is to mimic the maximum likelihood estimation at the marginal level: Select a prior $\pi$ that maximizes $m_{\pi}(x)$,
given the data.

Offtop:


- This ratio is easily numerically evaluated, see mathematica notebook. (p.3)
- Berger (1985), Section 4.6 pages 180–195 contains an excellent account on hierarchical models with detailed proofs.
- The result (see MATHEMATICA program jeremy.nb on the web site)
- Marginal posterior,  marginal distribution
- James Stein Estimator

#### [ Hierarchical and Empirical Bayes Analyses](https://www.stat.unipd.it/sites/default/files/bayesian-mod4.pdf)

__Many statistical applications involve multiple parameters that can be regarded as related
or connected in some way by the structure of the problem implying that a joint probability
model for these parameters should reflect the dependence among them__. For example in a
study of effectiveness of cardiac treatments with the patients in hospital $i$ having survival
probability $\theta_i$ it might be reasonable to expect that estimates of the $\theta_i$'s which represent a sample of hospitals should be related to each other. This is achieved in a natural way
if we use a prior distribution in which the $\theta_i$'s are viewed as a sample from a common
population distribution. 

A key feature of such applications is that the observed data $y_{ij}$
on the $j$th unit in the $i$th group can be used to estimate aspects of the population distribution of the $\theta_i$'s even though they are not potentially observable. It is natural to model
such a problem hierarchically with observable outcomes modelled conditionally on certain
parameters which themselves are modelled through probability distribution depending on
additional parameters say  known as hyperparameters. Such hierarchical thinking helps
in understanding multiparameter problems and also plays an important role in developing computational strategies. 

__Perhaps even more important in practice is that nonhierarchical models are usually
unsuitable for hierarchical data; with few parameters they generally cannot large data
sets accurately whereas with many parameters they tend to overfit such data in the sense
of producing models that fit the existing data well but lead to inferior predictions for future data.__ In contrast hierarchical models can have enough parameters to fit the data well
while using a population distribution to structure some dependence into the parameters
thereby avoiding problems of overfitting.



Hierarchical modelling is useful in both frequentist and Bayesian analyses. Normal
mixed linear models and analysis of overdispersion can be viewed as applications of hierarchical modelling in frequentist analysis. 



As described in Berger an EB scenario is one in which known relationships or
structures of the coordinates of a parameter vector say  $\theta$ allow use of the
data to estimate some features of the distribution $p_1(\theta|\lambda)$  called a prior distribution. For
example one may have reason to believe that the $\theta_i$ are i.i.d. with joint pdf $p_1(\theta|\lambda)$ where
$p$ is structurally known except for the hyperparameters. A parametric empirical
Bayes 	(PEB) procedure is one where  is estimated from the marginal distribution of the
observations


Closely related to the EB procedure is the HB procedure which models the prior distribution in stages. In the first stage (conditional on  $\lambda$) $\theta_i$ are i.i.d. with a
prior $p_1(\cdot|\lambda)$.  In the second stage a prior distribution say $p_2(\lambda)$ often noninformative and
improper is assigned to $\lambda$. This is an example of two-stage prior.


It is apparent that both the EB and the HB procedures
recognize the uncertainty in the prior information but whereas the HB procedure models the uncertainty in the prior information by assigning a distribution 	(often noninformative
or improper) to the hyperparameters, the EB procedure attempts to estimate the unknown hyperparameters typically by some classical method such as method of moments, method
of maximum likelihood and use the resulting estimated posteriors of the parameters  for inferential purposes. 

It turns out that the two methods can quite often lead to comparable results especially in the context of point estimation. 
However when it comes to the question of measuring
the standard errors associated with these estimates needed for interval estimation the HB
method has a clear edge over a naive EB method. By a naive EB method we mean that
EB method based on estimated posterior distribution which does not account fully or in
part the uncertainty involved in estimating the hyperparameters. Whereas there are no
clear cut measures of standard errors associated with EB point estimates the same is not
true with HB estimates. To be precise if one estimates the parameter of interest by its
posterior mean then a very natural estimate of the accuracy associated with this estimate
is its posterior variance. Estimates of the standard errors associated with EB point estimates usually need an ingenious approximation whereas the posterior variances though
often complicated can be found exactly.


Thus a naive EB procedure ignores estimating $V[E( \theta	|y, \xi)|y]$ which amounts to ignoring the
uncertainty involved in estimating the prior mean $\xi$ when estimating the posterior variance



First while HB procedure and the non-naive
EB procedure tend to produce similar inference the naive EB estimates are quite different
and the estimated posterior variances associated with them are too small. Second in the
posterior variance of HB estimates the second term namely the variance of the conditional
expectation 	contributes significantly and its omission as often done in
naive EB solution will lead to substantial underestimation of the true measure of accuracy

 We can also consider
the situation in which the $\theta_i$ arise from a regression model
$$\theta_i=X^T_i\beta + e_i$$

It is deemed plausible for $\theta_i$ to be linearly increasing with time:
$$\theta_i = \beta_1 + \beta_2 i + e_i$$


Raw data estimates $y_i$ and the regression estimates can be
considered two extremes: the $y_i$ would be the natural estimates if no relationships among the
$\theta_i$ were suspected while the $\hat{\theta}_{R_i}$ would be the estimates under the specific lower dimensional
regression model. The EB and HB estimates are of course averages of these two extremes; note that the EB estimates are slightly closer to the regression estimate.

Similarly the raw
data variances  and the variances from the regression estimates can be considered
to be two extremes: the additional structure of the regression model yields much smaller
variances 	only valid if the model is actually correct of course. Again the EB and HB
variances are an average of these extremes and are reasonably similar. Note that the HB
variances tend to be smaller than the EB variances for middle i but larger otherwise; they
thus mimic the pattern of the regression variances more closely than do the EB variance.

Advantages of HB:

- We criticized EB
because of a failure to consider hyperparameter estimation error; EB theory does not by
itself indicate how to incorporate the hyperparameter estimation error in the analysis. On
the other hand HB analysis incorporates such errors automatically and is hence generally
the more reasonable of the approaches. Sophisticated EB procedures such as that of Morris
are usually developed by trying to approximate the HB answer.

- Another advantage of the HB approach is that with only slight additional diffculty one can incorporate actual subjective prior information at the second stage.

- A third advantage of the HB approach is that it easily produces a greater wealth of
information. For instance the posterior variance-covariance matrix presented after 	are
substantial and knowledge of them would be important for a variety of statistical analyses.
These covariances are easily calculable in HB analysis but would require work to derive in
a sophisticated EB fashion.

- From a calculational perspective the comparison is something of a toss-up. Standard
EB theory requires solution of likelihood equations while the HB approach requires numerical integration. Solutions of likelihood equations is probably somewhat easier particularly when the needed numerical integration is higher dimensional but
numerical issues are never clearcut 	(e.g. one has to worry about uniqueness of solutions to the likelihood equation)

- In conclusion it appears that the HB approach is the superior methodology for general
application. When p is large of course there will be little difference between the two
approaches and whichever is more convenient can then be employed. 


### [Empirical Hierarchical Bayes Estimation](https://link.springer.com/chapter/10.1007/978-1-4612-2944-5_8)

It is well known that the James-Stein estimates of mean values of several populations can be derived as empirical Bayes
estimates assuming a common prior distribution for all the
mean values. But the superiority of such estimates over the
usual unbiased estimates diminishes as the variability of the
true mean values between populations increases. 
__In such cases
it is suggested that the populations may be split into two or
more homogeneous groups and the James-Stein procedure applied to the mean values in each group separately.__ In this paper, we introduce a hierarchical prior distribution by considering the mean values within a group to have a common prior
with some hyperparameters which are different from group to
group. The hyperparameters in different groups are themselves
considered to have a common prior distribution possibly with
hyper-hyperparameters. Under some conditions on variabilities
between and within groups, it is shown that the empirical Bayes
estimates derived from a two stage prior distribution on the
mean values are better than those obtained by applying the
James-Stein procedure to all the mean values in one step or to
the mean values in individual groups separately.

Consider a situation where we have independent samples drawn from
a number of populations which can be grouped into a smaller number of
clusters such that the populations within a cluster are more homogeneous
than those between.

The advantage of the James-Stein estimates is lost if the parameters
under estimation have a large variation. In such a case it may be profitable
to consider some natural classification of the parameters into two or more
groups and apply the James-Stein procedure separately for the parameters
in each group.

A natural choice for the hierarchical
prior is as follows:
1. $\mu_{i1}, ... , \mu_{ik}$; (i.e., the parameters of interest in the i-th cluster) are
iid with a common probability density $p(\cdot|\lambda_i,\eta)$, $i = 1, ... , p$ depending
on a varying parameter $\lambda_i$ and common parameter $\eta$.
2. The cluster parameters $\lambda_1, ... , \lambda_p$ are iid with a common probability
density $p(\cdot | k)$ depending on a parameters $k$.

We consider two cases, one when the parameters are known,
and another when some or all parameters are unknown but the unknowns
are estimable from given data.

We consider two types of risk functions: 
1.Mean dispersion error (MDE) in estimation
$$MDE (\hat{\theta}) = E[(\hat{\theta} - \theta)(\hat{\theta} - \theta)' |\theta]$$
2. Compound mean square error (CMSE) 
$$ CMSE (\hat{\theta}) = tr MDE (\hat{\theta}) = \sum E[(\hat{\theta}_i - \theta_i)^2 | \theta ] $$
3. Bayes MDE 
$$BMDE (\hat{\theta}) = E [MDE (\hat{\theta})] $$
where the expectation is taken over a specified prior distribution of $\theta$
$$ BCMSE (\hat{\theta}) = E[ CMSE (\hat{\theta})]$$
with respect to a specified prior for $\theta$. 




EMPIRICAL HIERARCHICAL BAYES METHOD

Let us suppose that the parameters under estimation can be grouped in
a natural way into clusters and denote the parameters in the i- th cluster.

Hierarchical prior is a natural one in situations we consider:
1. $\mu_{i1}, ... , \mu_{ik}$ are iid random variables having a common prior distribution
$$N_1(\lambda_i,\sigma^2_1), i=l, ... ,p$$
2. $\lambda_1, ... , \lambda_p$ are iid random variables having a common prior distribution
$$N_1(k,\sigma^2_2)$$

Using the observations  and    the model   for
the prior distribution of the parameters, we estimate the parameters $\mu_{ij}$ by

1. applying the J-S procedure separately on the parameters in each cluster,
2. considering the parameters in all clusters together and applying the
J-S procedure in a single step, and
3. the method described based on the hierarchical priors

and compare their relative efficiencies.