## [Empirical Bayes method](https://en.wikipedia.org/wiki/Empirical_Bayes_method)

Empirical Bayes uses the data to set the hyperparameters of the prior. Performing Bayesian inference with this prior then gets you a sort of shrinkage and can be viewed as an approximation to a hierarchical Bayesian model.


EB ignores the uncertainty in the hyper-parameters, whereas HBM attempts to include it in the analysis. HMB is a good idea where there is little data and hence significant uncertainty in the hyper-parameters, which must be accounted for. On the other hand for large datasets EB becomes more attractive as it is generally less computationally expensive and the the volume of data often means the results are much less sensitive to the hyper-parameter settings.



Empirical Bayes methods are procedures for statistical inference in which the prior distribution is estimated from the data. This approach stands in contrast to standard Bayesian methods, for which the prior distribution is fixed before any data are observed. Despite this difference in perspective, empirical Bayes may be viewed as an approximation to a fully Bayesian treatment of a hierarchical model wherein the parameters at the highest level of the hierarchy are set to their most likely values, instead of being integrated out. Empirical Bayes, also known as maximum marginal likelihood, represents one approach for setting hyperparameters.

###  [James-Stein estimators](http://www.stats.ox.ac.uk/~reinert/stattheory/chapter1107.pdf)


Assume that $θ_i ∼ N (0, τ^2)$, then $p(x|τ^2) = N (0,(1 + τ^2)I_p)$, and
the posterior for $θ$ given the data is
$$θ|x ∼ N (τ^2/(1 + τ^2) x, 1/(1 + τ)^2I_p)$$

Under quadratic loss, the Bayes estimator $δ(x)$ of $θ$ is the posterior
mean $τ^2/(1 + τ^2) x$

In the empirical Bayes approach, we would use the m.l.e. for 
$τ^2$
and the empirical Bayes estimator is the estimated posterior mean,
$$δ^{EB}(x) = \hat{τ}^2/(1 + \hat{τ}^2)x = (1 − p/||x||^2)^+ x$$
is the truncated James-Stein estimator. It can can be shown to
outperform the estimator $δ(x) = x$.

Alternatively, the best unbiased estimator of $1/(1 + τ^2)$ is $(p−2)/||x||$
giving
$$δ^{EB}(x) = (1 − p/|| x ||^2) x$$
This is the James-Stein estimator. It can be shown that under
quadratic loss function the James-Stein estimator has a risk function
that is uniformly better than $δ(x) = x$.

>Note: both estimators tend to ”shrink” towards 0. It is now known
to be a very general phenomenon that when comparing three or more
populations, the sample mean is not the best estimator. ”Shrinkage”
estimators are an active area of research


### [ Hierarchical vs Empirical Bayesian methods](http://www.stats.ox.ac.uk/~reinert/stattheory/chapter1107.pdf)

Hierarchical Bayes: $
θ|β ∼ π_1(θ|β)$, where $β ∼ π_2(β)$
so that then $π(θ) = \int π_1(θ|β)π_2(β)dβ$.



For simulation in hierarchical models: we simulate first from $β$,
then, given $β$, we simulate from $θ$. We hope that the distribution of
$β$ is easy to simulate, and also that the conditional distribution of $θ$
given $β$ is easy to simulate. This approach is particularly useful for
MCMC (Markov chain Monte Carlo) methods, e.g.: see next term.


Empirical Bayes:
$p(x|λ) = \int f(x|θ)π(θ|λ)dθ$

Rather than specifying $λ$, we estimate $λ$ by $\hat{λ}$, for example by frequentist methods, based on $p(x|λ)$,  $π(θ|x, \hat{λ})$ 
is called a pseudo-posterior


The empirical Bayes approach
- is neither fully Bayesian nor fully frequentist;
- depends on $\hat{λ}$, different $\hat{λ}$ will lead to different procedures;
- if $\hat{λ}$ is consistent, then asymptotically will lead to coherent Bayesian
analysis.
- often outperforms classical estimators in empirical terms.

#### [Empirical and Hierarchical Bayes](https://www.cs.ubc.ca/~schmidtm/Courses/540-W16/L19.pdf)

> Why the beta distribution? It’s a flexible distribution that includes uniform as special case (if $α = 1$ and $β = 1$.)

- Likelihood $p(x|θ)$. Probability of seeing data given parameters.
- Prior $p(θ|α, β)$.
Belief that parameters are correct before we’ve seen data.
- Posterior $p(θ|x, α, β)$.
Probability that parameters are correct after we’ve seen data.
- Posterior predictive $p(\hat{x}|x, α, β)$.
Probability of new data given old, integrating over parameters, 
tells us which prediction is most likely given data and prior.
- Marginal likelihood (also called evidence) $p(x|α, β)$.
Probability of seeing data given hyper-parameters.

Hyper-parameters $α$ and $β$ are like “pseudo-counts” in our mind before we flip

##### Learning the Prior from Data:

_Can we use the data to set the hyper-parameters_

- In theory: No!
    - It would not be a “prior”. 
- In practice: Yes!
    - Approach 1: use a validation set or cross-validation as before.
    - Approach 2: optimize the marginal likelihood,
$$ p(y|X, λ) = \int_w p(y|X, w)p(w|λ)dw$$

Also called type II maximum likelihood or evidence maximization or empirical Bayes.

##### Overivew of Bayesian Variable Selection

[Empirical Bayes vs. fully Bayes variable selection](http://www-stat.wharton.upenn.edu/~edgeorge/Research_papers/CG%20JSPI%202008.pdf)

- If we fix $λ$ and use L1-regularization (Bayesian lasso), posterior is not sparse.
Probability that a variable is exactly 0 is zero.
L1-regularization only lead to sparsity because the MAP point estimate is sparse.
- Type II maximum likelihood leads to sparsity in the posterior because variance
goes to zero.
    - Weird fact: yields sparse solutions (automatic relevance determination).
Can send $λ_j → ∞$, concentrating posterior for $w_j$ at 0.
This is L2-regularization, but empirical Bayes naturally encouages sparsity.
(Non-convex and theory not well understood, but recent work shows:
Never performs worse than L1-regularization, and exists cases where it does better)
- We can encourage sparsity in Bayesian models using a spike and slab prior:
    - Mixture of Dirac delta function 0 and another prior with non-zero variance. Places non-zero posterior weight at exactly 0. Posterior is still non-sparse, but answers the question “what is the probability that variable is non-zero”?
    


##### Bayesian Model Selection and Averaging

Bayesian model selection (“type II MAP”): maximize hyper-parameter posterior,
which further takes us away from overfitting (thus allowing more complex models).


Bayesian model averaging considers posterior over hyper-parameters.
We could also maximize marginal likelihood of $γ$, (“type III ML”)

- Posterior predictive lets us directly model what we want given hyper-parameters.
- Marginal likelihood is probability seeing data given hyper-parameters.
- Empirical Bayes optimizes this to set hyper-parameters:
    - Allows tuning a large number of hyper-parameters.
    - Bayesian Occam’s razor: naturally encourages sparsity and simplicity.
- Hierarchical Bayes goes even more Bayesian with prior on hyper-parameters.
    - Leads to Bayesian model selection and Bayesian model averaging.

