__[Shrinkage and Regularized Regression](https://jrnold.github.io/bayesian_notes/shrinkage-and-regularized-regression.html)__

# Shrinkage


Shrinkage is implicit in Bayesian inference and penalized likelihood inference, and explicit in James–Stein-type inference. In contrast, simple types of maximum-likelihood and least-squares estimation procedures do not include shrinkage effects, although they can be used within shrinkage estimation schemes.

Shrinkage is implicit in Bayesian inference and penalized likelihood inference, and explicit in James–Stein-type inference. In contrast, simple types of maximum-likelihood and least-squares estimation procedures do not include shrinkage effects, although they can be used within shrinkage estimation schemes.

####  [Stein's paradox](http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf)


> The best guess about the future is usually  obtained by computing the average of past events. Stein's paradox defines corcumstances in which there are estimators better that the arithmetic average



Stein's  paradox, in decision theory and estimation theory, is the phenomenon that when three or more parameters are estimated simultaneously, there exist combined estimators more accurate on average (that is, having lower expected mean squared error) than any method that handles the parameters separately. 


An intuitive explanation is that optimizing for the mean-squared error of a combined estimator is not the same as optimizing for the errors of separate estimators of the individual parameters. In practical terms, if the combined error is in fact of interest, then a combined estimator should be used, even if the underlying parameters are independent. On the other hand, if one is instead interested in estimating an individual parameter, then using a combined estimator does not help and is in fact worse.

#### [Implications of Stein’s Paradox for Environmental Standard Compliance Assessment](https://pubs.acs.org/doi/pdf/10.1021/acs.est.5b00656)

The implications of Stein’s paradox stirred considerable debate in statistical circles when the concept was first introduced in the 1950s. 

The paradox arises when we are interested in estimating the means of several variables simultaneously. In this situation, the best estimator for an individual mean, the sample average, is no longer the best. 

Rather, a shrinkage estimator, which shrinks individual sample averages toward the overall average is shown to have improved overall accuracy. 

Although controversial at the time, the concept of shrinking toward overall average is now widely accepted as a good practice for improving statistical stability and reducing error, not only in simple estimation problems, but also in complicated modeling problems. 

In this essay, we introduce Stein’s paradox and its **modern generalization, the Bayesian hierarchical model**. 
Bayesian hierarchical model can improve overall estimation accuracy, thereby improving our confidence in the assessment results, especially for standard compliance assessment of waters with **small sample sizes.**

[Connection between Stein's paradox, ridge regression, and random effects in mixed models](https://stats.stackexchange.com/questions/122062/unified-view-on-shrinkage-what-is-the-relation-if-any-between-steins-paradox)

#### Effects of priors
  

Reasonable	  priors:	  uninformative	  –	  constant	  prior;	  scale  parameters	  in	$[0, \infty);$	   uniform	  in	   log	  of	  parameter	 (Jeffreys’	  prior)	
  


![image.png](attachment:image.png)

# [Bayesian shrinkage](https://arxiv.org/pdf/1212.6088.pdf)

Interestingly, we demonstrate that most commonly used shrinkage priors, including
the Bayesian Lasso, are suboptimal in high-dimensional settings. A new class of Dirichlet Laplace
(DL) priors are proposed, which are optimal and lead to efficient posterior computation exploiting
results from normalized random measure theory. Finite sample performance of Dirichlet Laplace
priors relative to alternatives is assessed in simulations.

[Can we create a prior that 
1) puts most of the posterior mass at zero for
small signals and 2) leaves large signals unshrunk?](http://www.jarad.me/courses/stat615/slides/Hierarchical/Hierarchical1.pdf)

Point-mass prior with t-distribution 
![Screenshot%20from%202018-11-20%2011-19-51.png](attachment:Screenshot%20from%202018-11-20%2011-19-51.png)
![Screenshot%20from%202018-11-20%2011-20-01.png](attachment:Screenshot%20from%202018-11-20%2011-20-01.png)


- Heavy tails allow the likelihood to easily overwhelm the prior.
- A peak allows “complete” shrinkage.

High-dimensional data have become commonplace in broad application areas, and there is an exponentially
increasing literature on statistical and computational methods for big data. In such
settings, it is well known that classical methods such as maximum likelihood estimation break
down, motivating a rich variety of alternatives based on penalization and thresholding

Most penalization
approaches produce a point estimate of a high-dimensional coefficient vector, which
has a Bayesian interpretation as corresponding to the mode of a posterior distribution obtained
under a shrinkage prior. 

For example, the wildly popular **Lasso/L1 regularization approach to
regression ** is equivalent to maximum a posteriori (MAP) estimation under a Gaussian linear
regression model having a double exponential (Laplace) prior on the coefficients

However, in many applications, it is crucial to
be able to obtain a realistic characterization of uncertainty in the parameters, in functionals of
the parameters and in predictions. Usual frequentist approaches to characterize uncertainty, such
as constructing asymptotic confidence regions or using the ** bootstrap** can break down in highdimensional
settings. For example, in regression when the number of subjects n is much less than
the number of predictors p, one **cannot naively appeal to asymptotic normality and resampling
from the data may not provide an adequate characterization of uncertainty**

Given that most shrinkage estimators correspond to the mode of a Bayesian posterior, we can use the whole posterior distribution to provide a probabilistic measure
of uncertainty. In addition to providing a characterization of uncertainty, taking a Bayesian perspective has
distinct advantages in terms of tuning parameter choice, allowing key penalty parameters to be
marginalized over the posterior distribution **instead of relying on cross-validation.** Also, by inducing
penalties through shrinkage priors, important new classes of penalties can be discovered that
may outperform usual Lq-type choices.

### Shrinkage 
- [Scalable MCMC for Bayes Shrinkage Priors](http://stanford.edu/~pauloo/talk/scalable_mcmc_bayes_shrinkage/slides/scalable_mcmc_bayes_shrinkage.pdf)
- [Machine learning, shrinkage estimation, and
economic theory](https://maxkasy.github.io/home/files/slides/habilitationsvortrag-slides-kasy.pdf)
- [James–Stein Estimation and Ridge
Regression](http://statweb.stanford.edu/~ckirby/brad/other/CASI_Chap7_Nov2014.pdf) 
- [Effect and shrinkage estimation
in meta-analyses of two studies](http://www.biometrische-gesellschaft.de/fileadmin/AG_Daten/BayesMethodik/workshops_etc/2016-12_Mainz/Roever2016-slides.pdf)
- [Shrinkage estimation](http://strimmerlab.org/courses/2005-06/seminar/slides/christoph-2x4.pdf)
- [Understanding shrinkage and how to circumvent it](http://monolix.lixoft.com/faq/understanding-shrinkage-circumvent/)