### Bayesian Interpretation of Regularization

Estimating the model of the posterior distribution is also called *maximum a posterior estimation* (MAP).That is,

$$\theta_{\text{MAP}}=\underset{\theta}{\text{argmax}} p(\theta|x,y)$$

Compare this to the *maximum likelihood estimation* (MLE) we have seen previously:

$$\theta_{\text{MLE}}=\underset{\theta}{\text{argmax}} p(y|x,\theta)$$

**(a)**

We have :

$$p(\theta|x,y)=\frac{p(x,y,\theta)}{p(x,y)}=\frac{p(y|x,\theta)p(x,\theta)}{p(x,y)}=\frac{p(y|x,\theta)p(\theta|x)p(x)}{p(x,y)}$$

Assume that $p(\theta)=p(\theta|x)$, then 

Proof:

\begin{align*}
\theta_{\mathrm{MAP}} & = \arg \max_\theta p(\theta \ \vert \ x, y) \\
                      & = \arg \max_\theta \frac{p(x ,y,\theta)}{p(x,y)} \\
                      & = \arg \max_\theta \frac{p(y \ \vert \ x, \theta) \ p(\theta \ \vert \ x) \ p(x)}{p(x, y)} \\
                      & = \arg \max_\theta \frac{p(y \ \vert \ x, \theta) \ p(\theta) \ p(x)}{p(x, y)} \\
                      & = \arg \max_\theta p(y \ \vert \ x, \theta) \ p(\theta)
\end{align*}

And by assumption, $p(\theta|x)=p(\theta)$

**(b)**

Since $p(\theta) \sim \mathcal{N} (0, \eta^2 I)$,

\begin{align*}
\theta_{\mathrm{MAP}} & = \arg \max_\theta p(y \ \vert \ x, \theta) \ p(\theta) \\
                      & = \arg \min_\theta - \log p(y \ \vert \ x, \theta) - \log p(\theta) \\
                      & = \arg \min_\theta - \log p(y \ \vert \ x, \theta) - \log \frac{1}{(2 \pi)^{d / 2} \vert \Sigma \vert^{1/2}} \exp \big( -\frac{1}{2} (\theta - \mu)^T \Sigma^{-1} (\theta - \mu) \big) \\
                      & = \arg \min_\theta - \log p(y \ \vert \ x, \theta) + \frac{1}{2} \theta^T \Sigma^{-1} \theta \\
                      & = \arg \min_\theta - \log p(y \ \vert \ x, \theta) + \frac{1}{2 \eta^2} \Vert \theta \Vert_2^2 \\
                      & = \arg \min_\theta - \log p(y \ \vert \ x, \theta) + \lambda \Vert \theta \Vert_2^2
\end{align*}

where $\lambda = 1 / (2 \eta^2)$.

**(c)**

Our model for the whole training set can be written effectively as $\vec{y}=X\theta+\vec{\epsilon}$ where $\vec{\epsilon} \sim \mathcal{N}(0,\sigma^2I)$. Then, $\vec{y}|X,\theta \sim \mathcal{N}(X\theta, \sigma^2I)$. Using the result from (b), we have:

\begin{align*}
\theta_{\mathrm{MLE}} & = \arg \max_\theta \prod_{i=1}^m p(y^{(i)}|x^{(i)},\theta) \\
                      & = \arg \max_\theta \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\theta}\text{exp}\{-\frac{1}{2\sigma^2}(y^{(i)}-\theta^T x^{(i)})^2\} \\
                      & \arg \max_\theta \frac{1}{(2\pi)^{m/2}\sigma^m}\text{exp}\{-\frac{1}{2\sigma^2}(y^{(i)}-\theta^T x^{(i)})^2\} \\
                      & \arg \max_\theta \frac{1}{(2\pi)^{m/2}\sigma^m}\text{exp}\{-\frac{1}{2\sigma^2}(\parallel X\theta-\vec y \parallel_2^2)
\end{align*}

$$\text{log}p(\vec{y}|X,\theta)=-\frac{m}{2}\text{log}(2\pi)-m\text{log} \sigma -\frac{1}{2\sigma^2}\parallel X\theta-\vec y \parallel_2^2$$

\begin{align*}
\theta_{\mathrm{MAP}} & = \arg \min_\theta - \text{log}p(y|x,\theta)+ \lambda \Vert \theta \Vert_2^2 \\
                      & = \arg \min_\theta \frac{1}{2 \sigma^2} (\vec{y} - X \theta)^T (\vec{y} - X \theta) + \frac{1}{2 \eta^2} \Vert \theta \Vert_2^2 \\
                      & = \arg \min_\theta J(\theta)
\end{align*}

By solving

\begin{align*}
\nabla_\theta J(\theta) & = \nabla_\theta \big( \frac{1}{2 \sigma^2} (\vec{y} - X \theta)^T (\vec{y} - X \theta) + \frac{1}{2 \eta^2} \Vert \theta \Vert_2^2 \big) \\
                        & = \frac{1}{2 \sigma^2} \nabla_\theta (\theta^T X^T X \theta - 2 \vec{y}^T X \theta + \frac{\sigma^2}{\eta^2} \theta^T \theta) \\
                        & = \frac{1}{\sigma^2} (X^T X \theta - X^T \vec{y} + \frac{\sigma^2}{\eta^2} \theta) \\
                        & = 0
\end{align*}

we obtain

$$\theta_{\mathrm{MAP}} = (X^T X + \frac{\sigma^2}{\eta^2} I)^{-1} X^T \vec{y}$$

**(d)**

Assume $\theta \in \mathbb{R}^n$. Given $\theta_i \sim \mathcal{L} (0, bI)$ and $y = \theta^T x + \epsilon$ where $\epsilon \sim \mathcal{N} (0, \sigma^2)$, we have

$$p(\theta)=\frac{1}{(2b)^n}\text{exp}\{-\frac{1}{b} \Vert \theta \Vert \}$$

$$\text{log}p(\theta)=-n\text{log}(2b)-\frac{1}{b}\Vert \theta \Vert_1$$

\begin{align*}
\theta_{\mathrm{MAP}} & = \arg \min_\theta \frac{1}{2 \sigma^2} \Vert X\theta -\vec{y} \Vert_2^2 -\text{log}p(\theta) \\
                      & = \arg \min_\theta \frac{1}{2 \sigma^2} \Vert X \theta - \vec{y} \Vert_2^2 + \frac{1}{b} \Vert \theta \Vert_1 \\
                      & = \arg \min_\theta \Vert X \theta - \vec{y} \Vert_2^2 + \frac{2 \sigma^2}{b} \Vert \theta \Vert_1
\end{align*}

Therefore,

$$J(\theta)=\Vert X\theta-\vec y \Vert_2^2 +\gamma \Vert \theta \Vert_1$$

$$\theta_{MAP}=\arg \min_\theta J(\theta)$$

$$\gamma=\frac{2\sigma^2}{b}$$

**Remark:** Linear regression with $L_2$ regularization is also commonly called *Ridge regression*, and when $L_1$ regularization is employed, is commonly called *Lasso regression*. These regularization can be applied to any Generalized Linear models just as above (by replacing $\text{log}p(y|x,\theta)$ with the appropriate family likelihood). Regularization techniques of the above type are also called *weight decay*, and *shrinkage*. The Gaussian and Laplace priors encourage the parameter values to be closer to their mean ($i.e$ zero), which results in the shrinkage effect.

**Remark:** Lasso regression ($i.e.L_1$ regularization) is known to result in sparse parameters, where most of the parameter values are zero, with only some of them non-zero.