# PS2-3 Bayesian Interpretation of Regularization

### (a) Relation between MAP and MLE.

From the definition and the chain rule of conditional probability we can obtain

\begin{align*}
\theta_{\text{MAP}}&=\arg\max_\theta p(\theta|x,y)\\
&=\arg\max_\theta\frac{p(\theta,x,y)}{p(x,y)}\\
&=\arg\max_\theta\frac{p(y|x,\theta)p(\theta|x)p(x)}{p(x,y)}
\end{align*}

$p(x)$ and $p(x,y)$ are constants and we assume $p(\theta|x)=p(\theta)$, thus

\begin{align*}
\theta_{\text{MAP}}&=\arg\max_\theta p(y|x,\theta)p(\theta)
\end{align*}

while

\begin{align*}
\theta_{\text{MLE}}&=\arg\max_\theta p(y|x;\theta)
\end{align*}

### (b) MAP estimation with zero-mean Gaussian prior is equivalent to MLE with $L_2$ regularization.

First, transform maximizing into minimization:

\begin{align*}
\theta_{\text{MAP}}&=\arg\max_\theta p(y|x,\theta)p(\theta)\\
&=\arg\min_\theta -\log p(y|x,\theta) - \log p(\theta)
\end{align*}

$\theta\sim\mathcal{N}(0, \eta^2I)$, its density is

\begin{align*}
p(\theta) &= \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{1}{2}\theta^T\Sigma^{-1}\theta\right)\\
&= \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{\theta^T\theta}{2\eta^2}\right)\\
&= \frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}\exp\left(-\frac{||\theta||_2^2}{2\eta^2}\right)
\end{align*}

Thus,

\begin{align*}
\theta_{\text{MAP}}&=\arg\min_\theta \left[-\log p(y|x,\theta) - \left(\log\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}-\frac{||\theta||_2^2}{2\eta^2}\right)\right]\\
&=\arg\min_\theta \left[-\log p(y|x,\theta) +\frac{||\theta||_2^2}{2\eta^2}\right]
\end{align*}

Therefore, we can see that MAP estimation with a zero-mean Gaussian prior over $\theta$ is equivalent to applying $L_2$ regularization with MLE estimation, with $\lambda=\frac{1}{2\eta^2}$.

### (c) Closed form solution for $θ_{\text{MAP}}$ with zero-mean Gaussian prior in linear regression.

From (b) we know

\begin{align*}
\theta_{\text{MAP}}=\arg\min_\theta \left(-\log p(y|x,\theta) +\frac{||\theta||_2^2}{2\eta^2}\right)
\end{align*}

where 

\begin{align*}
p(y|x,\theta)=\prod_{i=1}^mp(y^{(i)}|x^{(i)},\theta)
\end{align*}

Since $y^{(i)} = θ^Tx^{(i)} + \epsilon$, where $\epsilon\sim\mathcal{N}(0,\sigma^2)$ and $\theta^Tx^{(i)}$ can be viewed as a scalar, from the property of Gaussian distribution we know $y^{(i)}|x^{(i)},\theta \sim \mathcal{N}(\theta^Tx^{(i)}, \sigma^2)$. Therefore,

\begin{align*}
\theta_{\text{MAP}}&=\arg\min_\theta -\sum_{i=1}^m\log\frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\right\}+\frac{||\theta||_2^2}{2\eta^2}\\
&=\arg\min_\theta \sum_{i=1}^m\left[\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2} \right] + \frac{||\theta||_2^2}{2\eta^2}
\end{align*}

Next, transform the equation above into a matrix form:

\begin{align*}
\theta_{\text{MAP}}&=\arg\min_\theta \frac{1}{2\sigma^2}\left(\vec{y}-X\theta\right)^T\left(\vec{y}-X\theta\right) + \frac{\theta^T\theta}{2\eta^2}\\
\end{align*}

Therefore, the objective function can be written as

\begin{align*}
J(\theta) &= \frac{1}{2\sigma^2}\left(\vec{y}-X\theta\right)^T\left(\vec{y}-X\theta\right) + \frac{\theta^T\theta}{2\eta^2}
\end{align*}

The gradient of $J(\theta)$ w.r.t. $\theta$ is

\begin{align*}
\nabla_\theta J(\theta) &= \frac{1}{\sigma^2}（-X^T）\left(\vec{y}-X\theta\right) + \frac{1}{\eta^2}\theta\\
&= -\frac{1}{\sigma^2}X^T\vec{y} + \frac{1}{\sigma^2}X^TX\theta + \frac{1}{\eta^2}\theta
\end{align*}

Setting the gradient to zero gives us the closed form solution:

\begin{align*}
\theta = \left(X^TX + \frac{\sigma^2}{\eta^2}I\right)^{-1}X^T\vec{y}
\end{align*}

### (d) MAP estimation with zero-mean Laplace prior is equivalent to MLE with $L_1$ regularization.

Fron (b) we know

\begin{align*}
\theta_{\text{MAP}}=\arg\min_\theta \left(-\log p(y|x,\theta) - \log p(\theta)\right)
\end{align*}

Here, we assume $\theta\sim\mathcal{L}(0, bI)$, thus the density is

\begin{align*}
p(\theta) = \frac{1}{(2b)^n}\exp\left(-\frac{||\theta||_1}{b}\right)
\end{align*}

Therefore, 

\begin{align*}
\theta_{\text{MAP}}&=\arg\min_\theta \left(-\log p(y|x,\theta) - \log \frac{1}{(2b)^n} + \frac{||\theta||_1}{b}\right)\\
&=\arg\min_\theta \left(-\log p(y|x,\theta) + \frac{||\theta||_1}{b}\right)
\end{align*}

Therefore, we can see that MAP estimation with a zero-mean Laplace prior over $\theta$ is equivalent to applying $L_1$ regularization with MLE estimation.

Next, consider the linear regression in (c). In this case the original objective function is

\begin{align*}
J(\theta) &= \frac{1}{2\sigma^2}\left(\vec{y}-X\theta\right)^T\left(\vec{y}-X\theta\right) + \frac{||\theta||_1}{b}
\end{align*}

Thus, 
\begin{align*}
\theta_{\text{MAP}}&=\arg\min_\theta J(\theta)\\
&= \arg\min_\theta \left(\vec{y}-X\theta\right)^T\left(\vec{y}-X\theta\right) + \frac{2\sigma^2}{b}||\theta||_1\\
&= \arg\min_\theta ||X\theta-\vec{y}||_2^2 + \gamma ||\theta||_1
\end{align*}

Therefore, we can rewrite $J(\theta)=||X\theta-\vec{y}||_2^2 + \gamma ||\theta||_1$, where $\gamma = \frac{2\sigma^2}{b}$.