# Bayesian Regression

In [section 3](3_Linear_regression_MLE.ipynb), we estimated the $\beta$ by maximizing the likelihood of the target variable given the data ($P(y|X,\beta)$). This gave us the normal equation which is a point estimate for coefficients of the OLS. But what-if instead of point estimate, we are insterested into the estimating their distribution, i.e. $P(\beta|X,y)$. If we apply Bayes rule to this, we get

$$
\overbrace{P(\beta|X,y)}^{\text{posterior}} \propto \overbrace{P(y|X,\beta)}^{\text{likelihood}} \overbrace{\;\;\; P(\beta) \;\;\;}^{\text{prior}}
$$

From our earlier discussions, we already know that likelihood gives us

\begin{equation*}
    P(y|X,\beta) = \Bigg(\frac{1}{(2\pi\sigma^2)^{d/2}}\Bigg)^n \exp \bigg( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 \bigg)
\end{equation*}

And for prior, if we assume that all coefficients are independent and from same Gaussian distribution with zero mean and constant variance

\begin{equation*}
    P(\beta) = \Bigg(\frac{1}{\sqrt{2\pi\nu^2}}\Bigg)^d \exp \bigg( -\frac{1}{2\nu^2}\sum^d_{j=1}(\beta_i)^2 \bigg)
\end{equation*}

Combining both together we get

\begin{equation*}
    P(\beta|X,y) = \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg)^n \exp \bigg( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 \bigg) \Bigg(\frac{1}{\sqrt{2\pi\nu^2}}\Bigg)^d \exp \bigg( -\frac{1}{2\nu^2}\sum^d_{j=1}(\beta_i)^2 \bigg)
\end{equation*}

and we need to maximize the posterior
$$
    \underset{\beta}{max} \quad P(y|X,\beta)
    
    
    \underset{\beta}{max} \quad \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg)^n \exp \bigg( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 \bigg) \Bigg(\frac{1}{\sqrt{2\pi\nu^2}}\Bigg)^d \exp \bigg( -\frac{1}{2\nu^2}\sum^d_{j=1}(\beta_i)^2 \bigg)
$$


Since the variance for coefficients are assumed to be constant and taking all other assumptions from section 3, we can simplify this to


\begin{equation*}
    \underset{\beta}{max} \quad \exp \bigg( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 -\frac{1}{2\nu^2}\sum^d_{j=1}(\beta_i)^2 \bigg)
\end{equation*}

Taking log and converting this to minimization we get

\begin{align*}
    \underset{\beta}{min} \quad \frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 + \frac{1}{2\nu^2}\sum^d_{j=1}(\beta_i)^2 \\
    \underset{\beta}{min} \quad \frac{n}{2\sigma^2}\frac{1}{n}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 + \lambda \sum^d_{j=1}(\beta_i)^2 \\
    \underset{\beta}{min} \quad \frac{1}{n}\sum^n_{i=1}(y_i - \beta^{T} x_i)^2 + \lambda \sum^d_{j=1}(\beta_i)^2 \\
    \underset{\beta}{min} \quad MSE(y, \hat{y}) + \lambda \sum^d_{j=1}(\beta_i)^2
\end{align*}

In matrix form

$$ \underset{\beta}{min} \quad (y - X\beta)^T(y - X\beta) + \lambda \beta^T\beta $$

If you remember from earlier section, this is same as **Ridge Regression**.

Similarly, instead of Gaussian, if we assume that all coefficients are independent and from same Laplace distribution with zero mean and constant diversity

\begin{equation*}
    P(\beta) = \Bigg(\frac{1}{2b}\Bigg)^n \exp \bigg( -\frac{1}{b}\sum^d_{j=1}|\beta_i| \bigg)
\end{equation*}

we get combined final minimization problem as

\begin{equation*}
    \underset{\beta}{min} \quad MSE(y, \hat{y}) + \lambda \sum^d_{j=1}|\beta_i|
\end{equation*}

If you remember from earlier section, this is same as **LASSO Regression**.

## Multiple possibilities with Bayesian

Bayesian prior is very powerful idea as this not just lets us restrict the values for the coefficient to a certain distribution, it opens ways for further control and can also let us do further analysis.


- **Analysis:** In the first scenario where we assumed coefficients to have constant variance, instead we could have assumed a distribution of this variance and its relationship with the coefficients

$$
P(\beta|X,y) \propto P(y|X,\beta)\:P(\beta|\sigma_2)\:P(\sigma^2)
$$

Using this, we can do more than OLS. Since, we our base parameters are parameters for variance, we can also get model the variance. This can help us compare OLS with Baysesian linear regression on an increasing number of data points. OLS has no notion of uncertainty about its point estimate of $\beta$. Bayesian linear regression does; and being regularized by its prior, it requires more data to become more certain about the inferred $\beta$.

- **Incorporate domain knowledge in the model:** If we know what kind of relationship exist between independent variables and target variable, we can bring that knowledge while modeling the problem.

1. If we know that independent variables can never have an inverse relationship with target variable, it might make more sense to assume a positive distribution for the prior.
2. If the relationship is more of count of independent variables, Poisson distribution may be helpful.
3. Similarly, if we have lots of outliers and we want to accommodate them may Cauchy is your choice.