## Linear Regression Model

#### Linear Regression Model
<hr>
* Given data set: $\large{\mathcal{D} = \{ x_{i}, y_{i} \}_{i=1}^N}$
<br>
<br>
$$y_i =  \sum_{m=0}^{M-1} w_m x^m_i + \mathcal{E} \quad \leftarrow \text{(M-1)-degree polynomial function fitting}$$
<br>
<center>
<img src='lr.png' width=500/>

#### Linear Regression Model
<hr>
* Given data set: $\large{\mathcal{D} = \{ x_{i}, y_{i} \}_{i=1}^N}$
<br>
<br>
$$y_i =  W^{\top}X_i + \mathcal{E}, \quad \text{where} $$

<br>
$$W = (w_0, w_1, \dots, w_{M-1}) \in \mathbb{R}\quad \text{and}\quad X_i = (x^0_i, x^1_i, \dots, x^{M-1}_i) \in \mathbb{R}$$
<br>
<center>
<img src='lr.png' width=500/>

### Bayesian Linear regression model
<hr>
$$y_i =  W^{\top}X_i + \mathcal{E_i}, \quad \text{where}\quad  \mathcal{E_i} \sim \mathcal{N}(0,\sigma^2)$$

* Model distribution (univariate Gaussian)

$$p(y_i | X_i, W,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{\Big( -\frac{(y_i - W^{\top}X_i)^2}{2\sigma^2}\Big)}$$
<br>

* Prior distribution (multivariate Gaussian)
<br>
$$p(W | C) = \mathcal{N}_{M-1}(W | 0,C) = \frac{1}{\sqrt{2\pi^{M-1}|C|}}\exp{\Big( -\frac{1}{2}W^{\top}C^{-1}W\Big)},\quad \text{where}\quad C = \beta^{-1}\mathbb{I}_{M-1}$$

### Bayesian linear regression graphical model
<hr>

* Graphical notation
<br>
<img src='B_Lr.png' width=800>

* Likelihood of the data:
<br>
$$\large{p(\{y_n\}_{n=1}^N | \{X_n\}_{n=1}^N, W, \sigma^2) \propto \exp{\bigg(-\frac{1}{2\sigma^2}\sum_{n=1}^N \Big( y_{n} - W^{\top}X_n\Big)^2 \bigg)}}$$

* Matrix form:
<br>
$$\large{p(\mathcal{D}|W,\sigma^2) \propto \exp{\Big( -\frac{|| Y - XW ||^2_{Fro}}{2\sigma^2}\Big)}}$$
<br>
* Where
$$\large{ Y \in \mathbb{R}^N, \quad X \in \mathbb{R}^{(N,M-1)},\quad W \in \mathbb{R}^{M-1}}$$

$\textbf{Note:}\quad ||X||_{Fro} = \sqrt{tr(X^{\top}X)}$

### Linear regression can be performed analytically
<hr>

* Posterior
$$\large{
p(W | \mathcal{D}, \sigma^2, C) \sim \mathcal{N}(W | \hat{W}, \hat{\Sigma})\ \propto \ \exp{\bigg( -\frac{(W - \hat{W})^{\top}\hat\Sigma^{-1}(W - \hat{W})}{2}\bigg),}
}$$

<br>
* Where:
<br>
$$\boxed{\large{\hat{W} = \frac{\hat{\Sigma} X^{\top}y}{\sigma^2}, \quad \hat\Sigma = \sigma^2\Big( X^{\top}X + \sigma^2C^{-1}\Big)^{-1} }}$$

### Predictive linear regression 
<hr>
* Predictive distribution for a new sample pair $\large{(x^{*},y^{*})}$
<br><br>
$$\large{
p(y^* | x^*, \mathcal{D}, \sigma^2, C) \sim \mathcal{N}(y^{*} | \hat{y}, \hat\sigma^2_y)\ \propto \ \exp{\bigg( -\frac{(y^* - \hat{y})^2}{2\hat{\sigma}_y^2} \bigg)} 
}$$

$\newcommand{\xstr}{x^*}$
$\newcommand{\xTstr}{x^{*^{\top}}}$

* Where:

$$\boxed{\large{
\hat{y} = \frac{ \xTstr \Big(X^{\top}X + \xstr\xTstr + \sigma^2C^{-1}\Big)^{-1}X^{\top}y  }{ 1 - \xTstr \Big(X^{\top}X + \xstr\xTstr + \sigma^2C^{-1}\Big)^{-1}\xstr}
}}$$


$$\boxed{\large{
\hat{\sigma}^2_y = \frac{\sigma^2}{1 - \xTstr\Big(X^{\top}X + \xstr\xTstr + \sigma^2C^{-1} \Big)^{-1} \xstr}
}}$$

### Marginal for linear regression observarions
<hr>

* Marginal

$$\large{
p(Y | X, \sigma^2, C) = \frac{\exp{\bigg( -\frac{||Y||^2 - Y^{\top}X\Big(X^{\top}X + \sigma^2C^{-1}\Big)^{-1}X^{\top}y}{2\sigma^2}\bigg)}}{\sqrt{(2\pi\sigma^2)^N}\text{det}\Big[CX^{\top}X + \sigma^2\mathbb{I}_{M}\Big]^{\frac{1}{2}}}
}$$


### Bayesian free energy (robust marginal)

<br>
* Bayesian free energy is just minus log of the marginal
<br><br>
$$\large{
2F^{\text{Bayes}} = -2\ln p(Y | X, \sigma^2, C) \\ =  N \ln{(2\pi\sigma^2)} + \ln\text{det}\Big[CX^{\top}X + \sigma^2\mathbb{I}_{M}  \Big] + \frac{||Y||^2 - Y^{\top}X\Big(X^{\top}X + \sigma^2C^{-1}\Big)^{-1}X^{\top}y}{\sigma^2}
}$$

### Bayesian model selection
<hr>
<center>
<img src='Bsel.png' width=1000>
</center>
<br>
* Probability that the _true_ model degree is $\large{M-1}$
<br><br>
$$\large{
p(M | X, Y) \propto p(Y| X,M)p(M)
}$$


<br>
* Marginal likelihood as a model selection criterion (MAP under $p(M) \propto 1$)

$$\boxed{\large{\hat{M} = \text{argmin}_{M} 2 F^{\text{Bayes}}}}$$

### Empirical Bayesian learning

* Model selection = hyperparameter optimization

$$\boxed{\large{(\hat{C}, \hat{\sigma}^2) = \text{argmin}_{\hat{C},\hat{\sigma}^2} 2F^{\text{Bayes}}(\sigma^2,C)}}$$



* Sparce Bayesian Learning

$$\large{C = \text{diag}(c_1^2, \dots, c_M^2)}$$

$$\large{c_m^2 \rightarrow +0\quad (\text{and hence}\ w_m^2 = 0 )\  \text{if}\ x^m\ \text{is not useful to explain } Y}$$