# Univariate linear regression

Let us consider a probabilistic model 

\begin{align*}
  y_i=ax_i+b+\varepsilon_i,\qquad \varepsilon_i\sim\mathcal{N}(0,\sigma)
\end{align*}

for a fixed set of observations $x_1,\ldots,x_n$. 

## I. Probabilistic model and likelihood 

First observe

\begin{align*}
y_i=ax_i+b+\varepsilon_i \qquad \Longleftrightarrow \qquad \varepsilon_i= y_i-ax_i-b
\end{align*}

As the leading coefficent in front of $y_i$ is one and the rest is constant, the densities are equal:

\begin{align*}
p[y_i|x_i, a,b,\sigma]&=p[\varepsilon_i= y_i-ax_i-b|a,b,\sigma]
\end{align*}

Thus
\begin{align*}
p[y_i|x_i, a,b,\sigma]&=
\frac{1}{\sqrt{2\pi} \sigma}\cdot\exp\left(-\frac{(y_i-ax_i-b)^2}{2\sigma^2}\right)
\end{align*}

As $\varepsilon_i$ are indpendent we get 

\begin{align*}
  p[\mathbf{y}|\mathbf{x},a,b, \sigma]&=\prod_{i=1}^n p[y_i|x_i, a, b, \sigma]\\
  &=\prod_{i=1}^n\frac{1}{\sqrt{2\pi} \sigma}\cdot\exp\left(-\frac{(y_i-ax_i-b)^2}{2\sigma^2}\right)
\end{align*}

## II. Maximum likelihood principle

We can establish log-likelihood of the data

\begin{align*}
\log p[\mathbf{y}|\mathbf{x},a,b,\sigma]&= const-n\log(\sigma)-\sum_{i=1}^n\frac{(y_i-ax_i-b)^2}{2\sigma^2}
\end{align*}

and apply the maximum likelihood principle, i.e., to pick a model that maximises the likelihood of the data.


**Bayesian thinking and internal consistency:** In most cases, one can attach non-informative prior to all models and choose the one with the highest posterior probability and get the same model. 
Using maximum likelihood principle directly is better as defining a non-informative prior over models depends on the parametrisation, i.e., while using the maximum likelihood principle is independent of the model parametrisation.   

If we fix $\sigma$ them the maximisation task is equivalent to minimising 

\begin{align*}
F(a,b)=\frac{1}{n}\cdot\sum_{i=1}^n(y_i-ax_i-b)^2
\end{align*}

In other words we need to minimise the mean-squared error to get the optimal solution. 


If we fix $a$ and $b$ then the maximisation task is equivalent to minimising

\begin{align*}
F(\sigma)= n\log \sigma+\frac{1}{2\sigma^2}\cdot\sum_{i=1}^n(y_i-\hat{y}_i)^2
\end{align*}

where $\hat{y}_i=ax_i+b$ is the prediction. By equating the first derivative to zero we get

\begin{align*}
\frac{\partial F(\sigma)}{\partial \sigma}= \frac{n}{\sigma}- \frac{1}{\sigma^3}\cdot\sum_{i=1}^n(y_i-\hat{y}_i)^2=0
\end{align*}

which implies

\begin{align*}
\sigma^2= \frac{1}{n}\cdot\sum_{i=1}^n(y_i-\hat{y}_i)^2
\end{align*}

The latter is equivalent to fitting normal distribution with zero mean to residuals $y_i-\hat{y}_i$.
