## Probabilistic Interpretation for Linear Regression

Why is the least-square cost function $J(\beta)$ a good choice for regression problems?

$$J(\beta) = \frac{1}{2}\sum_{i=1}^n(y_i - \beta^Tx_i)^2$$

Can it be shown to be a natural choice for the regression problem, maybe under certain assumptions? The true relationship between the variables are given by the equation:

$$y_i = \beta^Tx_i + \epsilon_i$$

$\epsilon_i$ is a catch-all error term for what is missed in the simple model. Let us assume that $\epsilon$'s are independently and identically distributed (IID) according to a Gaussian distribution, $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$

$$p(\epsilon_i) = \frac{1}{\sqrt{2\pi}\sigma} exp (-\frac{\epsilon_i^2}{2\sigma^2})$$

This implies that the distribution of $y_i$ given $x_i$, parameterized by $\beta$ is given by

$$p(y_i | x_i; \beta) = \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2})$$

The above probability can be defined as a __likelihood__ function $L(\beta; X, y)$, and because $\epsilon$'s are IID:

\begin{align}
L(\beta) & = \prod_{i = 1}^{n} p(y_i \ x_i; \beta) \\
& = \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}) \\
\end{align}

Given the above relationship between the variables, how do we choose $\beta$? We actually need to maximize the agreement of the model with the observed data, and in this case, it means we need to select parameters that maximizes the likelihood function. Or the __log likelihood__ $\ell(\beta)$, a typical chioce which provides convenience without altering the underlying optimization problem (as it is a strictly increasing function).

\begin{align}
\ell(\beta) & = log L(\beta) \\
& = log \prod_{i = 1}^{n} \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}) \\
& = \sum_{i = 1}^{n} log \frac{1}{\sqrt{2\pi}\sigma} exp(-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2}) \\
& = \sum_{i = 1}^{n} log \frac{1}{\sqrt{2\pi}\sigma} + \sum_{i = 1}^{n} log (exp(-\frac{(y_i - \beta^Tx_i)^2}{2\sigma^2})) \\
& = nlog \frac{1}{\sqrt{2\pi}\sigma} - \frac{1}{2\sigma^2} \sum_{i = 1}^{n}(y_i - \beta^Tx_i)^2
\end{align}

Maximizing $\ell(\beta)$ will work out to the be the same as minimizing out original $J(\beta)$