## Linear Regression
Linear regression is the workhorse algorithm thats used in many sciences, social and natural. As before, we want to put a "best" line through a cloud of points. This time though, we'll build a full Likelihood model of the regression problem.

## Linear Regression as a Likelihood model
In order to extend linear regression to a likelihood model we need to specify the probablity of witnessing a particular dataset.

Note that all forms of regression consider the X data as fixed and already given, with no stochasticity. Everything is done GIVEN those X data and we make no statements about how those X data came to be. As a result, OLS doesn't care one bit whether the X data are a random sample of a popoulation or specifically chosen to be the 10 tallest people in the world (though this does impact how well the model will generalize to other groups).

Since the X data are given, all we need to specify is the probability of seeing a particular Y value, given an X value. Because we want to keep things simple, why not decide that the Y values are normally distributed around the line we're fitting. Brilliant!

The diagram below illustrates the probabilistic interpretation of linear regression. We show a point $(x_i, y_i)$, and the corresponding prediction  for $x_i$ using the line, that is $yhat_i$ or $\hat{y}_i$ [so named because the y has a hat over it]. The illustration shows how the $y_i$ would typically fall at the mean of the distribution- right on the line- but instead spread out vertically around that point following a normal distribution.

![](images/linregmle.png)

To be fully precise, we also specify that the variance of the normal distribution is the same for all values of X, and that dall eviations from the line are iid draws from the apropriate normal distribution.

We'll see below how these assumptions let us write out the probability of seeing and particular set of Y values, given our fixed X values.

##  Linear Regression MLE
Putting the above into math, it says that each $y_i$ is gaussian distributed with mean  $\v{w}\cdot\v{x_i}$ (i.e. the y value predicted by the regression line for input $\v{x_i}$) and variance $\sigma^2$:

$$ y_i \sim N(\v{w}\cdot\v{x_i}, \sigma^2) .$$

We can then write the likelihood:

$$\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i p(\v{y}_i | \v{x}_i, \v{w}, \sigma)=  \prod_i \frac{1}{\sigma\sqrt{2\pi}} e^{-(y_i - \v{w}\cdot\v{x_i})^2 / 2\sigma^2}$$

This likelihood (or rather, its negative) plays the role of the loss function we saw in OLS. But now, instead of telling us a line is bad because the vertical distances are big, it tells us a line is bad because the observed y values are unlikely to have come from gaussian distributions centered on that line.

Now apply MLE or loss minimization to find the best values of $\v{w}$ and $\sigma$

### Fitting via MLE / Minimizing the Likelihood Loss
The log likelihood $\ell$ is given by:

$$\ell = \frac{-n}{2} log(2\pi\sigma^2) - \frac{1}{2\sigma^2}  \sum_i (y_i -  \v{w}\cdot\v{x}_i)^2 .$$

If you differentiate this with respect to  $\v{w}$ and $\sigma$, you get the MLE values of the parameter estimates:

$$\v{w}_{MLE} = (\v{X}^T\v{X})^{-1} \v{X}^T\v{y}, $$

where $\v{X}$ is the design matrix created by stacking rows $\v{x}_i$, and

$$\sigma^2_{MLE} =  \frac{1}{n} \sum_i (y_i -  \v{w}\cdot\v{x}_i)^2  . $$

The formulae above turn out to be the same formulae you get when working with the squared-error loss function.

### Comparison to OLS
For any given dataset, the line found by OLS loss minimization is the same as the line found by the normally-distributed errors model above. They have no difference in predictive performance or any other criteria. So, what are the actual differences.

The extra assumptions in the probability model above let us do things like characterize the sampling distribution of the line we just found (that is, what other lines might we have seen if we re-sampled the dataset from the distribution specified). Further, having a full likelihood model lets us draw confidence intervals that give a range estimate for the parameters of the model instead of point estimates. Likewise, we can make prediction intervals which estimate the range where new y values are likely to fall at each X value, factoring in both nosie from the normal ditribution and uncertainty in the parameters we've estimated.

Of course, all of the above estimates will be varying degrees of _actively wrong_ if the assumptions about iid normal errors aren't true of the data generating process. And that's always the tradeoff- the richer the structure you put on your model the more you can squeeze out of it, but the more you expose yourself to faulty assumptions and then misleading results.