#### Disclaimer
Odds are you're already very familar with linear regression. This notebook isn't an introduction to Ordinary Least Squares; the goal is to contrast and compare regression in the OLS framework and in the probability framework.

## TL;DR
OLS picks a measure of how well a line fits the data and finds the line that fits best under that measure. In particular, it uses the square of the vertical distance from the line to each training point. OLS is brutally effective and makes few assumptions, but can't say much at all about how certain it is in its estimates.


## Regression (Least Squares)
The goal of linear regression is to draw a line through a cloud of points that is 'close' in some sense to the points in the cloud.  

Bad:
![A bad line](images/ols_vbad_line.png)
Better:
![A better line](images/ols_okay_line.png)




### Loss Function
The Machine Learning way selecting a line is to come up with a criteria to decide which lines are better than others and then search for the best-scoring line. I.e. the ML approach is to define a loss function and minimize it.

The loss function selected in OLS (Ordinary Least Squares) scores any line as the sum of the squares of the vertical distance from the line to the points it's trying to match.

In math:
$$Loss=\sum_{i\ \in\ dataset}{(y_i-prediction_i)^2}$$

In a picture:
<table>
<tr>
<td> <img src="images/ols_vbad_line_resid.png"/> </td>
<td> <img src="images/ols_okay_line_resid.png"/> </td>
</tr>
</table>
The line on the right is better because the [squared] red lines are shorter

#### Alternatives
One could instead choose a loss function that takes the actual (not squared) length of each line and totals them. There can be lots of arguing over appropriate loss functions, touching on both their motivation and computational burden. [Squared error has a simple plug-in formula, absolute error requires itterative approximation]

Additionally, we might prefer to minimize _actual_ distance from the points to the line, rather than _vertical_ distance. The first component(s) of PCA turn out to be the line/plane that minimizes distance orthogonal to the line/plane. OLS is in some sense assuming that the X variables are known exactly and errors only occur in the Y direction.

## Limitations
The arbitrary-loss-function approach is simple, but has many limitations. All the resulting line does is, given an x value, predict the y value shown on the line. If we ask the model why points don't land on the line it would reply "because life is pain". There's no theoretic reason why the data are the way they are, OLS simply set out to give us a line whose predictions minimize the loss function on the data handed to it.

Further, the line returned by OLS is calculated from a dataset and therefore subject to whatever randomness was involved in data collection. With a different dataset we'd see a different line, but without a model of how the data came to be we can't say much about how our line might change if we re-collected the dataset.

In the next notebook, we'll extend OLS to a full probability model. The extra assumptions will allow all sorts of additional statements about the accuracy of the model. 