In [1]:
from __future__ import print_function

from traitlets.config.manager import BaseJSONConfigManager
path = '/Users/jmk/anaconda2/envs/data601/etc/jupyter/nbconfig'
cm = BaseJSONConfigManager(config_dir=path)
cm.update('livereveal', {
              'theme': 'night',
              'scroll': True,
              #'transition': 'zoom',
              'start_slideshow_at': 'selected',
})

{'scroll': True, 'start_slideshow_at': 'selected', 'theme': 'night'}

# Linear Regression
Simple linear regression lives up to its name: it is a very straightforward approach for predicting a quantitative response $Y$ on the basis of a single predictor variable, $X$. 

Mathematically we can write this as: $Y = B_0 + B_1X $

That is, $Y$ is "approximately modeled as" a linear function of $X$.

$B_0$ and $B_1$ are terms that represent the intercept and slope of the line, respectively.  (Recall $y = mx + b$)  Note:  $B_0$ is referred to as the _bias_ term.

These are call the _coefficients_ or _parameters_ of the model.  They are the _unknowns_ we want to estimate with the training process.

Once we know these values, we can _predict_ $y$ for unknown values of $x$.

#  How does Linear Regression Work?

We want to obtain values of the coefficients so that the linear model "fits the data well".  We intuitively define this as the line that follows the "shape" of the training data.  

More formally, it does the best job of predicting the values of all of the training samples as closely as possible.

To do this, we need a way to:

* Define "closely" so that we can measure it
* Define a _search procedure_ that will let us explore the parameter space to find the "best" solution

# Defining Closeness

The most common way to define "close" is measuring something called the _least squares_ criterion.  Let $\hat{y_i} = \hat{B_0} + \hat{B_1}x_i$ be a prediction of the output, $y$ for a given input $x_i$.  

We'll define the _residual_ as the difference between what the current model gives us and the "right" answer, $y_i$.  That is, $e_i = y_i - \hat{y_i}$  This is the distance, in $y$, of the line from the right answer, as shown below.

![linear-regression-residuals.png](linear-regression-residuals.png)

We define the _residual sum of squares (RSS)_ as the _sum of the squares_ of those residuals.  That is: $\sum{(y_i - \hat{y_i})^2}$.  

The "closest" match is the one that provides the minimum value of _RSS_.

![linear-regression-optimization.png](linear-regression-optimization.png)

This is what we call an _optimization problem_.  The goal is to find the values of the parameters that minimizes the RSS "cost" of the solution.  That is, of all the solutions, we want to find the best.  But we have to go looking for it in the search space above.

Note that every point on that mesh surface represents a pair of parameters.

That is, each point there represents a possible hypothesis.  And they're all laid out by the cost function (RSS) that we've defined.

Because this is a squared value, it's a _continuous function_ (read:  smoothly curved surface) which means we can use derivatives to calculate the _gradient_ and move in the steepest direction towards the bottom of the hill.  Because the space is _convex_, in this case, we're guaranteed a global minimum.

Note that this doesn't mean it's the _best_ solution, it's just the best that this _model_ can do.