### Introduction: Just a bit better

In the last section, we saw our process for improviding regression lines.  We started out with some data.  Then we used a simple regression line of the form, $\overline{y}= \overline{m}x + \overline{b} $ to predict an output, given an input.  Finally, we measured the accuracy of our regression line by calcululating the differences between the outputs predicted by the regression line and the actual values.

![regression-scatter.png](./regression-scatter.png)

We quantify the accuracy of the regression line by squaring all of the errors (to eliminate negative values) and adding these squares together, to get our residual sum of squares (RSS).  Armed with a number that describes the line's accuracy (or goodness of fit), we iteratively try new regression lines by adjusting our y-intercept value, $b$, or slope value, $m$, and then assessing goodness of fit.  By finding the values $m$ and $b$ that minimize the RSS, we can find our "best fit line".  

In our cost function below, you can see the sequential values of $b$ and the related RSS values (given a specific value $m$).

In [8]:
import plotly
from plotly.offline import init_notebook_mode, iplot
from graph import m_b_trace, trace_values, plot, build_layout
init_notebook_mode(connected=True)
b_values = list(range(70, 150, 10))
rss = [10852, 9690, 9128, 9166, 9804, 11042, 12880, 15318]

layout = build_layout(options = {'title': 'RSS with changes to y-intercept', 'xaxis': {'title': 'y-intercept value'}, 'yaxis': {'title': 'RSS'}})
cost_curve_trace = trace_values(b_values, rss, mode="line")
plot([cost_curve_trace], layout)

> The bottom of the blue curve displays the $b$ value that produces the lowest RSS.

### Things are not so simple

Now at this point, our problem of finding the minimum may seem simple.  For example, why not simply try **all** of the different values for a y intercept, and find the value where RSS is the lowest? 

Well we want to choose an approach that will continue to work as we change more variables of our regression line.  And as we can change more variables, our cost curve only gets more complicated.  For example, here is a quick look at what our cost curve looks like if we can change both our y-intercept and slope value.

None of these approaches are ideal.  The reason why, is because currently we are choosing a line where we only change one variable, as we look for the change in RSS.  But let's say we both the slope and y intercept, then plotting how the RSS changes looks like the following.

![](./gradientdescent.png)

So as you can see our finding the RSS for each value would be harded as we are able to change two different variables -- the slope and y intercept.  And in the future we'll be able to change more than just that.  

So because we will want to vary multiple variables in our regression lines, we will need to rule out some approaches that are more computationally expensive, or simply not possible.

* We **cannot** simply use the derivative -- whatever that is -- to find the minimum.  Using that approach will be impossible in many scenarios as our regression lines become more complicated.
* We **cannot** altering all of the variables of our regression line across all points and calculate the result.  It will take too much time, as we have more variables to alter. 

Altering our regression line and calculating the RSS, though is on the right track.

Remember in the last lesson, we evaluated our regression line by changing our y-intercept by 10 to see if our changing $b$ value produced a higher or lower RSS.  

| b        | residual sum of squared           | 
| ------------- |:-------------:| 
| 140| 24131
| 130      |21497| 
| 120      |19864 | 
| 110      |19230| 
|100 | 19597
|90 | 20963
|80 | 23330
|70| 26696

What we to do, however is change our variables more carefully so that **we know** we our changes will move towards reducing our RSS.  That is what we want to pursue.

### Our approach

So we don't want to adjust the y-intercept value or another variable and then just hope that the RSS decreased.  Doing so is a like trying to fly plane by just moving sitting down and pressing buttons.  

We want an approach that with every change, we can rest assured that we're moving in the right direction.  And also want to know how much of a **change** to our regression line to minimize RSS.  

> Let's call each of these changes a **step**, and the size of the change our **step size**. 

So our new task is to have our step sizes get to our RSS quickly and without overshooting the mark.

![](https://bossip.files.wordpress.com/2014/11/aden-and-cree-580x435.jpg)

### The slope of the cost curve tells us our step size

Believe it or not, we can determine how large our step size should be just by looking at the slope of our cost function.

Imagine yourself standing on our cost curve.  Even with your eyes closed, you could tell simply *by the way you were tilting* whether to walk forwards or backwards to approach the bottom of the cost curve.  

![](./skateboard.png)

* If the slope tilts downwards, then we should walk forward to approach the minimum.  
* And if the slope tilts upwards, then we should point walk backwards to approach the minimum.  
* And the steeper the tilt, the further away we are from our cost curve's minimum, so we should take a larger step.  

So by looking to the tilt of a cost curve at a given point, we can discover the direction of our next step and how large of step to take.  The beauty of this, is that as our regression lines become more complicated, we do not need to plot all of the values of our regression line.  We can simply see the next variation of the regression line to take simply by looking to the slope.

To see this, let's zoom in on our cost function, and look at just one part of it.  Looking at our zoomed in cost function below, do you see a way that we can get a sense of how to alter our y intercept next and how much we should alter our this value?  

In [7]:
import plotly
from plotly.offline import init_notebook_mode, iplot
from graph import m_b_trace, trace_values, plot, build_layout
init_notebook_mode(connected=True)
layout = build_layout(options = {'title': 'RSS with changes to y-intercept', 'xaxis': {'title': 'y-intercept value'}, 'yaxis': {'title': 'RSS'}})
b_values = list(range(70, 150, 10)[:3])
rss = [10852, 9690, 9128, 9166, 9804, 11042, 12880, 15318][:3]
cost_curve_trace = trace_values(b_values, rss, mode="line")
plot([cost_curve_trace], layout)

### Stepping according to the slope

![](./cost-chart-slope.png)

We can follow our technique with more precision by adding some numbers to our slope.  The slope of the curve at any given point is equal to the slope of the tangent line at that point.  By tangent line, we mean the line that just barely touches the curve at that point.  In the above graph, the orange, green, and red lines are tangent to our cost curve at the points where $b$ equals 70, 85, and 90 respectively.  The slopes of our tangent lines, and therefore the slopes of the cost curves at those points, are labeled above.  

The whole point of looking at the slope is because it supposedly tells us the size and direction of our next step, and thus tells us how to change our value of $b$.  Let's see how this works.

We use the following procedure for approaching our $b$ finding the ideal $b$: 
1.  Randomly choose a value of $b$, and then 
2.  Update $b$ with the formula $ b = (-.1) * slope_{b = i} + b_i$.

All that formula says to is choose a value of $b$, and then move it to be a small number, -.1 times the slope of the tangent line at that point.  This way, the larger the slope, the larger the step.  The negative sign means we will always move in the opposite direction of the slope (that is, when the tangent line points downwards, we move our y-intercept forwards). 

Let's see an example.  We randomly choose our value of $b$ to equal 70.  Then:

* $b_{t=0} = 70 $
* $b_{t=1} = (-.1) * -146.17  + 70 = 14.61 + 70 = 84.61 $
* $b_{t=2} = (-.1) * -58.51 + 85 = 5.851 + 85 = 90.851 $
* $b_{t=3} = (-.1) * -21.07 + 90.85 = 90.851 + 2.107 $

> Notice that we don't update our values of $b$ by just adding or subtracting the slope at that point.  Doing so would be too drastic.  Instead we multiply by the slope by a fraction -- in this case -- $.1$ which is called a **learning rate**.  This way, the steeper slope of the tangent line, the more we change in $b$, but we still make sure we are not changing our regression lines too drastically.  

This technique is pretty magical.  By looking at the tangent line at each point, we no longer are  changing our $b$ value and just hoping that it has the correct impact on our RSS.  This is because, for one, the slope of the tangent line points us in the right direction.  And as you can see above, our technique properly adjusts the amount to change the $b$ value by without even knowing where ideal $b$ value is.  When our $b$ was far away from the ideal $b$ value our formula increased our $b$ by 14, and in just three steps we were only updating our $b$ value by 2, as we approached the $b$ that minimizes our RSS.  

### Summary

We started this section with saying that we wanted a technique to find a $b$ value that would minimize our RSS, given a value of $m$.  We did not want to simply try all of the values of $b$ as doing so would be inefficient.  Instead, we went with the approach of gradient descent, where we try variations of regression lines by iteratively changing our $b$ variable and assessing our RSS to see if we are making progress.

In this lesson, we focused in on how to know which direction to alter a given variable, $m$ or $b$, as well as a technique for determining the size of the change to one of our variables.  We used the line tangent to our cost curve at a given point to indicate the direction and size of the update to $b$.  The further away, the steeper the curve and thus the larger the step we would want to take.  Appropriately, our tangent line slope would have us take a larger step.  And the closer we are to the ideal $b$ value, the flatter the tangent line to the curve, and the smaller a step we would take. 