## Improving Regression Lines

### Introduction

In the last lesson, we derived the functions that we help us descend along our cost functions efficiently.  Remember that this technique is not so different from what we saw with using the derivative to tell us our next step size and direction in two dimensions.  

![](./tangent-lines.png)

Remember that we used the slope of the tangent line at each point, to tell us how large of a step to take next.  Now with the our cost curve being a function of changing variables of $m$ and $b$ things appear more complicated.  

![](./gradientdescent.png)

But really it's just the same approach.  Just like we can calculate the use derivative of a function like $x^2$ to calculate the slope at a given value of $x$ on the graph, and thus how far to move next.  Here, we use the partial derivative with respect to the slope and with respect to the y-intercept, to calculate the amount to move next in either direction, and thus to steer us towards our minimum.   

![](./maps-pointer.png)

### Reviewing our gradient descent formulas

Luckily for us, we already did the hard work of deriving these formulas.  Now we get to see the fruit of our labor.  The following formulas tell us how to update regression variables of $m$ and $b$ to approach a "best fit" line.   

* $ \frac{dJ}{dm}J(m,b) = -2*\sum x(mx + b - y)$  
* $ \frac{dJ}{db}J(m,b) = -2*\sum(mx + b - y) $

Now the formulas above tell us to take some dataset, with values of $x$ and $y$, and then given a regression formula with values $m$ and $b$, just plug in our values and move in that direction.  Let's simplify these formulas conceptually a little bit: 

* $ \frac{dJ}{dm}J(m,b) = -2*\sum x(mx + b - y) = -2*\sum x(\overline{y} - y) = -2*\sum x\epsilon$  
* $ \frac{dJ}{db}J(m,b) = -2*\sum(mx + b - y) = -2*\sum(\overline{y} - y) = -2*\sum \epsilon$

If you step through the above lines, you can see how we can state these formulas in terms of error. As expressed above, we calculate the update to our slope by multiplying each error by the $x$ value and adding these terms up, and then multiplying by negative two.  We calculate the update to the $b$ value by adding up all of the errors at each point, and multiplying by negative two.

Let's get a sense of how this would translate into code.

** perhaps here, show our approach so far in pseudocode **

### Tweaking our approach 

Ok, we are about to turn these formulas into code, but before we do, we need to make just a couple of tweaks to our approach.

The first one is obvious if we think about what these formulas are really telling us to do.  Look at the graph below, and think about what it means to change each of our $m$ and $b$ variables by at least the sum of our errors.  That would be an enormous change.  To ensure that we are not making such drastic changes, we multiply each of these partial derivatives by a learning rate, where the learning rate is something small like $.0001$.  The learning rate is represented by the Greek letter eta, so $\eta = .0001$ means the learning rate is $.0001$.

This is ok, because we in a multivariable formula like $J(m,b)$, we should really think of our derivatives as steering us in the correct direction, that is in making sure we are make the correct proportional changes to $m$ and $b$.  So scaling down these changes to make sure we don't move too works fine. 

![](./regression-scatter.png)

For our second tweak, the because the more the number of points, the larger our error, we can correct for this by multiplying our changes by $1/n$.

Making these changes, our formula looks like the following:

```python
learning_rate = .0001
n = len(updated_shows)
b_gradient = 0
m_gradient = 0 
N = float(len(updated_shows))
for i in range(len(points)):
    b_gradient += -(1/n)*(error_at_point_x)
    m_gradient += -(1/n)*(error_at_point_x*x)

new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
```

So note from the code above, that our b and m gradients start at zero.  In other words, we begin by assuming that we should not change b or m at all.  Then for each point, we see the error of our current regression line.  For each point, we adjust our b gradient by the error, divided by the number of points.  For the m_gradient, we adjust our m_gradient to by the error at a given point multiplied by the x value of that point.  

Finally, we update our value of b by the amount of the gradient.  Well, the amount of the gradient times by this learning rate.  The learning rate is simply to scale down the size of our update to our values - so that we don't overshoot the mark.  Remember that as long as we move in the right direction, we will get there.

So that is our our gradient descent process.  Start with an initial regression line with values of m and b.  Then for each point, calculate how the regression line fares against the actual point (that is, find the error).  Update the gradient, so that we will be moving the value in the opposite direction of the error.  And after running through all of the points, and adjusting the gradient with every point, update the overall value of b and m by their respective gradients, scaled down by a small learning rate.

Going forward, let's see our gradient descent formulas in action.  Then, we can develop some intuition as to why they work and where they came from.

### Seeing our gradient descent formulas in action

First, let's translate our step gradient process into some real live code.

In [6]:
first_show = {'x': 30, 'y': 45}
second_show = {'x': 40, 'y': 60}
third_show = {'x': 100, 'y': 150}

updated_shows = [first_show, second_show, third_show]

def step_gradient(b_current, m_current, points):
    b_gradient = 0
    m_gradient = 0
    learning_rate = .0001
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i]['x']
        y = points[i]['y']
        b_gradient += -(1/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(1/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learning_rate * b_gradient)
    new_m = m_current - (learning_rate * m_gradient)
    return {'b': new_b, 'm': new_m}

In [7]:
b = 0
m = 0

step_gradient(b, m, updated_shows)

{'b': 0.0085, 'm': 0.6249999999999999}

So just looking at input and output, we begin by setting $b$ and $m$ to 0, 0.  Then from our step_gradient function, we recieve new values of b and m of .0085 and .6245.  Now what we need to do, is take another step in the correct direction by calling our step gradient function with our updated values of b and m.

In [8]:
updated_b = 0.0085
updated_m = 0.6249
step_gradient(updated_b, updated_m, updated_shows)

{'b': 0.01345805, 'm': 0.9894768333333332}

Let's do this, say, 10 times.

In [9]:
# set our initial step with m and b values, and the corresponding error.
b = 0
m = 0
iterations = []
for i in range(10):
    iteration = step_gradient(b, m, updated_shows)
    # {'b': value, 'm': value}
    b = iteration['b']
    m = iteration['m']
    # update values of b and m
    iterations.append(iteration)

In [10]:
iterations

[{'b': 0.0085, 'm': 0.6249999999999999},
 {'b': 0.013457483333333336, 'm': 0.9895351666666665},
 {'b': 0.016348771640555558, 'm': 1.20215258815},
 {'b': 0.018034938763874835, 'm': 1.3261630333815368},
 {'b': 0.01901821141416974, 'm': 1.398492904819568},
 {'b': 0.019591516465717437, 'm': 1.4406797579467343},
 {'b': 0.019925705352372706, 'm': 1.4652855068756228},
 {'b': 0.020120428242875608, 'm': 1.4796369666804499},
 {'b': 0.02023380672219544, 'm': 1.4880075481368862},
 {'b': 0.020299740568747532, 'm': 1.4928897448417577}]

As you can see, our m and b values both update with each step.  Not only that, but with each step, our m and b values are updated less and less as the lines they produce have lower errors.

We can see this visually.  We'll write a method called `to_line` that takes an iterations and changes it to produce a data structure we can use for a frame in an animation. 

In [11]:
def to_line(m, b):
    initial_x = 0
    ending_x = 100
    initial_y = m*initial_x + b
    ending_y = m*ending_x + b
    return {'data': [{'x': [initial_x, ending_x], 'y': [initial_y, ending_y]}]}

frames = list(map(lambda iteration: to_line(iteration['m'], iteration['b']),iterations))
frames[0]

{'data': [{'x': [0, 100], 'y': [0.0085, 62.508499999999984]}]}

Now we can see how our regression line changes, and approaches our data, with each iteration.

In [14]:
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML

init_notebook_mode(connected=True)

x_values_of_shows = list(map(lambda show: show['x'], updated_shows))
y_values_of_shows = list(map(lambda show: show['y'], updated_shows))
figure = {'data': [{'x': [0], 'y': [0]}, {'x': x_values_of_shows, 'y': y_values_of_shows, 'mode': 'markers'}],
          'layout': {'xaxis': {'range': [0, 110], 'autorange': False},
                     'yaxis': {'range': [0,160], 'autorange': False},
                     'title': 'Regression Line',
                     'updatemenus': [{'type': 'buttons',
                                      'buttons': [{'label': 'Play',
                                                   'method': 'animate',
                                                   'args': [None]}]}]
                    },
          'frames': frames}
iplot(figure)

In [13]:
x_values_of_shows

[30, 40, 100]

### Summary