### Improving Regression Lines

In the previous section we saw how after choosing the slope and y-intercept values of a regression line, we use the root mean squared error to distill the goodness of fit into one number.  

Now we can go beyond that to find the "best fit" regression line by doing the following:
* Adjust $b$ and $m$, as these are the only things that can vary in a single-variable regression line.
* After each adjustment calculate the average squared error 
* The regression line (that is, the values of $b$ and $m$) with our smallest average squared error is our best fit line 

Let's see this technique in action.  For this example, let's imagine that our data does not include the point when x = 0. This leaves our dataset looking like the following:

In [223]:
first_show = {'x': 100, 'y': 150}
second_show = {'x': 200, 'y': 300}
third_show = {'x': 400, 'y': 700}

updated_shows = [first_show, second_show, third_show]

We again take an initial guess at slope by drawing a line between the first and last points.  And then let's just take an initial stab at $b$ by setting $b$ = 100.

In [224]:
def slope_between_two_points(first_point, second_point):
    return (second_point['y'] - first_point['y'])/(second_point['x'] - first_point['x'])

slope_between_two_points(updated_shows[0], updated_shows[2]) # 1.833

def regression_formula(x):
    return 1.83*x + 50
    # change the number 0 to different numbers, to see what happens
    
 

From there, we calculate the `root_mean_squared_error`.

In [225]:
def y(x, points):
    point_at_x = list(filter(lambda point: point['x'] == x,points))[0]
    return point_at_x['y']

def squared_error(x, movies):
    return (y(x, movies) - regression_formula(x))**2

def sum_of_squared_errors(points):
    squared_errors = list(map(lambda point: squared_error(point['x'], points), points))
    return sum(squared_errors)

sum_of_squared_errors(updated_shows) # 18956.33

27069.0

Ok, over 18,000.  Is that a good number? Who knows. Let's get a sense of this by plugging in different numbers for *b* and seeing what happens to the average squared error.

| b        | residual sum of squared           | 
| ------------- |:-------------:| 
| 100      |53069| 
| 110      |55989 | 
| 90      |50749 | 
|80 | 49029
|70 | 47909
|60 | 47389
| 50 | 47469

Now notice that simply by setting different numbers as $b$, we get a smaller residual sum of squares (RSS), given our value of $m$ at 1.83.  Setting $b$ to 110 produced a higher error, than at 100, so we tried moving in the other direction.  We kept moving our $b$ value lower until we set $b$ = 50, at which point our error increased from the value at 60.  So, we know that a value of $b$ between 50 and 60 produces the smallest RSS, when we set $m$ = 1.83. 

If we plot these two numbers on a chart, we get the following:

![](./cost-curve-plot.png)

So you can see visually from this that the point when b = 60 is the lowest RSS. So we start at value 100, and we can move back and forth until we get to around 60.  The check of every ten is called our *step size*.  

This technique is a called *gradient descent*.  Gradient just means *a series of successive changes* and that's what we do.  We successively change our value of our y-intercept.  We *descend*, as we descend along a cost curve.  When the value of our RSS no longer descends as we change our variable, we stop.

So our technique from the top of this lesson holds true: 

* Adjust $b$ and $m$, as these are the only things that can vary in a single-variable regression line.
* After each adjustment calculate the average squared error 
* The regression line (that is, the values of $b$ and $m$) with our smallest average squared error is our best fit line 

## Still, things are not so simple

Things happened to work out fairly nicely for us, but it could have been worse. 

For example, imagine that instead of checking our error value by changing our y intercept by increasing and decreasing our y intercept by 10, we checked our y intercept every 80.  All of our plot points from 20 to 100 would disappear, and our graph would suggest something different.  Take a look, it skips right over our "best fit" value of 60, and suggests that if we move our b value higher than 100, our cost will continue to descend. 

![](./large-step.png)

You may find this exercise silly - who would check a difference in error every 100?  But remember with our linear regression, we do not just change our y-intercept but we also change our slope as we are changing the y intercept. So our calculations of residual sum of squares turn into a grid.

| b        |m = 1.83           | m = 1.9           | m = 2.0           | 
| ------------- |:-------------:| :-------------:| 
| 100      |53069| ? | ?
| 110      |55989 |  ? | ? 
| 90      |50749 |  ? | ?
|80 | 49029| ? | ?
|70 | 47909 | ? | ?
|60 | 47389 | ? | ?
| 50 | 47469 | ? | ?

How much do we change our slope variable as we change our y-intercept?  If we choose too small of a number, we may never get there.  If we choose too large of a number, well we will hop right over our "best fit" value -- as the "silly" graph above showed.  So *this* is the problem that we need to solve: tell us how much to adjust $m$ and $b$ between each calculation of error, so that we can then arrive at our best fit line. 

To figure this out, we sent some all-intelligent mathematicians to go up to a mountain and when they came back down, they gave us the following formulas:

$b = b + $ average_error 


$m = $ m + average_error*sum_of_x_values

Those are our gradient descent formulas.  Tada, as they say.

Now, would we like to understand how to use those formulas, why those formulas make sense, and where they came from?  Yes, we would.  So let's do the following.  

* First, let's explain these formulas a little more and see them in action
* Then, hoping to be one of those smart mathematicians ourself, we'll develop some intuition for these formulas.  
* Finally, we'll show how to use some good old fashioned mathematics to derive them.

The material to come may be confusing at first.  But stick with it, we'll be attacking it from all angles.

### Seeing our gradient descent formulas in action

Ok, so what are those formulas about?  Well, remember, that is the how much we should be changing our variables, as we calculate how the size of our error changes.

What the first formula says is that we should change the b value by the amount of our average error.  b = b + average_error, means reassign b to equal the old value of b by the average_error.  

The second formula says that as we do, we should also change our m value by the average_error of our function multiplied by x.  Making these two changes simultaneously can have a large effect on our line.  To combat this, we multiply these changes by a learning rate, for example, $0.1$. 

So really each step, looks like the following:

```python
learning_rate = .1
n = size_of_data_set
b = b + learning_rate*(total_error)*(1/n)
m = m + learning_rate*(total_error*sum_of_x_values)*(1/n)
```

So think of our formulas as telling us proportions, how much should we change $m$ as we change $b$.  These are factors of how large our error was, as well as our previous values of b and m, all multiplied by a learning rate so as not to overshoot the values that minimize our cost. 

Ok, let's turn this technique into code.  Our formulas above, depend on calculations of `average_error` and the `sum_of_x_values` for our data set.

In [226]:
def error(points, x, m, b):
    # actual y - expected y 
    return   - y(x, points)

Ok, now that we have those functions, we can move onto coding our formulas.  We said that those formulas were what happens in each step, so we wrap it in a function called step, and turn all of our variables into arguments, so that we can change these variables each time we execute our function: 

In [274]:
first_show = {'x': 100, 'y': 150}
second_show = {'x': 200, 'y': 300}
third_show = {'x': 400, 'y': 700}

updated_shows = [first_show, second_show, third_show]


learning_rate = .001
n = len(updated_shows)
b = 100
m = 1
steps = []
steps.append({'b': b, 'm': m})
def step_gradient(b, m):
    m_gradient = 0
    b_gradient = 0
    for show in updated_shows:
        x = show['x'] 
        y = show['y']
        expected = (m*x + b)
        error = y - expected
        b_gradient += (error)*(2/n)
        m_gradient += (error*x)*(2/n)

    b = b - learning_rate * b_gradient
    m = m - learning_rate * m_gradient
    return {'b': b, 'm': m, 'error': error, 'b_gradient': b_gradient, 'm_gradient': m_gradient}

for i in range(10):
    current_regression = steps[-1]
    steps.append(step_gradient(current_regression['b'], current_regression['m']))
    

In [275]:
steps

[{'b': 100, 'm': 1},
 {'b': 99.9,
  'b_gradient': 99.99999999999999,
  'error': 200,
  'm': -48.99999999999999,
  'm_gradient': 49999.99999999999},
 {'b': 76.46646666666668,
  'b_gradient': 23433.533333333326,
  'error': 20200.099999999995,
  'm': -7099.046666666665,
  'm_gradient': 7050046.666666665},
 {'b': -3237.03571151111,
  'b_gradient': 3313502.1781777767,
  'error': 2840242.2001999994,
  'm': -1001166.5623155553,
  'm_gradient': 994067515.6488886},
 {'b': -470455.33886352653,
  'b_gradient': 467218303.15201545,
  'error': 400470561.9619336,
  'm': -141166232.56982532,
  'm_gradient': 140165066007.50977},
 {'b': -66348972.215459734,
  'b_gradient': 65878516876.59621,
  'error': 56466964183.269,
  'm': -19904658574.83684,
  'm_gradient': 19763492342267.016},
 {'b': -9355322339.183748,
  'b_gradient': 9288973366968.29,
  'error': 7961929779606.952,
  'm': -2806587822142.362,
  'm_gradient': 2786683163567525.0},
 {'b': -1319115016651.0645,
  'b_gradient': 1309759694311880.5,
  'err

So with our `step` function, we alter $m$ and $b$ each time.  And then we recalculate `total_error()`.  Let's do this say, 30 times, and see where we get.

In [240]:
# set our initial step with m and b values, and the corresponding error 



def generate_gradient_descent(steps):
    for i in range(10000):
        
        current_regression = steps[len(steps)-1]
        
        updated_regression = step(current_regression['b'], current_regression['m'], current_regression['total_error'])
        steps.append(updated_regression)
    return steps

In [153]:
resulting_steps = generate_gradient_descent(steps)

In [154]:
resulting_steps[998:1000]

[{'b': 998.5682740717915,
  'm': -2.2081497460219643,
  'total_error': 4.547473508864641e-13},
 {'b': 998.5682740717915,
  'm': -2.2081497460219635,
  'total_error': -1.1368683772161603e-13}]

In [45]:
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML

init_notebook_mode(connected=True)

figure = {'data': [{'x': [0, 1], 'y': [0, 1]}],
          'layout': {'xaxis': {'range': [0, 5], 'autorange': False},
                     'yaxis': {'range': [0, 5], 'autorange': False},
                     'title': 'Start Title',
                     'updatemenus': [{'type': 'buttons',
                                      'buttons': [{'label': 'Play',
                                                   'method': 'animate',
                                                   'args': [None]}]}]
                    },
          'frames': [{'data': [{'x': [1, 2], 'y': [1, 2]}]},
                     {'data': [{'x': [1, 4], 'y': [1, 4]}]},
                     {'data': [{'x': [3, 4], 'y': [3, 4]}],
                      'layout': {'title': 'End Title'}}]}

iplot(figure)

### Using the fitted line

In [6]:
def regression_formula_variable(x, m, b):
    return m*x + b

Now we update our functions that calculate the error to use our new function, and to allow us to pass through the values of $b$ and $m$.

In [35]:
def squared_error_variable(point, m, b):
    y_hat = regression_formula_variable(point['x'], m, b)
    return (point['y'] - y_hat)**2

def squared_errors_variable(points, m, b):
    return list(map(lambda point: squared_error_variable(point, m, b), points))

def sum_of_squared_error_variable(points, m, b):
    return sum(squared_errors_variable(points, m, b))

b_values = list(range(10, 180, 90)) 
errors = list(map(lambda b_value: sum_of_squared_error_variable(updated_shows, 1.83, b_value), b_values))
error_chart = list(zip(b_values, errors))
error_chart

[(10, 53789.0), (100, 53069.0)]

Above is our error chart.  Note that it is identical to the data in the table we have above.  If we plot this data of the b-values, and the corresponding squared errors generated from them, we see that the data makes an curve.

In [36]:
from plotly import graph_objs 
cost_function_trace = graph_objs.Scatter(
    x=list(map(lambda error: error[0], error_chart)),
    y=list(map(lambda error: error[1], error_chart)),
)

layout = dict(title = 'Cost Function',
              yaxis = dict(zeroline = False, title= 'Sum Squared Error'),
              xaxis = dict(zeroline = False, title= 'B value')
             )
plotly.offline.iplot(dict(data=[cost_function_trace], layout=layout))

That smily face above, is called the **cost curve**.  It shows the errors of different levels of B.  We want to reduce the error, so to do that we need to find the value of b such that the sum of squared errors is lowest - that appears to be when b is 60.  So that means that our y intercept, when x is 1.83 should be 60.

If we show the regression line side by side of the points cost curve, you can see how the two numbers relate.

> Don't stress about the below code.  It's not important -- it's just used to generate lines in our plots.  

In [12]:
def generate_regression_line(ending_x, m, b):
    y_hat = m*ending_x + b
    return {
    'type':'line',
    'x0': 0,
    'y0': b,
    'x1': ending_x,
    'y1': y_hat,
    'xref': 'x1',
    'yref': 'y1',
    'line': {
        'color': 'rgb(55, 128, 191)',
        'width': 3,
        }
    }
line = generate_regression_line(400, 1.8, 500)

def generate_cost_line(errors, b):
    return {
    'type':'line',
    'x0': b,
    'y0': 0,
    'x1': b,
    'y1': max(errors),
    'xref': 'x2',
    'yref': 'y1',
    'line': {
        'color': 'rgb(55, 128, 191)',
        'width': 3,
        }
    }


> Now the below code, still doesn't need to be understood.  But do change the value of b, and see how the plots below adjust.

In [13]:
import plotly
from plotly import graph_objs, tools
plotly.offline.init_notebook_mode(connected=True)

fig = tools.make_subplots(rows=1, cols=2)



cost_function_trace = graph_objs.Scatter(
    x=list(map(lambda error: error[0], error_chart)),
    y=list(map(lambda error: error[1], error_chart)),
)
fig.append_trace(cost_function_trace, 1, 2)

scatter_trace = graph_objs.Scatter(
    x=list(map(lambda show: show['x'], updated_shows)),
    y=list(map(lambda show: show['y'], updated_shows)),
    mode="markers"
)


##############

### CHANGE THIS VALUE OF B

b = 80
##############

cost_line = generate_cost_line(errors, b)
regression_line = generate_regression_line(400, 1.8, b)

fig.append_trace(scatter_trace, 1, 1)

fig['layout'].update(shapes=[regression_line, cost_line])
fig['layout']['yaxis1'].update(range=[0, 1000])

plotly.offline.iplot(fig)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



> As you change the value of *b*, the regression line moves up and down.  Also, as you change the value of *b* the vertical line along the cost curve shifts left or right, with the intersecting point being the value of b and corresponding sum of squared error.

### Automatically adjusting a regression line

Now so far we have improved our regression line, by manually adjusting our estimated y intercept, and seeing the least squares.  What if we wanted to use code to do this process automatically.

Well what you can imagine us doing is adjusting our values of b.  Now we don't want to think about the squared errors anymore, because squared errors does not tell us if our estimates are too high or too low.  Instead look at what happens by considering absolute error.    

In [41]:
def regression_formula_variable(x, m, b):
    return m*x + b

def error_variable(point, m, b):
    y_hat = regression_formula_variable(point['x'], m, b)
    return (point['y'] - y_hat)

def errors_variable(points, m, b):
    return list(map(lambda point: error_variable(point, m, b), points))

def average_error_variable(points, m, b):
    return sum(errors_variable(points, m, b))/len(points)

b_values = list(range(10, 120, 10)) # [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110]
errors = list(map(lambda b_value: average_error_variable(updated_shows, 1.83, b_value), b_values))
linear_error_chart = list(zip(b_values, errors))
linear_error_chart

[(10, 46.333333333333336),
 (20, 36.333333333333336),
 (30, 26.333333333333332),
 (40, 16.333333333333332),
 (50, 6.333333333333333),
 (60, -3.6666666666666665),
 (70, -13.666666666666666),
 (80, -23.666666666666668),
 (90, -33.666666666666664),
 (100, -43.666666666666664),
 (110, -53.666666666666664)]

In [42]:


##############

### CHANGE THIS VALUE OF M

b = 40
##############

cost_line = generate_cost_line(linear_error_chart, b)
regression_line = generate_regression_line(400, 1.8, b)
fig = make_subplots()
add_cost_function_trace(fig, linear_error_chart)
add_scatter_plot(updated_shows)

fig['layout'].update(shapes=[regression_line, cost_line])
fig['layout']['yaxis1'].update(range=[0, 1000])

plotly.offline.iplot(fig)

NameError: name 'make_subplots' is not defined

Now the charts above, show something interesting.  Our regression line stays the same.  And our error function changes to be linear.  That is if we increase our value of *b*, we expect the average deviation to change by a similar amount.  So now imagine we take our first guess of b, and set b equal to 100.  You can see, from the plot of the average error on the right that this gives us an average error of -43.67.  So if changing our b value a certain amount, changes our average error by the same amount, we can simply decrease our b value by 43.67.

In [None]:
error = average_error_variable(updated_shows, 1.83, b)

b = b + error
b # 56.33333333333333

So this is pretty cool, even by guessing our b value incorrectly the first time, we can simply look at the average error and make an adjustment.

### Adjusting the slope value

Now that we have gotten a sense for how we can adjust our y-intercept value, let's see if we can take a similar with adjusting our slope.  We came up with an initial guess of our slope simply by drawing a line between two of our points, and using the slope for that line.  That gave us an slope of 1.83.  Now let's see if we can improve on that. 

Ok, so we adjusted our y-intercept value by seeing how a change in the y-intercept changed our mean error.  Let's see how a change in the slope changes our mean error.

In [None]:
ints = list(range(0, 30, 1))
m_values = list(map(lambda x: x/10.0, ints))
m_value_errors = list(map(lambda m_value: average_error_variable(updated_shows, m_value, 56.33), m_values))
m_linear_error_chart = list(zip(m_values, m_value_errors))
m_linear_error_chart[:5]

In [None]:
m = 3.0
cost_line = generate_cost_line(errors, m)
regression_line = generate_regression_line(400, m, 56.33)
fig = make_subplots()
add_cost_function_trace(fig, m_linear_error_chart)
add_scatter_plot(updated_shows)

fig['layout'].update(shapes=[regression_line, cost_line])
fig['layout']['yaxis1'].update(range=[0, 1500])

plotly.offline.iplot(fig)

Our cost chart on the right is a little difficult to interpret, but the main thing to realize is altering our m value no longer alters our cost by something approaching an equal amount.  Now focus on the chart to the right.  What we want to consider is how points further to the right (that is, with a larger x-coordinate) influence our error, as we change the slope of the line.  

Notice that when the slope is 2.0, the point with x-value at 100 and x-value at 400, both miss the mark by say 100 or so.  Ok, go ahead and change the slope from 2.0 to say 3.0.  The error at x=100 rises by about 50 or so, but the error at point 400 rises by what, another 300?  A lot.  The takeaway point is that the change in the error as we change the slope of the line, does not influence all of the points equally.  The higher the x-value of a point, the more sensitive the error is to a changing slope.

So when we update m, we `m = m + error * x`, as we adjust our value of m not just by the average error, but also by our points' x-coordinate, as the further the x-coordinate the larger the error for a given point.  

### Summary