# Week 1: Linear Model

In [1]:
import pandas as pd
import numpy as np

##  Source Data

In [2]:

points_df = pd.read_csv('xydata.csv', header=None)
points_df.columns = ['xy']
points_df = points_df['xy'].str.split(',', expand=True)
points_df.columns = ['x', 'y']
# convert to numeric
points_df = points_df.apply(pd.to_numeric)
points_df

Unnamed: 0,x,y
0,32.502345,31.707006
1,53.426804,68.777596
2,61.530358,62.562382
3,47.475640,71.546632
4,59.813208,87.230925
...,...,...
95,50.030174,81.536991
96,49.239765,72.111832
97,50.039576,85.232007
98,48.149859,66.224958


## THIS ASSUMES THE MODEL IS Y = MX + B

Target variable Y
Input Parameter (X) - only a single input parameter (so simple linear regression, not multi factor linear regression)
Intercept B
Weight for our Factor is M

So X and Y are both given from our data set.
We need to calculate optimal values of M and B.  Optimal intercept and weight.

1.  We start some aribtary initial values and we calculate the loss / error for those initial values.


2.  For each and every data point we calculate the 'loss' or 'error' that our current values for M and B are giving us. 

3.  We sum those to give us a total for the whole dataset combined (the mean squared error usually).

4.  We check that total value vs our target loss level and if it exceeds the target loss level, we tweak the weight and intercept (or you can think of the intercept as being one of the weights) and then recalculate the loss value for the whole dataset.  

5.  We continue the tweaking and subsequent loss calc until we hit our max number of iterations (normally we also be doing so until we get to the desired error level).


How do we tweak the values:

1.  We calculate a gradient for each factor (in this case the intercept and weight of X).

2.  We apply our learning rate to the gradient (how big of jumps we want to take each step)

3.  We apply this to our old value to get our new value.

We do the above for each single data point for every single iteration.
-   That is, we need to calculate the gradient of our factor for every value of X (as the line shape changes right, and the dataset basically gives us an entire line, not just one point, and given the line shape changes the gradient probably also changes.)



# PROGRAM STRUCTURE




Compute Error - Calculate error for all the data points for a given set of weights

Step Gradient - Calculate how much to adjust the weights by and calculate the new weights

Gradient Descent Runner - Loop through the compute error, step gradient continusously calculating the new error and new weights

Main 
- Set Input Parameters (number of iterations, define dataset, initial weights, learning rate) 
- Kick off Gradient Descent Runner
- Print Results

In [3]:
# Source: https://github.com/mattnedrich/GradientDescentExample

# y = mx + b
# m is slope, b is y-intercept

def compute_error_for_line_given_points(b, m, points):
    totalError = 0
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        totalError += (y - (m * x + b)) ** 2
    return totalError / float(len(points))

In [4]:
# Source: https://github.com/mattnedrich/GradientDescentExample
def step_gradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        x = points[i, 0]
        y = points[i, 1]
        b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
        m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

In [5]:
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
    b = starting_b
    m = starting_m
    for i in range(num_iterations):
        #b, m = step_gradient(b, m, array(points), learning_rate)
        b, m = step_gradient(b, m, np.array(points), learning_rate)
    return [b, m]

In [6]:
def run():
    #points = genfromtxt("xydata.csv", delimiter=",")
    points = points_df.values
    learning_rate = 0.0001
    initial_b = 0 # Initial y-intercept guess
    initial_m = 0 # Initial slope guess
    num_iterations = 1000
    print ("Starting gradient descent at b = {0}, m = {1}, error = {2}".format(initial_b, initial_m, compute_error_for_line_given_points(initial_b, initial_m, points)))
    print ("Running...")
    [b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
    print ("After {0} iterations b = {1}, m = {2}, error = {3}".format(num_iterations, b, m, compute_error_for_line_given_points(b, m, points)) )


In [7]:
#extract data from points df to a numpy array
points = points_df.values
points[1,1]

68.77759598163891

In [8]:
run()

Starting gradient descent at b = 0, m = 0, error = 5565.107834483211
Running...
After 1000 iterations b = 0.08893651993741357, m = 1.4777440851894448, error = 112.61481011613473
