# Gradient Descent
In this notebook we will code the Gradient Descent Algorithm from Scratch using Python and we will visualize how it behaves when given a simple learning task.

This algorithm estimate the gradient using a the training data to calculate the gradient
It needs a parameter 'a' called the learning rate

This algorithm doesn't scale well with lots of training instance since it calculate the gradient on all the data points, its variant the Stochastic Gradient Descent which estimate the gradient from 1 point is less computationally restrictive.

# Pseudocode
- initialize the parameters w randomly and select a learning rate (a)
- While a minima is not found
    - Calculate the gradient using all the data
    - w := w - a*gradient
        
For the update if we have multiple feature we need to take the partial derivative of each feature for the function we are trying to estimate.

In [14]:
# f(x) = w1 + w2*x
# we are trying to fit the best w1 and w2 we can on the dataset
# (x1,x2,...,xn) with (y1,y2,...,yn)
# we are using least squares
# We need to minimize: Sum[i=1:N](yhat_i - yi)^2
# Which tranlsate to Sum[i=1:N](w1 + w2*xi - yi)^2

# The gradient are then the following
# df(x)/d(w1) = (1/N) * (Sum[i=1:N] 2*(w1 + w2*xi - yi))
# df(x)/d(w2) = (1/N) * (Sum[i=1:N] 2*xi*(w1 + w2*xi - yi))


import numpy as np

def f(w1,w2,x):
    '''
        f: function we are trying to estimate the parameters (line)
        w1: bias
        w2: slope
        x: a point in the plane
        
        return yhat an estimate of y
    '''
    yhat = w1 + w2*x
    return yhat

def dx_w1(w1,w2,x,y):
    '''
        dx_w1: partial derivative of the weight w1 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w1
    '''
    yhat = f(w1,w2,x)
    gradient = 2*(yhat - y)
    return gradient

def dx_w2(w1,w2,x,y):
    '''
        dx_w2: partial derivative of the weight w2 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w2
    '''    
    yhat = f(w1,w2,x)
    gradient = 2*x*(yhat - y)
    return gradient

def gradient_w1(w1,w2,xs,ys):
    '''
        gradient_w1: estimate mean gradient over all point for w1
        w1: bias
        w2: slope
        xs: all point on the plane
        ys: all response on the plane
        
        return gradient which is the gradient at that point for all x and y for w1
    '''        
    N = len(ys)
    
    total = 0
    for x,y in zip(xs,ys):
        total = total + dx_w1(w1,w2,x,y)
    
    gradient = total/N
    return gradient

def gradient_w2(w1,w2,xs,ys):
    '''
        gradient_w2: estimate mean gradient over all point for w2
        w1: bias
        w2: slope
        xs: all point on the plane
        ys: all response on the plane
        
        return gradient which is the gradient at that point for all x and y for w2
    '''            
    N = len(ys)
    
    total = 0
    for x,y in zip(xs,ys):
        total = total + dx_w2(w1,w2,x,y)
    
    gradient = total/N
    return gradient

def gradient_descent(xs, ys, learning_rate = 0.01, max_num_iteration = 1000):
    '''
        gradient_descent: will estimate the parameters w1 and w2 (here it uses least square cost function)
        xs: all point on the plane
        ys: all response on the plane
        learning_rate: the learning rate for the step that weights update will take
        max_num_iteration: the number of iteration before we stop updating
        
        return w1 and w2 which is the bias and the slope of the formula
    '''    
    # Randomly initialize the weight w1 and w2
    w1 = np.random.uniform(0,1,1)
    w2 = np.random.uniform(0,1,1)
    
    for i in range(max_num_iteration):
        w1 = w1 - learning_rate*gradient_w1(w1,w2,xs,ys)
        w2 = w2 - learning_rate*gradient_w2(w1,w2,xs,ys)
        
        if i % 100 == 0:
            print(f"Iteration {i}")
            print(f"W1 = {w1}")
            print(f"W2 = {w2}")
    
    return (w1,w2)
        

In [15]:
# Here we have a simple line with intercept = 0 and slope = 1
xs = [1,2,3,4,5,6,7]
ys = [1,2,3,4,5,6,7]
(w1,w2) = gradient_descent(xs,ys)
print(w1,w2)

Iteration 0
W1 = [0.31242937]
W2 = [0.42653762]
Iteration 100
W1 = [0.27932483]
W2 = [0.94380455]
Iteration 200
W1 = [0.18887323]
W2 = [0.96200189]
Iteration 300
W1 = [0.12771187]
W2 = [0.97430652]
Iteration 400
W1 = [0.08635592]
W2 = [0.98262664]
Iteration 500
W1 = [0.05839195]
W2 = [0.98825252]
Iteration 600
W1 = [0.03948334]
W2 = [0.99205662]
Iteration 700
W1 = [0.02669775]
W2 = [0.99462886]
Iteration 800
W1 = [0.01805243]
W2 = [0.99636816]
Iteration 900
W1 = [0.01220665]
W2 = [0.99754423]
[0.00828622] [0.99833295]


In [16]:
# Here we have a simple line with intercept = 0 and slope = 2
xs = [1,2,3,4,5,6,7]
ys = [2,4,6,8,10,12,14]
(w1,w2) = gradient_descent(xs,ys)
print(w1,w2)

Iteration 0
W1 = [0.38928493]
W2 = [1.35313913]
Iteration 100
W1 = [0.33901613]
W2 = [1.93179567]
Iteration 200
W1 = [0.22923515]
W2 = [1.95388175]
Iteration 300
W1 = [0.1550037]
W2 = [1.96881587]
Iteration 400
W1 = [0.10481005]
W2 = [1.97891398]
Iteration 500
W1 = [0.07087022]
W2 = [1.9857421]
Iteration 600
W1 = [0.04792087]
W2 = [1.99035913]
Iteration 700
W1 = [0.03240302]
W2 = [1.99348106]
Iteration 800
W1 = [0.0219102]
W2 = [1.99559204]
Iteration 900
W1 = [0.01481519]
W2 = [1.99701943]
[0.01005698] [1.99797671]


In [17]:
# Here we have a simple line with intercept = 1 and slope = 2
xs = [1,2,3,4,5,6,7]
ys = [3,5,7,9,11,13,15]
(w1,w2) = gradient_descent(xs,ys)
print(w1,w2)

Iteration 0
W1 = [0.53138242]
W2 = [1.34595275]
Iteration 100
W1 = [0.7828876]
W2 = [2.04367936]
Iteration 200
W1 = [0.85319343]
W2 = [2.02953501]
Iteration 300
W1 = [0.90073267]
W2 = [2.01997092]
Iteration 400
W1 = [0.93287764]
W2 = [2.01350389]
Iteration 500
W1 = [0.95461336]
W2 = [2.00913103]
Iteration 600
W1 = [0.96931056]
W2 = [2.0061742]
Iteration 700
W1 = [0.97924849]
W2 = [2.00417486]
Iteration 800
W1 = [0.98596829]
W2 = [2.00282294]
Iteration 900
W1 = [0.99051207]
W2 = [2.00190881]
[0.99355932] [2.00129576]
