# Stochastic Gradient Descent
In this notebook we will code the Stochastic Gradient Descent Algorithm from Scratch using Python and we will visualize how it behaves when given a simple learning task.

This algorithm estimate the gradient using a subset of the training data (randomly selected)
It needs a parameter 'a' called the learning rate

# Pseudocode
- initialize the parameters w randomly and select a learning rate (a)
- While a minima is not found
    - Shuffle the example in the training sets
    - for i = 1:n
        - w := w - a*gradient
        
For the update if we have multiple feature we need to take the partial derivative of each feature for the function we are trying to estimate.

In [9]:
# f(x) = w1 + w2*x
# we are trying to fit the best w1 and w2 we can on the dataset
# (x1,x2,...,xn) with (y1,y2,...,yn)
# we are using least squares
# We need to minimize: Sum[i=1:N](yhat_i - yi)^2
# Which tranlsate to Sum[i=1:N](w1 + w2*xi - yi)^2

# The gradient are then the following
# df(x)/d(w1) = (1/N) * (Sum[i=1:N] 2*(w1 + w2*xi - yi))
# df(x)/d(w2) = (1/N) * (Sum[i=1:N] 2*xi*(w1 + w2*xi - yi))


import numpy as np
from numpy.random import permutation

def f(w1,w2,x):
    '''
        f: function we are trying to estimate the parameters (line)
        w1: bias
        w2: slope
        x: a point in the plane
        
        return yhat an estimate of y
    '''
    yhat = w1 + w2*x
    return yhat

def dx_w1(w1,w2,x,y):
    '''
        dx_w1: partial derivative of the weight w1 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w1
    '''
    yhat = f(w1,w2,x)
    gradient = 2*(yhat - y)
    return gradient

def dx_w2(w1,w2,x,y):
    '''
        dx_w2: partial derivative of the weight w2 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w2
    '''    
    yhat = f(w1,w2,x)
    gradient = 2*x*(yhat - y)
    return gradient


def stochastic_gradient_descent(xs, ys, learning_rate = 0.01, max_num_iteration = 1000):
    
    # Randomly initialize the weight w1 and w2
    w1 = np.random.uniform(0,1,1)
    w2 = np.random.uniform(0,1,1)
    
    
    iteration = 0
    while iteration < max_num_iteration:
        
        perm = permutation(len(xs))
        xr = xs[perm]
        yr = ys[perm]
        
        for x,y in zip(xr,yr):
            w1 = w1 - learning_rate*dx_w1(w1,w2,x,y)
            w2 = w2 - learning_rate*dx_w2(w1,w2,x,y)
            
            iteration = iteration + 1
        
            if iteration % 100 == 0:
                print(f"Iteration {iteration}")
                print(f"W1 = {w1}")
                print(f"W2 = {w2}")
    
    return (w1,w2)
        

In [10]:
# Here we have a simple line with intercept = 0 and slope = 1
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([1,2,3,4,5,6,7])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.39702059]
W2 = [0.91490876]
Iteration 200
W1 = [0.26569407]
W2 = [0.95631483]
Iteration 300
W1 = [0.17487377]
W2 = [0.96571606]
Iteration 400
W1 = [0.1136879]
W2 = [0.97297321]
Iteration 500
W1 = [0.07687914]
W2 = [0.98745924]
Iteration 600
W1 = [0.05149245]
W2 = [0.99105992]
Iteration 700
W1 = [0.03471444]
W2 = [0.99441874]
Iteration 800
W1 = [0.0227578]
W2 = [0.99522658]
Iteration 900
W1 = [0.01506752]
W2 = [0.99687251]
Iteration 1000
W1 = [0.01001122]
W2 = [0.99797009]
[0.00989219] [0.99773679]


In [15]:
# Here we have a simple line with intercept = 0 and slope = 2
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([2,4,6,8,10,12,14])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.58746837]
W2 = [1.90574693]
Iteration 200
W1 = [0.38602199]
W2 = [1.91154428]
Iteration 300
W1 = [0.25818906]
W2 = [1.94993965]
Iteration 400
W1 = [0.17344539]
W2 = [1.96966564]
Iteration 500
W1 = [0.11646616]
W2 = [1.97698264]
Iteration 600
W1 = [0.07683005]
W2 = [1.98621577]
Iteration 700
W1 = [0.0513436]
W2 = [1.99114118]
Iteration 800
W1 = [0.03462707]
W2 = [1.99500284]
Iteration 900
W1 = [0.02293864]
W2 = [1.99498178]
Iteration 1000
W1 = [0.01496295]
W2 = [1.99709161]
[0.0150127] [1.99738413]


In [14]:
# Here we have a simple line with intercept = 1 and slope = 2
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([3,5,7,9,11,13,15])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.61413861]
W2 = [2.05532995]
Iteration 200
W1 = [0.74332329]
W2 = [2.04444021]
Iteration 300
W1 = [0.83207545]
W2 = [2.03739644]
Iteration 400
W1 = [0.88837911]
W2 = [2.02880182]
Iteration 500
W1 = [0.92500973]
W2 = [2.01658534]
Iteration 600
W1 = [0.94945501]
W2 = [2.00958379]
Iteration 700
W1 = [0.96698536]
W2 = [2.00823717]
Iteration 800
W1 = [0.97824596]
W2 = [2.00519469]
Iteration 900
W1 = [0.98541929]
W2 = [2.00375134]
Iteration 1000
W1 = [0.99013737]
W2 = [2.00198722]
[0.99029487] [2.00214158]
