# Stochastic Gradient Descent
In this notebook we will code the Stochastic Gradient Descent Algorithm from Scratch using Python and we will visualize how it behaves when given a simple learning task.

This algorithm estimate the gradient using a subset of the training data (randomly selected)
It needs a parameter 'a' called the learning rate

# Pseudocode
- initialize the parameters w randomly and select a learning rate (a)
- While a minima is not found
    - Shuffle the example in the training sets
    - for i = 1:n
        - w := w - a*gradient
        
For the update if we have multiple feature we need to take the partial derivative of each feature for the function we are trying to estimate.

In [1]:
# f(x) = w1 + w2*x
# we are trying to fit the best w1 and w2 we can on the dataset
# (x1,x2,...,xn) with (y1,y2,...,yn)
# we are using least squares
# We need to minimize: Sum[i=1:N](yhat_i - yi)^2
# Which tranlsate to Sum[i=1:N](w1 + w2*xi - yi)^2

# The gradient are then the following
# df(x)/d(w1) = (1/N) * (Sum[i=1:N] 2*(w1 + w2*xi - yi))
# df(x)/d(w2) = (1/N) * (Sum[i=1:N] 2*xi*(w1 + w2*xi - yi))


import numpy as np
from numpy.random import permutation

def f(w1,w2,x):
    '''
        f: function we are trying to estimate the parameters (line)
        w1: bias
        w2: slope
        x: a point in the plane
        
        return yhat an estimate of y
    '''
    yhat = w1 + w2*x
    return yhat

def dx_w1(w1,w2,x,y):
    '''
        dx_w1: partial derivative of the weight w1 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w1
    '''
    yhat = f(w1,w2,x)
    gradient = 2*(yhat - y)
    return gradient

def dx_w2(w1,w2,x,y):
    '''
        dx_w2: partial derivative of the weight w2 for function f
        w1: bias
        w2: slope
        x: a point in the plane
        y: the response of the point x
        
        return gradient which is the gradient at that point for this x and y for w2
    '''    
    yhat = f(w1,w2,x)
    gradient = 2*x*(yhat - y)
    return gradient


def stochastic_gradient_descent(xs, ys, learning_rate = 0.01, max_num_iteration = 1000):
    
    # Randomly initialize the weight w1 and w2
    w1 = np.random.uniform(0,1,1)
    w2 = np.random.uniform(0,1,1)
    
    
    iteration = 0
    while iteration < max_num_iteration:
        
        perm = permutation(len(xs))
        xr = xs[perm]
        yr = ys[perm]

        x = xr[0]
        y = yr[0]
        
        w1 = w1 - learning_rate*dx_w1(w1,w2,x,y)
        w2 = w2 - learning_rate*dx_w2(w1,w2,x,y)
        
        iteration = iteration + 1
    
        if iteration % 100 == 0:
            print(f"Iteration {iteration}")
            print(f"W1 = {w1}")
            print(f"W2 = {w2}")
    
    return (w1,w2)
        

In [2]:
# Here we have a simple line with intercept = 0 and slope = 1
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([1,2,3,4,5,6,7])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.71911473]
W2 = [0.86125972]
Iteration 200
W1 = [0.50948049]
W2 = [0.91828758]
Iteration 300
W1 = [0.34804108]
W2 = [0.95002387]
Iteration 400
W1 = [0.22967592]
W2 = [0.95810732]
Iteration 500
W1 = [0.15788839]
W2 = [0.97103751]
Iteration 600
W1 = [0.10624548]
W2 = [0.97184584]
Iteration 700
W1 = [0.07308874]
W2 = [0.98952196]
Iteration 800
W1 = [0.05037367]
W2 = [0.98957716]
Iteration 900
W1 = [0.03515546]
W2 = [0.99403178]
Iteration 1000
W1 = [0.02302739]
W2 = [0.99667378]
[0.02302739] [0.99667378]


In [3]:
# Here we have a simple line with intercept = 0 and slope = 2
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([2,4,6,8,10,12,14])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.55824135]
W2 = [1.84934346]
Iteration 200
W1 = [0.39205731]
W2 = [1.94304744]
Iteration 300
W1 = [0.26397308]
W2 = [1.93828506]
Iteration 400
W1 = [0.18160052]
W2 = [1.97357962]
Iteration 500
W1 = [0.12780026]
W2 = [1.97718582]
Iteration 600
W1 = [0.0863146]
W2 = [1.98756468]
Iteration 700
W1 = [0.05885023]
W2 = [1.98730806]
Iteration 800
W1 = [0.03926074]
W2 = [1.9915265]
Iteration 900
W1 = [0.02724714]
W2 = [1.99514424]
Iteration 1000
W1 = [0.01689975]
W2 = [1.99635483]
[0.01689975] [1.99635483]


In [4]:
# Here we have a simple line with intercept = 1 and slope = 2
xs = np.array([1,2,3,4,5,6,7])
ys = np.array([3,5,7,9,11,13,15])
(w1,w2) = stochastic_gradient_descent(xs,ys)
print(w1,w2)

Iteration 100
W1 = [0.84871501]
W2 = [2.02610082]
Iteration 200
W1 = [0.89861095]
W2 = [2.01855607]
Iteration 300
W1 = [0.93029524]
W2 = [2.0140809]
Iteration 400
W1 = [0.95314372]
W2 = [2.01198149]
Iteration 500
W1 = [0.96796106]
W2 = [2.00571922]
Iteration 600
W1 = [0.97970064]
W2 = [2.00411118]
Iteration 700
W1 = [0.98561327]
W2 = [2.00249569]
Iteration 800
W1 = [0.99053092]
W2 = [2.00211655]
Iteration 900
W1 = [0.99393584]
W2 = [2.00138302]
Iteration 1000
W1 = [0.99607824]
W2 = [2.00118035]
[0.99607824] [2.00118035]
