# Linear Regression with Gradient Descent

In this exercise, i will develop a basic machine learning algorithm for regression problem which is Linear Regression.  We will compare the basic gradient descent and stochastic gradient descent.  
On top of it, we will also try to implement a regularization namely L1 and L2 and see how this regularization will affect the ML model performance

In [25]:
import random
import numpy as np

Let's assume that we have x and y, where :  
- x is the feature
- y is the label

to make things start really simple, we will just use python list and start with 1 feature and 1 label

In [5]:
x = np.array([1,2,3])
y = np.array([2.1, 3.2, 4.1])

w0 = np.random.uniform(0,1)
w1 = np.random.uniform(0,1)

In [6]:
epoch = 1000
learning_rate = 0.1

for i in range(epoch):
    n = len(x)
    #get prediction value
    pred = w0 + w1*x
    error = pred - y
    loss = np.sum(error**2) / (2*n)
    
    delta_w0 = np.sum(error) / n
    delta_w1 = np.sum(error * x) / n
    
    if not i%200:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    w0 -= learning_rate * delta_w0
    w1 -= learning_rate * delta_w1

iteration number 0
loss : 1.8473932702576625
iteration number 200
loss : 0.0011112237663111412
iteration number 400
loss : 0.0011111120045664508
iteration number 600
loss : 0.0011111111181970085
iteration number 800
loss : 0.0011111111111673055


In [7]:
print(w0 + w1*x)
print(y)

[2.13333338 3.13333334 4.13333331]
[2.1 3.2 4.1]


The result are pretty close to our objective!

## Let's try the matrix version

First let's initialize everything that we need  
We will present X as a column vector and Y as a vector

In [19]:
total_row = 100
total_feature = 5
mean = 100
std = 10

raw_x = np.vstack([np.random.normal(mean,std,total_row) for i in range(total_feature)]).T
x = np.append(np.ones((total_row,1)),raw_x,axis=1)
print(x.shape)

y = np.random.normal(mean,std,total_row)
print(y.shape)

(100, 6)
(100,)


In [23]:
w = np.random.normal(0,1,total_feature+1)
# print(w.shape)

epoch = 1000
learning_rate = 0.00001

for i in range(epoch):
    n = len(x)
    #get prediction value
    pred = np.dot(x,w)
    error = pred - y
    loss = np.sum(error**2) / (2*n)
    
    delta_w = np.dot(error,x) / n
    
    if not i%200:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    w -= learning_rate * delta_w

iteration number 0
loss : 41557.05513270729
iteration number 200
loss : 136.49534137179415
iteration number 400
loss : 107.15388264265215
iteration number 600
loss : 88.8465962499302
iteration number 800
loss : 77.36232552773578


## Stochastic Gradient Descent

Iterating through all the example can be costly,  
with SGD we only run through some example (randomizing the sample data), the number or sample we will call as a `batch_size`

In [24]:
sgd_w = np.random.normal(0,1,total_feature+1)
# print(sgd_w.shape)

epoch = 1000
learning_rate = 0.00001
batch_size = 32

for i in range(epoch):
    n = len(x)
    
    sample_index = random.sample(range(n), batch_size)
    sample_x = x[sample_index,:]
    sample_y = y[sample_index]
    
    #get prediction value
    pred = np.dot(sample_x,sgd_w)
    error = pred - sample_y
    loss = np.sum(error**2) / (2*n)
    
    delta_w = np.dot(error,sample_x) / n
    
    if not i%200:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    sgd_w -= learning_rate * delta_w

iteration number 0
loss : 7317.981902148507
iteration number 200
loss : 55.58489541738622
iteration number 400
loss : 47.70763872008232
iteration number 600
loss : 60.835224133588056
iteration number 800
loss : 47.78195444710167


## Regularization

Since everything went well, we will introduce a regularization L1 and L2  
There is some good reading around here : https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261

In order to implement some error, we will use different loss, which is sum square error

In [457]:
l_w = np.random.normal(0,10,total_feature+1)
print(l_w.shape)

#TRAINING TIME
#THIS TRAINING WILL USE L0 or no regularization at all
epoch = 10000
learning_rate = 0.0000001
regularization_rate = 0.0000001

for i in range(epoch):
    n = len(x)
    #get prediction value
    pred = np.dot(x,l_w)
    error = pred - y
    loss = np.sum(error**2)
    
    delta_w = np.dot(error,2*x)
    
    if not i%1000:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    l_w -= learning_rate * delta_w

(6,)
iteration number 0
loss : 143755026.7819015
iteration number 1000
loss : 110253.22041986421
iteration number 2000
loss : 17135.920181149297
iteration number 3000
loss : 13601.384358595089
iteration number 4000
loss : 13401.579813652424
iteration number 5000
loss : 13387.598666296826
iteration number 6000
loss : 13386.362927453943
iteration number 7000
loss : 13386.066630217578
iteration number 8000
loss : 13385.841064640268
iteration number 9000
loss : 13385.620871860227


In [470]:
l_one_w = np.random.normal(0,10,total_feature+1)

#TRAINING TIME
#THIS TRAINING WILL USE L1 
epoch = 10000
learning_rate = 0.0000001
regularization_rate = 0.001

for i in range(epoch):
    n = len(x)
    #get prediction value
    pred = np.dot(x,l_one_w)
    error = pred - y
    loss = np.sum(error**2) + np.sum(regularization_rate * np.absolute(l_one_w))
    
    delta_w = np.dot(error,2*x) + (np.array([1 if ww>0 else -1 for ww in l_one_w]) * regularization_rate)
    
    if not i%1000:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    l_one_w -= learning_rate * delta_w

iteration number 0
loss : 418452007.38917416
iteration number 1000
loss : 178425.7736440261
iteration number 2000
loss : 22838.58720822707
iteration number 3000
loss : 13698.292235264465
iteration number 4000
loss : 13066.251622974813
iteration number 5000
loss : 13019.639466495992
iteration number 6000
loss : 13015.967203419077
iteration number 7000
loss : 13015.51601185989
iteration number 8000
loss : 13015.308307826881
iteration number 9000
loss : 13015.119070547104


In [471]:
l_two_w = np.random.normal(0,10,total_feature+1)

#TRAINING TIME
#THIS TRAINING WILL USE L1 
epoch = 10000
learning_rate = 0.0000001
regularization_rate = 0.001

for i in range(epoch):
    n = len(x)
    pred = np.dot(x,l_two_w)
    error = pred - y
    loss = np.sum(error**2) + np.sum(regularization_rate * l_two_w**2)
    
    delta_w = np.dot(error,2*x) + (2*regularization_rate*l_two_w)
    
    if not i%1000:
        print('iteration number {}'.format(i))
        print('loss : {}'.format(loss))
    
    l_two_w -= learning_rate * delta_w

iteration number 0
loss : 2591491458.6033354
iteration number 1000
loss : 53645.83481158214
iteration number 2000
loss : 15363.293446707274
iteration number 3000
loss : 13107.042188666705
iteration number 4000
loss : 12948.513804834893
iteration number 5000
loss : 12936.580636088425
iteration number 6000
loss : 12935.513609425416
iteration number 7000
loss : 12935.2657647719
iteration number 8000
loss : 12935.079981168486
iteration number 9000
loss : 12934.898918806608


In [472]:
l_w

array([-3.09942194,  0.22189387,  0.16570789,  0.11135865,  0.18739065,
        0.32159271])

In [473]:
l_one_w

array([5.1150334 , 0.20889359, 0.14450967, 0.09459891, 0.17643099,
       0.30179577])

In [474]:
l_two_w

array([6.984858  , 0.20591197, 0.13968926, 0.09078336, 0.17389809,
       0.29734472])

## Reflection

Here i learn how to build a regression and linear regression,
The difference between GD and SGD is SGD not using all the dataset to calculate the loss function, thus the calculate will be faster and yield similar result as well

about Regularization, i read that it is important, but can not see the result in this exercise

## Reading List

- https://towardsdatascience.com/intuitions-on-l1-and-l2-regularisation-235f2db4c261