# Gradient descent

## Theory + Math
---


**learning** is about finding **w** that minimises the loss
> arg min loss(w)

Gradient descent intuition:

<img src="images/gradient_descent.gif" width="400" align="center">

Mathematically,

$$loss = (\hat y - y)^2 = (x \ast w - y)^2$$

$$w_{n+1} = w_{n} - \alpha \ast \frac{\partial loss}{\partial w} $$

Using calculus (or https://www.derivative-calculator.net) we can show that:

$$
\frac{\partial loss}{\partial w} = \frac{\partial}{\partial w} [(x \ast w - y)^2] = 2w\left(wx-y\right)
$$

Therefore, the learning equation is as follows:

$$w_{n+1} = w_{n} - \alpha \ast 2w\left(wx-y\right) $$

where $\alpha$ is the **learning rate**, a **hyperparameter** of the model.


Here is the step-by-step walkthrough of the derivation:

<img src="images/loss_function_derivative.jpg" width="400" align="center">

## Implementation
---


We will re-use the setup of our [01_supervised_learning.ipynb](01_supervised_learning.ipynb) notebook

In [1]:
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]

In [13]:
# randonmly initialise weight w
w = 1.0

In [3]:
# define forward-pass funtion
def forward(x):
    return round((x * w), 2)

In [4]:
# define loss function
def loss(x, y):
    y_hat = forward(x)
    return round(((y_hat - y) * (y_hat - y)), 2)

We will define our `gradient()` function and re-write the training loop to make use of the `gradient()` function. 

In [5]:
# compute gradient
def gradient(x, y):
    return 2 * x * (x * w - y)

In [14]:
# learning using gradient descent

## before training 
print("predict (before training)", 4, forward(4))

## learning loop
alpha = 0.01
for epoch in range(100):
    for x_val, y_val in zip(x_data, y_data):
        grad = gradient(x_val, y_val)
        w = w - alpha * grad
        l = loss(x_val, y_val)
        print("\t", "grad: ", x_val, y_val, grad, l)
    print("progress: ", epoch, "w=", w, "loss=", l)

## after training
print("predict (after training)", 4, forward(4))

predict (before training) 4 4.0
	 grad:  1.0 2.0 -2.0 0.96
	 grad:  2.0 4.0 -7.84 3.24
	 grad:  3.0 6.0 -16.2288 4.93
progress:  0 w= 1.260688 loss= 4.93
	 grad:  1.0 2.0 -1.478624 0.52
	 grad:  2.0 4.0 -5.796206079999999 1.77
	 grad:  3.0 6.0 -11.998146585599997 2.69
progress:  1 w= 1.453417766656 loss= 2.69
	 grad:  1.0 2.0 -1.093164466688 0.29
	 grad:  2.0 4.0 -4.285204709416961 0.98
	 grad:  3.0 6.0 -8.87037374849311 1.46
progress:  2 w= 1.5959051959019805 loss= 1.46
	 grad:  1.0 2.0 -0.8081896081960389 0.16
	 grad:  2.0 4.0 -3.1681032641284723 0.53
	 grad:  3.0 6.0 -6.557973756745939 0.81
progress:  3 w= 1.701247862192685 loss= 0.81
	 grad:  1.0 2.0 -0.59750427561463 0.08
	 grad:  2.0 4.0 -2.3422167604093502 0.29
	 grad:  3.0 6.0 -4.848388694047353 0.44
progress:  4 w= 1.7791289594933983 loss= 0.44
	 grad:  1.0 2.0 -0.44174208101320334 0.05
	 grad:  2.0 4.0 -1.7316289575717576 0.16
	 grad:  3.0 6.0 -3.584471942173538 0.24
progress:  5 w= 1.836707389300983 loss= 0.24
	 grad:  1.0 2

## Exercise
---

Implement for non-linear mapping: 
$$ \hat{y} = x^2 \ast w_2 + x \ast w_1 + b $$
$$ loss = (\hat{y} - y)^2 $$
$$ \frac {\partial}{\partial w_1} loss = x $$
$$ \frac {\partial}{\partial w_2} loss = x^2 $$

In [1]:
x_data = [1.0, 2.0, 3.0]
y_data = [2.0, 4.0, 6.0]

In [12]:
# randonmly initialise weight w
w_1 = 1.0
w_2 = 1.0
# b = 1.0

In [7]:
# define forward-pass funtion
def forward(x):
    return ((x * x * w_2) + (x * w_1))
    # return ((x * x * w_2) + (x * w_1) + b)

In [8]:
# define loss function
def loss(x, y):
    y_hat = forward(x)
    return round(((y_hat - y) * (y_hat - y)), 2)

In [9]:
# compute gradient
def gradient_w1(x, y):
    return x

def gradient_w2(x, y):
    return x * x

In [13]:
# learning using gradient descent
threshold = 0.1

## before training 
print("predict (before training)", 4, forward(4))

## learning loop
alpha = 0.01
for epoch in range(100):
    for x_val, y_val in zip(x_data, y_data):
        grad_w1 = gradient_w1(x_val, y_val)
        grad_w2 = gradient_w2(x_val, y_val)
        w_1 = w_1 - alpha * grad_w1
        w_2 = w_2 - alpha * grad_w2
        l = loss(x_val, y_val)
        print("\t", "grad: ", x_val, y_val, grad_w1, grad_w2)
    print("progress: ", epoch, "w1=", w_1, "w2=", w_2, "loss=", l)
    if l < threshold:
        break

## after training
print("predict (after training)", 4, forward(4))

predict (before training) 4 20.0
	 grad:  1.0 2.0 1.0 1.0
	 grad:  2.0 4.0 2.0 4.0
	 grad:  3.0 6.0 3.0 9.0
progress:  0 w1= 0.94 w2= 0.86 loss= 20.79
	 grad:  1.0 2.0 1.0 1.0
	 grad:  2.0 4.0 2.0 4.0
	 grad:  3.0 6.0 3.0 9.0
progress:  1 w1= 0.8799999999999999 w2= 0.72 loss= 9.73
	 grad:  1.0 2.0 1.0 1.0
	 grad:  2.0 4.0 2.0 4.0
	 grad:  3.0 6.0 3.0 9.0
progress:  2 w1= 0.8199999999999998 w2= 0.58 loss= 2.82
	 grad:  1.0 2.0 1.0 1.0
	 grad:  2.0 4.0 2.0 4.0
	 grad:  3.0 6.0 3.0 9.0
progress:  3 w1= 0.7599999999999998 w2= 0.43999999999999995 loss= 0.06
predict (after training) 4 10.079999999999998
