## implement the network using numpy.

### Math derivitive

$h=\hat x\cdot\omega_1$

$h_{relu}=relu(h)$

$\hat y=h_{relu}\cdot\omega_2$

#### Update $\omega_2$
Loss $L=(\hat{y}-y)^2$

$\frac{\partial{L}}{\partial\omega_2} = 2(\hat{y}-y)\cdot \frac{\partial\hat{y}}{\partial\omega_2}$

since $\hat y=h_{relu}\cdot\omega_2$

$\frac{\partial\hat{y}}{\partial\omega_2}=h_{relu}$

Now $\frac{\partial{L}}{\partial\omega_2} = 2 (\hat y -y)\cdot \frac{\partial\hat y}{\partial \omega_2}= 2 (\hat y -y)\cdot h_{relu}$

To update the $\omega_2$, 

$\omega_2^*=\omega_2-\frac{\partial{L}}{\partial\omega_2}$

#### Update $\omega_1$

$\frac{\partial L}{\partial \omega_1} = \frac{\partial L}{\partial h_{relu}}\cdot\frac{\partial h_{relu}}{\partial h}\cdot\frac{\partial h}{\partial \omega_1}$

For $\frac{\partial L}{\partial h_{relu}}= 2(\hat y -y)\cdot\frac{\partial\hat y}{\partial h_{relu}}=2(\hat y -y)\cdot\omega_2$

For $\frac{\partial h }{\partial \omega_1}=\hat x$

As for $\frac{\partial h_{relu}}{\partial h}$, since we are using a relu fuction, $\frac{\partial h_{relu}}{\partial h}$ becomes matrix with elements either 1 or 0. For any location $h_{xy} < 0$,  $\frac{\partial h_{relu}}{\partial h}_{xy}=0$. For any location $h_{xy} >= 0$,  $\frac{\partial h_{relu}}{\partial h}_{xy}=1$. 

In [13]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

## Same as above with torch

In [14]:
import torch


dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2