## implement the network using numpy.

### Math derivitive

$h=\hat x\cdot\omega_1$

$h_{relu}=relu(h)$

$\hat y=h_{relu}\cdot\omega_2$

#### Update $\omega_2$
Loss $L=(\hat{y}-y)^2$

$\frac{\partial{L}}{\partial\omega_2} = 2(\hat{y}-y)\cdot \frac{\partial\hat{y}}{\partial\omega_2}$

since $\hat y=h_{relu}\cdot\omega_2$

$\frac{\partial\hat{y}}{\partial\omega_2}=h_{relu}$

Now $\frac{\partial{L}}{\partial\omega_2} = 2 (\hat y -y)\cdot \frac{\partial\hat y}{\partial \omega_2}= 2 (\hat y -y)\cdot h_{relu}$

To update the $\omega_2$, 

$\omega_2^*=\omega_2-\frac{\partial{L}}{\partial\omega_2}$

#### Update $\omega_1$

$\frac{\partial L}{\partial \omega_1} = \frac{\partial L}{\partial h_{relu}}\cdot\frac{\partial h_{relu}}{\partial h}\cdot\frac{\partial h}{\partial \omega_1}$

For $\frac{\partial L}{\partial h_{relu}}= 2(\hat y -y)\cdot\frac{\partial\hat y}{\partial h_{relu}}=2(\hat y -y)\cdot\omega_2$

For $\frac{\partial h }{\partial \omega_1}=\hat x$

As for $\frac{\partial h_{relu}}{\partial h}$, since we are using a relu fuction, $\frac{\partial h_{relu}}{\partial h}$ becomes matrix with elements either 1 or 0. For any location $h_{xy} < 0$,  $\frac{\partial h_{relu}}{\partial h}_{xy}=0$. For any location $h_{xy} >= 0$,  $\frac{\partial h_{relu}}{\partial h}_{xy}=1$. 

In [1]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 29227024.264679532
1 26989938.52109949
2 28271391.363715913
3 28680739.575943768
4 25329623.810732782
5 18509200.956911348
6 11344322.94277013
7 6236458.575750768
8 3404322.880933826
9 1999303.30266066
10 1312789.7592160555
11 956779.5485744701
12 751961.2335689417
13 619630.3078684796
14 524984.5396871697
15 452153.002157927
16 393585.098079461
17 345150.69886881474
18 304399.72933581757
19 269685.68391873
20 239869.35827408257
21 214110.40360320627
22 191720.47174837333
23 172165.69228627006
24 155038.31906102056
25 139965.90256812546
26 126668.18762851864
27 114893.49135892853
28 104426.75006244925
29 95093.14025096643
30 86745.91500342693
31 79267.23306387244
32 72549.13085893916
33 66502.07596606872
34 61063.498304563276
35 56143.338173190175
36 51689.58158089666
37 47649.445945494925
38 43975.96197554712
39 40640.72235745108
40 37602.445503700044
41 34828.46404253874
42 32289.78205174451
43 29963.73335882849
44 27830.028179857374
45 25868.893390049994
46 24065.088369677767
47 2

458 8.573881268031201e-05
459 8.213845615805286e-05
460 7.869029406603643e-05
461 7.538731861904827e-05
462 7.222256650875454e-05
463 6.919178443088723e-05
464 6.628799226165634e-05
465 6.350695047946783e-05
466 6.084266459032536e-05
467 5.829024885644659e-05
468 5.584548518197174e-05
469 5.350328553905751e-05
470 5.125970544261939e-05
471 4.911076974110219e-05
472 4.705188892293691e-05
473 4.5079649221123835e-05
474 4.319009960065717e-05
475 4.1379901140289e-05
476 3.964577799886984e-05
477 3.798455398112514e-05
478 3.6393108552793485e-05
479 3.486876365640251e-05
480 3.340836176685191e-05
481 3.200897196367004e-05
482 3.0668456698484665e-05
483 2.9384265974331764e-05
484 2.8154250891772485e-05
485 2.6975791119138642e-05
486 2.5846460141551553e-05
487 2.4764568879282767e-05
488 2.3728216522993573e-05
489 2.2735175091264145e-05
490 2.1783858583436286e-05
491 2.0872438111598936e-05
492 1.9999187424829006e-05
493 1.9162642384450105e-05
494 1.836104530173184e-05
495 1.7593151386183292e-05

## Same as above with torch

In [11]:
import torch


dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in).type(dtype)
y = torch.randn(N, D_out).type(dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 41176791.64383209
1 37197638.668970406
2 32580420.836689204
3 24437651.01230657
4 15428795.321009398
5 8668568.522182766
6 4831317.338235217
7 2911125.845812292
8 1957627.013845018
9 1444489.5921075633
10 1133508.9502261158
11 922695.8083313257
12 767381.2873488674
13 646802.1414061354
14 550055.2230981077
15 470982.25789704314
16 405679.6958162985
17 351124.92887842027
18 305294.11042269133
19 266512.0268674786
20 233532.51815453696
21 205352.14127040366
22 181177.9579577078
23 160338.25294410283
24 142328.9601174069
25 126732.38028091629
26 113149.67807172437
27 101285.90938537774
28 90883.80722815206
29 81728.95433691202
30 73649.33941151379
31 66501.22233547643
32 60164.468203456985
33 54532.55434620159
34 49516.78941957936
35 45032.29437586704
36 41016.31081408805
37 37417.3630520672
38 34180.868304825155
39 31267.53730159797
40 28639.19202714725
41 26264.328612244688
42 24118.22698606772
43 22172.053450388812
44 20405.141923560586
45 18799.507895009476
46 17337.242958042916
47 

467 6.689500952868077e-05
468 6.599192415773086e-05
469 6.498561204294762e-05
470 6.381656521632032e-05
471 6.274590853334264e-05
472 6.193131395771517e-05
473 6.099347364697805e-05
474 6.031373406892304e-05
475 5.922694562651796e-05
476 5.8264797117153246e-05
477 5.730204605089526e-05
478 5.659677161415355e-05
479 5.570808565599772e-05
480 5.4910825454806966e-05
481 5.410619362286595e-05
482 5.3443852646245515e-05
483 5.2498574466222636e-05
484 5.194140004050052e-05
485 5.116042011210631e-05
486 5.055030006329009e-05
487 4.957798830151905e-05
488 4.901448360417754e-05
489 4.8375495882599547e-05
490 4.7596162100309314e-05
491 4.7088399214328946e-05
492 4.63130222427971e-05
493 4.570584266493094e-05
494 4.505722549082852e-05
495 4.456305781445902e-05
496 4.388722153453317e-05
497 4.3229939778538506e-05
498 4.255034805589197e-05
499 4.221715611639476e-05
