# PyTorch Examples

This is from the beginner tutorials for PyTorch, available [here](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#warm-up-numpy). These examples use a fully-connected ReLU network, first implemented in Numpy and then in PyTorch.

## Tensors

### Warm-up: Numpy

Let's create a two-layer network in numpy.

In [1]:
import numpy as np

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create some random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss 
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    

0 35224320.923023894
1 33846757.529069565
2 34041408.06053148
3 30405850.959074683
4 22293013.033890463
5 13387895.2140003
6 7145003.620157708
7 3827122.756219165
8 2265352.962689481
9 1524410.0187285487
10 1136876.1529271163
11 904726.1732398119
12 747114.3770108283
13 629945.4319734013
14 538170.2681294582
15 463665.71195878077
16 402211.06762233726
17 350866.4201988302
18 307536.2489418563
19 270691.2275828413
20 239226.8961031773
21 212166.80653476404
22 188787.34289280188
23 168539.00697331404
24 150919.2907156765
25 135509.15634154194
26 121972.48535157854
27 110058.63540717901
28 99537.25454193642
29 90220.276541584
30 81943.31036946893
31 74572.43929628385
32 67988.70663038365
33 62093.94145401508
34 56801.565477668526
35 52040.95406404389
36 47750.311063333866
37 43876.22733831213
38 40370.336610342565
39 37193.226432098076
40 34308.95247971355
41 31687.472123193456
42 29298.71366132462
43 27119.383321249716
44 25127.569675540562
45 23305.008814979527
46 21635.969002042224
47 

426 0.0010960295525315805
427 0.0010530906395548774
428 0.0010118376195362773
429 0.0009722144750967933
430 0.0009341391255438891
431 0.0008975547117417472
432 0.0008624053356295401
433 0.000828648343755006
434 0.0007962090592763982
435 0.0007650388579684027
436 0.0007350907339975935
437 0.000706314742240707
438 0.0006786685617478686
439 0.0006521045672331654
440 0.0006265934603702772
441 0.0006020791771576917
442 0.000578521613444691
443 0.0005558859667330707
444 0.0005341440436647123
445 0.0005132526178679644
446 0.0004931777445807446
447 0.0004738868876997815
448 0.0004553528225231134
449 0.0004375447614228398
450 0.00042043340655951235
451 0.0004039943711679116
452 0.000388198450141205
453 0.0003730213219874639
454 0.0003584394590849948
455 0.00034443399314993756
456 0.00033097561756748067
457 0.00031804018644190365
458 0.0003056096876850953
459 0.0002936669796235554
460 0.0002821918447165568
461 0.0002711662500159004
462 0.00026057110195584684
463 0.0002503916383068753
464 0.00024

### PyTorch: Tensors

Instead of using Numpy arrays, we can just use tensors which will keep track of the computational graph and its gradients for us. You can also run tensor operations on a GPU.

In [2]:
import torch

dtype = torch.float
device = torch.device("cuda:0")

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)    # mm = matrix multiplication
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # UPdate weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 37823732.0
1 37921692.0
2 39042968.0
3 34549788.0
4 24226378.0
5 13495472.0
6 6744312.0
7 3525357.75
8 2134091.5
9 1494022.125
10 1152655.75
11 937961.0625
12 785165.0625
13 667947.375
14 574058.875
15 497080.65625
16 433048.75
17 379292.46875
18 333773.5625
19 294984.03125
20 261761.484375
21 233156.96875
22 208379.203125
23 186826.734375
24 168004.015625
25 151494.25
26 136953.171875
27 124133.9921875
28 112772.9375
29 102659.796875
30 93636.5625
31 85565.828125
32 78325.015625
33 71809.5234375
34 65939.1953125
35 60633.71484375
36 55828.6484375
37 51470.328125
38 47509.57421875
39 43902.44140625
40 40616.48828125
41 37614.8515625
42 34870.484375
43 32355.951171875
44 30049.904296875
45 27932.021484375
46 25984.826171875
47 24191.927734375
48 22539.240234375
49 21015.2734375
50 19606.923828125
51 18304.896484375
52 17100.1015625
53 15984.3896484375
54 14950.056640625
55 13990.5263671875
56 13099.861328125
57 12272.4052734375
58 11503.0126953125
59 10787.2685546875
60 10121.95410156

## Autograd

### PyTorch: Tensors and autograd

Rather than manually implementing the forward and backward passes of the network, PyTorch can automatically compute the gradients of each layer with `autograd`. This defines a computational graph where nodes are Tensors and edges are functions that produce output tensors.

In [3]:
import torch

dtype = torch.float
device = torch.device("cuda:0")

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random tensors to hold input and outputs
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random tensors for weights
# Setting requires_grad=False indicates that we don't need to compute the gradients with respect
# to these tensors during backprop
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors. These are exactly the same
    # operations we used to compute the forward pass using Tensors, but we don't need to keep
    # references to the intermediate values since we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1.)
    # loss.item() gets the scalar value held in loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass. This call computes the gradient of loss with respect
    # to all Tensors with requires_grad=True. After this call, w1.grad and w2.grad will be Tensors
    # holding the gradient of the loss with respect to w1 and w2 respectively
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad() because weights have 
    # requires_grad=True, but we don't want to track this in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data. Recall that tensor.data
    # gives a tensor that shares storage with the wrapping tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Manually zero gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 30836498.0
1 23285684.0
2 22551308.0
3 24468054.0
4 26212378.0
5 25061098.0
6 20345936.0
7 13748319.0
8 8098387.0
9 4443700.0
10 2479959.5
11 1491160.5
12 995446.4375
13 731553.3125
14 577177.8125
15 476576.84375
16 404513.5625
17 349067.875
18 304386.75
19 267359.90625
20 236147.953125
21 209529.046875
22 186636.015625
23 166809.015625
24 149546.03125
25 134444.8125
26 121192.109375
27 109507.984375
28 99185.6328125
29 90020.4609375
30 81869.5859375
31 74594.1875
32 68083.9140625
33 62243.9375
34 56975.53515625
35 52230.36328125
36 47945.23046875
37 44070.89453125
38 40561.2890625
39 37378.91796875
40 34484.41796875
41 31847.208984375
42 29440.57421875
43 27241.830078125
44 25230.01171875
45 23386.33203125
46 21694.64453125
47 20140.865234375
48 18712.3359375
49 17397.794921875
50 16185.8564453125
51 15068.4091796875
52 14036.712890625
53 13083.3359375
54 12201.8291015625
55 11385.8486328125
56 10629.9931640625
57 9929.669921875
58 9279.9404296875
59 8676.6484375
60 8116.2568359375


485 8.255086868302897e-05
486 8.11875652289018e-05
487 7.982595707289875e-05
488 7.837168232072145e-05
489 7.731218647677451e-05
490 7.595416536787525e-05
491 7.464420923497528e-05
492 7.350008672801778e-05
493 7.239739352371544e-05
494 7.132980681490153e-05
495 7.003985228948295e-05
496 6.908850627951324e-05
497 6.81946039549075e-05
498 6.715304334647954e-05
499 6.604880763916299e-05


### Defining new autograd functions

Each primitive autograd operator is really just two functions that operate on tensors. The `forward` function computes output tensors from inputs, and the `backward` function receives the gradients from output tensors with respect to some scalar value and computes the gradients of the input tensors with respect to that same value.

To define a new autograd operator, define a class that inherits from `torch.autograd.Function` and override `forward` and `backward`.

In [4]:
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by inheriting from torch.autograd.Function
    and implementing the forward and backward basses that operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass, we receive a Tensor containing the input. We return a Tensor containing 
        the output. ctx is a context object that can be used to stash information for backward
        computation. You can cache arbitrary objects for use in the backward pass using the
        ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss with respect to
        the output. We need to compute the gradient of the loss with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input
    
dtype = torch.float
device = torch.device("cuda:0")

# You've seen this already
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors to hold weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our function we use Function.apply. We alias this as relu
    relu = MyReLU.apply
    
    # Forward pass: compute predicted y using operations. We compute ReLU using the custom autograd op
    y_pred = relu(x.mm(w1)).mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())
    
    # Use autograd to compute the backward pass
    loss.backward()
    
    # Update the weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        
        # Zero the weight gradients
        w1.grad.zero_()
        w2.grad.zero_()

0 35940260.0
1 30257930.0
2 25051292.0
3 18784676.0
4 12625178.0
5 7883827.0
6 4850064.5
7 3094057.0
8 2102410.75
9 1527359.125
10 1173415.25
11 939114.375
12 772933.8125
13 648176.6875
14 550732.9375
15 472421.25
16 408223.4375
17 354887.96875
18 310100.03125
19 272209.8125
20 239908.140625
21 212175.9375
22 188279.265625
23 167572.0
24 149562.34375
25 133835.96875
26 120062.34375
27 107959.515625
28 97285.046875
29 87846.0
30 79473.8984375
31 72030.3984375
32 65400.32421875
33 59481.625
34 54181.72265625
35 49425.828125
36 45153.97265625
37 41313.0234375
38 37850.4453125
39 34728.859375
40 31910.04296875
41 29354.35546875
42 27033.3671875
43 24927.578125
44 23007.884765625
45 21256.25390625
46 19656.484375
47 18193.4453125
48 16853.2265625
49 15624.943359375
50 14497.779296875
51 13462.3603515625
52 12509.8193359375
53 11634.6240234375
54 10828.9130859375
55 10085.4697265625
56 9398.7880859375
57 8764.8681640625
58 8179.51611328125
59 7638.2861328125
60 7136.73974609375
61 6672.11669

497 0.00010577704961178824
498 0.00010368641960667446
499 0.00010157062206417322


## *nn* module

PyTorch also has packages similar to `tf.layers` to easily add nodes to the computational graph. The `torch.nn` packages also defines useful loss functions.

In [5]:
import torch

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define teh model as a sequence of layers. nn.Sequential is a Module which
# contains other Modules, and applies them in sequence to produce its output. Each Linear module 
# computes output from input using a linear function, and holds tensors for its weight and bias
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# The nn package also contains definitions of popular loss functions. We'll use Mean Squared Error
# here
loss_fn = torch.nn.MSELoss(size_average=False)

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects override the 
    # __call__ operator so you can call them like functions. When doing so, you pass a Tensor of
    # input data to the Module and it produces a Tensor of output data
    y_pred = model(x)
    
    # Compute and print loss. We pass Tensors containing the predicted and true values of y, and the
    # loss function returns a Tensor containing the loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Zero the gradients before running the backward pass
    model.zero_grad()
    
    # Backward pass: Compute gradient of the loss with respect to all the learnable parameters of the
    # model. Internally, the parameters of each Module are stored in Tensors with requires_grad=True,
    # so this call will compute gradients for all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Tensor, so we can access its
    # gradients like we did before
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 695.4727172851562
1 647.4632568359375
2 605.6074829101562
3 568.6370849609375
4 535.7569580078125
5 506.2281799316406
6 479.1685485839844
7 454.1861267089844
8 430.9997863769531
9 409.4521484375
10 389.1055908203125
11 369.8600158691406
12 351.626953125
13 334.1884460449219
14 317.3907470703125
15 301.25439453125
16 285.81976318359375
17 271.0343933105469
18 256.8603515625
19 243.24668884277344
20 230.1355438232422
21 217.56488037109375
22 205.50726318359375
23 193.9586944580078
24 182.95654296875
25 172.4736785888672
26 162.4840545654297
27 152.9626922607422
28 143.9113006591797
29 135.31980895996094
30 127.1811294555664
31 119.48153686523438
32 112.19583129882812
33 105.30443572998047
34 98.80225372314453
35 92.67794799804688
36 86.91766357421875
37 81.50071716308594
38 76.40909576416016
39 71.62726593017578
40 67.13292694091797
41 62.92298126220703
42 58.97761535644531
43 55.28721618652344
44 51.83113098144531
45 48.605712890625
46 45.59799575805664
47 42.78690719604492
48 40.1581

369 0.0005843174294568598
370 0.0005695257568731904
371 0.000555141712538898
372 0.0005411094753071666
373 0.0005274590803310275
374 0.0005141463479958475
375 0.0005011826870031655
376 0.0004885507514700294
377 0.0004762552853208035
378 0.00046427789493463933
379 0.00045259814942255616
380 0.00044122946565039456
381 0.0004301483859308064
382 0.0004193551139906049
383 0.00040884161717258394
384 0.00039859834942035377
385 0.00038861524080857635
386 0.00037888935185037553
387 0.000369413843145594
388 0.00036018481478095055
389 0.00035118445521220565
390 0.00034241838147863746
391 0.0003338838287163526
392 0.00032556080259382725
393 0.0003174443554598838
394 0.00030954802059568465
395 0.0003018462157342583
396 0.0002943386207334697
397 0.0002870265452656895
398 0.00027989631053060293
399 0.00027296843472868204
400 0.00026620234712027013
401 0.00025960616767406464
402 0.0002531732607167214
403 0.00024690551799722016
404 0.00024079448485281318
405 0.00023483936092816293
406 0.000229041717830

### optim

Rather than manually mutating the Tensors like we were doing before (with `torch.no_grad()` or `.data`), we can use the `optim` package to optimize the network for us. This contains commonly used optimization algorithms like Adam:

In [6]:
import torch

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define the model and loss function
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

loss_fn = torch.nn.MSELoss(size_average=False)

# Use the optim package to define an Optimizer that will update the weights of the model for us. Here
# we will use Adam; the first argument to its constructor tells the optimizer which tensors it should
# update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
    
    # Before the backward pass, we use the optimizer object to zero all the grdients for the variables
    # it will update (which are the learnable weights of the model). This is because by default, gradients
    # are accumulated in buffers (i.e. not overwritten) whenever .backward() is called. see 
    # torch.autograd.backward for details
    optimizer.zero_grad()
    
    # Backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()
    
    # Call the step function to update the weights
    optimizer.step()

0 648.9163818359375
1 631.83837890625
2 615.2686767578125
3 599.1947631835938
4 583.6558227539062
5 568.5576782226562
6 553.960693359375
7 539.774658203125
8 526.020263671875
9 512.67529296875
10 499.69085693359375
11 486.96502685546875
12 474.568603515625
13 462.5971984863281
14 451.06988525390625
15 439.9304504394531
16 429.0756530761719
17 418.49542236328125
18 408.1410827636719
19 398.0235900878906
20 388.2340393066406
21 378.7366943359375
22 369.5377197265625
23 360.5857849121094
24 351.8377380371094
25 343.2992858886719
26 334.9955139160156
27 326.85107421875
28 318.8672180175781
29 311.10833740234375
30 303.5417785644531
31 296.1343994140625
32 288.9024353027344
33 281.8221740722656
34 274.889892578125
35 268.08544921875
36 261.44342041015625
37 254.9441680908203
38 248.6047821044922
39 242.40557861328125
40 236.3610382080078
41 230.4492950439453
42 224.66514587402344
43 218.9867706298828
44 213.42337036132812
45 207.977783203125
46 202.64634704589844
47 197.4332733154297
48 192

395 9.963319462258369e-05
396 9.448753553442657e-05
397 8.959374827099964e-05
398 8.49496791488491e-05
399 8.052535122260451e-05
400 7.632836059201509e-05
401 7.233522046590224e-05
402 6.85444101691246e-05
403 6.494325498351827e-05
404 6.152776040835306e-05
405 5.828043504152447e-05
406 5.51992779946886e-05
407 5.227480869507417e-05
408 4.949486174155027e-05
409 4.6860066504450515e-05
410 4.4361015170579776e-05
411 4.199016620987095e-05
412 3.9740189095027745e-05
413 3.760205436265096e-05
414 3.557664967956953e-05
415 3.3656557206995785e-05
416 3.183513763360679e-05
417 3.0108330975053832e-05
418 2.8471213227021508e-05
419 2.6918436560663395e-05
420 2.54488259088248e-05
421 2.4053801098489203e-05
422 2.273525024065748e-05
423 2.1486308469320647e-05
424 2.0299970856285654e-05
425 1.9180020899511874e-05
426 1.811830406950321e-05
427 1.711215008981526e-05
428 1.616040026419796e-05
429 1.525912739452906e-05
430 1.4405316505872179e-05
431 1.3597912584373262e-05
432 1.2836369933211245e-05
43

### Custom `nn` modules

Define Modules by extending `nn.Module` and defining a `forward` function which receives input Tensors and produces output Tensors.

In [7]:
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        Constructor. We instantiate two nn.Linear modules and assign them as member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return a Tensor of
        output data. We can use Modules defined in the constructor as well as arbitrary operations
        on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred
    
# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters() in the SGD constructor
# will contain the learnable parameters of the two nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())
    
    # Zero gradients, perform a backward pass, and update the weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 738.3618774414062
1 683.9659423828125
2 637.17822265625
3 595.8290405273438
4 559.3675537109375
5 526.5848388671875
6 496.9197082519531
7 469.9239196777344
8 445.3011169433594
9 422.724365234375
10 401.7272644042969
11 381.91571044921875
12 363.0981750488281
13 345.1686096191406
14 328.0423889160156
15 311.5852966308594
16 295.90936279296875
17 280.8794860839844
18 266.4249267578125
19 252.5376434326172
20 239.16094970703125
21 226.3318634033203
22 214.01596069335938
23 202.22817993164062
24 190.95162963867188
25 180.17105102539062
26 169.85794067382812
27 160.02088928222656
28 150.6439666748047
29 141.66647338867188
30 133.13055419921875
31 125.04481506347656
32 117.34475708007812
33 110.00707244873047
34 103.06739807128906
35 96.52518463134766
36 90.36180114746094
37 84.55819702148438
38 79.10027313232422
39 73.96839141845703
40 69.15445709228516
41 64.62519836425781
42 60.39539337158203
43 56.443382263183594
44 52.74629211425781
45 49.29364776611328
46 46.06619644165039
47 43.0554

411 4.076766344951466e-05
412 3.975603976869024e-05
413 3.877066046698019e-05
414 3.7810699723195285e-05
415 3.6874898796668276e-05
416 3.596457463572733e-05
417 3.507782093947753e-05
418 3.4215670893900096e-05
419 3.337450834806077e-05
420 3.255679257563315e-05
421 3.175890014972538e-05
422 3.0983494070824236e-05
423 3.0227294701035134e-05
424 2.94909314106917e-05
425 2.877458609873429e-05
426 2.8073916837456636e-05
427 2.7395410143071786e-05
428 2.6732435799203813e-05
429 2.6087871447089128e-05
430 2.5457964511588216e-05
431 2.484417927917093e-05
432 2.4246564862551168e-05
433 2.36644918913953e-05
434 2.309819683432579e-05
435 2.254518221889157e-05
436 2.2007057850714773e-05
437 2.1482605006895028e-05
438 2.097141623380594e-05
439 2.0472869437071495e-05
440 1.998652624024544e-05
441 1.951189733517822e-05
442 1.9050879927817732e-05
443 1.860079646576196e-05
444 1.8162558262702078e-05
445 1.7734020730131306e-05
446 1.7316211597062647e-05
447 1.6910818885662593e-05
448 1.651387719903141

### Control flow and weight sharing

We can use normal Python control flow to implement loops in the graph. We can also implement weight sharing among the innermost layers by reusing the same Module multiple times when defining the forward pass.

Here's a weird model that does this:

In [8]:
import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        Constructor. Construct three nn.Linear instances that will be used in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3 and reuse the 
        middle_linear module that many times to compute hidden layer representations.
        
        Since each forward pass builds a dynamic computational graph, we can use normal Python
        control flow operators like loops or conditionals when defining the forward pass.
        
        Here we also see that it is perfectly safe to reuse the same Module many times when defining
        a computational graph. This is a big improvement from Lua Torch, where each Module could
        only be used once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred
    

# N is batch size; D_in is the input dimension
# H is the hidden layer dimension, D_out is the output dimension (logits layer)
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and Optimizer. Training this strange model with vanilla SGD is tough, so
# we use Momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())
    
    # Zero gradients, perform backprop, and update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 642.17919921875
1 640.4161987304688
2 639.5952758789062
3 638.3236694335938
4 640.6866455078125
5 638.5653076171875
6 597.3516235351562
7 633.267333984375
8 630.6689453125
9 613.4772338867188
10 401.72271728515625
11 597.6683959960938
12 624.4301147460938
13 629.1668090820312
14 272.4549255371094
15 627.5350952148438
16 214.43411254882812
17 181.974609375
18 532.6250610351562
19 622.6021728515625
20 601.3587036132812
21 481.0809020996094
22 613.3866577148438
23 607.6112670898438
24 74.40505981445312
25 380.3463134765625
26 71.57864379882812
27 517.030517578125
28 64.57430267333984
29 546.6593017578125
30 451.6687927246094
31 54.68474578857422
32 48.39772415161133
33 217.20941162109375
34 348.9156799316406
35 409.5720520019531
36 370.2767028808594
37 147.25975036621094
38 283.3663635253906
39 125.20189666748047
40 174.8333740234375
41 108.5503158569336
42 92.45303344726562
43 204.45726013183594
44 78.78553771972656
45 101.6755142211914
46 382.2397766113281
47 220.71310424804688
48 725

387 0.5822903513908386
388 0.9397277235984802
389 0.1907849907875061
390 0.8701563477516174
391 0.6360353231430054
392 0.16386426985263824
393 0.7555074095726013
394 0.1676652431488037
395 0.6443384885787964
396 0.5858300924301147
397 0.8209527134895325
398 0.6052820086479187
399 0.587907075881958
400 0.23791156709194183
401 0.556063175201416
402 0.15046794712543488
403 0.33305996656417847
404 0.5204373598098755
405 0.15607190132141113
406 0.9653676748275757
407 0.13412877917289734
408 0.9132421612739563
409 0.08174939453601837
410 0.5029069185256958
411 0.38275811076164246
412 0.6913235783576965
413 0.06438527256250381
414 0.6021411418914795
415 0.5526212453842163
416 0.5360802412033081
417 0.07649930566549301
418 0.07528127729892731
419 0.0684557780623436
420 0.4244372248649597
421 0.35472312569618225
422 0.28547126054763794
423 0.24720723927021027
424 1.0394366979599
425 1.0330954790115356
426 0.9411152601242065
427 0.18719656765460968
428 0.15176157653331757
429 0.4362735152244568
