# Learning PyTorch with examples

This tutorial introduces the fundamental concepts of PyTorch through self-contained examples.

At its core, PyTorch provides two main features:
    -An n-dimensional Tensor, similar to numpy but can run on GPUs
    -Automatic differentiation for building and training neural networks

We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

## Tensors

### Warm-up: numpy

Before introducing PyTorch, we will first implement the network using numpy.

Numpy provides an n-dimensional array object, and many functions for manipulating these arrays. Numpy is a generic framework for scientific computing; it does not know anything about computation graphs, or deep learning, or gradients. However we can easily use numpy to fit a two-layer network to random data by manually implementing the forward and backward passes through the network using numpy operations:

In [1]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data

x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2


0 31904558.273768704
1 27458056.510597058
2 28659615.697684698
3 30509188.69895033
4 29056708.6015617
5 23050441.78139724
6 14929079.271873709
7 8303265.897260379
8 4356212.717270631
9 2386305.399995669
10 1453737.2579098148
11 998906.1420676624
12 755293.4896400817
13 607372.1299255558
14 506369.89825605566
15 431013.3547186042
16 371538.2975457551
17 322986.92522202095
18 282629.4555749806
19 248600.74934837501
20 219633.59328546416
21 194800.26539027234
22 173379.99785421294
23 154822.57512609242
24 138676.03139126397
25 124585.32017900872
26 112213.78366432292
27 101318.92697814289
28 91690.18832003303
29 83165.3884144063
30 75592.71763465772
31 68851.32105494112
32 62828.831225532165
33 57439.569229995264
34 52603.38472602863
35 48250.08313753866
36 44323.26091503976
37 40772.81077820271
38 37558.14516825001
39 34642.739114181575
40 31993.198062075007
41 29583.1473882203
42 27384.91575088087
43 25377.410635742606
44 23542.48229879047
45 21861.625825845695
46 20320.545906798485
47 

418 0.0024822605817051307
419 0.002391362331875211
420 0.002303761739939434
421 0.002219397609394441
422 0.0021381265486385545
423 0.00205990089839045
424 0.0019844624603485416
425 0.0019118219284058663
426 0.0018418413344088852
427 0.001774496972548346
428 0.001709570270696953
429 0.0016470013854836159
430 0.0015867511623980292
431 0.001528746801503434
432 0.001472830080335761
433 0.0014189713844271017
434 0.001367088765503289
435 0.0013171406533977587
436 0.0012689967799123362
437 0.0012225979721278962
438 0.0011779076481475497
439 0.001134889240342424
440 0.0010934478581681292
441 0.0010534919116137018
442 0.001014996010426055
443 0.0009779322989571467
444 0.0009422375385694027
445 0.0009078246280428596
446 0.0008746780721277431
447 0.0008427477649082121
448 0.0008120019158210154
449 0.0007823588987364049
450 0.0007538001625675178
451 0.0007262883157038735
452 0.000699813014635029
453 0.0006742796831714325
454 0.0006496782163388606
455 0.0006259766585314582
456 0.0006031601626991441

### PyTorch: Tensors

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of 50x or greater, so unfortunately numpy won’t be enough for modern deep learning.

Here we introduce the most fundamental PyTorch concept: the **Tensor**. A PyTorch Tensor is conceptually identical to a numpy array: a Tensor is an n-dimensional array, and PyTorch provides many functions for operating on these Tensors. Behind the scenes, Tensors can keep track of a computational graph and gradients, but they’re also useful as a generic tool for scientific computing.

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [2]:
# -*- coding: utf-8 -*-

import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights

w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    print(t, loss)
    
    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 29010150.0
1 23270126.0
2 19758636.0
3 16258831.0
4 12497780.0
5 8976923.0
6 6144399.0
7 4125738.75
8 2785957.75
9 1931697.5
10 1389643.5
11 1040844.125
12 809240.1875
13 649302.0
14 534393.8125
15 448460.09375
16 382009.15625
17 329070.0625
18 285992.03125
19 250374.78125
20 220495.828125
21 195147.1875
22 173431.1875
23 154700.171875
24 138437.015625
25 124246.6796875
26 111817.1015625
27 100889.984375
28 91251.78125
29 82719.25
30 75140.796875
31 68390.9453125
32 62361.91796875
33 56962.62109375
34 52116.50390625
35 47761.8984375
36 43840.47265625
37 40297.58984375
38 37090.015625
39 34182.06640625
40 31542.072265625
41 29140.162109375
42 26950.765625
43 24952.154296875
44 23125.44921875
45 21454.857421875
46 19925.39453125
47 18522.673828125
48 17233.353515625
49 16047.4140625
50 14956.5537109375
51 13950.73046875
52 13023.1962890625
53 12165.7607421875
54 11372.73828125
55 10639.1982421875
56 9959.7109375
57 9329.806640625
58 8745.5263671875
59 8203.185546875
60 7699.09375
61 72

410 0.01095059048384428
411 0.01059861108660698
412 0.010264065116643906
413 0.009934909641742706
414 0.009616440162062645
415 0.009313927963376045
416 0.00902816653251648
417 0.008737113326787949
418 0.00846845656633377
419 0.008196943439543247
420 0.007946490310132504
421 0.007692595012485981
422 0.007454338483512402
423 0.007223699241876602
424 0.007003107573837042
425 0.006784866563975811
426 0.006572894286364317
427 0.006371772848069668
428 0.006174057722091675
429 0.005987757351249456
430 0.005801375024020672
431 0.005630853585898876
432 0.005453658755868673
433 0.00528839323669672
434 0.005131290294229984
435 0.004973003640770912
436 0.004824982490390539
437 0.004672589246183634
438 0.004532298073172569
439 0.004394750110805035
440 0.004266942385584116
441 0.004139735363423824
442 0.0040180878713727
443 0.003900339361280203
444 0.0037882255855947733
445 0.0036744847893714905
446 0.003563636913895607
447 0.0034590745344758034
448 0.003358706133440137
449 0.0032605056185275316
450

## Autograd

### PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) to automate the computation of backward passes in neural networks. The **autograd** package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a **computational graph**; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

This sounds complicated, it’s pretty simple to use in practice. Each Tensor represents a node in a computational graph. If `x` is a Tensor that has `x.requires_grad=True` then `x.grad` is another Tensor holding the gradient of x with respect to some scalar value.

Here we use PyTorch Tensors and autograd to implement our two-layer network; now we no longer need to manually implement the backward pass through the network:

In [3]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the a scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 32560994.0
1 26115498.0
2 22565532.0
3 19050326.0
4 14761971.0
5 10566616.0
6 7056750.5
7 4601296.0
8 3016194.0
9 2047232.75
10 1453048.25
11 1081546.5
12 839296.5
13 673421.5625
14 553928.4375
15 464050.125
16 393943.59375
17 337856.9375
18 292030.15625
19 253986.234375
20 222025.609375
21 194929.90625
22 171778.515625
23 151906.109375
24 134740.9375
25 119873.9921875
26 106943.640625
27 95644.515625
28 85743.1875
29 77056.2109375
30 69402.1015625
31 62625.30078125
32 56611.09765625
33 51269.1328125
34 46511.31640625
35 42261.76953125
36 38455.9921875
37 35038.0546875
38 31965.470703125
39 29197.294921875
40 26701.716796875
41 24445.912109375
42 22404.341796875
43 20555.01171875
44 18876.587890625
45 17351.853515625
46 15964.0517578125
47 14699.82421875
48 13546.3955078125
49 12493.693359375
50 11531.6162109375
51 10651.3544921875
52 9844.8427734375
53 9105.5
54 8427.0849609375
55 7804.015625
56 7231.50927734375
57 6705.07763671875
58 6220.4580078125
59 5773.84033203125
60 5362.4536

467 5.2656403568107635e-05
468 5.176331615075469e-05
469 5.091911225463264e-05
470 5.030167449149303e-05
471 4.9849157221615314e-05
472 4.912182703264989e-05
473 4.8449499445268884e-05
474 4.7850815462879837e-05
475 4.725263352156617e-05
476 4.632650961866602e-05
477 4.576009450829588e-05
478 4.5024145947536454e-05
479 4.4477801566245034e-05
480 4.397309385240078e-05
481 4.3123200157424435e-05
482 4.266612450010143e-05
483 4.209293911117129e-05
484 4.173759225523099e-05
485 4.109852670808323e-05
486 4.034671655972488e-05
487 3.983325223089196e-05
488 3.919904338545166e-05
489 3.8799498724984005e-05
490 3.8093454350018874e-05
491 3.775874574785121e-05
492 3.7265617720549926e-05
493 3.685321644297801e-05
494 3.647111589089036e-05
495 3.588343315641396e-05
496 3.557813397492282e-05
497 3.522877159412019e-05
498 3.4693028283072636e-05
499 3.434105383348651e-05


### PyTorch: Defining new autograd functions

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. The **forward** function computes output Tensors from input Tensors. The **backward** function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of `torch.autograd.Function` and implementing the `forward` and `backward` functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

In [4]:
# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 23783482.0
1 18451350.0
2 16709158.0
3 16114213.0
4 15456091.0
5 13938363.0
6 11612843.0
7 8848814.0
8 6300161.5
9 4271537.0
10 2853499.5
11 1916279.75
12 1322439.5
13 945880.0
14 705120.625
15 546350.875
16 437900.90625
17 360760.4375
18 303635.375
19 259700.390625
20 224871.015625
21 196496.0
22 172919.4375
23 153129.875
24 136238.046875
25 121660.078125
26 108997.203125
27 97931.171875
28 88214.2890625
29 79656.15625
30 72074.4375
31 65336.87109375
32 59337.0390625
33 53976.4296875
34 49175.55859375
35 44861.0234375
36 40980.3828125
37 37483.78125
38 34329.4140625
39 31477.669921875
40 28896.50390625
41 26554.607421875
42 24429.662109375
43 22500.6328125
44 20741.501953125
45 19139.236328125
46 17676.37890625
47 16339.3857421875
48 15116.130859375
49 13995.5947265625
50 12967.5986328125
51 12023.513671875
52 11156.3125
53 10358.248046875
54 9623.3037109375
55 8946.4775390625
56 8322.8232421875
57 7747.30517578125
58 7217.0537109375
59 6726.939453125
60 6273.48974609375
61 5853.744

405 0.0008687405497767031
406 0.0008424748084507883
407 0.0008180683362297714
408 0.0007952051237225533
409 0.0007719768909737468
410 0.0007499484927393496
411 0.0007291445508599281
412 0.0007076282636262476
413 0.0006881162407808006
414 0.0006681828526780009
415 0.0006489718216471374
416 0.0006320816464722157
417 0.0006148168467916548
418 0.0005979146808385849
419 0.0005811958108097315
420 0.0005655891727656126
421 0.0005514373769983649
422 0.0005372525192797184
423 0.0005230553215369582
424 0.0005087594036012888
425 0.0004953014431521297
426 0.00048300440539605916
427 0.0004710286157205701
428 0.00046008580829948187
429 0.00044865437666885555
430 0.0004365773929748684
431 0.0004255292296875268
432 0.0004146906139794737
433 0.0004044650704599917
434 0.00039415867649950087
435 0.00038422984653152525
436 0.0003752456104848534
437 0.0003668786375783384
438 0.0003575477167032659
439 0.00034933650749735534
440 0.0003403648152016103
441 0.0003328577149659395
442 0.00032475919579155743
443 0

In [5]:
### TensorFlow: Static Graphs

PyTorch autograd looks a lot like TensorFlow: in both frameworks we define a computational graph, and use automatic differentiation to compute gradients. The biggest difference between the two is that TensorFlow’s computational graphs are **static** and PyTorch uses **dynamic** computational graphs.

In TensorFlow, we define the computational graph once and then execute the same graph over and over again, possibly feeding different input data to the graph. In PyTorch, each forward pass defines a new computational graph.

Static graphs are nice because you can optimize the graph up front; for example a framework might decide to fuse some graph operations for efficiency, or to come up with a strategy for distributing the graph across many GPUs or many machines. If you are reusing the same graph over and over, then this potentially costly up-front optimization can be amortized as the same graph is rerun over and over.

One aspect where static and dynamic graphs differ is control flow. For some models we may wish to perform different computation for each data point; for example a recurrent network might be unrolled for different numbers of time steps for each data point; this unrolling can be implemented as a loop. With a static graph the loop construct needs to be a part of the graph; for this reason TensorFlow provides operators such as `tf.scan` for embedding loops into the graph. With dynamic graphs the situation is simpler: since we build graphs on-the-fly for each example, we can use normal imperative flow control to perform computation that differs for each input.

To contrast with the PyTorch autograd example above, here we use TensorFlow to fit a simple two-layer net:

In [6]:
# -*- coding: utf-8 -*-
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TenforFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initalize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())
    
    # Create numpy arrays holding the actual data for the inputs x and the targets y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    
    for _ in range(500):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

34042490.0
33414858.0
39484984.0
44142184.0
39686852.0
25715752.0
12407913.0
5252548.0
2518745.0
1529319.5
1116898.2
895766.9
748267.06
636615.7
546999.25
473217.78
411647.94
359747.38
315684.8
278061.0
245770.69
217930.69
193863.94
172929.69
154659.16
138651.78
124587.11
112185.164
101209.18
91478.01
82823.75
75104.37
68210.375
62038.33
56512.316
51550.08
47083.742
43057.21
39421.44
36133.543
33166.47
30475.164
28030.162
25807.793
23783.59
21938.344
20254.613
18715.963
17308.43
16018.992
14837.803
13754.505
12760.052
11846.704
11006.573
10233.787
9522.3545
8866.203
8260.07
7700.0557
7182.177
6703.044
6259.4624
5848.254
5466.907
5113.1772
4784.6743
4479.5376
4195.725
3931.7573
3686.1433
3457.3716
3244.5442
3046.2527
2861.3464
2688.717
2527.5493
2376.9158
2236.1028
2104.353
1981.1135
1865.7935
1757.7662
1656.5343
1561.625
1472.7797
1389.3734
1311.0763
1237.5592
1168.536
1103.6743
1042.6847
985.33655
931.3985
880.6627
832.88916
787.8968
745.5228
705.5926
667.96936
632.4801
599.0051
567.4

## _nn_ module

### nn

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into **layers**, some of which have **learnable parameters** which will be optimized during learning.

In TensorFlow, packages like Keras, TensorFlow-Slim, and TFLearn provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the `nn` package serves this same purpose. The `nn` package defines a set of **Modules**, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The nn package also defines a set of useful loss functions that are commonly used when training neural networks.

In this example we use the `nn` package to implement our two-layer network:

In [7]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 686.1702880859375
1 637.1605224609375
2 594.3972778320312
3 556.7274780273438
4 522.976806640625
5 492.54046630859375
6 464.6207580566406
7 438.896484375
8 415.0803527832031
9 392.8974304199219
10 372.1849060058594
11 352.6351623535156
12 334.1941833496094
13 316.69732666015625
14 300.1248779296875
15 284.3031311035156
16 269.2139892578125
17 254.7942657470703
18 240.98353576660156
19 227.86688232421875
20 215.39654541015625
21 203.44961547851562
22 191.9687042236328
23 181.0736846923828
24 170.71607971191406
25 160.85545349121094
26 151.48504638671875
27 142.59130859375
28 134.1652374267578
29 126.210693359375
30 118.6884536743164
31 111.5748519897461
32 104.85710906982422
33 98.53960418701172
34 92.60003662109375
35 87.00420379638672
36 81.7396469116211
37 76.78860473632812
38 72.14871978759766
39 67.79293823242188
40 63.70716094970703
41 59.876365661621094
42 56.28169631958008
43 52.910770416259766
44 49.75209045410156
45 46.79035568237305
46 44.018306732177734
47 41.4205818176269

350 0.00036304580862633884
351 0.0003529275127220899
352 0.00034309274633415043
353 0.0003335464862175286
354 0.00032424545497633517
355 0.0003152234712615609
356 0.0003064509655814618
357 0.0002979366108775139
358 0.0002896562800742686
359 0.00028162181843072176
360 0.0002737968461588025
361 0.0002662156766746193
362 0.00025883325724862516
363 0.00025165939587168396
364 0.00024469656636938453
365 0.0002379190846113488
366 0.00023133496870286763
367 0.00022494270524475724
368 0.00021872378420084715
369 0.00021268836280796677
370 0.00020681452588178217
371 0.00020110781770199537
372 0.00019557401537895203
373 0.00019019194587599486
374 0.00018495651602279395
375 0.000179859678610228
376 0.00017491787730250508
377 0.00017009658040478826
378 0.00016542206867597997
379 0.00016087568656075746
380 0.00015645848179701716
381 0.0001521601661806926
382 0.00014798520714975893
383 0.00014392727462109178
384 0.00013997184578329325
385 0.00013613810006063432
386 0.00013241141277831048
387 0.0001287

### optim

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (with `torch.no_grad()` or `.data` to avoid tracking history in autograd). This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like AdaGrad, RMSProp, Adam, etc.

The `optim` package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the `nn` package to define our model as before, but we will optimize the model using the Adam algorithm provided by the `optim` package:

In [8]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 666.195068359375
1 649.03466796875
2 632.3576049804688
3 616.0852661132812
4 600.2734985351562
5 584.9382934570312
6 569.970458984375
7 555.4872436523438
8 541.5020141601562
9 527.9505004882812
10 514.8280639648438
11 502.0240478515625
12 489.5912170410156
13 477.5303955078125
14 465.7814025878906
15 454.3542785644531
16 443.2380676269531
17 432.474365234375
18 422.0018310546875
19 411.82568359375
20 401.9736633300781
21 392.3719177246094
22 383.0407409667969
23 373.940185546875
24 365.0621032714844
25 356.438232421875
26 348.06048583984375
27 339.8863525390625
28 331.8731689453125
29 324.0473937988281
30 316.4313659667969
31 309.0072326660156
32 301.79583740234375
33 294.76177978515625
34 287.8707580566406
35 281.1431884765625
36 274.61688232421875
37 268.2266845703125
38 261.97528076171875
39 255.8778533935547
40 249.90737915039062
41 244.0784912109375
42 238.3941650390625
43 232.8209686279297
44 227.35824584960938
45 222.00559997558594
46 216.7608184814453
47 211.6373291015625
48 

363 5.706300726160407e-06
364 5.2580635383492336e-06
365 4.8443826017319225e-06
366 4.460722720978083e-06
367 4.108074790565297e-06
368 3.78235836251406e-06
369 3.481730345811229e-06
370 3.2040443329606205e-06
371 2.9484187962225405e-06
372 2.7127712201036047e-06
373 2.495088210707763e-06
374 2.2947851903154515e-06
375 2.110720060954918e-06
376 1.939527010108577e-06
377 1.7826429257183918e-06
378 1.6382737157982774e-06
379 1.5052503385959426e-06
380 1.381607830808207e-06
381 1.2692230484390166e-06
382 1.1650635087789851e-06
383 1.0691917395888595e-06
384 9.80907202574599e-07
385 8.997641884889163e-07
386 8.256425871877582e-07
387 7.570530442535528e-07
388 6.937463012945955e-07
389 6.361686359923624e-07
390 5.829764972986595e-07
391 5.342009217201849e-07
392 4.887613158643944e-07
393 4.4789021558244713e-07
394 4.098895942661329e-07
395 3.75185720713489e-07
396 3.4314788877054525e-07
397 3.13585474032152e-07
398 2.8686190489679575e-07
399 2.6229176341985294e-07
400 2.396782008418086e-07


### Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing `nn.Module` and defining a `forward` which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

In this example we implement our two-layer network as a custom Module subclass:

In [9]:
# -*- coding: utf-8 -*-
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 677.1204833984375
1 627.5800170898438
2 584.70458984375
3 546.996826171875
4 513.6954345703125
5 483.6799621582031
6 456.2716979980469
7 430.9631652832031
8 407.5250244140625
9 385.82891845703125
10 365.41064453125
11 346.1640319824219
12 328.0512390136719
13 310.8643493652344
14 294.64691162109375
15 279.2012023925781
16 264.5472717285156
17 250.5754852294922
18 237.3550567626953
19 224.77163696289062
20 212.78404235839844
21 201.35031127929688
22 190.44285583496094
23 180.0918731689453
24 170.2339630126953
25 160.8388671875
26 151.90481567382812
27 143.41217041015625
28 135.35400390625
29 127.73361206054688
30 120.50454711914062
31 113.6422348022461
32 107.15384674072266
33 101.03302764892578
34 95.2628173828125
35 89.80828094482422
36 84.65664672851562
37 79.807861328125
38 75.23103332519531
39 70.90269470214844
40 66.82320404052734
41 62.977813720703125
42 59.3514518737793
43 55.94085693359375
44 52.734619140625
45 49.70973205566406
46 46.86132049560547
47 44.18012619018555
48 41

361 0.00023079865786712617
362 0.00022398964210879058
363 0.00021738786017522216
364 0.00021097110584378242
365 0.00020475636119954288
366 0.0001987271971302107
367 0.00019288226030766964
368 0.000187215133337304
369 0.00018171391275245696
370 0.0001763798063620925
371 0.00017120325355790555
372 0.00016618016525171697
373 0.00016131580923683941
374 0.00015660123608540744
375 0.00015202366921585053
376 0.0001475870085414499
377 0.00014327923418022692
378 0.00013909704284742475
379 0.0001350359380012378
380 0.00013109784049447626
381 0.00012728168803732842
382 0.00012357428204268217
383 0.00011997651745332405
384 0.00011648832878563553
385 0.00011310844274703413
386 0.00010982291132677346
387 0.00010663005377864465
388 0.00010353975085308775
389 0.00010054553422378376
390 9.763370326254517e-05
391 9.480764128966257e-05
392 9.206365211866796e-05
393 8.940134284785017e-05
394 8.681746112415567e-05
395 8.431308378931135e-05
396 8.188108040485531e-05
397 7.952805026434362e-05
398 7.723410817

### Control Flow + Weight Sharing

As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by simply reusing the same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [10]:
# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 661.4137573242188
1 661.4899291992188
2 659.8898315429688
3 658.6165771484375
4 655.7174072265625
5 655.4534912109375
6 671.4595947265625
7 652.1492309570312
8 645.5587158203125
9 550.0386352539062
10 491.2261657714844
11 647.6262817382812
12 639.079345703125
13 637.3631591796875
14 296.0633544921875
15 631.9777221679688
16 630.810791015625
17 645.1777954101562
18 610.9617309570312
19 643.4176025390625
20 152.5430908203125
21 562.7241821289062
22 608.9974975585938
23 636.0831909179688
24 593.8336791992188
25 628.35205078125
26 622.4215087890625
27 614.5226440429688
28 88.61373901367188
29 83.01107788085938
30 519.0211181640625
31 572.33935546875
32 363.5462341308594
33 536.1586303710938
34 58.44125747680664
35 301.97900390625
36 395.6514892578125
37 440.52801513671875
38 61.03584671020508
39 57.14142990112305
40 45.733036041259766
41 296.3965148925781
42 193.7781524658203
43 319.4304504394531
44 294.8972473144531
45 29.842966079711914
46 30.857181549072266
47 147.1555633544922
48 27.

409 0.5952229499816895
410 0.4967089891433716
411 0.49953174591064453
412 0.2950907051563263
413 0.47110655903816223
414 0.434558242559433
415 0.2858697474002838
416 0.6313093304634094
417 0.6073466539382935
418 0.39924973249435425
419 0.37869885563850403
420 0.14716769754886627
421 0.3015049695968628
422 0.1341271847486496
423 0.33286479115486145
424 0.11650985479354858
425 0.8351050615310669
426 0.27884843945503235
427 0.09797850251197815
428 0.5218493938446045
429 0.37344804406166077
430 0.6637707352638245
431 0.2616810202598572
432 0.7543941140174866
433 0.6434998512268066
434 0.3167334198951721
435 0.2891938388347626
436 0.06257819384336472
437 0.3420095145702362
438 1.9111355543136597
439 0.5174962878227234
440 0.34767377376556396
441 0.4860934019088745
442 5.729719638824463
443 0.4729049801826477
444 0.23759318888187408
445 2.9927217960357666
446 4.596949577331543
447 0.28151267766952515
448 1.720644474029541
449 1.7782400846481323
450 1.2734270095825195
451 0.07094685733318329
