In [1]:
%matplotlib inline


Neural Networks
===============

Neural networks can be constructed using the ``torch.nn`` package.

Now that you had a glimpse of ``autograd``, ``nn`` depends on
``autograd`` to define models and differentiate them.
An ``nn.Module`` contains layers, and a method ``forward(input)``\ that
returns the ``output``.

For example, look at this network that classifies digit images:

.. figure:: /_static/img/mnist.png
   :alt: convnet

   convnet

It is a simple feed-forward network. It takes the input, feeds it
through several layers one after the other, and then finally gives the
output.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

Define the network
------------------

Let’s define this network:



In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You just have to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for you
using ``autograd``.
You can use any of the Tensor operations in the ``forward`` function.

The learnable parameters of a model are returned by ``net.parameters()``



In [3]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 5, 5])


In [12]:
params[0]

Parameter containing:
tensor([[[[ 0.1766,  0.0626, -0.0126, -0.1485, -0.1143],
          [ 0.1079, -0.0807, -0.1443, -0.1870, -0.0684],
          [-0.0105,  0.0737, -0.0739, -0.1030, -0.1602],
          [-0.1543, -0.0784, -0.0076, -0.0313, -0.1566],
          [ 0.1963, -0.1473, -0.1078, -0.0949, -0.1037]]],


        [[[-0.1560,  0.1539,  0.0730, -0.1281,  0.0183],
          [-0.1039,  0.1409, -0.1497, -0.1801, -0.0988],
          [-0.1943, -0.0518,  0.1439, -0.0244, -0.0232],
          [-0.0017, -0.1677, -0.0404, -0.1159,  0.1965],
          [ 0.1672,  0.0719, -0.0179, -0.1904, -0.0605]]],


        [[[ 0.0002, -0.0052, -0.1303,  0.0499,  0.0766],
          [ 0.0073, -0.1399,  0.1015, -0.1289, -0.0063],
          [-0.0801,  0.0052, -0.0898,  0.1953,  0.1694],
          [ 0.0261,  0.0505, -0.0939,  0.1857, -0.1188],
          [ 0.0492,  0.0209,  0.1305,  0.1481,  0.1087]]],


        [[[ 0.0532,  0.1949,  0.0530, -0.0214, -0.0073],
          [ 0.0892,  0.0542, -0.1628, -0.0636, -0.1199

In [11]:
params[1]

Parameter containing:
tensor([ 0.1347,  0.1469, -0.1322, -0.0088,  0.0215,  0.0648])

In [13]:
params[2]

Parameter containing:
tensor(1.00000e-02 *
       [[[[-2.7557, -6.7583,  1.7948,  0.0540,  1.7929],
          [ 7.2878, -4.6002, -7.1123, -1.8802,  6.3791],
          [ 5.4438,  0.9972, -7.1706,  1.7921, -1.0726],
          [ 0.0228,  2.0864, -5.6084, -5.2926, -3.7150],
          [ 3.1693, -2.9725,  4.3696,  0.0897,  2.4423]],

         [[-1.2635,  4.0192, -1.0275,  5.4801, -2.4831],
          [-4.7110, -0.6397, -3.8306,  5.7879,  1.3434],
          [-0.1656,  4.8660,  4.8321,  3.0476,  1.3929],
          [-1.3669, -7.4818,  7.8114, -4.3556, -5.3576],
          [ 7.9164,  5.9184,  4.5088,  1.3254, -2.1218]],

         [[-3.7819, -0.8025,  0.8332, -4.4593, -7.1126],
          [ 4.1711,  3.5896, -1.4551,  0.7057,  7.2711],
          [ 6.4650,  4.9905, -0.3131,  2.8092, -2.4918],
          [ 7.2855, -5.0519, -0.7914, -6.2351, -5.2789],
          [-7.8652, -3.9655, -2.7525,  6.1050, -3.2395]],

         [[-0.5243, -0.9949, -0.6995,  6.1387,  2.3764],
          [ 6.8599,  7.3158,  1.5108,  

In [14]:
params[3]

Parameter containing:
tensor(1.00000e-02 *
       [-6.2208, -2.9289,  6.0411, -4.8247,  2.5073,  2.9058,  8.0447,
         2.4551,  5.7090,  5.2946,  3.2243,  4.1511,  2.8694, -2.4219,
         8.0693, -4.2360])

In [15]:
params[4]

Parameter containing:
tensor([[ 1.2634e-02,  2.3087e-02,  4.8764e-02,  ...,  4.3082e-02,
          3.5288e-03, -7.8769e-03],
        [-3.6116e-02,  4.0611e-02, -1.8232e-02,  ..., -1.6476e-02,
          2.5371e-02,  2.9297e-02],
        [-2.1741e-02, -3.0112e-02,  4.8850e-02,  ..., -4.4457e-02,
         -3.4977e-02, -4.0434e-02],
        ...,
        [-4.6662e-02,  2.0653e-02, -7.2814e-03,  ..., -4.9183e-02,
         -4.5102e-02,  4.2223e-02],
        [ 4.1020e-02,  1.7329e-02, -4.5680e-02,  ...,  4.9582e-02,
         -1.8222e-02,  1.7887e-02],
        [ 3.2665e-02,  3.2288e-02,  1.4311e-02,  ..., -3.8445e-02,
          3.4608e-02, -3.4968e-02]])

In [16]:
params[5]

Parameter containing:
tensor(1.00000e-02 *
       [ 0.7993, -4.5939, -0.4913,  0.3560,  0.8247, -0.4292, -4.9696,
        -0.3507, -1.6101, -1.3545,  1.2663, -4.8945,  4.5837,  3.0093,
         0.7560, -2.8490,  4.5657, -0.9518, -0.4037,  2.9403, -2.0277,
        -1.7214,  0.7739,  3.5474, -0.7361, -2.0960,  0.4570,  2.7553,
         1.0134,  4.6471,  2.7491, -0.7515,  1.5416, -4.1582, -0.4886,
        -0.2441,  3.6936,  1.5918, -4.7076,  0.9961, -3.5473, -0.9436,
         0.7128,  0.1827,  4.2426, -4.0257, -0.4251,  3.6093, -4.6249,
        -4.5452,  2.3364,  0.5373, -3.2167, -0.5343,  3.4979, -2.3814,
         4.1288,  1.4933, -2.5847,  3.1511, -2.5378,  4.7888, -2.5861,
         1.3041,  3.8966, -4.3968, -3.8630, -0.7841,  2.5598, -0.1898,
        -2.8274, -1.4709,  2.0151, -4.5158, -4.8565,  2.4150, -0.8562,
         4.7647, -4.5654, -1.8300,  0.5100,  0.3426, -4.6515, -3.1606,
         0.9423, -3.7901, -2.8506, -3.5013, -3.3183,  3.3036, -2.1743,
         2.5578, -0.5470,  4.7416,

In [17]:
params[6]

Parameter containing:
tensor(1.00000e-02 *
       [[ 1.2368, -7.3091,  5.0765,  ...,  8.7203,  3.7073, -4.2464],
        [-0.6658, -2.5002,  0.1474,  ...,  6.4851, -5.7720, -6.5738],
        [-0.1003, -4.4392, -3.9783,  ..., -0.6537,  4.4131, -1.6640],
        ...,
        [-4.8352, -2.5084, -8.2158,  ...,  4.0101,  1.5109,  7.3754],
        [-1.2128, -2.5012, -0.6475,  ..., -5.3254, -8.6097,  2.7185],
        [-8.2277,  9.0836,  0.1847,  ..., -3.4958, -5.5351,  8.2247]])

In [18]:
params[7]

Parameter containing:
tensor(1.00000e-02 *
       [ 4.4963,  4.9905,  1.2907,  5.7610,  0.9245,  3.6374,  8.3315,
         6.7015, -4.2438, -3.7145, -5.9884, -7.4573,  1.3191,  1.8337,
        -3.7005,  6.9101,  6.1748, -2.5954, -1.9803,  7.4429, -4.6292,
        -1.1431,  4.7948, -8.2267,  8.7532,  3.7656, -2.8290,  5.7917,
         7.5451, -7.1892,  0.2357, -0.6134, -2.8648, -5.6974,  6.0703,
        -6.1286,  4.2482, -3.8136,  8.6462,  6.2290, -4.7014,  4.5385,
        -4.5188, -4.6047, -6.5037, -2.2696,  3.8900,  1.8005,  6.6959,
        -3.6971, -8.9407,  5.6726, -7.6046,  6.4657,  7.7985,  5.5293,
         3.7895,  6.3729, -0.7559,  0.7541,  4.6076,  1.1845, -0.0694,
         7.3200,  1.3234,  3.7963, -8.6663,  7.4109,  0.2873, -1.3324,
        -3.6446,  6.5030, -2.7726,  5.9385,  5.6983,  1.5580, -8.2261,
         5.0245, -5.3064,  4.9673,  6.9563, -8.6351, -2.1537, -8.2455])

In [19]:
params[8]

Parameter containing:
tensor([[-0.0675,  0.0664, -0.0713,  0.1018,  0.0002,  0.0949, -0.0558,
          0.0847, -0.0109, -0.0525,  0.0738, -0.0148, -0.0176, -0.0205,
          0.0773,  0.0362, -0.0307,  0.0524, -0.0646, -0.0349,  0.0477,
         -0.0939,  0.0360, -0.0192,  0.0805,  0.0645,  0.0832, -0.0703,
         -0.0741, -0.0809, -0.0741, -0.0218,  0.0238, -0.1006, -0.0621,
          0.0374,  0.0815, -0.0241, -0.1081, -0.0560, -0.0543,  0.0824,
         -0.0510,  0.0008,  0.0686, -0.0516,  0.0730,  0.0553, -0.0949,
         -0.0169,  0.0349,  0.0472, -0.0125,  0.0589, -0.0282,  0.0806,
          0.0014, -0.0752, -0.0938, -0.0228,  0.0104, -0.0760,  0.0125,
         -0.0986,  0.0048, -0.0851,  0.0288, -0.0930, -0.0762,  0.1050,
         -0.0099, -0.0801,  0.0246, -0.0447, -0.0122,  0.0112, -0.0890,
         -0.0666,  0.0297, -0.0485,  0.0262, -0.0913,  0.0097, -0.1028],
        [ 0.0482,  0.0241, -0.0240,  0.0320, -0.0364, -0.0479, -0.0170,
         -0.0907, -0.0498,  0.0401,  0.01

In [20]:
params[9]

Parameter containing:
tensor([-0.0280,  0.0570,  0.1024, -0.0496,  0.0905,  0.0909, -0.0881,
         0.1163, -0.0809, -0.0633])

Let try a random 32x32 input
Note: Expected input size to this net(LeNet) is 32x32. To use this net on
MNIST dataset, please resize the images from the dataset to 32x32.



In [4]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[-0.0461, -0.0020,  0.0251, -0.0977,  0.1229,  0.2578, -0.1546,
          0.0960, -0.0769, -0.1118]])


Zero the gradient buffers of all parameters and backprops with random
gradients:



In [5]:
net.zero_grad()
out.backward(torch.randn(1, 10))

<div class="alert alert-info"><h4>Note</h4><p>``torch.nn`` only supports mini-batches. The entire ``torch.nn``
    package only supports inputs that are a mini-batch of samples, and not
    a single sample.

    For example, ``nn.Conv2d`` will take in a 4D Tensor of
    ``nSamples x nChannels x Height x Width``.

    If you have a single sample, just use ``input.unsqueeze(0)`` to add
    a fake batch dimension.</p></div>

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd
     operations like ``backward()``. Also *holds the gradient* w.r.t. the
     tensor.
  -  ``nn.Module`` - Neural network module. *Convenient way of
     encapsulating parameters*, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically
     registered as a parameter when assigned as an attribute to a*
     ``Module``.
  -  ``autograd.Function`` - Implements *forward and backward definitions
     of an autograd operation*. Every ``Tensor`` operation, creates at
     least a single ``Function`` node, that connects to functions that
     created a ``Tensor`` and *encodes its history*.

**At this point, we covered:**
  -  Defining a neural network
  -  Processing inputs and calling backward

**Still Left:**
  -  Computing the loss
  -  Updating the weights of the network

Loss Function
-------------
A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different
`loss functions <http://pytorch.org/docs/nn.html#loss-functions>`_ under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the input and the target.

For example:



In [6]:
output = net(input)
target = torch.arange(1, 11)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(38.5802)


Now, if you follow ``loss`` in the backward direction, using its
``.grad_fn`` attribute, you will see a graph of computations that looks
like this:

::

    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
          -> view -> linear -> relu -> linear -> relu -> linear
          -> MSELoss
          -> loss

So, when we call ``loss.backward()``, the whole graph is differentiated
w.r.t. the loss, and all Tensors in the graph that has ``requres_grad=True``
will have their ``.grad`` Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:



In [7]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward object at 0x7f92e4bcf828>
<AddmmBackward object at 0x7f92e4bcf978>
<ExpandBackward object at 0x7f92e4bcf828>


Backprop
--------
To backpropagate the error all we have to do is to ``loss.backward()``.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.


Now we shall call ``loss.backward()``, and have a look at conv1's bias
gradients before and after the backward.



In [8]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([ 0.,  0.,  0.,  0.,  0.,  0.])
conv1.bias.grad after backward
tensor([-0.0872, -0.0191, -0.0878, -0.0355,  0.1031,  0.0175])


Now, we have seen how to use loss functions.

**Read Later:**

  The neural network package contains various modules and loss functions
  that form the building blocks of deep neural networks. A full list with
  documentation is `here <http://pytorch.org/docs/nn>`_.

**The only thing left to learn is:**

  - Updating the weights of the network

Update the weights
------------------
The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

     ``weight = weight - learning_rate * gradient``

We can implement this using simple python code:

.. code:: python

    learning_rate = 0.01
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
To enable this, we built a small package: ``torch.optim`` that
implements all these methods. Using it is very simple:



In [9]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

.. Note::

      Observe how gradient buffers had to be manually set to zero using
      ``optimizer.zero_grad()``. This is because gradients are accumulated
      as explained in `Backprop`_ section.

