In [1]:
%matplotlib inline


Neural Networks
===============

Neural networks can be constructed using the ``torch.nn`` package.

Now that you had a glimpse of ``autograd``, ``nn`` depends on
``autograd`` to define models and differentiate them.
An ``nn.Module`` contains layers, and a method ``forward(input)``\ that
returns the ``output``.

For example, look at this network that classifies digit images:

.. figure:: /_static/img/mnist.png
   :alt: convnet

   convnet

It is a simple feed-forward network. It takes the input, feeds it
through several layers one after the other, and then finally gives the
output.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

Define the network
------------------

Let’s define this network:



In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

def weights_init(m):
    if isinstance(m, nn.Conv2d):
        nn.init.uniform_(m.weight.data)

net = Net()
net.apply(weights_init)
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You just have to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for you
using ``autograd``.
You can use any of the Tensor operations in the ``forward`` function.

The learnable parameters of a model are returned by ``net.parameters()``

Here, params[0] contains weights of conv1 layer, params[1] contains biases of conv1, params[2] contains weights of conv2 and so forth.


In [3]:
params = list(net.parameters())
print(len(params))

for i in params:
    print(i.shape)
    
print(params)


10
torch.Size([6, 1, 5, 5])
torch.Size([6])
torch.Size([16, 6, 5, 5])
torch.Size([16])
torch.Size([120, 400])
torch.Size([120])
torch.Size([84, 120])
torch.Size([84])
torch.Size([10, 84])
torch.Size([10])
[Parameter containing:
tensor([[[[0.0205, 0.9384, 0.3466, 0.7664, 0.0063],
          [0.7058, 0.4754, 0.6675, 0.8995, 0.6439],
          [0.8649, 0.0038, 0.7128, 0.8942, 0.9093],
          [0.9220, 0.6844, 0.4362, 0.7938, 0.0239],
          [0.1146, 0.4099, 0.9849, 0.0324, 0.5438]]],


        [[[0.7051, 0.0842, 0.0021, 0.3583, 0.3168],
          [0.9908, 0.3503, 0.4393, 0.8822, 0.0797],
          [0.8535, 0.6691, 0.4817, 0.9033, 0.2888],
          [0.4153, 0.4550, 0.1055, 0.7175, 0.0537],
          [0.9289, 0.4551, 0.2094, 0.5698, 0.8031]]],


        [[[0.3295, 0.9075, 0.0347, 0.9648, 0.5102],
          [0.9384, 0.3054, 0.9173, 0.2792, 0.0808],
          [0.5655, 0.3247, 0.7561, 0.5384, 0.8735],
          [0.3973, 0.5634, 0.8398, 0.9353, 0.0523],
          [0.2315, 0.1442, 0.2364, 0

Let try a random 32x32 input
Note: Expected input size to this net(LeNet) is 32x32. To use this net on
MNIST dataset, please resize the images from the dataset to 32x32.



In [4]:
input = torch.randn(1, 1, 32, 32)
print(input)
out = net(input)
print(out)

tensor([[[[ 0.9075, -0.3262,  1.6211,  ...,  0.4470,  0.7173, -0.9660],
          [ 1.1712, -1.3710,  0.1153,  ..., -1.6503, -0.4119, -0.1123],
          [-0.2234, -1.1084, -0.3526,  ...,  0.6143,  0.7884,  1.1465],
          ...,
          [-0.6864, -0.1405,  1.6630,  ..., -0.4787,  0.2018, -1.2964],
          [ 0.9528, -0.3871,  0.3279,  ...,  1.7516,  1.2347,  0.2796],
          [ 2.4674, -0.3897,  0.3117,  ...,  1.6496, -0.8743, -1.9957]]]])
tensor([[ 10.1932,  16.7110, -26.8263,   8.5804,  25.5562, -25.3286,  47.9618,
          -0.6238, -31.0383,   6.4824]], grad_fn=<AddmmBackward>)


Zero the gradient buffers of all parameters and backprops with random
gradients:



In [5]:
net.zero_grad()
out.backward(torch.randn(1, 10))


In [6]:
params = list(net.parameters())
params[0].grad

tensor([[[[ 1.1890, -1.7642, -1.2407, -0.6905,  1.4818],
          [-3.0232,  0.5179,  0.0531, -1.8498, -0.0746],
          [-2.7031,  1.8405, -0.8986, -1.1680, -4.4293],
          [-3.6899, -1.5791,  1.8281,  0.8276, -1.3301],
          [-1.0217, -0.1033, -1.8131,  1.6611, -2.1874]]],


        [[[-1.5841,  0.8583,  0.3118, -0.7311, -1.8117],
          [-2.0638,  0.2793,  0.9307, -1.9093,  0.4375],
          [-3.1827, -0.7752, -1.2239, -1.7563, -1.4826],
          [-1.7069,  0.2144,  1.3982, -1.0978,  0.7146],
          [-3.2961,  0.0650,  1.5228, -2.4158, -3.0712]]],


        [[[-1.7227, -1.0544, -0.0397, -2.7508, -1.0612],
          [-1.9456, -0.0552, -0.9607, -0.9534, -1.5147],
          [ 0.0090,  1.1706, -0.7672, -0.6291, -1.0155],
          [-1.5231, -1.6951, -1.2303, -0.9434,  1.1715],
          [ 0.4837,  2.1094,  0.6831, -3.6687, -1.9772]]],


        [[[-3.3978, -1.9479,  1.2797, -1.8634,  0.7480],
          [-0.7653,  2.2482,  0.6680, -1.6831, -1.5829],
          [-2.1136,

<div class="alert alert-info"><h4>Note</h4><p>``torch.nn`` only supports mini-batches. The entire ``torch.nn``
    package only supports inputs that are a mini-batch of samples, and not
    a single sample.

    For example, ``nn.Conv2d`` will take in a 4D Tensor of
    ``nSamples x nChannels x Height x Width``.

    If you have a single sample, just use ``input.unsqueeze(0)`` to add
    a fake batch dimension.</p></div>

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd
     operations like ``backward()``. Also *holds the gradient* w.r.t. the
     tensor.
  -  ``nn.Module`` - Neural network module. *Convenient way of
     encapsulating parameters*, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically
     registered as a parameter when assigned as an attribute to a*
     ``Module``.
  -  ``autograd.Function`` - Implements *forward and backward definitions
     of an autograd operation*. Every ``Tensor`` operation, creates at
     least a single ``Function`` node, that connects to functions that
     created a ``Tensor`` and *encodes its history*.

**At this point, we covered:**
  -  Defining a neural network
  -  Processing inputs and calling backward

**Still Left:**
  -  Computing the loss
  -  Updating the weights of the network

Loss Function
-------------
A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different
`loss functions <http://pytorch.org/docs/nn.html#loss-functions>`_ under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the input and the target.

For example:



In [7]:
output = net(input)
target = torch.arange(1, 11)  # a dummy target, for example
target = target.view(1, -1).float()  # make it the same shape as output
# criterion = nn.MSELoss()
criterion = nn.BCEWithLogitsLoss()

loss = criterion(output, target)
print(loss)

tensor(2.6427, grad_fn=<BinaryCrossEntropyWithLogitsBackward>)


Now, if you follow ``loss`` in the backward direction, using its
``.grad_fn`` attribute, you will see a graph of computations that looks
like this:

::

    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
          -> view -> linear -> relu -> linear -> relu -> linear
          -> MSELoss
          -> loss

So, when we call ``loss.backward()``, the whole graph is differentiated
w.r.t. the loss, and all Tensors in the graph that has ``requres_grad=True``
will have their ``.grad`` Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward:



In [8]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<BinaryCrossEntropyWithLogitsBackward object at 0x000001FF312D8148>
<AddmmBackward object at 0x000001FF312D8148>
<AccumulateGrad object at 0x000001FF312E9CC8>


Backprop
--------
To backpropagate the error all we have to do is to ``loss.backward()``.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.


Now we shall call ``loss.backward()``, and have a look at conv1's bias
gradients before and after the backward.



In [9]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([0.4392, 0.1056, 0.4390, 0.7510, 0.1132, 0.2902])


Now, we have seen how to use loss functions.

**Read Later:**

  The neural network package contains various modules and loss functions
  that form the building blocks of deep neural networks. A full list with
  documentation is `here <http://pytorch.org/docs/nn>`_.

**The only thing left to learn is:**

  - Updating the weights of the network

Update the weights
------------------
The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

     ``weight = weight - learning_rate * gradient``

We can implement this using simple python code:

.. code:: python

    learning_rate = 0.01
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
To enable this, we built a small package: ``torch.optim`` that
implements all these methods. Using it is very simple:



In [10]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)
# optimizer = optim.RMSProp(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

.. Note::

      Observe how gradient buffers had to be manually set to zero using
      ``optimizer.zero_grad()``. This is because gradients are accumulated
      as explained in `Backprop`_ section.



Saving and Loading Model
------------------

**Saving**


In [None]:
torch.save(net.state_dict(), './model1.pt')

**Loading**

In [None]:
model = Net()
model.load_state_dict(torch.load('./model1.pt'))
model.eval()