# Neural Networks
Neural networks can be constructed using the `torch.nn` package.

Now that you had a glimpse of `autograd`, `nn` depends on `autograd` to define models and differentiate them. An `nn.Module` contains layers, and a method `forward(input)` that returns the `output`.

A Typical training procedure for a NN is as follow:
1. Define the neural network that has some learnable parameters (or weights)
2. Iterate over a dataset of inputs
3. Process input through the network
4. Compute the loss (how far is the output from being correct)
5. Propagate gradients back into the network's parameters
6. Update the weights of the network, typically using a simple update rule: `weight = weight - LR*gradient`

### Define the Network

In [19]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [20]:
class Network(nn.Module):
    def __init__(self):
        super(Network, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        #kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)

        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120) # 5*5 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window)
        # x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))

        # If the size is a square, you can specify with a single number
        x = F.max_pool2d(F.relu(self.conv1(x)), 2)
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)

        return x

In [21]:
network = Network()
network

Network(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

You just have to define the `forward` function, and the `backward` function (where gradients are computed) is automatically defined for you using `autograd`. You can use any of the Tensor operations in the `forward` function.

The learnable parameters of a model are returned by `net.parameters()`.

In [22]:
params = list(network.parameters())
print(len(params))
print(params[0].size()) # conv1's .weight

10
torch.Size([6, 1, 5, 5])


Let's try a random 32x32 input.

Note: Expected input size of this net (LeNet) is 32x32. To use this network on the **MNIST** dataset, you should resize the images from the dataset to 32x32.

In [23]:
input = torch.randn(1, 1, 32, 32)
out = network(input)
out

tensor([[ 0.1258,  0.0090, -0.0192, -0.0605,  0.0320,  0.1966,  0.0272,  0.0923,
         -0.0653, -0.1663]], grad_fn=<AddmmBackward0>)

Zero the gradient buffers of all parameters and backprop with random gradients.

In [24]:
network.zero_grad()
out.backward(torch.randn(1, 10))

### Loss Function
A loss function takes the `(output, target)` pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different **loss functions** under the `nn` package. A simple loss is: `nn.MSELoss` (Mean Squared Error).

In [25]:
out = network(input)
target = torch.randn(10) # dummy target
target = target.view(1, -1) # same shape as the output
criterion = nn.MSELoss()

loss = criterion(out, target)
loss

tensor(1.1192, grad_fn=<MseLossBackward0>)

If you follow `loss` in the backward direciton using its `.grad_fn` attribute, you will see a graph of computations that looks like this:

```
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
```

When we call `loss.backward()`, the whole graph is differentiated w.r.t. the neural network parameters, and all Tensors in the graph that have `required_grad=True` will have their `.grad` Tensor accumulated with the gradient.

In [26]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x12c483a60>
<AddmmBackward0 object at 0x12c4827d0>
<AccumulateGrad object at 0x12c483a60>


### Backprop
To backpropagate the error all we have to do is to `loss.backward()`. You *need to clear the existing gradients though*, else gradients will be accumulated to existing gradients.

In [27]:
network.zero_grad() # zeroes the graduent buffers of all parameters.

print("conv1.bias.grad before backward")
print(network.conv1.bias.grad)

loss.backward()

print("conv1.bias.grad after backward")
print(network.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([ 0.0153, -0.0180,  0.0099,  0.0225,  0.0162,  0.0044])


### Updating the weights
Simplest method is **Stochastic Gradient Descent (SGD)**:

`weight = weight - learning_rate * gradient`

In [28]:
learning_rate = 0.01
for f in network.parameters():
    f.data.sub_(f.grad.data * learning_rate)

PyTorch has built-in implementations of many different optimization algorithms in the `torch.optim` package.

In [29]:
import torch.optim as optim

In [30]:
optimizer = optim.SGD(network.parameters(), lr=0.01)

optimizer.zero_grad() # zero the gradient buffers
output = network(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # does the update