# More complex networks

Here we are going to traing a significantly more complex network on the MNIST digit classification dataset and see how much improvement can be seen in the network's performance (on the test set).

We will see that as you make the network more complex, it will take significantly longer to train the network. If you continue to run these networks on a CPU (as in regular computation), soon the training time becomes prohibitive!

We will see how you can improve the situation by **training the network on a GPU**!

## Importing necessary packages

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import torch # import the PyTorch package
from torch import nn
import torch.nn.functional as F

import torchvision # import trochvision package
from torchvision import transforms # get torchvision's transforms subpackage

import time

## Loading datasets

In [3]:
# create a composite transform that first converts images to tensors and then normalize the images
image_transform = transforms.Compose([
    transforms.ToTensor(), # converts images into Tensors
    transforms.Normalize([0.1307], [0.3081])
])

# apply the transforms at the time of dataset loading
training_set = torchvision.datasets.MNIST('./data', train=True, download=True,
                                          transform=image_transform)
test_set = torchvision.datasets.MNIST('./data', train=True, download=True,
                                          transform=image_transform)

batch_size = 512
training_loader = torch.utils.data.DataLoader(training_set, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)

# Decently complex network

Here is a much more complex network (although still would be considered very simple from the field's standard) that uses operations like **convolution** and **drop outs**. (Covering these opartions is beyond the scope of this course but you can find tons of references on them.)

In [5]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2(x), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

## Train the network

Now let's go ahead and train this network from scratch. Here I'm using values of learning rate and number of epochs that I have discovered to work well when training this network. These adjustable values that cannot trained by the gradient descent are often referred to as **hyperparameters**.

In [4]:
net = Net()
net.train() # puts the network into the training mode

# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.03, momentum=0.5)

start = time.time()
for epoch_idx in range(10):
    for batch_idx, (data, target) in enumerate(training_loader):
        # reset the gradient before the next gradient step
        optimizer.zero_grad()

        # evaluate the network output
        output = net(data)

        # compute the loss
        loss = F.nll_loss(output, target)

        # perform back propagation to compute gradients with respect to parameters!
        loss.backward()

        # perform a gradient descent step on the parameters
        optimizer.step()

        # report the loss every 100 batches
        if batch_idx % 100 == 0:
            print('Epoch {} Loss: {:.6f}'.format(epoch_idx, loss.item()))
            
duration = time.time() - start
print('Training completed in {:.2f} seconds'.format(duration))

NameError: name 'Net' is not defined

## Test the network

In [5]:
net.eval() # put network into evaluation model
test_loss = 0
correct = 0

# prevents unnecessary gradient computation during test - can lead to time and memory saving
with torch.no_grad(): 
    for data, target in test_loader:
        output = net(data)
        
        # sum up batch loss
        test_loss += F.nll_loss(output, target, size_average=False).item() 
        
        # get the index of the max log-probability
        pred = output.max(1, keepdim=True)[1] 
        
        # count number of times where max probability matches the label index
        correct += pred.eq(target.view_as(pred)).sum().item()

# divide the test loss by number of samples in the test set
test_loss /= len(test_loader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))


Test set: Average loss: 0.0668, Accuracy: 58753/60000 (98%)



So this network can perform much better than our earlier networks, but it takes significantly longer to train!

For your refernce the best network performance on MNIST to date is 99.79% on the test set! You can find (a bit outdated - from 2016) classification scores on MNIST and many other popular benchmark datasets [here](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#4d4e495354).

# Speeding up your training with a GPU

You can get a significant speed up by placing the network and data on GPUs and letting computation take place there.

**WARNING** The following code will only work if you are on a machine with a properly configured GPU device.

In [6]:
net = Net()
net.to('cuda') # place the network on GPU!

net.train() # puts the network into the training mode


# create and initialize an optimizer
optimizer = torch.optim.SGD(net.parameters(), lr=0.03, momentum=0.5)

start = time.time()
for epoch_idx in range(10):
    for batch_idx, (data, target) in enumerate(training_loader):
        # reset the gradient before the next gradient step
        optimizer.zero_grad()
        
        # send each batch to GPU so it can be processed by the network that's also on the GPU
        data, target = data.to('cuda'), target.to('cuda')

        # evaluate the network output
        output = net(data)

        # compute the loss
        loss = F.nll_loss(output, target)

        # perform back propagation to compute gradients with respect to parameters!
        loss.backward()

        # perform a gradient descent step on the parameters
        optimizer.step()

        # report the loss every 100 batches
        if batch_idx % 100 == 0:
            print('Epoch {} Loss: {:.6f}'.format(epoch_idx, loss.item()))
            
duration = time.time() - start
print('Training completed in {:.2f} seconds'.format(duration))

RuntimeError: Error attempting to use dtype torch.float32 with layout torch.strided and device type CUDA.  Torch not compiled with CUDA enabled.


We also test the network on GPU.

In [7]:
net.eval() # put network into evaluation model
test_loss = 0
correct = 0

# prevents unnecessary gradient computation during test - can lead to time and memory saving
with torch.no_grad(): 
    for data, target in test_loader:
        # place batch onto GPU
        data, target = data.to('cuda'), target.to('cuda')
        
        output = net(data)
        
        # sum up batch loss
        test_loss += F.nll_loss(output, target, size_average=False).item() 
        
        # get the index of the max log-probability
        pred = output.max(1, keepdim=True)[1] 
        
        # count number of times where max probability matches the label index
        correct += pred.eq(target.view_as(pred)).sum().item()

# divide the test loss by number of samples in the test set
test_loss /= len(test_loader.dataset)

print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))


Test set: Average loss: 0.0648, Accuracy: 58811/60000 (98%)

