# Convolutional Neural Networks

The code below is slightly modified from the [official PyTorch tutorials](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) on image classifiers. 

If you are interested in building these systems on your own data, I highly recomment you start by working through the [60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) tutorial. Then, look at the [Examples Repository](https://github.com/pytorch/examples/) for some concrete examples that do various things, and finally, get very very familiar with the [PyTorch API Documentation](https://pytorch.org/docs/stable/) page. If you write PyTorch code as often as I do, you'll have that page almost permanently open.

## Imports

We start by importing modules. Don't worry too much about these right now; they basically give us the functionality we need to put together our model. As you learn more about PyTorch, you'll learn which packages are used more often (like `torch.nn` and `torch.nn.functional`) and which are more specialized, and also what the conventions are for importing different packages (like the fact that `torch.nn.functional` is often imported as `F`). These are optional, but highly recommended so people can quickly understand your code.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
import torchvision.transforms as transforms

import matplotlib.pyplot as plt
import numpy as np

## Network Architecture

Right away, we can design our network. In practice the network code would be hidden in an external script and then imported, but since this is a tutorial notebook it makes sense to put it here.


### Overview

PyTorch model classes are typically written in a similar way; they must include at least two methods:

- The `init()` method defines the "building blocks" of the network which you will use in the forward pass. It also allows you to pass stuff to the constructor -- for example, the number of filters you want, allowing you to change the architecture by passing in different arguments.
- The `forward()` method defines how the building blocks are assembled into a full network. The input to the forward pass is what gets passed when you call `model(data)`, and the output values are what gets returned at the end.

The `forward()` function is key; by defining your network here, PyTorch is able to calculate the gradient automatically through backpropagation. 

Below I've written out some concise guidelines for each of the elements used in the code. Clicking on the headers will take you to the PyTorch documentation for that function.

### [nn.Conv2d](https://pytorch.org/docs/stable/nn.html#conv2d)

`torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)`

Applies a 2D convolution over an input signal composed of several input planes. The arguments used below are:

- `in_channels`: How many channels are coming into this layer?
- `out_channels`: How many channels are being sent out of this layer? (Think of this like the number of filters you want to use to represent the data at the current scale)
- `kernel_size`: How big of a kernel do you want to use to represent your data? Think of this like the "neighborhood" of your convolution.

The other arguments (`stride`, `padding`, `dilation`, `groups`, and `bias`) all maintain their default values.

### [nn.MaxPool2d](https://pytorch.org/docs/stable/nn.html#maxpool2d)

`torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)`

Downsamples the input signal through maximum pooling. This is a straightforward operation which does NOT have weights / biases trained on it. Below, the arguments are:

- `kernel_size`: The size of the window to use for performing the downsampling.
- `stride`: How much to move the filter at each step.

### [nn.Linear](https://pytorch.org/docs/stable/nn.html#linear)

`torch.nn.Linear(in_features, out_features, bias=True)`

This is equivalent to a "traditional" (fully-connected) neural network layer:

$$ \mathbf{y} = \mathbf{x}\mathbf{W}^{T} + \mathbf{b}$$

All you really need for this layer are:

- `in_features`: The number of input neurons coming into this layer.
- `out_features`: The number of outputs from this layer -- equivalent to the number of "hidden" neurons at the layer.

### [nn.functional.relu](https://pytorch.org/docs/stable/nn.html#id26)

`torch.nn.functional.relu(input, inplace=False)`

Applies the rectified linear unit (ReLU) nonlinearity to the input data:

$$ \textrm{ReLU}(x) = \max{(0, x)} $$

This is the "functional" version of `nn.relu()`; this means that you don't need to define it in the `init()` method before applying it here.

### Tracking Data Through the Network

You need to do a bit of math to figure out the size of the data as it moves through the network. If you don't do this properly, you'll get errors when you try to create the model. 

The network below has the following architecture:

- conv1: Conv2d
    - Output channels: 6, Kernel size: 5, Stride: 1, Padding: 0, Dilation: 1
        - Change: **32x32x3** -> **28x28x6**
        - Explanation: A kernel size of 5 with no padding means that the filter will start 2 pixels "in" on each side, so the height and width is reduced by 4
- ReLu: No change in data
- pool: MaxPool2d
    - Kernel size: 2, Stride: 2
        - Change: **28x28x6** -> **14x14x6** 
        - Explanation: Data downsampled by 50%
- conv2: Conv2D
    - Output channels: 16, Kernel size: 5, Stride: 1, Padding: 0, Dilation: 1
        - Change: **14x14x6** -> **10x10x16** 
        - Explanation: Same kernel size, padding, and stride as conv1, so again reduce height and width by 4
- ReLu: No change in data
- pool: MaxPool2d
    - Same parameters as previous pooling layer
        - Change: **10x10x16** -> **5x5x16** 
        - Explanation: Downsampled by 50%
- fc1: Linear
    - Input features: 16 * 5 * 5 elements ("flattened" image from previous layer)
    - Output features: 120
- ReLu: No change in data
- fc2: Linear
    - Input features: 120
    - Output features: 84
- ReLu: No change in data
- fc3: Linear
    - Input features: 84
    - Output features: 10 (number of desired classes -- these are our output nodes)

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        
        # Define the convolutional layers
        # Notes: 
        #  - The number of input channels to conv1 (the first convolutional layer) is the number of channels
        #    in our input data. For RGB images, this should be 3.
        #  - The number of output channels of conv1 is 6; since this data will then be passed
        #    to conv2, then the number of input channels to conv2 should also be 6.
        
        self.conv1 = nn.Conv2d(3, 6, 5)
        
        # Define the pooling layer
        self.pool = nn.MaxPool2d(2, 2)
        
        self.conv2 = nn.Conv2d(6, 16, 5)
        
        # Define the fully-connected layers
        # Notes:
        #  - You need to figure out how many input values you have to the first fully-convolutional layer
        #    which means you have to do the work in the previous section to figure out what size your image
        #    is going to be at this point in the network.
        #  - The number of output nodes here is up to you; whether this expands and then contracts, or just 
        #    contracts, is a matter of preference.
        #  - The final linear layer should have a number of outputs equivalent to the number of classes you want.
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        
        x = x.view(-1, 16 * 5 * 5)
        
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        
        x = self.fc3(x)
        
        return x

net = Net()

## Getting to Know Your Model (Visualizing and Understanding Model Parameters)

You can visualize the model structure easily by calling methods that work on `nn.Module` classes. See [the documentation of nn.Module](https://pytorch.org/docs/stable/nn.html#module) for a list of the valid methods you can call.

In [None]:
print('Name:\t\tParameters Size:')
print('-----\t\t----------------')
for name, param in net.named_parameters():
    print('{}\t{}'.format(name, np.array(param.size())))

We can get the number of parameters listed above by adding together the parameters in each collection:

In [None]:
6*3*5*5 + 6 + 16*6*5*5 + 16 + 120*400 + 120 + 84*120 + 84 + 10*84 + 10

We can also do this in a bit more of a pythonic way, rather than simply typing in all the numbers:

In [None]:
print('Total number of trainable parameters: {}'.format(sum(p.numel() for p in net.parameters() if p.requires_grad)))

The `requires_grad` property says whether the associated parameter is "trainable" -- if this was false, then the optimizer won't update those weights during training. We'll talk about this a bit more later when we actually train the model.

If you want, you could even go crazy and look at the exact values of each of the parameters in the model:

In [None]:
for param in [p for p in net.parameters() if p.requires_grad]:
    print(param)

## Working with Data: Datasets, Dataloaders

For this demo we're using the [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html). This is such a common dataset for demonstrating and testing CNNs that the `torchvision` package has special methods for automatically downloading and working with this dataset.

The first thing we want to do is create a PyTorch "transform" for our images -- this will convert the input images to the proper format (PyTorch Tensor), and will normalize the images as well. We will use this transform in the next section.

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

Next we define "datasets" and "dataloaders". These are PyTorch-specific things which take care of data input-output, as well as creating batches, randomizing the order during training, etc. They are basically iterators that allow you to "step through" a dataset -- we can iterate through them to pull out samples, either for training or testing.

If you have different types of datasets, you can load them in various ways. Check out:

- The [`torch.utils.data`](https://pytorch.org/docs/stable/data.html) documentation on the different classes and functions for defining datasets;
- The [`torchvision.datasets`](https://pytorch.org/docs/stable/torchvision/datasets.html) page for a list of pre-defined benchmark image databases;
- The [`torchvision.datasets.ImageFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#imagefolder) section for building your own image database starting with image files contained in class folders;
- The [`torchvision.datasets.DatasetFolder`](https://pytorch.org/docs/stable/torchvision/datasets.html#datasetfolder) for a format-agnostic version of `ImageFolder`; and finally
- The [Data Loading and Processing Tutorial](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) for a nice example of how to build your own dataset code, allowing you to do some neat things with multi-modal datasets (images plus landmarks plus captions?)

Note that since we have different requirements for training and testing, we have to create two different dataloaders (for example, the test loader doesn't have to shuffle the data, might require different transforms, etc.)

In [None]:
trainset = torchvision.datasets.CIFAR10(root='./', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Since these are image-based datasets, we want to create a small helper function for displaying an image (meaning we don't have to type the same bit of code over and over):

In [None]:
def imshow(img):
    # Un-normalize the image
    img = img / 2 + 0.5
    npimg = img.numpy()
    
    # Need to transpose the image to get the channels in the right spot
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

To use the dataloaders, you first create an iterator that you can then use to "click" through a dataset. Each time you call the `.next()` method on the iterator, it will generate a new batch of samples.

In [None]:
# Create an iterator for the training dataset
dataiter = iter(trainloader)

# Call the `.next()` method to get a batch of images and list of corresponding class labels
images, labels = dataiter.next()

# It's a good idea to make sure we actually got some data that we can use, and what size / format it's in
print('Size of the image batch: {}'.format(images.size()))
print('Size of the labels batch: {}'.format(labels.size()))

Notice the size of the image batch: the way that PyTorch organizes image batches, the dimensions are: `[batches channels height width]`. 

Often, we want to view a batch of samples at once; because this is such a common task, there is a utility function `torchvision.utils.make_grid()` which takes in a batch of images and returns a stitched grid of images consisting of all the images in a batch. 

(The channels aren't fixed though, so we still have to do the "transpose" inside our helper `imshow` function.)

In [None]:
imshow(torchvision.utils.make_grid(images, pad_value=1))

# Print the labels as well
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))

So now we have some images and a way of accessing them, as well as a way to figure out the class of the image. Now we can actually begin training!

## Training: Loss Functions, Optimizer, and Backpropagation

Now we can move on to training! But first, there are two more important things we need to identify:

- The ***Loss Function***, which is responsible for penalizing our network for mistakes AND for computing the gradient for all the weights in the network;
- The ***Optimization Algorithm***, which actually goes about changing the weights in response to the gradient calculated by the loss function.

These are separate algorithms and are defined separately in PyTorch.

### Selecting a Loss Function

We can either define our own function or we can use one provided by PyTorch. Look [here in the documentation](https://pytorch.org/docs/stable/nn.html#loss-functions) for a list of possible loss functions and what they're used for. 

The main thing for the loss functions is that they all contain a `.backward()` method, which allows you to automatically calculate the gradient for the network parameters. Actually **how** this works is beyond the scope of this network, but you can start by looking at the [documentation for PyTorch's Autograd mechanics](https://pytorch.org/docs/stable/notes/autograd.html). For now, it's fine to just consider it "magic."

For our purposes, we'll be using [the Cross Entropy Loss function](https://pytorch.org/docs/stable/nn.html#crossentropyloss). 

#### Pay Attention to your Loss Functions!

It is **critical** to understand what values your loss function expects and what it's comparing them to. Study the loss function API and associated examples carefully, so you know what shape your inputs and labels should be and what their values should contain. 

`CrossEntropyLoss` combines a [log softmax function, `LogSoftmax`](https://pytorch.org/docs/stable/nn.html#logsoftmax), which takes the output of the network and makes the values sum to 1, and a [negative log likelihood loss function, `NLLLoss`](https://pytorch.org/docs/stable/nn.html#nllloss). This is why we don't need to have a `Softmax` in the network definition (because our loss function takes care of that). 

In [None]:
criterion = nn.CrossEntropyLoss()

### Selecting an Optimizer

Here, we [select an optimizer](https://pytorch.org/docs/stable/optim.html) -- this is the function responsible for updating our parameter values.

There are several algorithms implemented in `torch.optim`, but they all require two things: 

- A set of parameters that are going to be changed -- in our case, this is the list of the trainable parameters of the network, which we get by `net.parameters()`; and
- A set of "hyper-parameters" that control the behavior of the optimizer -- this is things like learning rate `lr` and momentum.

You can also specify different hyper-parameters for different parts of the network, for example if you wanted the convolutional layers of the network to learn faster than the fully-connected layers. We won't do this, but sometimes you may have a good reason for tweaking with these settings.

#### Which Optimizer to Use?

Just like selecting nonlinearities, which optimizer to use is a matter of debate and subjectivity. The goal of the optimizer is to train the network smoothly, without over-fitting -- if you have a system that either does not learn anything, or which over-fits very quickly, it may be worth it to switch optimizers or the hyper-parameters.

In our case, we will use the [Stochastic Gradient Descent](https://pytorch.org/docs/stable/optim.html#torch.optim.SGD) optimizer.

In [None]:
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

### Training Procedure

Now, we basically have to run our training algorithm in one big loop. At each epoch, the process iterates through a "mini-batch" of samples. The main process for each batch is:

1. Get the inputs (images) and the labels for each sample in the batch
2. Zero out the parameter gradients (so we calculate them anew for each batch)
3. Run the forward pass through the network to get the current outputs
4. Calculate the loss of the network in the batch by comparing the outputs to the labels
5. Calculate the gradient of the network by calling the loss function's `.backward()` method
6. Use the optimizer to calculate the next step of weight modifications, and apply them
7. Calculate the loss for this batch, and add it to a running average of loss values

Each of these steps is highlighted in the code below.

We repeat this for each batch of samples in the dataloader; once we finish one pass through the dataloader, we have completed one epoch. We can then repeat epochs over and over until we reach some stopping point. For now, we will just run two epochs.



In [None]:
# Start the training process
for epoch in range(2):

    # Reset the loss for this epoch
    running_loss = 0.0
    
    # Iterate through the training data
    for i, data in enumerate(trainloader, 0):
        # Pull out this batch of data 
        # Recall that the trainloader was created with the option `batch_size=4`
        # Thus, each time we get `inputs` and `labels`, we get the data for 4 samples
        inputs, labels = data

        # Zero the parameter gradients
        # This is necessary to prevent "accumulation" of the gradients over multiple training batches
        # Each batch calculates the gradient independently
        # This is critical for allowing the system to learn in parallel
        optimizer.zero_grad()

        # Calculate the forward pass for these input samples
        outputs = net(inputs)
        
        # Calculate the loss by comparing the network output to the ground-truth labels (from trainloader)
        loss = criterion(outputs, labels)
        
        # Calculate the backward pass (backpropagation) for the loss function
        loss.backward()
        
        # Calculate the optimization step, tweaking the actual parameters (weights) of the network
        optimizer.step()

        # Add the loss of this batch to the running loss
        running_loss += loss.item()
        
        # Print the loss every 2000 batches and then reset it
        if i % 2000 == 1999:    
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

## Evaluate the Model on Testing Data

Let's see how well our trained model can predict the labels of testing data -- which again, remember, we have not yet seen.

### Viewing a Single Batch

First, let's pull out a single batch of testing data and print it.

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()

# Display the images
imshow(torchvision.utils.make_grid(images))

# Print out the ground truth labels for the testing data
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

Now, let's put the data through the network -- this is a simple forward pass, and should be quick.

We also print the outputs as well; each column of the output is a sample, and each row is one of the class outputs. 

**Remember:** The magnitude of the network output is proportional to the probability that the sample belongs to that class, but since we haven't used a `softmax()` layer in our network, these outputs are not probabilities (the do not sum to 1, they can be greater than 1, and they can be less than 0).

In [None]:
outputs = net(images)

print('Raw network output:')
print(torch.transpose(outputs, 1, 0))

Let's get a class label by finding the row (class) of the maximum value for each sample, and then using the class index to print out the predicted class.

In [None]:
_, predicted = torch.max(outputs, 1)

print('Ground Truth:\t', '\t'.join('%5s' % classes[labels[j]] for j in range(4)))
print('Predict:\t', '\t'.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

### Evaluating All Testing Data

Let's iterate through the testing dataloader, pulling out our samples and performing the same operations as we just did on each batch. We'll keep track of the total number of samples as well as the correct samples to calculate our accuracy at the end.

In [None]:
correct = 0
total = 0

# This line tells PyTorch to not keep track of gradients for our testing forward passes
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

Remember, we have 10 classes, so if our classifier is randomly guessing, we would expect an accuracy of 10%.

### Calculate Class-by-Class Performance

We might be interested in seeing how the performance varies based on the target class -- maybe it's harder to distinguish planes than cats?

So here we're just going to do the same thing as we just did, except our `total` and `correct` counts will be a list where each list element holds the data for a different class.

In [None]:
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

## Advanced: Visualizing Filter Outputs

Fundamentally the convolutional layers are calculating filters of the input images. We can pull these out of our trained network to see what they look like, which may give us an idea of what characteristics of the image the system is looking at.

In PyTorch, we can create an "intermediate" CNN by slicing into the network's children and pulling out only the first 2 or 3 layers. 

In [None]:
# Grab the intermediate layers to see what the data looks like as it travels through the trained network
net_intermediate = nn.Sequential(*list(net.children())[0:1])
net_intermediate.eval()

This intermediate network's output is now just the output of one of the convolutional layers. We can push some testing data through the network and visualize the output the same way we did above.

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()

# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))

In [None]:
image_outputs = net_intermediate(images)

# Calculate the size of the outputs to construct our matplotlib plot
nImages, nFilters, height, width = image_outputs.shape

f, ax = plt.subplots(nImages, nFilters+1, figsize=(25,12))

for idx_img in range(nImages):
    ax[idx_img][0].imshow(np.moveaxis(images[idx_img,:,:,:].numpy(), 0, -1))
    for idx_filter in range(nFilters):
        ax[idx_img][idx_filter+1].imshow(image_outputs[idx_img,idx_filter,:,:].detach().numpy(), cmap=plt.cm.gray)