# PyTorch Tutorial & Homework - Neural Networks
Prof. Lim Kwan Hui, with many thanks to Prof. Dorien Herremans for the initial version and Nelson Lui for the base text.

Homework questions are at the end of the tutorial.

**To edit the notebook**:

There are two ways to edit the notebook.

You can either open it in the "playground", where you can change and run cells. After closing the tab, your changes will be lost. To do so, press "File" > "Open in playground".

Alternatively, you can make a copy of this notebook to your own Google Drive account through "File" > "Save a copy in Drive..."

**Activating the GPU on Colab**:

Colab now gives you 12 hours of free GPU time (before you have to request a new node).
Simply select "GPU" in the Accelerator drop-down in Notebook Settings (either through the Edit menu or the command palette at cmd/ctrl-shift-P).

# Setting up the notebook on colab

Let's check if we are using the GPU environment and cuda is installed:

In [None]:
# Import PyTorch and other libraries
import torch
import numpy as np
from tqdm import tqdm

print("PyTorch version:")
print(torch.__version__)
print("GPU Detected:")
print(torch.cuda.is_available())

#defining a shortcut function for later:
import os
using_GPU = os.path.exists('/opt/bin/nvidia-smi')

# Computation Graphs

A computation graph is simply a way to define a sequence of operations to go from input to model output.

You can think of the nodes in the graph as representing operations, and the edges in the graph represent tensors going in and out.

For example, say we wanted to build a linear regression model. This has the form $\hat y = Wx + b$.

In this equation, $x$ is our input, $W$ is a learned weight matrix, $b$ is a learned bias, and $\hat y$ is the predicted output.

As a computation graph, this looks like:

![Linear Regression Computation Graph](https://imgur.com/IcBhTjS.png)

When implementing deep learning models, you're basically designing and specifying computation graphs. It's a bit like playing with Legos in that you're stringing together a bunch of blocks (the operations) to achieve a final desired output.

# The building blocks of deep learning models

`torch.nn` makes it easy to build neural nets by providing functions for specifying arbitrary computation graphs and abstractions for putting them all together. We'll start by covering a few classes in the `torch.nn` module that form basic building blocks of many deep learning applications.

The classes below are all callable, so you can use them with `outputs = YourDeepLearningBlock(its_inputs)`

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable

## Linear Layers (Affine Transforms)

A linear layer (also known as an affine transform) defines a function:

$$f(x) = Wx + b$$

This linear transform is a core part of deep learning. $W$ and $b$ are the parameters of this layer, where $W$ is a learned weight matrix and $b$ is a learned bias vector.

`nn.Linear()` takes two construction parameters: the dimensionality of the input and the dimensionality of the desired output.

In [None]:
# Create a Linear layer. Input should have 5 dimensions, output will have 3.
lin = nn.Linear(5, 3)
# Data is a matrix of shape (2, 5). Can we use the linear layer on it?
data = torch.randn(2, 5)

# Yes! Running the data matrix through the layer outputs shape (2, 3).
print(lin(data))

In [None]:
# What about a matrix of shape (2, 4, 5)?
data = torch.randn(2, 4, 5)
# This works as well! As long as the last dimension is the specified
# input dimension to the Linear layer, you're good.
# Output shape: (2, 4, 3)
print(lin(data))

In [None]:
# But (5, 2) is an incompatible shape (uncomment and run to see error)
data = torch.randn(5, 2)
# print(lin(data))

In [None]:
# But we can transpose it using t()!
# Now its shape (2, 5) and all is fine.
print(lin(data.t()))

## Nonlinearities / Activation Functions

Since composing linear transformations gives you a linear transformation, we don't gain any representational power by just chaining `Linear` layers.

In deep learning, we add nonlinearities after our Linear transforms, which lets us build more powerful models.

PyTorch comes with a veritable zoo of nonlinearities.

In [None]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities are layers too!
relu = nn.ReLU()
print(relu)
print(relu(data))

tanh = nn.Tanh()
print(tanh)
print(tanh(data))

sigmoid = nn.Sigmoid()
print(sigmoid)
print(sigmoid(data))

If you'd prefer to not create a class for the nonlinearity, you can also call it functionally as below:

In [None]:
data = torch.randn(2, 3)
print(data)

# Nonlinearities can also be used functionally, with no need to create a class!
print("ReLu:")
print(torch.relu(data))

print("tanh:")
print(torch.tanh(data))

print("Sigmoid:")
print(torch.sigmoid(data))

## Dropout

Dropout is used to regularize our models by randomly setting some outputs to 0.

This helps to prevent overfitting by encouraging the model to look beyond specific spurious patterns and find features that generalize.

**Note that we should only apply dropout during training!**

In [None]:
data = torch.randn(2, 3)
print(data)

# Create a Dropout layer and call it on input
# Here, the probability of zeroing an element is 0.5
dropout = nn.Dropout(0.5)
print(dropout)
print(dropout(data))

# Use dropout functionally, training=False by default so no change.
print("Functional dropout, training=False")
print(F.dropout(data, 0.5, training=False))

# Set training=True, so things are dropped out
print("Functional dropout, training=True")
print(F.dropout(data, 0.5, training=True))

# Structuring PyTorch models

At the highest level, `nn.Module` defines what most would refer to as a "model". It's a convenient way for encapsulating the trainable parameters of a model or a component of your model, and subclassing this class gives you Python functions for moving your model to the GPU, saving it, loading it etc.

When you're building your own model, you're going to subclass `nn.Module`. Critically, you also need to override the `__init__()` and `forward()` functions.

*   In `__init__()`, you should take arguments that modify how the model runs (e.g. # of layers, # of hidden units, output sizes). You'll also set up most of the layers that you use in the forward pass here.
*   In `forward()`, you define the "forward pass" of your model, or the operations needed to transform input to output. **You can use any of the Tensor operations in the forward pass.**



### Feed-forward neural net

Back to the simple neural network we covered in the lecture, we can add some intermediate layers (called hidden layers), nonlinearities, and dropout for regularization. This is essentially a multi-layer feed forward neural net, and it's implementation as a module is outlined below:

In [None]:
class FeedForwardNN(nn.Module):
  # input_size: Dimensionality of input feature vector.
  # num_classes: The number of classes in the classification problem.
  # num_hidden: The number of hidden (intermediate) layers to use.
  # hidden_dim: The size of each of the hidden layers.
  # dropout: The proportion of units to drop out after each layer.
  def __init__(self, input_size, num_classes, num_hidden, hidden_dim, dropout):
    # Always call the superclass (nn.Module) constructor first!
    super(FeedForwardNN, self).__init__()

    # Set up the hidden layers.
    assert num_hidden > 0
    # A special ModuleList to store our hidden layers.
    self.hidden_layers = nn.ModuleList([])
    # First hidden layer maps from input_size -> num_hidden.
    self.hidden_layers.append(nn.Linear(input_size, hidden_dim))
    # Subsequent hidden layers map from num_hidden -> num_hidden.
    # Note that they can map to any dimensionality --- as long as the final
    # output is a distribution over your classes!
    for i in range(num_hidden - 1):
      self.hidden_layers.append(nn.Linear(hidden_dim, hidden_dim))

    # Set up the dropout layer.
    self.dropout = nn.Dropout(dropout)

    # Set up the final transform to a distribution over classes.
    self.output_projection = nn.Linear(hidden_dim, num_classes)

    # Set up the nonlinearity to use between layers.
    self.nonlinearity = nn.ReLU()

  # Forward's sole argument is the input.
  # input is of shape (batch_size, input_size)
  def forward(self, x):
    # Apply the hidden layers, nonlinearity, and dropout.
    for hidden_layer in self.hidden_layers:
      x = hidden_layer(x)
      x = self.dropout(x)
      x = self.nonlinearity(x)

    # Output layer: project x to a distribution over classes.
    out = self.output_projection(x)

    # Softmax the out tensor to get a log-probability distribution
    # over classes for each example.
    out_distribution = F.log_softmax(out, dim=-1)
    return out_distribution

# Training PyTorch models: Losses and Optimizers

By now, we've learned how to construct models in PyTorch. In this section, we'll go over how to calculate your model's loss and how to optimize the parameters to minimize the loss.

## Loss Functions

Intuitively, loss functions serve to tell your model how poorly it's doing --- the purpose of training is to adjust the weights of our model to minimize the loss.

A loss function takes a true output $y$ and a model-predicted output $\hat y$ and calculates the loss. If $y = \hat y$, our model produced the correct output and thus our loss is 0. The further our predicted $\hat y$ from the true $y$, the higher our loss is.

PyTorch comes with a large collection of loss functions. The most commonly used loss for classification is negative log likelihood (`nn.NLLLoss` or the very related `nn.CrossEntropyLoss`). The difference between `nn.NLLLoss` and `nn.CrossEntropyLoss` for classification problems is that `nn.NLLLoss` expects the output to be log-softmax normalized, which is easy to do with the `nn.LogSoftmax` layer. On the other hand `nn.CrossEntropyLoss`, automatically applies the log-softmax --- you can think of it as `nn.LogSoftmax` + `nn.NLLLoss`. Which to use depends on whether you want to add the extra `nn.LogSoftmax` to your model's `forward()`.

A common loss used for regression problems is the mean squared error (`nn.MSELoss`).

Here's a usage example of the `CrossEntropyLoss`.

In [None]:
# 3 examples, unnormalized scores over 4 classes.
model_output = torch.rand(3, 4, requires_grad = True)

# The correct labels.
targets = torch.LongTensor([1, 0, 3])

# CrossEntropyLoss
cross_entropy = nn.CrossEntropyLoss()
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.cross_entropy(model_output, targets)
avg_loss = cross_entropy(model_output, targets)
print("CrossEntropyLoss averaged across all 3 batch elements:")
print(avg_loss)

# Backpropagate wrt avg_loss
avg_loss.backward()
# Print out the gradients of model_output
print("Gradients of model_output")
print(model_output.grad)

And here's a snippet showing that `LogSoftmax` + `NLLLoss` is the same as `CrossEntropyLoss`.

In [None]:
nll = nn.NLLLoss()
log_softmax_model_output = F.log_softmax(model_output, dim=-1)
# Loss, averaged across all 3 batch elements.
# Can call this functionally: avg_loss = F.nll_loss(model_output, targets)
avg_loss = nll(log_softmax_model_output, targets)
print("Negative-Log Likelihood averaged across all 3 batch elements:")
print(avg_loss)

## Optimizers

Now that we can calculate the loss and backpropagate through our model (with `.backward()`), we can update the weights and try to reduce the loss!

PyTorch includes a variety of optimizers that do exactly this, from the standard SGD to more recent techniques like Adam and RMSProp.

At construction, PyTorch parameters take the parameters to optimize. When we run an input through our model, calculate the loss, and backpropagate, the gradients are automatically stored in the parameters (since they're all `Variables`). With these gradients, the optimizer can update the weights.

Optimizers live in the `torch.optim` module.

In [None]:
import torch.optim as optim

To get the parameters of our model, we can just call `.parameters()` on a `Module`. Below, we create an instance of our previously-defined feed forward neural network and get its parameters.

In [None]:
input_size = 784
num_classes = 10
num_hidden = 2
hidden_dim = 50
dropout = 0.2
ffnn_clf = FeedForwardNN(input_size, num_classes, num_hidden,
                         hidden_dim, dropout)
print(ffnn_clf)

parameters = ffnn_clf.parameters()

print("Shapes of model parameters:")
print([x.size() for x in list(parameters)])

Now to create an optimizer for this model, we construct a optimizer class and pass it the parameters of the model: stochastic gradient descend.

In [None]:
ffnn_optim = optim.SGD(ffnn_clf.parameters(), lr=0.5)

Let's try using our optimizer to take a gradient update on our model! We'll generate a few random examples, and run them through our model (the forward pass).

In [None]:
# Make some fake data for our model.
# 5 examples in the batch, each example has 784 features.
sample_input = torch.randn(5, 784)
# Multilabel classification, 10 possible classes.
sample_labels = torch.LongTensor([0, 3, 9, 6, 2])

# Run the sample_input through ffnn_clf to get a distribution
# over our classes
sample_predictions = ffnn_clf(sample_input)
print("Predicted distribution over classes: ")
print(sample_predictions)
print("Target Labels:")
print(sample_labels)

Now let's calculate the loss of our model on these examples.

In [None]:
nll_loss = F.nll_loss(sample_predictions, sample_labels)
print("Average NLL Loss:")
print(nll_loss)

Let's print the gradients of one of the parameter matrices in our model, to ensure it's `None`. We haven't done backprop yet, so there shouldn't be any gradients.

In [None]:
print(list(ffnn_clf.parameters())[0].grad)

Now we can backpropagate with respect to the loss to calculate the gradients for the parameters of our model with `.backward()`. It's also good practice to call `optimizer.zero_grad()` before `loss.backwards()`, which ensures that the gradients are reset to 0 before backprop.

In [None]:
ffnn_optim.zero_grad()
nll_loss.backward()

Let's check our gradients now...

In [None]:
print(list(ffnn_clf.parameters())[0].grad)

Now that we have gradients for each of our parameters, we can update them by using `optimizer.step()`.

In [None]:
# save the old value of the parameter for comparison later
old_parameter = list(ffnn_clf.parameters())[0].data.clone()

# Make a gradient update with our optimizer
ffnn_optim.step()

new_parameter = list(ffnn_clf.parameters())[0].data

print("Difference between weight matrix before and after update:")
print(old_parameter - new_parameter)


If you're familiar with the SGD update rule, you know that:

$$ \theta^{t+1} = \theta^{t} - \left( \eta \cdot \nabla L \left(\theta^{t} \right) \right)$$

Where $\theta^{t}$ is the weight at time $t$, $\eta$ is the learning rate, $\nabla L(\theta^{t})$ is the gradient. Since $\eta = 0.5$, it makes perfect sense that the difference between the weight vectors printed above is exactly half of the gradient.

# Example: Classification on FashionMNIST

Let's use the `FeedForwardNN` model we built earlier to do a simple classification task! This example is meant to be an annotated walkthrough of how to build, train, and evaluate a model in PyTorch. We'll use the [FashionMNIST dataset](https://github.com/zalandoresearch/fashion-mnist), where we are tasked with classifying black and white images of clothes into 10 different classes.

## Loading Data

We'll start by loading the data with `torchvision` --- knowing how to use torchvision isn't the point of this tutorial, so it's relatively unannotated.

In [None]:
!pip install torchvision==0.17 #note: you can find compatible torch/torchvision versions here: https://github.com/pytorch/vision#installation
import torchvision
from torchvision.datasets import FashionMNIST

train_dataset = FashionMNIST(root='./torchvision-data',
                             train=True,
                             transform=torchvision.transforms.ToTensor(),
                             download=True)

test_dataset = FashionMNIST(root='./torchvision-data', train=False,
                            transform=torchvision.transforms.ToTensor())

`train_dataset` and `test_dataset` are both subclasses of PyTorch's `torch.utils.data.Dataset`. The main benefit of subclassing this abstract class is that we can use `torch.utils.data.DataLoader`s to handle batching our examples and iterating over them. We'll create `DataLoader`s for our datasets now.

In [None]:
from torch.utils.data import DataLoader

# Data-related hyperparameters
batch_size = 64

# Set up a DataLoader for the training dataset.
train_dataloader = DataLoader(
    dataset=train_dataset, batch_size=batch_size, shuffle=True)

# Set up a DataLoader for the test dataset.
test_dataloader = DataLoader(
    dataset=test_dataset, batch_size=batch_size)

Let's take a look at what's inside our datasets. `torch.utils.data.Dataset`s are indexable, so we can easily peek inside.

In [None]:
# Print the first training example
print(train_dataset[0])

From this output, we can see the dataset elements are tuple of `(data_tensor, label)`. `data_tensor` is a `FloatTensor` of shape `(1, 28, 28)` (since the image is 28x28), and `label` is an integer from 0 to 9 (since there are 10 classes in the data).

Let's similarly look at what the `DataLoader` produces.

In [None]:
list(train_dataloader)[0]

As we can see, the `DataLoader` groups examples into batches of size `batch_size` (64 by default in the code above). Thus, the shape of the returned tensor is `(64, 1, 28, 28)`, since we essentially stacked `batch_size` examples together. Similarly, `labels` is now a `LongTensor` of size `batch_size`.

Note that the label for a single example was a Python `int` --- the dataloader automatically grouped them into a `LongTensor` of the appropriate size.

## Building our model

Now we can construct a `FeedForwardNN` instance that we'll train. Each FashionMNIST example is `28x28`, so we get it as a Tensor of shape `(28, 28)`.

We'll flatten out each example to a vector of size `(784,)` for compatibility with our model.

In [None]:
# Hyperparameters of our model.
num_hidden = 2
hidden_dim = 512
dropout = 0.2

fashionmnist_ffnn_clf = FeedForwardNN(input_size=784, num_classes=10,
                                      num_hidden=num_hidden,
                                      hidden_dim=hidden_dim, dropout=dropout)
print(fashionmnist_ffnn_clf)

If we're using a GPU, we'll move the model to the GPU which should speed up training. We do this with the same `.cuda()` method we used for Tensors.

In [None]:
if using_GPU:
  fashionmnist_ffnn_clf = fashionmnist_ffnn_clf.cuda()

# Check if the Module is on GPU by checking if a parameter is on GPU
print("Model on GPU?:")
print(next(fashionmnist_ffnn_clf.parameters()).is_cuda)

## Construct other classes we need for training: loss and optimizer

Now, we'll set up a criterion for calculating the loss and an Optimizer for updating our parameters.

In [None]:
# Set up criterion for calculating loss
nll_criterion = nn.NLLLoss()

lr = 0.1
momentum = 0.9
# Set up an optimizer for updating the parameters of fashionmnist_ffnn_clf
ffnn_optimizer = optim.SGD(fashionmnist_ffnn_clf.parameters(),
                           lr=lr, momentum=momentum)

## Train the model!

Now, we'll implement the procedure to train the model --- this is typically called the "train loop" since we loop over our batches, performing the forward pass, calculating a loss, backpropping, and then updating our parameters. This is the bulk of the code necessary to train the model.

This block looks pretty long, but that's mostly because of the comments :)

In [None]:
# Number of epochs (passes through the dataset) to train the model for.
num_epochs = 10

# A counter for the number of gradient updates we've performed.
num_iter = 0

# Iterate `num_epochs` times.
for epoch in range(num_epochs):
  print("Starting epoch {}".format(epoch + 1))
  # Iterate over the train_dataloader, unpacking the images and labels
  for (images, labels) in train_dataloader:
    # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784), since
    # that's what our model expects. Remember that -1 does shape inference!
    reshaped_images = images.view(-1, 784)

    # Wrap reshaped_images and labels in Variables,
    # since we want to calculate gradients and backprop.
    reshaped_images = Variable(reshaped_images)
    labels = Variable(labels)

    # If we're using the GPU, move reshaped_images and labels to the GPU.
    if using_GPU:
      reshaped_images = reshaped_images.cuda()
      labels = labels.cuda()

    # Run the forward pass through the model to get predicted log distribution.
    # predicted shape: (batch_size, 10) (since there are 10 classes)
    predicted = fashionmnist_ffnn_clf(reshaped_images)

    # Calculate the loss
    batch_loss = nll_criterion(predicted, labels)

    # Clear the gradients as we prepare to backprop.
    ffnn_optimizer.zero_grad()

    # Backprop (backward pass), which calculates gradients.
    batch_loss.backward()

    # Take a gradient step to update parameters.
    ffnn_optimizer.step()

    # Increment gradient update counter.
    num_iter += 1

    # Calculate test set loss and accuracy every 500 gradient updates
    # It's standard to have this as a separate evaluate function, but
    # we'll place it inline for didactic purposes.
    if num_iter % 500 == 0:
      # Set model to eval mode, which turns off dropout.
      fashionmnist_ffnn_clf.eval()
      # Counters for the num of examples we get right / total num of examples.
      num_correct = 0
      total_examples = 0
      total_test_loss = 0

      # Iterate over the test dataloader
      for (test_images, test_labels) in test_dataloader:
        # Reshape images from (batch_size, 1, 28, 28) to (batch_size, 784) again
        reshaped_test_images = test_images.view(-1, 784)

        # Wrap test data in Variable, like we did earlier.
        # We set volatile=True bc we don't need history; speeds up inference.
        reshaped_test_images = Variable(reshaped_test_images, volatile=True)
        test_labels = Variable(test_labels, volatile=True)

        # If we're using the GPU, move tensors to the GPU.
        if using_GPU:
          reshaped_test_images = reshaped_test_images.cuda()
          test_labels = test_labels.cuda()

        # Run the forward pass to get predicted distribution.
        predicted = fashionmnist_ffnn_clf(reshaped_test_images)

        # Calculate loss for this test batch. This is averaged, so multiply
        # by the number of examples in batch to get a total.
        total_test_loss += nll_criterion(
            predicted, test_labels).data * test_labels.size(0)

        # Get predicted labels (argmax)
        # We need predicted.data since predicted is a Variable, and torch.max
        # expects a Tensor as input. .data extracts Tensor underlying Variable.
        _, predicted_labels = torch.max(predicted.data, 1)

        # Count the number of examples in this batch
        total_examples += test_labels.size(0)

        # Count the total number of correctly predicted labels.
        # predicted == labels generates a ByteTensor in indices where
        # predicted and labels match, so we can sum to get the num correct.
        num_correct += torch.sum(predicted_labels == test_labels.data)
      accuracy = 100 * num_correct / total_examples
      average_test_loss = total_test_loss / total_examples
      print("Iteration {}. Test Loss {}. Test Accuracy {}.".format(
          num_iter, average_test_loss, accuracy))
      # Set the model back to train mode, which activates dropout again.
      fashionmnist_ffnn_clf.train()

# Homework Exercises
**Due: 23th Feb, 11:59pm**
<br>
<br>
Based on the same FashionMNIST dataset, work on the following tasks below. Submit your homework as either: (i) an ipynb file with your results inside; or (ii) a python file and separate pdf discussing your results.

(a) Develop a new feed-forward neural network that contains 3 hidden layers, with hidden layers 1, 2, 3 being of dimensions 512, 256, 128, respectively. Hidden layer 1 is the layer immediately after the input layer, while hidden layer 3 is the one just before the output layer.

(b) Experiment with three different activation functions and two different optimizers. Report your results and discuss your findings.

(c) Building upon Task b above, describe and implement two approaches to improve upon the best variation from Task b. Report your results and discuss your findings.
