## Additional tutorial on deep learning with PyTorch

This notebook consists of a tutorial on some common deep learning operations with PyTorch.
This tutorial completely optional, and not graded.
However it introduces different concepts which can be useful for the exercise 5.
Thus we highly recommend going through this tutorial.
Additionally, we provide different skeleton functions for training deep networks and show how to use them in this tutorial.
Feel free to reuse this code for your exercise.

In [None]:
import sys
from pathlib import Path

from google.colab import drive
drive.mount('/content/drive')

In [None]:
iacv_path = 'MyDrive/BMIC/iacv/hs23_ex_wip/ex5/solution' # TODO set this

env_path = Path('/content/drive/') / iacv_path
# Add the handout folder to python paths
if env_path not in sys.path:
    sys.path.append(str(env_path))

### Step 1: Define the model
For our tutorial, we will be using a toy model `ToyNet` that is defined in the Python cell below. Note that defining a PyTorch network consists of two main steps. Firstly, in the `__init__`  function, we define the different layers that the network has. Secondly, the `forward` function defines how an input to the network is processed by the different layers. We call the process of passing the input through the network layers as 'forward pass' through the model.

In [None]:
# Example Toy Model

import torch
from torch import nn
import torch.nn.functional as F

class ToyNet(nn.Module):

    # model initialization
    def __init__(self, input_size=(10,), num_classes=10):
        super(ToyNet, self).__init__()
        input_size = input_size[0]
        self.lin1 = nn.Linear(input_size, 100)
        self.lin2 = nn.Linear(100, 100)
        self.lin3 = nn.Linear(100, num_classes)

    # model forward pass
    def forward(self, x):
        x = self.lin1(x)
        x = F.relu(x)
        x = self.lin2(x)
        x = F.relu(x)
        x = self.lin3(x)
        return x

model = ToyNet()

### Step 2: Summarizing a Model

You can learn about a PyTorch neural network `model` using various methods. The standard method for printing a summary of a model on PyTorch is `print(model)`. You can also use an external library such as `torchsummary` to obtain a more detailed summary.

In [None]:
from torchsummary import summary

print(">>>>> print(model):")
print(model)

print("\n\n>>>>> Torch-Summary:")
summary(model=model, input_size=(10,))

The output of the `print` function shows us that the `model` has three sub-modules, namely `lin1`, `lin2`, and `lin3`, each of which is a `Linear` layer.

Additionally, the output of `torchsummary` also tells us the size of the model, as well as the number of parameters in the model that are learned, i.e. updated during the backpropagation process. Currently, we see that all the parameters in the model will be trained.

### Step 3: Extending a model

After learning about a given model, you can change the model according to your needs. For example, you can add or remove a layer in the model, or modify an existing layer.  

**Task 1:** In the following cell, your task is to extend `model` with a ReLU activation layer at the end. That is, you should construct a new network `model2`, which has the same parameters as `model`, but also has a ReLU layer at the end.

In [None]:
from torch import nn
from evaluation import check_relu_layer
from copy import deepcopy

model2 = deepcopy(model)

#######
# Your code here. Extend model2 by adding a ReLU layer at the end
#######

# Test whether your implementation is correct
check_relu_layer(model, model2)

### Step 4: Checking model parameters
As you may have seen in the lectures, a layer in a neural network (e.g. a Linear layer) can have learnable parameters. These learnable parameters for a model can be accessed using the `parameters()` function of a neural network module.

For instance, we can view the number of parameters in the `lin1` layer as follows

In [None]:
# Get all the parameters
params_list = list(model.lin1.parameters())

print(f'Lin1 layer has {len(params_list)} parameter blocks')
print(f'The shape of the first paramter block is {params_list[0].shape}')
print(f'The shape of the second paramter block is {params_list[1].shape}')

**Questions:** What do these two parameter blocks correspond to? Can you figure out why the shape of the first block is `(100, 10)`, while that of second is only `(100, )`?

Additionally, you can get all the parameters in the `ToyNet` by calling the `parameters()` function on the top level module.

In [None]:
all_params = list(model.parameters())
print(f'ToyNet has {len(all_params)} parameter blocks')

You can iterate through these parameter blocks in the 'pythonic' way as follows

In [None]:
for param in model.parameters():
    print(f'Shape of parameter block is {param.shape}')

In addition to the `shape` attribute, a parameter block has many more interesting properties. In particular, one crucial attribute related to the training of the network parameters is the `requires_grad` flag, which tells `PyTorch` whether it should compute gradients for this parameter block. You may be aware that neural networks are trained using the backpropagation algorithm, in which we iteratively update each of the network parameters using the computed gradients.

We can check whether our parameter blocks require gradients or not as follows.

In [None]:
for i, param in enumerate(model.parameters()):
    print(f'Parameter block {i} requires_grad is: {param.requires_grad}')

As we can see, by default, all parameter blocks in a network require gradients. That is, all the parameters in the neural network are updated when training the network. However, you can control this property by manually setting the `requires_grad` flag to `False`, to 'freeze' a particular layer during training. This will allow you to keep certain parameters in a network fixed, while learning the other paramters.

**Question:** When would you desire such a property? Can you think of a case where you want to keep certain network parameters fixed, i.e. not learn them?

**Task 2:** Your task next is to freeze the first and third layer of the `model`.

In [None]:
from evaluation import check_freezing

#################
# Your code here. You need to freeze the first and the third layers
# (lin1 and lin3). That is, the requires_grad attribute should be False for
# the paramters in these layers
#################

check_freezing(model)

### Step 5: Learning the model parameters

In the network training process, our aim is to learn a set of parameters which minimize our training loss. This is achieved by using some form of gradient-based update algorithm, e.g. stochastic gradient descent. Fortunately, PyTorch can calculate the gradients of network parameters with respect to the training loss, using its `autograd` mechanism. A simple illustration of this is provided below.

We define two variables $a$ and $b$, set to 2 and 3 respectively. We define our 'loss' as $(a-b)^2$. Our goal is to update $a$ such that we can minimize the loss. We can achieve this by performing gradient descent. To do this, we want to first compute the gradient $\frac{\text{d}loss}{\text{d}a}$. For our simple problem, we can manually calculate the gradient, which we know is $2(a-b)$

In [None]:
# Define the tensors a and b. For tensor a, we set requires_grad to True, since
# we want to calculate its gradients
a = torch.tensor([2.0], requires_grad=True)
b = torch.tensor([3.0])

# Compute loss
loss = (a-b)**2

# Compute the gradients for each input.
# This is achieved by calling the backward function.
loss.backward()

print(f'Gradient w.r.t. a is {a.grad}')
print(f'Gradient w.r.t. b is {b.grad}')

We can see that the gradient w.r.t. $a$ is -2, which is equal to $2(a-b) = 2(2 - 3) = -2$. Thus, we see that PyTorch can compute the gradients automatically. While in our simple case, we could manually obtain an expression for the gradients as $2(a-b)$, this is not feasible to compute when dealing with deep neural networks, consisting of many complex operations. PyTorch's `autograd` functionality allows us to automatically compute the gradients even in such cases. You can refer to https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html for more information about `autograd` in PyTorch.

**Question:** Why was gradient w.r.t. $b$ set to `None` in our example?

Once we compute the gradients for each parameter, the next step is to update the parameter values. As you may have already seen in exercise 4, this is done using the [`optimizer` module in PyTorch](https://pytorch.org/docs/stable/optim.html). One can use many different types of optimizers, e.g. SGD, Adam, etc. In our example, we will use the SGD optimizer.

In [None]:
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

The first input argument to `optim.SGD` is `model.parameters()`. Why is that? Basically we want to tell the optimizer which parameter blocks in the model we want to update. By sending `model.parameters()` as input, we are telling the optimizer that it can update all the parameter blocks. However, note that the optimizer can only update the paramter blocks for which gradients are computed. So for instance, if the `requires_grad` flag for certain layers are set to `False`, the optimizer cannot update those layer parameters.

**Task:** Construct an optimizer which will only update the parameters of the last linear layer, i.e. `lin3`

In [None]:
optimizer = optim.SGD('TODO', lr=0.01, momentum=0.9)

### Step 6: Fine-tuning a network

Until now, we have only considered whether to train a full network, or 'freeze' certain layers in the network while learning others. However, in certain cases, we may wish to slightly adapt certain layers, while learning others from scratch. Such a behaviour is controlled using the learning rate. Instead of using a fixed learning rate for all the parameters in the network, we could use different learning rates for different parameters. This will allow us to control how much the network paramters may change, compared to their initial value.

We can obtain this behaviour as follows.

In [None]:
import torch.optim as optim
optimizer = optim.SGD([
                          {'params': model.lin1.parameters(), 'lr': 0.02},
                          {'params': model.lin2.parameters(), 'lr': 0.05},
                          {'params': model.lin3.parameters(), 'lr': 0.1},
                       ], lr=0.01, momentum=0.9)

Observe that instead of passing `model.parameters()` as the first arugment, we are passing a `List` of dictionaries. Each dictionary contains a set of parameters, as well as the learning rate to use for the particular set of parameters. In this example, the optimizer will use a learning rate of $0.02$ for layer 1 (`lin1`) parameters, $0.05$ for layer 2 (`lin2`) parameters and so on.

### Step 7: Saving a network
Once we train a neural network, we may want to save the learned model for later use. In order to do so, we save the learned parameters, which we can easily reload later. The funtion `state_dict()` provides a dictionary containing each of the layer name in the network, and the corresponding parameters. Basically, for each parameter block in the network, we obtain the name of the parameter, and its weights.

In [None]:
state_dict = model.state_dict()

print(f'The model has following parameter blocks')
print(list(state_dict.keys()))
print('The shape of lin1.weight parameter is ',
      state_dict["lin1.weight"].shape)

Once we have the state dict, we can easily save it to the disk using `torch.save`.

In [None]:
# Save model to the given path
torch.save(model.state_dict(), env_path / 'ckpt' / 'toy_model.pt')

Later, you can load a saved network weights and assign to your model. For example you can construct a new instance of `ToyNet`, and assign the saved parameters to it.

In [None]:
# Create a new model
model_new = ToyNet()

# Load previously saved weight
old_weights = torch.load(f"{env_path}/ckpt/toy_model.pt")

# Set the old weights to the new model
model_new.load_state_dict(old_weights)


Now you know how to adapt a given model, train or finetune only a few layers in the network, and save and load a learned model :)

This should be sufficient to tackle Exercise 5. We further provide a small toy setup below where you can play around with the concepts you have learned, without having to wait for long training times. Additionally we provide a number of useful functions such as `train_model`, `evaluate_model`, and `plot_training_log` to train, evaluate your model, and visualize the training process, respectively. Feel free to reuse these functions to train your own networks.