# Deep Learning for Computer Vision

---

**Goethe University Frankfurt am Main**

Winter Semester 2022/23

<br>

## *Assignment 3 (Network)*

---

**Points:** 25<br>
**Due:** 16.11.2022, 10 am<br>
**Contact:** Matthias Fulde ([fulde@cs.uni-frankfurt.de](mailto:fulde@cs.uni-frankfurt.de))<br>

---

**Your Name:**

<br>

<br>

## Table of Contents

---

- [1 Loss](#1-Loss)
- [2 Optimization](#2-Optimization-(5-Points))
  - [2.1 Gradient Descent with Momentum](#2.1-Gradient-Descent-with-Momentum-(3-Points))
  - [2.2 Weight Decay](#2.2-Weight-Decay-(2-Points))
- [3 Deep Neural Network](#3-Deep-Neural-Network-(20-Points))
  - [3.1 Definition](#3.1-Definition-(5-Points))
  - [3.2 Training](#3.2-Training-(15-Points))


<br>

## Setup

---

Besides the NumPy and Matplotlib libraries, we import the definitions of the network layers and the corresponding test cases.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from modules import *
from utils import *

%load_ext autoreload
%autoreload 2

<br>

## Exercises

---

### 1 Loss

---

In this assignment we want to create and train a deep neural network for classification, hence we need a loss function. We want to use again the cross-entropy loss that we already implemented in the last assignment.

<br>

#### 1.1 Task

Complete the definition of the `CrossEntropyLoss` class in the `modules/loss.py` file.

Feel free to use the implementation shown in the solution for last week's assignment, but you can also copy your own solution. This exercise is not graded. If you have not implemented this loss in the previous assignment, we highly recommend that you try this yourself before using the given solution.

In the `forward` method of the class, compute the cross-entropy loss and store the result in the `out` variable that is returned from the method. Also cache the labels and the softmax probabilities to reuse them in the backward pass for gradient computation.

In the `backward` method compute the gradient of the loss with respect to the inputs. Store the gradient in the `in_grad` variable that is passed to the given model and that is returned from the method.

Use only vectorized operations.

<br>

#### 1.2 Test

To test the implementation you can run the following cell.

In [None]:
CrossEntropyLoss_test()

<br>

### 2 Optimization (5 Points)

---

So far we have only used vanilla gradient descent. That is, the update rule was to scale the gradient with the learning rate and subtract the result from the parameters. In this exercise we want to extend that concept a bit, taking into account also the previous updates.

<br>

### 2.1 Gradient Descent with Momentum (3 Points)

---

One of the problems with standard gradient descent is that the gradient with respect to a parameter may change rapidly during training. These oscillations of the gradient make optimization hard. In addition, there is also the problem of the gradient being stuck in a flat region, where the slope is almost zero. Gradient descent with momentum is one approach to tackle these problems.

Instead of just updating the parameters with

$$
    W^{(t)} = W^{(t-1)} - \eta\nabla\mathcal{L}\left(W^{(t-1)}\right)
$$

<br>

where $\eta$ is the learning rate, we also take into account the previous updates of the parameters, scaled by a hyperparameter called momentum. Thus the update rule becomes

<br>

$$
    W^{(t)} = W^{(t-1)} - V^{(t)}
$$

with

$$
    V^{(t)} = \mu V^{(t-1)} + \eta\nabla\mathcal{L}\left(W^{(t-1)}\right)
$$

<br>

where $V$ is called velocity and $\mu$ is the momentum. The velocity is an array with the same shape as $W$ and $\nabla\mathcal{L}(W)$ and can be understood as a moving average over the past gradients. With this update rule we get a more stable trajectory towards a minimum and we can still move, even if the gradient for the current time step becomes small.

<br>

#### 2.1.1 Task

Complete the definition of the `SGD` class in the `modules/optim.py` file.

In the `step` method of the optimizer, implement the update rule described above. The $\text{learning_rate}$ and $\text{momentum}$ are stored as attributes of the optimizer object and each layer that is iterated over has dictionaries for parameters, gradients and velocity, where corresponding entries are referenced with the same name, given that you adhered to this convention in the previous exercises.

Your implementation should be fully vectorized, so no loops are allowed.

<br>

#### 2.1.2 Test

To test your implementation, run the following code cell.

In [None]:
SGD_test()

<br>

### 2.2 Weight Decay (2 Points)

---

In the last assignment, we used L2 regularization to regularize our linear classifier models, computing the squared Euclidean norm of the parameters. We added explicitly a regularization loss term to the data loss term to compute the final loss. However, we can also implement this in a slightly different way.

Instead of computing an explicit loss, we can apply **weight decay**, by just adding the gradient of the L2 regularization loss separately for each parameter to the gradient of the data loss. Hence, for vanilla gradient descent with weight decay we compute

<br>

$$
    W^{(t)} = W^{(t-1)} - \eta\left(\nabla\mathcal{L}\left(W^{(t-1)}\right) + \lambda W^{(t-1)}\right),
$$

<br>

where $\eta$ is the learning rate and $\lambda$ is the regularization strength. This is based on the equivalent definition of the regularization loss as

<br>

$$
    R(W) = \frac{\lambda}{2}\lVert W \rVert^2.
$$

such that

$$
    \frac{\partial}{\partial W} R(W) = \lambda W.
$$

<br>

In the same way we can compute the update when using stochastic gradient descent with momentum. In this case we compute

<br>

$$
    W^{(t)} = W^{(t-1)} - V^{(t)}
$$

with

$$
    V^{(t)} = \mu V^{(t-1)} + \eta\left(\nabla\mathcal{L}\left(W^{(t-1)}\right) + \lambda W^{(t-1)}\right).
$$

<br>

#### 2.2.1 Task

Extend the definition of the `SGD` class in the `modules/optim.py` file.

Add weight decay to the update rule of stochastic gradient descent with momentum, which you implemented in the previous exercise.

Use only vectorized operations.

<br>

#### 2.2.2 Test

To test your implementation, run the following code cell.

In [None]:
SGD_test(use_weight_decay=True)

<br>

### 3 Deep Neural Network (20 Points)

---

Now that we have all the components implemented, we can plug everything together to create a deep neural network.

Since we don't have GPU support, we're going to create a rather shallow model. Otherwise the training would take too much time. In order to get an idea how many parameters our model will have in the end, we'll calculate them from hand after the definition.

<br>

### 3.1 Definition (5 Points)

To test our implementations, we're going to define a network with two convolutional layers, each followed by a ReLU activation function and a max pooling layer. We convert the outputs of the last pooling layer into vectors and pass them into a small fully-connected network, composed of two linear layers, the first of which has a ReLU activation function. The last linear layer has no activation function and produces the scores for the ten classes of the dataset.

For both convolutional layers we use a kernel size of 3 and set padding and stride to 1. The first conv layer has 3 input channels and 6 output channels. The second conv layer has 6 input channels and 8 output channels.

For the pooling layers we use a kernel size of 2 and a stride of the same size, so that we pool non-overlapping windows of the feature maps.

The number of output features for the first linear layer should be 32, and for the second linear layer it should be 10, matching the number of classes in the CIFAR-10 dataset that we use again in this exercise.

<br>

#### 3.1.1 Feature Size (1 Point)

The resolution of the images in the CIFAR-10 dataset is $32\times32$. Given the definitions above, compute the number of input features for the first linear layer. Write down all the steps of your computation.

##### Answer

*Write your answer here.*

<br>

#### 3.1.2 Implementation (2 Points)

Complete the definition of the `ConvNet` class below.

If you don't define the `forward` method, the inherited method from the base class will call the layers in the order in which they were added as attributes in the constructor. You don't have to define a `backward` method.

Create the network according to the above definitions.

In [None]:
class ConvNet(Module):

    def __init__(self):
        """
        Create deep neural network with two conv and two linear layers.
        """
        super().__init__()
        ############################################################
        ###                  START OF YOUR CODE                  ###
        ############################################################



        ############################################################
        ###                   END OF YOUR CODE                   ###
        ############################################################

<br>

#### 3.1.3 Capacity (2 Points)

Now we want to compute the capacity of the model, which is the number of learnable parameters. Compute the number of parameters for each layer and than sum the results to get the total number of parameters of the model.

##### Answer

*Write your answer here.*

<br>

### 3.2 Training (15 Points)

---

We want to train our model again on the CIFAR-10 dataset that we already used in the previous problem sets. The function for loading and preprocessing the data expects the dataset in the `datasets` folder in the same directory as the notebook, so copy the folder before you proceed.

Let's load the data and print the shapes.

In [None]:
# Load and preprocess the CIFAR-10 dataset.
data = get_CIFAR_10_data()

# Output the shapes of the partitioned data and labels.
for name, array in data.items():
    print(f'{name} shape: {array.shape}')

<br>

#### 3.2.1 Task (12 Points)

Implement a training loop for the defined model and dataset.

In each iteration, randomly sample a minibatch of $64$ images from the development set with replacement. Compute the forward pass through the network. In order to do this, you can call the model directly. The `__call__` method dispatches to the `forward` method of the instance.

Compute the average accuracy for the training batch and store it in the predefined `train_acc` list.

The next step is to call the loss function with the model output and the ground truth labels. Again, you can call the instance directly. Store the loss in the `train_loss` list that is already defined. After that, call the `backward` method of the loss to compute the gradients of the loss with respect to the model parameters.

In order to update the parameters, call the `step` method of the optimizer.

Finally, sample a minibatch of the same size from the validation set and compute a forward pass. Again compute the loss. Store it in the `val_loss` list. Compute the average accuracy of the predictions and store the result in the `val_acc` list.

Use only vectorized operations. No further loops are allowed.

<br>

You're model should at least converge, so the loss should decrease and the accuracy increase. With the given settings, expect a slow start, but towards the end of the given number of iterations, you should see that the accuracy on the development set is well above chance, which would be $10\%$.

Since we're training only on the CPU, be prepared that training the model with the predefined settings may take a while!

<br>

#### 3.2.2 Solution

Write your solution in the marked code cell below.

In [None]:
# Create the model.
model = ConvNet()

# Create the loss function.
loss = CrossEntropyLoss(model)

# Create the optimizer.
optimizer = SGD(model, lr=1e-3, momentum=0.9, weight_decay=5e-4)

# Access development set.
X_dev = data['X_dev']
y_dev = data['y_dev']

# Access validation set.
X_val = data['X_val']
y_val = data['y_val']

In [None]:
# Lists to store training and validation loss.
train_loss = []
val_loss = []

# Lists to store the training and validation accuracy.
train_acc = []
val_acc = []

In [None]:
# Set number of iterations.
num_iter = 100

# Set number of samples per minibatch.
batch_size = 64

# Show intermediate results.
verbose = True
print_every = 10

# Train the model.
for i in range(1, 1+num_iter):
    ############################################################
    ###                  START OF YOUR CODE                  ###
    ############################################################



    ############################################################
    ###                   END OF YOUR CODE                   ###
    ############################################################
    if verbose and (i == 1 or i % print_every == 0):
        print(
            f'Iter: {i:4}  | ',
            f'Train acc: {train_acc[-1]*100:6.2f}%  | ',
            f'Val acc: {val_acc[-1]*100:6.2f}%  | ',
            f'Train loss: {train_loss[-1]:6.3f}  | ',
            f'Val loss: {val_loss[-1]:6.3f}'
        )

<br>

#### 3.2.3 Results

Let's check the best accuracy on the training and validation set.

In [None]:
print(f'Best train acc: {np.max(train_acc)*100:6.2f}  |  Best val acc: {np.max(val_acc)*100:6.2f}')

Let's also plot the training and validation losses and accuracies obtained during training.

In [None]:
show_training(train_loss, val_loss, train_acc, val_acc)

<br>

#### 3.2.4 Observations (3 Points)

Briefly describe your observations when you trained the model.

##### Answer

*Write your answer here.*