# 3.2 Additional exercises

This notebook contains a wealth of additional exercises and projects for you to pick from, as well as some glossary. Since there are quite a few and you only have 2 hours during the tutorial, just choose to solve the ones you like the most.

<!--
- [More on cross-validation](#More-on-cross-validation)
- [Regularisation techniques](#Regularisation-techniques)
- [Momentum](#Momentum)
- [Learning rate scheduling](#Learning-rate-scheduling)
- [Batch normalisation](#Batch-normalisation)
- [Weight initialization](#Weight-initialization)
- [Gradient clipping](#Gradient-clipping)
- [Warm-up steps](#Warm-Up-steps)
- [Ensemble methods](#Ensemble-methods)
- [Monitor and visualise](#Monitor-and-visualise)
- [Group discussion](#Group-discussion)
-->

## More on cross-validation

**Exercise 1**: As mentioned in notebook 3.1, you don't really need to write the code for cross-validation yourself. Suitable methods have already been implemented, e.g., in scikit-learn. **a)** However, to make PyTorch work with scikit-learn, you would need to wrap it in [skorch](https://skorch.readthedocs.io/en/stable/index.html). Explore skorch. **b)** Explore [RayTune](https://docs.ray.io/en/latest/tune/index.html).

## Regularization techniques

You can improve your network through regularisation techniques, such as dropout or L1/L2 regularization, to prevent overfitting and enhance model generalization. You met dropout in the exercises of notebook 2.1. L1/L2 regularisation simply means that you add a penalty term to your current loss, discouraging large parameters (for a neural network, this means that you try to keep the weights small).

**Exercise 2**: You can include L2 regularisation (what is that exactly?) by setting weight\_decay to a non-zero value in your optimiser. What exactly would you need to do in your code? What does weight\_decay represent?

In [None]:
# EXAMPLE

import torch
import torch.optim as optim

# Define the optimizer with L2 regularization (weight_decay)
optimizer1 = optim.SGD(model.parameters(), lr=0.01, weight_decay=0.1)  # weight_decay is the regularization strength

**Exercise 3**: You could also add your L2 regularisation manually. To see how you might do this, have a look at the following example from [Kaggle](https://www.kaggle.com/code/cheesleypringlesman/minimizing-loss-using-l1-regularization-in-pytorch) on L1 regularisation.

## Momentum

To improve convergence when using stochastic gradient descent, we can draw on the concept of momentum from physics. Thus, in momentum-based SGD, the update is influenced not only by the current gradient but also by an exponentially decaying moving average of past gradients. This way, if the optimizer has been consistently moving in a certain direction over the last few steps, it will continue to do so, building up momentum. But how much should the previous previous gradients contribute? For this purpose, you can set a hyperparameter.

**Exercise 4**: How do you do this in practice in PyTorch?

In [None]:
# EXAMPLE

import torch
import torch.optim as optim

# Define the optimizer with momentum
optimizer2 = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) # Here, 90 per cent of the previous momentum is carried over.

**Exercise 5**: What does the term "Exponential Moving Average" cover?

## Learning rate scheduling

**Exercise 6**: [Learning rate scheduling](https://pytorch.org/docs/stable/optim.html) is a technique used during the training of neural networks where the learning rate is adjusted over time according to a predefined schedule. The goal is to improve the training process, potentially speeding up convergence, enhancing model performance, and achieving better generalization. Construct a simple coding example implementing learning rate scheduling.

## Batch normalisation

Before we pass the data to the neural network, we normalise it. However, as the input data $x$ gets transformed, passing through each layer, $x$ might very well blow up significantly. To avoid this, we can normalise the output of each layer using [nn.BatchNorm2d()](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm2d.html) after convolutional or pooling layers and [nn.BatchNorm1d()](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) after fully connected layers (why?). This approach is called batch normalisation (see also the original article by [Ioffe and Szegedy](https://arxiv.org/abs/1502.03167)).

**Exercise 7**: Explain the code below

In [None]:
import torch.nn as nn

# Define a simple CNN with Batch Normalization
class CNNWithBatchNorm(nn.Module):
    def __init__(self):
        super(CNNWithBatchNorm, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.batchnorm1 = nn.BatchNorm2d(32)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.batchnorm2 = nn.BatchNorm2d(64)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2)

        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.batchnorm_fc = nn.BatchNorm1d(128)
        self.relu_fc = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.batchnorm1(x)
        x = self.relu1(x)
        x = self.pool1(x)

        x = self.conv2(x)
        x = self.batchnorm2(x)
        x = self.relu2(x)
        x = self.pool2(x)

        x = x.view(-1, 64 * 7 * 7) # Alternative to flatten
        x = self.fc1(x)
        x = self.batchnorm_fc(x)
        x = self.relu_fc(x)
        x = self.fc2(x)

        return x

## Weight initialisation

Weight initialisation is a crucial aspect of training neural networks. It involves setting the initial values of the weights in the network before training begins. Proper weight initialisation can help improve the convergence speed and the overall performance of the neural network. PyTorch does this automatically, and weight initialisation may not be explicitly included in basic examples because the default initialisation methods provided by modern deep learning frameworks are generally well-suited for many common scenarios. Frameworks like PyTorch thus use sensible default initialisation strategies, such as Xavier/Glorot initialisation for linear layers.

**Exercise 8**: But you can set decide on the initialisation yourself. Check out nn.init.xavier_uniform_(). What does it do?

In [None]:
import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(in_features=10, out_features=5)
        # Explicitly set Xavier/Glorot initialization for the linear layer
        nn.init.xavier_uniform_(self.fc1.weight)

    def forward(self, x):
        x = self.fc1(x)
        return x

# Instantiate the model
model = SimpleNet()

## Gradient clipping:

Gradient clipping helps to prevent exploding gradients during the optimization process. You can find the corresponding tools in [PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html). Gradient clipping is most commonly used for recurrent neural networks (RNNs) and other models that involve sequential data processing, where the vanishing or exploding gradient problem is prevalent.

## Warm-up steps

You can gradually increase the learning rate during the initial steps of training. This approach can help the model to converge more quickly.


## Ensemble Methods

Ensemble methods involve training multiple models and combining their predictions. The idea is that diverse models can collectively produce more accurate and robust predictions. Indeed, many machine learning models draw on ensemble methods (cf. random forest, boosting and bagging). Also, in Deep Learning, you can find various ensemble methods. These include but are not limited to

- Model Averaging: Train multiple instances of the same deep learning architecture with different random initializations or hyperparameters and average the predictions of these models during inference.
- Bagging with neural networks: Train multiple instances of the same neural network on different subsets of the training data and average predictions during inference.
- Weight averaging: Instead of combining predictions at the decision level (as in voting or stacking), weight averaging involves combining the weights of multiple trained models to create a single model with averaged weights.

**Exercise 9**: Investigate this topic further. What does PyTorch offer in this regard (discuss briefly, e.g. code found [here](https://pytorch.org/docs/stable/optim.html) and [here](https://pytorch.org/tutorials/intermediate/ensembling.html#:~:text=Model%20ensembling%20combines%20the%20predictions,vmap%20.))?

# Monitor and visualise

TensorBoard can be used with PyTorch to visualise and analyse the training of neural networks. It offers real-time visualisation of training metrics, and it includes an interactive interface. Moreover, you can easily compare multiple training runs or experiments in TensorBoard.

**Exercise 10**: Explore Tensorboard for PyTorch (e.g. [here](https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html) and [here](
https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/tensorboard_with_pytorch.ipynb)).

## Group discussion

**Exercise 11**: Explore the homepage of PyTorch and the Kaggle. Find at least three useful examples and references that you have not been pointed to in these notebooks and discuss them in the group.

**Exercise 12**: Discuss the content of today's lecture and the notebooks. What are the main concepts to take home? Are there any aspects (of the content or PyThon programming) that you feel you need to dive further into before watching the next lecture? Discuss in the group.

**Exercise 13**: Can you see any applications of the concepts presented in this course to your research?

**Exercise 14**: In the lecture, we briefly discuss graph neural networks (GNN), variational autoencoders (VAE), flow-based generative models, generative adversarial networks (GAN), natural language processing (NLP), transformers, large language models (LLM), and neural density estimators. Dive into any of these topics. Find useful material, examples and references and discuss them in your group. Are any of these Deep Learning Techniques useful for your research?

**Exercise 15**: Are there any Deep Learning methods that we have not mentioned or covered in the course? How do they apply to your topic of research?

**Exercise 16:** Are you interested in taking the optimisation of your hyper-parameters to scale, beyond simple for-loops, as for-loops can quickly become cumbersome and inefficient? Check out [https://optuna.org/](https://optuna.org/).