<a href="https://colab.research.google.com/github/EffiSciencesResearch/ML4G-2.0/blob/master/workshops/hyperparameters/hyperparameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hyperparameters


## Goals

The goal of this workshop is to get familiar with:
- the concept of hyperparameters
- what the usual hyperparameters are for a neural network and what are their effects
- how to tune them and validate a choice of hyperparameters

## Structure of the notebook
- Generate toy data
- Creating the neural network in pytorch
- Creating the training loop
- Creating evaluation functions
- Hyper-parameter optimization
  - The optimizer
  - The architecture
  - The loss function
  - Prediction on a test set

## Hyperparameters

OPTIONAL: This block of text is optional reading and recaps what was explained during the lecture.

In the prerequisites you trained a model to learn the sinus function. In the process, we implicitly made many decisions, such as to use a degree 3 polynomial, to use a specific learning rate, etc.

The learning rate is an example of a **hyperparameter**. A regular parameter is a variable whose value is automatically determined during the training process, using the optimizer. (And because we are using gradient descent, we must be able to differentiate the loss with respect to the parameters.)

The learning rate, in contrast, cannot be determined by this training process. As a hyperparameter, we need to introduce an outer loop that wraps the training loop to search for good learning rate values. This outer loop is called a hyperparameter search, and each iteration consists of testing different combinations of hyperparameters, creating a table of results consisting of pairs $(\text{hyperparameters}, \text{validation performance})$. Obtaining results for each row of the table requires running the full inner training loop.

Due to a fixed budget of ML researcher time and available compute, we are interested in a trade-off between the ML researcher time, the cost of running the search, and the cost of training the final model. Due to the vast search space and cost of obtaining data, we don't hope to find any sort of optimum but merely to improve upon our initial guesses enough to justify the cost.

In addition, hyperparameters are not always continuous, like the learning rate, but can be discrete too, e.g. the number of layers in the network, choice of loss function, choice of optimization algorithm, learning rate scheduler, etc.

More broadly, every design decision can be considered a hyperparameter, including how to preprocess the input data, the connectivity of different layers, the types of operations, etc. Papers such as [AmeobaNet](https://arxiv.org/pdf/1801.01548.pdf) demonstrate that it's possible to find architectures superior to human-designed ones.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import torch

# Display figures on jupyter notebook
%matplotlib inline

## Generate toy data

In [None]:
# We define a function to generate our synthetic the dataset, in the form of two interlaced spirals
# You don't need to understand this code, just run it


def spiral(phi):
    x = (phi + 1) * torch.cos(phi)
    y = phi * torch.sin(phi)
    return torch.cat((x, y), dim=1)


def generate_data(num_data):
    angles = torch.empty((num_data, 1)).uniform_(1, 15)
    data = spiral(angles)
    # add some noise to the data
    data += torch.empty((num_data, 2)).normal_(0.0, 0.4)
    labels = torch.zeros((num_data,), dtype=torch.int)
    # flip half of the points to create two classes
    data[num_data // 2 :, :] *= -1
    labels[num_data // 2 :] = 1
    return data, labels

In [None]:
# Generate the training set with 4000 examples
x_train, y_train = generate_data(4000)

print("X_train", x_train.shape)
print("y_train", y_train.shape)

In [None]:
# You don't need to understand this code, just run it
def plot_data(x, y):
    """Plot labeled data points X and y. Label 1 is a red +, label 0 is a blue +."""
    plt.figure(figsize=(5, 5))
    plt.plot(x[y == 1, 0], x[y == 1, 1], "r+")
    plt.plot(x[y == 0, 0], x[y == 0, 1], "b+")

In [None]:
# Visualize the data
plot_data(x_train, y_train)

As seen in the pre-requisite materials, PyTorch has `Dataset` and `DataLoader` objects, which make it convenient to load the data in batches, shuffle the data, etc.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

training_set = TensorDataset(x_train, y_train)

##  Creating the neural network

Here we create the neural network. This is the model you'll try to improve in the exercises.

It is already created for you, but you should read through the code and understand what is done on each line.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from typing_extensions import Literal

A tutorial for constructing models can be found [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py).

In [None]:
# Read this code line-by-line. It's code you will see again many times


class Model(nn.Module):
    """
    A fully connected neural network with any number of layers.
    """

    NAME_TO_NONLINEARITY = {
        "relu": nn.ReLU,
        "sigmoid": nn.Sigmoid,
        "tanh": nn.Tanh,
    }

    def __init__(
        self, layers=[2, 10, 1], non_linearity: Literal["relu", "sigmoid", "tanh"] = "relu"
    ):
        super(Model, self).__init__()

        modules = []
        for input_dim, output_dim in zip(layers[:-1], layers[1:]):
            modules.append(nn.Linear(input_dim, output_dim))
            # After each linear layer, we apply a non-linearity
            modules.append(self.NAME_TO_NONLINEARITY[non_linearity]())

        # Remove the last non-linearity, since the last layer is the output layer
        self.layers = nn.Sequential(*modules[:-1])

    def forward(self, inputs):
        ouput = self.layers(inputs)

        # We want the model to predict 0 for one class and 1 for the other class
        # A Sigmoid activation function maps the output from [-inf, inf] to [0, 1]
        prediction = torch.sigmoid(ouput)
        return prediction

In [None]:
# Create the model:
model = Model()

# Choose the hyperparameters for the training loop:
num_epochs = 10
batch_size = 10

# Loss function. This one is a mean squared error (MSE) loss between the output
# of the network and the target label
criterion = nn.MSELoss()

# Optimizer. We use SGD optimizer with a learning rate (lr) of 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

## Training the model
More information can be found [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py) if needed.

In [None]:
# Read this code line-by-line. It's code you want to understand as it is central to ML

# tqdm is a library used to display progress bars. It's useful when training.
from tqdm.notebook import tqdm


def train(
    num_epochs: int, batch_size: int, criterion, optimizer, model, dataset, verbose: bool = False
):
    """Train a model."""
    # Store the training errors
    train_losses = []
    # Create a DataLoader to iterate over the dataset in batches
    train_loader = DataLoader(dataset, batch_size, shuffle=True)

    for epoch in tqdm(range(num_epochs)):
        epoch_average_loss = 0
        # Each epoch, we iterate over the dataset once
        for x_batch, y_true in train_loader:
            # Compute the predictions.
            # Output shape is (batch_size, 1), so we squeeze the last dimension
            y_predicted = model(x_batch).squeeze(1)

            # The loss is how far the predictions are from the true labels
            loss = criterion(y_predicted, y_true.float())

            # Do gradient descent to minimize the loss
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # Record the average loss for this batch
            epoch_average_loss += loss.item() * batch_size / len(dataset)

        train_losses.append(epoch_average_loss)

        if verbose:
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {epoch_average_loss:.4f}")

    return train_losses

In [None]:
# run the training loop
train_losses = train(num_epochs, batch_size, criterion, optimizer, model, training_set, 1)

In [None]:
# Plot the training error wrt. the number of epochs
plt.plot(range(1, num_epochs + 1), train_losses)
plt.xlabel("num_epochs")
plt.ylabel("Train error")
plt.title("Visualization of convergence")
plt.show()

## Evaluating the model on the validation set

We first evaluate the accuracy on a validation set, to see how the model performs on data it did not see during training.

In [None]:
# Read this code line-by-line. It's code you want to understand as it is central to ML

# Generate 1000 validation datapoints
x_val, y_val = generate_data(1000)


def get_accuracy(model, x=x_val, y=y_val):
    """Compute the accuracy of the model on a dataset."""
    # Compute the predictions, without keeping track of the gradients
    with torch.no_grad():
        y_predicted = model(x).squeeze(1)

    # The predictions are in [0, 1] and the labels are either 0 or 1
    # So we round the predictions to get the predicted labels
    y_predicted = torch.round(y_predicted)

    # Compute the accuracy by counting the number of correct predictions
    accuracy = (y_predicted == y).sum().item() / len(y)

    print(f"Accuracy on {len(y)} examples: {accuracy:.2%}")
    return accuracy

In [None]:
get_accuracy(model);

Then we visualize what the model has learned by plotting all the predictions.

In [None]:
# You don't need to understand this code, just run it


def compare_predictions(model, x=x_val, y_real=y_val):
    """Compare the prediction with real labels."""

    with torch.no_grad():
        y_predicted = model(x).squeeze(1)

    plt.figure(figsize=(10, 5))

    reds = y_real > 0.5
    plt.subplot(121)
    plt.plot(x[reds, 0], x[reds, 1], "r+")
    plt.plot(x[~reds, 0], x[~reds, 1], "b+")
    plt.title("real data")

    reds = y_predicted > 0.5
    plt.subplot(122)
    plt.plot(x[reds, 0], x[reds, 1], "r+")
    plt.plot(x[~reds, 0], x[~reds, 1], "b+")
    plt.title("predicted data")

    plt.show()

In [None]:
compare_predictions(model)

## Hyper-parameter optimisation

We will now try to find the best combination of hyper-parameters.

- RECOMMENDATION: For this exercise to be maximally useful, make predictions about each experiment for running the code. Write down your predictions somewhere and check your predictions against what is observed.

Bonus:
- if you want, you can make your predictions on [FateBook](https://fatebook.io), a nice website to easily make predictions, resolve them and see your calibration.
- organize the results of the experiments in a tidy summary table, so that this table would be produced if you reset the notebook and clicked 'Run All'
- At the end of this notebook, we have a separate test dataset. Why do we need this in addition to the training and validation set?

### Exercise 1: Impact of the optimizer

Retrain the model by using different hyperparameters, you can change them in the previous sections definition, or put everything you need in the cell below for convenience.

Try to see the impact of the following factors:
* Use different batch size from 10 to 400
* Use different values of the learning rate (between 0.001 and 10), and see how these impact the training process.
* Change the duration of the training by increasing the number of epochs
* Use other optimizers, such as [Adam](https://pytorch.org/docs/stable/optim.html?highlight=adam#torch.optim.Adam) or [RMSprop](https://pytorch.org/docs/stable/optim.html?highlight=rmsprop#torch.optim.RMSprop)

### Exercise 2: Impact of the architecture of the model

Try to see the impact of the following factors:

* Try to add more layers (1, 2, 3, more ?)
* Try different activation functions ([sigmoid](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.sigmoid), [tanh](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.tanh), [relu](https://pytorch.org/docs/stable/nn.functional.html#torch.nn.functional.relu), etc.)
* Try to change the number of neurons for each layer (5, 10, 20, more ?)
* Do all network architectures react the same way to different learning rates?

**Note:** These changes may interact with your previous choices of hyperparameters, and you may need to change them as well!

### Exercise 3: Impact of the loss function

MSELoss is rarely used nowadays for classification and instead Cross Entropy is used.
It consists in interpreting the output of the network as the probability $p(y | x)$ of the point $x$ belonging to the class $y$.
Hence, the goal of the neural network is to maximize the probability for the *correct* class, that is, in maximizing $\displaystyle \prod_{(x,y) \in Dataset} p(y|x)$.
Applying $-\log$, we obtain:

$$ \sum_{(x,y) \in Dataset} - \log p(y | x) $$

This is the Cross Entropy loss and usually the number of outputs in the final layer equals the number of possible classes.
Because we have a binary problem, it is easier to just have a single output and use [BCELoss](https://pytorch.org/docs/stable/nn.html?highlight=bce#torch.nn.BCELoss).

Counterintuitively, for numerical stability reasons, it is better combine the sigmoid (done at the end of forward) and the BCELoss into a single function.
This is done by [BCEWithLogitsLoss](https://pytorch.org/docs/stable/nn.html?highlight=bcewithlogit#torch.nn.BCEWithLogitsLoss).

So explicitly, your task is:
- Use `BCEWithLogitsLoss` instead of MSE and see how this changes the behavior in the network. This can also interact with the changes of the previous exercices.
- Ensure you modify your network so there is no sigmoid done on the final output.

### Exercise 4: Prediction on test set

Once you have chosen your hyper-parametrs and trained your final model, you should evaluate it on a test dataset (that was never seen during training or during validation).

Question: why is this needed?