# Optimisation for Deep Learning

Learning outcomes
- understand mathematically and intuitively the most common optimisation algorithms used for optimising deep models
- implement your own optimiser in PyTorch



## Gradient descent

## SGD

## AdaGrad



## RMSProp



## Adam

## So which algorithm do I use?

Well... as usual, it depends on your problem and your dataset.

It's still a highly active field of research. But in general, **SGD with momentum or Adam** are the go to choices for optimising deep models.

## Using these optimisation algorithms

Let's set up the same neural network as in the previous module, and then switch out the optimiser for Adam and others and show how you can adapt it to use momentum.

In [14]:
import sys
sys.path.append('..')
from utils import NN, get_dataloaders
import torch
import torch.nn.functional as F

my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)

learning_rate = 0.0001

# HOW TO USE DIFFERENT OPTIMISERS PROVIDED BY PYTORCH
optimiser = torch.optim.SGD(my_nn.parameters(), lr=learning_rate, momentum=0.1)
# optimiser = torch.optim.Adagrad(NN.parameters(), lr=learning_rate)
# optimiser = torch.optim.RMSprop(NN.parameters(), lr=learning_rate)
optimiser = torch.optim.Adam(my_nn.parameters(), lr=learning_rate)

The stuff below is exactly the same as before!

In [27]:
# GET DATALOADERS
test_loader, val_loader, train_loader = get_dataloaders()
criterion = F.cross_entropy

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter

# TRAINING LOOP
def train(model, optimiser, graph_name, epochs=1, tag='Loss/Train'):
    writer = SummaryWriter(log_dir=f'../../runs/{tag}') # make a different writer for each tagged optimisation run
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader):
            inputs, labels = minibatch
            prediction = model(inputs)             # pass the data forward through the model
            loss = criterion(prediction, labels)   # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad()                  # reset the gradients attribute of each of the model's params to zero
            loss.backward()                        # backward pass to compute and set all of the model param's gradients
            optimiser.step()                       # update the model's parameters
            writer.add_scalar(f'Loss/{graph_name}', loss, epoch*len(train_loader) + idx)    # write loss to a graph

# train(my_nn, optimiser)

Let's compare the training curves generated using some of the optimisers that we explained above.

In [29]:
optimisers = [
    {
        'optimiser_class': torch.optim.SGD, 
        'tag': 'SGD'
    },
    {
        'optimiser_class': torch.optim.Adam,
        'tag': 'Adam'
    },
    {
        'optimiser_class': torch.optim.Adagrad,
        'tag': 'Adagrad'
    },
    {
        'optimiser_class': torch.optim.RMSprop,
        'tag': 'RMSProp'
    }
]

learning_rates = [0.01, 0.001, 0.0001, 0.00001]

for optimiser_obj in optimisers:   
    for lr in learning_rates:
        my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)
        optimiser_class = optimiser_obj['optimiser_class']
        optimiser = optimiser_class(my_nn.parameters(), lr=lr)
        tag = optimiser_obj['tag']
        train(my_nn, optimiser, graph_name=lr, epochs=1, tag=f'Loss/Train/{tag}')
    

 	Loss: tensor(2.1468, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 340 	Loss: tensor(2.1868, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 341 	Loss: tensor(2.2765, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 342 	Loss: tensor(2.1437, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 343 	Loss: tensor(2.2342, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 344 	Loss: tensor(2.1550, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 345 	Loss: tensor(2.1320, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 346 	Loss: tensor(2.1429, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 347 	Loss: tensor(2.0624, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 348 	Loss: tensor(2.1047, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 349 	Loss: tensor(2.0750, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 350 	Loss: tensor(2.1382, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 351 	Loss: tensor(2.1334, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 352 	Loss: tensor(2.1329, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 353 	Loss: tensor(2.205

## Implementing our own PyTorch optimiser

To understand a bit further what's happening under the hood, let's implement SGD from scratch.

In [None]:
class SGD():
    def __init__(self, model_params, learning_rate):
        self.model_params = model_params
        self.learning_rate

    def step(self):
        for param in self.model_params:
            param -= self.learning_rate * param.grad

    def zero_grad(self):
        for param in self.model_params:
            param.grad = torch.zeros_like(param.grad)

In [None]:
my_nn = NN()
optimiser = SGD()

train(my_nn, optimiser)