# Optimisation for Deep Learning

Learning outcomes
- understand mathematically and intuitively the most common optimisation algorithms used for optimising deep models
- implement your own optimiser in PyTorch



## Challenges with optimising deep models

## Gradient Descent

![](images/gradient_descent.jpg)

## SGD

![](images/SGD.jpg)

## SGD with momentum

![](images/momentum.jpg)

## SGD with Nesterov momentum

![](images/nesterov.jpg)

## AdaGrad

Is there a more systematic way to reduce the learning rate over time?

AdaGrad assumes so, and reduces the learning rate

![](images/adagrad.jpg)

## RMSProp

The problem with AdaGrad is that the learning rate can never recover and increase to speed up optimisation once it has slowed down, it can only decrease further. So if a steep part of the loss surface is encountered before a flatter part, the learning rate for this parameter will be divided by the large loss surface gradient in the steep region and be too small to make meaningful progress in the flatter region.

RMSProp is similar to AdaGrad except for how it accumulates the gradient to decay the learning rate for each parameter. Instead of continuuing to sum up the square of all of the gradients encountered in each given direction, it takes an *exponential moving average*. This gives the chance for the learning rate to increase if a steep gradient were not encountered recently, as the historical gradients encountered have an exponentially smaller influence on the learning rate with each optimisation step.

![](images/rmsprop.jpg)

## Adam

![](images/adam.jpg)

## So which algorithm do I use?

Well... as usual, it depends on your problem and your dataset.

It's still a highly active field of research. But in general, **SGD with momentum or Adam** are the go to choices for optimising deep models.

## Using these optimisation algorithms

Let's set up the same neural network as in the previous module, and then switch out the optimiser for Adam and others and show how you can adapt it to use momentum.

In [1]:
import sys
sys.path.append('..')
from utils import NN, get_dataloaders
import torch
import torch.nn.functional as F

my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)

learning_rate = 0.0001

# HOW TO USE DIFFERENT OPTIMISERS PROVIDED BY PYTORCH
optimiser = torch.optim.SGD(my_nn.parameters(), lr=learning_rate, momentum=0.1)
# optimiser = torch.optim.Adagrad(NN.parameters(), lr=learning_rate)
# optimiser = torch.optim.RMSprop(NN.parameters(), lr=learning_rate)
optimiser = torch.optim.Adam(my_nn.parameters(), lr=learning_rate)

The stuff below is exactly the same as before!

In [2]:
# GET DATALOADERS
test_loader, val_loader, train_loader = get_dataloaders()
criterion = F.cross_entropy

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter

# TRAINING LOOP
def train(model, optimiser, graph_name, epochs=1, tag='Loss/Train'):
    writer = SummaryWriter(log_dir=f'../../runs/{tag}') # make a different writer for each tagged optimisation run
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader):
            inputs, labels = minibatch
            prediction = model(inputs)             # pass the data forward through the model
            loss = criterion(prediction, labels)   # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad()                  # reset the gradients attribute of each of the model's params to zero
            loss.backward()                        # backward pass to compute and set all of the model param's gradients
            optimiser.step()                       # update the model's parameters
            writer.add_scalar(f'Loss/{graph_name}', loss, epoch*len(train_loader) + idx)    # write loss to a graph

# train(my_nn, optimiser)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to MNIST-data/MNIST/raw/train-images-idx3-ubyte.gz
100.1%Extracting MNIST-data/MNIST/raw/train-images-idx3-ubyte.gz to MNIST-data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to MNIST-data/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%Extracting MNIST-data/MNIST/raw/train-labels-idx1-ubyte.gz to MNIST-data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to MNIST-data/MNIST/raw/t10k-images-idx3-ubyte.gz
100.4%Extracting MNIST-data/MNIST/raw/t10k-images-idx3-ubyte.gz to MNIST-data/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to MNIST-data/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%Extracting MNIST-data/MNIST/raw/t10k-labels-idx1-ubyte.gz to MNIST-data/MNIST/raw
Processing...
Done!


Let's compare the training curves generated using some of the optimisers that we explained above.

In [3]:
optimisers = [
    {
        'optimiser_class': torch.optim.SGD, 
        'tag': 'SGD'
    },
    {
        'optimiser_class': torch.optim.Adam,
        'tag': 'Adam'
    },
    {
        'optimiser_class': torch.optim.Adagrad,
        'tag': 'Adagrad'
    },
    {
        'optimiser_class': torch.optim.RMSprop,
        'tag': 'RMSProp'
    }
]

learning_rates = [0.01, 0.001, 0.0001, 0.00001]

for optimiser_obj in optimisers:   
    for lr in learning_rates:
        my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)
        optimiser_class = optimiser_obj['optimiser_class']
        optimiser = optimiser_class(my_nn.parameters(), lr=lr)
        tag = optimiser_obj['tag']
        train(my_nn, optimiser, graph_name=lr, epochs=1, tag=f'Loss/Train/{tag}')
    

 	Loss: tensor(2.1702, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 340 	Loss: tensor(2.1566, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 341 	Loss: tensor(2.1759, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 342 	Loss: tensor(2.0939, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 343 	Loss: tensor(2.0399, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 344 	Loss: tensor(2.1666, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 345 	Loss: tensor(2.1858, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 346 	Loss: tensor(2.1874, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 347 	Loss: tensor(2.1195, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 348 	Loss: tensor(2.0071, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 349 	Loss: tensor(2.1568, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 350 	Loss: tensor(2.1484, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 351 	Loss: tensor(2.1725, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 352 	Loss: tensor(2.0955, grad_fn=<NllLossBackward>)
Epoch: 0 	Batch: 353 	Loss: tensor(2.088

## Implementing our own PyTorch optimiser

To understand a bit further what's happening under the hood, let's implement SGD from scratch.

In [78]:
class SGD():
    def __init__(self, model_params, learning_rate):
        self.model_params = list(model_params) # HACK turning to list prevents len model_params being zero
        self.learning_rate = learning_rate

    def step(self):
        with torch.no_grad():
            for param in self.model_params:
                param -= self.learning_rate * param.grad

    def zero_grad(self):
        for param in self.model_params:
            if param.grad is None: # if not yet set (loss.backward() not yet called)
                print('continuing')
                continue
            param.grad = torch.zeros_like(param.grad)


In [79]:
my_nn = NN([784, 1024, 1024, 512, 10], distribution=True, flatten_input=True)
optimiser = SGD(my_nn.parameters(), learning_rate=0.1)

train(my_nn, optimiser, 'custom_sgd')

ckward&gt;)
Epoch: 0 	Batch: 362 	Loss: tensor(2.2908, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 363 	Loss: tensor(2.2891, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 364 	Loss: tensor(2.2966, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 365 	Loss: tensor(2.2943, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 366 	Loss: tensor(2.2945, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 367 	Loss: tensor(2.2896, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 368 	Loss: tensor(2.2937, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 369 	Loss: tensor(2.2968, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 370 	Loss: tensor(2.2864, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 371 	Loss: tensor(2.2943, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 372 	Loss: tensor(2.2943, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 373 	Loss: tensor(2.2934, grad_fn=&lt;NllLossBackward&gt;)
Epoch: 0 	Batch: 374 	Loss: tensor(2.2931, grad_fn=&lt;NllLossBackward&gt;)


## Challenges
- flash card match images with name of optimisation algorithm
- roughly sketch the paths that different optimisation algorithms might take