# Optimisation for Deep Learning

Learning outcomes
- understand mathematically and intuitively the most common optimisation algorithms used for optimising deep models
- implement your own optimiser in PyTorch



## Gradient descent

## SGD

## AdaGrad



## RMSProp



## Adam

## So which algorithm do I use?

Well... as usual, it depends on your problem and your dataset.

It's still a highly active field of research. But in general, **SGD with momentum or Adam** are the go to choices for optimising deep models.

## Using these optimisation algorithms

Let's set up the same neural network as in the previous module, and then switch out the optimiser for Adam and others and show how you can adapt it to use momentum.

In [8]:
import sys
sys.path.append('..')
from utils import NN, get_dataloaders
import torch
import torch.nn.functional as F

my_nn = NN([784, 1024, 512, 10], distribution=True, flatten_input=True)

learning_rate = 0.1

# HOW TO USE DIFFERENT OPTIMISERS PROVIDED BY PYTORCH
optimiser = torch.optim.SGD(my_nn.parameters(), lr=learning_rate, momentum=0.1)
# optimiser = torch.optim.Adagrad(NN.parameters(), lr=learning_rate)
# optimiser = torch.optim.RMSprop(NN.parameters(), lr=learning_rate)
optimiser = torch.optim.Adam(my_nn.parameters(), lr=learning_rate)

The stuff below is exactly the same as before!

In [10]:
# GET DATALOADERS
test_loader, val_loader, train_loader = get_dataloaders()
criterion = F.cross_entropy

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir='../../runs')                            # we will use this to show our models performance on a graph

# TRAINING LOOP
def train(model, optimiser, epochs=1):
    for epoch in range(epochs):
        for idx, minibatch in enumerate(train_loader):
            inputs, labels = minibatch
            print(inputs.shape)
            prediction = model(inputs)             # pass the data forward through the model
            print(prediction.shape)
            print(labels.shape)
            loss = criterion(prediction, labels)   # compute the loss
            print('Epoch:', epoch, '\tBatch:', idx, '\tLoss:', loss)
            optimiser.zero_grad()                  # reset the gradients attribute of each of the model's params to zero
            loss.backward()                        # backward pass to compute and set all of the model param's gradients
            optimiser.step()                       # update the model's parameters
            writer.add_scalar('Loss/Train', loss, epoch*len(train_loader) + idx)    # write loss to a graph

train(my_nn, optimiser)

kward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 478 	Loss: tensor(2.3987, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 479 	Loss: tensor(2.3987, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 480 	Loss: tensor(2.3987, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 481 	Loss: tensor(2.2737, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 482 	Loss: tensor(2.3987, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 483 	Loss: tensor(2.4612, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([16, 10])
torch.Size([16])
Epoch: 0 	Batch: 484 	Loss: tensor(2.4612, grad_fn=<NllLossBackward>)
torch.Size([16, 1, 28, 28])
torch.Size([

Let's visualise the training curves using some of the optimisers that we explained above.

In [None]:
def train(model, optmiser):
    

## Implementing our own PyTorch optimiser

To understand a bit further what's happening under the hood, let's implement SGD from scratch.

In [None]:
class SGD():
    def __init__(self, model_params, learning_rate):
        self.model_params = model_params
        self.learning_rate

    def step(self):
        for param in self.model_params:
            param -= self.learning_rate * param.grad

    def zero_grad(self):
        for param in self.model_params:
            param.grad = torch.zeros_like(param.grad)

In [None]:
my_nn = NN()
optimiser = SGD()

train(my_nn, optimiser)