# Optimizers

Notebook created in PyTorch by [Daniel Fojo](https://www.linkedin.com/in/daniel-fojo/) for the [UPC School](https://www.talent.upc.edu/ing/estudis/formacio/curs/310400/postgrau-artificial-intelligence-deep-learning/) (2020).

Updated by [Gerard I. Gállego](https://www.linkedin.com/in/gerard-gallego/).

In this lab we will do a deep dive on how to use an optimizer to minimize an arbitrary function, as well as best practices on how to use the optimizers for Deep Learning.

In [None]:
import torch
import torch.optim as optim
import plotly.graph_objects as go
import time
import numpy as np

We can use the following functions to view an animation of how the optimizer finds the minimum of a function. These functions use [plotly](https://plotly.com/) to do the plots, which is a great choice thanks to its options to create animations.

In [None]:
def animate_2d_optimization(f, x_range, points):

    x = torch.linspace(*x_range, steps=1000)
    function_graph = go.Scatter(x=x, y=f(x), mode='lines', name="Function")

    def frame_data(p):
        return [function_graph,
                go.Scatter(x=[p], y=[f(torch.tensor([float(p)])).item()], mode='markers', marker=dict(size=[15]), name="Point")]

    frames = [go.Frame(data=frame_data(p)) for p in points]

    fig = go.Figure(
        data=frame_data(points[0]),
        layout=go.Layout(
            title="Start optimization",
            updatemenus=[dict(
                type="buttons",
                buttons=[dict(label="Play",
                            method="animate",
                            args=[None])])],
            showlegend=False
        ),
        frames=frames,
    )
    return fig


def animate_3d_optimization(f, x_range, y_range, points):

    x = torch.linspace(*x_range, steps=100)
    y = torch.linspace(*x_range, steps=100)
    x, y = torch.meshgrid([x, y])
    function_surface = go.Surface(x=x, y=y, z=f(x, y), name="Function")

    def frame_data(p):
        return [function_surface,
                go.Scatter3d(x=[p[0]], y=[p[1]], z=[0.05+f(torch.tensor([p[0]], dtype=float), torch.tensor([p[1]], dtype=float)).item()], mode='markers', marker=dict(size=[15], color="white"), name="Point")]

    frames = [go.Frame(data=frame_data(p)) for p in points]

    fig = go.Figure(
        data=frame_data(points[0]),
        layout=go.Layout(
            title="Start optimization",
            showlegend=False,
            updatemenus=[dict(
                            type="buttons",
                            buttons=[dict(label="Play",
                                        method="animate",
                                        args=[None])
                            ])]
        ),
        frames=frames,
    )
    return fig

## Optimization of single-input functions
In DL we use optimizers to minimize the loss with respect to the parameters (weights) of a neural network. Let's start with a simpler case: let's find the minimum value of an analytic function, i.e., let's find the value x which minimizes the value f(x).

In [None]:
def f(x):
    return torch.tanh(x-2.5)**2 + 0.3*torch.tanh(x)**2

### Exercise 1

Many people say that the learning rate is the most important hyperparameter when doing Deep Learning. We will see now the important of a good choice of LR.

Complete the `optimize_1d_function` and call it with an `init_value` of 4.5. Use SGD as an optimizer, and optimize the tensor `v`.

In [None]:
def optimize_1d_function(f, init_value, lr=0.1, steps=60):
    points = []
    v = torch.tensor([float(init_value)], requires_grad=True)

    # TODO: Use SGD as an optimizer
    optimizer = ...

    for step in range(steps):
        points.append(v.item())

        # TODO
        optimizer...

        loss = f(v)

        # TODO
        loss...

        # TODO
        optimizer...

    return animate_2d_optimization(f, x_range=[-10, 10], points=points)

In [None]:
optimize_1d_function(f, 4.5)

Is the optimizer learning fast enough? We have to tune the hyperparameters. Let's try using a LR of 10.

In [None]:
optimize_1d_function(f, 4.5, lr=10)

Clearly, this LR was too large. Let's try now a LR of 1.

In [None]:
optimize_1d_function(f, 4.5, lr=1)

Great! This was the correct choice. Note that we tried 1e-1, 1e1 and 1e2, and only 1e1 worked. This is a toy example, but this exact fenomenon happens when choosing a LR to optimize our Deep Learning model. Luckily, we have more advanced optimizers that make this choice easier!

**Extra:** Could we have used `init_value` = 0? Why?

### Exercise 2

Complete the function again, but now use Adam instead of SGD. Then, we will try it with the same `init_value` (4.5) and a lr=0.1 and lr=1.

In [None]:
def optimize_1d_function_adam(f, init_value, lr=0.1, steps=60):
    points = []
    v = torch.tensor([float(init_value)], requires_grad=True)

    # TODO: Use Adam instead of SGD
    optimizer = ...

    for step in range(steps):
        optimizer.zero_grad()
        points.append(v.item())
        loss = f(v)
        loss.backward()
        optimizer.step()

    return animate_2d_optimization(f, x_range=[-10, 10], points=points)

In [None]:
optimize_1d_function_adam(f, 4.5, lr=0.1)

In [None]:
optimize_1d_function_adam(f, 4.5, lr=1)

We can see that both learning rates worked! But neither of them worked as well as SGD with lr=1. This is the advantage of ADAM, the choice of the learning rate is much more lenient. The exact same thing passes when training neural networks.

**Key takeaways**

* SGD works great, but requires tunning the lr much more. We had to use specifically 1e1, otherwise the optimizer didn't find the optimum. This makes SGD the correct choice when we are willing to try many different learning rates and want the most optimal neural network possible.

* Adam is much more lenient than SGD. Any value between 1e-1 and 1e1 works, and gets to an optimum almost as good as SGD. This makes Adam the default choice for most people, since it performs almost as well as SGD but doesn't require nearly as much tunning.

## Optimization of multi-input functions

Now, we will work on a more realistic case. The function we will try to optimize will have 2 parameters. The main difference between 1d and 2d is that we now can have saddle points, which are points that are a maximum in a direction and a minimum in another direction.

In [None]:
def f(x, y):
    return 1e-1*((x)**2 - (y)**2)

In [None]:
def optimize_2d_function(f, init_value, lr=0.1, steps=20):
    points = []
    v = torch.tensor(init_value, requires_grad=True)
    optimizer = optim.SGD([v], lr)

    for step in range(steps):
        points.append(v.detach().numpy().copy().tolist())
        optimizer.zero_grad()
        loss = f(v[0], v[1])
        loss.backward()
        optimizer.step()

    return animate_3d_optimization(f, x_range=[-10, 10], y_range=[-10, 10], points=points)

### Exercise 3

Search in a logarithmic scale (0.1, 1, 10, 100...) the minimum value for the learning rate necessary to escape from the saddle point, when starting at [-5, 0.01].

In [None]:
optimize_2d_function(f, [-5, 0.01], lr=0.1)

Did we have to use a large learning rate? Using a value that's too high can hinder the training process. Is there another way to escape from saddle points?

### Exercise 4

Now use SGD with 0.9 momentum. We will try a learning rate of 1 with the same starting point.

**Extra:** Do the same but with Nesterov accelerated momentum.

In [None]:
def optimize_2d_function_momentum(f, init_value, lr=0.1, steps=20):
    points = []
    v = torch.tensor(init_value, requires_grad=True)

    # TODO: Use SGD with 0.9 momentum
    optimizer = ...

    for step in range(steps):
        points.append(v.detach().numpy().copy().tolist())
        optimizer.zero_grad()
        loss = f(v[0], v[1])
        loss.backward()
        optimizer.step()

    return animate_3d_optimization(f, x_range=[-10, 10], y_range=[-10, 10], points=points)

In [None]:
optimize_2d_function_momentum(f, [-5, -0.01], lr=1)

By adding momentum, we were able to escape the saddle point with a lower learning rate. That's great!

**Key takeaway:** When using SGD, we almost always want to add momentum. It will generally accelerate learning without adding the problems of using a high learning rate. The only negative side is that this adds a new hyperparamter to tune (0.9 is a good reference value for momentum).

## How to find a good initial learning rate? The LR Range Test

In 2015, Leslie N. Smith more or less formalized the above trial-and-error into a technique called the LR Range Test. The idea is simple, you just run your model and data for a few iterations, with the learning rate initially start at a very small value and then increase after each iteration. You record the loss for each value of learning rate and plot it up.
![lr-test](https://miro.medium.com/max/1280/1*U0Y0HWHhFZu9mHyGf3OoRw.png)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

import torchvision.datasets as datasets
import torchvision.transforms as transforms
import torchvision.models as models
device = torch.device("cuda")

CIFAR10 is a task of classifying thumbnail-sized images in 10 possible classes (0: airplane, 1: automobile, 2: bird, 3: cat, 4: deer, 5: dog, 6: frog, 7: horse, 8: ship, and 9:truck.). The neural network architecture is named "resnet34" but we don't need to know the details for now. Let's take a look at some images:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torchvision



vis_tf = transforms.ToTensor()
vis_set = datasets.CIFAR10(root="data", train=True, download=True, transform=vis_tf)

vis_loader = torch.utils.data.DataLoader(vis_set, batch_size=16, shuffle=True)
imgs, _ = next(iter(vis_loader))  # imgs: [16, 3, 32, 32]

grid = torchvision.utils.make_grid(imgs, nrow=4, padding=2)
plt.figure(figsize=(6,6))
plt.imshow(grid.permute(1, 2, 0))  # CHW -> HWC
plt.axis('off')
plt.tight_layout()
plt.show()

We'll implement and try this method to find a good learning rate for training CIFAR10 with a resnet34 and Adam.

In [None]:
transform = transforms.Compose([
                                transforms.ToTensor(),
                                transforms.Normalize(0.5, 0.5)
])
dataset = datasets.CIFAR10(root="data", download=True, transform=transform)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True, num_workers=4)


In [None]:
model = models.resnet34(num_classes=10).to(device)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

We will search for the learning rate between 1e-9 and 1e2, using np.logspace.

In [None]:
lr_range = np.logspace(-9, 2, num=200)

### Exercise 5

Complete the code to do the test. You can modify the learning rate by modifying `optimizer.param_groups[0]["lr"] = ...`

In [None]:
loss_history = []

for lr, (images, targets) in zip(lr_range, dataloader):
    # TODO: Set the learning rate to lr
    optimizer...

    # TODO
    optimizer...

    images, targets = images.to(device), targets.to(device)

    # TODO
    output = ...

    # TODO
    loss = ...

    # TODO
    loss...

    # TODO
    optimizer...

    loss_history.append(loss.item())

Now we can check what is the optimal Learning Rate in the plot. Even though it's noisy, we can see a good value for the learning rate. Can you spot it?

In [None]:
import matplotlib.pyplot as plt

plt.plot(lr_range, loss_history)
plt.ylim([2, 3])
plt.xscale("log")

## I've found a good initial learning rate. Now what? LR schedulers

A general good practice is to use a learning rate scheduler. One of the most useful ones is `ReduceLROnPlateau`. This scheduler decay the learning rate every time the loss function gets stuck. The use of an optimizer like this one can help squeeze a bit more performance of your model. We will see how to use it in PyTorch. We will train the same model as before, using a lr of 1e-3 (the largest acceptable value we found using the lr test).

First, let's get a baseline without an scheduler:

In [None]:
model = models.resnet34(num_classes=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
epochs = 2
min_loss = float("inf")
for epoch in range(epochs):
    losses = []
    for i, (images, targets) in enumerate(dataloader):
        optimizer.zero_grad()
        images, targets = images.to(device), targets.to(device)
        output = model(images)
        loss = criterion(output, targets)
        min_loss = min(loss.item(), min_loss)
        loss.backward()
        losses.append(loss.item())
        optimizer.step()
        if i%50 == 0:
            print(f"Epoch {epoch} [{i}/{len(dataloader)}]: loss: {np.mean(losses):.2f}, lr={optimizer.param_groups[0]['lr']} min_loss:{min_loss:.2f}")

### Exercise 6
Now we will use a scheduler. First, we have to declare it from the `torch.optim` module. They work very similar to optimizers.

In [None]:
model = models.resnet34(num_classes=10).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=150, factor=0.5)

Complete the training code with a scheduler. You should add the scheduler step after the optimizer step.

In [None]:
epochs = 2
min_loss = float("inf")
for epoch in range(epochs):
    losses = []
    for i, (images, targets) in enumerate(dataloader):
        optimizer.zero_grad()
        images, targets = images.to(device), targets.to(device)
        output = model(images)
        loss = criterion(output, targets)
        min_loss = min(loss.item(), min_loss)
        loss.backward()
        losses.append(loss.item())
        optimizer.step()
        # TODO

        if i%50 == 0:
            print(f"Epoch {epoch} [{i}/{len(dataloader)}]: loss: {np.mean(losses):.2f}, lr={optimizer.param_groups[0]['lr']} min_loss:{min_loss:.2f}")