#A tutorial introducing PyTorch and an optimiser
*Marcus Frean*

We build torch code that uses autograd and gradient descent optimisers, for toy search problems in 2D.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import torch

Torch gives you Tensors, which are like numpy arrays, but with
*   **GPU**/CPU choice via *device_type*, and
*   **AutoGrad** via *requires_grad*




In [None]:
torch.randn(4,5)

Here we will make something to minimize....

Let's call it "f" -- in practice this is going to be your loss function.

It can be as "interesting" as you like, and can call other functions.

Here we're going to keep it simple, such as a quadratic bowl, or the Himmelbrau function from the lecture, say.

---

> But JUST FOR EXAMPLE, it could even involve a forward mapping that *uses* these w to map "input vectors" to "outputs", and there could be another function that compares those outputs to some "targets" and gives a scalar, which f could return..... Just saying!
---


In [None]:
def f(w):
  # EXPECTS A TWO-COLUMN TENSOR (w) whose rows are sample vectors.
  #
  # Define some interesting surface in a 2-dimensional space.
  # See https://en.wikipedia.org/wiki/Test_functions_for_optimization   for lots of ideas.
  # These are (x,y) pairs, so we can plot them
  # So let's just rename them x and y here, to match the notation on the wikipedia page.
  x, y = w[:,0], w[:,1]

  # Simple Quadratic bowl:
  return x*x + y*y   #ALT:  torch.sum(X*X, 1)  or   torch.mul(X,X), 1) or   X.mul(X)

  # Himmelblau:
  #return torch.pow(x*x + y -11, 2) + torch.pow(x + y*y -7, 2)

  # Bukin6:
  #return 100.0 * torch.sqrt(torch.absolute(y-0.01*x*x)) + 0.01*torch.absolute(x + 10)

  # Easom = brutal! like a golf putting green.
  #return -torch.cos(x) * torch.cos(y) * torch.exp(-((x-np.pi)*(x-np.pi) + (y-np.pi)*(y-np.pi)))


In [None]:
w = torch.rand(1,2);
print(w, " --> ",f(w))

In [None]:
LIMIT = 6.0
NTESTPOINTS = 8
winit = LIMIT * (2*torch.rand(NTESTPOINTS,2)-1)

In [None]:
w = torch.clone(winit)
print(w)

w.requires_grad = True
print(w.grad)

## **Question:** why did we have to say `requires_grad` ?

PyTorch will track all operations that involve that tensor. This allows PyTorch to automatically compute gradients for the tensor with respect to a anything (e.g. some loss function we give it)

In [None]:
myLoss = f(w)
myLoss.backward( torch.ones_like(myLoss) )
w.grad

## **Question:** what did `out.backward` do?

The backward function in PyTorch is responsible for **actually computing** those gradients of a function with respect to another tensor, at some point (tensor).

# try some flavour of Gradient Descent on `f(w)`

At this point we COULD *explicitly* step down the gradient in tiny steps...

But instead, we'll use a torch.optimizer to do it for us.

(note that generic gradient descent is called `SGD` in torch!).

In [None]:
opt = torch.optim.SGD([w], lr=0.001, momentum=0.5)  # lr is the learning rate (controls the step size)
opt.step()
w

Notice that w has now changed.

That was one step. Let's try several:

In [None]:
STEPS = 200
saved = np.ones((STEPS,len(winit),2)) # this is PURELY for making the pics to follow

for t in range(STEPS):
  saved[t,:,:] = w.detach().numpy()

  opt.zero_grad()
  y = f(w)
  y.backward(torch.ones_like(y))
  opt.step()

**Question:** Why did we have to go `w.detach().numpy()` ??



> "detach() is used to detach a tensor from the current computational graph. It returns a new tensor that doesn't require a gradient. When we don't need a tensor to be traced for the gradient computation, we detach the tensor from the current computational graph. It is most often used when you want to save the loss for logging, or save a Tensor for later inspection but you don’t need gradient information."



**Question**: why did we have to do `opt.zero_grad()` ?

# Now display the trajectories
The code here is a little tedious and you need not pay it much attention - we just want to see those trajectories...

Obviously this sort of display is only possible because we're working in this tiny toy world with only 2 dimensions.

We use `mgrid` as a handy way to make the coordinates of all the points in a "grid".


In [None]:
grid = np.mgrid[-LIMIT:LIMIT:0.1,  -LIMIT:LIMIT:0.1]
n, m  = grid.shape[1], grid.shape[2]
xs, ys = grid[0], grid[1]
colors = ['b','g','r','c','m','y']

gridInputs = torch.from_numpy(grid.reshape(2,-1).T)
truesurf = f(gridInputs).numpy()

plt.figure(figsize=(10,10))
plt.contour(xs,ys, truesurf.reshape(n,m), levels=50, alpha=.3)
plt.axis('equal')

for i in range(len(winit)):
  c = colors[i % len(colors)]
  plt.plot(saved[0,i,0],saved[0,i,1],'s',color=c,markersize=10)
  plt.plot(saved[:,i,0],saved[:,i,1],'-',color=c)#'gray')
  plt.plot(saved[:,i,0],saved[:,i,1],'o',color=c)