# Math4ML Part II: Calculus

# Setup Code

This section includes setup code for the remaining sections.

In [None]:
%%capture
!pip install --upgrade -qq wandb okpy==1.15.0

In [None]:
# importing from standard library
import random
import sys

# importing libraries
import autograd
import autograd.numpy as np  # trick for automatic differentiation with numpy
import matplotlib.pyplot as plt
import wandb

if 'google.colab' in str(get_ipython()):
    !git clone --branch "math4ml/reorg" "https://github.com/wandb/edu.git"
    %cd "edu/math-for-ml/02_calculus"
else:
    pass

if "../" not in sys.path:
    sys.path.append("../")

# importing course-specific modules
import autograder
import utils

## Section 1. How does Gradient Descent Behave?

In general, machine learning models depend on more than one parameter,
and so to understand the behavior of gradient descent,
we need to consider loss functions with multiple input dimensions,
also known as loss *surfaces*.

To draw a loss surface, we plot the value of the loss function at each combination of values for the inputs. Because we're able to, at most, make things in 3 dimensions, we'll have two input dimensions and leave the third for the loss.

A surface is one way to generalize the familiar old idea of the graph of a function to functions that take more than one input.

We live on a surface of this type, the surface of the Earth. If you want to think of it as a loss function, you could think of it as the loss function you'd have if you didn't like being at a high altitude, as a function of your latitude and longitude. Consider: what point on Earth would optimize this loss function?

Let's visualize some surfaces that are closer to what you might see if you plotted the loss surface for a machine learning model you were optimizing. A list of loss functions is defined in the cells below, first in Python and then, for some, in mathematical notation.

In [None]:
scale = 1

losses = [lambda x,y: np.square(x) + np.square(y),
          lambda x,y: np.square(x) + 0.1 * np.square(y),
          lambda x,y: 3 * np.square(x) + 0.1 * np.square(y),
          lambda x,y: np.cos(3 * x) + 0.5 * (np.square(x) + np.square(y) + x),
          lambda x,y: np.cos(3 * x) + np.square(x)+np.square(y),
          lambda x,y: np.where(np.abs(x)+np.abs(y)<0.75,0,np.abs(x)+np.abs(y)-0.75),
          lambda x,y: np.where(x>y+0.25,1.25,y)+np.square(x)+np.square(y),
          lambda x,y: 0.1 * np.random.standard_normal(size=x.shape),
          lambda x,y: utils.surfaces.gauss_random_field(x, y, scale)]

$$
\begin{align}
    l_0(x, y) &= x^2 + y^2\\
    l_1(x, y) &= x^2 + 0.1 \cdot y^2\\
    l_2(x, y) &= 3x^2+ 0.1 \cdot y^2\\
    l_3(x, y) &= \cos(3x) + 0.5 \cdot (x^2 + y^2 + x)\\
    l_4(x, y) &= \cos(3x) + x^2 + y^2\\
    l_5(x, y) &= \left\{\begin{array}{rl}
            \|x\|+ \|y\| - 0.75, & \text{if } \|x\|+ \|y\| > 0.75\\
            0, & \text{otherwise }
            \end{array}\right.\\
    l_6(x, y) &= x^2 +y^2 + \left\{\begin{array}{rl}
            1.25, & \text{if } x > y + 0.25\\
            y, & \text{otherwise }
            \end{array}\right.\\
\end{align}
$$

The next cell produces a 3-D plot of a single loss surface, chosen by indexing into the `losses` list.

The plots are interactive, to the extent that you can change the perspective. You can rotate with by clicking and dragging the left mouse button and zoom by doing the same with the right.

In [None]:
loss = losses[0]

N = 50

mesh_extent = 1.5

utils.surfaces.plot_loss_surface(loss, N, mesh_extent)

The following questions will ask you to visualize these loss surfaces and answer questions about them.

For certain problems, gradient descent performs nicely. View `losses[0]`.

#### Q Pick a few different starting points and follow the direction of steepest descent. Where do you end up?

#### Q What's nice about this loss surface?
*This question might be easier to answer once you've seen some of the other loss surfaces.*

When we follow gradients numerically, using a computer, we have to pick a scale for the "size" of steps we take. This can cause problems we might not anticipate with a view of gradient descent based on physical intuition.

View `losses[1]` and then `losses[2]`.

#### Q Why might picking a size of step cause issues with the surface `losses[1]` or `losses[2]`?

Other problems can't be solved by gradient descent effectively. Select `losses[3]`.

#### Q Again, select multiple different starting points and follow the direction of steepest descent. What's different in this case?

#### Q Why might this change in the behavior of gradient descent be a bad thing?

There are several similar cases to the above issue that are of interest. View `losses[4]` and then `losses[5]`.

#### Q Compare and contrast loss surfaces `3`, `4`, and `5`. Which ones cause issues for optimization? Explain your answers.

Some issues are more theoretical than practical. View `losses[6]`.

#### Q Can we still do gradient descent on this loss surface? Why or why not?

For some loss functions, the right method of minimization can be hard to decide. View `losses[7]`.

On this function, the value at each point is random and independent of the value at all other points.

#### Q What strategies might you use to minimize this loss function?

The loss surfaces for things like "a neural network that maps [pictures of horses to almost identical pictures of zebras](https://github.com/junyanz/CycleGAN)" are expected to be much more complicated than the ones we've looked at so far. The next class of loss functions is a simplified model of a neural network loss.

View `losses[8]` with the parameters `N = 100`, `mesh_extent = 10`, and `scale = 1` in the plotting and loss definition cells (original values are `50`, `1.5`, and `1`, for reference).

You can also attempt to increase these values to `N = 150`, `mesh_extent = 25`, and `scale = 2`. With these settings, the plot will take a bit of time to render on most machines and the plot will lag when interacted with. On some machines, this may consume too much memory and cause the plot to not render. If the plot successfully renders and you'd like to see more, increase the parameters to `N = 250`, `mesh_extent = 50`, and `scale = 3`.

#### Q Do these loss surfaces look like promising candidates for gradient descent? Why or why not?