# Automatic Differentiation

In all the optimization algorithms that we will use to train deep networks. While the calculations are straightforward, working them out by hand can be tedious and error-prone, and these issues only grow as our models become more complex.

Fortunately all modern deep learning frameworks take this work off our plates by offering automatic differentiation (often shortened to autograd). As we pass data through each successive function, the framework builds a computational graph that tracks how each value depends on others. To calculate derivatives, automatic differentiation works backwards through this graph applying the chain rule. The computational algorithm for applying the chain rule in this fashion is called backpropagation.

Interestingly, while we use autograd to optimize models (in a statistical sense) the optimization of autograd libraries themselves (in a computational sense) is a rich subject of vital interest to framework designers. Here, tools from compilers and graph manipulation are leveraged to compute results in the most expedient and memory-efficient manner.

### Basics

For now, try to remember these basics:
1. Attach gradients to those variables with respect to which we desire derivatives; 
2. Record the computation of the target value; 
3. Execute the backpropagation function;
4. Access the resulting gradient.

In [1]:
import torch

## 1. Simple Function

Differentiate $y = 2\mathbf{x}^{\top}\mathbf{x} = 2\mathbf{x}^2$  respect $\mathbf{x}$

In [56]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

Before we calculate the gradient of $\mathbf{y}$
 with respect to $\mathbf{x}$
, we need a place to store it. In general, we avoid allocating new memory every time we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters a great many times, and we might risk running out of memory.

In [57]:
# Can also create x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad  # The gradient is None by default

In [58]:
y = 2 * torch.dot(x.T, x) # 2*torch.dot(x,x)
y

tensor(28., grad_fn=<MulBackward0>)

In [59]:
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

## 2. Backward for Non-Scalar variables

When y is a vector, the most natural representation of the derivative of y with respect to a vector x is a matrix called the Jacobian that contains the partial derivatives of each component of y with respect to each component of x. Likewise, for higher-order y and x, the result of differentiation could be an even higher-order tensor.

While Jacobians do show up in some advanced machine learning techniques, more commonly we want to sum up the gradients of each component of y with respect to the full vector x, yielding a vector of the same shape as x

In [61]:
x.grad.zero_()

tensor([0., 0., 0., 0.])

In [62]:
y = x * x
y

tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)

In [33]:
y.backward(gradient=torch.ones(len(y)))  # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

In [36]:
y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

## 3. Detaching Computation

Sometimes, we wish to move some calculations outside of the recorded computational graph. For example, say that we use the input to create some auxiliary intermediate terms for which we do not want to compute a gradient. In this case, we need to detach the respective computational graph from the final result. The following toy example makes this clearer: suppose we have z = x * y and y = x * x but we want to focus on the direct influence of x on z rather than the influence conveyed via y. In this case, we can create a new variable u that takes the same value as y but whose provenance (how it was created) has been wiped out. Thus u has no ancestors in the graph and gradients do not flow through u to x. For example, taking the gradient of z = x * u will yield the result u, (not 3 * x * x as you might have expected since z = x * x * x)

In [39]:
x.grad.zero_()
y = x * x
u = y
z = u * x

z.sum().backward()
x.grad, x.grad == u

(tensor([ 0.,  3., 12., 27.]), tensor([ True, False, False, False]))

In [40]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad, x.grad == u

(tensor([0., 1., 4., 9.]), tensor([True, True, True, True]))

## 4. Gradients and Python control flow

Dynamic control flow is very common in deep learning. For instance, when processing text, the computational graph depends on the length of the input. In these cases, automatic differentiation becomes vital for statistical modeling since it is impossible to compute the gradient a priori.

So far we reviewed cases where the path from input to output was well defined via a function such as z = x * x * x. Programming offers us a lot more freedom in how we compute results. For instance, we can make them depend on auxiliary variables or condition choices on intermediate results. One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Python control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable.

In [51]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

Below, we call this function, passing in a random value, as input. Since the input is a random variable, we do not know what form the computational graph will take. However, whenever we execute f(a) on a specific input, we realize a specific computational graph and can subsequently run backward

In [42]:
a = torch.randn(size=(), requires_grad=True)
d = f(a)
a, d

(tensor(-0.3940, requires_grad=True),
 tensor(-161377.0625, grad_fn=<MulBackward0>))

In [43]:
d.backward()

Even though our function f is, for demonstration purposes, a bit contrived, its dependence on the input is quite simple: it is a linear function of a with piecewise defined scale. As such, f(a) / a is a vector of constant entries and, moreover, f(a) / a needs to match the gradient of f(a) with respect to a.

In [55]:
a.grad, d/a, a.grad == d/a

(tensor(409600.), tensor(409600., grad_fn=<DivBackward0>), tensor(True))

## Excercises