# Automatic Differentiation

## A simple example

In [3]:
import torch

x = torch.arange(4.0, requires_grad=True) 
x.grad # Default value is None

In [4]:
y = 2 * torch.dot(x, x)

y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [5]:
# Another example
x.grad.zero_()
y = x.sum()

y.backward()
x.grad

tensor([1., 1., 1., 1.])

How does this work? Computing gradients manually is tedious, therefore deep learning frameworks automate it. They do this by constructing a *computational graph* of which variables are functions of which, then applying the chain rule. The particular algorithm is called *backpropagation*. 

## Non-scalar variables
Invoking backward on a non-scalar can give us a tensor of arbitrary dimension (although the most common case is optimizing a scalar cost function). 

In [11]:
# Something I don't understand yet

## Detaching Computation
Suppose we calculate `a` and `b` as functions of `x`, and then set `y = f(a, b)`. Furthermore, we wish to compute the gradient of `y` w.r.t `x`, but *treating `a` as a constant*, i.e discarding any information about how `a` was computed from the computational graph. 

In [12]:
# We discard information about how u was computed from the graph.
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

## Computing Gradient of Python Control Flow
One of the big advantages of autodifferentiation is that we can compute the gradients of really unpleasant control flows. 

In [28]:
def f(a): 
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0: 
        c = b
    else: 
        c = 100 * b
    return c

a = torch.randn(size=(), requires_grad = True)
d = f(a)
d.backward()
a.grad

tensor(8192.)