In [26]:
#hide
from fastai2.vision.all import *
import math

https://cs231n.github.io/optimization-2/#staged  
Yes, you should understand backward propagation. $\frac{df}{dx} = -3$.



* Demonstrate Pytorch's backward function 
* Show that gradients computed above are consistent with staged computation for intermediate gates.
* Demonstrate with example where failure to remember to reset grad will leads to erroneous backpropagation.

# Basic notation

## What do you call this?

$$
\frac{\partial{f}}{\partial{x}}
$$

The partial derivative of $f$ with respect to $x$.  
The derivative of $f$ with respect to $x$.  
The derivative of $f$ on $x$.  
The gradient of $f$ on $x$.  
The derivative on $x$.  
The gradient on $x$.

## Tagent 

$$
\frac{\partial{f}}{\partial{x}} = \lim_{h\rightarrow\infty} \frac{f(x + h) - f(x)}{h}
$$

Think of $\frac{\partial{f}}{\partial{x}}$ as a linear function locally, with some slope.


## All together

$$
\nabla{f} \rightarrow \left[\frac{\partial{f}}{\partial{x_0}}, \frac{\partial{f}}{\partial{x_1}}, \cdots, \frac{\partial{f}}{\partial{x_n}}\right]
$$

# Circuit diagrams

A circuit diagram describe a set of steps of computation.  

$$
f(x, y, z) = (x + y)z
$$

[Include a diagram here.]

The input to the computation are on the left.  The output of the computation is on the right.  
*Forward pass*: left to right.  
*Backward pass*: right to left.

1. Each gate in the circuit diagram is a function, or mapping.
2. Any kind of differentiable function can be a gate.


## Local gradient

For each gate in the circuit diagram:
1. A gate is a mapping.
2. There can be only one output.
3. There can be more than one input.
4. The output can be computed from the inputs using the mapping.
5. The gradient of the output on the inputs can be computed using locally available information.

## Gradient of circuit output

Chain rule is used during back propagation.  During back propagation, the gradient of the circuit output on the output of a gate, $\frac{\partial{\mathcal{L}}}{\partial{y}}$, is multiplied with the gradient of the gate's output on each of its inputs, $\frac{\partial{y}}{\partial{x_i}}$.


## Define gates for convenience

1. Multiple gates can be grouped into a single gate.
2. A single gate can be broken up into multiple gates.

For the same overall computation, it is more convenient to have gates whose local gradients are easy to calculate.

# Staged computation

$$
f(x, y, z) = (x + 2y)z
$$


$$
q = x + 2y \\
f = qz
$$

## Manual implementation

In [20]:
x, y, z = 3, -4, 19

Forward pass.

In [21]:
q = x + 2*y    #1
f = q * z      #2
f

-95

Backward pass.

In [24]:
dq = z             #2
dz = q             #2
dx = dq * 1        #1
dy = dq * 2        #1
dq, dx, dy, dz

(19, 19, 38, -5)

## Autodiff with Pytorch

Pytorch can automatically compute derivatives.  Let's see how it works for the current example.

In [71]:
x, y, z = tensor(3.), tensor(-4.), tensor(19.)

Forward pass.

In [72]:
q = x + 2*y    #1
f = q * z      #2
f

tensor(-95.)

*In Pytorch, calling the `backward()` method of a variable computes the gradient of that variable on other variables connected to it in the circuit diagram.*

In the current example, you would call `f.backward()` to compute `dx` $=\frac{\partial{f}}{\partial{x}}$, `dy`$=\frac{\partial{f}}{\partial{y}}$, `dz`$=\frac{\partial{f}}{\partial{z}}$, and `dq`$=\frac{\partial{f}}{\partial{q}}$.  However, as things stand, if you run the `backward()` method of `f`, you will get the following Exception:

``` 
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
```

This is because each torch tensor has a `requires_grad` attribute, and the gradient will only be computed when this is `True`.  Right now, this attribute is `False` for all the variables, so when you call `backward()`, an Exception is raised.

and `grad` attribute.  Notice that at the moment all variables have these equal to `False` and `None`, respectively.

In [73]:
for o in [x, y, z, q, f]: print(o.requires_grad)

False
False
False
False
False


### Location of `requires_grad=True`  in circuit diagram

*Setting a variable's `requires_grad` to `True` tells Pytorch that you want to have the gradient on that variable computed when you call `backward()`.*

For example, if you now set `f`'s `requires_grad` attribute to `True`, you will see that  
* `f`'s `requires_grad` is equal to `True`.
* All the variables before `f` in the circuit diagram still has `requires_grad=False`.

In [74]:
f.requires_grad_()

tensor(-95., requires_grad=True)

In [75]:
for o in [x, y, z, q, f]: print(o.requires_grad)

False
False
False
False
True


You can now call `backward()` with no Exception raised.

In [76]:
f.backward()

In Pytorch, the gradient on a variable is stored in the `grad` attribute of that variable.  For example, the gradient on `y`, $\frac{\partial{f}}{\partial{y}}$, is `y.grad`.

Let's check the current gradients on all the variables.

In [79]:
for o in [x, y, z, q, f]: print(o.requires_grad, o.grad)

False None
False None
False None
False None
True tensor(1.)


It's seen that the back prop from `f` has only computed the gradient on `f`, which is $\frac{\partial{f}}{\partial{f}}=1$.  The gradients on all variables before `f` in the circuit diagram, such as $\frac{\partial{f}}{\partial{q}}$ and $\frac{\partial{f}}{\partial{z}}$, are `None`, not computed.

Going back to the beginning.  Suppose, instead of setting `f`'s `requires_grad` to `True`, you set `q`'s `requires_grad` to `True`.

In [80]:
q = x + 2*y    #1
q.requires_grad_()
f = q * z      #2
f

tensor(-95., grad_fn=<MulBackward0>)

In [81]:
for o in [x, y, z, q, f]: print(o.requires_grad, o.grad)

False None
False None
False None
True None
True None


* `f` and `q`'s `requires_grad` are both `True`, even though you have only called `requires_grad_()` on `q`.
* `requires_grad` is still `False` for all the variables that come before `q`.

In [82]:
f.backward()

In [83]:
for o in [x, y, z, q, f]: print(o.requires_grad, o.grad)

False None
False None
False None
True tensor(19.)
True None


`dq`$=\frac{\partial{f}}{\partial{q}}$ is computed.  However, `df`$=\frac{\partial{f}}{\partial{f}}$, even though `f.requires_grad` is `True`, is `None`.

Going back to the beginning again.  What happens if you set `requires_grad` to `True` for `x`?

In [85]:
x.requires_grad_()

tensor(3., requires_grad=True)

In [86]:
q = x + 2*y    #1
f = q * z      #2
f

tensor(-95., grad_fn=<MulBackward0>)

In [87]:
for o in [x, y, z, q, f]: print(o.requires_grad, o.grad)

True None
False None
False None
True None
True None


In [88]:
f.backward()

In [89]:
for o in [x, y, z, q, f]: print(o.requires_grad, o.grad)

True tensor(19.)
False None
False None
True None
True None


`dx`$=\frac{\partial{f}}{\partial{x}}$ is computed and the only gradient available.  `requires_grad` is `True` for both `dq` and `df`, but they're both `None`.

This might be a slightly suprising behaviour, to have `requires_grad` equal to `True` and yet the gradient being `None`.  However, this checks out with Pytorch's design (or documentation).  It only saves the gradient of *leaf variables* to save memory.  Access to intermediate variables' gradients is through the use of *hooks*, which will not be covered here.  (See [here](https://discuss.pytorch.org/t/why-cant-i-see-grad-of-an-intermediate-variable/94/2) for how to do this.)

### Resetting the gradient

In [113]:
x, y, z = tensor(3.), tensor(-4.), tensor(19.)
for o in [x, y, z]: o.requires_grad_()

q = x + 2*y    #1
f = q * z      #2
f

In [115]:
f.backward()

In [116]:
for o in [x, y, z, q, f]: print(o.grad)

tensor(19.)
tensor(38.)
tensor(-5.)
None
None


In [117]:
f.backward()

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

As the `RuntimeError` explains, by default, you can't run `backward()` more than once, you need to do `.backward(retain_graph=True)`.

In [118]:
x, y, z = tensor(3.), tensor(-4.), tensor(19.)
for o in [x, y, z]: o.requires_grad_()

q = x + 2*y    #1
f = q * z      #2
f

tensor(-95., grad_fn=<MulBackward0>)

In [119]:
f.backward(retain_graph=True)

In [120]:
for o in [x, y, z, q, f]: print(o.grad)

tensor(19.)
tensor(38.)
tensor(-5.)
None
None


In [121]:
f.backward()

In [122]:
for o in [x, y, z, q, f]: print(o.grad)

tensor(38.)
tensor(76.)
tensor(-10.)
None
None


The gradients accumulates over multiple backward passes.

In [123]:
f.backward()

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

In [125]:
x, y, z = tensor(3.), tensor(-4.), tensor(19.)
for o in [x, y, z]: o.requires_grad_()

q = x + 2*y    #1
f = q * z      #2
f

tensor(-95., grad_fn=<MulBackward0>)

In [126]:
f.backward(retain_graph=True)

In [127]:
for o in [x, y, z, q, f]: print(o.grad)

tensor(19.)
tensor(38.)
tensor(-5.)
None
None


In [128]:
for o in [x, y, z, q, f]: o.grad = None

You can reset the gradients whenever, with `x.grad=None`, for example, for `x`.

In [129]:
f.backward()

In [130]:
for o in [x, y, z, q, f]: print(o.grad)

tensor(19.)
tensor(38.)
tensor(-5.)
None
None
