# Automatic differentiation (in PyTorch)

**Course**: Computer Vision (911.908)    
**Author**: Roland Kwitt (Dept. of Computer Science, Univ. of Salzburg)       
Winter term 2019/20


In this lecture, we take a look at the mechanics of **automatic differentiation** (aka *AutoGrad* / *AutoDiff*) in PyTorch. These techniques are at the very heart of today's frameworks for learning with neural networks.

---

## Contents

- [Introductory example](#Introductory-example)
- [A  slightly more involved example](#A-slightly-more-involved-example)
- [The AutoGrad machinery](#The-AutoGrad-machinery)

---

In [2]:
%load_ext autoreload
%autoreload 2

import sys
import agtree2dot

sys.path.append("../")

from utils import visualize_DAG
from IPython.display import HTML

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---

## Introductory example

In this example, we have a following computation graph:

<img src="DAG0.svg" alt="drawing" width="700"/>

Lets compute  

$$ \frac{\partial a}{\partial b} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial b}$$

First, 

$$ \frac{\partial v}{\partial b} = 1 $$

and

$$ \frac{\partial a}{\partial v} = \frac{\partial}{\partial v} \text{ReLU}(v) = 
\begin{cases}
0 & \text{if}~v \leq 0\\
1 & \text{else}
\end{cases}
$$

Hence, in our case we have

$$\frac{\partial a}{\partial v} = 1$$

and consequently 

$$
\frac{\partial a}{\partial b} = 1
$$

Let's try to compute the partial derivative of $a$ wrt. $w$:

$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial w}
$$

So, 

$$
\frac{\partial v}{\partial u} = 1
$$

and 

$$
\frac{\partial u}{\partial w} = x
$$

Combined, we obtain (if we use the fact that $x=3$)

$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial w} = 1\cdot 1\cdot 3 = 3
$$



Conceptually, the **forward pass** (i.e., the computation of the function value) is a standard tensor computation (i.e., in this example simply the computation of all intermediate results), and the **Directed Ascyclic Graph (DAG)** of tensor operations is required only to compute derivatives (via the chain rule). 

When executing tensor operations, PyTorch can automatically (on-the-fly) construct the graph of operations to compute the gradient of any quantity
with respect to any tensor involved.

In [None]:
import torch
from torch.autograd import grad
import torch.nn.functional as F

**Example**

In [None]:
x = torch.tensor([3.])
w = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

u = x*w
v = u+b
a = torch.clamp(v, min=0) # alternatively: a = F.relu(x*w + b)

A Tensor has a Boolean field `requires_grad`, set to `False` by default, which
states if PyTorch should build the graph of operations so that gradients with
respect to it can be computed.

**Note**: Only *floating point type* tensors can have their gradient computed.

In [None]:
print("v =", v.item())

Compute gradient $\partial a/\partial w$

In [None]:
grad_w = grad(a, w, retain_graph=True)
print(grad_w[0].item())

Compute gradient $\partial a/\partial b$

In [None]:
print(grad(a, b)[0].item())

---

## A slightly more involved example

In this example, we have a slightly more complex computation graph with a *shared* weight $w$ and two paths to arrive at the result (fix: pow(.,3))

<img src="DAG1.svg" alt="drawing" width="700"/>

Let's compute the partial derivative of $a$ wrt. $w$ by hand once more:


$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v_1}\frac{\partial v_1}{\partial u_1}\frac{\partial u_1}{\partial w} + \frac{\partial a}{\partial v_2}\frac{\partial v_2}{\partial u_2}\frac{\partial u_2}{\partial w} = 
1\cdot 2u_1\cdot x + 1 \cdot 3u_2^2 \cdot x = 8 + 24 = 32
$$

Again, having **all the intermediate values** during the **forward pass** makes it very easy to actually compute the gradient of $a$ wrt. $w$, i.e., $\partial a/\partial w$.



In [None]:
x = torch.tensor([2.])
w = torch.tensor([1.], requires_grad=True)

u1 = x*w
u2 = x*w

v1 = torch.pow(u1, 2)
v2 = torch.pow(u2, 3)

a = v1 + v2
print("a =", a.item())

Next, we compute the gradient of $a$ wrt. $w$, i.e., $\partial a/\partial w$

In [None]:
print(grad(a, w)[0].item())

---

## The AutoGrad machinery

The autograd DAG is encoded through the fields `grad_fn` of Tensors, and the
fields `next_functions` of Functions.

In [None]:
# below is another way of saying requires_grad=True
x = torch.tensor([ 1.0, -2.0, 3.0, -4.0 ]).requires_grad_() 
a = x.abs()
s = a.sum()
print(s)

In [None]:
print(s.grad_fn.next_functions)
print(s.grad_fn.next_functions[0][0].next_functions)

Below, we visualize a simple computation graph:

In [None]:
x = torch.tensor([1., 2., 2.]).requires_grad_()
q = x.norm() # l2-norm of tensor sqrt(1+4+4)=9
print(q.item())

In [None]:
agtree2dot.save_dot(q, {x: 'x', q: 'q'}, open('simple1.dot', 'w'))
HTML(visualize_DAG('simple1.dot', 'simple1.jpg'))

Here's another, slightly more involved, example: We have $\mathbf{x} \in \mathbb{R}^{10}$ and **first** compute

$$
\mathbf{h}  = \text{tanh}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) \\
$$

with $\mathbf{W}_1 \in \mathbb{R}^{20 \times 10}$ and $\mathbf{b}_1 \in \mathbb{R}^{10}$. This results in $\mathbf{h} \in \mathbb{R}^{20}$. **Second**, we compute

$$
\hat{\mathbf{y}}  = \text{tanh}(\mathbf{W}_2\mathbf{h} + \mathbf{b}_2)
$$

with $\mathbf{W}_2 \in \mathbb{R}^{20 \times 5}$ and $\mathbf{b}_2 \in \mathbb{R}^5$. This results in 
$\hat{\mathbf{y}} \in \mathbb{R}^5$. And **finally**, we compute 

$$
L(\mathbf{y},\hat{\mathbf{y}}) = \frac{1}{5} \sum_{i=1}^5(y_i-\hat{y}_i)^2
$$

where $y_i,\hat{y}_i$ are the $i$-th elements of $\mathbf{y}$ and $\hat{\mathbf{y}}$, respectively. In the example below, $y_i$ (which we use as targets) are just random numbers.

In [None]:
w1 = torch.rand(20, 10).requires_grad_()
b1 = torch.rand(20).requires_grad_()
w2 = torch.rand(5, 20).requires_grad_()
b2 = torch.rand(5).requires_grad_()

x = torch.rand(10)
h = torch.tanh(w1 @ x + b1)
y = torch.tanh(w2 @ h + b2)

target = torch.rand(5)

loss = (y - target).pow(2).mean()
print(loss)

Lets look at the computation graph ...

In [None]:
agtree2dot.save_dot(loss, 
                    {
                        w1: 'w1',
                        b1: 'b1',
                        w2: 'w2',
                        b2: 'b2',
                        loss: 'loss'
                    }, open('simple2.dot', 'w'))
HTML(visualize_DAG('simple2.dot', 'simple2.jpg'))

### Using `torch.no_grad()`

The `torch.no_grad()` context switches off the autograd machinery, and can be
used for operations such as parameter updates.

In [None]:
a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
eta = 0.1

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a - b)**2
    ga, gb = torch.autograd.grad(l, (a, b))
    
    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb
        
print('%.06f' % a.item(), '%.06f' % b.item())

### Using `detach()`

The `detach()` method creates a tensor which shares the data, but does not
require gradient computation, and is not connected to the current graph.

This method should be used when the gradient should not be propagated
beyond a variable, or to update leaf tensors.

In [None]:
a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
eta = 0.1

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a.detach() - b)**2
    ga, gb = torch.autograd.grad(l, (a, b))
    
    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb
        
print('%.06f' % a.item(), '%.06f' % b.item())