# Computer Vision (911.908)

## <font color='crimson'>Automatic differentiation (aka AutoGrad)</font>

**Changelog**:
- *Sep. 2020*: initial version (using PyTorch v1.6) 
- *Sep. 2021*: adaptations to PyTorch v1.9
- *Oct. 2022*: adaptations to PyTorch v1.12.1 + minor fixes
- *Oct. 2024*: adaptations to PyTorch v2.3
- *Oct. 2025*: adaptations to PyTorch v2.6

---

In this lecture, we take a look at the mechanics of **automatic differentiation** (aka *AutoGrad* / *AutoDiff*) in PyTorch. These techniques are at the very heart of today's frameworks for learning with neural networks.

You can also find the official (tutorial-like) documentation [here](https://pytorch.org/docs/stable/notes/autograd.html).

---

## Contents

- [Introductory example](#Introductory-example)
- [A  slightly more involved example](#A-slightly-more-involved-example)
- [The AutoGrad machinery](#The-AutoGrad-machinery)

---

In [47]:
#
# Running this code requires torchviz. Install via
#
# pip install torchviz
#

%load_ext autoreload
%autoreload 2

import sys
import agtree2dot

sys.path.append(".")

from utils import visualize_DAG
from IPython.display import HTML
from torchviz import make_dot

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



---

## Introductory example

In this example, we take the following function (ReLU stands for Rectified Linear Unit, and we will see this later in the lecture)

$$f(x;w,b) = \max\{wx + b, 0\} = \text{ReLU}(wx+b)$$

where 

$$\text{ReLU}: x \mapsto \text{ReLU}(x) = \max\{x, 0\}\enspace.$$

We first look at its representation as a **computation graph** (with example values for $w,b$ and $x$ added):

<img src="DAG0.svg" alt="drawing" width="700"/>

Lets compute  

$$ \frac{\partial a}{\partial b} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial b}$$

First, 

$$ \frac{\partial v}{\partial b} = 1 $$

and (using a [sub-derivative](https://en.wikipedia.org/wiki/Subderivative))

$$ \frac{\partial a}{\partial v} = \frac{\partial}{\partial v} \text{ReLU}(v) = 
\begin{cases}
0 & \text{if}~v \leq 0\\
1 & \text{else}
\end{cases}\enspace.
$$

Hence, in our case, we have (as $v=7$)

$$\frac{\partial a}{\partial v} = 1$$

and consequently 

$$
\frac{\partial a}{\partial b} = 1\cdot 1 = 1 \enspace.
$$

Let's try to compute the partial derivative of $a$ wrt. $w$:

$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial w}
$$

We have, 

$$
\frac{\partial v}{\partial u} = 1
$$

and 

$$
\frac{\partial u}{\partial w} = x
$$

Combined, we obtain (knowing that $x=3$, see figure)

$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v}\frac{\partial v}{\partial u}\frac{\partial u}{\partial w} = 1\cdot 1\cdot 3 = 3
$$



Conceptually, the **forward pass** (i.e., the computation of the function value) is a sequence of standard tensor computations (i.e., in this example, simply the computation of **all** intermediate results). We need the **Directed Ascyclic Graph (DAG)** of tensor operations only to compute derivatives (via the chain rule). 

When executing tensor operations, PyTorch can automatically (on-the-fly) construct the graph of operations to compute the gradient of any quantity with respect to any tensor involved.

In [48]:
import torch
from torch.autograd import grad
import torch.nn.functional as F # functional API

**Example**

Let's implement the simple example from above. Note the use of `requires_grad=True` (`False` by default) for the variable `w` and `b` (explained below).

In [68]:
x = torch.tensor([3.]) # Input
w = torch.tensor([2.], requires_grad=True) 
b = torch.tensor([1.], requires_grad=True) 

u = x*w
v = u+b
a = F.relu(v)
print(a)

tensor([7.], grad_fn=<ReluBackward0>)


A Tensor has a Boolean field `requires_grad`, set to `False` by default, which
states if PyTorch should build the graph of operations so that gradients with
respect to it can be computed.

**Note**: Only *floating point type* tensors can have their gradient computed.

In [69]:
print("v =", v.item())

v = 7.0


Compute gradient $\partial a/\partial w$

In [70]:
grad_w = grad(a, w, retain_graph=True)
print(grad_w[0].item())

3.0


Note the use of `retain_graph=True`. This is not required here, but in case you want to execute that cell multiple times (try it out with `retain_graph=False`), you will need it (also for executing the next call, as we call `grad` another time). The reason is that, by default, the part of the graph that computes `a` is destroyed (for memory reasons) once the gradient computation is done. If `retain_graph=False`, then execution of the next statement will fail (unless you do another forward pass). 

Compute gradient $\partial a/\partial b$

In [71]:
print(grad(a, b)[0].item())

1.0


---

## A slightly more involved example

In this example, we have a slightly more complex computation graph with a *shared* weight $w$ and two paths to arrive at the result. The function is

$$f(x;w) = (wx)^2 + (wx)^3$$

<img src="DAG1.svg" alt="drawing" width="700"/>

Let's compute the partial derivative of $a$ wrt. $w$ by hand once more:


$$
\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v_1}\frac{\partial v_1}{\partial u_1}\frac{\partial u_1}{\partial w} + \frac{\partial a}{\partial v_2}\frac{\partial v_2}{\partial u_2}\frac{\partial u_2}{\partial w} = 
1\cdot 2u_1\cdot x + 1 \cdot 3u_2^2 \cdot x = 8 + 24 = 32
$$

Again, having **all the intermediate values** during the **forward pass** makes it very easy to actually compute the gradient of $a$ wrt. $w$, i.e., $\partial a/\partial w$.

### Fix values



In [77]:
x = torch.tensor([2.])
w = torch.tensor([1.], requires_grad=True)

u1 = x*w
u2 = x*w

v1 = torch.pow(u1, 2)
v2 = torch.pow(u2, 3)

a = v1 + v2
print("a =", a.item())

a = 12.0


Next, we compute the gradient of $a$ wrt. $w$, i.e., $\partial a/\partial w$

In [78]:
print(grad(a, w)[0].item())

32.0


---

## The AutoGrad machinery

The autograd DAG is encoded through the fields `grad_fn` of Tensors, and the
fields `next_functions` of functions.

In [81]:
# below is another way of saying requires_grad=True
x = torch.tensor([ 1.0, -2.0, 3.0, -4.0 ]).requires_grad_() 
a = x.pow(2.) # square each component of x
s = a.sum()
print(s.item())

30.0


Below, we visualize a simple computation graph:

In [82]:
x = torch.tensor([1., 2., 2.]).requires_grad_()
q = x.norm() # L2-norm of tensor sqrt(1^2+2^2+2^2)=3
print(q.item())

3.0


In [83]:
agtree2dot.save_dot(q, {x: 'x', q: 'q'}, open('simple1.dot', 'w'))
HTML(visualize_DAG('simple1.dot', 'simple1.jpg'))

**Example**

Here's another, slightly more involved, example: We have $\mathbf{x} \in \mathbb{R}^{10}$ and **first** compute (noting that $\text{tanh}$ is applied componentwise).

$$
\mathbf{h}  = \text{tanh}(\mathbf{W}_1\mathbf{x} + \mathbf{b}_1) \\
$$

with $\mathbf{W}_1 \in \mathbb{R}^{20 \times 10}$ and $\mathbf{b}_1 \in \mathbb{R}^{20}$. This results in $\mathbf{h} \in \mathbb{R}^{20}$. **Second**, we compute

$$
\hat{\mathbf{y}}  = \text{tanh}(\mathbf{W}_2\mathbf{h} + \mathbf{b}_2)
$$

with $\mathbf{W}_2 \in \mathbb{R}^{5 \times 20}$ and $\mathbf{b}_2 \in \mathbb{R}^5$. This results in 
$\hat{\mathbf{y}} \in \mathbb{R}^5$. And **finally**, we compute (the mean-squared error)

$$
L(\mathbf{y},\hat{\mathbf{y}}) = \frac{1}{5} \sum_{i=1}^5(y_i-\hat{y}_i)^2
$$

where $y_i,\hat{y}_i$ are the $i$-th elements of $\mathbf{y}$ and $\hat{\mathbf{y}}$, respectively. In the code below, $y_i$ (which we use as targets) are just random numbers.

In [94]:
w1 = torch.rand(20, 10).requires_grad_()
b1 = torch.rand(20).requires_grad_()
w2 = torch.rand(5, 20).requires_grad_()
b2 = torch.rand(5).requires_grad_()

In [100]:
x = torch.rand(10)
h = torch.tanh(w1 @ x + b1)
yhat = torch.tanh(w2 @ h + b2)

y = torch.rand(5) # target

L = (y - yhat).pow(2).mean()
L.backward() # computes the partial derivatives wrt. w1,b1,w2,b2
print(L.item())

0.2959361672401428


Lets look at the computation graph ...

In [101]:
agtree2dot.save_dot(L, 
                    {
                        w1: 'w1',
                        b1: 'b1',
                        w2: 'w2',
                        b2: 'b2',
                        L: 'loss'
                    }, open('simple2.dot', 'w'))
HTML(visualize_DAG('simple2.dot', 'simple2.jpg'))

### Using `torch.no_grad()`

The `torch.no_grad()` **context** switches off the autograd machinery, and can be
used for operations such as parameter updates.

In [105]:
a = torch.tensor( 0.7).requires_grad_()
b = torch.tensor(-0.7).requires_grad_()
eta = 0.1

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a - b)**2 # i.e., the forward pass
    ga, gb = torch.autograd.grad(l, (a, b))

    print(k,ga,gb)
    
    # walk towards the direction of the negative gradient 
    # (aka gradient descent) - both statements will not change
    # the computation graph, because they are within the 
    # torch.no_grad context!
    with torch.no_grad():
        a -= eta * ga 
        b -= eta * gb
    
print('%.06f' % a.item(), '%.06f' % b.item())

0 tensor(2.2000) tensor(-2.2000)
1 tensor(0.8800) tensor(-0.8800)
2 tensor(0.3520) tensor(-0.3520)
3 tensor(0.1408) tensor(-0.1408)
4 tensor(0.0563) tensor(-0.0563)
5 tensor(0.0225) tensor(-0.0225)
6 tensor(0.0090) tensor(-0.0090)
7 tensor(0.0036) tensor(-0.0036)
8 tensor(0.0014) tensor(-0.0014)
9 tensor(0.0006) tensor(-0.0006)
10 tensor(0.0002) tensor(-0.0002)
11 tensor(9.2387e-05) tensor(-9.2387e-05)
12 tensor(3.6836e-05) tensor(-3.6836e-05)
13 tensor(1.4663e-05) tensor(-1.4663e-05)
14 tensor(5.9605e-06) tensor(-5.9605e-06)
15 tensor(2.3842e-06) tensor(-2.3842e-06)
16 tensor(9.5367e-07) tensor(-9.5367e-07)
17 tensor(3.5763e-07) tensor(-3.5763e-07)
18 tensor(2.3842e-07) tensor(-2.3842e-07)
19 tensor(1.1921e-07) tensor(-1.1921e-07)
20 tensor(1.1921e-07) tensor(-1.1921e-07)
21 tensor(1.1921e-07) tensor(-1.1921e-07)
22 tensor(1.1921e-07) tensor(-1.1921e-07)
23 tensor(1.1921e-07) tensor(-1.1921e-07)
24 tensor(1.1921e-07) tensor(-1.1921e-07)
25 tensor(1.1921e-07) tensor(-1.1921e-07)
26 ten

In case you do not believe that this is the minimum, fire up Mathematica and run the `2DMinimumExample.nb` available in this directory :)

### Using `detach()`

The `detach()` method creates a tensor which shares the data, but does not
require gradient computation, and is not connected to the current computation graph.

This method should be used when the gradient should not be propagated
beyond a variable, or to update leaf tensors.

### A simple example

In [44]:
import torch
x = torch.ones(10, dtype=torch.float32, requires_grad=True)

y = x**2
z = x**3
r=(y+z).sum()
print(r.item())
r.backward() # computes the gradient of r wrt. x
print(x.grad)

20.0
tensor([5., 5., 5., 5., 5., 5., 5., 5., 5., 5.])


Now, lets exclude the contribution of the $x^3$ branch via `detach()`.

In [45]:
x = torch.ones(10, dtype=torch.float32, requires_grad=True)

y = x**2
z = (x**3).detach()
r=(y+z).sum()
print(r.item())
r.backward()
print(x.grad)

20.0
tensor([2., 2., 2., 2., 2., 2., 2., 2., 2., 2.])


In the example above, we have $f(x) = x^2+x^3$ and we compute $\frac{d}{dx}f(x)$ which is $2x+3x^2$. Now, if we have $x=1$, it's clear that we see a vector of all 5's. However, if we exclude the $x^3$ computation, we end up with $2x$ which (for $x=1$ is 2) and that is what we see.

Getting back to our exampe of finding the minimum of a function (see above) ...

In [46]:
a = torch.tensor( 0.5).requires_grad_()
b = torch.tensor(-0.5).requires_grad_()
eta = 0.1

for k in range(100):
    l = (a - 1)**2 + (b + 1)**2 + (a.detach() - b)**2
    ga, gb = torch.autograd.grad(l, (a, b))
    
    with torch.no_grad():
        a -= eta * ga
        b -= eta * gb
        
print('%.06f' % a.item(), '%.06f' % b.item())

1.000000 -0.000000


In [123]:
%%timeit
a = torch.rand(500,256,256)
b = torch.rand(500,256,256)
torch.matmul(a,b)
for i in range(500):
    torch.matmul(a[i],b[i])        

171 ms ± 427 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
