Link to post: 

# How does the "gradient" argument in Pytorch's "backward" function work - explained by examples

This post is some examples for the `gradient` argument in Pytorch's `backward` function. The math of `backward(gradient)` is explained in this [tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients) and these threads ([thread-1](https://discuss.pytorch.org/t/gradient-argument-in-out-backward-gradient/12742), [thread-2](https://stackoverflow.com/questions/43451125/pytorch-what-are-the-gradient-arguments)), along with some examples. Those were very helpful, but I wish there were more examples on how the numbers in the example correspond to the math, to help me more easily understand. I could not find many such examples so I will make some and write them here, so that I can look back when I forget this in two weeks.

In the examples, I run code in torch, write down the math, and run the math in numpy, and show that the torch result matches the math/numpy result.

Here's how Pytorch [tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients) explains the math:

Insert picture


*We will make examples of $x$ and $y=f(x)$ (we omit the arrow-hats of $x$ and $y$ above), and manually calculate Jacobian $J$.* 

Pytorch tutorial goes on with the explanation:

Insert picture


The above basically says: if you pass $v^T$ as the `gradient` argument, then `y.backward(gradient)` will give you not $J$ but $v^T \cdot J$ as the result of `x.grad`.

*We will make examples of $v^T$, calculate $v^T \cdot J$ in numpy, and confirm that the result is the same as `x.grad` after calling `y.backward(gradient)` where `gradient` is $v^T$.*

All good? Let's go.

In [147]:
from torch import tensor
from numpy import array

## input is scalar, output is scalar

First, a simple example where $x=1$ and $y = x^2$ are both scalar. 

In pytorch:

In [148]:
x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
y.backward() # this is the same as y.backward(tensor(1.))
print('x.grad:', x.grad)

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(2.)


Now manually calculate Jacobian $J$. In this case $x$ and $y$ are both scalar (each only has one component $x_1$ and $y_1$ respectively). And we have

$$
J = [\partial y_1 / \partial x_1] = [\partial y / \partial x] = [2x]
$$

In numpy:

In [149]:
x = x.detach().numpy()
J = array([[2*x]])
print('J:', J)

J: [[2.]]


In this example, we did not pass the `gradient` argument to `backward()`, and this defaults to passing the value 1. As a reminder, $v^T$ is our `gradient` with value 1. We can confirm that $v^T \cdot J$ gives the same result as `x.grad`. All good.

In [150]:
vᵀ = array([[1,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

vᵀ: [[1]]
vᵀ・J: [[2.]]


## input is scalar, output is scalar, non-default gradient

We can keep everything else the same but pass a non-default `gradient` with the value 100 to `backward()` instead of the default value 1.

In [151]:
x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
gradient_value = 100.
y.backward(tensor(gradient_value)) 
print('x.grad:', x.grad)

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(200.)


This is the same as setting the value `100` for $v^T$, and we can see $v^T \cdot J$ still matches `x.grad`. Still good.

In [152]:
x = x.detach().numpy()
J = array([[2*x]])
print('J:', J)

vᵀ = array([[gradient_value,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

J: [[2.]]
vᵀ: [[100.]]
vᵀ・J: [[200.]]


## input is vector, output is scalar

Now we look at a more interesting example where $x=[x_1,x_2]=[1,2]$ is a vector and $y=\sum x_i$ is a scalar. 

In [153]:
x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
print('y:', y)
y.backward() 
print('x.grad:', x.grad)

x: tensor([1., 2.], requires_grad=True)
y: tensor(3., grad_fn=<AddBackward0>)
x.grad: tensor([1., 1.])


Now manually calculate Jacobian $J$. In this case since $x$ is a vector with components $x_1$ and $x_2$, and $y=x_1+x_2$ is a scalar, so we have

$$
J =  
\left( \begin{array}{c}
\frac{\partial y_1}{\partial x_1}, \frac{\partial y_1}{\partial x_2}
\end{array} \right)
= 
\left( \begin{array}{c} 1,1
\end{array} \right)
$$


In numpy:

In [155]:
J = array([[1, 1]])
print('J:')
print(J)

J:
[[1 1]]


In this example, we did not pass the `gradient` argument to `backward()`, and this defaults to passing the value 1, i.e., $v^T$ has value 1. We can confirm that $v^T \cdot J$ gives the same result as `x.grad`.

In [158]:
vᵀ = array([[1]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

vᵀ: [[1]]
vᵀ・J: [[1 1]]


## input is vector, output is scalar, non-default gradient

We can keep everything else the same as above but pass a non-default `gradient` with the value 100 to `backward()` instead of the default value 1. Still, $x=[x_1,x_2]=[1,2]$ is a vector and $y=\sum x_i$ is a scalar. 

In [170]:
x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
gradient_value = 100.
y.backward(tensor(gradient_value)) 
print('x.grad:', x.grad)

x: tensor([1., 2.], requires_grad=True)
x.grad: tensor([100., 100.])


This is the same as setting the value `100` for $v^T$, and we can see $v^T \cdot J$ still matches `x.grad`. Still good.

In [171]:
J = array([[1, 1]])
print('J:')
print(J)


vᵀ = array([[gradient_value,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

J:
[[1 1]]
vᵀ: [[100.]]
vᵀ・J: [[100. 100.]]
