Link to post: 

# How does the "gradient" argument in Pytorch's "backward" function work - explained by examples

This post is some examples for the `gradient` argument in Pytorch's `backward` function. The math of `backward(gradient)` is explained in this [tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients) and these threads ([thread-1](https://discuss.pytorch.org/t/gradient-argument-in-out-backward-gradient/12742), [thread-2](https://stackoverflow.com/questions/43451125/pytorch-what-are-the-gradient-arguments)), along with some examples. Those were very helpful, but I wish there were more examples on how the numbers in the example correspond to the math, to help me more easily understand. I could not find many such examples so I will make some and write them here, so that I can look back when I forget this in two weeks.

In the examples, I run code in torch, write down the math, and run the math in numpy, and show that the torch result matches the math/numpy result.

Here's how Pytorch [tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients) explains the math:

Mathematically, if you have a vector valued function y⃗ =f(x⃗ ), then the gradient of y⃗  with respect to x⃗  is a Jacobian matrix:

J=⎛⎝⎜⎜⎜⎜∂y1∂x1⋮∂ym∂x1⋯⋱⋯∂y1∂xn⋮∂ym∂xn⎞⎠⎟⎟⎟⎟


*We will make examples of `x` and `y=f(x)` (we omit the arrow-hats of `x` and `y` above), and manually calculate Jacobian `J`.* 

Pytorch tutorial goes on with the explanation:

Generally speaking, torch.autograd is an engine for computing vector-Jacobian product. That is, given any vector v=(v1v2⋯vm)T, compute the product vT⋅J. If v happens to be the gradient of a scalar function l=g(y⃗ ), that is, v=(∂l∂y1⋯∂l∂ym)T, then by the chain rule, the vector-Jacobian product would be the gradient of l with respect to x⃗ :

JT⋅v=⎛⎝⎜⎜⎜⎜∂y1∂x1⋮∂y1∂xn⋯⋱⋯∂ym∂x1⋮∂ym∂xn⎞⎠⎟⎟⎟⎟⎛⎝⎜⎜⎜⎜∂l∂y1⋮∂l∂ym⎞⎠⎟⎟⎟⎟=⎛⎝⎜⎜⎜⎜∂l∂x1⋮∂l∂xn⎞⎠⎟⎟⎟⎟
(Note that vT⋅J gives a row vector which can be treated as a column vector by taking JT⋅v.)

The above basically says: if you pass `vᵀ` as the `gradient` argument in `y.backward(gradient)`, then `y.backward(gradient=vᵀ)` will give you not `J` but `vᵀ・J` as `x.grad`.

*We will make examples of `vᵀ`, calculate `vᵀ・J` in numpy, and confirm that the result is the same as `x.grad` after calling `y.backward(gradient=vᵀ)`.*

All good? Let's go.

In [53]:
from torch import tensor
from numpy import array

## input is scalar, output is scalar

First, the simplest example where `x = 1` and `y = x**2` are both scalar. 

In [100]:
x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
y.backward() # this is the same as y.backward(tensor(1.))
print('x.grad:', x.grad)

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(2.)


Manully calculate `J`. In this case, calculus says `J` has the value `2*x`, and its shape is 1x1. 

In [101]:
x = x.detach().numpy()
J = array([[2.*x]])
print('J:', J)

J: [[2.]]


In this example, we did not pass the `gradient` argument to `backward()`, and this defaults to passing the value `1`. As a reminder, `vᵀ` is our `gradient`, and here it has shape 1x1 with value `1` in it. We can confirm that `vᵀ・J` gives the same result as `x.grad`. All good.

In [102]:
vᵀ = array([[1.,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

vᵀ: [[1.]]
vᵀ・J: [[2.]]


## input is scalar, output is scalar, non-default gradient

We can keep everything else the same but pass a non-default `gradient` with the value `100` to `backward()` that does not have the value `1`.

In [103]:
x = tensor(1., requires_grad=True)
print('x:', x)
y = x**2
print('y:', y)
gradient_value = 100.
y.backward(tensor(gradient_value)) 
print('x.grad:', x.grad)

x: tensor(1., requires_grad=True)
y: tensor(1., grad_fn=<PowBackward0>)
x.grad: tensor(200.)


This is the same as setting the value `100` for `vᵀ`, and we can see `vᵀ・J` still matches `x.grad`. Still good.

In [104]:
x = x.detach().numpy()
J = array([[2.*x]])
print('J:', J)

vᵀ = array([[gradient_value,]])
print('vᵀ:', vᵀ)
print('vᵀ・J:', vᵀ@J)

J: [[2.]]
vᵀ: [[100.]]
vᵀ・J: [[200.]]


## input is vector, output is scalar

Now we look at a slightly more interesting example where `x = [1, 2]` is a vector and `y = sum(x)` is a scalar. 

In [105]:
x = tensor([1., 2.], requires_grad=True)
print('x:', x)
y = sum(x)
print('y:', y)
y.backward() 
print('x.grad:', x.grad)

x: tensor([1., 2.], requires_grad=True)
y: tensor(3., grad_fn=<AddBackward0>)
x.grad: tensor([1., 1.])


Manully calculate `J`. In this case, calculus says `J` has the value $\sadkfaisoej$, and its shape is 2x1. 

In [101]:
x = x.detach().numpy()
J = array([[2.*x]])
print('J:', J)

J: [[2.]]
