# Automatic Differentiation in PyTorch

Modern deep learning frameworks rely heavily on *automatic differentiation*, a technique that efficiently computes gradients for optimizing complex models. PyTorch’s `autograd` module provides a powerful and flexible way to perform automatic differentiation, enabling gradient-based optimization with minimal effort. Whether you're training neural networks, computing derivatives for custom functions, or performing higher-order differentiation, `autograd` makes it seamless and effortless. In this tutorial, we'll explore how PyTorch tracks computations, calculates gradients using `.backward()` and `torch.autograd.grad()`, and handles multiple gradient computations. By the end, you'll have a clear understanding of how automatic differentiation works in PyTorch and how to apply it effectively in your own projects. The working knowledge you develop from this tutorial will also play a critical role in learning the implementation of physics-informed loss functions.

### Basics of Autograd - Computing Gradients using `.backward()`

PyTorch’s `autograd` module provides automatic differentiation, allowing gradients to be computed effortlessly for tensor operations. This is particularly useful for optimization tasks, such as training deep learning models (we already know, sort of), or as we will see shortly for training physics informed models.

First of all, to enable PyTorch to track operations and compute gradients, set `requires_grad=True` when defining a tensor. This will trigger PyTorch to keep track of all operations performed on `x` to facilitate differentiation.

In [1]:
import torch

x = torch.tensor(1.0, requires_grad=True)

Once `requires_grad=True` is set, any operations on the tensor are recorded for gradient computation:

In [2]:
def tanh(x):
    y = torch.exp(-2.0 * x)
    return (1.0 - y) / (1.0 + y)

y = tanh(x)

At this point, PyTorch builds a computational graph connecting `x` and `y`. The system will use this graph to compute derivatives when needed. To compute the derivative of `y` with respect to `x`, you may simply call:

In [3]:
y.backward()
print(x.grad)    # Should print 0.42, since dy/dx = 1 - tanh(x)**2

tensor(0.4200)


Note that the derivative $dy/dx$ is stored in `x.grad`, after executing `.backward()`. 

Of course, you can compute the gradient of multivariate functions in the same way. For instance, let's say that you are interested in computing the gradient of the (squared) Euclidean norm $f(x)=x_1^2 + x_2^2 + \cdots + x_n^2$. The following code will compute the gradient of the function $f$.

In [5]:
def norm(x):
    return torch.sum(x**2)

x = 2*torch.rand(5) - 1   # a 5-dimensional random vector with elements ranging between -1 and 1.
x.requires_grad_()      # another way of setting requires_grad
y = norm(x)
y.backward()
print(x)
print(x.grad)
print(2*x)            # analytic gradient. `x.grad` should be the same as this one.

tensor([0.1589, 0.7113, 0.5476, 0.9500, 0.0543], requires_grad=True)
tensor([0.3177, 1.4226, 1.0952, 1.9001, 0.1087])
tensor([0.3177, 1.4226, 1.0952, 1.9001, 0.1087], grad_fn=<MulBackward0>)


### Important Notes About the `.backward()` Method

There are a few important things to remember about the way we computed gradient using the `.backward()` method. First, once `.backward()` is executed, the computational graph connecting `x` and `y` to backpropagate the gradient is automatically deleted to save memory. For example, if you run `.backward()` one more time like in the following, you will get an error.

In [7]:
y.backward()        # this line should return ERROR

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

As the error message at the bottom indicates, PyTorch returns the error because it has freed (removed from the memory) the computational graph already, and hence, you can no longer execute the backpropagation operation.

In case you need to repeat multiple backpropagation operations on the same variable, you must turn on `retain_graph=True` option:

In [13]:
x = torch.tensor(2.0, requires_grad=True)
y = x**3

y.backward(retain_graph=True)    # this will prevent PyTorch from freeing the computational graph
print(x.grad)               # should return 12, because dy/dx = 3*(x**2)

y.backward()                    # computes the gradient again
print(x.grad)               # should return 24, I'll explain why...

tensor(12.)
tensor(24.)


Note in the above that the output of the second round backpropagation is doubled. This is because we didn't reset the derivative and PyTorch, by default, keeps the results from the previous calculation and accumulate new ones on top of them. If you want to clear the derivatives and redo the calculation, you should call `x.grad.zero_()` beforehand:

In [14]:
x = torch.tensor(2.0, requires_grad=True)
y = x**3

y.backward(retain_graph=True)
print(x.grad)

x.grad.zero_()    # observe the difference with the addition of this line

y.backward()
print(x.grad)

tensor(12.)
tensor(12.)


### Computing the Higher Order Derivatives using the `.backward()` method

If everything above makes sense, computing the higher order derivatives using the `backward()` method shouldn't be terribly complicated. Let's see the example below:

In [18]:
x = torch.tensor(2.0, requires_grad=True)
y = x**2

y.backward(create_graph=True)    # create_graph is used instead of retain_graph
dydx = x.grad.clone()       # Make a copy of x.grad

x.grad.zero_()             # reset grad

dydx.backward()         # compute d^2y/dx^2 using the same graph
d2ydx2 = x.grad

print(dydx)             # This should be 4 because dy/dx = 2*x
print(d2ydx2)           # This should be 2 because d^2y/dx^2 = 2

tensor(4., grad_fn=<CloneBackward0>)
tensor(2., grad_fn=<ZeroBackward0>)


### Advanced Use of Autograd - Computing Gradients using `torch.autograd.grad`

PyTorch provides another way to compute gradients using `torch.autograd.grad()`, which allows **more flexibility** than `.backward()`. In a nutshell, you can think of `.backward()` as implementing *backpropagation*, which is a **special case** of automatic differentiation designed for computing gradients for optimization. In contrast, `torch.autograd.grad()` provides more general *automatic differentiation*, allowing computation of derivatives without modifying `.grad` and handling multiple outputs.

Let's parse what all this means by using the following examples.

Here's a simple example demonstrating the `torch.autograd.grad()` method.

In [28]:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

f = x*y + y**2

grads = torch.autograd.grad(f, (x, y))    # this is how you use torch.autograd.grad

print(grads[0])   # should print df/dx, which is 3, because df/dx = y
print(grads[1])   # should print df/dy, which is 8, because df/dy = x + 2*y
print(x.grad, y.grad)  # prints None for both. I'll explain why.

tensor(3.)
tensor(8.)
None None


...and this is what you would've done with the `.backward()` method for comparison.

In [29]:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(3.0, requires_grad=True)

f = x*y + y**2

f.backward()
print(x.grad)
print(y.grad)

tensor(3.)
tensor(8.)


Note that in case of the `torch.autograd.grad()` method, the gradients are outputed as a separate output than `x.grad` and `y.grad`. In fact, we just saw that runing the line `print(x.grad, y.grad)` printed `None None`.

So, one of the major differences between the two method can be said that...
- `.backward()`: Computes gradients and **stores them** in the `.grad` attribute of tensors.
- `torch.autograd.grad()`: Returns gradients **without modifying** `.grad`, useful when you need gradients without affecting the computation graph.

Now, here's a more powerful use case for `torch.autograd.grad()`, which is the computation of derivatives for non-scalar outputs.

Consider a situation where we have a model that takes, just for simplicity, one input variable and spits out one output variable. Suppose that we have a batch of inputs and we are going to apply the model to these inputs to produce a batch of outputs. This is a typical scenario as we've seen in the previous lab sessions. In this case, we have a batch input and a batch output and the function (model) is no longer a scalar-valued function. Hence, the following code will not work.

In [60]:
def model(x):
    return x**2    # A toy model for demo. This could be a neural network.

x = torch.rand(8, requires_grad=True)   # a batch of 8 samples, each a scalar
y = model(x)                            # this will return a batch of 8 outputs, each corresponding to an element of x

y.backward()              # this will return ERROR, because the backward() method is not for non-scalar outputs

RuntimeError: grad can be implicitly created only for scalar outputs

Instead, using `torch.autograd.grad()`, we can compute the derivatives without a problem.

In [63]:
x = torch.rand(8, requires_grad=True)
y = model(x)

# note the `grad_outputs` argument. I'll explain it shortly...
grads = torch.autograd.grad(y, x, grad_outputs=torch.ones_like(y))

print(grads)
print(2*x)      # analytical gradient. prediction must be the same as this one.

(tensor([1.6904, 0.8537, 1.5016, 0.3693, 0.5401, 0.0414, 0.1508, 1.6516]),)
tensor([1.6904, 0.8537, 1.5016, 0.3693, 0.5401, 0.0414, 0.1508, 1.6516],
       grad_fn=<MulBackward0>)


In the above, note that `grad_outputs` argument is added, with a tensor of the same size as `y` and filled with ones. There's a lot to unpack about this one actually, so bear with me.

First, `grad_outputs` should be a sequence of length matching `y` (the output). Given `grad_outputs`, `torch.autograd.grad` will compute what is called *vector-Jacobian product* or *vjp*, which is defined as:

$
\mathbf{J}^\top\mathbf{v} =
\begin{bmatrix}
    \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\
    \vdots & \ddots & \vdots \\
    \frac{\partial y_1}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n} \\
\end{bmatrix}
$

where $\mathbf{v}$ is the vector specified for `grad_outputs`. Therefore, if we set `grad_outputs` or $\mathbf{v}$ to be the vector of the same size as `y` and filled with ones, we are effectively computing:

$
\text{grads[i]} = \sum_{j=1}^m \frac{\partial y_j}{\partial x_i}
$

which equals to $\text{grads[i]} = \frac{\partial y_i}{\partial x_i}$ in our case, because $\frac{\partial y_j}{\partial x_i}=0$ if $i \neq j$ (Note in our example above, `y` was element-wise squares of `x`).

At a glance, this may sound like a redundant process of computing the gradient. However, in the actual implementation, PyTorch never constructs the Jacobian explicitly, but instead, calculates VJP directly. So the actual computational load does not increase.

This trick above is what we are going to use very frequently for the implementation of PINN.

### Computing Jacobian

Of course, if in any case, if you do need the actual Jacobian, there is also a way. Consider a function $\mathbf{f}:\mathbb{R}^2\rightarrow\mathbb{R}^2$, given by

$ \mathbf{f}\left(\begin{bmatrix} x \\ y \end{bmatrix}\right) = 
\begin{bmatrix} f_1(x,y) \\ f_2(x,y) \end{bmatrix} =
\begin{bmatrix} x^2y \\ 5x+\sin y \end{bmatrix}
$

Then the Jacobian of $\mathbf{f}$ is obtained as:

$
\mathbf{J}_\mathbf{f}(x,y) =
\begin{bmatrix}
    \nabla f_1^\top \\
    \nabla f_2^\top
\end{bmatrix} =
\begin{bmatrix}
    \frac{\partial f_1}{\partial x} & \frac{\partial f_1}{\partial y} \\
    \frac{\partial f_2}{\partial x} & \frac{\partial f_2}{\partial y}
\end{bmatrix} =
\begin{bmatrix}
    2xy & x^2 \\
    5 & \cos y
\end{bmatrix}
$

For the full Jacobian (as opposed to VJP), we can use `torch.autograd.functional.jacobian()`:

In [65]:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(torch.pi/2, requires_grad=True)

def func(xy):
    x, y = xy
    return torch.stack([
        (x**2)*y,
        5*x + torch.sin(y)
    ])

J = torch.tensor([
    [2*x*y, x**2],
    [5, torch.cos(y)]
])

xy = torch.stack([x,y])

grads = torch.autograd.functional.jacobian(func, xy)

print(f)
print(J)
print(grads)

tensor([ 6.2832, 11.0000], grad_fn=<StackBackward0>)
tensor([[ 6.2832e+00,  4.0000e+00],
        [ 5.0000e+00, -4.3711e-08]])
tensor([[ 6.2832e+00,  4.0000e+00],
        [ 5.0000e+00, -4.3711e-08]])


Note that, even though it is possible to compute the full Jacobian, the computational load can snow ball pretty quickly, as the dimensions $m$ and $n$ grow. If you can avoid computing the full Jacobian, but instead VJP, you should do so as much as you could.

### Summary

In this session, we saw different ways of computing gradients and other derivatives in PyTorch. Largely, we compared the `.backward()` method and the `torch.autograd.grad()` method. I know it's a lot of information to process, but here's a quick summary of what we learned in this session:

| Feature | `tensor.backward()` | `torch.autograd.grad()` |
| ------- | ------------------- | ----------------------- |
| Computes gradient? | ✅ Yes | ✅ Yes |
| Stores gradient in `.grad`? | ✅ Yes | ❌ No (returns as output) |
| Works on scalar outputs? | ✅ Yes | ✅ Yes |
| Works on non-scalar outputs? | ❌ No (must provide `gradient=` argument) | ✅ Yes (must specify `grad_outputs`) |


When should you use `torch.autograd.grad()` instead of `.backward()`?

1. When you don't want to modify `.grad` (e.g., to avoid accumulation).
1. When computing gradients for multiple variables at once.
1. When working with higher-order gradients (e.g., Hessians, Jacobians).
1. When differentiating non-scalar outputs, using `grad_outputs`.