<div>
<img src="https://discuss.pytorch.org/uploads/default/original/2X/3/35226d9fbc661ced1c5d17e374638389178c3176.png" width="400" style="margin: 50px auto; display: block; position: relative; left: -30px;" />
</div>

<!--NAVIGATION-->
# < [Data](2-Data.ipynb) | Autograd | [Optimization](4-Optimization.ipynb) >

### Automatic Differentiation

Automatic differentation (autodiff) is a key feature of PyTorch.
PyTorch can differentiate the outcome of any computation with respect to its inputs. You don't need to compute the gradients yourself. This allows you to express and optimize complex models without worrying about correctly differentiating the model.

We will start by discussing a little bit of the math behind autodiff. We then cover PyTorch's `.backward()` method that does everything automatically for you. Finally, we have a quick look under the hood to see how PyTorch does its magic.

### Table of Contents

#### 1. [Understanding gradient computation](#Understanding-gradient-computation)
#### 2. [A linear regression example](#A-linear-regression-example)
#### 3. [Useful features](#Useful-features)
#### 4. [Advanced topics](#Advanced-topics)
---


In [None]:
import torch
torch.__version__

# Understanding gradient computation

### Requires grad attribute

The `requires_grad` property on a Tensor tells PyTorch to track computations based on this tensor. 
After you compute a quantity `y` (forward pass), you can compute the gradient of `y` with respect to all tensors that have `requires_grad==True`. 

In [None]:
x = torch.Tensor([2])
print(x)

In [None]:
print(x.requires_grad)

In [None]:
x.requires_grad_(True)  # note the underscore

In [None]:
print(x.requires_grad)

### Checking how Autograd tracks operations

In [None]:
y = x * x
print(y)

In [None]:
print(y.requires_grad)

In [None]:
print(y.grad_fn)

In [None]:
z = y + 4
print(z)

### Computing gradients with `.backward()`

The gradient computation (the backward pass) is triggered with `z.backward()`. You will find the computed gradients in `x.grad`.

This computes the gradient of z with respect to x.

In [None]:
print(x.grad)

In [None]:
z.backward()

In [None]:
print(x.grad)

Here, we have $z(x) = x^2 + 4$, therefore $\frac{dz}{dx}(x) = 2  x$.  We can indeed check that `x.grad = 2*x`

This simple polynomial expression is easy enough to differentiate by hand. When expressions become tensor-valued and more complex, however, computing gradients becomes tedious and error-prone. The power of PyTorch is that it can compute gradients of any tensor with respect to its 'inputs' automatically. This greatly simplifies the optimization of complex, creative ML models.

Remember:
- `tensor.requires_grad`
- `tensor.grad`
- `tensor.backward()`

# A linear regression example

We have the simple linear regression $loss = (x \cdot W + b - y)^{2}$

Let's create our sample data point `x` and `y`, and our regression parameters `W` and `b`.

Since we want to update `W` and `b`, we need gradient with respect to them, so we set their `requires_grad` attribute to `True`.

In [None]:
x = torch.Tensor([1,2,3])
y = torch.Tensor([1])

W = torch.rand((3,1), requires_grad=True)
b = torch.rand(1, requires_grad=True)
print(W, "\n\n", b)

In [None]:
loss = (x @ W + b - y) ** 2
print(loss)

<p>
<img src="figures/grad.png" width="400" style="margin-left: auto;margin-right: auto;display: block;" />
</p>


Before calling backward, all gradients are `None`

In [None]:
print(W.grad, b.grad)

In [None]:
loss.backward()

<p>
<img src="figures/backward.png" width="400" style="margin-left: auto;margin-right: auto;display: block;" />
</p>


After calling backward, gradients of all parameters have been computed !

In [None]:
print(W.grad, "\n\n", b.grad)

Note: No gradient of the `loss` is computed with respect to `x` and `y` since they do not require gradient.

In [None]:
print(x.grad, y.grad)

#### Gradients accumulate !

In [None]:
loss = (x @ W + b - y) ** 2
loss.backward()

In [None]:
print(W.grad, "\n\n", b.grad)

You see that the second time, the gradient computed is twice as big. This is because `.backward()` __accumulates__ the gradients.

If you want fresh gradient values, you need to set the `.grad` attributes of the parameters to zero before you call `.backward()`.

# Useful features

### Skipping history tracking with `torch.no_grad()`

After you trained a model, you just want to use it without computing gradients.
Building a computation graph for every operation would be wasteful if you don't need it.
Therefore, you can skip these operations by wrapping your code with the `with torch.no_grad():` context.

In [None]:
x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)

y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)

with torch.no_grad():
    y = (x ** 2)
    print("y.requires_grad : ", y.requires_grad)

Any variable created within the `no_grad` context will have `requires_grad==False`.

### Dropping history with `.detach()`

Some tensors are computed from others, but you may want to consider them constants without computation history (called leaf variables). For that, you can use the `.detach()` method.

In [None]:
A = torch.rand(1,2, requires_grad=True)
B = A.mean()

print("B : ", B)
print("B.requires_grad :", B.requires_grad)
print("B.grad_fn :", B.grad_fn)

C = B.detach()
print("\n-- C = B.detach() -- \n")

print("C : ", C)
print("C.requires_grad :", C.requires_grad)
print("C.grad_fn :", C.grad_fn)

---


# Advanced topics

### Leaves vs Nodes

*Advanced*

PyTorch's autograd mechanism differentiates between two types of tensors:
- __node variables__ are the result of a pytorch operation
- __leaf variables__ are directly created by a user

We can use the `.is_leaf` property to differentiate between the two types.

In [None]:
A = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
B = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) + 2  # B is the result of an operation (+)
C = 5 * A  # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)

When `.backward()` is called, only the **leaf variables** have their gradients stored in their `.grad` attribute.

### Differentiating w.r.t. intermediate values: `.retain_grad()`

*Advanced*

When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables in the computation graph. However, by default, only gradients of variables that were **created by the user** (leaf) and **have the `requires_grad` property to True** are saved.

Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters (which are leaf variables). 

In [None]:
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
C = B.mean()

print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

In [None]:
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
B.retain_grad()  # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()


print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

### Inspecting PyTorch's computation graph

*Advanced*

You can explore how PyTorch keeps track of history by inspecting the `tensor.grad_fn` argument:

In [None]:
print(y.grad_fn)
print(y.grad_fn.next_functions[0][0])
print(y.grad_fn.next_functions[0][0].next_functions[0][0])

Each value has a `grad_fn` corresponding to the operation that produced the value. 
Each operation's `grad_fn` points to its inputs through `next_functions`.
For each input, `next_functions` contains a tuple of the input's `grad_fn` and, if the operation had multiple outputs, an index of the relevant output.

In [None]:
# In our example, the final `add` operation has two inputs:
# - The first is the output of `multiplication`.
# - The second is a constant `4` for which we don't require a gradient.
y.grad_fn.next_functions

___

<!--NAVIGATION-->
# < [Data](2-Data.ipynb) | Autograd | [Optimization](4-Optimization.ipynb) >