# Credits

This is heavily influenced or copied from https://github.com/pytorch/tutorials

# Autograd: automatic differentiation

Central to all neural networks in PyTorch is the ``autograd`` package.
Let’s first briefly visit this, and we will then go to training our first neural network.

The `autograd` package **provides automatic differentiation for all operations on Tensors**.
It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

Let us see this in more simple terms with some examples.

## 1. Tensor

`torch.Tensor` is the central class of the package. Setting the attribute `.requires_grad` to `True` will make the tensor "record" all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into the `.grad` attribute.

![autograd.Variable](https://github.com/DeepLearningDTU/02456-2025/tree/master/static_files/autograd-variable.png)

- `data`
- `grad`
- `creator`

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (except for Tensors created by the user - their `grad_fn` is `None`).

If you want to compute the derivatives, you can call `.backward()` on a Tensor. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to backward(), however if it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.

In [17]:
import torch

Create a tensor

In [18]:
x = torch.ones(2, 2, requires_grad=True) # requires_grad=True to track all operations on x
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


Do a tensor operation:

In [19]:
y = x + 2 # this operation is tracked
print(y)
# Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor`

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


- Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor`
    - Except if the tensor was created by the user. In this case, `grad_fn` is `None`.
    - But `y` was created as a result of an operation, so it has a `grad_fn`.

In [20]:
print(y.grad_fn)
print(f"The type of y.grad_fn is {type(y.grad_fn)}, a subclass of torch.autograd.Function")

<AddBackward0 object at 0x1183ac6a0>
The type of y.grad_fn is <class 'AddBackward0'>, a subclass of torch.autograd.Function


Do more operations on y

In [21]:
z = y * y * 3
out = z.mean()

print(z)
print(f"The operation that created z is {z.grad_fn}")
print(out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)
The operation that created z is <MulBackward0 object at 0x10ed71750>
tensor(27., grad_fn=<MeanBackward0>)


# Assignments

1. Create a Tensor that `requires_grad` of size (5, 5)
2. Sum the values in the Tensor

In [22]:
# 1.
my_tensor = torch.rand(5,5, requires_grad=True)

# 2. sum all values
my_tensor_sum = my_tensor.sum()
# my_tensor_sum = x_1 + x_2 + ... + x_5 (where x_i are vectors of size 5)
# -> d(my_tensor_sum)/d(my_tensor) = [1, 1, 1, 1, 1], [1, 1, 1, 1, 1], ..., [1, 1, 1, 1, 1] (5x5 matrix of ones)

# 3. print the history
print(f"The computation that created my_tensor_sum is {my_tensor_sum.grad_fn}, a subclass of torch.autograd.Function")

# 4. Backprop
my_tensor_sum.backward()

print(my_tensor.grad) # print the gradients d(my_tensor_sum)/d(my_tensor)

The computation that created my_tensor_sum is <SumBackward0 object at 0x10fefd6c0>, a subclass of torch.autograd.Function
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])


## 2. Gradients

Let’s backprop now. Because `out` contains a single scalar, `out.backward()` is equivalent to `out.backward(torch.tensor([1.0]))`

In [23]:
out.backward()

# out = z.mean() = (y*y*3).mean() = ( (x+2)*(x+2)*3 ).mean()

Print gradients d(out)/dx

In [24]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


You should have a matrix of `4.5`. Let’s denote the tensor `out` with $o$.

We have:
- $o = \frac{1}{4}\sum_i z_i$,
- $z_i = 3(x_i+2)^2$
- $z_i\bigr\rvert_{x_i=1} = 27$.

Therefore, $\frac{\partial o}{\partial x_i} = \frac{1}{4} \sum_i \frac{\partial z_i}{\partial x_i} =  \frac{1}{4} \sum_i 6 (x_i+2) \bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

You can do many crazy things with autograd!

In [77]:
x = torch.randn(3, requires_grad=True) # normal distribution N(0,1)

y = x * 2
# -> data
# -> grad
# -> creator
# print(f"data: {y.data}, grad: {y.grad}, creator: {y.grad_fn}")

# keep doubling y until its norm is larger than 1000
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([1079.9368, -553.6485,  415.8031], grad_fn=<MulBackward0>)


In [78]:
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])

y.retain_grad()  # retain gradients for non-leaf tensor x

y.backward(gradients)

# y.grad will not work because y is not a leaf tensor in the computation graph.
# In PyTorch, only leaf tensors (those created by the user with requires_grad=True and not the result of an operation) have their .grad populated by default after backward().

# y is the result of an operation (y = x * 2, etc.), so its .grad attribute will be None unless you explicitly call y.retain_grad() before backward().
# For non-leaf tensors, gradients are not retained to save memory.

print(f"Gradient dy/dx at x = {x.data} is {x.grad}")
print(f"Gradient dy/dy at y = {y.data} is {y.grad}")  # This will be None unless y.retain_grad() was called before backward()

Gradient dy/dx at x = tensor([ 2.1093, -1.0813,  0.8121]) is tensor([5.1200e+01, 5.1200e+02, 5.1200e-02])
Gradient dy/dy at y = tensor([1079.9368, -553.6485,  415.8031]) is tensor([1.0000e-01, 1.0000e+00, 1.0000e-04])


<details>

<summary>Click to see the ChatGPT explanation</summary>

Alright, let’s unpack this carefully step by step — you’re mixing two subtly different things:
	1.	Derivative $\frac{dy}{dx}$ (the usual thing we want).
	2.	Derivative $\frac{dy}{dy}$, which sounds strange but makes sense in the context of backpropagation.

⸻

1. How $\frac{dy}{dx}$ is calculated

When you do:

```python
x = torch.randn(3, requires_grad=True)
y = x * 2
```

PyTorch builds a computation graph.
	•	The node y knows that it was produced by multiplying x by 2.
	•	The gradient rule stored in y.grad_fn is:
$\frac{\partial y}{\partial x} = 2.$

Now, if later you call y.backward(gradients), the chain rule is applied:
$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$.

So your x.grad stores exactly this.

⸻

2. What about $\frac{dy}{dy}$?

Here’s the trick: in backpropagation, the seed gradient is the derivative of a tensor with respect to itself.

Formally:
$\frac{\partial y}{\partial y} = 1$.

This is why, if you just do:

```python
y.backward()
```

PyTorch treats it as
$\frac{\partial y}{\partial y} = 1$,
so it starts propagating a gradient of 1 backwards through the computation graph.

If you instead pass an explicit vector of upstream gradients:

```python
y.backward(gradients)
```

you are telling PyTorch:
$\frac{\partial L}{\partial y} = \text{gradients}$.

So PyTorch doesn't assume $\frac{dy}{dy} = 1$, but instead uses the vector you gave as the starting point for chain rule.

⸻

3. Why do we need gradients?

Because in your case y is not a scalar (it’s a vector of size 3).
	•	For scalars, $\frac{dy}{dy} = 1$ is unambiguous.
	•	For vectors, there isn't a single gradient; mathematically $\frac{dy}{dy}$ would be the identity matrix, which is too big to store.

So PyTorch requires you to provide the upstream gradient (a vector of the same shape as y) when you call .backward(). That’s what you did with:

gradients = torch.FloatTensor([0.1, 1.0, 0.0001])

This acts like: “Suppose L = \text{gradients}^T y. Compute \nabla_x L.”

⸻

4. Why `retain_grad()` is needed
	•	x is a leaf tensor (`requires_grad=True` created by you). PyTorch keeps its .grad.
	•	y is a non-leaf tensor (result of operations). PyTorch frees its `.grad` by default to save memory.
	•	If you want to see gradients for y, you must call `y.retain_grad()` before `.backward()`.

⸻

✅ Summary:
	•	$\frac{dy}{dy} = 1$ for scalars, but for tensors PyTorch requires you to provide the upstream gradient.
	•	That’s why you pass gradients into `.backward()`.
	•	`retain_grad()` is needed for non-leaf tensors if you want to inspect their .grad.

⸻

Do you want me to also draw the chain rule flow explicitly for your doubling loop (so you see exactly what multipliers accumulate in \frac{dy}{dx})?

</details>

**Read later** \
*Documentation* \
`Tensor`: https://pytorch.org/docs/stable/tensors.html \
`Function`: http://pytorch.org/docs/autograd

# Assignments

1. Define a tensor and set `requires_grad` to `True`
3. Multiply the tensor by 2 and assign the result to a new python variable (i.e. `x = result`)
4. Sum the variable's elements and assign to a new python variable
5. Print the gradients of all the variables
6. Now perform a backward pass on the last variable (NOTE: for each new python variable that you define, call `.retain_grad()`)
7. Print all gradients again
  - what did you notice?

In [None]:
my_tensor = torch.rand(2,3, requires_grad=True)

print(my_tensor)

# multiply by two and assign the result to a new variable
x = my_tensor.multiply(2)

print(x)

# sum the variables elements
my_sum = x.sum()

print("\n\nBefore backward pass (no gradients computed):\n")

# print the gradients of all the variables
print(f"Gradient of my_sum w.r.t. my_tensor: {my_tensor.grad}")
print(f"Gradient of my_sum w.r.t. x: {x.grad}")
print(f"Gradient of my_sum w.r.t. itself: {my_sum.grad}")

print("\n\nAfter backward pass (gradients computed):\n")

# perform a backward pass on the last variable
my_sum.backward()

# for each new variable, we call .retain_grad() to keep track of the gradients
x.retain_grad()
my_tensor.retain_grad()
print(f"Gradient of my_sum w.r.t. my_tensor: {my_tensor.grad}")
print(f"Gradient of my_sum w.r.t. x: {x.grad}") # x is not a leaf node, so it does not have any gradients
print(f"Gradient of my_sum w.r.t. itself: {my_sum.grad}") # my_sum is not a leaf node, so it does not have any gradients

tensor([[0.1623, 0.2464, 0.4988],
        [0.2716, 0.6969, 0.1077]], requires_grad=True)
tensor([[0.3245, 0.4928, 0.9976],
        [0.5432, 1.3939, 0.2153]], grad_fn=<MulBackward0>)


Before backward pass:

Gradient of my_sum w.r.t. my_tensor: None
Gradient of my_sum w.r.t. x: None
Gradient of my_sum w.r.t. itself: None


After backward pass:

Gradient of my_sum w.r.t. my_tensor: tensor([[2., 2., 2.],
        [2., 2., 2.]])
Gradient of my_sum w.r.t. x: None
Gradient of my_sum w.r.t. itself: None


  print(f"Gradient of my_sum w.r.t. x: {x.grad}") # x is not a leaf node, so it does not have any gradients
  print(f"Gradient of my_sum w.r.t. itself: {my_sum.grad}") # my_sum is not a leaf node, so it does not have any gradients
  print(f"Gradient of my_sum w.r.t. itself: {my_sum.grad}") # my_sum is not a leaf node, so it does not have any gradients
