<a href="https://colab.research.google.com/github/ychaulagain/RAG-Anything/blob/main/notebooks/02_autograd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 02 – Autograd (Automatic Differentiation)

## Goals
By the end of this notebook, you will be able to:
- Explain what **autograd** does in PyTorch
- Use `requires_grad` to track gradients
- Understand what `.backward()` computes
- Inspect gradients using `.grad`
- Know why you need to **zero gradients**
- Use `torch.no_grad()` and `detach()` correctly


In [1]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.9.0+cpu


## 1. What is autograd?
**Autograd** is PyTorch's automatic differentiation engine.

When you perform operations on tensors with `requires_grad=True`, PyTorch:
- builds a **computation graph** (records operations)
- can compute derivatives (gradients) using **backpropagation**

In deep learning:
- weights are tensors
- loss is a scalar tensor
- gradients tell us how to update weights to reduce loss

In [2]:
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1

print("x:", x)
print("y:", y)
print("y.requires_grad:", y.requires_grad)
print("y.grad_fn", y.grad_fn) # shows the function tat created y

x: tensor(2., requires_grad=True)
y: tensor(11., grad_fn=<AddBackward0>)
y.requires_grad: True
y.grad_fn <AddBackward0 object at 0x7d3ba928b820>


## 2. `.backward()` and gradients
If `y` is a scalar,calling:
```python
y.backward()
Computes the gradient dy/dx and stores it in `x.grad`

In [3]:
### Compute gradient for scalar (Code)
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x + 1  # derivative: dy/dx = 2x + 3

y.backward()
print("y:", y.item())
print("x.grad:", x.grad.item())  # should be 2*2 + 3 = 7


y: 11.0
x.grad: 7.0


## 3. Gradients accumulate
If you call `.backward()` multiple times, gradientsts are **added** to `.grad`. This is why training loops usually do `optimizer.zero_grad()` each iteration.

In [4]:
x = torch.tensor(2.0, requires_grad=True)
y = x**2

y.backward()
print("After first backward, x.grad:", x.grad.item()) # 2x = 4

y = x**2
y.backward()
print("After second backward, x.grad:", x.grad.item()) # accumulates -> 8

After first backward, x.grad: 4.0
After second backward, x.grad: 8.0


In [5]:
x.grad.zero_()
print("After zero_, x.grad", x.grad.item())

After zero_, x.grad 0.0


## 4. Autograd with vectors (and why loss is usually a scalar)
If the output is **not scalar**, PyTorch needs to know how to combine values to compute gradients.

Most training uses a **scalar loss** (e.g., mean loss), so `.backward()` works naturally.

In [6]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x**2

# Make it scalar using sum or (mean) so backward() is defined
loss = y.sum()
loss.backward()

print("x:", x)
print("y:", y)
print("loss:", loss)
print("x.grad:", x.grad) # derivative of sum(x^2) - 2x

x: tensor([1., 2., 3.], requires_grad=True)
y: tensor([1., 4., 9.], grad_fn=<PowBackward0>)
loss: tensor(14., grad_fn=<SumBackward0>)
x.grad: tensor([2., 4., 6.])


In [7]:
## Vector example of scalar reduction
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x**2 # vector output

# Make it scalar using sum (or mean) so backward() is defined
loss = y.sum()
loss.backward()

print("x:", x)
print("y:", y)
print("loss:", loss)
print("x.grad:", x.grad) # derivative of sum(x^2)


x: tensor([1., 2., 3.], requires_grad=True)
y: tensor([1., 4., 9.], grad_fn=<PowBackward0>)
loss: tensor(14., grad_fn=<SumBackward0>)
x.grad: tensor([2., 4., 6.])


### Backward on a non-scalar tensor


In [8]:
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x**2 # vector

# Provide grad_output (same shape as y)
grad_output = torch.tensor([1.0, 1.0, 1.0])
y.backward(gradient=grad_output)


print("y:", y)
print("x.grad:", x.grad) # still 2x because grad_output is ones

y: tensor([1., 4., 9.], grad_fn=<PowBackward0>)
x.grad: tensor([2., 4., 6.])


## 5. `detach()` vs `torch.no_grad()`

### `detach()`
Creates a new tensor that shares the same data but **stops tracking gradients**.

### `torch.no_grad()`
A context manager to temorarily disable gradient tracking.
Used during **inference/evaluation** to save memory and compute

In [9]:
## detach example
x = torch.tensor(2.0, requires_grad=True)
y = x**2

y_detached = y.detach()
print("y.requires_grad:", y.requires_grad)
print("y_detached.requires_grad:", y_detached.requires_grad)

y.requires_grad: True
y_detached.requires_grad: False


In [10]:
## no_grad example
x = torch.tensor(2.0, requires_grad=True)
y = x**2

with torch.no_grad():
  y = x**2 + 5
print("y:", y)
print("y.requires_grad:", y.requires_grad) # False, because no_grad block


y: tensor(9.)
y.requires_grad: False


## 6. Turning gradinet tracking on/off

You can enable grad tracking on an existing tensor using:
```python
x.requires_grad_()

In [12]:
### requires_grad_ example
x = torch.tensor([1.0, 2.0, 3.0])
print("Before:", x.requires_grad)

x.requires_grad_()
print("After:", x.requires_grad)

y = (x ** 2).mean()
y.backward()
print("x.grad:", x.grad)

Before: False
After: True
x.grad: tensor([0.6667, 1.3333, 2.0000])


## 7. Mini example: gradinet descent intuition
We'll minimize
\[
  f(w) = (w - 3)^2
\]

The minimum is at `w = 3`
We'll use autograd to compute gradients and update `w`.

In [13]:
w = torch.tensor(0.0, requires_grad=True)
lr = 0.01

for step in range(10):
  loss = (w - 3) ** 2
  loss.backward()

  with torch.no_grad():
    w -= lr * w.grad # gradient descent step

  w.grad.zero_() # reset gradients
  print(f"step={step:02d} w={w.item():.4f} loss={loss.item():.4f}")

step=00 w=0.0600 loss=9.0000
step=01 w=0.1188 loss=8.6436
step=02 w=0.1764 loss=8.3013
step=03 w=0.2329 loss=7.9726
step=04 w=0.2882 loss=7.6569
step=05 w=0.3425 loss=7.3537
step=06 w=0.3956 loss=7.0625
step=07 w=0.4477 loss=6.7828
step=08 w=0.4988 loss=6.5142
step=09 w=0.5488 loss=6.2562


## Common mistakes (and how to avoid them)

1. **Forgetting `requires_grad=True`**
   - Gradients will be `None`

2. **Calling `.backward()` repeatedly without resetting grads**
   - Gradients accumulate and updates explode

3. **Trying `.backward()` on a non-scalar tensor**
   - Use `.sum()` / `.mean()` or pass `gradient=...`

4. **Using gradients during inference**
   - Wrap inference with `torch.no_grad()` to save memory and speed up


## Summary

You learned:
- What autograd is and why it matters
- How `requires_grad` controls tracking
- How `.backward()` computes gradients and stores them in `.grad`
- Why gradients accumulate and how to reset them
- How to handle vector outputs (reduce to scalar or pass gradient)
- The difference between `detach()` and `torch.no_grad()`

Next: **03 – `nn.Module` and building neural networks**
