[![Open InColab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/xiangshiyin/llm-from-scratch/blob/main/torch_from_scratch/autograd.ipynb)

Reference: 
- https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
- https://www.youtube.com/watch?v=MswxJw-8PvEm

In [1]:
import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

# Example usage in pytorch

In [2]:
prediction = model(data) # forward pass

In [3]:
loss = (prediction - labels).sum()
loss.backward() # backward pass

In [4]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
optim.step() #gradient descent

# Differentiation in Autograd

Assume the following tensor relations
$$
Q = 3a^3 - b^2
$$

Assume a and b to be the parameters of an NN, and Q to be the error. We can compute the gradients of Q with respect to a and b as follows:

$$
\frac{\partial Q}{\partial a} = 9a^2
$$
$$
\frac{\partial Q}{\partial b} = -2b
$$

In [5]:
import torch

# We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked.
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

Q = 3*a**3 - b**2

In [6]:
a.shape, b.shape, Q.shape

(torch.Size([2]), torch.Size([2]), torch.Size([2]))

In [7]:
# When we call .backward() on Q, autograd calculates these gradients and stores them in the respective tensors’ .grad attribute.
# We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e. ∇_Q Q.

external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

# Gradients are now deposited in a.grad and b.grad
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


In [None]:
a.grad, b.grad

(tensor([36., 81.]), tensor([-12.,  -8.]))

In [11]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients?: False
Does `b` require gradients?: True


# Exclusion from the DAG

In [12]:
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

model.fc = nn.Linear(512, 10)
# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)