<a href="https://colab.research.google.com/github/sudhanshumukherjeexx/PyTorch-Playground/blob/main/3_PyTorch_Autograd%2C_Gradient_Tracking_and_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Programmed by Sudhanshu Mukherjee
* 10-21-2024: Collab Notebook
* 10-29-2024: Notebook updated with text

### This notebook is inspired by and credits [PyTorch’s Autograd Tutorial](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html). Some of the code presented here is based on that tutorial. However, the explanations and examples in this notebook are tailored to be more beginner-friendly, with simplified language and additional clarifications to help new learners grasp key concepts like `autograd`, `gradient tracking`, and `fine-tuning` using `torch.no_grad()`.

### Forward and Backward Propogation

In neural networks, forward propagation and backward propagation are key processes used during the training of the network.

### 1. Forward Propagation:
- Forward propagation is the process of passing input data through the neural network layers to produce an output (also known as predictions).
#### How it works:
  - The input data is passed into the first layer of the neural network.
  - Each neuron in the layer applies a weighted sum of the inputs and then applies an activation function to introduce non-linearity.
  - The output from the first layer becomes the input for the next layer.
  - This process continues through all layers until the final output layer produces the prediction (for example, a classification or regression output).

  **The goal is to compute the predicted output for the given input based on the current state of the network's weights.**

### 2. Backward Propagation:
- Backward propagation (backpropagation) is the process of adjusting the weights of the neural network by calculating the error (loss) and distributing it back through the network.
#### How it works:
  - After forward propagation produces an output, the loss function compares the predicted output with the actual target value to compute the error.
  - Backpropagation computes the gradient of the loss function with respect to each weight in the network using the chain rule of calculus.
  - Starting from the output layer, gradients are propagated back through the network to update the weights of the neurons in all layers.

  **The purpose is to minimize the error by adjusting the weights using optimization techniques like gradient descent.**

In [None]:
import torch
from torchvision.models import resnet18, ResNet18_Weights
from torch import nn, optim

In [None]:
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /root/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|██████████| 44.7M/44.7M [00:00<00:00, 126MB/s]


- The below code represents forward pass

In [None]:
prediction = model(data)

- Now, we compare the model's prediction with the actual answer (called labels) to figure out how wrong the prediction is. It is also called the error (or loss).

- Next, we need to tell the model how to adjust weights to improve. This process is called **backward propagation** (backpropagation).

- We start backpropagation by calling `.backward()` on the error. This tells the system to automatically calculate how much each part of the model contributed to the error. It does this using a feature called **autograd**, which stores these values (called gradients) for each part of the model in its `.grad` attribute.



In [None]:
loss = (prediction - labels).sum()
loss.backward()

- Here we set up an **optimizer**, that helps the model learn by adjusting its weights.
- We use an optimizer called **SGD** (Stochastic Gradient Descent) with a learning rate of **0.01** and a momentum of **0.9**.
- The learning rate controls how big the updates are when the model adjusts its weights, and momentum helps the model avoid getting stuck in its learning.
- Optimizer keeps track of all the model's parameters so it can update them during training.

In [None]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

- we use the optimizer to update the model’s weights by calling .step(). This starts the gradient descent process.
- The optimizer looks at the gradients (the values stored in .grad) and adjusts each setting based on those values. This helps the model improve its predictions by reducing the error.

In [None]:
optim.step()

- Now we have everything ready to start training a neural network.
- Now we will discuss how autograd(the automatic gradient calculation) works.

### Differentiation in Autograd

- Let’s see how autograd works to collect gradients (the values needed to update our model's settings)
- We start by creating two tensors, `a` and `b`, with `requires_grad=True`.
- This tells autograd that it needs to keep track of all operations involving these tensors so it can calculate the gradients when we need them.

In [None]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([7., 5.], requires_grad=True)

- Now we crete another tensor `S` from `a` and `b`
$$S={a^9} − {2b^2}$$

In [None]:
S = a**9 - 2*b**2

- Let's say `a` and `b` are the parameters of a neural network, and `S` is the error or loss that tells us how wrong the network's predictions are.

- In neural network training, we need to calculate how changing the parameters `a` and `b` affects the error. These changes are called gradients. For example:

- The gradient of `S` with respect to `a` is $$\frac{∂S}{∂a} = {9a^8}$$ that shows how much the error changes if we adjust a.

- The gradient of `S` with respect to `b` is $$\frac{∂S}{∂b} = {-4b}$$ which shows how much the error changes if we adjust b.

- When we call `.backward()` on `S`, autograd automatically calculates these gradients for each parameter and stores them in the `.grad` attribute of `a` and `b`.

- Since `S` is a vector, we need to provide a gradient argument when calling `S.backward()`. This argument is a tensor of the same shape as `S`, representing the gradient of `S` with respect to itself is 1(Change of something with respect to itself is always 1)

In [None]:
external_grad = torch.tensor([1., 1.])
S.backward(gradient=external_grad)

- Gradients are now deposited at `a.grad` and `b.grad`

In [None]:
print(9*a**8 == a.grad)
print(-4*b == b.grad)

tensor([True, True])
tensor([True, True])


### Directed Acyclic Graph(DAG)

_**Note:**_

**In PyTorch, DAGs (Directed Acyclic Graphs) are dynamic. This means that the graph, which tracks how the model's calculations are done, is built from scratch every time you call `.backward()`. Once the gradients are calculated, the graph is thrown away, and a new one is created during the next forward pass.**

**This is important because it allows you to use control flow in your model. For example, you can have `if` statements or loops that change the shape, size, or operations of your neural network on each iteration if needed. PyTorch will still be able to handle it because it builds a fresh graph every time, based on whatever operations happen in that iteration.**

---
1. In PyTorch, autograd keeps track of all the operations on tensors that have their `requires_grad` flag set to `True`. This means that PyTorch will automatically calculate gradients for those tensors when needed.

2. However, if you have a tensor that doesn’t need gradients (for example, if it's just an input that you don’t want to change), you can set `requires_grad=False` to exclude it from gradient calculations. This way, PyTorch won’t track or calculate gradients for it.

3. Even if you have an operation where only one of the input tensors has `requires_grad=True`, the output of that operation will still require gradients. This is because autograd will continue to track the entire computation to make sure it can compute the gradients for the part that does need them.

In [None]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients?: {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients?: False
Does `b` require gradients?: True


- In a neural network (NN), **frozen parameters** are weights that don’t need to compute gradients. Freezing parameters means you stop autograd from tracking changes to them, which can make training faster because fewer computations are needed.

- This is helpful when you’re fine-tuning a model. In fine-tuning, we often use a pre-trained model (a model already trained on a large dataset) and keep most of the model unchanged (frozen). We only update the last few layers, like the classifier, to adapt the model to make predictions on a new task or with new labels.

In [None]:
model = resnet18(weights=ResNet18_Weights.DEFAULT)

# freeze all model parameters
for param in model.parameters():
  param.requires_grad = False

- Let’s say we have a new dataset with 10 labels, and we want to fine-tune the ResNet18 model on this dataset.
- In ResNet18, the classifier that makes predictions is the last layer, called `model.fc`.
- To fine-tune the model, we can replace this last layer with a new Linear layer that has 10 output units (one for each label). This new layer is unfrozen by default, meaning it will be trained and its parameters will be updated during the fine-tuning process.

In [None]:
model.fc = nn.Linear(512, 10)

- Now, in our modified ResNet18 model, all the parameters are frozen, meaning they won’t be updated during training. The only part of the model that can learn and update is the new `model.fc` layer that we added.

- This means that the **only parameters that compute gradients** (the ones that get adjusted during training) are the `weights` and `bias` of the `model.fc` layer. All other layers in the model stay the same as they were in the pre-trained version.

- So, during training, the model will focus on fine-tuning just this final classifier layer to adapt to the new dataset with `10` labels.

In [None]:
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

- Even though we tell the optimizer about all the parameters in the model, the only ones that are actually computing gradients and getting updated during gradient descent are the weights and bias of the new classifier `model.fc`. All the other layers are frozen and not being updated.

- One can also use a special feature in PyTorch called `torch.no_grad()` to temporarily turn off gradient tracking for any operation. This is useful when you don’t want PyTorch to calculate gradients for certain parts of your model (like during evaluation or inference).

Here’s an example of how you can use `torch.no_grad()`:

In [None]:
input_data = torch.randn(1, 3, 224, 224) # image with RGB

for param in model.parameters():
    param.requires_grad = True

print("\nBefore running anything")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"{name} grad: {param.grad}")

with torch.no_grad():
  # won't track gradients
  output = model(input_data)


print("\n\nAfter running inside torch.no_grad():")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"{name} grad: {param.grad}")


print(f'\n{output}')


Before running anything
conv1.weight grad: None
bn1.weight grad: None
bn1.bias grad: None
layer1.0.conv1.weight grad: None
layer1.0.bn1.weight grad: None
layer1.0.bn1.bias grad: None
layer1.0.conv2.weight grad: None
layer1.0.bn2.weight grad: None
layer1.0.bn2.bias grad: None
layer1.1.conv1.weight grad: None
layer1.1.bn1.weight grad: None
layer1.1.bn1.bias grad: None
layer1.1.conv2.weight grad: None
layer1.1.bn2.weight grad: None
layer1.1.bn2.bias grad: None
layer2.0.conv1.weight grad: None
layer2.0.bn1.weight grad: None
layer2.0.bn1.bias grad: None
layer2.0.conv2.weight grad: None
layer2.0.bn2.weight grad: None
layer2.0.bn2.bias grad: None
layer2.0.downsample.0.weight grad: None
layer2.0.downsample.1.weight grad: None
layer2.0.downsample.1.bias grad: None
layer2.1.conv1.weight grad: None
layer2.1.bn1.weight grad: None
layer2.1.bn1.bias grad: None
layer2.1.conv2.weight grad: None
layer2.1.bn2.weight grad: None
layer2.1.bn2.bias grad: None
layer3.0.conv1.weight grad: None
layer3.0.bn1.w