<a href="https://colab.research.google.com/github/susant146/PyTorch_Basics_CNNmodels/blob/main/Sl_02_Autograd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2. Autograd: Automatic Differentiation

The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

``torch.Tensor`` is the central class of the package. If you set its attribute
``.requires_grad`` as ``True``, it **starts to track all operations on it**. When
you finish your computation you can call ``.backward()`` and have **all the
gradients computed automatically**. The gradient for this tensor will be
accumulated into ``.grad`` attribute.

To **stop a tensor from tracking history**, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To **prevent tracking history (and using memory)**, you can also wrap the code block
in ``with torch.no_grad():``. This can be particularly helpful when evaluating a
model because the model may have trainable parameters with `requires_grad=True`,
but for which we don't need the gradients.

In [2]:
import torch

In [10]:
x = torch.randn(3, requires_grad = True)
print("x: ", x)
# Whenever we do operations with this tensor PyTorch will create a "Computational graph" and calculate gradient by back-tracking it.

y = x+2
print('The gard function\t', y) #add backward

z = y*y*2
print('The gard function\t', z) #mul backward
z = z.mean()
print('The gard function\t', z) #mean backward


x:  tensor([ 0.3231, -0.5908, -0.2032], requires_grad=True)
The gard function	 tensor([2.3231, 1.4092, 1.7968], grad_fn=<AddBackward0>)
The gard function	 tensor([10.7933,  3.9719,  6.4567], grad_fn=<MulBackward0>)
The gard function	 tensor(7.0739, grad_fn=<MeanBackward0>)


**Grad-Calculation & backward function**

In [11]:
z.backward()  # dz/dx
# z = mean[2*(x+2)^2]
print(x.grad)
# Make x = torch.randn(3, requires_grad = False) and rerun again

tensor([3.0974, 1.8790, 2.3957])


In [None]:
# requires_grad = False
x = torch.randn(3, requires_grad = False)
print("x: ", x)
# Whenever we do operations with this tensor PyTorch will create a "Computational graph" and calculate gradient by back-tracking it.

y = x+2
print('The gard function\t', y) #add backward

z = y*y*2
print('The gard function\t', z) #mul backward
z = z.mean()
print('The gard function\t', z) #mean backward

z.backward()  # Error here

**Chain Rule**

Mathematically, if you have a vector valued function $\vec{y}=f(\vec{x})$, then the gradient of $\vec{y}$ with respect to $\vec{x}$ is a Jacobian matrix:
\begin{split}J=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\end{split}

Generally speaking, `torch.autograd` is an engine for computing vector-Jacobian product. That is, given any vector $v=\left(\begin{array}{cccc} v_{1} & v_{2} & \cdots & v_{m}\end{array}\right)^{T}$, compute the product $v^{T}\cdot J$. If $v$ happens to be the gradient of a scalar function $l=g\left(\vec{y}\right)$, that is, $v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$, then by the chain rule, the vector-Jacobian product would be the gradient of $l$ with respect to $\vec{x}$:
\begin{split}J^{T}\cdot v=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\left(\begin{array}{c}
 \frac{\partial l}{\partial y_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial y_{m}}
 \end{array}\right)=\left(\begin{array}{c}
 \frac{\partial l}{\partial x_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial x_{n}}
 \end{array}\right)\end{split}

(Note that $v^{T}\cdot J$ gives a row vector which can be treated as a column vector by taking $J^{T}\cdot v$.)

This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.

In [13]:
x = torch.ones(2, 2, requires_grad=True)
y = x + 2
z = y * y * 3
out = z.mean() # Remove mean and then re-run
               # Error = grad can be implicitly created only for dcalar outputs
out.backward()
print(x.grad)


tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


**Explanation**

Let $out = \frac{1}{4}\sum_i z_i$,  
$z_i = 3(x_i+2)^2$  
and $z_i\bigr\rvert_{x_i=1} = 27$.  
Therefore,  
$\frac{\partial out}{\partial x_i} = \frac{1}{4}\frac{\partial z_i}{\partial x_i} = \frac{1}{4}.3.2(x_i+2) = \frac{3}{2}(x_i+2)$,  
 hence  
$\frac{\partial out}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

In [17]:
# Most of the case the final output is a scalar value
# If not then we have to have a vector to complete the vector-Jacobian product
x = torch.ones(3, requires_grad=True)
y = x + 2
z = y * y * 3
# z = z.mean() # Remove mean and then re-run
               # Error = grad can be implicitly created only for dcalar outputs
z.backward()
print(x.grad)

RuntimeError: grad can be implicitly created only for scalar outputs

In [18]:
# Most of the case the final output is a scalar value
# If not then we have to have a vector to complete the vector-Jacobian product
x = torch.ones(3, requires_grad=True)
y = x + 2
z = y * y * 3
# z = z.mean() # Remove mean and then re-run
               # Error = grad can be implicitly created only for dcalar outputs
v = torch.tensor([0.1, 1.0, 0.001], dtype = torch.float32)
z.backward(v)
print(x.grad)

tensor([1.8000e+00, 1.8000e+01, 1.8000e-02])


**Preventing Gradient History**

x.requires_grad_(False)

x.detach()

with torch.no_grad():

In [19]:
x = torch.randn(3, requires_grad=True)
print(x)

x.requires_grad_(False)
print(x)

tensor([-0.9978, -1.0081, -0.3368], requires_grad=True)
tensor([-0.9978, -1.0081, -0.3368])


In [20]:
x = torch.randn(3, requires_grad=True)
print(x)

y = x.detach()
print(y)

tensor([-1.4535,  1.0971, -0.9018], requires_grad=True)
tensor([-1.4535,  1.0971, -0.9018])


In [21]:
x = torch.randn(3, requires_grad=True)
print(x)

with torch.no_grad():
  y = x+2
  print(y)

tensor([-0.2543, -0.8507,  0.4326], requires_grad=True)
tensor([1.7457, 1.1493, 2.4326])


**Training Example & Backprop With Autograd**

In [25]:
weights = torch.ones(4, requires_grad=True)

for epoch in range(2):
  model_output = (weights*3).sum()

  model_output.backward()

  print(f'Iteration {epoch}, Weight grads:, {weights.grad}')

Iteration 0, Weight grads:, tensor([3., 3., 3., 3.])
Iteration 1, Weight grads:, tensor([3., 3., 3., 3.])


In [26]:
weights = torch.ones(4, requires_grad=True)

for epoch in range(2):
  model_output = (weights*3).sum()

  model_output.backward()

  print(f'Iteration {epoch}, Weight grads:, {weights.grad}')
  # Very important
  # Everytime the iteration runs for each epoch the weights are accumulated.
  # We need to zero the grad to remove these accumularions. So that the next iteration can start afresh.

  weights.grad.zero_()

Iteration 0, Weight grads:, tensor([3., 3., 3., 3.])
Iteration 1, Weight grads:, tensor([3., 3., 3., 3.])
