# Autograd
For calculation of gradients

Used with `autograd` package

In [1]:
import torch

In [2]:
x = torch.rand(3 ,requires_grad=True) # Need to specify requires_grad = True for calculation of gradients, by default it is False
print(x)

tensor([0.2254, 0.4946, 0.0936], requires_grad=True)


## Computational graph

Any operation on this variable x which has `required_grad = True`, will be done using a computational graph. For example:

$$y = x + 2$$

will result in a graph as shown below -
<p align="center">
<img src="../images/Computational_Graph.png" style="width:600px;height:300px;">
</p>

Using the back propagation method, the gradients can be computed and the `grad_fn` attribute varies based on the operation -
- `AddBackward0`: Back Propagation for addition operations
- `MulBackward0`: Back Propagation for multiplication operations
- `MeanBackward0`: Back Propagation for mean calculations
and so on...

This `grad_fn` attribute tracks the history of computations done.


In [3]:
y = (2*x)+2 # Will create a computational graph
print(y) # A grad_fn attribute is created

tensor([2.4507, 2.9892, 2.1873], grad_fn=<AddBackward0>)


In [4]:
z = 2*(y**2)
print(z)

tensor([12.0119, 17.8701,  9.5685], grad_fn=<MulBackward0>)


In [5]:
z1 = z.mean()
print(z1)

tensor(13.1502, grad_fn=<MeanBackward0>)


`backward()` function on a variable creates the gradient of that variable with respect to the variable assigned `requires_grad = True`. It uses calculus to calculate the derivative (`grad_fn` decides how the derivative will be computed) and **not** numerical analysis approximation.

It strictly needs the tensor to have a **single** value, if not tensor passed inside backward() function.

In [6]:
z1.backward() # dz1/dx (backward() requires it to be a single value
# Will work only if requires_grad = True, otherwise it will create a Runtime Error

`x.grad` is created which is basically $\frac{dz1}{dx}$

This would be -
$$y = 2x+2$$
$$z = 2(y)^2 = 8(x+1)^2$$
$$z1 = z/3$$
$$\frac{dz1}{dx} = \frac{dz1}{dz} \times \frac{dz}{dx} = \frac{16}{3}(x+1)$$

In [7]:
print(x.grad)
print(16*(x+1)/3)

tensor([6.5352, 7.9711, 5.8328])
tensor([6.5352, 7.9711, 5.8328], grad_fn=<DivBackward0>)


In the background -

It creates the Vector - Jacobian product which is shown below

<p align="center">
<img src="images/Jacobian_Vector_Product.png" style="width:600px;height:200px;">
</p>

where 
- $J$: Jacobian Matrix
- $v$: Gradient Vector
- The product results in the gradients we need for Back Propagation

In short this calculation is the **Chain rule**

In [8]:
#If we want dz/dx
x1 = torch.rand(3 ,requires_grad=True)
y1 = 2*x1 + 2
z1 = 2*(y1**2)
z2 = z1.sum()
z2.backward() # This will calculate dz/dx as dz2/dz = 1

Here -
$$\frac{dz_2}{dx_1} = 16(x+1)$$

In [9]:
print(x1.grad)
print(16*(x1+1))

tensor([20.0744, 27.6490, 27.1790])
tensor([20.0744, 27.6490, 27.1790], grad_fn=<MulBackward0>)


In [10]:
#Or we pass a vector of same size in the backward() function
x2 = torch.rand(3 ,requires_grad=True)
y2 = 2*x2 + 2
z2 = 2*(y2**2)
v2 = torch.ones_like(z2) 
z2.backward(v2) # Results in (dz2/dx2)/v2, so we will need to multiply v2 later
print(x2.grad)
print(16*(x2+1))
print(16*(x2+1)*v2) # Not needed as v2 is just a vector of ones

tensor([29.2486, 17.2806, 25.7376])
tensor([29.2486, 17.2806, 25.7376], grad_fn=<MulBackward0>)
tensor([29.2486, 17.2806, 25.7376], grad_fn=<MulBackward0>)


In [11]:
#Lets try v2 to be something else
x2 = torch.randn(3 ,requires_grad=True)
y2 = 2*x2 + 2
z2 = 2*(y2**2)
v2 = torch.rand(len(z2))
z2.backward(v2)
print(x2.grad)
print(16*(x2+1)) # Won't work
print(16*(x2+1)*v2) # Will work

tensor([ 1.5111, 40.0987, 13.2673])
tensor([20.7550, 41.3589, 21.6156], grad_fn=<MulBackward0>)
tensor([ 1.5111, 40.0987, 13.2673], grad_fn=<MulBackward0>)


## Making `requires_grad = False` later in the calculations

Once back propagation is over, it is best to deactivate requires_grad while updating the parameters in Neural Networks

Can be done in three ways -
- **`x.requires_grad_(False)`**: Setting x to have `requires_grad = False`
- **`x.detach()`**: Creates a duplicate of the tensor with `requires_grad = False`
- **`with torch.no_grad():`**: Writing inside this block ensures `required_grad = False` inside it

#### `requires_grad_(False)`

In [12]:
x

tensor([0.2254, 0.4946, 0.0936], requires_grad=True)

In [13]:
x.requires_grad_(False) # the underscore in the end means that requires_grad in x has been set to False inplace
print(x)

tensor([0.2254, 0.4946, 0.0936])


#### `detach()`

In [14]:
x = torch.randn(3, requires_grad=True)
print(x)
x.detach_()
print(x)

tensor([-1.4489,  1.9851,  0.0558], requires_grad=True)
tensor([-1.4489,  1.9851,  0.0558])


In [15]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x.detach()
print('x:',x) # x is not affected here as it didn't call the detach() function with an underscore (not inplace)
print('y:',y)

tensor([ 0.1504,  1.2185, -0.2486], requires_grad=True)
x: tensor([ 0.1504,  1.2185, -0.2486], requires_grad=True)
y: tensor([ 0.1504,  1.2185, -0.2486])


#### `with torch.no_grad():`

In [16]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x+2
print('y:',y) # Here y has the grad_fn
with torch.no_grad():
    y = x+2
    print('y:',y) # Here y does not have grad_fn
    print(x)

tensor([-0.9801, -0.3551,  1.2238], requires_grad=True)
y: tensor([1.0199, 1.6449, 3.2238], grad_fn=<AddBackward0>)
y: tensor([1.0199, 1.6449, 3.2238])
tensor([-0.9801, -0.3551,  1.2238], requires_grad=True)


## Resetting Gradients

In [17]:
# Lets take an example
weights = torch.ones(3, requires_grad=True)

for epoch in range(3):
    output = (weights*3).sum() # y = weights*3, output = y.sum()
    output.backward() # Should give d(output)/d(y) * d(y)/d(weights) = 1 * [3,3,3] = [3,3,3]
    print(weights.grad)

tensor([3., 3., 3.])
tensor([6., 6., 6.])
tensor([9., 9., 9.])


Reason this gets added is because grad needs to be reset after it has run once

In [18]:
weights = torch.ones(3, requires_grad=True)
for epoch in range(3):
    output = (weights*3).sum() # y = weights*3, output = y.sum()
    output.backward() # Should give d(output)/d(y) * d(y)/d(weights) = 1 * [3,3,3] = [3,3,3]
    print(weights.grad)
    weights.grad.zero_()

tensor([3., 3., 3.])
tensor([3., 3., 3.])
tensor([3., 3., 3.])


#### Doing it using the `optim` class in torch library

In [19]:
weights = torch.ones(3, requires_grad=True)
# optimizer = torch.optim.SGD(weights, lr=0.01) #SGD: Stochastic Gradient Descent
# optimizer.step() # updates parameters
# optimizer.zero_grad() # Same as .grad.zero_() for the parameters in optimizer

After using `backward()` function -----> call `.grad.zero_()` ------> Make `requires_grad = False`

All pic credits - The Python Engineer YT channel