# Pytorch Autograd

In [32]:
import torch
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

from torch import nn

## 1. PyTorch Basics for undersanding Autograds

**Tensors**: It is just an n-dimensional array in PyTorch. Tensors support some additional enhancements which make them unique: 
* Apart from CPU, they can be loaded or the GPU for faster computations. 
* On setting `.requires_grad = True` they start forming a backward graph that tracks every operation applied on them to calculate the gradients using something called a dynamic computation graph (DCG).

> In earlier versions of PyTorch, the `torch.autograd.Variable` class was used to create tensors that support gradient calculations and operation tracking but as of `PyTorch v0.4.0` Variable class has been deprecated. `torch.Tensor` and `torch.autograd.Variable` are now the same class. More precisely, `torch.Tensor` is capable of tracking history and behaves like the old Variable

In [33]:
x = torch.randn(2, 2, requires_grad = True)

# From numpy
x = np.array([1., 2., 3.]) #Only Tensors of floating point dtype can require gradients
x = torch.from_numpy(x)
# Now enable gradient 
x.requires_grad_(True)
# _ above makes the change in-place (its a common pytorch thing)

tensor([1., 2., 3.], dtype=torch.float64, requires_grad=True)

> Note: By PyTorch’s design, gradients can only be calculated for **floating** point tensors.

**Autograd**: This class is an engine to calculate derivatives (Jacobian-vector product to be more precise). 
* It records a graph of all the operations performed on a `gradient enabled` tensor and creates an acyclic graph called the dynamic computational graph. 
* The leaves of this graph are input tensors and the roots are output tensors. Gradients are calculated by tracing the graph from the root to the leaf and multiplying every gradient in the way using the chain rule.

## 2. Neural networks and Backpropagation

Neural networks are nothing more than composite mathematical functions that are delicately tweaked (trained) to output the required result. The tweaking or the training is done through a remarkable algorithm called backpropagation. Backpropagation is used to calculate the gradients of the loss with respect to the input weights to later update the weights and eventually reduce the loss.

> In a way, back propagation is just fancy name for the chain rule — Jeremy Howard

Creating and training a neural network involves the following essential steps:
1. Define the architecture
2. Forward propagate on the architecture using input data
3. Calculate the loss
4. **Backpropagate to calculate the gradient for each weight**
5. Update the weights using a learning rate

The change in the loss for a small change in an input weight is called the gradient of that weight and is calculated using backpropagation. The gradient is then used to update the weight using a learning rate to overall reduce the loss and train the neural net.

This is done in an iterative way. For each iteration, several gradients are calculated and something called a computation graph is built for storing these gradient functions. PyTorch does it by building a Dynamic Computational Graph (DCG). This graph is built from scratch in every iteration providing maximum flexibility to gradient calculation. For example, for a forward operation (function)`Mul` a backward operation (function) called `MulBackwardis` dynamically integrated in the backward graph for computing the gradient.

## 3. Dynamic Computational graph

Gradient enabled tensors (variables) along with functions (operations) combine to create the dynamic computational graph. 
* The flow of data and the operations applied to the data are defined at runtime hence constructing the computational graph dynamically. 
* This graph is made dynamically by the autograd class under the hood. [You don’t have to encode all possible paths before you launch the training — what you run is what you differentiate](https://pytorch.org/docs/stable/notes/autograd.html).

A simple DCG for multiplication of two tensors would look like this:

<img src="./assets/simple_autograd.png" width="430" height="430" />

Every variable object has several attributes some important of which are:

* **Data**: It is the data a variable is holding. `a` holds a 1x1 tensor with the value equal to 2.0 while `b` holds 3.0. `c` holds the product of two i.e. 6.0

* **requires_grad**: This attribute, if true starts tracking all the operation history and forms a backward graph for gradient calculation. For an arbitrary tensor a It can be manipulated in-place as follows: `a.requires_grad_(True)`.

> If there’s a single input to an operation that requires gradient, its output will also require gradient.
>
> Conversely, only if all inputs don’t require gradient, the output also won’t require it. 
>
> Backward computation is never performed in the subgraphs, where all Tensors didn’t require gradients.

* **grad**: grad holds the value of gradient. If `requires_grad` is False, it will hold a `None` value. Even if `requires_grad` is True, it will hold a `None` value unless `.backward()` function is called from some other node. For example, if you call `out.backward()` for some variable out that involved `x` in its calculations then `x.grad` will hold `∂out/∂x`.

* **grad_fn**: This is the backward function used to calculate the gradient.

* **is_leaf**: A node is leaf if :
    * It was initialized explicitly by some function like `x = torch.tensor(1.0)` or `x = torch.randn(1, 1)` (basically all the tensor initializing methods).
    * It is created after operations on tensors which all have `requires_grad = False`.
    * It is created by calling `.detach()` method on some tensor.

On calling `backward()`, gradients are populated only for the nodes which have both `requires_grad` and `is_leaf` True. Gradients are of the output node from which `.backward()` is called, w.r.t other leaf nodes.

On turning `requires_grad = True` PyTorch will start tracking the operation and store the gradient functions at each step as follows:

<img src="./assets/autograd_backprop.png" width="550" height="550" />

In the picture:

* The `Mul` operational funcation has access to a context variable called `ctx` and it can store any values it needs for the backwards pass in `ctx`.
* `ctx` would be passed to the `MulBackward` operation in the backward pass.
* `MulBackward` function has the attribute `next_functions`, which is a list of tuples that each is associated with
the different inputs that were passed to 'Mul' function.
    * AccumulateGrad is associated with tensor `a` and it accumulates the gradient for the tensor `a`. 
    * None is associated with tensor `b`. It is none because tensor `b` has `requires_grad` set to `False` so we don't need to pass a gradient to it. 

The following code would generate the above graph under the PyTorch’s hood:

In [41]:
# Creating the graph
a = torch.tensor(2.0, requires_grad = True)
b = torch.tensor(3.0)
c = a * b

# Displaying
for i, name in zip([a, b, c], "abc"):
    print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\n\
grad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\nrequires_grad: {i.requires_grad}")

a
data: 2.0
requires_grad: True
grad: None
grad_fn: None
is_leaf: True
requires_grad: True
b
data: 3.0
requires_grad: False
grad: None
grad_fn: None
is_leaf: True
requires_grad: False
c
data: 6.0
requires_grad: True
grad: None
grad_fn: <MulBackward0 object at 0x11cba8cd0>
is_leaf: False
requires_grad: True


Using **torch.no_grad() context** to turn off gradient calculation

To stop PyTorch from tracking the history and forming the backward graph, the code can be wrapped inside the context `with torch.no_grad():`. It will make the code run faster whenever gradient tracking is not needed.

In [43]:
# Creating the graph
x = torch.tensor(1.0, requires_grad = True)
# Check if tracking is enabled
print(x.requires_grad) #True
y = x * 2
print(y.requires_grad) #True

with torch.no_grad():
    # Check if tracking is enabled
    y = x * 2
    print(y.requires_grad) #False

True
True
False


## 4. Backward() function

Backward is the function which actually calculates the gradient by passing it’s argument (1x1 unit tensor by default) through the backward graph all the way up to every leaf node traceable from the calling root tensor. The calculated gradients are then stored in `.grad` of every **leaf node**. 

> Remember, the backward graph is already made dynamically during the forward pass. Backward function only calculates the gradient using the already made graph and stores them in leaf nodes.

In [45]:
# Creating the graph
x = torch.tensor(1.0, requires_grad = True)
z = x ** 3
z.backward() #Computes the gradient 
print(z.grad_fn)
print(x.grad.data) #Prints '3' which is dz/dx 

<PowBackward0 object at 0x104993790>
tensor(3.)


An important thing to notice is that when `z.backward()` is called, a tensor is automatically passed as `z.backward(torch.tensor(1.0))`. The `torch.tensor(1.0)` is the external gradient provided to terminate the chain rule gradient multiplications. This external gradient is passed as the input to the `PowBackward` function to further calculate the gradient of `x`. The dimension of tensor passed into `.backward()` must be the same as the dimension of the tensor whose gradient is being calculated. 

For example, if the gradient enabled tensor `x` and `y` are as follows:

In [46]:
x = torch.tensor([0.0, 2.0, 8.0], requires_grad = True)
y = torch.tensor([5.0 , 1.0 , 7.0], requires_grad = True)
z = x * y

Then, to calculate gradients of `z` (a `1x3` tensor) with respect to `x` or `y` , an external gradient needs to be passed to `z.backward()`function as follows: `z.backward(torch.FloatTensor([1.0, 1.0, 1.0])`

In [48]:
z.backward(torch.FloatTensor([1.0, 1.0, 1.0]))
print(z.grad_fn)
print(x.grad.data)       

<MulBackward0 object at 0x11cb9ab90>
tensor([5., 1., 7.])


> `z.backward()` would give a RuntimeError: grad can be implicitly created only for scalar outputs

In [50]:
# This wouled give "grad can be implicitly created only for scalar outputs" error.
# z.backward()

The tensor passed into the backward function acts like weights for a weighted output of gradient. Mathematically, this is the vector multiplied by the Jacobian matrix of non-scalar tensors. Hence it should almost always be a unit tensor of dimension same as the tensor backward is called upon, unless weighted outputs needs to be calculated.

> Backward graph is created automatically and dynamically by **autograd** class during **forward pass**. `Backward()` simply calculates the gradients by passing its argument to the already made backward graph.

## 5. Backward calculation

<img src="./assets/calculate_backprop.png" width="580" height="580" />

In [55]:
# Creating the graph
a = torch.tensor(2.0, requires_grad = True)
b = torch.tensor(3.0)
c = a * b

print("Forward pass:")
for i, name in zip([a, b, c], "abc"):
    print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\n\
grad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\nrequires_grad: {i.requires_grad}")
    
c.backward()
print("\n")
print("After backward on c:")
# Displaying
for i, name in zip([a, b, c], "abc"):
    print(f"{name}\ndata: {i.data}\nrequires_grad: {i.requires_grad}\n\
grad: {i.grad}\ngrad_fn: {i.grad_fn}\nis_leaf: {i.is_leaf}\nrequires_grad: {i.requires_grad}")

Forward pass:
a
data: 2.0
requires_grad: True
grad: None
grad_fn: None
is_leaf: True
requires_grad: True
b
data: 3.0
requires_grad: False
grad: None
grad_fn: None
is_leaf: True
requires_grad: False
c
data: 6.0
requires_grad: True
grad: None
grad_fn: <MulBackward0 object at 0x11cc22110>
is_leaf: False
requires_grad: True


After backward on c:
a
data: 2.0
requires_grad: True
grad: 3.0
grad_fn: None
is_leaf: True
requires_grad: True
b
data: 3.0
requires_grad: False
grad: None
grad_fn: None
is_leaf: True
requires_grad: False
c
data: 6.0
requires_grad: True
grad: None
grad_fn: <MulBackward0 object at 0x11cc225d0>
is_leaf: False
requires_grad: True


## 6. Freeze specified layers

Set **.requires_grad = False** of model's parameters is especially useful when you want to freeze part of your model, or you know in advance that you’re not going to use gradients w.r.t. some parameters. For example if you want to finetune a pretrained CNN, it’s enough to switch the `requires_grad` flags in the frozen base, and no intermediate buffers will be saved, until the computation gets to the last layer, where the affine transform will use weights that require gradient, and the output of the network will also require them.

In [23]:
# toy feed-forward net
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 5)
        self.fc3 = nn.Linear(5, 1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return x

In [26]:
# define random data
random_input = torch.randn(10,)
random_target = torch.randn(1,)

# define net
net = Net()

# print fc2 weight
print('fc2 weight before train:')
print(net.fc2.weight)

# train the net
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.1)
for i in range(100):
    net.zero_grad()
    output = net(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()

# print the trained fc2 weight
print('fc2 weight after train:')
print(net.fc2.weight)

# save the net
torch.save(net.state_dict(), 'model')

fc2 weight before train:
Parameter containing:
tensor([[ 0.0712,  0.1818, -0.3325,  0.0632,  0.3794],
        [ 0.4307, -0.3777, -0.0134, -0.3388,  0.3315],
        [ 0.0073,  0.0054,  0.4047,  0.1378,  0.0897],
        [ 0.3749, -0.0928, -0.0474, -0.0010,  0.2380],
        [-0.2007, -0.2979, -0.2473, -0.0802,  0.0305]], requires_grad=True)
fc2 weight after train:
Parameter containing:
tensor([[ 0.0181,  0.1118, -0.3705,  0.0418,  0.4444],
        [ 0.4631, -0.3315,  0.0101, -0.3224,  0.2941],
        [-0.0406, -0.0575,  0.3705,  0.1188,  0.1484],
        [ 0.3521, -0.1250, -0.0639, -0.0122,  0.2645],
        [-0.2020, -0.2988, -0.2481, -0.0800,  0.0326]], requires_grad=True)


In [27]:
# delete and redefine the net
del net
net = Net()

# load the weight
net.load_state_dict(torch.load('model'))

# print the pre-trained fc2 weight
print('fc2 pretrained weight (same as the one above):')
print(net.fc2.weight)

fc2 pretrained weight (same as the one above):
Parameter containing:
tensor([[ 0.0181,  0.1118, -0.3705,  0.0418,  0.4444],
        [ 0.4631, -0.3315,  0.0101, -0.3224,  0.2941],
        [-0.0406, -0.0575,  0.3705,  0.1188,  0.1484],
        [ 0.3521, -0.1250, -0.0639, -0.0122,  0.2645],
        [-0.2020, -0.2988, -0.2481, -0.0800,  0.0326]], requires_grad=True)


In [28]:
# we want to freeze the fc2 layer this time: only train fc1 and fc3
net.fc2.weight.requires_grad = False
net.fc2.bias.requires_grad = False

# NOTE: pytorch optimizer explicitly accepts parameter that requires grad
# see https://github.com/pytorch/pytorch/issues/679
# optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=0.1)
# this raises ValueError: optimizing a parameter that doesn't require gradients
optimizer = optim.Adam(net.parameters(), lr=0.1)

In [29]:
# train again
criterion = nn.MSELoss()

# define new random data
random_input = torch.randn(10,)
random_target = torch.randn(1,)

for i in range(100):
    net.zero_grad()
    output = net(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()

# print the retrained fc2 weight
# note that the weight is same as the one before retraining: only fc1 & fc3 changed
print('fc2 weight (frozen) after retrain:')
print(net.fc2.weight)

fc2 weight (frozen) after retrain:
Parameter containing:
tensor([[ 0.0181,  0.1118, -0.3705,  0.0418,  0.4444],
        [ 0.4631, -0.3315,  0.0101, -0.3224,  0.2941],
        [-0.0406, -0.0575,  0.3705,  0.1188,  0.1484],
        [ 0.3521, -0.1250, -0.0639, -0.0122,  0.2645],
        [-0.2020, -0.2988, -0.2481, -0.0800,  0.0326]])


In [30]:
# let's unfreeze the fc2 layer this time for extra tuning
net.fc2.weight.requires_grad = True
net.fc2.bias.requires_grad = True

# # add the unfrozen fc2 weight to the current optimizer
# optimizer.add_param_group({'params': net.fc2.parameters()})

# re-retrain
for i in range(100):
    net.zero_grad()
    output = net(random_input)
    loss = criterion(output, random_target)
    loss.backward()
    optimizer.step()

# print the re-retrained fc2 weight
# note that this time the fc2 weight also changed
print('fc2 weight (unfrozen) after re-retrain:')
print(net.fc2.weight)

fc2 weight (unfrozen) after re-retrain:
Parameter containing:
tensor([[-0.2066,  0.2041, -0.4219,  0.1379,  0.5784],
        [-0.1694,  0.4880, -0.8607,  0.2167,  0.0865],
        [-0.4725,  0.1525,  0.1350,  0.3146,  0.1834],
        [ 0.6044, -0.7760,  0.4370, -0.5776, -0.0451],
        [-0.5125,  0.4816, -0.8304,  0.5518,  0.2252]], requires_grad=True)


## References:

* [[Youtube] PyTorch Autograd Explained - In-depth Tutorial](https://www.youtube.com/watch?v=MswxJw-8PvE)
* [Autograd Mechanics](https://pytorch.org/docs/stable/notes/autograd.html)
* [PyTorch Autograd - Understanding the heart of PyTorch’s magic](https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95)