# PyTorch Tutorial - Part 2
Having gone through part-1, you must be now familiar with creation of Tensor objects and operating on them. Just to remind you that indexing Tensor objects are similar to numpy indexing. Further, you must go through the tensor broadcasting rules (similar to broadcasting rules for numpy arrays) from https://pytorch.org/docs/stable/notes/broadcasting.html. Below is given a set of exercises on indexing and broadcasting semantics for you to attempt before you get into this tutorial.

## Exercise
1. Create a 2-d Tensor of size 8x3. Extract all the even numbered rows by slicing and print it.
2. Create a 2-d Tensor of size 8x3. Extract all the even numbered rows in reverse order by slicing and print it.
3. Create a 4-d tensor named t of size 2x3x4x2. What does t[0, ..., 1] extract?
4. For the tensor in part 1, set the 2nd row (row index is 1) to the Tensor with data [2, 3, 4] with just one line of code.
5. For the Tensor in part3, set the 4th element to -22.7. (only one line of code)
6. For the Tensor in part 3, extract the 1st and 3rd rows of the middle 3x4 array across 1st and 4th dimensions (both of which have size 2). Do this with one line of code.
7. Tensor with scalar data can be added to a Tensor with size 2. (True/False)
8. Tensor of size 2x3 can be added to another Tensor of size 2 with both having same datatype. (True or False)
9. Tensor of size 2x3 can be added to another Tensor of size 3 with both having same datatype. (True or False)
10. Tensor of size 2x1x3x1 can be elementwise multiplied with another Tensor of size 1x2 with both Tensors having same datatype. (True of False)
11. What will be the size of the output of elementwise multiplying a Tensor of size 2x1x3x1 with   another Tensor of size 1x2 assuming both Tensors have same datatype?
12. For the Tensor defined in part1, extract 1st row 2nd element, 5th row 1st element and 8th row 1st element with one line of code.
13. For the Tensor defined in part1, extract all the corner elements i.e elements at positions (0, 0), (0, 2), (7, 0) and (7, 2).(only one line of code)
14. For the tensor defined in part 1, extract all the rows whose mean is >= 2. (only one line of code)

In [0]:
import torch

In [35]:
# 1. Create a 2-d Tensor of size 8x3. Extract all the even numbered rows by slicing and print it.
t1 = torch.randn((8,3))
t1_same = t1
print("Ans1:\nt1: ", t1)
# indices1 = torch.LongTensor([i for i in range(t1.shape[0]) if not (i%2)])
# res1 = t1.index_select(dim=0, index=indices1)
print("res1: ", t1[::2,:])
print("................................................................................\n")

# 2. Create a 2-d Tensor of size 8x3. Extract all the even numbered rows in reverse order by slicing and print it.
t2 = torch.randint(low=0, high=100, size=[8,3])
print("Ans2:\nt2: ", t2)
# indices2 = torch.LongTensor([i for i in range(t2.shape[0]-1, -1, -1) if not (i%2)])
# res2 = t2.index_select(dim=0, index=indices2)
print("res2: Negative step is not yet supported.", )
print("................................................................................\n")

# 3. Create a 4-d tensor named t of size 2x3x4x2. What does t[0, ..., 1] extract?
t3 = torch.randn((2,3,4,2))
t3_same = t3
print("Ans3:\nt3[0, ..., 1].shape: ", t3[0, ..., 1].shape, "\ntorch.equal(t3[0, ..., 1],t3[0,:,:,1]) = ", torch.equal(t3[0, ..., 1],t3[0,:,:,1]))
print("................................................................................\n")

# 4. For the tensor in part 1, set the 2nd row (row index is 1) to the Tensor with data [2, 3, 4] with just one line of code.
t1[1] = torch.Tensor([2, 3, 4])
print("Ans4:\nAfter setting: t1[1] = torch.Tensor([2, 3, 4])\nt1: ", t1)
print("t1[1]: ", t1[1])
print("................................................................................\n")

# 5. For the Tensor in part3, set the 4th element to -22.7. (only one line of code)
t3[0,0,1,1] = torch.Tensor([-22.7])
print("Ans5:\nAfter setting: t3[0,0,1,1] = torch.Tensor(-22.7):\nt3: ", t3)
print("t3[0,0,1,1]: ", t3[0,0,1,1]) 
print("................................................................................\n")

# 6. For the Tensor in part 3, extract the 1st and 3rd rows of the middle 3x4 array across 1st and 4th dimensions (both of which have size 2).
#    Do this with one line of code.
t3 = t3_same
print("Ans6:\nres6 = t3[:,::2,:,:] = ", t3[:,::2,:,:])
print("................................................................................\n")

# 7. Tensor with scalar data can be added to a Tensor with size 2. (True/False)
t7_1 = torch.tensor([2], dtype=torch.float).cuda()
t7_2 = torch.tensor([2,2], dtype=torch.float).cuda()
print("Ans7:\nt7_1 + t7_2 = ", t7_1+t7_2,"\nAns: True")
print("................................................................................\n")

# 8. Tensor of size 2x3 can be added to another Tensor of size 2 with both having same datatype. (True or False)
t8_1 = torch.randn((2,3),dtype=torch.float).cuda()
t8_2 = torch.tensor([1,1], dtype=torch.float)
print("Ans8: False")
print("................................................................................\n")

# 9. Tensor of size 2x3 can be added to another Tensor of size 3 with both having same datatype. (True or False)
t9_1 = torch.randn((2,3),dtype=torch.float).cuda()
t9_2 = torch.tensor([1,1,1], dtype=torch.float).cuda()
print("Ans9: True")
print("................................................................................\n")

# 10. Tensor of size 2x1x3x1 can be elementwise multiplied with another Tensor of size 1x2 with both Tensors having same datatype. (True of False)
t10_1 = torch.randn((2,1,3,1), dtype=torch.float).cuda()
t10_2 = torch.randn((1,2), dtype=torch.float).cuda()
print("Ans10: .True.")
print("................................................................................\n")

# 11. What will be the size of the output of elementwise multiplying a Tensor of size 2x1x3x1
#     with another Tensor of size 1x2 assuming both Tensors have same datatype?
print("Ans11: Resultant size will be 2x1x3x2.")
print("................................................................................\n")

# 12. For the Tensor defined in part1, extract 1st row 2nd element, 5th row 1st element and 8th row 1st element with one line of code.
t12 = t1_same
# Also, res12 = t12[(0,4,7),(1,0,0)]
print("Ans12:\nres12 = torch.gather(t12.index_select(dim=0, index=torch.LongTensor([0,4,7])),dim=1,index=torch.LongTensor([[1],[0],[0]]))")
print("So, res12 = ", torch.gather(t12.index_select(dim=0, index=torch.LongTensor([0,4,7])),dim=1,index=torch.LongTensor([[1],[0],[0]])))
print("................................................................................\n")

# 13. For the Tensor defined in part1, extract all the corner elements i.e elements
# at positions (0, 0), (0, 2), (7, 0) and (7, 2).(only one line of code).
t13 = t1_same
print(t13)
print("Ans13:\nres13 = t13[::t13.shape[0]-1,::t13.shape[1]-1]")
# also, res13 = t13[(0,0,7,7),(0,2,0,2)]
print("So, res13 = ", t13[::t13.shape[0]-1,::t13.shape[1]-1])
print("................................................................................\n")

# 14. For the tensor defined in part 1, extract all the rows whose mean is >= 2. (only one line of code)
t14 = t1_same
print(t14)
# also, res14 = [t14[i] for i in range(t14.shape[0]) if t14[i].mean() >= 2]
res14 = t14.index_select(dim=0,index = torch.LongTensor([i for i in range(t14.shape[0]) if t14[i,:].mean() >= 2]))
print("Ans14:\nres14 = ", res14)
print("................................................................................\n")


Ans1:
t1:  tensor([[ 0.4849, -0.6040, -1.1029],
        [-1.7819,  1.0709, -0.6504],
        [ 0.5057,  0.9325, -1.1939],
        [-0.5370,  1.0823, -1.5110],
        [ 1.6736, -1.1163,  0.6021],
        [-1.1221,  1.2851, -1.8526],
        [ 0.2900,  1.3086, -0.4059],
        [-0.0078,  0.6413, -1.4778]])
res1:  tensor([[ 0.4849, -0.6040, -1.1029],
        [ 0.5057,  0.9325, -1.1939],
        [ 1.6736, -1.1163,  0.6021],
        [ 0.2900,  1.3086, -0.4059]])
................................................................................

Ans2:
t2:  tensor([[97, 80, 23],
        [19, 76, 62],
        [25, 21,  7],
        [32, 78, 74],
        [68, 69, 52],
        [51, 40, 14],
        [33, 36, 16],
        [63, 74, 77]])
res2: Negative step is not yet supported.
................................................................................

Ans3:
t3[0, ..., 1].shape:  torch.Size([3, 4]) 
torch.equal(t3[0, ..., 1],t3[0,:,:,1]) =  True
...............................................

In this tutorial we will focus on the requires_grad, grad and grad_fn attributes of a Tensor object. Further we will also look at 'backward' method. we will also look at torch.autograd package.

Until now, all the Tensors we had created had requires_grad to be False by default. For example, see below.

In [36]:
import torch
t = torch.tensor([[1, 2], [3, 4]])
t.requires_grad

False

So, what is this requires_grad? Why is it required?
<br>
<br>
In Pytorch, each Tensor has an associated directed acyclic graph (DAG) called computation graph that records the computation of that Tensor. For example, if we have a Tensor w that is sum of two other Tensors x and y, then the associated computation graph with w has leaves as the expression x and expression y and root as the expression x+y that computed w. The computation graph associated with the Tensor x is simply a leaf node with expression x and similarly the computation graph associated with Tensor y is again a leaf node with expression y. If we have a Tensor u which equals $z*w$, then the computation graph associated with u has leaves as expression x, expression y and expression z, internal node as the expression x+y that computed w and the root as the expression   $z*w$ that computed u. It is important to note that each Tenor has its associated computation graph. An expression like x+y involved in computing w is also part of computing u=$z*w$ and so, it is part of computation graphs associated with both w and u.
<br>
<br>
Now, if the requires_grad attribute of a Tensor t is True, then 'backward' method defined in the Tensor class can be called on t to compute the gradients of t w.r.to Tensors involved in computing it. These gradients are available in the grad attribute of the Tensors.The 'backward' method automatically computes the gradients by evaluating the graph backwards using chain rule. This is how the AutoGrad (automatic gradients) works in Pytorch. Of course, gradient of t w.r.to a Tensor involved in computing it but whose requires_grad is False will be None. The general rule of propagation of requires_grad attribute is as follows: even if a single operand in an operation has its requires_grad to be True, the output also will have its requires_grad to be True. In other words, only when all the operands in an operation does not require grad (i.e their requires_grad equals False), does the ouput also not require gradient (its requires_grad will be False). 
<br>
<br>
We know that such gradient computations are required in deep learning implementations. That's why PyTorch introduced requires_grad attribute to a Tensor object. If that attribute is set to True for a Tensor, gradients of the Tensor w.r.to Tensors involved in computing it can be automatically obtained using the 'backward' method.
<br>
<br>
Let us see some examples.

In [37]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1.)
w = x+y
print(f'requires_grad of x, y, w: {x.requires_grad}\t{y.requires_grad}\t{w.requires_grad}')
z = torch.tensor(-1., requires_grad = True)
u = z*w
print(f'requires_grad of z, w, u: {z.requires_grad}\t{w.requires_grad}\t{u.requires_grad}')

requires_grad of x, y, w: True	False	True
requires_grad of z, w, u: True	True	True


Suppose we want to compute the gradients (derivatives here) of u w.r.to z, w, x and y respectively. Let us denote by $dp$ the entity $\frac{d}{dp}(u)$. We should expect the following results:
<br>
$dz$ = $w$ = 3
<br>
$dw$ = $z$ = -1
<br>
$dy$ = $dw*\frac{dw}{dy}$ = $-1*1$ = -1
<br>
$dx$ = $dw*\frac{dw}{dx}$ = $-1*1$ = -1

Let us check if we get these results by calling 'backward' on the Tensor u.


In [38]:
u.backward()
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}')

dz: 3.0
dw: None
dy: None
dx: -1.0


We got the right answer for $dz$ and $dx$ (note that the gradients are available with the grad attribute of the Tensors). But, what about $dw$ and $dy$? $dy$ is not available because its requires_grad is False. $dw$ is not available because, by default, gradients are accumulated only for the leaf Tensors in their grad attribute. Since $w$ is an internal node, its gradient is not accumulated. To accumulate the gradient of an internal node in the computation graph, we have to call 'retain_grad' method on the Tensor that enables its grad attribute. To set the requires_grad attribute of a Tensor, we can call 'requires_grad_' method.  See the revised example below. 

In [39]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1.)
y.requires_grad_(requires_grad = True)
w = x+y
w.retain_grad()
print(f'requires_grad of x, y, w: {x.requires_grad}\t{y.requires_grad}\t{w.requires_grad}')
z = torch.tensor(-1., requires_grad = True)
u = z*w
print(f'requires_grad of z, w, u: {z.requires_grad}\t{w.requires_grad}\t{u.requires_grad}')

u.backward()
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}')

requires_grad of x, y, w: True	True	True
requires_grad of z, w, u: True	True	True
dz: 3.0
dw: -1.0
dy: -1.0
dx: -1.0


Now, let us try to call backward on u once again.

In [40]:
u.backward()

RuntimeError: ignored

What happened??
<br>
<br>
Backward propgagation is unable to proceed because the computation graph associated with the Tensor u has already been freed. In PyTorch, once the 'backward' method has been called on a Tensor, at the end of the call, its computation graph is freed. If we want to retain the graph to call backward once again, we should explicitly set the retain_graph parameter of 'backward' method to True. The reason Pytorch frees the graph after one 'backward' propagation is to allow the graph to be built dynamically during every forward propagation. This is a very powerful feature of PyTorch. But for situations where you want 'backward' to be done on the same graph, make sure retain_graph parameter in 'backward' call is set to True.
<br>
<br>
Let us see an example.

In [41]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1., requires_grad = True)
w = x+y
w.retain_grad()
print(f'requires_grad of x, y, w: {x.requires_grad}\t{y.requires_grad}\t{w.requires_grad}')
z = torch.tensor(-1., requires_grad = True)
u = z*w
print(f'requires_grad of z, w, u: {z.requires_grad}\t{w.requires_grad}\t{u.requires_grad}')

u.backward(retain_graph = True) # backward call; graph is not freed after this call
print('Grads after first call to backward')
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}\n\n')

u.backward() # backward call again; graph will be freed after this call
print('Grads after second call to backward')
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}')

requires_grad of x, y, w: True	True	True
requires_grad of z, w, u: True	True	True
Grads after first call to backward
dz: 3.0
dw: -1.0
dy: -1.0
dx: -1.0


Grads after second call to backward
dz: 6.0
dw: -2.0
dy: -2.0
dx: -2.0


Observe the grads after second call to 'backward'. They have doubled up. In fact Pytorch accumulates gradient in to the grad attribute of the Tensor. In the above example, we know that gradient of u w.r.to z is 3.0. So, after the first call to 'backward', the grad attribute of z is 3.0.  After the second call to 'backward', the gradient of u w.r.to z will still be 3.0 but when it gets accumulated into the grad attribute of z that already has 3.0, it becomes 6.0. 
<br>
<br>
One more 'backward' call on u will give error since during the second call we did not set the parameter retain_graph to True. So the graph will be freed after the second call. However, do not have the confusion that when graph is freed, all the associated Tensors are freed. Graph is not made of Tensors but built  using only the set of expressions that computed the Tensor. Tensors along with its attributes and methods are alive. To appreciate this point, let us call backward on the Tensor w. Remind yourself that each Tensor has its associated computation graph built using the set of expressions that computed it. And the computation graph associated with Tensor w has not yet been freed since we have not yet called backward on w.

In [42]:
w.backward() # backward call again; graph will be freed after this call
print('Grads after call to backward on w')
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}')

Grads after call to backward on w
dz: 6.0
dw: -1.0
dy: -1.0
dx: -1.0


Observe that $dz$ is still available (even though the graph associated with u has already been freed) and has not changed since z is not part of the computation graph of w. However, $dx$ and $dy$ have changed because w=x+y. Derivative of w w.r.to x is 1 and derivative of w w.r.to y is 1. Already, from the previous call, grad attributes of x and y have values -2.0 and -2.0 respectively. Since grad accumulates the gradients, from the current call, 1 is accumulated to -2.0 to give the latest grad attribute value for x and y as -1.0 and -1.0 respectively.

Having got  familarity with requires_grad and grad attribute, now we will look at another attribute namely grad_fn associated with the PyTorch Tensor. Let us look at the example below.

In [43]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1., requires_grad = True)
w = x+y
print(f'Gradient functions associated with x and y: {x.grad_fn}\t{y.grad_fn}')
print(f'Gradient function associated with w: {w.grad_fn}')
print(f'Type of the gradient function associated with w: {type(w.grad_fn)}')

Gradient functions associated with x and y: None	None
Gradient function associated with w: <AddBackward0 object at 0x7f07a1ca6d30>
Type of the gradient function associated with w: <class 'AddBackward0'>


As you can see, grad_fn associated with Tensor w is an object of type 'AddBackward0'. This is because w is obtained by adding x and y. So the 'Function' type (actually expression) required to compute w is addition. Formally, PyTorch builds the computation graph for w using these 'Function' objects as nodes. The leaves are expressions x and y that have their grad_fn attributes to be None. Root is the expression x+y associated with AddBackward0 function object. Note that by chain rule, gradient at a node is the product of incoming gradient and local gradient. The 'Function' objects stored in the grad_fn attribute facilitate AutoGrad to compute local gradients. One more example is shown below.

In [44]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1., requires_grad = True)
w = x+y
z = torch.tensor(-1., requires_grad = True)
u = z*w
print(f'Gradient functions associated with x and y: {x.grad_fn}\t{y.grad_fn}')
print(f'Gradient function associated with w and z: {w.grad_fn}\t{z.grad_fn}')
print(f'Gradient function associated with u: {u.grad_fn}')

Gradient functions associated with x and y: None	None
Gradient function associated with w and z: <AddBackward0 object at 0x7f07a1425240>	None
Gradient function associated with u: <MulBackward0 object at 0x7f07a14253c8>


Until now we have dealt with a simple example that had only scalar Tensors (i.e Pytorch Tensors whose data are scalars). Let us look at another example. Say, we want to compute $u$ = $\frac{1}{4}\sum_{i=0}^{i=3}z_i$ where $z$ = $x+y$, $x$ and $y$ are 2 d Tensors of same type and size 2x2. So, $u$ is a function from $R^4$ to $R$ and $z$ is a function from $R^4 X R^4$ to $R^4$. The gradients of $u$ w.r.to $z$, $y$ and $x$ are as follows:
<br>
<br>
$dz$ =  local_gradient * incoming_gradient = $J_z(u)^T*\frac{d}{du}(u)$ 
<br>
&emsp;&nbsp; &nbsp; = $[\frac{du}{dz_1} \hspace{0.25cm}...\hspace{0.25cm}\frac{du}{dz_4}]^T$ 
<br>
&emsp;&nbsp; &nbsp; =  $[0.25\hspace{0.25cm} 0.25 \hspace{0.25cm}0.25\hspace{0.25cm}0.25]^T*1.0$ 
<br>
&emsp;&nbsp; &nbsp; = $\begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\0.25 \end{bmatrix}$ 
<br>
<br>
where $J_z(u)$ is the Jacobian of $u$ w.r.to $z$.

The resultant $dz$ when accumulated in to the grad attribute of z is reshaped in to 2x2 since $z$ is 2x2.
<br>
<br>
$dy$ = local_gradient * incoming_gradient = $J_y(z)^T*dz$ = $\begin{bmatrix} \frac{dz_1}{dy_1} & ... & \frac{dz_1}{dy_4}\\ \vdots & \vdots & \vdots \\ \frac{dz_4}{dy_1} & ... & \frac{dz_4}{dy_4} \end{bmatrix}^T * dz$ 
<br>
&emsp;&nbsp; &nbsp; =  $\begin{bmatrix} 1.0 & 0 & 0 & 0 \\ 0 & 1.0 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 1.0 \end{bmatrix}^T*\begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{bmatrix}$ 
<br>
&emsp;&nbsp; &nbsp; = $\begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{bmatrix}$
<br>
<br>
The resultant $dy$ when accumulated in to the grad attribute of y is reshaped in to 2x2 since $y$ is 2x2.

Similarly, $dx$ can be computed.
<br>
<br>
In fact 'backward' method does the above set of computations. If the Tensor on which 'backward' is called is a scalar valued (like $u$), then the initial incoming gradient is assumed to be 1. Then, at every node, 'backward' computes the local gradient as the transpose of the Jacobian of the Tensor associated with the node from which the gradient comes in w.r.to the Tensor associated with the current node and then multiplies this with the incoming gradient to give the gradient at the current node i.e for eg, at node associated with z, transpose of the Jacobian of $u$ w.r.to $z$, $J_z(u)^T$, is computed and then multiplied with the incoming gradient 1.0 to give the output $dz$. The resultant gradient is accumulated in to the grad attribute of the Tensor associated with the current node. Note that the Jacobians are not explicitly available. It is internal to the 'backward' method.
<br>
<br>
Let us code this example and check.

In [45]:
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad = True) 
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad = True)
z = x+y
z.retain_grad()
u = torch.mean(z)

u.backward()
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

dz: tensor([[0.2500, 0.2500],
        [0.2500, 0.2500]])

dy: tensor([[0.2500, 0.2500],
        [0.2500, 0.2500]])

dx: tensor([[0.2500, 0.2500],
        [0.2500, 0.2500]])



Suppose in the above example, $u$ = $2*z$ instead of $u$ = $\frac{1}{4}\sum_{i=0}^{i=3}z_i$, then $u$ is no more a scalar valued Tensor but a 2d 2x2 Tensor. In this case, backward explicitly requires the initial incoming gradient with same size and type as $u$ unlike the scalar valued case where it assumed it be 1. The following example will give error.

In [46]:
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad = True) 
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad = True)
z = x+y
z.retain_grad()
u = 2*z

u.backward()
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

RuntimeError: ignored

To debug, we will supply the initial incoming gradients through the gradient paprameter of the 'backward' method. See the example below.

In [47]:
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad = True) 
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad = True)
z = x+y
z.retain_grad()
u = 2*z

u.backward(gradient = torch.tensor([[1., 1.],[1., 1.]]))
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

dz: tensor([[2., 2.],
        [2., 2.]])

dy: tensor([[2., 2.],
        [2., 2.]])

dx: tensor([[2., 2.],
        [2., 2.]])



Note that the Jacobian at the node of the computation graph associated with z is:
<br>
<br>
&emsp; &emsp; &emsp; $\begin{bmatrix} \frac{du_1}{dz_1} & ... & \frac{du_1}{dz_4} \\ \vdots & \vdots & \vdots \\  \frac{du_4}{dz_1} & ... & \frac{du_4}{dz_4}\end{bmatrix}$

Now, we will look at a couple of more important methods defined in the Tensor class.
First, we will look at 'register_hook'  method. Using this method we can register a hook (i.e a handle to a function) w.r.to the Tensor on which it is called. The hook will be called everytime immediately after the gradient  w.r.to this Tensor is computed. The function must take grad attribute of this Tensor as the argument. It must not modify this grad in its body. It may optionally return a new gradient or None. If it returns a new gradient, this new gradient will be used for further backward propagation. 
<br>
Let us revise the above example code by registering a hook for z. The hook will return a new gradient that is reverse (i.e negate) of the gradient at z. This new gradient will be used further in the backpropagation to compute gradients at x and y respectively. However, gradient at z will not be modified. To remove the hook, we can call 'remove' method on the hook. See the demo below.

In [63]:
def hook_z(grad):
    return grad * -1

x = torch.tensor([[1., 2.], [3., 4.]], requires_grad = True) 
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad = True)
z = x+y
z.retain_grad()
u = 2*z

h = z.register_hook(hook_z)
u.backward(gradient = torch.tensor([[1., 1.],[1., 1.]]))
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

h.remove()

dz: tensor([[2., 2.],
        [2., 2.]])

dy: tensor([[-2., -2.],
        [-2., -2.]])

dx: tensor([[-2., -2.],
        [-2., -2.]])



We will now look at 'detach' method. To understand this, let's imagine a situation where $z$ = $x+y$ and $u$ = $2*z$. Suppose we want to compute another Tensor $w$ = $z-v$ wherein we want to use the $z$ computed from $x+y$ but we do not require gradient of $w$ w.r.to $z$. Note that $z$ is part of computation graphs of both $u$ and $w$. If we explicitly make requires_grad attribute of $z$ to be False, then $dx$ and $dy$ will be None. If requires_grad of $z$ is True, gradient through 'backward' call on $w$ will also get accumulated into grad attribute of $z$ which in turn will affect $dx$ and $dy$. To resolve this situation, we can use 'detach' method on z. This will return a new Tensor whose requires_grad is False but the new Tensor shares the memory with original Tensor z. So, we have to be careful in not modifying the new Tensor as it will affect z and subsequently will have a cascade effect. 

In [66]:
x = torch.tensor([[1., 2.], [3., 4.]], requires_grad = True) 
y = torch.tensor([[5., 6.], [7., 8.]], requires_grad = True)
z = x+y
z.retain_grad()
u = 2*z

v = torch.tensor([[9., 10.], [11., 12.]], requires_grad = True)
w = z.detach() - v
# w = z - v

u.backward(gradient = torch.tensor([[1., 1.],[1., 1.]]), retain_graph = True)
print('Grads after call to backward on u')
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

w.backward(gradient = torch.tensor([[1., 1.],[1., 1.]]))
print('Grads after call to backward on w')
print(f'dz: {z.grad}\n')
print(f'dv: {v.grad}\n')

u.backward(gradient = torch.tensor([[1., 1.],[1., 1.]]))
print('Grads after 2nd call to backward on u')
print(f'dz: {z.grad}\n')
print(f'dy: {y.grad}\n')
print(f'dx: {x.grad}\n')

Grads after call to backward on u
dz: tensor([[2., 2.],
        [2., 2.]])

dy: tensor([[2., 2.],
        [2., 2.]])

dx: tensor([[2., 2.],
        [2., 2.]])

Grads after call to backward on w
dz: tensor([[2., 2.],
        [2., 2.]])

dv: tensor([[-1., -1.],
        [-1., -1.]])

Grads after 2nd call to backward on u
dz: tensor([[4., 4.],
        [4., 4.]])

dy: tensor([[4., 4.],
        [4., 4.]])

dx: tensor([[4., 4.],
        [4., 4.]])



You can clearly see in the above example that $dz$ does not get modified when 'backward' is called on $w$. So, during the second call to 'backward' on $u$, gradients at $x$, $y$ and $z$ doubled up as expected. Check yourself the consequences if 'detach' is not called on z in the computation of w.

We have seen that gradients accumulate in the grad attribute. If you do not want the accumulation to happen but get the current gradients, you can explicitly set the grad attribute to zero prior to further 'backward' calls. See the following example.

In [67]:
x = torch.tensor(2., requires_grad = True) 
y = torch.tensor(1., requires_grad = True)
w = x+y
w.retain_grad()
print(f'requires_grad of x, y, w: {x.requires_grad}\t{y.requires_grad}\t{w.requires_grad}')
z = torch.tensor(-1., requires_grad = True)
u = z*w
print(f'requires_grad of z, w, u: {z.requires_grad}\t{w.requires_grad}\t{u.requires_grad}')

u.backward(retain_graph = True) # backward call; graph is not freed after this call
print('Grads after first call to backward')
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}\n\n')

z.grad.data = torch.zeros_like(z) # zero the gradient
w.grad.data = torch.zeros_like(w) 
y.grad.data = torch.zeros_like(y) 
x.grad.data = torch.zeros_like(x) 
u.backward() # backward call again; graph will be freed after this call

print('Grads after second call to backward')
print(f'dz: {z.grad}')
print(f'dw: {w.grad}')
print(f'dy: {y.grad}')
print(f'dx: {x.grad}')

requires_grad of x, y, w: True	True	True
requires_grad of z, w, u: True	True	True
Grads after first call to backward
dz: 3.0
dw: -1.0
dy: -1.0
dx: -1.0


Grads after second call to backward
dz: 3.0
dw: -1.0
dy: -1.0
dx: -1.0


Now we will look at few methods in torch.autograd package. I encourage you to go throught (https://pytorch.org/docs/stable/autograd.html?highlight=torch%20autograd#module-torch.autograd) for more details.
<br><br>
Suppose we want to compute higher order derivatives/grdients of a function. 'grad' method in torch.autograd will be very useful for this purpose. You can look at the documentation of 'grad' method for detailed explanation. I will show through two examples below how higher order gradients can be computed using this method.
<br>
<br>
Let $y$ = $x^2$. Then $y^{'}$ = $2x$ , $y^{''} = 2$ and $y^{(3)} = 0$. To compute $y^{'}$, the derivative of y w.r.to x, we can call 'grad' method with ouputs parameter set to ($y$, ) and inputs parameter set to ($x$, ). Since $y$ is a scalar function, we need not explictily supply grad_outputs parameter. We should set create_graph parameter to True. By setting this parameter to True, graph for derivatives will be built which will facilitate computation of higher order derivatives.  Note that $x$ should have its requires_grad set to True. The result returned is a tuple with one Tensor, the derivative of $y$ w.r.to $x$, with its requires_grad set to True. To find $y^{''}$, all that you have to do is to set the ouputs parameter in call to 'grad' with the result obtained from the previous call. Similarly other higher order derivatives can be computed. See the example below.



In [68]:
import torch.autograd
x = torch.tensor(3., requires_grad = True) 
y = x**2

dy_dx = torch.autograd.grad(outputs = (y, ), inputs = (x, ), create_graph = True)
print(f'dy_dx at x={x.data}:\t\t{dy_dx[0].data}')

d2y_dx2 = torch.autograd.grad(outputs = dy_dx, inputs = (x, ), create_graph = True)
print(f'd2y_dx2 at x={x.data}:\t{d2y_dx2[0].data}')

d3y_dx3 = torch.autograd.grad(outputs = d2y_dx2, inputs = (x, ))
print(f'd3y_dx3 at x={x.data}:\t{d3y_dx3[0].data}')

dy_dx at x=3.0:		6.0
d2y_dx2 at x=3.0:	2.0
d3y_dx3 at x=3.0:	0.0


Given below is another example. 

In [69]:
x = torch.tensor([[1., 2., 3.], [4., 5., 6.]], requires_grad = True)
y = torch.mean(x ** 2) * torch.ones(2, 3)
print(y.requires_grad)
dy_dx = torch.autograd.grad(outputs = (y, ), inputs = (x, ), grad_outputs = (torch.ones(y.size()), ), create_graph = True)
print(f'dy_dx: {dy_dx}')
d2y_dx2 = torch.autograd.grad(outputs = dy_dx, inputs = (x, ), grad_outputs = (torch.ones(y.size()), ))
print(f'd2y_dx2: {d2y_dx2}')

True
dy_dx: (tensor([[ 2.,  4.,  6.],
        [ 8., 10., 12.]], grad_fn=<MulBackward0>),)
d2y_dx2: (tensor([[2., 2., 2.],
        [2., 2., 2.]]),)


Finally, as part of this tutorial, we will look at two context-managers defined in autograd package namely, 'no_grad' and 'enable_grad'. Generally, when testing the model, we perform only forward computations and no backward computation. By invoking the context-manager 'no_grad' during such time, storing intermediate results of forward computations will be avoided since there is no follow-up backward. Otherwise memory will be wasted. See an example code below for using 'no_grad' context-manager. Note in the example that 'no_grad' can also be used as a function decorator.

In [70]:
x = torch.tensor([1.], requires_grad=True)
with torch.autograd.no_grad():    # we can also use torch.no_grad in place of torch.autograd.no_grad
    y = x * 2
print(y.requires_grad)

@torch.autograd.no_grad()    # we can also use torch.no_grad in place of torch.autograd.no_grad
def doubler(x):
    return x * 2
z = doubler(x)
print(z.requires_grad)

False
False


If for some reason, gradient computation need to be enabled witihin the 'no_grad' context, 'enable_grad' context-manager can be used. See the example below.

In [71]:
x = torch.tensor([1.], requires_grad=True)
with torch.autograd.no_grad():  # we can also use torch.no_grad in place of torch.autograd.no_grad
    with torch.autograd.enable_grad():  # we can also use torch.enable_grad in place of torch.autograd.enable_grad
        y = x * 2
        
print(y.requires_grad)
y.backward()
print(x.grad)

@torch.enable_grad()     # we can also use torch.enable_grad in place of torch.autograd.enable_grad
def doubler(x):
    return x * 2

with torch.no_grad():    # we can also use torch.no_grad in place of torch.autograd.no_grad
    z = doubler(x)
print(z.requires_grad)


True
tensor([2.])
True
