# Differentiation in Autograd
Let’s take a look at how autograd collects gradients. We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked

In [1]:
import torch

# 1.1 basic usage

In [2]:
a = torch.tensor([2.0, 3.0], requires_grad=True)
print(a)

tensor([2., 3.], requires_grad=True)


In [3]:
b = torch.tensor([6.0, 4.0], requires_grad=True)
print(b)

tensor([6., 4.], requires_grad=True)


We create another tensor Q from a and b.

Q = 3a^3 - b^2

In [4]:
Q = 3*a**3 - b**2
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)


we want gradients of the Q w.r.t. parameters a, b, i.e.

∂Q/ ∂a =9a^2 
 
∂Q/ ∂b = −2b

When we call .backward() on Q, autograd calculates these gradients and stores them in the respective tensors’ .grad attribute.

We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e.

dQ/dQ =1

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like Q.sum().backward().

In [5]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

In [6]:
#Gradients are now deposited in a.grad and b.grad
# check if collected gradients are correct
print(a.grad)
print(b.grad)

print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([36., 81.])
tensor([-12.,  -8.])
tensor([True, True])
tensor([True, True])


## 1.2 Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

* run the requested operation to compute a resulting tensor, and
* maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

* computes the gradients from each .grad_fn,
* accumulates them in the respective tensor’s .grad attribute, and
* using the chain rule, propagates all the way to the leaf tensors.

In [7]:
shape = (3,1)
x = torch.randn(shape, requires_grad=True)
print(x)

tensor([[ 2.1829],
        [-0.0914],
        [ 0.7154]], requires_grad=True)


In [8]:
y = x+1
print(y)
print(y.requires_grad)
print(y.grad_fn)

tensor([[3.1829],
        [0.9086],
        [1.7154]], grad_fn=<AddBackward0>)
True
<AddBackward0 object at 0x7f2c10cea9b0>


In [9]:
z = y**2 + 3
print(z)
print(z.requires_grad)
print(z.grad_fn)
print(z.grad_fn.next_functions)
print(z.grad_fn.next_functions[0][0].next_functions)

tensor([[13.1306],
        [ 3.8255],
        [ 5.9425]], grad_fn=<AddBackward0>)
True
<AddBackward0 object at 0x7f2c10cea440>
((<PowBackward0 object at 0x7f2c10cebbe0>, 0), (None, 0))
((<AddBackward0 object at 0x7f2c10cea440>, 0),)


## Disabling Gradient Tracking

Exclusion from the DAG
torch.autograd tracks operations on all tensors which have their requires_grad flag set to True. For tensors that don’t require gradients, setting this attribute to False excludes it from the gradient computation DAG.

The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True

In [10]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients? : False
Does `b` require gradients?: True


In [11]:
# Disabling Gradient Tracking
# ---------------------------
#
# By default, all tensors with ``requires_grad=True`` are tracking their
# computational history and support gradient computation. However, there
# are some cases when we do not need to do that, for example, when we have
# trained the model and just want to apply it to some input data, i.e. we
# only want to do *forward* computations through the network. We can stop
# tracking computations by surrounding our computation code with
# ``torch.no_grad()`` block:

In [12]:
with torch.no_grad():
    c = x + z
print(c.requires_grad)

False


In [13]:
# Another way to achieve the same result is to use the ``detach()`` method on the tensor:

In [14]:
d = x + z
d_detached = d.detach()
print(d_detached.requires_grad)

False


## 1.3 autograd in Neutral Network

In [15]:
sample_size = 2
number_features = 3

x = torch.arange(sample_size*number_features).reshape(sample_size, number_features) * 1.0
print(x)

target = torch.arange(2, sample_size+2).reshape(sample_size, 1) * 1.0
print(target)

tensor([[0., 1., 2.],
        [3., 4., 5.]])
tensor([[2.],
        [3.]])


In [16]:
w = torch.ones((3,1), requires_grad=True)
print(w)
print(w.grad)

tensor([[1.],
        [1.],
        [1.]], requires_grad=True)
None


In [17]:
output = x.matmul(w)
print(output)
print(output.grad_fn)

tensor([[ 3.],
        [12.]], grad_fn=<MmBackward0>)
<MmBackward0 object at 0x7f2c10d20640>


In [18]:
loss = ((output - target)**2).mean()
print(loss)

tensor(41., grad_fn=<MeanBackward0>)


In [19]:
loss.backward()

In [20]:
print(w.grad)

tensor([[27.],
        [37.],
        [47.]])


In [21]:
def update_parameter(w, learning_rate=0.1):
    with torch.no_grad():
        w.add_(w.grad, alpha=-1*learning_rate)

In [22]:
update_parameter(w, 0.01)

In [23]:
print(w)

tensor([[0.7300],
        [0.6300],
        [0.5300]], requires_grad=True)


> DAGs are dynamic in PyTorch An important thing to note is that the graph is recreated from scratch; after each .backward() call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model; you can change the shape, size and operations at every iteration if needed.

## 1.4 vector-Jacobian product(optional reading)

In [24]:
# In many cases, we have a scalar loss function, and we need to compute
# the gradient with respect to some parameters. However, there are cases
# when the output function is an arbitrary tensor. In this case, PyTorch
# allows you to compute so-called **Jacobian product**

see https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#optional-reading-vector-calculus-using-autograd

In [25]:
a = torch.tensor([2.0, 3.0], requires_grad=True)
b = torch.tensor([6.0, 4.0], requires_grad=True)
Q = 3*a**3 - b**2
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)


In [26]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)
print(a.grad)

tensor([36., 81.])


In [27]:
a.grad.zero_()

tensor([0., 0.])

In [29]:
#Q.sum().backward(retain_graph=True)
#print(a.grad)