### Autograd: automatic differentiation

Central to all neural networks in PyTorch is the ``autograd`` package. Let’s first briefly visit this, and we will then go to training our first neural network.

The ``autograd`` package provides automatic differentiation for all operations on Tensors. It is a *define-by-run* framework, which means that your *backprop* is defined by how your code is run, and that every single iteration can be different.

Let us see this in more simple terms with some examples.


#### Tensor

``torch.Tensor`` is the central class of the package. If you set its attribute ``.requires_grad`` as ``True``, it starts to track all operations on it. When you finish your computation you can call ``.backward()`` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach it from the computation history, and to prevent future computation from being tracked.

To prevent tracking history (and using memory), you can also wrap the code block in ``with torch.no_grad():``. This can be particularly helpful when evaluating a model because the model may have ``trainable`` parameters with ``requires_grad=True``, but we don’t need the gradients.

There’s one more class which is very important for ``autograd`` implementation - a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a ``.grad_fn`` attribute that references a ``Function`` that has created the ``Tensor`` (except for Tensors created by the user - their ``grad_fn`` is ``None``).

If you want to compute the derivatives, you can call ``.backward()`` on a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to ``backward()``, however if it has more elements, you need to specify a ``gradient`` argument that is a tensor of matching shape.

In [1]:
import torch

Create a ``tensor`` and set ``requires_grad=True`` to track computation with it.

In [2]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[ 1.,  1.],
        [ 1.,  1.]])


Do an operation on ``x``.

In [3]:
# ``y`` was created as a result of an operation.
y = x + 2
print(y)

tensor([[ 3.,  3.],
        [ 3.,  3.]])


``y`` was created as a result of an operation, so it has a ``grad_fn``.

In [4]:
print(y.grad_fn)

<AddBackward0 object at 0x1056dcef0>


Do more operations on ``y``.

In [5]:
z = y * y * 3
out = z.mean()

print(z)
print(out)

tensor([[ 27.,  27.],
        [ 27.,  27.]])
tensor(27.)


``.requires_grad_( ... )`` changes an existing Tensor’s ``requires_grad`` flag in-place. The input flag defaults to ``True`` if not given.

In [6]:
a = torch.rand(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)

False


In [7]:
# Operations & Tensors created by ``a`` is now being tracked.
a.requires_grad_(True)
print(a.requires_grad)

True


In [8]:
# ``b`` was created as a result of ``a`` (whose requires_grad=True)
b = (a * a).sum()
print(b.grad_fn)

<SumBackward0 object at 0x108e857f0>


#### Gradients

Let’s backprop now Because ``out`` contains a single scalar, ``out.backward()`` is equivalent to ``out.backward(torch.tensor(1))``.

In [9]:
out.backward()


print gradients ${\partial\over{\partial x}} out$

In [10]:
print(x.grad)

tensor([[ 4.5000,  4.5000],
        [ 4.5000,  4.5000]])


You should have got a matrix of ``4.5``. Let’s call the ``out`` Tensor $“o”$.

We have that $o\space=\space{1\over4}\sum_i{z_i,z_i = 3(x_i + 2)^2}$ and $z_i|_{x_i = 1} = 27$.

Therefore, ${{\partial o}\over{\partial x_i}} = {3\over2}(x_i + 2)$, hence ${{\partial o}\over{\partial x_i}} |_{x_i = 1} = {9\over2} = 4.5$.

You can do many crazy things with ``autograd``!

In [19]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([-989.9449,   62.7360,  141.0580])


In [20]:
gradient = torch.tensor([0.1, 1.0, 0.0001], 
                        dtype=torch.float)

y.backward(gradient=gradient)

print(x.grad)

tensor([  204.8000,  2048.0000,     0.2048])


You can also stop ``autograd`` from tracking history on Tensors with ``requires_grad=True`` by wrapping the code block in ``with torch.no_grad():``

In [None]:
print('x.requires_grad = {}'.format(x.requires_grad))
print('(x ** 2).requires_grad = {}'.format((x ** 2).requires_grad))

with torch.no_grad():
    print('(x ** 2).requires_grad = {}'.format((x ** 2).requires_grad))

``.detach()`` also stops ``autograd`` from tracking history (and using memory). To stop a tensor's gradient tracking completely, you have to re-assign ``.detach()`` or ``x.detach_()`` to modify x in-place. You can also call the ``.requires_grad_(False)`` method on the ``tensor``.

Example:
```python
>>> x = torch.rand(2, 2, requires_grad=True)
>>> print(x.requires_grad)
True
>>> (x ** 2).requires_grad
True
>>> # Detaching ``x`` completely
>>> x = x.detach()  # or x.detach_()
>>> # or you can also say ``x.requires_grad_(False)``.
>>> print(x.requires_grad)
False
>>> (x ** 2).requires_grad
False
```

In [21]:
print('(x **2).detach().requires_grad = {}'.format((x **2).detach().requires_grad))
print('x.requires_grad = {}'.format(x.requires_grad))

x.requires_grad = True
(x ** 2).requires_grad = True
(x ** 2).requires_grad = False
(x **2).detach().requires_grad = False
x.requires_grad = True


In [24]:
x = torch.rand(2, 2, requires_grad=True)
print(x.requires_grad)

True


In [26]:
(x ** 2).requires_grad

True

In [27]:
x = x.detach()  # or x.detach_()
print(x.requires_grad)

False


In [28]:
(x ** 2).requires_grad

False