In [None]:
%matplotlib inline


Autograd: Automatic Differentiation
===================================

Central to all neural networks in PyTorch is the ``autograd`` package.
Let’s first briefly visit this, and we will then go to training our
first neural network.


The ``autograd`` package  <font color=blue>provides automatic differentiation for all operations
on Tensors.</font> It is a <font color=red>define-by-run</font> framework, which means that <font color=blue>your backprop is defined by how your code is run, and that every single iteration can be different.</font>

Let us see this in more simple terms with some examples.

Tensor
--------

``torch.Tensor`` is the central class of the package. <font color=blue>If you set its attribute
``.requires_grad`` as ``True``, it starts to track all operations on it. When
you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically.</font> The gradient for this tensor will be
accumulated into ``.grad`` attribute.

<font color=red>To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.</font>

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. This can be <font color=blue>particularly helpful when evaluating a
model because the model may have trainable parameters with
``requires_grad=True``, but for which we don't need the gradients.</font>

There’s one more class which is very important for autograd
implementation - a ``Function``.

``Tensor`` and ``Function`` are <font color=blue>interconnected and build up an **acyclic
graph**, that encodes a complete history of computation.</font> <font color=red>Each tensor has
a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their
``grad_fn is None``).</font>

<font color=red>If you want to compute the derivatives, you can call ``.backward()`` on
a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element
data), you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.</font>



In [1]:
import torch
torch.__version__

'0.4.1.post2'

Create a tensor and set ``requires_grad=True`` to track computation with it



In [2]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


Do a tensor operation:



In [3]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward>)


``y`` was created as a result of an <font color=blue>operation</font>, so it has a ``grad_fn``.



In [4]:
print(y.grad_fn)

<AddBackward object at 0x7f347d2c96d8>


Do more operations on ``y``



In [7]:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward>) tensor(27., grad_fn=<MeanBackward1>)


``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad``
flag <font color=blue>in-place</font>. The input flag defaults to ``False`` if not given.



In [10]:
a = torch.randn(2, 2)
print('[1] a=', a)

a = ((a * 3) / (a - 1))
print('[2] a=', a)
print(a.requires_grad)

a.requires_grad_(True)
print(a.requires_grad)

b = (a * a).sum()
print('[3] b=', b)
print(b.grad_fn)

[1] a= tensor([[ 0.3127,  0.6842],
        [-0.0578, -1.8010]])
[2] a= tensor([[-1.3650, -6.4986],
        [ 0.1641,  1.9289]])
False
True
[3] b= tensor(47.8424, grad_fn=<SumBackward0>)
<SumBackward0 object at 0x7f347d2614a8>


Gradients
---------
Let's backprop now.
Because ``out`` contains a single scalar, ``out.backward()`` is
equivalent to ``out.backward(torch.tensor(1.))``.



In [11]:
out.backward()

Print gradients d(out)/dx




In [12]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


You should have got a matrix of ``4.5``. Let’s call the ``out``
*Tensor* “$o$”.<br/>
We have that $o = \frac{1}{4}\sum_i z_i$,
$z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.<br/>
Therefore,
$\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, hence
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.



Mathematically, if you have a <font color=orange>vector valued function</font> <font color=blue>$\vec{y}=f(\vec{x})$</font>,
then the gradient of <font color=blue>$\vec{y}$</font> with respect to <font color=blue>$\vec{x}$</font>
is a **<font color=orange>Jacobian matrix</font>**:

\begin{align}J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\end{align}

Generally speaking, ``torch.autograd`` is an engine for computing
**<font color=orange>vector-Jacobian product</font>**. 
<br/>That is, given any vector<font color=blue>
$v=\left(\begin{array}{cccc} v_{1} & v_{2} & \cdots & v_{m}\end{array}\right)^{T}$</font>,
compute the product <font color=blue>$v^{T}\cdot J$</font>.<br/> If <font color=blue>$v$</font> happens to be
the gradient of a <font color=orange>scalar function</font> <font color=blue>$l=g\left(\vec{y}\right)$</font>,
that is,<font color=blue>
$v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$</font>,
then by the <font color=orange>chain rule</font>, the <font color=orange>vector-Jacobian product</font> would be the
gradient of <font color=blue>$l$</font> with respect to <font color=blue>$\vec{x}$</font>:

\begin{align}J^{T}\cdot v=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\left(\begin{array}{c}
   \frac{\partial l}{\partial y_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial y_{m}}
   \end{array}\right)=\left(\begin{array}{c}
   \frac{\partial l}{\partial x_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial x_{n}}
   \end{array}\right)\end{align}

(Note that <font color=blue>$v^{T}\cdot J$</font> gives a <font color=orange>row vector</font> which can be
treated as a <font color=orange>column vector</font> by taking <font color=blue>$J^{T}\cdot v$</font>.)

This characteristic of <font color=orange>vector-Jacobian product</font> makes it very
convenient to feed external gradients into a model that has
non-scalar output.



Now let's take a look at an example of <font color=orange>vector-Jacobian product</font>, where `.norm()` is **L2-norm**:



In [13]:
x = torch.randn(3, requires_grad=True)

y = x * 2
count = 0
while y.data.norm() < 1000:
    count += 1
    y = y * 2

print(count, '=> ', y)

tensor([-856.5516,  499.1843,  456.1978], grad_fn=<MulBackward>)


Now in this case ``y`` is no longer a scalar. ``torch.autograd``
<font color=red>could not compute the full Jacobian directly, but if we just
want the <font color=orange>vector-Jacobian product</font>, simply pass the vector to
``backward`` as argument</font>:



In [14]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([ 204.8000, 2048.0000,    0.2048])


You can also stop autograd from tracking history on Tensors
with ``.requires_grad=True`` by wrapping the code block in
``with torch.no_grad():``



In [15]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
	print((x ** 2).requires_grad)

True
True
False


**Read Later:**

Documentation of ``autograd`` and ``Function`` is at
https://pytorch.org/docs/autograd

