# AutoGrad: Automatic Differentiation in PyTorch

This tutorial covers PyTorch [autograd](https://pytorch.org/docs/stable/autograd.html) implementation of gradient descent.

Previously, we performed several operations on tensors, but we did nothing to store the sequences of operations or to perform the derivatives of a function.

Here, we introduce the concept of *dynamical computational graph* which comprises all the Tensor objects in the network, as well as the *Functions* used to create them.

The PyTorch autograd package provides automatic differentiation for all operations on Tensors, indeed operations become attributes of the tensors. When a tensor's .requires_grad attribute is set to True, it starts to track all operations on it. When you want to compute the gradients on the ''final tensor'', you can call `backward()` and all the gradients are computed.
The gradient of a tensor will be accumulated in the `.grad` attribute.

In [17]:
import torch
x = torch.tensor(1.0, requires_grad=True)

## Back-propagation on one step
We apply a polynomial function $y = f(x)$ to tensor $x$. Then we will backprop and print the gradient $\frac {dy} {dx}$.

$\begin{split}Function:\quad y &= x^4 + 3x^3 + 3x^2 + 5x + 1 \\
Derivative:\quad y' &= 4x^3 + 9x^2 + 6x + 5\end{split}$

In [18]:
y = x**4 + 3*x**3 + 3*x**2 + 5*x + 1

print(y)

tensor(13., grad_fn=<AddBackward0>)


Since $y$ has been computed as a result of an operation, it has an associated gradient function, accessible as `y.grad_fn`.

In [19]:
y.backward()

In [20]:
print(x.grad)

tensor(24.)


In [21]:
# this is the computation of the derivative when x=1
# 24 is the slope of the function at (x,y) = (1,13)
print(4*1 + 9*1 + 6*1 + 5)

24


# Back-propagation on multiple steps
Let's have layers $y$ and $z$ after $x$.

In [22]:
x = torch.tensor([[1.,2,3],[3,2,1]], requires_grad=True)
print(x)

tensor([[1., 2., 3.],
        [3., 2., 1.]], requires_grad=True)


Create the first layer with $y = 2x+1$

In [23]:
y = 2*x + 1
print(y)

tensor([[3., 5., 7.],
        [7., 5., 3.]], grad_fn=<AddBackward0>)


Create the second layer $z=y^2$

In [24]:
z = y**2
print(z)

tensor([[ 9., 25., 49.],
        [49., 25.,  9.]], grad_fn=<PowBackward0>)


Set the output to be the matrix mean.

In [25]:
o = z.mean()#sum of all the elements divided by the number of elements
print(o)

tensor(27.6667, grad_fn=<MeanBackward0>)


Now, we compute the gradient of $x$ w.r.t. $out$.

In [26]:
o.backward()
print(x.grad)

tensor([[2.0000, 3.3333, 4.6667],
        [4.6667, 3.3333, 2.0000]])


You should see a 2x3 matrix. We can calculate the partial derivative of $o$ with respect to $x_i$ as follows:<br>

$o = \frac {1} {6}\sum_{i=1}^{6} z_i$<br>

$z_i = (y_i)^2 = (2x_i+1)^2$<br>

To solve the derivative of $z_i$ we use the <a href='https://en.wikipedia.org/wiki/Chain_rule'>chain rule</a>, where the derivative of $f(g(x)) = f'(g(x))g'(x)$<br>

In this case<br>

$\begin{split} f(g(x)) &= (g(x))^2, \quad &f'(g(x)) = 2g(x) \\
g(x) &= 2x+1, &g'(x) = 2 \\
\frac {dz} {dx} &= 4g(x) &= 4(2x+1) \end{split}$

Therefore,<br>

$\frac{\partial o}{\partial x_i} = \frac{1}{6}\times 4(2x+1)$<br>

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{2}{3}(2(1)+1) = 2$

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=2} = \frac{2}{3}(2(2)+1) = 3.333333333$

$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=3} = \frac{2}{3}(2(3)+1) = 4.666666666$

In [16]:
print(2/3*3)
print(2/3*5)
print(2/3*7)

2.0
3.333333333333333
4.666666666666666


## Turn off tracking
There may be times when we don't want or need to track the computational history.

You can reset a tensor's <tt>requires_grad</tt> attribute in-place using `.requires_grad_(False)`.

When performing evaluations, it's often helpful to wrap a set of operations in `with torch.no_grad():`

It is also possible to use `.detach()` on a tensor to not track future computations.