In [1]:
import torch

# 2. AUTOGRAD: AUTOMATIC DIFFERENTIATION
[A useful artical](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

Central to all neural networks in PyTorch is the ```autograd``` package, which **provides automatic differentiation for all operations on Tensors**.


Mathematically, if you have a vector valued function $\tilde{y}=f(\tilde{x})$, then the gradient of with respect to $x$ is a Jacobian matrix
$$
\begin{split}J=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\end{split}
$$
Generally speaking, ```torch.autograd``` is an engine for computing vector-Jacobian product. That is, given any vector $v$ compute
the product $v^{T}\cdot J$.

If $v$ is the gradient of a scalar function $l=g\left(\vec{y}\right)$:
$$v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$$
By the chain rule, the vector-Jacobian product would be the gradient of $l$ with respect to $\tilde{x}$ :
$$\begin{split}J^{T}\cdot v=\left(\begin{array}{ccc}
 \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
 \vdots & \ddots & \vdots\\
 \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
 \end{array}\right)\left(\begin{array}{c}
 \frac{\partial l}{\partial y_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial y_{m}}
 \end{array}\right)=\left(\begin{array}{c}
 \frac{\partial l}{\partial x_{1}}\\
 \vdots\\
 \frac{\partial l}{\partial x_{n}}
 \end{array}\right)\end{split}$$



## 2.1 Tensor

### 2.1.1 ```Tensor``` class

* ``` requires_grad ```
 * ```torch.Tensor``` class has an attribute ```.requires_grad```
 * If you set its ```.requires_grad``` as ```True```, **it starts to track all operations on it**.
 * defaults to ```False```.
 * ```requires_grad``` is contagious. It means after applying some tensor operator, the result also has ```requires_grad=True```.
* ```.backward()```
 * ```.backward()``` can be used when you finish your computation to have all the gradients computed automatically. 
* ```.grad``` attribute.
 *  The gradient for this tensor will be accumulated into ```.grad``` attribute.
* ```.detach()```
 * To stop a tensor from tracking history, you can call ```.detach()``` to detach it from the computation history, and to prevent future computation from being tracked.
* ```with torch.no_grad():```
 * To prevent tracking history (and using memory), you can also wrap the code block in ```with torch.no_grad():```.
 * Can be used when evaluating a model with some trainable parameters(```.requires_grad=True```), but we don’t need the gradients.

### 2.1.2 ```Function``` class

```Tensor``` and ```Function``` are interconnected and build up an acyclic graph, that encodes a complete history of computation.

* ```.grad_fn``` attribute

 * Each tensor has a ```.grad_fn``` attribute that references a Function that has created the ```Tensor``` 
 * ```grad_fn``` is None if ```requires_grad``` is set to ```False```. 

## 2.1.3 Code

Create a tensor and set ```requires_grad=True``` to track computation with it




In [3]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


Do a tensor operation:

In [4]:
y = x + 2
print(y,"\n")
print(y.grad_fn)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>) 

<AddBackward0 object at 0x7ff8a74bf048>


```y``` was created as a result of an operation, so it has a ```grad_fn```.

Do more operations on y:

In [5]:
z = y * y * 3
out = z.mean()

print(z,"\n")
print(out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) 

tensor(27., grad_fn=<MeanBackward0>)


We can use ```.requires_grad_( ... )``` to change an existing Tensor’s ```requires_grad``` flag in-place. 

In [6]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)

a.requires_grad_(True)
print(a.requires_grad,"\n")

b = (a * a).sum()
print(b.requires_grad)
print(b.grad_fn)

False
True 

True
<SumBackward0 object at 0x7ff8a74c5860>


## 2.2 Gradients
 
Now we can do backprop:

Because ``out`` contains a single scalar, ``out.backward()`` is
equivalent to ``out.backward(torch.tensor(1))``.

In [7]:
out.backward()
# out.backward(torch.tensor(1.))
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


We have that $out = \frac{1}{4}\sum_i z_i$,
$z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.
Therefore,
$\frac{\partial out}{\partial x_i} = \frac{3}{2}(x_i+2)$, hence
$\frac{\partial out}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

If the output is no longer a scalear, ```torch.autograd``` could not compute the full Jacobian directly, but if we just want the vector-Jacobian product, simply pass the vector to backward as argument:

In [9]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2
print(y)

z = torch.randn(3, requires_grad=True)

v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([1388.3756, -279.0660, -226.1103], grad_fn=<MulBackward0>)
tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


In [10]:
print(x.requires_grad)
print((x ** 2).requires_grad)
with torch.no_grad():
    print((x ** 2).requires_grad)

True
True
False


Or by using ```.detach()``` to get a new Tensor with the same content but that does not require gradients:

In [12]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all()) # If tensor x and y have same value

True
False
tensor(True)
