In [1]:
%matplotlib inline


Autograd: Automatic Differentiation（自动微分）
===================================

PyTorch中所有神经网络的核心是 ``autograd`` 包.
让我们先简单地看一下这个，然后我们来训练我们的第一个神经网络。


The ``autograd`` 为张量上的所有操作提供自动微分. 
它是一个**按照代码运行**的框架，这意味着你的反向传播是由您的代码运行方式定义，并且每次迭代都可以不同。

让我们用更简单的术语和一些例子来看看。

Tensor
--------

``torch.Tensor`` 是``autograd`` 包中的主要类. 将属性``.requires_grad`` 设置为 ``True``, 她将回按照你定义的方式进行. 当你完成你的代码后，可以调用 ``.backward()`` 自动的计算所有的梯度. 这个张量的梯度将会被积累到 ``.grad`` 属性中.

要阻止一个张量跟踪历史, 调用 ``.detach()`` 把它从计算历史中分离出来, 并防止未来的计算被跟踪，防止跟踪历史 (占用内存), 您还可以将代码块封装进来 ``with torch.no_grad():``. 这在评估模型时特别有用，因为模型训练得到的参数是我们感兴趣的
``requires_grad=True``, 但是我们并不需要梯度.

还有一个类对autograd实现非常重要的 ``Function``.

``Tensor`` 和 ``Function`` 是相互联系的，它们构成一个无环图, 图中编码着完整的计算历史. 每个张量
的 ``.grad_fn`` 属性关联着一个创造该张量的``Function``（8是由4和4相加得到的）(用户直接创建的张量除外（石头里蹦出来的）- 他们的
``grad_fn is None``).

如果你想计算导数, 调用 ``.backward()`` 对一个 ``Tensor``. 如果 ``Tensor`` 时一个单元素张量 (i.e. 它包含一个单元素数据), 不需要具体化任何参数 ``backward()``,
但是如果它有更多的元素, 你需要具体化 ``gradient``
参数（设置梯度匹配的形状）.



In [2]:
import torch

创建一个张量并设置 ``requires_grad=True`` 用它来跟踪计算



In [3]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


做一个张量运算:



In [4]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


``y`` 是加创建的结果, 所以他应该有一个 ``grad_fn``属性.



In [5]:
print(y.grad_fn)

<AddBackward0 object at 0x0000020EBC9D4780>


Do more operations on ``y``



In [6]:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward1>)


``.requires_grad_( ... )`` 改变现有张量的 ``requires_grad``标志位
. 输入``requires_grad``标志位默认为 ``False`` .



In [7]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x0000020EAC64A828>


Gradients
---------
开始反向传播<br>
因为 ``out`` 包含单个标量, ``out.backward()`` 
相当于 ``out.backward(torch.tensor(1.))``.



In [8]:
out.backward()

Print gradients d(out)/dx




In [9]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


一个矩阵 ``4.5``. 把 ``out``写成
*Tensor* “$o$”.
o是由f（x）得到的 $o = \frac{1}{4}\sum_i z_i$,
$z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.
然后,反向传播对o求x的导
$\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, 因此
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.



Mathematically, if you have a vector valued function $\vec{y}=f(\vec{x})$,
then the gradient of $\vec{y}$ with respect to $\vec{x}$
is a Jacobian matrix:

\begin{align}J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\end{align}

Generally speaking, ``torch.autograd`` is an engine for computing
vector-Jacobian product. That is, given any vector
$v=\left(\begin{array}{cccc} v_{1} & v_{2} & \cdots & v_{m}\end{array}\right)^{T}$,
compute the product $v^{T}\cdot J$. If $v$ happens to be
the gradient of a scalar function $l=g\left(\vec{y}\right)$,
that is,
$v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$,
then by the chain rule, the vector-Jacobian product would be the
gradient of $l$ with respect to $\vec{x}$:

\begin{align}J^{T}\cdot v=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\left(\begin{array}{c}
   \frac{\partial l}{\partial y_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial y_{m}}
   \end{array}\right)=\left(\begin{array}{c}
   \frac{\partial l}{\partial x_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial x_{n}}
   \end{array}\right)\end{align}

(Note that $v^{T}\cdot J$ gives a row vector which can be
treated as a column vector by taking $J^{T}\cdot v$.)

This characteristic of vector-Jacobian product makes it very
convenient to feed external gradients into a model that has
non-scalar output.



PT中求导只能是标对矢量<br>
标量（单个元素好比LOSS）对矢量（每一层的参数）求导-----导数：向量形式（可能是多个雅可比乘矢量的形式）<br>
矢量对矢量求导-----导数：雅可比矩阵形式（PT中不能显示）<br>
**--[Pytorch中的backward]**(https://blog.csdn.net/witnessai1/article/details/79763596)<br>
熟悉内部运行机制同样可以得到雅可比矩阵，backward（）传入偏导的系数即v\[1,0,0\]得到第一列，在\[0,1,0\]类推

Now let's take a look at an example of vector-Jacobian product:<br>
--[Pytorch中的backward](https://blog.csdn.net/witnessai1/article/details/79763596)


In [33]:
x = torch.randn(3, requires_grad=True)
print(x)
y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([ 1.1482, -0.8346, -1.1552], requires_grad=True)
tensor([ 1175.7565,  -854.6229, -1182.9468], grad_fn=<MulBackward0>)


Now in this case ``y`` is no longer a scalar. ``torch.autograd``
could not compute the full Jacobian directly, but if we just
want the vector-Jacobian product, simply pass the vector to
``backward`` as argument:



In [27]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


You can also stop autograd from tracking history on Tensors
with ``.requires_grad=True`` by wrapping the code block in
``with torch.no_grad():``



In [11]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
	print((x ** 2).requires_grad)

True
True
False


**Read Later:**

Documentation of ``autograd`` and ``Function`` is at
https://pytorch.org/docs/autograd

