In [None]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

[Learn the Basics](intro.html) \|\|
[Quickstart](quickstart_tutorial.html) \|\|
[Tensors](tensorqs_tutorial.html) \|\| [Datasets &
DataLoaders](data_tutorial.html) \|\|
[Transforms](transforms_tutorial.html) \|\| [Build
Model](buildmodel_tutorial.html) \|\| **Autograd** \|\|
[Optimization](optimization_tutorial.html) \|\| [Save & Load
Model](saveloadrun_tutorial.html)

Automatic Differentiation with `torch.autograd`
===============================================

When training neural networks, the most frequently used algorithm is
**back propagation**. In this algorithm, parameters (model weights) are
adjusted according to the **gradient** of the loss function with respect
to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation
engine called `torch.autograd`. It supports automatic computation of
gradient for any computational graph.

Consider the simplest one-layer neural network, with input `x`,
parameters `w` and `b`, and some loss function. It can be defined in
PyTorch in the following manner:


In [None]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b

# 用于 二元分类 问题的一种损失函数
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

Tensors, Functions and Computational graph
==========================================

This code defines the following **computational graph**:

![](https://pytorch.org/tutorials/_static/img/basics/comp-graph.png)

In this network, `w` and `b` are **parameters**, which we need to
optimize. Thus, we need to be able to compute the gradients of loss
function with respect to those variables. In order to do that, we set
the `requires_grad` property of those tensors.


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>You can set the value of <code>requires_grad</code> when creating atensor, or later by using <code>x.requires_grad_(True)</code> method.</p>
</div>


A function that we apply to tensors to construct computational graph is
in fact an object of class `Function`. This object knows how to compute
the function in the *forward* direction, and also how to compute its
derivative during the *backward propagation* step. A reference to the
backward propagation function is stored in `grad_fn` property of a
tensor. You can find more information of `Function` [in the
documentation](https://pytorch.org/docs/stable/autograd.html#function).


In [None]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Computing Gradients
===================

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters,
namely, we need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ under some fixed values of `x` and
`y`. To compute those derivatives, we call `loss.backward()`, and then
retrieve the values from `w.grad` and `b.grad`:


In [None]:
loss.backward()
print(w.grad)
print(b.grad)

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<ul>
<li>We can only obtain the <code>grad</code> properties for the leafnodes of the computational graph, which have <code>requires_grad</code> propertyset to <code>True</code>. For all other nodes in our graph, gradients will not beavailable.- We can only perform gradient calculations using<code>backward</code> once on a given graph, for performance reasons. If we needto do several <code>backward</code> calls on the same graph, we need to pass<code>retain_graph=True</code> to the <code>backward</code> call.</li>
</ul>
</div>


Disabling Gradient Tracking
===========================

By default, all tensors with `requires_grad=True` are tracking their
computational history and support gradient computation. However, there
are some cases when we do not need to do that, for example, when we have
trained the model and just want to apply it to some input data, i.e. we
only want to do *forward* computations through the network. We can stop
tracking computations by surrounding our computation code with
`torch.no_grad()` block:


In [None]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

Another way to achieve the same result is to use the `detach()` method
on the tensor:


In [None]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

There are reasons you might want to disable gradient tracking:

   -   To mark some parameters in your neural network as **frozen
        parameters**.
        
   -   To **speed up computations** when you are only doing forward
        pass, because computations on tensors that do not track
        gradients would be more efficient.


More on Computational Graphs
============================

Conceptually, autograd keeps a record of data (tensors) and all executed
operations (along with the resulting new tensors) in a directed acyclic
graph (DAG) consisting of
[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

-   run the requested operation to compute a resulting tensor
-   maintain the operation's *gradient function* in the DAG.

The backward pass kicks off when `.backward()` is called on the DAG
root. `autograd` then:

-   computes the gradients from each `.grad_fn`,
-   accumulates them in the respective tensor's `.grad` attribute
-   using the chain rule, propagates all the way to the leaf tensors.

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>An important thing to note is that the graph is recreated from scratch; after each<code>.backward()</code> call, autograd starts populating a new graph. This is exactly what allows you to use control flow statements in your model;you can change the shape, size and operations at every iteration if needed.</p>
</div>


Optional Reading: Tensor Gradients and Jacobian Products
========================================================

- Jacobian 矩阵是向量函数对向量输入的偏导数矩阵，包含所有输出对所有输入的偏导数。
- 张量的梯度 是用于描述 标量值函数 对张量输入的偏导数，表示如何通过输入张量的变化影响标量输出。

In many cases, we have a scalar loss function, and we need to compute
the gradient with respect to some parameters. However, there are cases
when the output function is an arbitrary tensor. In this case, PyTorch
allows you to compute so-called **Jacobian product**, and not the actual
gradient.


For a vector function $\vec{y}=f(\vec{x})$, where
$\vec{x}=\langle x_1,\dots,x_n\rangle$ and
$\vec{y}=\langle y_1,\dots,y_m\rangle$, a gradient of $\vec{y}$ with
respect to $\vec{x}$ is given by **Jacobian matrix**:

$$\begin{aligned}
J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)
\end{aligned}$$

Instead of computing the Jacobian matrix itself, PyTorch allows you to
compute **Jacobian Product** $v^T\cdot J$ for a given input vector
$v=(v_1 \dots v_m)$. This is achieved by calling `backward` with $v$ as
an argument. The size of $v$ should be the same as the size of the
original tensor, with respect to which we want to compute the product:


在 PyTorch 中，计算 **Jacobian Product（雅可比积）** 并不需要显式构建完整的 **Jacobian 矩阵**。PyTorch 通过其自动微分机制来高效计算 Jacobian Product，而不需要先计算出整个 Jacobian 矩阵。这是通过 **反向传播** 和 **链式法则** 实现的。

### **理解 Jacobian Product**

假设有一个向量值函数 \( $\vec{y}$ = f($\vec{x}$) \)，其中：
- \( $\vec{x}$ \) 是 \( $n$ \)-维输入向量（或张量）。
- \( $\vec{y}$ \) 是 \( $m$ \)-维输出向量（或张量）。

其 Jacobian 矩阵 \( J \) 是一个 \( m \times n \) 的矩阵，每个元素 \( J_{ij} \) 是 \( \frac{\partial y_i}{\partial x_j} \)。

我们想要计算 Jacobian Product，即：
\[
v^T \cdot J = \sum_{i=1}^{m} v_i \cdot \nabla_{\vec{x}} y_i
\]
其中，\( v \) 是一个 \( m \)-维的向量。

### **为什么不用显式计算 Jacobian 矩阵？**
如果显式构建 Jacobian 矩阵 \( J \)，其规模会非常大，尤其是在深度学习中，模型的输入输出可能是高维张量，计算整个 Jacobian 矩阵的内存开销和时间复杂度都非常高。

### **PyTorch 如何计算 Jacobian Product？**
PyTorch 使用自动微分机制，通过 **链式法则** 来高效计算 Jacobian Product，而无需显式地计算整个 Jacobian 矩阵。

当你调用 `backward` 函数时，PyTorch 会通过反向传播计算梯度。如果你传入一个向量 \( v \) 作为参数，PyTorch 会通过这个向量和 Jacobian 矩阵的乘积（而不是构建完整的 Jacobian 矩阵）来计算梯度。这是通过以下步骤实现的：

1. **前向传播：** 计算输出 \( \vec{y} \)。
2. **反向传播：** 在反向传播中，PyTorch 通过链式法则逐层计算导数，而不是构建整个 Jacobian 矩阵。每一层都会对输入执行必要的微分运算。

当你调用 `backward(v)` 时，PyTorch 实际上计算的是 \( v^T \cdot J \)，这相当于将输出张量的梯度与向量 \( v \) 相乘。

### **举个例子**

假设我们有一个简单的向量值函数 \( f(x) \)，并且我们想计算 Jacobian Product：

```python
import torch

# 定义输入张量，启用梯度追踪
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)

# 定义向量值函数 f(x)
y = x ** 2  # 输出是一个向量 y = [x_0^2, x_1^2, x_2^2]

# 定义向量 v
v = torch.tensor([1.0, 0.5, 0.1])  # 大小与输出 y 一致

# 计算 Jacobian Product: v^T * J
y.backward(v)  # 传入向量 v

# 打印 x 的梯度
print(x.grad)  # tensor([2.0000, 2.0000, 0.6000])
```

#### 解释：
- 函数 \( y = f(x) = [x_0^2, x_1^2, x_2^2] \)，输出是一个 3 维向量。
- PyTorch 通过调用 `y.backward(v)`，计算 \( v^T \cdot J \)，即 \( v_0 \frac{\partial y_0}{\partial x} + v_1 \frac{\partial y_1}{\partial x} + v_2 \frac{\partial y_2}{\partial x} \)，其中 \( v = [1.0, 0.5, 0.1] \)。

输出的梯度是 \( [2, 2, 0.6] \)，对应于 Jacobian Product 的结果，而不是完整的 Jacobian 矩阵。

### **总结**
- **Jacobian Product** 是通过将向量 \( v \) 与 Jacobian 矩阵相乘得到的，而 PyTorch 可以通过反向传播直接计算这个乘积，而不需要显式构建整个 Jacobian 矩阵。
- PyTorch 通过 `backward(v)` 实现了这个功能，`v` 的尺寸必须和输出张量的尺寸相同。通过传入 \( v \)，可以高效计算 Jacobian Product。

In [None]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")

out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")

inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

Notice that when we call `backward` for the second time with the same
argument, the value of the gradient is different. This happens because
when doing `backward` propagation, PyTorch **accumulates the
gradients**, i.e. the value of computed gradients is added to the `grad`
property of all leaf nodes of computational graph. If you want to
compute the proper gradients, you need to zero out the `grad` property
before. In real-life training an *optimizer* helps us to do this.


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>
<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">
<p>Previously we were calling <code>backward()</code> function without parameters. This is essentially equivalent to calling<code>backward(torch.tensor(1.0))</code>, which is a useful way to compute the gradients in case of a scalar-valued function, such as loss duringneural network training.</p>
</div>


------------------------------------------------------------------------


Further Reading
===============

-   [Autograd
    Mechanics](https://pytorch.org/docs/stable/notes/autograd.html)
