In [1]:
import sys
import torch
from torch import nn
print(torch.__version__)
print(sys.version_info)

2.0.0+cu118
sys.version_info(major=3, minor=10, micro=11, releaselevel='final', serial=0)


- compute loss
    - forward
- loss.backward() (或者任意的objective.backward())
    - backward(compute grad)
- optimizer.step()
    - update parameters

#### 两种不被允许的inplace operation
1. 对于requires_grad=True的叶子张量(leaf tensor)，不允许inplace operation
    - all Parameters are leaf node and requires grad
    - tensor.is_leaf == True
2. 对于在求梯度阶段需要用到的tensor不能使用inplace operation

In [2]:
w = torch.FloatTensor(10)
w.requires_grad = True

In [3]:
w

tensor([-2.0806e-26,  5.7309e-21, -2.9400e+30,  4.5677e-41, -4.1277e+30,
         4.5677e-41,  1.6304e-21,  6.9602e+06, -6.3681e+32,  4.5678e-41],
       requires_grad=True)

In [4]:
w.is_leaf

True

In [5]:
w.normal_()

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

In [6]:
w.data.requires_grad

False

In [7]:
w.data.normal_()

tensor([ 1.2882,  0.9667, -1.1425, -0.3947, -0.5808, -1.7354, -1.4438, -0.3609,
        -0.7266, -0.4986])

In [8]:
w.data

tensor([ 1.2882,  0.9667, -1.1425, -0.3947, -0.5808, -1.7354, -1.4438, -0.3609,
        -0.7266, -0.4986])

#### 求梯度阶段（不限于是否是leaf node/variable/parameters）需要用到的tensor

In [9]:
x = torch.FloatTensor([[1., 2.]])
w1 = torch.FloatTensor([[2.], [1.]])
w2 = torch.FloatTensor([3.])
w1.requires_grad = True
w2.requires_grad = True

In [10]:
w2.is_leaf

True

In [11]:
d = torch.matmul(x, w1)
f = torch.matmul(d, w2)
d[:] = 0

f.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1, 1]], which is output 0 of torch::autograd::CopySlices, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

In [12]:
d = torch.matmul(x, w1)
d[:] = 0
f = torch.matmul(d, w2)
f.backward()

In [13]:
w2.grad

tensor([0.])

- 计算f的时候，d是等于某个值的，f对于w2的导数和这个时候的d的值是相关的
- 但是计算完f之后，d的值变了，这就会导致了f.backward()对于w2的导数计算出现错误，为了防止这种错误，pytorch选择了报错的形式
- 造成这个问题的主要原因是因为在执行f=torch.matmul(d, w2)的时候，pytorch的反向求导机制保存了d的引用为了之后的反向求导计算。

#### .data与.detach
- .detach
    - returns a new tensor，detached from the current graph
    - The result will never require gradient.
- x.data与x.detach返回的tensor有相同的地方，也有不同的地方，相同点如下：
    - 都和x共享同一块数据
    - 都和x的计算历史无关
    - requires_grad=False
- x.data的修改不会导致报错，但是其实计算是有问题（相当于埋了一个bug）:
    - x.detach()会直接报错（更加梯度安全）

In [14]:
a = torch.tensor([1, 2, 3.], requires_grad=True)
out = a.sigmoid()

c = out.data
print(f'a.requires_grad={a.requires_grad}, c.requires_grad={c.requires_grad}, out.requires_grad={out.requires_grad}')

a.requires_grad=True, c.requires_grad=False, out.requires_grad=True


In [15]:
print(out)
print(c)

tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)
tensor([0.7311, 0.8808, 0.9526])


In [16]:
out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

tensor([0.1966, 0.1050, 0.0452]) tensor([0.1966, 0.1050, 0.0452], grad_fn=<MulBackward0>)


#### embedding

Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in backward
pass. However, we don’t need to save the entire embedding.weight matrix, just the ro

In [17]:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight.clone() @ W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) @ W.t()
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

In [18]:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight @ W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) @ W.t()
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 5]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b. If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this. If you get an error, post it (and your code).