<img src="../../images/Transformer.png" width="40%">

Transformer整体上来看是分为两部分：encoder和decoder部分。encoder以输入字符作为输入，以状态作为输出。decoder以上一时刻的字符作为输入，把encode输出状态作为输入的一部分，最终返回字符的预测概率。


encoder以输入的编码和位置编码来作为输入，然后encoder又由N层构成，也就是N个block构成，每个block里面又包含两部分，Multi-Head Attention，对序列自身的表征运算。第二部分是一个前馈神经网络。

decoder第一部分：以待预测字符和位置编码作为输入，之后经过带有mask的multi-head attention，这里带mask是因为预测的时候，不能看到之后的字符，而只能看见之前的字符。第二部分：是交叉注意力，以mask multi-head attention的输出作为query，以encoder的输出状态作为key和value来计算出encoder序列和decoder序列之间的关联性，来得到一个表征。decoder第三部分就是一个全连接层。整个网络最后接一个线性层连接到输出概率。

由于transformer对于局部和全局的位置信息都不敏感，所以我们需要增加一个位置编码。由于encoder和decoder的block中有大量的残差连接，所以位置信息可以充分地传送到上层网络中。

## Encoder

在`encoder`中，首先将稀疏的`one-hot`向量送入到一个不带`bias`的全连接网络中，得到一个稠密向量，一方面可以节省内存，另一方面可以使得单词的表征更加丰富一些。这一层也称为嵌入层:

```python
# 这里Embeddings中的s表示两个一模一样的嵌入层，他们共享参数。
class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        """
        # d_model: 指词嵌入的维度。
        # vocab: 指词表的大小。如果是英译法，那英文的词表大小就是源文本词表大小，法文词表大小为目标词表大小。
        """
        super(Embeddings, self).__init__()
        self.lut = nn.Embedding(vocab, d_model)
        self.d_model = d_model

    def forward(self, x):
        """
        # 参数x: 因为词嵌入是首层，所以代表输入给模型的文本通过词汇映射后的张量。
        """
        return self.lut(x) * math.sqrt(self.d_model) # math.sqrt(self.d_model)其缩放的作用
```

`nn.Embedding`实例化及其调用过程可以描述为如下形式:

```python
embedding = nn.Embedding(10, 3)
input_word = torch.LongTensor([[1,2,4,1],[4,3,2,9]])
print(embedding(input_word))
```


### Position Encoding

由于考虑的是序列建模问题，而`Transformer`对位置不敏感，所以引入`position encoding`，将其加到`word embedding`上。

我们对`position encoding`有一些假设，希望`position encoding`是一个确定的，最好不要有不确定的，比如说将一个序列分成`0-1`之间的均匀划分，这样的话，训练长度不一致的话，每个位置上的编码也就不一样了，比如序列长度为`10`的话，第二个位置上是`0.1`，序列长度为`5`的话，又变成了`0.2`。还希望，对于不同长短的句子中，相同位置的距离一致，比如在一个长句子中，两个字符隔了两个单词，和相同的两个字符在一个短句子中隔了两个单词，他们之间的相对距离要保持一致。还希望这个位置编码可以推广到更长的测试句子。有了上述的三种假设之后，作者提出了sin和cos函数来表征这样的信息。



In [1]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

对于序列建模问题，有source sentence和target sentence。构建序列，序列字符以索引形式表示:

In [2]:
batch_size = 2
src_len = torch.randint(2, 5, (batch_size, )) # 源序列长度, 最小长度为2，最大长度为5
tgt_len = torch.randint(2, 5, (batch_size, )) # 目标序列长度, 最小长度为2，最大长度为5
print(src_len)
print(tgt_len)

tensor([4, 3])
tensor([3, 4])


输出结果解释：因为batch size为2，所以会有两个句子，每个数字都表示某个句子的长度。有了长度之后，我们就可以生成源序列和目标序列。

In [3]:
max_num_src_word = 8  # 源序列单词总数为8
max_num_tgt_word = 8  # 源序列单词总数为8

src_seq = [torch.randint(1, max_num_src_word, (L,)) for L in src_len]
tgt_seq = [torch.randint(1, max_num_tgt_word, (L,)) for L in tgt_len]
print(src_seq)
print(tgt_seq)

[tensor([5, 6, 5, 5]), tensor([6, 1, 1])]
[tensor([5, 3, 7]), tensor([5, 2, 4, 3])]


这样我们就随机生成了一个batch size为2，源序列目标长度为src_len的，单词表大小为max_num_src_word的源序列。目标序列类似。但是为了保证输入到网络中，我们给定的序列是一样的，所以我们需要加padding。

In [4]:
max_src_seq_len = max(src_len)  # 序列最大长度
max_tgt_seq_len = max(tgt_len)  # 序列最大长度

# pad左边不用pad，填0，右边pad
src_seq = [F.pad(torch.randint(1, max_num_src_word, (L,)), (0, max_src_seq_len-L)) for L in src_len]
tgt_seq = [F.pad(torch.randint(1, max_num_tgt_word, (L,)), (0, max_tgt_seq_len-L)) for L in tgt_len]
print(src_seq)
print(tgt_seq)

[tensor([7, 4, 4, 2]), tensor([4, 3, 5, 0])]
[tensor([3, 6, 6, 0]), tensor([7, 5, 5, 3])]


之后我们还需要将其cat起来变成一个二维张量,在cat之间需要用torch.unsqueeze对其升高一个维度:

In [5]:
max_src_seq_len = max(src_len)  # 序列最大长度
max_tgt_seq_len = max(tgt_len)  # 序列最大长度

# pad左边不用pad，填0，右边pad
src_seq = [F.pad(torch.unsqueeze(torch.randint(1, max_num_src_word, (L,)), 0), (0, max_src_seq_len-L)) \
                                                                                            for L in src_len]
tgt_seq = [F.pad(torch.unsqueeze(torch.randint(1, max_num_tgt_word, (L,)), 0), (0, max_tgt_seq_len-L)) \
                                                                                            for L in tgt_len]

src_seq = torch.cat(src_seq, 0)
tgt_seq = torch.cat(tgt_seq, 0)

print(src_seq)
print(tgt_seq)

tensor([[4, 3, 5, 6],
        [5, 2, 2, 0]])
tensor([[3, 2, 7, 0],
        [7, 3, 2, 2]])


之后就需要去构造embedding,其官网地址:

[https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding)

```python
CLASS torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False, _weight=None, device=None, dtype=None)
```

num_embeddings参数表示单词表的大小，embedding_dim为编码之后的维度

In [6]:
model_dim = 8 # 原始论文中为512

src_embedding_table = nn.Embedding(max_num_src_word, model_dim)
tgt_embedding_table = nn.Embedding(max_num_tgt_word, model_dim)

print(src_embedding_table.weight)

Parameter containing:
tensor([[ 0.4440,  0.9642,  0.4904,  0.6679, -0.4161,  1.6588,  0.8813, -0.5609],
        [-1.3250,  3.0019,  0.7034,  0.1382, -0.6140, -0.3905, -0.6831, -0.1695],
        [ 1.6436, -0.1385, -0.3549,  0.0512, -1.6353, -0.0802,  0.9369, -1.3509],
        [-0.3555, -0.1949,  0.2134,  0.2540, -2.1946,  0.1603,  1.0509,  0.4482],
        [-1.1937, -0.0556,  1.0496, -0.1971,  0.8452, -0.6711,  0.6527,  0.4400],
        [-0.7475, -0.5348, -1.4369,  0.1944,  0.4523, -0.5907,  1.3619,  0.3157],
        [ 1.7535, -0.1772,  1.6677,  0.1393,  0.0733, -0.2864,  0.3537,  1.0370],
        [ 0.3950,  0.6527, -0.2264, -0.1220,  0.5487,  0.0403, -0.0487, -2.1227]],
       requires_grad=True)


In [7]:
src_embedding = src_embedding_table(src_seq)
tgt_embedding = tgt_embedding_table(tgt_seq)
print(src_embedding.size())
print(tgt_embedding.size())

torch.Size([2, 4, 8])
torch.Size([2, 4, 8])


之后我们就需要去构造position embedding：

$$
\text{PE}(i,\delta) = 
\begin{cases}
\sin(\frac{i}{10000^{2\delta'/d}}) & \text{if } \delta = 2\delta'\\
\cos(\frac{i}{10000^{2\delta'/d}}) & \text{if } \delta = 2\delta' + 1\\
\end{cases}
$$

In [8]:
max_src_position_len = max_src_seq_len
max_tgt_position_len = max_tgt_seq_len


pos_matrix = torch.arange(max(max_src_position_len, max_tgt_position_len)).reshape(-1, 1)
i_mat = torch.pow(10000, torch.arange(0, model_dim, 2).reshape(1, -1)/model_dim)  # 间隔是2
print(pos_matrix)
print(i_mat)

tensor([[0],
        [1],
        [2],
        [3]])
tensor([[   1.,   10.,  100., 1000.]])


之后我们再初始化一个pe_embedding_table，然后对其进行赋值即可:

In [9]:
# 取max_src_position_len和max_tgt_position_len中最大的是期望两个position能用一个embedding。

pe_embedding_table = torch.zeros(max(max_src_position_len, max_tgt_position_len), model_dim)

pe_embedding_table[:, 0::2] = torch.sin(pos_matrix / i_mat)  # 赋值偶数列
pe_embedding_table[:, 1::2] = torch.cos(pos_matrix / i_mat)  # 赋值奇数列
print(pe_embedding_table)

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00]])


之后把pe_embedding_table中的值传入到nn.Embedding中:

In [10]:
pe_embedding = nn.Embedding(max(max_src_position_len, max_tgt_position_len), model_dim)
pe_embedding.weight = nn.Parameter(pe_embedding_table, requires_grad=False)
print(pe_embedding.weight)

Parameter containing:
tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00]])


之后就可以直接用pe_embedding作为一个函数对src_seq和tgt_seq进行embedding。

In [11]:
src_pos = torch.cat([torch.unsqueeze(torch.arange(max(src_len)), 0) for _ in src_len]).to(torch.int32)
tgt_pos = torch.cat([torch.unsqueeze(torch.arange(max(tgt_len)), 0) for _ in tgt_len]).to(torch.int32)

src_pe_embedding = pe_embedding(src_pos)
tgt_pe_embedding = pe_embedding(tgt_pos)

print(src_pe_embedding)
print(tgt_pe_embedding)

tensor([[[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
           1.0000e+00,  0.0000e+00,  1.0000e+00],
         [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
           9.9995e-01,  1.0000e-03,  1.0000e+00],
         [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
           9.9980e-01,  2.0000e-03,  1.0000e+00],
         [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
           9.9955e-01,  3.0000e-03,  1.0000e+00]],

        [[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
           1.0000e+00,  0.0000e+00,  1.0000e+00],
         [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
           9.9995e-01,  1.0000e-03,  1.0000e+00],
         [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
           9.9980e-01,  2.0000e-03,  1.0000e+00],
         [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
           9.9955e-01,  3.0000e-03,  1.0000e+00]

### Multi-head Self-attention

Multi-head Self-attention可以使得建模能力更强，表征空间更加丰富。多头注意力机制由多组Q，K，V构成，每组单独计算一个attention向量。最后把每组的attention向量拼接起来，并进入到一个不带bias的FFN中，得到最终的向量。

多头注意力机制会使得特征向量的特征维度降低，比如原来的特征向量维度是512的，如果分为8个head的话，那么每个head的维度就会变成64，这样做也是为了保证整体上的运算量没有太大的改变。

在encoder中，Q和K都是word embedding经过两个linear层得到Q和K。

Scaled Dot-Product Attention:

$$
Attention(Q, K, V) = softmax(\frac{QK^{T}}{\sqrt{d_{k}}}) V
$$

对于两两单词而言，$Q$和$K$的相乘就是两个单词的内积。而一个query对所有的key依次做内积的话就是相似度的了，再softmax一下之后就是相似度的概率。而除以$\sqrt{d_{k}}$是希望softmax出来的结果方差不要太大，同时也希望雅可比矩阵出来的导数不要变成0了。

In [12]:
alpha1 = 0.1
alpha2 = 10
score = torch.randn(5)
prob1 = F.softmax(score * alpha1, -1)
prob2 = F.softmax(score * alpha2, -1)
print(prob1)
print(prob2)

tensor([0.2184, 0.1966, 0.2013, 0.1828, 0.2009])
tensor([9.9946e-01, 2.7264e-05, 2.8074e-04, 1.8708e-08, 2.3486e-04])


可以看到当score放大时，方差是比较大的，对最终的概率影响也不是线形的。概率大的会相对更大，概率小的会相对更小。

In [13]:
def softmax_func(score):
    return F.softmax(score)

jaco_mat1 = torch.autograd.functional.jacobian(softmax_func, score * alpha1)
jaco_mat2 = torch.autograd.functional.jacobian(softmax_func, score * alpha2)

print(jaco_mat1)
print(jaco_mat2)

tensor([[ 0.1707, -0.0429, -0.0440, -0.0399, -0.0439],
        [-0.0429,  0.1580, -0.0396, -0.0359, -0.0395],
        [-0.0440, -0.0396,  0.1608, -0.0368, -0.0404],
        [-0.0399, -0.0359, -0.0368,  0.1494, -0.0367],
        [-0.0439, -0.0395, -0.0404, -0.0367,  0.1605]])
tensor([[ 5.4252e-04, -2.7249e-05, -2.8059e-04, -1.8698e-08, -2.3473e-04],
        [-2.7249e-05,  2.7263e-05, -7.6541e-09, -5.1004e-13, -6.4032e-09],
        [-2.8059e-04, -7.6541e-09,  2.8066e-04, -5.2520e-12, -6.5936e-08],
        [-1.8698e-08, -5.1004e-13, -5.2520e-12,  1.8708e-08, -4.3937e-12],
        [-2.3473e-04, -6.4032e-09, -6.5936e-08, -4.3937e-12,  2.3481e-04]])


  return F.softmax(score)


可以看到，alpha比较大的时候，梯度会变成0。

如果encoder中需要构造mask的话，期望被mask的元素值为负无穷。mask的shape需要和score得分的维度一致。

我们首先需要构造有效矩阵, 除了有效位置之外的位置补0。

In [14]:
valid_encoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L), (0, max_src_seq_len-L)), 0) \
                                               for L in src_len]), 2)
print(valid_encoder_pos.shape)

torch.Size([2, 4, 1])


这矩阵乘其自身的转置就能得到领接矩阵，也就是相关性:

In [15]:
valid_encoder_pos_matrix = torch.bmm(valid_encoder_pos, valid_encoder_pos.transpose(1, 2))
print(valid_encoder_pos_matrix)

tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 0.],
         [1., 1., 1., 0.],
         [1., 1., 1., 0.],
         [0., 0., 0., 0.]]])


In [16]:
invalid_encoder_pos_matrix = 1 - valid_encoder_pos_matrix
print(invalid_encoder_pos_matrix)

tensor([[[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]],

        [[0., 0., 0., 1.],
         [0., 0., 0., 1.],
         [0., 0., 0., 1.],
         [1., 1., 1., 1.]]])


In [17]:
mask_encoder_self_attention = invalid_encoder_pos_matrix.to(torch.bool)
print(mask_encoder_self_attention)

tensor([[[False, False, False, False],
         [False, False, False, False],
         [False, False, False, False],
         [False, False, False, False]],

        [[False, False, False,  True],
         [False, False, False,  True],
         [False, False, False,  True],
         [ True,  True,  True,  True]]])


True代表那些位置需要mask，False代表那些位置不需要mask。

In [18]:
score = torch.rand(batch_size, max(src_len), max(src_len))
masked_score = score.masked_fill(mask_encoder_self_attention, -1e-9)
prob = F.softmax(masked_score, -1)

print(src_len)
print(score)
print(masked_score)
print(prob)

tensor([4, 3])
tensor([[[0.1168, 0.6235, 0.2864, 0.2388],
         [0.3130, 0.5479, 0.1937, 0.9192],
         [0.0649, 0.3943, 0.2376, 0.0828],
         [0.1616, 0.2483, 0.9941, 0.3251]],

        [[0.5100, 0.9872, 0.7432, 0.9793],
         [0.1166, 0.8348, 0.6957, 0.9492],
         [0.6588, 0.9512, 0.7108, 0.4154],
         [0.3914, 0.9951, 0.1293, 0.8756]]])
tensor([[[ 1.1677e-01,  6.2355e-01,  2.8638e-01,  2.3877e-01],
         [ 3.1300e-01,  5.4785e-01,  1.9374e-01,  9.1919e-01],
         [ 6.4862e-02,  3.9433e-01,  2.3756e-01,  8.2819e-02],
         [ 1.6165e-01,  2.4830e-01,  9.9411e-01,  3.2507e-01]],

        [[ 5.1005e-01,  9.8717e-01,  7.4319e-01, -1.0000e-09],
         [ 1.1661e-01,  8.3479e-01,  6.9566e-01, -1.0000e-09],
         [ 6.5880e-01,  9.5118e-01,  7.1082e-01, -1.0000e-09],
         [-1.0000e-09, -1.0000e-09, -1.0000e-09, -1.0000e-09]]])
tensor([[[0.2010, 0.3337, 0.2382, 0.2271],
         [0.2006, 0.2537, 0.1780, 0.3677],
         [0.2175, 0.3024, 0.2585, 0.2215],


### LayerNorm & Residual

### Feedforward Neural Network

这个前馈神经网络只考虑每个单独位置上的建模，不同位置参数共享。

#### Linear(large)

#### Linear2(d_model)

### LayerNorm & Residual

## Decoder



### output word embeddding

output word embeddding是另外一个序列的embedding，比如说中英文的翻译，encoder接收的是中文字符，decoder接收的英文字符，


### Position Embedding

### Memory-base Multi-head Cross-attention

Multi-head Cross-attention需要计算decoder和encoder的Multi-head的关联性。用decoder的MHA(multi head attention)作为query，用encoder的输出作为key和value来去算出上下文的表征。


$Q @ K^{T}$的shape为[batch_size, tgt_seq_len, src_seq_len]，因此构造的mask也应该是这样的形状。

In [19]:
valid_decoder_pos = torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L), (0, max_tgt_seq_len-L)), 0) \
                                               for L in tgt_len]), 2)
print(valid_decoder_pos)

tensor([[[1.],
         [1.],
         [1.],
         [0.]],

        [[1.],
         [1.],
         [1.],
         [1.]]])


In [20]:
valid_cross_pos_matrix = torch.bmm(valid_decoder_pos, valid_decoder_pos.transpose(1, 2))
print(valid_cross_pos_matrix)

tensor([[[1., 1., 1., 0.],
         [1., 1., 1., 0.],
         [1., 1., 1., 0.],
         [0., 0., 0., 0.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]]])


In [21]:
invalid_cross_pos_matrix = 1 - valid_cross_pos
print(invalid_cross_pos_matrix)

NameError: name 'valid_cross_pos' is not defined

In [None]:
mask_cross_attention = invalid_cross_pos_matrix.to(torch.bool)
print(mask_cross_attention)

### LayerNorm & Residual

### Casual Multi-head Self-attention

Casual Multi-head Self-attention是带有掩码的attention，使得其符合因果律。

首先我们需要去构造decoder self-attention的mask。

In [None]:
tri_matrix = [torch.tril(torch.ones(L, L)) for L in tgt_len]
print(tri_matrix)

In [None]:
valid_decoder_tri_matrix = [torch.unsqueeze(F.pad(torch.tril(torch.ones(L, L)), (0, max(tgt_len)-L, 0, max(tgt_len)-L)), 0) \
                            for L in tgt_len]
print(valid_decoder_tri_matrix)

In [None]:
valid_decoder_tri_matrix = torch.cat(valid_decoder_tri_matrix, 0)
print(valid_decoder_tri_matrix)

In [None]:
valid_decoder_tri_matrix = (1 - valid_decoder_tri_matrix).to(torch.bool)
print(valid_decoder_tri_matrix)

In [None]:
score = torch.randn(batch_size, max(tgt_len), max(tgt_len))
masked_score = score.masked_fill(valid_decoder_tri_matrix, -1e9)
prob = F.softmax(masked_score, -1)
print(tgt_len)
print(prob)

### scaled_dot_product_attention

In [None]:
def scaled_dot_product_attention(Q, K, V, atten_mask):
    score = torch.bmm(Q, K.transpose(-2, -1))/torch.sqrt(model_dim)
    masked_score = score.maked_fill(atten_mask, -1e9)
    prob = F.softmax(masked_score)
    context = torch.bmm(prob, V)
    return context

### LayerNorm & Residual

### Feedforward Neural Network

#### Linear(large)

#### Linear2(d_model)

### LayerNorm & Residual

## Transformer Masked Loss

In [None]:

# batch_size=2, sequence_len=3, vocab_size = 4
logits = torch.randn(2, 3, 4)
label = torch.randint(0, 4, (2, 3))

In [None]:
logits = logits.transpose(1, 2)
print(F.cross_entropy(logits, label))
print(F.cross_entropy(logits, label, reduction="none"))

如果句子长度不一致的话，我们同样需要对预测输出做mask

In [None]:
tgt_len = torch.Tensor([2, 3]).to(torch.int32)
mask = torch.cat([torch.unsqueeze(F.pad(torch.ones(L), (0, max(tgt_len)-L)), 0) for L in tgt_len])
print(mask)

In [None]:
print(F.cross_entropy(logits, label, reduction="none") * mask)

F.cross_entropy本身也有提供这样的功能参数ignore_index，默认值为-100。

In [None]:
label[0, 2] = -100
print(F.cross_entropy(logits, label, reduction="none"))

可以看到输出结果同样被mask掉了。

## PyTorch官方API介绍

Pytorch中Transformer的源码地址为:[https://github.com/pytorch/pytorch/blob/734a97a7c828d6bfe82764c9cae399c349828091/torch/nn/modules/transformer.py](https://github.com/pytorch/pytorch/blob/734a97a7c828d6bfe82764c9cae399c349828091/torch/nn/modules/transformer.py)

我们可以采用torch.nn.Transformer来去调用这个transformer。

```python
Examples::
        >>> transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
        >>> src = torch.rand((10, 32, 512))
        >>> tgt = torch.rand((20, 32, 512))
        >>> out = transformer_model(src, tgt)
```

### Transformer类中的__init__


其init函数如下所示：d_model是整个transformer的特征维度，n_head是multi head self-attention头的数目。num_encoder_layers和num_decoder_layers指block的数目。dim_feedforward指全连接层中间的特征维度，multi-head attention的输出首先映射到dim_feedforward=2048这个大的特征空间，然后再把它映射回来到512这个特征空间，因为我们需要保证输出的维度是512，之后才能做残差连接。


之后就需要在init函数中实例化一些模块，第一个要实例化的就是encoder，encoder是通过TransformerEncoder这个类去实现的，在这个类中我们需要传入encoder_layer，num_encoder_layers和encoder_norm。在TransformerEncoderLayer中需要实现multi-head self-attention的调用，还有残差连接和层归一化，还有全连接层网络来构成TransformerEncoderLayer。对于decoder也是一样，需要包括自注意力机制，交叉注意力机制和前馈神经网络。

```python
def __init__(self, d_model: int = 512, nhead: int = 8, num_encoder_layers: int = 6,
                 num_decoder_layers: int = 6, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 custom_encoder: Optional[Any] = None, custom_decoder: Optional[Any] = None,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(Transformer, self).__init__()

        if custom_encoder is not None:
            self.encoder = custom_encoder
        else:
            encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            encoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        if custom_decoder is not None:
            self.decoder = custom_decoder
        else:
            decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout,
                                                    activation, layer_norm_eps, batch_first, norm_first,
                                                    **factory_kwargs)
            decoder_norm = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
            self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm)

        self._reset_parameters()

        self.d_model = d_model
        self.nhead = nhead

        self.batch_first = batch_first
```

可以看出整个transformer主要由4个部件构成，TransformerEncoderLayer，TransformerEncoder，TransformerDecoderLayer，TransformerDecoder。

### Transformer类中的forward

在forward函数中，我们需要基于sorce sentence，target sentence，src_mask，tgt_mask来去算出最终的解码器的输出。src_mask是由于做一个mini-batch的序列的时候，每个minibitch的长度不一致，是为了之后做softmask的时候，无效位置的能量值设置为负无穷。tgt_mask是一个因果的mask。

```python
def forward(self, src: Tensor, tgt: Tensor, src_mask: Optional[Tensor] = None, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Take in and process masked source/target sequences.
        Args:
            src: the sequence to the encoder (required).
            tgt: the sequence to the decoder (required).
            src_mask: the additive mask for the src sequence (optional).
            tgt_mask: the additive mask for the tgt sequence (optional).
            memory_mask: the additive mask for the encoder output (optional).
            src_key_padding_mask: the ByteTensor mask for src keys per batch (optional).
            tgt_key_padding_mask: the ByteTensor mask for tgt keys per batch (optional).
            memory_key_padding_mask: the ByteTensor mask for memory keys per batch (optional).
        Shape:
            - src: :math:`(S, E)` for unbatched input, :math:`(S, N, E)` if `batch_first=False` or
              `(N, S, E)` if `batch_first=True`.
            - tgt: :math:`(T, E)` for unbatched input, :math:`(T, N, E)` if `batch_first=False` or
              `(N, T, E)` if `batch_first=True`.
            - src_mask: :math:`(S, S)` or :math:`(N\cdot\text{num\_heads}, S, S)`.
            - tgt_mask: :math:`(T, T)` or :math:`(N\cdot\text{num\_heads}, T, T)`.
            - memory_mask: :math:`(T, S)`.
            - src_key_padding_mask: :math:`(S)` for unbatched input otherwise :math:`(N, S)`.
            - tgt_key_padding_mask: :math:`(T)` for unbatched input otherwise :math:`(N, T)`.
            - memory_key_padding_mask: :math:`(S)` for unbatched input otherwise :math:`(N, S)`.
            Note: [src/tgt/memory]_mask ensures that position i is allowed to attend the unmasked
            positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
            while the zero positions will be unchanged. If a BoolTensor is provided, positions with ``True``
            are not allowed to attend while ``False`` values will be unchanged. If a FloatTensor
            is provided, it will be added to the attention weight.
            [src/tgt/memory]_key_padding_mask provides specified elements in the key to be ignored by
            the attention. If a ByteTensor is provided, the non-zero positions will be ignored while the zero
            positions will be unchanged. If a BoolTensor is provided, the positions with the
            value of ``True`` will be ignored while the position with the value of ``False`` will be unchanged.
            - output: :math:`(T, E)` for unbatched input, :math:`(T, N, E)` if `batch_first=False` or
              `(N, T, E)` if `batch_first=True`.
            Note: Due to the multi-head attention architecture in the transformer model,
            the output sequence length of a transformer is same as the input sequence
            (i.e. target) length of the decode.
            where S is the source sequence length, T is the target sequence length, N is the
            batch size, E is the feature number
        Examples:
            >>> output = transformer_model(src, tgt, src_mask=src_mask, tgt_mask=tgt_mask)
        """

        is_batched = src.dim() == 3
        if not self.batch_first and src.size(1) != tgt.size(1) and is_batched:
            raise RuntimeError("the batch number of src and tgt must be equal")
        elif self.batch_first and src.size(0) != tgt.size(0) and is_batched:
            raise RuntimeError("the batch number of src and tgt must be equal")

        if src.size(-1) != self.d_model or tgt.size(-1) != self.d_model:
            raise RuntimeError("the feature number of src and tgt must be equal to d_model")

        memory = self.encoder(src, mask=src_mask, src_key_padding_mask=src_key_padding_mask)
        output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask,
                              tgt_key_padding_mask=tgt_key_padding_mask,
                              memory_key_padding_mask=memory_key_padding_mask)
        return output
```



TransformerEncoderLayer在其init函数中也包含一些参数，d_model整个transformer的特征维度，nhead指multi-head self-attention的头的数目。dim_feedforward指第一个全连接层的维度，第二个全连接层与d_model的维度一样，就是为了保证残差连接能够有效的进行。


### EncoderLayer类中的init

```python
class TransformerEncoderLayer(Module):
    r"""TransformerEncoderLayer is made up of self-attn and feedforward network.
    This standard encoder layer is based on the paper "Attention Is All You Need".
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
    in a different way during application.
    Args:
        d_model: the number of expected features in the input (required).
        nhead: the number of heads in the multiheadattention models (required).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of the intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, layer norm is done prior to attention and feedforward
            operations, respectivaly. Otherwise it's done after. Default: ``False`` (after).
    Examples::
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
        >>> src = torch.rand(10, 32, 512)
        >>> out = encoder_layer(src)
    Alternatively, when ``batch_first`` is ``True``:
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, batch_first=True)
        >>> src = torch.rand(32, 10, 512)
        >>> out = encoder_layer(src)
    Fast path:
        forward() will use a special optimized implementation if all of the following
        conditions are met:
        - Either autograd is disabled (using ``torch.inference_mode`` or ``torch.no_grad``) or no tensor
          argument ``requires_grad``
        - training is disabled (using ``.eval()``)
        - batch_first is ``True`` and the input is batched (i.e., ``src.dim() == 3``)
        - norm_first is ``False`` (this restriction may be loosened in the future)
        - activation is one of: ``"relu"``, ``"gelu"``, ``torch.functional.relu``, or ``torch.functional.gelu``
        - at most one of ``src_mask`` and ``src_key_padding_mask`` is passed
        - if src is a `NestedTensor <https://pytorch.org/docs/stable/nested.html>`_, neither ``src_mask``
          nor ``src_key_padding_mask`` is passed
        - the two ``LayerNorm`` instances have a consistent ``eps`` value (this will naturally be the case
          unless the caller has manually modified one without modifying the other)
        If the optimized implementation is in use, a
        `NestedTensor <https://pytorch.org/docs/stable/nested.html>`_ can be
        passed for ``src`` to represent padding more efficiently than using a padding
        mask. In this case, a `NestedTensor <https://pytorch.org/docs/stable/nested.html>`_ will be
        returned, and an additional speedup proportional to the fraction of the input that
        is padding can be expected.
    """
    __constants__ = ['batch_first', 'norm_first']

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(TransformerEncoderLayer, self).__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        # Implementation of Feedforward model
        self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
        self.dropout = Dropout(dropout)
        self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)

        self.norm_first = norm_first
        self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.dropout1 = Dropout(dropout)
        self.dropout2 = Dropout(dropout)

        # Legacy string support for activation function.
        if isinstance(activation, str):
            activation = _get_activation_fn(activation)

        # We can't test self.activation in forward() in TorchScript,
        # so stash some information about it instead.
        if activation is F.relu:
            self.activation_relu_or_gelu = 1
        elif activation is F.gelu:
            self.activation_relu_or_gelu = 2
        else:
            self.activation_relu_or_gelu = 0
        self.activation = activation
```

其中最重要的就是MultiheadAttention。

### EncoderLayer类中的forward

```python
def forward(self, src: Tensor, src_mask: Optional[Tensor] = None,
                src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the input through the encoder layer.
        Args:
            src: the sequence to the encoder layer (required).
            src_mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """

        # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf

        if (src.dim() == 3 and not self.norm_first and not self.training and
            self.self_attn.batch_first and
            self.self_attn._qkv_same_embed_dim and self.activation_relu_or_gelu and
            self.norm1.eps == self.norm2.eps and
            ((src_mask is None and src_key_padding_mask is None)
             if src.is_nested
             else (src_mask is None or src_key_padding_mask is None))):
            tensor_args = (
                src,
                self.self_attn.in_proj_weight,
                self.self_attn.in_proj_bias,
                self.self_attn.out_proj.weight,
                self.self_attn.out_proj.bias,
                self.norm1.weight,
                self.norm1.bias,
                self.norm2.weight,
                self.norm2.bias,
                self.linear1.weight,
                self.linear1.bias,
                self.linear2.weight,
                self.linear2.bias,
            )
            if (not torch.overrides.has_torch_function(tensor_args) and
                    # We have to use a list comprehension here because TorchScript
                    # doesn't support generator expressions.
                    all([(x.is_cuda or 'cpu' in str(x.device)) for x in tensor_args]) and
                    (not torch.is_grad_enabled() or all([not x.requires_grad for x in tensor_args]))):
                return torch._transformer_encoder_layer_fwd(
                    src,
                    self.self_attn.embed_dim,
                    self.self_attn.num_heads,
                    self.self_attn.in_proj_weight,
                    self.self_attn.in_proj_bias,
                    self.self_attn.out_proj.weight,
                    self.self_attn.out_proj.bias,
                    self.activation_relu_or_gelu == 2,
                    False,  # norm_first, currently not supported
                    self.norm1.eps,
                    self.norm1.weight,
                    self.norm1.bias,
                    self.norm2.weight,
                    self.norm2.bias,
                    self.linear1.weight,
                    self.linear1.bias,
                    self.linear2.weight,
                    self.linear2.bias,
                    src_mask if src_mask is not None else src_key_padding_mask,
                )
        x = src
        if self.norm_first:
            x = x + self._sa_block(self.norm1(x), src_mask, src_key_padding_mask)
            x = x + self._ff_block(self.norm2(x))
        else:
            x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
            x = self.norm2(x + self._ff_block(x))

        return x

    # self-attention block
    def _sa_block(self, x: Tensor,
                  attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
        x = self.self_attn(x, x, x,
                           attn_mask=attn_mask,
                           key_padding_mask=key_padding_mask,
                           need_weights=False)[0]
        return self.dropout1(x)

    # feed forward block
    def _ff_block(self, x: Tensor) -> Tensor:
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout2(x)
```

可以看出，在encoder中做的是self-attention，q，k，v都是他自己，之后再接残差连接。

### Encoder中的init


```python
class TransformerEncoder(Module):
    r"""TransformerEncoder is a stack of N encoder layers
    Args:
        encoder_layer: an instance of the TransformerEncoderLayer() class (required).
        num_layers: the number of sub-encoder-layers in the encoder (required).
        norm: the layer normalization component (optional).
        enable_nested_tensor: if True, input will automatically convert to nested tensor
            (and convert back on output). This will improve the overall performance of
            TransformerEncoder when padding rate is high. Default: ``True`` (enabled).
    Examples::
        >>> encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
        >>> transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
        >>> src = torch.rand(10, 32, 512)
        >>> out = transformer_encoder(src)
    """
    __constants__ = ['norm']

    def __init__(self, encoder_layer, num_layers, norm=None, enable_nested_tensor=True):
        super(TransformerEncoder, self).__init__()
        self.layers = _get_clones(encoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
        self.enable_nested_tensor = enable_nested_tensor
```

encoder_layer是实力化之后的encoder layer，还有需要多少个bolck，也就是num_layers的数量。

### Encoder中的forward

```python
def forward(self, src: Tensor, mask: Optional[Tensor] = None, src_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the input through the encoder layers in turn.
        Args:
            src: the sequence to the encoder (required).
            mask: the mask for the src sequence (optional).
            src_key_padding_mask: the mask for the src keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """
        output = src
        convert_to_nested = False
        first_layer = self.layers[0]
        if isinstance(first_layer, torch.nn.TransformerEncoderLayer):
            if (not first_layer.norm_first and not first_layer.training and
                    first_layer.self_attn.batch_first and
                    first_layer.self_attn._qkv_same_embed_dim and first_layer.activation_relu_or_gelu and
                    first_layer.norm1.eps == first_layer.norm2.eps and
                    src.dim() == 3 and self.enable_nested_tensor) :
                if src_key_padding_mask is not None and not output.is_nested and mask is None:
                    tensor_args = (
                        src,
                        first_layer.self_attn.in_proj_weight,
                        first_layer.self_attn.in_proj_bias,
                        first_layer.self_attn.out_proj.weight,
                        first_layer.self_attn.out_proj.bias,
                        first_layer.norm1.weight,
                        first_layer.norm1.bias,
                        first_layer.norm2.weight,
                        first_layer.norm2.bias,
                        first_layer.linear1.weight,
                        first_layer.linear1.bias,
                        first_layer.linear2.weight,
                        first_layer.linear2.bias,
                    )
                    if not torch.overrides.has_torch_function(tensor_args):
                        if output.is_cuda or 'cpu' in str(output.device):
                            convert_to_nested = True
                            output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not())

        for mod in self.layers:
            if convert_to_nested:
                output = mod(output, src_mask=mask)
            else:
                output = mod(output, src_mask=mask, src_key_padding_mask=src_key_padding_mask)

        if convert_to_nested:
            output = output.to_padded_tensor(0.)

        if self.norm is not None:
            output = self.norm(output)

        return output

```

### DecoderLayer中的init

TransformerDecoderLayer中init函数的第一个参数依旧是d_model，表示为整个transformer的特征维度nhead同样表示multi-head attention中有多少个头。但是这里需要实例化两个MultiheadAttention，第一个是自注意力机制，第二个是交叉注意力机制。

```python
class TransformerDecoderLayer(Module):
    r"""TransformerDecoderLayer is made up of self-attn, multi-head-attn and feedforward network.
    This standard decoder layer is based on the paper "Attention Is All You Need".
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
    Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in
    Neural Information Processing Systems, pages 6000-6010. Users may modify or implement
    in a different way during application.
    Args:
        d_model: the number of expected features in the input (required).
        nhead: the number of heads in the multiheadattention models (required).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of the intermediate layer, can be a string
            ("relu" or "gelu") or a unary callable. Default: relu
        layer_norm_eps: the eps value in layer normalization components (default=1e-5).
        batch_first: If ``True``, then the input and output tensors are provided
            as (batch, seq, feature). Default: ``False`` (seq, batch, feature).
        norm_first: if ``True``, layer norm is done prior to self attention, multihead
            attention and feedforward operations, respectivaly. Otherwise it's done after.
            Default: ``False`` (after).
    Examples::
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
        >>> memory = torch.rand(10, 32, 512)
        >>> tgt = torch.rand(20, 32, 512)
        >>> out = decoder_layer(tgt, memory)
    Alternatively, when ``batch_first`` is ``True``:
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8, batch_first=True)
        >>> memory = torch.rand(32, 10, 512)
        >>> tgt = torch.rand(32, 20, 512)
        >>> out = decoder_layer(tgt, memory)
    """
    __constants__ = ['batch_first', 'norm_first']

    def __init__(self, d_model: int, nhead: int, dim_feedforward: int = 2048, dropout: float = 0.1,
                 activation: Union[str, Callable[[Tensor], Tensor]] = F.relu,
                 layer_norm_eps: float = 1e-5, batch_first: bool = False, norm_first: bool = False,
                 device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super(TransformerDecoderLayer, self).__init__()
        self.self_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                            **factory_kwargs)
        self.multihead_attn = MultiheadAttention(d_model, nhead, dropout=dropout, batch_first=batch_first,
                                                 **factory_kwargs)
        # Implementation of Feedforward model
        self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
        self.dropout = Dropout(dropout)
        self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)

        self.norm_first = norm_first
        self.norm1 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm2 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.norm3 = LayerNorm(d_model, eps=layer_norm_eps, **factory_kwargs)
        self.dropout1 = Dropout(dropout)
        self.dropout2 = Dropout(dropout)
        self.dropout3 = Dropout(dropout)

        # Legacy string support for activation function.
        if isinstance(activation, str):
            self.activation = _get_activation_fn(activation)
        else:
            self.activation = activation
```


### DecoderLayer中的forward

```python
def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None, memory_mask: Optional[Tensor] = None,
                tgt_key_padding_mask: Optional[Tensor] = None, memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the inputs (and mask) through the decoder layer.
        Args:
            tgt: the sequence to the decoder layer (required).
            memory: the sequence from the last layer of the encoder (required).
            tgt_mask: the mask for the tgt sequence (optional).
            memory_mask: the mask for the memory sequence (optional).
            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
            memory_key_padding_mask: the mask for the memory keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """
        # see Fig. 1 of https://arxiv.org/pdf/2002.04745v1.pdf

        x = tgt
        if self.norm_first:
            x = x + self._sa_block(self.norm1(x), tgt_mask, tgt_key_padding_mask)
            x = x + self._mha_block(self.norm2(x), memory, memory_mask, memory_key_padding_mask)
            x = x + self._ff_block(self.norm3(x))
        else:
            x = self.norm1(x + self._sa_block(x, tgt_mask, tgt_key_padding_mask))
            x = self.norm2(x + self._mha_block(x, memory, memory_mask, memory_key_padding_mask))
            x = self.norm3(x + self._ff_block(x))

        return x

    # self-attention block
    def _sa_block(self, x: Tensor,
                  attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
        x = self.self_attn(x, x, x,
                           attn_mask=attn_mask,
                           key_padding_mask=key_padding_mask,
                           need_weights=False)[0]
        return self.dropout1(x)

    # multihead attention block
    def _mha_block(self, x: Tensor, mem: Tensor,
                   attn_mask: Optional[Tensor], key_padding_mask: Optional[Tensor]) -> Tensor:
        x = self.multihead_attn(x, mem, mem,
                                attn_mask=attn_mask,
                                key_padding_mask=key_padding_mask,
                                need_weights=False)[0]
        return self.dropout2(x)

    # feed forward block
    def _ff_block(self, x: Tensor) -> Tensor:
        x = self.linear2(self.dropout(self.activation(self.linear1(x))))
        return self.dropout3(x)
```

### Decoder中的init

在TransformerDecoder这个类中，首先会有个init函数，

```python
class TransformerDecoder(Module):
    r"""TransformerDecoder is a stack of N decoder layers
    Args:
        decoder_layer: an instance of the TransformerDecoderLayer() class (required).
        num_layers: the number of sub-decoder-layers in the decoder (required).
        norm: the layer normalization component (optional).
    Examples::
        >>> decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
        >>> transformer_decoder = nn.TransformerDecoder(decoder_layer, num_layers=6)
        >>> memory = torch.rand(10, 32, 512)
        >>> tgt = torch.rand(20, 32, 512)
        >>> out = transformer_decoder(tgt, memory)
    """
    __constants__ = ['norm']

    def __init__(self, decoder_layer, num_layers, norm=None):
        super(TransformerDecoder, self).__init__()
        self.layers = _get_clones(decoder_layer, num_layers)
        self.num_layers = num_layers
        self.norm = norm
```

decoder_layer是DecoderLayer的实例，num_layers指总共有多少个block。



### Decoder中的forward

```python
def forward(self, tgt: Tensor, memory: Tensor, tgt_mask: Optional[Tensor] = None,
                memory_mask: Optional[Tensor] = None, tgt_key_padding_mask: Optional[Tensor] = None,
                memory_key_padding_mask: Optional[Tensor] = None) -> Tensor:
        r"""Pass the inputs (and mask) through the decoder layer in turn.
        Args:
            tgt: the sequence to the decoder (required).
            memory: the sequence from the last layer of the encoder (required).
            tgt_mask: the mask for the tgt sequence (optional).
            memory_mask: the mask for the memory sequence (optional).
            tgt_key_padding_mask: the mask for the tgt keys per batch (optional).
            memory_key_padding_mask: the mask for the memory keys per batch (optional).
        Shape:
            see the docs in Transformer class.
        """
        output = tgt

        for mod in self.layers:
            output = mod(output, memory, tgt_mask=tgt_mask,
                         memory_mask=memory_mask,
                         tgt_key_padding_mask=tgt_key_padding_mask,
                         memory_key_padding_mask=memory_key_padding_mask)

        if self.norm is not None:
            output = self.norm(output)

        return output 
```