# Transformer难点理解与实现
## 1 word embedding
## 2 position embedding
## 3 encoder self-attention mask
## 4 intra-attention mask
## 5 decoder self-attention mask
## 6 multi-head self-attention


## seq2seq基础模块的分类

![](./imgs/6_1.png)



![](./imgs/6_2.png)




# Transformer模型

## Encoder
1. input word embedding
    - one-hot * embedding table = 稠密的连续向量
2. position encoding
    - Position Embedding随着残差网络的连接可以传递到更高的神经网络层中（位置信息一直存在）
3. multi-head self-attention
4. feed forward network

![](./imgs/6_3.png)

![](./imgs/6_4.png)

## Decoder

![](./imgs/6_5.png)

## 1. word embedding

对一段文本的处理：
1. 筛选出文本中所有单词
2. 建立词表，对拆分出的单词建立索引
3. 将文本转为索引列表


### 1.1 构建输入和输出

In [2]:

import time
import numpy as np
import os
import torch
import numpy
import torch.nn as nn
import torch.nn.functional as F
import random

# 设置种子数
random_seed = int(time.time() * 2)

# 固定种子程序
def seed_it(seed):
    random.seed(seed)
    os.environ["PYTHONSEED"] = str(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True #确定性固定
    torch.backends.cudnn.benchmark = True #False会确定性地选择算法，会降低性能
    torch.backends.cudnn.enabled = True  #增加运行效率，默认就是True
    torch.manual_seed(seed)

seed_it(random_seed)


In [3]:
"""
以序列建模为例，构建序列，序列的字符以其在词表中的索引的形式表示
"""

batch_size=2

# 序列的最大长度
max_src_seq_len=5
max_tgt_seq_len=5

# 单词表大小（embedding中num_embeddings参数（单词的个数））
max_num_src_words=8
max_num_tgt_words=8

# embedding中embedding_dim大小(模型的特征大小)
model_dim=8


# 生成长度为2和4的两个源句子
src_len=torch.Tensor([2,4]).to(torch.int32)

# 生成长度为4和3的两个目标句子
tgt_len=torch.Tensor([4,3]).to(torch.int32)

src_len,tgt_len

(tensor([2, 4], dtype=torch.int32), tensor([4, 3], dtype=torch.int32))

In [22]:
"""
详细步骤（运行时可略过）
"""

# 生成1到max_src_seq_len之间，长度为L的一维张量(单词索引构成的句子)
src_seq=[
       torch.randint(
            low=1,
            high=max_num_src_words,
            size=(L,)) for L in src_len
]
# 此时的src_seq为list，内容是两个一维的张量(两个句子)
src_seq  # [tensor([2, 7]), tensor([6, 5, 7, 4])]

# 按照最大序列长度进行0填充(pad)
for i in range(len(src_len)):
    L=int(src_len[i])
    src_seq[i]=F.pad(src_seq[i],(0,max_src_seq_len-L))
# 此时的src_seq为list，内容是两个一维的张量(两个句子)
src_seq # [tensor([2, 7, 0, 0, 0]), tensor([6, 5, 7, 4, 0])]

# 将列表中的每个一维tensor转为二维tensor
src_seq=torch.cat([torch.unsqueeze(src_seq[L],dim=0) for L in range(len(src_seq)) ])

src_seq  # tensor([[7, 7, 0, 0, 0],
         #         [5, 3, 5, 2, 0]])

[tensor([7, 7, 0, 0, 0]), tensor([5, 3, 5, 2, 0])]

In [4]:
"""
合并上述操作
- torch.randint: 生成指定范围的指定形状的整数
- F.pad：默认填充0
- torch.unsqueeze：在指定维度上扩展张量的维度
- torch.cat：拼接两个张量
"""

"""
Step1: 单词索引构成源句子和目标句子，构建batch，并且做了padding，默认值为0
"""
src_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1,max_num_src_words,(L,)), (0,max(src_len)-L)),0 ) for L in src_len]
)

tgt_seq = torch.cat(
    [torch.unsqueeze(F.pad(torch.randint(1,max_num_tgt_words,(L,)), (0,max(tgt_len)-L)),0 ) for L in tgt_len]
)

src_seq,tgt_seq

(tensor([[5, 7, 0, 0],
         [2, 5, 1, 5]]),
 tensor([[6, 6, 1, 1],
         [6, 6, 7, 0]]))

### 1.2 构造 word embedding

In [5]:
src_embedding_table=nn.Embedding(max_num_src_words+1,model_dim)
tgt_embedding_table=nn.Embedding(max_num_tgt_words+1,model_dim)

src_embedding_table  # 从table中获取单词的embedding

Embedding(9, 8)

In [6]:
"""
每一行代表embedding向量
- 其中第0行为pad
- 其余行分配给单词
"""
src_embedding_table.weight

Parameter containing:
tensor([[-1.8971, -0.5292, -0.2683, -0.4533, -1.3878, -0.3119,  1.5240,  0.6145],
        [-0.8193, -2.2878,  0.9385,  0.5366,  0.3396,  0.2181, -0.1508, -0.4788],
        [-0.7597,  0.9148, -0.5178,  1.4354, -0.1013,  1.5633, -0.4738,  0.4313],
        [ 1.5991,  0.5838,  0.1552, -1.9887,  1.0617,  0.2108, -0.4492, -1.0874],
        [ 2.3204,  0.4021,  1.9740,  1.1001,  0.7008, -0.9758, -0.5534,  0.4126],
        [-0.4560,  1.1002, -0.5133,  2.2987,  0.8974, -0.2775,  0.5889,  1.2542],
        [ 0.4470, -0.8439, -1.2878, -0.2874,  0.3921, -1.0862, -0.1129,  1.1329],
        [ 0.1421, -0.2498,  0.2293,  1.1480, -0.8974,  2.3922,  0.2216,  2.2292],
        [ 0.4339, -0.3328, -0.2964,  1.2370,  1.1829, -1.3129, -1.3691,  0.7486]],
       requires_grad=True)

In [7]:
# 调用forward方法，获取句子的embedding vector
src_embedding=src_embedding_table(src_seq)
tgt_embedding=tgt_embedding_table(tgt_seq)

In [8]:
"""
根据每个句子中每个单词的索引值在上面的`src_embedding_table.weight`中找到相应的行数，即可作为句子的embedding vector
"""
src_seq,src_embedding,src_embedding.shape

(tensor([[5, 7, 0, 0],
         [2, 5, 1, 5]]),
 tensor([[[-0.4560,  1.1002, -0.5133,  2.2987,  0.8974, -0.2775,  0.5889,
            1.2542],
          [ 0.1421, -0.2498,  0.2293,  1.1480, -0.8974,  2.3922,  0.2216,
            2.2292],
          [-1.8971, -0.5292, -0.2683, -0.4533, -1.3878, -0.3119,  1.5240,
            0.6145],
          [-1.8971, -0.5292, -0.2683, -0.4533, -1.3878, -0.3119,  1.5240,
            0.6145]],
 
         [[-0.7597,  0.9148, -0.5178,  1.4354, -0.1013,  1.5633, -0.4738,
            0.4313],
          [-0.4560,  1.1002, -0.5133,  2.2987,  0.8974, -0.2775,  0.5889,
            1.2542],
          [-0.8193, -2.2878,  0.9385,  0.5366,  0.3396,  0.2181, -0.1508,
           -0.4788],
          [-0.4560,  1.1002, -0.5133,  2.2987,  0.8974, -0.2775,  0.5889,
            1.2542]]], grad_fn=<EmbeddingBackward0>),
 torch.Size([2, 4, 8]))

## 2. position embedding

attention is all you need中position embedding的理由：
1. 泛化能力强，即使在测试集中出现未知的序列长度，亦可根据已有的序列长度，求一个线性方程，然后求出序列
2. 具有对称性和唯一性，每个位置的embedding是确定的

![](./imgs/6_6.png)


In [9]:
# 定义序列编码的最大长度(总共的位置数目)
max_position_len=5

# 构造position embedding
pos_mat=torch.arange(max_position_len).reshape((-1,1))
i_mat=torch.pow(10000,torch.arange(0,8,2).reshape((1,-1))/model_dim)

pos_mat,i_mat

(tensor([[0],
         [1],
         [2],
         [3],
         [4]]),
 tensor([[   1.,   10.,  100., 1000.]]))

In [10]:
pe_embedding_table=torch.zeros(max_position_len,model_dim)
pe_embedding_table[:,0::2]=torch.sin(pos_mat/i_mat)
pe_embedding_table[:,1::2]=torch.cos(pos_mat/i_mat)
pe_embedding_table

tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,
          9.9920e-01,  4.0000e-03,  9.9999e-01]])

In [11]:
# 将构造的table当作embedding的权值传入
pe_embedding=nn.Embedding(max_position_len,model_dim)
pe_embedding.weight=nn.Parameter(pe_embedding_table,requires_grad=False)
pe_embedding.weight

Parameter containing:
tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
          9.9995e-01,  1.0000e-03,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
          9.9980e-01,  2.0000e-03,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
          9.9955e-01,  3.0000e-03,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  3.8942e-01,  9.2106e-01,  3.9989e-02,
          9.9920e-01,  4.0000e-03,  9.9999e-01]])

In [12]:
src_pos=torch.cat([torch.unsqueeze(torch.arange(max(src_len)),0) for _ in src_seq]).to(torch.int32)
tgt_pos=torch.cat([torch.unsqueeze(torch.arange(max(src_len)),0) for _ in tgt_seq]).to(torch.int32)
# 向pe_embedding中传入位置索引
src_pe_embedding=pe_embedding(src_pos)
tgt_pe_embedding=pe_embedding(tgt_pos)

# 借助nn.embedding API，通过位置索引就可以得到位置embedding
src_pe_embedding

tensor([[[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
           1.0000e+00,  0.0000e+00,  1.0000e+00],
         [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
           9.9995e-01,  1.0000e-03,  1.0000e+00],
         [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
           9.9980e-01,  2.0000e-03,  1.0000e+00],
         [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
           9.9955e-01,  3.0000e-03,  1.0000e+00]],

        [[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
           1.0000e+00,  0.0000e+00,  1.0000e+00],
         [ 8.4147e-01,  5.4030e-01,  9.9833e-02,  9.9500e-01,  9.9998e-03,
           9.9995e-01,  1.0000e-03,  1.0000e+00],
         [ 9.0930e-01, -4.1615e-01,  1.9867e-01,  9.8007e-01,  1.9999e-02,
           9.9980e-01,  2.0000e-03,  1.0000e+00],
         [ 1.4112e-01, -9.8999e-01,  2.9552e-01,  9.5534e-01,  2.9995e-02,
           9.9955e-01,  3.0000e-03,  1.0000e+00]

## 3. encoder self-attention mask

### 3.1 softmax演示（scaled的重要性）

In [56]:
"""
softmax演示
"""

alpha1=0.1
alpha2=10

# 将该分数看到Query与key的相似度(Q*K结果)
# 即一个单词与整个序列的相似度结果
score=torch.randn(5)
score

tensor([ 0.0403,  0.8497, -0.2018,  1.6455, -0.9434])

In [57]:
# 得到query与序列中每个单词的相似度
# prob越大，相似度越大

# 概率差别小
prob1=F.softmax(score*alpha1,-1)
prob1

tensor([0.1945, 0.2109, 0.1899, 0.2284, 0.1763])

In [58]:
# 概率差别大
prob2=F.softmax(score*alpha2,-1)
prob2

tensor([1.0677e-07, 3.4968e-04, 9.4866e-09, 9.9965e-01, 5.7049e-12])

In [61]:
def softmax_func(score):
    return F.softmax(score)
jaco_mat1=torch.autograd.functional.jacobian(softmax_func,score*alpha1)
jaco_mat1

  return F.softmax(score)


tensor([[ 0.1567, -0.0410, -0.0369, -0.0444, -0.0343],
        [-0.0410,  0.1664, -0.0400, -0.0482, -0.0372],
        [-0.0369, -0.0400,  0.1538, -0.0434, -0.0335],
        [-0.0444, -0.0482, -0.0434,  0.1762, -0.0403],
        [-0.0343, -0.0372, -0.0335, -0.0403,  0.1452]])

In [62]:
jaco_mat2=torch.autograd.functional.jacobian(softmax_func,score*alpha2)
jaco_mat2

  return F.softmax(score)


tensor([[ 1.0677e-07, -3.7334e-11, -1.0128e-15, -1.0673e-07, -6.0909e-19],
        [-3.7334e-11,  3.4956e-04, -3.3173e-12, -3.4956e-04, -1.9949e-15],
        [-1.0128e-15, -3.3173e-12,  9.4866e-09, -9.4832e-09, -5.4120e-20],
        [-1.0673e-07, -3.4956e-04, -9.4832e-09,  3.4964e-04, -5.7029e-12],
        [-6.0909e-19, -1.9949e-15, -5.4120e-20, -5.7029e-12,  5.7049e-12]])

### 3.2 Scaled Dot-Product Attention

In [13]:
"""
构造encoder的self-attention mask

mask的shape:[batch_size,max_src_len,max_src_len],值为1或-inf
"""

# 有效编码器位置矩阵
valid_encoder_pos=torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0,max(src_len)-L)) ,0) for L in src_len]),2)
valid_encoder_pos

tensor([[[1.],
         [1.],
         [0.],
         [0.]],

        [[1.],
         [1.],
         [1.],
         [1.]]])

In [14]:
"""
2:batch_size
4:每个句子pad之后的最大长度
"""
valid_encoder_pos.shape

torch.Size([2, 4, 1])

In [15]:
"""
bmm:三维矩阵相乘(element-wise product)
"""
valid_encoder_pos_matrix=torch.bmm(valid_encoder_pos,valid_encoder_pos.transpose(1,2))
valid_encoder_pos_matrix,valid_encoder_pos_matrix.shape

(tensor([[[1., 1., 0., 0.],
          [1., 1., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]],
 
         [[1., 1., 1., 1.],
          [1., 1., 1., 1.],
          [1., 1., 1., 1.],
          [1., 1., 1., 1.]]]),
 torch.Size([2, 4, 4]))

In [16]:
invalid_encoder_pos_matrix=1-valid_encoder_pos_matrix
invalid_encoder_pos_matrix

tensor([[[0., 0., 1., 1.],
         [0., 0., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.],
         [0., 0., 0., 0.]]])

In [17]:
# 将位置编码矩阵变为bool型
# Ture:该位置需要被mask
mask_encoder_self_attention=invalid_encoder_pos_matrix.to(torch.bool)
mask_encoder_self_attention,mask_encoder_self_attention.shape

(tensor([[[False, False,  True,  True],
          [False, False,  True,  True],
          [ True,  True,  True,  True],
          [ True,  True,  True,  True]],
 
         [[False, False, False, False],
          [False, False, False, False],
          [False, False, False, False],
          [False, False, False, False]]]),
 torch.Size([2, 4, 4]))

In [18]:
score=torch.randn(batch_size,max(src_len),max(src_len))
score,score.shape

(tensor([[[ 1.6365, -1.5191, -0.2236,  0.3235],
          [-0.0399,  0.4800,  1.8378,  1.0547],
          [ 0.5063,  1.3518, -0.8374, -1.6042],
          [-1.1921, -0.2160,  0.1581,  0.7136]],
 
         [[-1.8907,  0.3340, -1.6852,  1.1155],
          [ 0.1832,  0.9547,  1.3048, -1.5454],
          [ 0.6335,  1.0391, -0.6470,  1.2854],
          [ 0.4862,  0.5130,  1.7316, -0.2617]]]),
 torch.Size([2, 4, 4]))

In [19]:
# masked_score=score.masked_fill(mask_encoder_self_attention,-np.inf)
masked_score=score.masked_fill(mask_encoder_self_attention,-1e9)
masked_score

tensor([[[ 1.6365e+00, -1.5191e+00, -1.0000e+09, -1.0000e+09],
         [-3.9859e-02,  4.7999e-01, -1.0000e+09, -1.0000e+09],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09, -1.0000e+09]],

        [[-1.8907e+00,  3.3396e-01, -1.6852e+00,  1.1155e+00],
         [ 1.8322e-01,  9.5474e-01,  1.3048e+00, -1.5454e+00],
         [ 6.3351e-01,  1.0391e+00, -6.4700e-01,  1.2854e+00],
         [ 4.8617e-01,  5.1296e-01,  1.7316e+00, -2.6172e-01]]])

In [20]:
prob=F.softmax(masked_score,-1)
prob

tensor([[[0.9591, 0.0409, 0.0000, 0.0000],
         [0.3729, 0.6271, 0.0000, 0.0000],
         [0.2500, 0.2500, 0.2500, 0.2500],
         [0.2500, 0.2500, 0.2500, 0.2500]],

        [[0.0316, 0.2919, 0.0388, 0.6378],
         [0.1560, 0.3374, 0.4789, 0.0277],
         [0.2129, 0.3194, 0.0592, 0.4086],
         [0.1674, 0.1719, 0.5815, 0.0792]]])

## 4. intra-attention mask

Q @ K^{T} shape: [batch_size,tgt_seq_len,src_seq_len]

In [21]:
valid_encoder_pos

tensor([[[1.],
         [1.],
         [0.],
         [0.]],

        [[1.],
         [1.],
         [1.],
         [1.]]])

In [22]:
# 有效编码器位置矩阵
valid_decoder_pos=torch.unsqueeze(torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0,max(tgt_len)-L)) ,0) for L in tgt_len]),2)
valid_decoder_pos,valid_decoder_pos.shape

(tensor([[[1.],
          [1.],
          [1.],
          [1.]],
 
         [[1.],
          [1.],
          [1.],
          [0.]]]),
 torch.Size([2, 4, 1]))

In [23]:
valid_cross_pos_matrix=torch.bmm(valid_decoder_pos,valid_encoder_pos.transpose(1,2))
valid_cross_pos_matrix

tensor([[[1., 1., 0., 0.],
         [1., 1., 0., 0.],
         [1., 1., 0., 0.],
         [1., 1., 0., 0.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [1., 1., 1., 1.],
         [0., 0., 0., 0.]]])

In [24]:
invalid_cross_pos_matrix=1-valid_cross_pos_matrix
mask_cross_attention=invalid_cross_pos_matrix.to(torch.bool)
mask_cross_attention

tensor([[[False, False,  True,  True],
         [False, False,  True,  True],
         [False, False,  True,  True],
         [False, False,  True,  True]],

        [[False, False, False, False],
         [False, False, False, False],
         [False, False, False, False],
         [ True,  True,  True,  True]]])

## 5. decoder self-attention mask

In [25]:
# 构造下三角矩阵
tri_matrix=[torch.tril(torch.ones((L,L))) for L in tgt_len]
tri_matrix

[tensor([[1., 0., 0., 0.],
         [1., 1., 0., 0.],
         [1., 1., 1., 0.],
         [1., 1., 1., 1.]]),
 tensor([[1., 0., 0.],
         [1., 1., 0.],
         [1., 1., 1.]])]

In [30]:
valid_decoder_tri_matrix=torch.cat([ torch.unsqueeze(F.pad(torch.tril(torch.ones((L,L))), (0,max(tgt_len)-L,0,max(tgt_len)-L) ),0) for L in tgt_len ])
valid_decoder_tri_matrix,valid_decoder_tri_matrix.shape

(tensor([[[1., 0., 0., 0.],
          [1., 1., 0., 0.],
          [1., 1., 1., 0.],
          [1., 1., 1., 1.]],
 
         [[1., 0., 0., 0.],
          [1., 1., 0., 0.],
          [1., 1., 1., 0.],
          [0., 0., 0., 0.]]]),
 torch.Size([2, 4, 4]))

In [31]:
# decoder因果掩码
invalid_decoder_tri_matrix=(1-valid_decoder_tri_matrix).to(torch.bool)
invalid_decoder_tri_matrix

tensor([[[False,  True,  True,  True],
         [False, False,  True,  True],
         [False, False, False,  True],
         [False, False, False, False]],

        [[False,  True,  True,  True],
         [False, False,  True,  True],
         [False, False, False,  True],
         [ True,  True,  True,  True]]])

In [32]:
score=torch.randn(batch_size,max(tgt_len),max(tgt_len))
masked_score=score.masked_fill(invalid_decoder_tri_matrix,-1e9)
prob=F.softmax(masked_score,-1)
# 注意力权重矩阵
prob

tensor([[[1.0000, 0.0000, 0.0000, 0.0000],
         [0.5944, 0.4056, 0.0000, 0.0000],
         [0.0514, 0.8686, 0.0799, 0.0000],
         [0.1175, 0.5565, 0.1232, 0.2028]],

        [[1.0000, 0.0000, 0.0000, 0.0000],
         [0.9241, 0.0759, 0.0000, 0.0000],
         [0.3277, 0.1568, 0.5155, 0.0000],
         [0.2500, 0.2500, 0.2500, 0.2500]]])

## 6. scaled self-attention

In [None]:
def scaled_dot_product_attention(Q,K,V,attn_mask):
    # shape of Q,K,V:[batch_size*num_head, seq_len, model_dim/num_head]
    score=torch.bmm(Q,K.transpose(-2,-1))/torch.sqrt(model_dim)
    masked_score=score.masked_fill(attn_mask,-1e9)
    prob=F.softmax(masked_score,-1)
    context=torch.bmm(prob,V)
    return context

## 7. Masked loss

In [33]:
# batch_size=2,seqlen=3,vocab_size=4
logits=torch.randn(2,3,4)
logits

tensor([[[-2.0492,  0.9944,  1.2019, -0.3054],
         [ 1.7441,  1.2566, -0.0404,  0.4397],
         [-1.4987, -1.8955, -0.8323, -0.1094]],

        [[ 2.1624,  1.3874, -0.3114, -0.6915],
         [ 0.3334, -0.1706,  1.1871,  0.3635],
         [-0.5511,  1.4288,  1.2365,  0.2242]]])

In [34]:
label=torch.randint(0,4,(2,3))
label

tensor([[3, 1, 0],
        [0, 3, 1]])

In [35]:
logits=logits.transpose(1,2)
logits

tensor([[[-2.0492,  1.7441, -1.4987],
         [ 0.9944,  1.2566, -1.8955],
         [ 1.2019, -0.0404, -0.8323],
         [-0.3054,  0.4397, -0.1094]],

        [[ 2.1624,  0.3334, -0.5511],
         [ 1.3874, -0.1706,  1.4288],
         [-0.3114,  1.1871,  1.2365],
         [-0.6915,  0.3635,  0.2242]]])

In [36]:
F.cross_entropy(logits,label,reduction='none')

tensor([[2.2362, 1.2070, 2.0323],
        [0.4716, 1.5759, 0.8167]])

In [37]:
tgt_len=torch.tensor([2,3]).to(torch.int32)
[torch.ones(L) for L in tgt_len]

[tensor([1., 1.]), tensor([1., 1., 1.])]

In [38]:
mask=torch.cat([torch.unsqueeze(F.pad(torch.ones(L),(0,max(tgt_len)-L)),0) for L in tgt_len])
mask

tensor([[1., 1., 0.],
        [1., 1., 1.]])

In [39]:
F.cross_entropy(logits,label,reduction='none') * mask

tensor([[2.2362, 1.2070, 0.0000],
        [0.4716, 1.5759, 0.8167]])

In [40]:
label

tensor([[3, 1, 0],
        [0, 3, 1]])

In [41]:
label[0,2]=-100

In [42]:
F.cross_entropy(logits,label,reduction='none') * mask

tensor([[2.2362, 1.2070, 0.0000],
        [0.4716, 1.5759, 0.8167]])

# Transformer模型总结
## 模型结构
![](./imgs/6_9.png)
![](./imgs/6_10.png)

## 使用类型
![](./imgs/6_8.png)

## 特点
![](./imgs/6_7.png)