<a href="https://colab.research.google.com/github/zhangxs131/pytorch_tutorials/blob/main/LM_with_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [72]:
!nvidia-smi

Wed Jan  5 07:27:02 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P0    72W / 149W |   1167MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

本次使用pytorch中自带的Transformers的模型中Encoder部分，进行训练语言模型。
来自于pytorch.org官网教程：https://pytorch.org/tutorials/beginner/transformer_tutorial.html

这里的预训练策略是判断下一个词，即词表中每个词出现在下一个位置的概率。
使用的模型结构为transformer的Encoder层，然后加一个线性层最后log-softmax一下。

输入为token embedding和position embedding

In [73]:
import math
from typing import Tuple
import torch
import torch.nn as nn
from torch import Tensor
import torch.nn.functional as F
from torch.nn import TransformerEncoder,TransformerEncoderLayer
from torch.utils.data import dataset

下面使用transformer编写预训练模型

In [74]:
class TransformerModel(nn.Module):
  def __init__(self,ntoken,d_model,n_head,d_hid,nlayers,droput=0.5):
    # ntoken 字典单词数；d_model 即输入特征维度，transformer为512，bert为768；n_head 多头注意力机制的头个数
    # d_hid 为feedforward层中隐藏层的维度，一般为2048；n_layers 几层encoderlayer构成encoder
    super().__init__()
    self.model_type='Transformer'
    self.d_model=d_model
    self.pos_encoder=PositionalEmbedding(d_model,dropout)
    encoder_layers=TransformerEncoderLayer(d_model,n_head,d_hid,dropout)
    self.transformer_encoder=TransformerEncoder(encoder_layers,num_layers=nlayers)
    self.encoder=nn.Embedding(ntoken,d_model)
    self.decoder=nn.Linear(d_model,ntoken)

    self.init_weights()

  #初始化encoder和decoder的权重
  def init_weights(self):
    initrange=0.1
    self.encoder.weight.data.uniform_(-initrange,initrange)
    self.decoder.bias.data.zero_()
    self.decoder.weight.data.uniform_(-initrange,initrange)
  
  def forward(self,src,src_mask):
    # shape: src   [seq_len,batch_size]
    #     src_mask [seq_len,seq_len]

    #embedding
    src=self.encoder(src)*math.sqrt(self.d_model)
    src=self.pos_encoder(src)

    output=self.transformer_encoder(src,src_mask)
    output=self.decoder(output)
    return output

下面函数用于mask文本，即生成一个三角矩阵,矩阵左下部分包括对角线为0，而右上部分为-inf

In [75]:
def generate_square_subsequent_mask(sz):
  return torch.triu(torch.ones(sz,sz)*float('-inf'),diagonal=1)
generate_square_subsequent_mask(5)

tensor([[0., -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0.]])

下面编写PositionalEmbedding,这里的 positionalembedding 使用三角函数计算的固定位置embedding，维度大小与token embeddding一致，用于相加后得到输入到LM模型中的embedding。

In [76]:
class PositionalEmbedding(nn.Module):
  def __init__(self,d_model,dropout=0.1,max_len=5000):
    super().__init__()
    self.dropout=nn.Dropout(dropout)

    position=torch.arange(max_len).unsqueeze(1)
    div_term=torch.exp(torch.arange(0,d_model,2)*(-math.log(10000.0)/d_model))
    pe=torch.zeros(max_len,1,d_model)
    pe[:,0,0::2]=torch.sin(position*div_term)
    pe[:,0,1::2]=torch.cos(position*div_term)
    self.register_buffer('pe',pe)

  def forward(self,x):
    x+=self.pe[:x.size(0)]
    return self.dropout(x)

load and batch data
使用torchtext生成Wikitext-2 dataset，并通过batchify（）进行batch化

In [77]:
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter=WikiText2(split='train')
tokenizer=get_tokenizer('basic_english')
vocab=build_vocab_from_iterator(map(tokenizer,train_iter),specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

def data_process(raw_text_iter):
  data=[torch.tensor(vocab(tokenizer(item)),dtype=torch.long) for item in raw_text_iter]
  return torch.cat(tuple(filter(lambda t:t.numel()>0,data)))

train_iter,val_iter,test_iter=WikiText2()
train_data=data_process(train_iter)
val_data=data_process(val_iter)
test_data=data_process(test_iter)

device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data,batch_size):
  #data shape [N]
  # return [N//batch_size.batch_size]
  seq_len=data.size(0)//batch_size
  data=data[:seq_len*batch_size]
  data=data.view(batch_size,seq_len).t().contiguous()
  return data.to(device)

batch_size=20
eval_batch_size=10
train_data=batchify(train_data,batch_size)
val_data=batchify(val_data,eval_batch_size)
test_data=batchify(test_data,eval_batch_size)

下面get_batch把数据分为data和target,通过切块，如原数据2-5的token，则target为3-6的token.bptt为块最大长度。

In [78]:
bptt=35
def get_batch(source,i)->Tuple[Tensor,Tensor]:
  #source Tensor shape[full_seq_len,batch_size]
  seq_len=min(bptt,len(source)-1-i)
  data=source[i:i+seq_len]
  target=source[i+1:i+1+seq_len].reshape(-1)
  return data,target



初始化模型和参数定义

In [79]:
ntoken=len(vocab)
embedding_size=200
d_hid=200
nlayers=2
n_head=2
dropout=0.2
model=TransformerModel(ntoken,embedding_size,n_head,d_hid,nlayers,dropout).to(device)

运行模型，使用交叉熵作为损失函数，使用SGD优化器，设置初始化为5.0并使用StepLR进行调整学习率。在训练过程中使用nn.utils.clip_grad_norm_来防止梯度爆炸

In [80]:
import copy
import time
criterion=nn.CrossEntropyLoss()
lr=5.0
optimizer=torch.optim.SGD(model.parameters(),lr=lr)
scheduler=torch.optim.lr_scheduler.StepLR(optimizer,1.0,gamma=0.95)

def train(model):
  model.train()
  total_loss=0.
  log_interval=200
  start_time=time.time()
  src_mask=generate_square_subsequent_mask(bptt).to(device)

  num_batches=len(train_data)//bptt
  for idx,i in enumerate(range(0,train_data.size(0)-1,bptt)):
    data,targets=get_batch(train_data,i)
    batch_size=data.size(0)
    if batch_size!=bptt:
      src_mask=src_mask[:batch_size,:batch_size]
    output=model(data,src_mask)
    loss=criterion(output.view(-1,ntoken),targets)

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(),0.5)
    optimizer.step()

    total_loss+=loss.item()
    if idx%log_interval ==0 and idx>0:
      lr=scheduler.get_last_lr()[0]
      ms_per_batch=(time.time()-start_time)*1000/log_interval
      cur_loss=total_loss/log_interval
      ppl=math.exp(cur_loss)
      print('epoch {}|{}/{} batches|lr {} ms/batch {} | loss {} | ppl {} '.format(epoch,idx,num_batches,lr,ms_per_batch,cur_loss,ppl))
      total_loss=0
      start_time=time.time()

def evaluate(model,eval_data):
  model.eval()
  total_loss=0.
  src_mask=generate_square_subsequent_mask(bptt).to(device)
  with torch.no_grad():
    for i in range(0,eval_data.size(0)-1,bptt):
      data,targets=get_batch(eval_data,i)
      batch_size=data.size(0)
      if batch_size!=bptt:
        src_mask=src_mask[:batch_size,:batch_size]
      output=model(data,src_mask)
      output_flat=output.view(-1,ntoken)
      total_loss+=batch_size*criterion(output_flat,targets).item()
  return total_loss/(len(eval_data)-1)


设置epoch开始训练，保存loss低的模型，调整学习率

In [81]:
best_val_loss=float('inf')
epochs=3
best_model=None

for epoch in range(1,epochs+1):
  epoch_start_time=time.time()
  train(model)
  val_loss=evaluate(model,val_data)
  val_ppl=math.exp(val_loss)
  elapsed=time.time()-epoch_start_time
  print('-'*88)
  print('end of epoch {} |time: {}s | valid loss {},valid ppl {}'.format(epoch,elapsed,val_loss,val_ppl))
  print('-'*88)

  if val_loss<best_val_loss:
    best_val_loss=val_loss
    best_model=copy.deepcopy(model)

  scheduler.step()

epoch 1|200/2928 batches|lr 5.0 ms/batch 37.98854351043701 | loss 8.242470648288727 | ppl 3798.9145048511164 
epoch 1|400/2928 batches|lr 5.0 ms/batch 37.49191761016846 | loss 6.891486141681671 | ppl 983.8624903244288 
epoch 1|600/2928 batches|lr 5.0 ms/batch 37.585289478302 | loss 6.443272621631622 | ppl 628.4601503418046 
epoch 1|800/2928 batches|lr 5.0 ms/batch 37.57092475891113 | loss 6.3073224925994875 | ppl 548.5741692811933 
epoch 1|1000/2928 batches|lr 5.0 ms/batch 37.6913857460022 | loss 6.19197496175766 | ppl 488.81053565999616 
epoch 1|1200/2928 batches|lr 5.0 ms/batch 37.46937036514282 | loss 6.153678452968597 | ppl 470.444716648386 
epoch 1|1400/2928 batches|lr 5.0 ms/batch 37.459876537323 | loss 6.114461686611175 | ppl 452.3524744243365 
epoch 1|1600/2928 batches|lr 5.0 ms/batch 37.42375612258911 | loss 6.101904542446136 | ppl 446.70773427009357 
epoch 1|1800/2928 batches|lr 5.0 ms/batch 37.43600130081177 | loss 6.013551225662232 | ppl 408.93295796960973 
epoch 1|2000/292

测试集结果

In [82]:
test_loss=evaluate(best_model,test_data)
test_ppl=math.exp(test_loss)
print('-'*88)
print('end of epoch {} | test loss {},valid ppl {}'.format(epoch,test_loss,test_ppl))
print('-'*88)

----------------------------------------------------------------------------------------
end of epoch 3 | test loss 5.502080969667734,valid ppl 245.201658932124
----------------------------------------------------------------------------------------
