## 简介
当输入序列长度和输出标签序列长度都可变时(即输入序列的子序列个数和输出序列的子序列个数不同)

数据预处理:
指定子序列长度,填充<pad>词元或裁剪得到固定长度的子序列,增加<eos>词元至子序列末尾

encoder(编码器):
1.子序列(batchsize,num_steps,vocabsize)用embeding层降维至(batchsize,num_steps,embed_size) #num_steps是子序列的词元个数,即RNN的时间步骤个数
2.子序列(batchsize,num_steps,vocabsize)输出为(batchsize,num_steps,num_hiddens)和每个子序列的末尾时间步骤的状态(batchsize,num_layers,num_hiddens) #num_hidddens是隐藏层特征数,num_layers是RNN层数

decoder(编码器):(N个长度固定的子序列->batchsize个为一批次,子序列长度为num_steps)
1.子序列(batchsize,num_steps,vocabsize)用embeding层降维至(batchsize,num_steps,embed_size) #num_steps是子序列的词元个数,即RNN的时间步骤个数
2.取encoder输出的最后一个时间步骤的预测的特征(batchsize,num_hiddens),添加为decoder子序列的特征(batchsize,num_steps,embed_size+num_hiddens) #decoder的每个时间步都包含encoder的特征?
3.拼接后的decoder子序列(batchsize,num_steps,embed_size+num_hiddens)的RNN输出为(batchsize,num_steps,num_hiddens),并通过linear层转为(batchsize,num_steps,vocabsize),其中RNN初始状态是encoder的末尾时间步骤的状态(batchsize,num_layers,num_hiddens) 


In [24]:
import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

## 模型定义

In [None]:
class Lit_encoder(pl.LightningModule):
    def __init__(self,vocab_size,embed_size,num_hiddens,num_layers,dropout=0):
        super(Lit_encoder,self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size,embed_size)
        self.rnn = torch.nn.GRU(embed_size,num_hiddens,num_layers,batch_first=True,dropout=dropout)
    
    def forward(self,x):
        x = self.embedding(x)
        outputs,state = self.rnn(x)
        #outputs: (batch_size,num_steps,num_hiddens)
        #state: (batch_size,num_layers,num_hiddens)
        return outputs, state

class Lit_decoder(pl.LightningModule):
    def __init__(self,vocab_size,embed_size,num_hiddens,num_layers,dropout=0):
        super(Lit_decoder,self).__init__()
        self.embedding = torch.nn.Embedding(vocab_size,embed_size)
        self.rnn = torch.nn.GRU(embed_size+num_hiddens,num_hiddens,num_layers,batch_first=True,dropout=dropout)
        self.dense = torch.nn.Linear(num_hiddens,vocab_size)
    
    def forward(self,x,state):
        x = self.embedding(x) #size: (batch_size,num_steps,embed_size)
        enc_outputs, enc_state = state
        #context=enc_outputs[-1]  #d2l的enc_outputs是[nums_step,batch_size,num_hiddens]
        context=enc_outputs[:,-1,:]  #size: (batch_size,num_hiddens)
        context=context.unsqueeze(1) #size: (batch_size,1,num_hiddens)
        context=context.repeat(1,x.shape[1],1) #size: (batch_size,num_steps,num_hiddens)
        x=torch.cat((x,context),2)
        outputs,dec_state = self.rnn(x,enc_state)
        outputs=self.dense(outputs) #size: (batch_size,num_steps,vocab_size)
        return outputs, [enc_outputs,dec_state]

class Lit_encoder_decoder(pl.LightningModule):
    def __init__(self,encoder,decoder,lr=0.001,tgt_pad_id=0):
        super(Lit_encoder_decoder,self).__init__()
        self.encoder=encoder
        self.decoder=decoder
        self.lr=lr
        self.tgt_pad_id=tgt_pad_id
        torch.nn.MultiheadAttention

    def loss(self,y_pred,y):
        l=torch.nn.functional.cross_entropy(y_pred.view(-1,y_pred.shape[-1]),y.view(-1))
        mask=(y.view(-1)!=self.tgt_pad_id).float()
        l=l*mask
        return l.sum()/mask.sum()

    def forward(self,enc_x,dec_x):
        enc_result=self.encoder(enc_x)
        dec_result=self.decoder(dec_x,enc_result)
        return dec_result[0]

    def training_step(self,batch,batch_idx):
        x,y=batch
        y_pred=self(x,y) 
        loss=self.loss(y_pred,y)
        self.log('train_loss',loss, prog_bar=True, logger=True, on_epoch=True,on_step=True)
        return loss
    
    def validation_step(self,batch,batch_idx):
        x,y=batch
        y_pred=self(x,y)
        loss=self.loss(y_pred,y)
        self.log('val_loss',loss, prog_bar=True, logger=True, on_epoch=True,on_step=True)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(),lr=self.lr)



## 数据集

### d2l的数据集处理方式

In [114]:
import requests
import os
import re
import zipfile
class LitLoadData_fra_impl(pl.LightningDataModule):  
    def prepare_data(self):
        url = 'http://d2l-data.s3-accelerate.amazonaws.com/fra-eng.zip'
        #文件是否存在
        if os.path.exists('./data/fra-eng/fra.txt'):
            return
        #下载文件
        r = requests.get(url, stream=True)
        #解压文件
        with zipfile.ZipFile('./data/fra-eng.zip', 'r') as zip_ref:
            zip_ref.extractall('./data/fra-eng')
            
data=LitLoadData_fra_impl()
data.prepare_data()
#返回fra.txt内容
with open('./data/fra-eng/fra.txt', 'r', encoding='utf-8') as f:
    raw_txt = f.read()
print(raw_txt[:75])

Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !



In [115]:

def preprocess(raw_txt,max_tokens=10000,num_steps=9):
    #大写字母改为小写
    raw_txt=raw_txt.lower()
    #去掉空行,取前max_tokens行
    lines=raw_txt.split('\n')
    lines=[line for line in lines if len(line)>0]
    lines=lines[:max_tokens]
    #每行以 tab 分割为两组
    pairs=[line.split('\t') for line in lines]
    #删除空行
    pairs=[pair for pair in pairs if len(pair)==2]
    #每组单词分割,标点符号视为一个独立单词
    pairs=[[re.findall(r'\w+|[^\w\s]',pair[0]),re.findall(r'\w+|[^\w\s]',pair[1])] for pair in pairs]
    #返回源语言和目标语言
    src=[pair[0] for pair in pairs]
    tgt=[pair[1] for pair in pairs]
    #末尾添加特殊字符'<eos>'
    src=[pair+['<eos>'] for pair in src]
    tgt=[pair+['<eos>'] for pair in tgt]
    #tgt前面添加特殊字符'<bos>'
    tgt=[['<bos>']+pair for pair in tgt]
    #裁剪或填充'<pad>'至num_steps
    src=[pair[:num_steps]+['<pad>']*(num_steps-len(pair)) if len(pair)<num_steps else pair[:num_steps] for pair in src]
    tgt=[pair[:num_steps]+['<pad>']*(num_steps-len(pair)) if len(pair)<num_steps else pair[:num_steps] for pair in tgt]
    return src,tgt


LitLoadData_fra_impl.preprocess=preprocess 

src,tgt=preprocess(raw_txt,max_tokens=10000,num_steps=9)
print(src[:6])
print(tgt[:6])

[['go', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['hi', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['run', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['run', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['who', '?', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['wow', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
[['<bos>', 'va', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'salut', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'cours', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'courez', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'qui', '?', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'], ['<bos>', 'ça', 'alors', '!', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>']]


In [116]:
import collections

def vocab(sentences, min_freq=0):
    tokens = [token for sentence in sentences for token in sentence]
    counter = collections.Counter(tokens)
    # 去掉频率小于min_freq的单词
    tokens = [token for token in counter if counter[token] >= min_freq] 
    # token的idx按频率降序
    tokens = sorted(tokens, key=lambda x: counter[x], reverse=True)
    idx_to_token = ['<unk>'] + tokens
    token_to_idx = {token: idx for idx, token in enumerate(idx_to_token)}
    return idx_to_token, token_to_idx

LitLoadData_fra_impl.vocab=vocab

idx_to_token_src, token_to_idx_src = vocab(src)
idx_to_token_tgt, token_to_idx_tgt = vocab(tgt)
print(idx_to_token_src[:6])
print(idx_to_token_tgt[:6])
#print(token_to_idx_src['<d>'])  #todo: nokey的token是<unk>

print(token_to_idx_src['go']) #0

['<unk>', '<pad>', '<eos>', '.', 'i', "'"]
['<unk>', '<pad>', '<bos>', '<eos>', '.', "'"]
22


In [117]:
class LitLoadData_fra(LitLoadData_fra_impl):
    def __init__(self,batch_size=64,num_steps=9,num_trains=512,num_val=128):
        super().__init__()
        self.save_hyperparameters()
        self.prepare_data() #download txt
        with open('./data/fra-eng/fra.txt', 'r', encoding='utf-8') as f:
            self.raw_txt = f.read()
        self.src,self.tgt=preprocess(raw_txt= self.raw_txt,max_tokens= num_trains+num_val,num_steps=num_steps)
        self.idx_to_token_src,self.token_to_idx_src=vocab(self.src)
        self.idx_to_token_tgt,self.token_to_idx_tgt=vocab(self.tgt)
        self.src_idx=[[self.token_to_idx_src[token] for token in sentence] for sentence in self.src]
        self.tgt_idx=[[self.token_to_idx_tgt[token] for token in sentence] for sentence in self.tgt]
        self.src_idx=torch.tensor(self.src_idx)
        self.tgt_idx=torch.tensor(self.tgt_idx)

    def get_tgtpad_idx(self):
        return self.token_to_idx_tgt['<pad>']

    def train_dataloader(self):
        src=self.src_idx[:self.hparams.num_trains]
        tgt=self.tgt_idx[:self.hparams.num_trains]
        dataset = torch.utils.data.TensorDataset(src, tgt)
        return torch.utils.data.DataLoader(dataset, batch_size=self.hparams.batch_size, shuffle=True)

    def val_dataloader(self):
        src=self.src_idx[self.hparams.num_trains:]
        tgt=self.tgt_idx[self.hparams.num_trains:]
        dataset = torch.utils.data.TensorDataset(src, tgt)
        return torch.utils.data.DataLoader(dataset, batch_size=self.hparams.batch_size, shuffle=False)

    

### 用tokenizers训练文本自动提取token

In [118]:
from tokenizers import Tokenizer
from tokenizers.processors import TemplateProcessing
from tokenizers.pre_tokenizers import BertPreTokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

class TokenizerTrainer:
    def __init__(self, num_step=9):
        self.num_step = num_step
        self.tokenizer = Tokenizer(BPE())
        self.tokenizer.pre_tokenizer = BertPreTokenizer()
        self.tokenizer.add_special_tokens(["<pad>", "<bos>", "<eos>", "<unk>"])
        self.trainer = BpeTrainer(special_tokens=["<pad>", "<bos>", "<eos>", "<unk>"], min_frequency=2)
        self.tokenizer.enable_padding(pad_id=self.tokenizer.token_to_id("<pad>"), pad_token="<pad>", length=self.num_step)
        self.tokenizer.enable_truncation(max_length=self.num_step)
        self.tokenizer.post_processor = TemplateProcessing(
            single="<bos> $A <eos>",
            pair="<bos> $A <eos> <bos> $B <eos>",
            special_tokens=[
                ("<bos>", self.tokenizer.token_to_id("<bos>")),
                ("<eos>", self.tokenizer.token_to_id("<eos>")),
            ],
        )

    def train(self, file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        pairs = [line.split('\t') for line in lines]
        self.tokenizer.train_from_iterator(pairs, self.trainer)

    def save(self, path):
        self.tokenizer.save(path)

    def encode(self, text):
        return self.tokenizer.encode(text)

    def decode(self, ids, skip_special_tokens=True):
        return self.tokenizer.decode(ids, skip_special_tokens=skip_special_tokens)

tokenizer = TokenizerTrainer()
tokenizer.train('./data/fra-eng/fra.txt')
tokenizer.save('./data/fra-eng/tokenizer.json')
print(tokenizer.encode('/x020d').tokens) #todo: 是否应该是<unk>? 是和否对模型训练有影响吗?
print(tokenizer.decode([0, 1, 2, 3, 4, 5],skip_special_tokens=False)) #todo: 这里的skip_special_tokens=True是否应该是False? 是和否对模型训练有影响吗?


['<bos>', '/', 'x', '0', '20', 'd', '<eos>', '<pad>', '<pad>']
<pad> <bos> <eos> <unk> ! "


## 流程

In [123]:
if __name__ == '__main__':
    data = LitLoadData_fra(batch_size=128, num_steps=9, num_trains=512, num_val=128)
    encoder = Lit_encoder(len(data.idx_to_token_src),embed_size=256,num_hiddens=256,num_layers=2,dropout=0.2)
    decoder = Lit_decoder(len(data.idx_to_token_tgt),embed_size=256,num_hiddens=256,num_layers=2,dropout=0.2)
    model = Lit_encoder_decoder(encoder, decoder,lr=0.005)
    
    logger = TensorBoardLogger("tensorBoard-logs/", name="RNNModel_v4")
    checkpoint_callback = ModelCheckpoint(
        monitor='val_loss',
        dirpath='checkpoints',
        filename='RNNModel_v4_{epoch:02d}_{val_loss:.2f}',
       # save_top_k=3,
        mode='min',
    )
    trainer = pl.Trainer(max_epochs=30,gradient_clip_algorithm='norm',gradient_clip_val=1, logger=logger, callbacks=[checkpoint_callback])
    trainer.fit(model, data)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
c:\Users\zncyxiong\AppData\Local\anaconda3\envs\pytorch_python3128\Lib\site-packages\pytorch_lightning\callbacks\model_checkpoint.py:654: Checkpoint directory D:\algorithm\deeplearning_zh.d2l.ai\pytorch\checkpoints exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name    | Type        | Params | Mode 
------------------------------------------------
0 | encoder | Lit_encoder | 860 K  | train
1 | decoder | Lit_decoder | 1.3 M  | train
------------------------------------------------
2.1 M     Trainable params
0         Non-trainable params
2.1 M     Total params
8.463     Total estimated model params size (MB)
7         Modules in train mode
0         Modules in eval mode


                                                                           

c:\Users\zncyxiong\AppData\Local\anaconda3\envs\pytorch_python3128\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
c:\Users\zncyxiong\AppData\Local\anaconda3\envs\pytorch_python3128\Lib\site-packages\pytorch_lightning\trainer\connectors\data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.
c:\Users\zncyxiong\AppData\Local\anaconda3\envs\pytorch_python3128\Lib\site-packages\pytorch_lightning\loops\fit_loop.py:310: The number of training batches (4) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the trainin

Epoch 29: 100%|██████████| 4/4 [00:00<00:00, 47.62it/s, v_num=17, train_loss_step=0.00225, val_loss_step=1.220, val_loss_epoch=1.220, train_loss_epoch=0.00232] 

`Trainer.fit` stopped: `max_epochs=30` reached.


Epoch 29: 100%|██████████| 4/4 [00:00<00:00, 44.44it/s, v_num=17, train_loss_step=0.00225, val_loss_step=1.220, val_loss_epoch=1.220, train_loss_epoch=0.00232]


In [124]:
%load_ext tensorboard

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [125]:
%tensorboard --logdir ./tensorBoard-logs/RNNModel_v4

Reusing TensorBoard on port 6006 (pid 1344), started 0:10:20 ago. (Use '!kill 1344' to kill it.)

In [None]:
def f(x):
    return 2 * torch.sin(x) + x

n = 40
x_train, _ = torch.sort(torch.rand(n) * 5)
y_train = f(x_train) + torch.randn(n)
x_val = torch.arange(0, 5, 0.1)
y_val = f(x_val)
x_train.reshape((-1, 1)) #shape: (40, 1)
x_val.reshape((1, -1)) #shape: (1, 50)
dists = x_train.reshape((-1, 1)) - x_val.reshape((1, -1)) #shape: (40, 50)

tensor([[0.0000, 0.1000, 0.2000, 0.3000, 0.4000, 0.5000, 0.6000, 0.7000, 0.8000,
         0.9000, 1.0000, 1.1000, 1.2000, 1.3000, 1.4000, 1.5000, 1.6000, 1.7000,
         1.8000, 1.9000, 2.0000, 2.1000, 2.2000, 2.3000, 2.4000, 2.5000, 2.6000,
         2.7000, 2.8000, 2.9000, 3.0000, 3.1000, 3.2000, 3.3000, 3.4000, 3.5000,
         3.6000, 3.7000, 3.8000, 3.9000, 4.0000, 4.1000, 4.2000, 4.3000, 4.4000,
         4.5000, 4.6000, 4.7000, 4.8000, 4.9000]])