# 2021语言与智能技术竞赛：多技能对话

多技能对话系统旨在建立一个开放域的多轮对话系统，能自然地融合多个对话技能，比如知识对话、推荐对话等，使得机器可以流畅自然地与人进行语言交互，从而有效地提升用户体验。

该示例展示了如何使用PaddleNLP快速搭建[2021语言与智能技术竞赛：多技能对话](https://aistudio.baidu.com/aistudio/competition/detail/67)基线并进阶优化基线。

In [1]:
# 安装paddlenlp最新版本
!pip install --upgrade paddlenlp -i https://pypi.org/simple

%cd multi-skill_dialogue/

Requirement already up-to-date: paddlenlp in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.0.5)
/home/aistudio/multi-skill_dialogue


## 多技能对话基线

多技能对话比赛提供了多个子数据集，包含知识对话、推荐对话、画像对话和其他多种类型的对话数据集。基线采用UnifiedTransformer模型，模型的的输入除了数据token及`[CLS]`、`[SEP]`等special token之外，还有用于区别不同对话技能的special token。

![模型输入](https://ai-studio-static-online.cdn.bcebos.com/24d697df544c4299a679e04e2d3b1442fdf17a14981e454e8a2de5c7acea8051)

### 快速搭建基线Step1：数据预处理

由于多技能对话比赛的[数据集](https://aistudio.baidu.com/aistudio/competition/detail/67)**数量多且数据规模大**，并且数据集之间**格式不同**，所以需要使用脚本对数据集进行预处理，同时将数据转化成id化的数据。

**注意：** 需要确保脚本中的输入文件路径、输出文件路径和参数配置正确。由于数据规模较大，脚本运行时间较长(尤其是训练集)。也可自行分批次处理。

In [None]:
# 注意：脚本默认只取每个数据集的部分语料进行处理作为基线模型的训练数据，参赛选手需根据需求自行修改数据处理策略
!python ./tools/convert_data_to_numerical.py ./tools/spm.model

Total num : 250484 
	truncate type 1: 5 rate(0.0000)
	truncate tye 2: 66403 rate(0.2651)
	truncate type 3: 3002 rate(0.0120)
	truncate type 4: 28 rate(0.0001)
Total num : 45334 
	truncate type 1: 271 rate(0.0060)
	truncate tye 2: 6561 rate(0.1447)
	truncate type 3: 351 rate(0.0077)
	truncate type 4: 63 rate(0.0014)
Total num : 23090 
	truncate type 1: 0 rate(0.0000)
	truncate tye 2: 8429 rate(0.3650)
	truncate type 3: 35 rate(0.0015)
	truncate type 4: 0 rate(0.0000)


### 快速搭建基线Step2：构建模型

[UnifiedTransformer](https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue)以Transformer的编码器为网络基本组件，采用灵活的注意力机制，十分适合文本生成任务，并在模型输入中加入了标识不同对话技能的special token，使得模型能同时支持闲聊对话、推荐对话和知识对话。

**PaddleNLP提供了UnifiedTransformer中文预训练模型，可以通过预训练模型名称完成一键加载。PaddleNLP为了方便用户处理数据，内置了与模型配套的Tokenizer，可以完成文本token化，token转ID，ID转token等操作。**

PaddleNLP目前为UnifiedTransformer提供了两个中文预训练模型：
- `unified_transformer-12L-cn` 该预训练模型是在大规模中文会话数据集上训练得到的
- `unified_transformer-12L-cn-luge` 该预训练模型是`unified_transformer-12L-cn`在千言对话数据集上进行微调得到的

In [2]:
from paddlenlp.transformers import UnifiedTransformerLMHeadModel, UnifiedTransformerTokenizer

# 预训练模型名称
model_name_or_path = 'unified_transformer-12L-cn-luge'

# 加载预训练模型
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)
# 加载配套的tokenizer
tokenizer = UnifiedTransformerTokenizer.from_pretrained(model_name_or_path)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  def convert_to_list(value, n, name, dtype=np.int):
[2021-07-19 10:17:00,991] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-luge.pdparams
[2021-07-19 10:17:11,870] [    INFO] - Found /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-vocab.txt
[2021-07-19 10:17:11,873] [    INFO] - Found /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-spm.model


### 快速搭建基线Step3：加载数据

基线通过继承`paddle.io.IterableDataset`自定义可迭代数据集`DialogueDataset`，包括读取文件、shuffle及组batch等操作，细节详见`data.py`。

In [3]:
from paddle.io import DataLoader
from data import DialogueDataset

# 训练batch_size
batch_size = 8192
# 组batch进行排序和shuffle的pool_size
sort_pool_size = 65536

# 训练集路径，注意与数据预处理输出路径保持一致
train_data_path = './datasets/train.txt' 
# 初始化Dataset
train_dataset = DialogueDataset(
        train_data_path,
        batch_size,
        tokenizer.pad_token_id,
        tokenizer.cls_token_id,
        sort_pool_size,
        mode='train')
# 初始化Dataloader
train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None)

# 开发集路径，注意与数据预处理输出路径保持一致
valid_data_path = './datasets/dev.txt' 
valid_dataset = DialogueDataset(
    valid_data_path,
    batch_size,
    tokenizer.pad_token_id,
    tokenizer.cls_token_id,
    sort_pool_size,
    mode='valid')
valid_dataloader = DataLoader(valid_dataset, return_list=True, batch_size=None)

### 快速搭建基线Step4：训练优化

在该基线中，我们选择交叉熵损失函数，使用`paddle.optimizer.AdamW`作为优化器。

在训练过程中，模型保存在当前目录checkpoints文件夹下。在训练的同时在验证集上进行评估，输出`loss`和`PPL`等指标。

In [4]:
import os

# 定义训练模型保存函数
def save_ckpt(model, tokenizer, save_dir, name):
    output_dir = os.path.join(save_dir, "model_{}".format(name))
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

In [5]:
import math
import paddle
import paddle.nn.functional as F

# 定义模型评估函数，在模型训练过程中会在开发集上对模型进行评估
@paddle.no_grad()
def evaluation(model, data_loader):
    print('\nEval begin...')
    model.eval()
    total_tokens = 0
    total_loss = 0.0
    start_time = time.time()
    step = 0
    for inputs in data_loader:
        step += 1
        token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs

        logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
        loss = F.cross_entropy(logits, tgt_label, reduction='sum')

        total_loss += loss.numpy()[0]
        total_tokens += tgt_label.shape[0]

    avg_loss = total_loss / total_tokens
    ppl = math.exp(avg_loss)
    avg_speed = (time.time() - start_time) / step
    print('loss: %.4f - ppl: %.4f - %.3fs/step\n' % (avg_loss, ppl, avg_speed))
    model.train()

In [6]:
import paddle.nn as nn
from paddle.optimizer.lr import NoamDecay
from paddle.optimizer import AdamW

# 学习率
lr = 1e-5
# 学习率逐渐升高到基础学习率（即上面配置的lr）所需要的迭代数
warmup_steps = 4000
# AdamW优化器中使用的weight_decay的系数
weight_decay = 0.01
# 度裁剪允许的最大梯度值
max_grad_norm = 0.1

# 初始化Noam衰减学习率的策略
lr_scheduler = NoamDecay(1 / (warmup_steps * (lr**2)), warmup_steps)
# 对偏置和LayerNorm层不进行weight_decay策略
decay_params = [
    p.name for n, p in model.named_parameters()
    if not any(nd in n for nd in ["bias", "norm"])
]
# 初始化AdamW优化器
optimizer = AdamW(
    learning_rate=lr_scheduler,
    parameters=model.parameters(),
    weight_decay=weight_decay,
    apply_decay_param_fun=lambda x: x in decay_params,
    grad_clip=nn.ClipGradByGlobalNorm(max_grad_norm))

In [None]:
import time

# 训练轮次
epochs = 10
# 日志打印间隔
logging_steps = 50
# 模型保存及评估间隔
save_steps = 100
# 模型的保存路径
save_dir = './checkpoints/'

step = 0
total_time = 0.0
for epoch in range(epochs):
    print('\nEpoch %d/%d' % (epoch + 1, epochs))
    batch_start_time = time.time()
    for inputs in train_dataloader:
        step += 1
        token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs

        logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos)
        # 使用交叉熵损失函数计算loss
        loss = F.cross_entropy(logits, tgt_label)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()
        optimizer.clear_grad()

        total_time += (time.time() - batch_start_time)
        if step % logging_steps == 0:
            ppl = paddle.exp(loss)
            print('step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step'
                % (step, loss, ppl, optimizer.get_lr(), total_time / logging_steps))
            total_time = 0.0
        if step % save_steps == 0:
            # 在开发集上对模型进行评估
            evaluation(model, valid_dataloader)
            # 保存模型
            save_ckpt(model, tokenizer, save_dir, step)
        batch_start_time = time.time()
print("\n=====training complete=====")


Epoch 1/10
step 50 - loss: 1.9291 - ppl: 6.8832 - lr: 0.0000001 - 0.728s/step
step 100 - loss: 1.9400 - ppl: 6.9587 - lr: 0.0000002 - 0.484s/step

Eval begin...
loss: 2.7435 - ppl: 15.5418 - 0.164s/step

step 150 - loss: 1.8874 - ppl: 6.6023 - lr: 0.0000004 - 0.484s/step
step 200 - loss: 1.7354 - ppl: 5.6713 - lr: 0.0000005 - 0.487s/step

Eval begin...
loss: 2.7431 - ppl: 15.5343 - 0.165s/step

step 250 - loss: 1.3325 - ppl: 3.7904 - lr: 0.0000006 - 0.495s/step
step 300 - loss: 1.6799 - ppl: 5.3651 - lr: 0.0000008 - 0.487s/step

Eval begin...
loss: 2.7431 - ppl: 15.5354 - 0.165s/step

step 350 - loss: 2.3642 - ppl: 10.6360 - lr: 0.0000009 - 0.487s/step
step 400 - loss: 1.8773 - ppl: 6.5358 - lr: 0.0000010 - 0.485s/step

Eval begin...
loss: 2.7426 - ppl: 15.5273 - 0.166s/step

step 450 - loss: 1.4364 - ppl: 4.2056 - lr: 0.0000011 - 0.487s/step
step 500 - loss: 1.6333 - ppl: 5.1206 - lr: 0.0000013 - 0.482s/step

Eval begin...
loss: 2.7436 - ppl: 15.5429 - 0.164s/step

step 550 - loss: 1

KeyboardInterrupt: 

### 快速搭建基线Step5：预测解码

用训练保存的模型参数来初始化模型，加载测试集后即可进行预测。

**PaddleNLP针对生成式任务提供了`generate`函数，支持Greedy Search、Beam Search和Sampling解码策略，用户只需指定解码策略以及相应的参数即可完成预测解码，得到生成的sequence的token ids以及概率得分。**

In [7]:
# 这里可以是paddlenlp提供的预训练模型名称，或者自己训练获得的微调模型路径
model_name_or_path = 'unified_transformer-12L-cn-luge' 
# 加载模型
model = UnifiedTransformerLMHeadModel.from_pretrained(model_name_or_path)

[2021-07-19 10:17:33,509] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/unified_transformer-12L-cn-luge/unified_transformer-12L-cn-luge.pdparams


In [8]:
# 预测batch_size
batch_size = 4

# 测试集路径，注意与数据预处理输出路径保持一致
test_data_path = './datasets/test.txt' 
test_dataset = DialogueDataset(
    test_data_path,
    batch_size,
    tokenizer.pad_token_id,
    tokenizer.cls_token_id,
    mode='test')
test_dataloader = DataLoader(test_dataset, return_list=True, batch_size=None)

In [9]:
import time
from data import select_response

# 预测解码生成序列的最大长度
max_dec_len = 64
# 预测解码生成序列的最小长度
min_dec_len = 1
# 解码策略
decode_strategy = 'sampling'
# topk-sampling解码参数top_k
top_k = 5
# 每条输入序列返回的输出序列个数，生成式API内部会将输入序列进行复制
num_return_sequences = 20
# 文本结果序列保存路径
output_path = './predict.txt'
# 日志打印间隔
logging_steps = 10

print('\nInfer begin...')
model.eval()
total_time = 0.0
start_time = time.time()
responses = []
for step, inputs in enumerate(test_dataloader, 1):
    input_ids, token_type_ids, position_ids, attention_mask = inputs
    ids, scores = model.generate(
        input_ids=input_ids,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        attention_mask=attention_mask,
        max_length=max_dec_len,
        min_length=min_dec_len,
        decode_strategy=decode_strategy,
        top_k=top_k,
        num_return_sequences=num_return_sequences)

    total_time += (time.time() - start_time)
    if step % logging_steps == 0:
        print('step %d - %.3fs/step' % (step, total_time / logging_steps))
        total_time = 0.0
    # 模型输出序列排序，从num_return_sequences个序列中选出最好的一个作为结果
    results = select_response(ids, scores, tokenizer, max_dec_len, num_return_sequences)
    responses.extend(results)

    start_time = time.time()

# 保存文本结果序列
with open(output_path, 'w', encoding='utf-8') as fout:
    for response in responses:
        fout.write(response + '\n')
print('\nSave inference result into: %s' % output_path)


Infer begin...


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  elif dtype == np.bool:


### 快速搭建基线Step6：提交结果

预测结果会被保存在`output_path`中，将预测结果准备成比赛官网要求的格式，提交到[比赛官网](https://aistudio.baidu.com/aistudio/competition/detail/67)进行评测即可。

以上基线实现基于PaddleNLP，开源不易，希望大家多多支持~ 
**记得给[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)点个小小的Star⭐**

GitHub地址：[https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)
![](https://ai-studio-static-online.cdn.bcebos.com/a0e8ca7743ea4fe9aa741682a63e767f8c48dc55981f4e44a40e0e00d3ab369e)