# 『NLP直播课』Day 5：情感分析预训练模型SKEP

本项目将详细全面介绍情感分析任务的两种子任务，句子级情感分析和目标级情感分析。

同时演示如何使用情感分析预训练模型SKEP完成以上两种任务，详细介绍预训练模型SKEP及其在 PaddleNLP 的使用方式。

本项目主要包括“任务介绍”、“情感分析预训练模型SKEP”、“句子级情感分析”、“目标级情感分析”等四个部分。


In [None]:
!pip install --upgrade paddlenlp -i https://pypi.org/simple 

Collecting paddlenlp
[?25l  Downloading https://files.pythonhosted.org/packages/63/7a/e6098c8794d7753470071f58b07843824c40ddbabe213eae458d321d2dbe/paddlenlp-2.0.3-py3-none-any.whl (451kB)
[K     |████████████████████████████████| 460kB 42kB/s eta 0:00:012
Installing collected packages: paddlenlp
  Found existing installation: paddlenlp 2.0.1
    Uninstalling paddlenlp-2.0.1:
      Successfully uninstalled paddlenlp-2.0.1
Successfully installed paddlenlp-2.0.3


##  Part A. 情感分析任务

众所周知，人类自然语言中包含了丰富的情感色彩：表达人的情绪（如悲伤、快乐）、表达人的心情（如倦怠、忧郁）、表达人的喜好（如喜欢、讨厌）、表达人的个性特征和表达人的立场等等。情感分析在商品喜好、消费决策、舆情分析等场景中均有应用。利用机器自动分析这些情感倾向，不但有助于帮助企业了解消费者对其产品的感受，为产品改进提供依据；同时还有助于企业分析商业伙伴们的态度，以便更好地进行商业决策。

被人们所熟知的情感分析任务是将一段文本分类，如分为情感极性为**正向**、**负向**、**其他**的三分类问题：
<br></br>
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/b630901b397e4e7a8e78ab1d306dfa1fc070d91015a64ef0b8d590aaa8cfde14" width="600" ></center>
<br><center>情感分析任务</center></br>

- **正向：** 表示正面积极的情感，如高兴，幸福，惊喜，期待等。
- **负向：** 表示负面消极的情感，如难过，伤心，愤怒，惊恐等。
- **其他：** 其他类型的情感。

实际上，以上熟悉的情感分析任务是**句子级情感分析任务**。


情感分析任务还可以进一步分为**句子级情感分析**、**目标级情感分析**等任务。在下面章节将会详细介绍两种任务及其应用场景。


## Part B. 情感分析预训练模型SKEP

近年来，大量的研究表明基于大型语料库的预训练模型（Pretrained Models, PTM）可以学习通用的语言表示，有利于下游NLP任务，同时能够避免从零开始训练模型。随着计算能力的发展，深度模型的出现（即 Transformer）和训练技巧的增强使得 PTM 不断发展，由浅变深。

情感预训练模型SKEP（Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis）。SKEP利用情感知识增强预训练模型， 在14项中英情感分析典型任务上全面超越SOTA，此工作已经被ACL 2020录用。SKEP是百度研究团队提出的基于情感知识增强的情感预训练算法，此算法采用无监督方法自动挖掘情感知识，然后利用情感知识构建预训练目标，从而让机器学会理解情感语义。SKEP为各类情感分析任务提供统一且强大的情感语义表示。

**论文地址**：https://arxiv.org/abs/2005.05635

<p align="center">
<img src="https://paddlenlp.bj.bcebos.com/models/transformers/skep/skep.png" width="80%" height="60%"> <br />
</p>

百度研究团队在三个典型情感分析任务，句子级情感分类（Sentence-level Sentiment Classification），评价目标级情感分类（Aspect-level Sentiment Classification）、观点抽取（Opinion Role Labeling），共计14个中英文数据上进一步验证了情感预训练模型SKEP的效果。

具体实验效果参考：https://github.com/baidu/Senta#skep




## Part C 句子级情感分析 & 目标级情感分析

### Part C.1 句子级情感分析


对给定的一段文本进行情感极性分类，常用于影评分析、网络论坛舆情分析等场景。如:

```text
选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。 泳池在大堂的屋顶，因此很小，不过女儿倒是喜欢。 包的早餐是西式的，还算丰富。 服务吗，一般	1
15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错	1
房间太小。其他的都一般。。。。。。。。。	0
```

其中`1`表示正向情感，`0`表示负向情感。


<br></br>
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/4aae00a800ae4831b6811b669f7461d8482344b183454d8fb7d37c83defb9567" width="550" ></center>
<br><center>句子级情感分析任务</center></br>


#### 常用数据集

ChnSenticorp数据集是公开中文情感分析常用数据集， 其为2分类数据集。PaddleNLP已经内置该数据集，一键即可加载。



In [None]:
!pwd
!wget https://dataset-bj.cdn.bcebos.com/qianyan/NLPCC14-SC.zip -O ~/data/NLPCC14-SC.zip
!unzip -oq /home/aistudio/data/NLPCC14-SC.zip -d ~/data/


/home/aistudio
--2021-06-18 10:26:25--  https://dataset-bj.cdn.bcebos.com/qianyan/NLPCC14-SC.zip
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1262929 (1.2M) [application/zip]
Saving to: ‘/home/aistudio/data/NLPCC14-SC.zip’


2021-06-18 10:26:25 (37.3 MB/s) - ‘/home/aistudio/data/NLPCC14-SC.zip’ saved [1262929/1262929]



In [None]:

# from paddlenlp.datasets import load_dataset
from paddlenlp.datasets import MapDataset
# train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

# train_ds, dev_ds, test_ds = load_dataset("ChnSentiCorp", splits=["train", "dev", "test"])

def load_dataset(datafiles):
    def read(data_path):
        with open(data_path, 'r', encoding='utf-8') as fp:
            # next(fp)  # Skip header
            cnt=0
            title_list = fp.readline().strip('\n').split('\t')
            for line in fp.readlines():
                if 'qid' not in title_list:
                    qid = cnt
                    cnt = cnt+1
                    labels, texts = line.strip('\n').split('\t')
                elif 'label' not in title_list:
                    labels=''
                    qid, texts = line.strip('\n').split('\t')
                else:
                    qid, labels, texts = line.strip('\n').split('\t')
                # words = words.split('\002')
                # labels = labels.split('\002')
                #  {'text': text, 'label': labels, 'qid':''}
                yield {'qid': qid, 'label':labels, 'text':texts}

    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]

# Create dataset, tokenizer and dataloader.
train_ds,  test_ds = load_dataset(datafiles=(
        './data/NLPCC14-SC/train.tsv', './data/NLPCC14-SC/test.tsv'))
# train_ds = load_dataset(datafiles=('./data/ChnSentiCorp/dev.tsv'))

print(train_ds[0])
print(train_ds[1])
print(train_ds[2])

print(test_ds[0])
print(test_ds[1])
print(test_ds[2])

{'qid': 0, 'label': '1', 'text': '请问这机不是有个遥控器的吗？'}
{'qid': 1, 'label': '1', 'text': '发短信特别不方便！背后的屏幕很大用起来不舒服，是手触屏的！切换屏幕很麻烦！'}
{'qid': 2, 'label': '1', 'text': '手感超好，而且黑色相比白色在转得时候不容易眼花，找童年的记忆啦。'}
{'qid': '0', 'label': '', 'text': '我终于找到同道中人啦～～～～从初中开始，我就已经喜欢上了michaeljackson.但同学们都用鄙夷的眼光看我，他们人为jackson的样子古怪甚至说＂丑＂．我当场气晕．但现在有同道中人了，我好开心！！！michaeljacksonisthemostsuccessfulsingerintheworld!!~~~'}
{'qid': '1', 'label': '', 'text': '看完已是深夜两点，我却坐在电脑前情难自禁，这是最好的结局。惟有如此，就让那前世今生的纠结就停留在此刻。再相逢时，愿他的人生不再让人唏嘘，他们的身心也会只居一处。可是还是痛心为这样的人，这样的爱……'}
{'qid': '2', 'label': '', 'text': '袁阔成先生是当今评书界的泰斗，十二金钱镖是他的代表作之一'}



### SKEP模型加载

PaddleNLP已经实现了SKEP预训练模型，可以通过一行代码实现SKEP加载。

句子级情感分析模型是SKEP fine-tune 文本分类常用模型`SkepForSequenceClassification`。其首先通过SKEP提取句子语义特征，之后将语义特征进行分类。


![](https://ai-studio-static-online.cdn.bcebos.com/fc21e1201154451a80f32e0daa5fa84386c1b12e4b3244e387ae0b177c1dc963)




In [None]:
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer

# 指定模型名称，一键加载模型
model = SkepForSequenceClassification.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch", num_classes=2)
# 同样地，通过指定模型名称一键加载对应的Tokenizer，用于处理文本数据，如切分token，转token_id等。
tokenizer = SkepTokenizer.from_pretrained(pretrained_model_name_or_path="skep_ernie_1.0_large_ch")

[2021-06-18 12:06:39,390] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-18 12:06:49,213] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt


`SkepForSequenceClassification`可用于句子级情感分析和目标级情感分析任务。其通过预训练模型SKEP获取输入文本的表示，之后将文本表示进行分类。

* `pretrained_model_name_or_path`：模型名称。支持"skep_ernie_1.0_large_ch"，"skep_ernie_2.0_large_en"。
	- "skep_ernie_1.0_large_ch"：是SKEP模型在预训练ernie_1.0_large_ch基础之上在海量中文数据上继续预训练得到的中文预训练模型；
    - "skep_ernie_2.0_large_en"：是SKEP模型在预训练ernie_2.0_large_en基础之上在海量英文数据上继续预训练得到的英文预训练模型；
    
* `num_classes`: 数据集分类类别数。


关于SKEP模型实现详细信息参考：https://github.com/PaddlePaddle/PaddleNLP/tree/develop/paddlenlp/transformers/skep
    

### 数据处理

同样地，我们需要将原始ChnSentiCorp数据处理成模型可以读入的数据格式。

SKEP模型对中文文本处理按照字粒度进行处理，我们可以使用PaddleNLP内置的`SkepTokenizer`完成一键式处理。

In [None]:
import os
from functools import partial


import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad

from utils import create_dataloader

def convert_example(example,
                    tokenizer,
                    max_seq_length=256,
                    is_test=False):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`int`, optional): The input label if not is_test.
    """
    # 将原数据处理成model可读入的格式，enocded_inputs是一个dict，包含input_ids、token_type_ids等字段
    encoded_inputs = tokenizer(
        text=example["text"], max_seq_len=max_seq_length)

    # input_ids：对文本切分token后，在词汇表中对应的token id
    input_ids = encoded_inputs["input_ids"]
    # token_type_ids：当前token属于句子1还是句子2，即上述图中表达的segment ids
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        # label：情感极性类别
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        # qid：每条数据的编号
        qid = np.array([example["qid"]], dtype="int64")
        return input_ids, token_type_ids, qid

In [None]:
# 批量数据大小
batch_size = 24
# 文本序列最大长度
max_seq_length = 256

# 将数据处理成模型可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)

# 将数据组成批量式数据，如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack()  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)
# dev_data_loader = create_dataloader(
#     dev_ds,
#     mode='dev',
#     batch_size=batch_size,
#     batchify_fn=batchify_fn,
#     trans_fn=trans_func)

### 模型训练和评估


定义损失函数、优化器以及评价指标后，即可开始训练。


**推荐超参设置：**

* `max_seq_length=256`
* `batch_size=48`
* `learning_rate=2e-5`
* `epochs=10`

实际运行时可以根据显存大小调整batch_size和max_seq_length大小。



In [None]:
import time

from utils import evaluate

# 训练轮次
epochs = 10
# 训练过程中保存模型参数的文件夹
ckpt_dir = "skep_ckpt_NLPCC14-SC"
# len(train_data_loader)一轮训练所需要的step数
num_training_steps = len(train_data_loader) * epochs

# Adam优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=2e-5,
    parameters=model.parameters())
# 交叉熵损失函数
criterion = paddle.nn.loss.CrossEntropyLoss()
# accuracy评价指标
metric = paddle.metric.Accuracy()

In [None]:
# 开启训练
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # 喂数据给model
        logits = model(input_ids, token_type_ids)
        # 计算损失函数值
        loss = criterion(logits, labels)
        # 预测分类概率值
        probs = F.softmax(logits, axis=1)
        # 计算acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # 反向梯度回传，更新参数
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 1000 == 0:
            save_dir = os.path.join(ckpt_dir, "model_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # 评估当前训练的模型
            # evaluate(model, criterion, metric, dev_data_loader)
            # 保存当前模型参数等
            model.save_pretrained(save_dir)
            # 保存tokenizer的词表等
            tokenizer.save_pretrained(save_dir)
save_dir = os.path.join(ckpt_dir, "model_final")
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
model.save_pretrained(save_dir)
# 保存tokenizer的词表等
tokenizer.save_pretrained(save_dir)

global step 10, epoch: 1, batch: 10, loss: 0.29067, accu: 0.85659, speed: 1.06 step/s
global step 20, epoch: 1, batch: 20, loss: 0.09907, accu: 0.85746, speed: 1.02 step/s
global step 30, epoch: 1, batch: 30, loss: 0.10869, accu: 0.85832, speed: 1.12 step/s
global step 40, epoch: 1, batch: 40, loss: 0.07327, accu: 0.85916, speed: 1.09 step/s
global step 50, epoch: 1, batch: 50, loss: 0.20912, accu: 0.86003, speed: 1.38 step/s
global step 60, epoch: 1, batch: 60, loss: 0.04829, accu: 0.86112, speed: 0.95 step/s
global step 70, epoch: 1, batch: 70, loss: 0.02235, accu: 0.86187, speed: 1.15 step/s
global step 80, epoch: 1, batch: 80, loss: 0.07145, accu: 0.86245, speed: 1.08 step/s
global step 90, epoch: 1, batch: 90, loss: 0.12644, accu: 0.86326, speed: 1.03 step/s
global step 100, epoch: 1, batch: 100, loss: 0.11074, accu: 0.86386, speed: 1.11 step/s
global step 110, epoch: 1, batch: 110, loss: 0.21706, accu: 0.86456, speed: 1.09 step/s
global step 120, epoch: 1, batch: 120, loss: 0.023

### 预测提交结果


使用训练得到的模型还可以对文本进行情感预测。


In [None]:
import numpy as np
import paddle

# 处理测试集数据
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # segment
    Stack() # qid
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

In [None]:
# 根据实际运行情况，更换加载的参数路径
params_path = 'skep_ckpt_NLPCC14-SC/model_final/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

Loaded parameters from skep_ckpt_NLPCC14-SC/model_final/model_state.pdparams


In [None]:
label_map = {0: '0', 1: '1'}
results = []
# 切换model模型为评估模式，关闭dropout等随机因素
model.eval()
for batch in test_data_loader:
    input_ids, token_type_ids, qids = batch
    # 喂数据给模型
    logits = model(input_ids, token_type_ids)
    # 预测分类
    probs = F.softmax(logits, axis=-1)
    idx = paddle.argmax(probs, axis=1).numpy()
    idx = idx.tolist()
    labels = [label_map[i] for i in idx]
    qids = qids.numpy().tolist()
    results.extend(zip(qids, labels))
print(results[2])

([2], '0')


In [None]:
res_dir = "./results"
if not os.path.exists(res_dir):
    os.makedirs(res_dir)
# 写入预测结果
with open(os.path.join(res_dir, "NLPCC14-SC.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for qid, label in results:
        f.write(str(qid[0])+"\t"+label+"\n")

### Part C.2 目标级情感分析

在电商产品分析场景下，除了分析整体商品的情感极性外，还细化到以商品具体的“方面”为分析主体进行情感分析（aspect-level），如下、：

* 这个薯片口味有点咸，太辣了，不过口感很脆。

关于薯片的**口味方面**是一个负向评价（咸，太辣），然而对于**口感方面**却是一个正向评价（很脆）。

* 我很喜欢夏威夷，就是这边的海鲜太贵了。

关于**夏威夷**是一个正向评价（喜欢），然而对于**夏威夷的海鲜**却是一个负向评价（价格太贵）。



<br></br>
<center><img src="https://ai-studio-static-online.cdn.bcebos.com/052d46409ba3451693a718552b968d188fa4677235bc43ddbc15fe11ad3b57b1" width="600" ></center>
<br><center>目标级情感分析任务</center></br>

#### 常用数据集

[千言数据集](https://www.luge.ai/)已提供了许多任务常用数据集。
其中情感分析数据集下载链接：https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLUGE=TRUE

SE-ABSA16_PHNS数据集是关于手机的目标级情感分析数据集。PaddleNLP已经内置了该数据集，加载方式，如下：


In [None]:
!pwd
!wget https://dataset-bj.cdn.bcebos.com/qianyan/SE-ABSA16_CAME.zip -O ~/data/SE-ABSA16_CAME.zip
!unzip -oq /home/aistudio/data/SE-ABSA16_CAME.zip -d ~/data/

/home/aistudio
--2021-06-18 17:19:25--  https://dataset-bj.cdn.bcebos.com/qianyan/SE-ABSA16_CAME.zip
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 373019 (364K) [application/zip]
Saving to: ‘/home/aistudio/data/SE-ABSA16_CAME.zip’


2021-06-18 17:19:25 (19.4 MB/s) - ‘/home/aistudio/data/SE-ABSA16_CAME.zip’ saved [373019/373019]



In [None]:
# from paddlenlp.datasets import load_dataset
# train_ds, test_ds = load_dataset("seabsa16", "phns", splits=["train", "test"])
# train_ds, test_ds = load_dataset("seabsa16", "came", splits=["train", "test"])
from paddlenlp.datasets import MapDataset
# train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

# train_ds, dev_ds, test_ds = load_dataset("ChnSentiCorp", splits=["train", "dev", "test"])

def load_dataset(datafiles):
    def read(data_path):
        with open(data_path, 'r', encoding='utf-8') as fp:
            # next(fp)  # Skip header
            cnt=0
            title_list = fp.readline().strip('\n').split('\t')
            for line in fp.readlines():
                if 'test' in data_path:
                    label, text, text_pair = line.strip('\n').split('\t')
                if 'train' in data_path:
                    label, text, text_pair = line.strip('\n').split('\t')
                # words = words.split('\002')
                # labels = labels.split('\002')
                #  {'text': text, 'label': labels, 'qid':''}
                yield {'text': text, 'text_pair':text_pair, 'label':label}

    if isinstance(datafiles, str):
        return MapDataset(list(read(datafiles)))
    elif isinstance(datafiles, list) or isinstance(datafiles, tuple):
        return [MapDataset(list(read(datafile))) for datafile in datafiles]

# Create dataset, tokenizer and dataloader.
train_ds,  test_ds = load_dataset(datafiles=(
        './data/SE-ABSA16_CAME/train.tsv', './data/SE-ABSA16_CAME/test.tsv'))
# train_ds = load_dataset(datafiles=('./data/ChnSentiCorp/dev.tsv'))
print(train_ds[0])
print(test_ds[0])


{'text': 'camera#design_features', 'text_pair': '千呼万唤始出来，尼康的APSC小相机终于发布了，COOLPIX A. 你怎么看呢？我看，尼康是挤牙膏挤惯了啊，1，外观既没有V1时尚，也没P7100专业，反而类似P系列。2，CMOS炒冷饭。3，OVF没有任何提示和显示。（除了框框)4，28MM镜头是不错，可是F2.8定焦也太小气了。5，电池坑爹，用D800和V1的电池很难吗？6，考虑到1100美元的定价，富士X100S表示很欢乐。***好处是，可以确定，尼康会继续大力发展1系列了***另外体积比X100S小也算是A的优势吧***。等2014年年中跌倒1900左右的时候就可以入手了。', 'label': '0'}
{'text': 'camera#quality', 'text_pair': '一直潜水，昨天入d300s +35 1.8g，谈谈感受，dx说，标题一定要长！在我们这尼康一个代理商开的大型体验中心提的货，老板和销售mm都很热情，不欺诈，也没有店大欺客，mm很热情，从d300s到d800，d7000，到d3x配各种镜头，全部把玩了一番，感叹啊，真他妈好东西！尤其d3x，有钱了，一定要他妈买一个，还有，就是d800，一摸心中的神机，顿时凉了半截，可能摸她之前，摸了她们的头牌，d3x的缘故，这手感 真是差了点，样子嘛，之所以喜欢尼康，就是喜欢棱角分明的感觉，d3x方方正正 ，甚是讨喜，d800这丫头，变得圆滑了不少，不喜欢。都说电子产品，买新不买旧，我倒不认为这么看，中低端产品的确如此，但顶级的高端产品，真不是这么回事啊，d3x也是51点对焦，我的d300s也是51点，但明显感觉，对焦就是比d300s 快，准，暗部反差较小时，也很少拉风箱，我的d300s就不行，光线不好反差较小，拉回来拉过去，半天合不上焦，说真的，一分价钱一分货啊，d800电子性能 肯定是先进的，但机械性能 跟d3x还是没可比性，传感器固然先进，但三千多万 像素 和两千多万像素 对我们来说，真的差别这么大吗？d800e3万多，有这钱真的不如加点买 d3x啊，真要是d3x烂，为什么尼康不停产了？人说高像素 是给商业摄影师用，我们的音乐老师，是业余的音乐制作人，也拍摄一些商业广告，平时他玩的时候 都是数码什么的，nc 加起来十几个，大三元全都配齐，但

In [None]:
print(train_ds.label_list)
print(train_ds[0])
print(test_ds[0])

['0', '1']
{'text': 'phone#design_features', 'text_pair': '今天有幸拿到了港版白色iPhone 5真机，试玩了一下，说说感受吧：1. 真机尺寸宽度与4/4s保持一致没有变化，长度多了大概一厘米，也就是之前所说的多了一排的图标。2. 真机重量比上一代轻了很多，个人感觉跟i9100的重量差不多。（用惯上一代的朋友可能需要一段时间适应了）3. 由于目前还没有版的SIM卡，无法插卡使用，有购买的朋友要注意了，并非简单的剪卡就可以用，而是需要去运营商更换新一代的SIM卡。4. 屏幕显示效果确实比上一代有进步，不论是从清晰度还是不同角度的视角，iPhone 5绝对要更上一层，我想这也许是相对上一代最有意义的升级了。5. 新的数据接口更小，比上一代更好用更方便，使用的过程会有这样的体会。6. 从简单的几个操作来讲速度比4s要快，这个不用测试软件也能感受出来，比如程序的调用以及照片的拍摄和浏览。不过，目前水货市场上坑爹的价格，最好大家可以再观望一下，不要急着出手。', 'label': 1}
{'text': 'software#usability', 'text_pair': '刚刚入手8600，体会。刚刚从淘宝购买，1635元（包邮）。1、全新，应该是欧版机，配件也是正品全新。2、在三星官网下载了KIES，可用免费软件非常多，绝对够用。3、不到2000元能买到此种手机，知足了。'}


#### SKEP模型加载

目标级情感分析模型同样使用`SkepForSequenceClassification`模型，但目标级情感分析模型的输入不单单是一个句子，而是句对。一个句子描述“评价对象方面（aspect）”，另一个句子描述"对该方面的评论"。如下图所示。


![](https://ai-studio-static-online.cdn.bcebos.com/1a4b76447dae404caa3bf123ea28e375179cb09a02de4bef8a2f172edc6e3c8f)



In [None]:
from paddlenlp.transformers import SkepForSequenceClassification, SkepTokenizer
# 指定模型名称一键加载模型
model = SkepForSequenceClassification.from_pretrained(
    'skep_ernie_1.0_large_ch', num_classes=2)
# 指定模型名称一键加载tokenizer
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

[2021-06-18 17:30:16,517] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-18 17:30:26,120] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt


### 数据处理

同样地，我们需要将原始SE_ABSA16_PHNS数据处理成模型可以读入的数据格式。

SKEP模型对中文文本处理按照字粒度进行处理，我们可以使用PaddleNLP内置的`SkepTokenizer`完成一键式处理。

In [None]:
from functools import partial
import os
import time

import numpy as np
import paddle
import paddle.nn.functional as F
from paddlenlp.data import Stack, Tuple, Pad


def convert_example(example,
                    tokenizer,
                    max_seq_length=512,
                    is_test=False,
                    dataset_name="chnsenticorp"):
    """
    Builds model inputs from a sequence or a pair of sequence for sequence classification tasks
    by concatenating and adding special tokens. And creates a mask from the two sequences passed 
    to be used in a sequence-pair classification task.
        
    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence has the following format:
    ::
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``

    A skep_ernie_1.0_large_ch/skep_ernie_2.0_large_en sequence pair mask has the following format:
    ::

        0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
        | first sequence    | second sequence |

    If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).
    
    note: There is no need token type ids for skep_roberta_large_ch model.


    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
            Sequences longer than this will be truncated, sequences shorter will be padded.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
        dataset_name((obj:`str`, defaults to "chnsenticorp"): The dataset name, "chnsenticorp" or "sst-2".

    Returns:
        input_ids(obj:`list[int]`): The list of token ids.
        token_type_ids(obj: `list[int]`): List of sequence pair mask.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """
    encoded_inputs = tokenizer(
        text=example["text"],
        text_pair=example["text_pair"],
        max_seq_len=max_seq_length)

    input_ids = encoded_inputs["input_ids"]
    token_type_ids = encoded_inputs["token_type_ids"]

    if not is_test:
        label = np.array([example["label"]], dtype="int64")
        return input_ids, token_type_ids, label
    else:
        return input_ids, token_type_ids

In [None]:
from utils import create_dataloader
# 处理的最大文本序列长度
max_seq_length=256
# 批量数据大小
batch_size=16

# 将数据处理成model可读入的数据格式
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length)
# 将数据组成批量式数据，如
# 将不同长度的文本序列padding到批量式数据中最大长度
# 将每条数据label堆叠在一起
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
    Stack(dtype="int64")  # labels
): [data for data in fn(samples)]
train_data_loader = create_dataloader(
    train_ds,
    mode='train',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

### 模型训练

定义损失函数、优化器以及评价指标后，即可开始训练。

In [None]:
# 训练轮次
epochs = 10
# 总共需要训练的step数
num_training_steps = len(train_data_loader) * epochs
# 优化器
optimizer = paddle.optimizer.AdamW(
    learning_rate=5e-5,
    parameters=model.parameters())
# 交叉熵损失
criterion = paddle.nn.loss.CrossEntropyLoss()
# Accuracy评价指标
metric = paddle.metric.Accuracy()

In [None]:
# 开启训练
ckpt_dir = "skep_aspect"
global_step = 0
tic_train = time.time()
for epoch in range(1, epochs + 1):
    for step, batch in enumerate(train_data_loader, start=1):
        input_ids, token_type_ids, labels = batch
        # 喂数据给model
        logits = model(input_ids, token_type_ids)
        # 计算损失函数值
        loss = criterion(logits, labels)
        # 预测分类概率
        probs = F.softmax(logits, axis=1)
        # 计算acc
        correct = metric.compute(probs, labels)
        metric.update(correct)
        acc = metric.accumulate()

        global_step += 1
        if global_step % 10 == 0:
            print(
                "global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f, speed: %.2f step/s"
                % (global_step, epoch, step, loss, acc,
                    10 / (time.time() - tic_train)))
            tic_train = time.time()
        
        # 反向梯度回传，更新参数
        loss.backward()
        optimizer.step()
        optimizer.clear_grad()

        if global_step % 500 == 0:
            
            save_dir = os.path.join(ckpt_dir, "modelCAME_%d" % global_step)
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
            # 保存模型参数
            model.save_pretrained(save_dir)
            # 保存tokenizer的词表等
            tokenizer.save_pretrained(save_dir)
save_dir = os.path.join(ckpt_dir, "modelCAME_final")
if not os.path.exists(save_dir):
    os.makedirs(save_dir)
# 保存模型参数
model.save_pretrained(save_dir)
# 保存tokenizer的词表等
tokenizer.save_pretrained(save_dir)

global step 10, epoch: 1, batch: 10, loss: 0.81712, acc: 0.60625, speed: 1.24 step/s
global step 20, epoch: 1, batch: 20, loss: 0.67679, acc: 0.59062, speed: 1.26 step/s
global step 30, epoch: 1, batch: 30, loss: 0.67554, acc: 0.60417, speed: 1.27 step/s
global step 40, epoch: 1, batch: 40, loss: 0.65711, acc: 0.59844, speed: 1.28 step/s
global step 50, epoch: 1, batch: 50, loss: 0.75251, acc: 0.59750, speed: 1.25 step/s
global step 60, epoch: 1, batch: 60, loss: 0.57564, acc: 0.61146, speed: 1.24 step/s
global step 70, epoch: 1, batch: 70, loss: 0.57873, acc: 0.61786, speed: 1.25 step/s
global step 80, epoch: 1, batch: 80, loss: 0.72466, acc: 0.62969, speed: 1.30 step/s
global step 90, epoch: 2, batch: 7, loss: 0.49440, acc: 0.63681, speed: 1.30 step/s
global step 100, epoch: 2, batch: 17, loss: 0.50333, acc: 0.64317, speed: 1.26 step/s
global step 110, epoch: 2, batch: 27, loss: 0.62377, acc: 0.64723, speed: 1.25 step/s
global step 120, epoch: 2, batch: 37, loss: 0.57376, acc: 0.6532

### 预测提交结果

使用训练得到的模型还可以对评价对象进行情感预测。

In [None]:
@paddle.no_grad()
def predict(model, data_loader, label_map):
    """
    Given a prediction dataset, it gives the prediction results.

    Args:
        model(obj:`paddle.nn.Layer`): A model to classify texts.
        data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches.
        label_map(obj:`dict`): The label id (key) to label str (value) map.
    """
    model.eval()
    results = []
    for batch in data_loader:
        input_ids, token_type_ids = batch
        logits = model(input_ids, token_type_ids)
        probs = F.softmax(logits, axis=1)
        idx = paddle.argmax(probs, axis=1).numpy()
        idx = idx.tolist()
        labels = [label_map[i] for i in idx]
        results.extend(labels)
    return results

In [None]:
# 处理测试集数据
label_map = {0: '0', 1: '1'}
trans_func = partial(
    convert_example,
    tokenizer=tokenizer,
    max_seq_length=max_seq_length,
    is_test=True)
batchify_fn = lambda samples, fn=Tuple(
    Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_ids
    Pad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_ids
): [data for data in fn(samples)]
test_data_loader = create_dataloader(
    test_ds,
    mode='test',
    batch_size=batch_size,
    batchify_fn=batchify_fn,
    trans_fn=trans_func)

In [None]:
# 根据实际运行情况，更换加载的参数路径
params_path = 'skep_ckpt/modelCAME_final/model_state.pdparams'
if params_path and os.path.isfile(params_path):
    # 加载模型参数
    state_dict = paddle.load(params_path)
    model.set_dict(state_dict)
    print("Loaded parameters from %s" % params_path)

results = predict(model, test_data_loader, label_map)

In [None]:
# 写入预测结果
with open(os.path.join("results", "SE-ABSA16_CAME.tsv"), 'w', encoding="utf8") as f:
    f.write("index\tprediction\n")
    for idx, label in enumerate(results):
        f.write(str(idx)+"\t"+label+"\n")

将预测文件结果压缩至zip文件，提交[千言比赛网站](https://aistudio.baidu.com/aistudio/competition/detail/50/?isFromLUGE=TRUE)

**NOTE:** results文件夹中NLPCC14-SC.tsv、SE-ABSA16_CAME.tsv、COTE_BD.tsv、COTE_MFW.tsv、COTE_DP.tsv等文件是为了顺利提交，补齐的文件。
其结果还有待提高。

In [None]:
#将预测文件结果压缩至zip文件，提交
!zip -r results.zip results

  adding: results/ (stored 0%)
  adding: results/SE-ABSA16_PHNS.tsv (deflated 65%)
  adding: results/ChnSentiCorp.tsv (deflated 63%)


### Part D 观点抽取

In [1]:
import paddlenlp
from paddlenlp.transformers import SkepForTokenClassification, SkepTokenizer

In [None]:
# 导入数据
!wget https://dataset-bj.cdn.bcebos.com/qianyan/COTE-BD.zip -O ~/data/COTE-BD.zip
!wget https://dataset-bj.cdn.bcebos.com/qianyan/COTE-MFW.zip -O ~/data/COTE-MFW.zip
!wget https://dataset-bj.cdn.bcebos.com/qianyan/COTE-DP.zip -O ~/data/COTE-DP.zip


# 解压数据
!unzip -o data/COTE-BD
!unzip -o data/COTE-DP
!unzip -o data/COTE-MFW

--2021-06-23 11:28:01--  https://dataset-bj.cdn.bcebos.com/qianyan/COTE-BD.zip
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1182741 (1.1M) [application/zip]
Saving to: ‘/home/aistudio/data/COTE-BD.zip’


2021-06-23 11:28:01 (39.9 MB/s) - ‘/home/aistudio/data/COTE-BD.zip’ saved [1182741/1182741]

--2021-06-23 11:28:01--  https://dataset-bj.cdn.bcebos.com/qianyan/COTE-MFW.zip
Resolving dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)... 182.61.128.166
Connecting to dataset-bj.cdn.bcebos.com (dataset-bj.cdn.bcebos.com)|182.61.128.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4872264 (4.6M) [application/zip]
Saving to: ‘/home/aistudio/data/COTE-MFW.zip’


2021-06-23 11:28:02 (45.1 MB/s) - ‘/home/aistudio/data/COTE-MFW.zip’ saved [4872264/4872264]

--2021-06-23 11

In [2]:
# 数据集字典
def open_func(file_path):
    return [line.strip() for line in open(file_path, 'r', encoding='utf8').readlines()[1:] if len(line.strip().split('\t')) >= 2]

data_dict = {'cotebd': {'test': open_func('COTE-BD/test.tsv'),
                        'train': open_func('COTE-BD/train.tsv')},
             'cotedp': {'test': open_func('COTE-DP/test.tsv'),
                        'train': open_func('COTE-DP/train.tsv')},
             'cotemfw': {'test': open_func('COTE-MFW/test.tsv'),
                        'train': open_func('COTE-MFW/train.tsv')}}

In [3]:
# 定义数据集
from paddle.io import Dataset, DataLoader
from paddlenlp.data import Pad, Stack, Tuple
import numpy as np
label_list = {'B': 0, 'I': 1, 'O': 2}
index2label = {0: 'B', 1: 'I', 2: 'O'}

# 考虑token_type_id
class MyDataset(Dataset):
    def __init__(self, data, tokenizer, max_len=512, for_test=False):
        super().__init__()
        self._data = data
        self._tokenizer = tokenizer
        self._max_len = max_len
        self._for_test = for_test
    
    def __len__(self):
        return len(self._data)
    
    def __getitem__(self, idx):
        samples = self._data[idx].split('\t')
        label = samples[-2]
        text = samples[-1]
        if self._for_test:
            origin_enc = self._tokenizer.encode(text, max_seq_len=self._max_len)['input_ids']
            return np.array(origin_enc, dtype='int64')
        else:
            
            # 由于并不是每个字都是一个token，这里采用一种简单的处理方法，先编码label，再编码text中除了label以外的词，最后合到一起
            texts = text.split(label)
            label_enc = self._tokenizer.encode(label)['input_ids']
            cls_enc = label_enc[0]
            sep_enc = label_enc[-1]
            label_enc = label_enc[1:-1]
            
            # 合并
            origin_enc = []
            label_ids = []
            for index, text in enumerate(texts):
                text_enc = self._tokenizer.encode(text)['input_ids']
                text_enc = text_enc[1:-1]
                origin_enc += text_enc
                label_ids += [label_list['O']] * len(text_enc)
                if index != len(texts) - 1:
                    origin_enc += label_enc
                    label_ids += [label_list['B']] + [label_list['I']] * (len(label_enc) - 1)

            origin_enc = [cls_enc] + origin_enc + [sep_enc]
            label_ids = [label_list['O']] + label_ids + [label_list['O']]
            
            # 截断
            if len(origin_enc) > self._max_len:
                origin_enc = origin_enc[:self._max_len-1] + origin_enc[-1:]
                label_ids = label_ids[:self._max_len-1] + label_ids[-1:]
            return np.array(origin_enc, dtype='int64'), np.array(label_ids, dtype='int64')


def batchify_fn(for_test=False):
    if for_test:
        return lambda samples, fn=Pad(axis=0, pad_val=tokenizer.pad_token_id): np.row_stack([data for data in fn(samples)])
    else:
        return lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),
                                        Pad(axis=0, pad_val=label_list['O'])): [data for data in fn(samples)]


def get_data_loader(data, tokenizer, batch_size=32, max_len=512, for_test=False):
    dataset = MyDataset(data, tokenizer, max_len, for_test)
    shuffle = True if not for_test else False
    data_loader = DataLoader(dataset=dataset, batch_size=batch_size, collate_fn=batchify_fn(for_test))
    return data_loader

In [8]:
import paddle
from paddle.static import InputSpec
from paddlenlp.metrics import Perplexity

# 模型和分词
model = SkepForTokenClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=3)
tokenizer = SkepTokenizer.from_pretrained('skep_ernie_1.0_large_ch')

# 参数设置
data_path_name = 'cotedp'  # 更改此选项改变数据集
# data_path_name = 'cotebd'  # 更改此选项改变数据集
# data_path_name = 'cotemfw'  # 更改此选项改变数据集
## 训练相关
epochs = 2
learning_rate = 2e-5
batch_size = 8
max_len = 512

## 数据相关
train_dataloader = get_data_loader(data_dict[data_path_name]['train'], tokenizer, batch_size, max_len, for_test=False)

input = InputSpec((-1, -1), dtype='int64', name='input')
label = InputSpec((-1, -1, 3), dtype='int64', name='label')
model = paddle.Model(model, [input], [label])

# 模型准备

optimizer = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())
model.prepare(optimizer, loss=paddle.nn.CrossEntropyLoss(), metrics=[Perplexity()])

[2021-06-23 13:29:43,003] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams
[2021-06-23 13:29:47,936] [    INFO] - Found /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.vocab.txt


In [9]:
# 开始训练
model.fit(train_dataloader, batch_size=batch_size, epochs=epochs, save_freq=5, save_dir='./checkpoints', log_freq=200)

The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/2
step  200/3158 - loss: 0.0156 - Perplexity: 1.0533 - 308ms/step
step  400/3158 - loss: 0.0082 - Perplexity: 1.0349 - 306ms/step
step  600/3158 - loss: 0.0099 - Perplexity: 1.0277 - 310ms/step
step  800/3158 - loss: 0.0179 - Perplexity: 1.0242 - 311ms/step
step 1000/3158 - loss: 0.0054 - Perplexity: 1.0215 - 313ms/step
step 1200/3158 - loss: 0.0069 - Perplexity: 1.0199 - 316ms/step
step 1400/3158 - loss: 0.0011 - Perplexity: 1.0185 - 315ms/step
step 1600/3158 - loss: 0.0104 - Perplexity: 1.0174 - 315ms/step
step 1800/3158 - loss: 0.0073 - Perplexity: 1.0167 - 315ms/step
step 2000/3158 - loss: 0.0283 - Perplexity: 1.0159 - 316ms/step
step 2200/3158 - loss: 0.0074 - Perplexity: 1.0154 - 317ms/step
step 2400/3158 - loss: 0.0205 - Perplexity: 1.0150 - 316ms/step
step 2600/3158 - loss: 0.0234 - Perplexity: 1.0145 - 316ms/step
step 2800/3158 - loss: 0.0060 - Perplexity: 1.01

In [10]:
import re
# 导入预训练模型
checkpoint_path = './checkpoints/final'  # 填写预训练模型的保存路径

model = SkepForTokenClassification.from_pretrained('skep_ernie_1.0_large_ch', num_classes=3)
input = InputSpec((-1, -1), dtype='int64', name='input')
model = paddle.Model(model, [input])
model.load(checkpoint_path)

# 导入测试集
test_dataloader = get_data_loader(data_dict[data_path_name]['test'], tokenizer, batch_size, max_len, for_test=True)
# 预测保存

save_file = {'cotebd': './COTE-BD.tsv', 'cotedp': './COTE-DP.tsv', 'cotemfw': './COTE-MFW.tsv'}
predicts = []
input_ids = []
for batch in test_dataloader:
    predict = model.predict_batch(batch)
    predicts += predict[0].argmax(axis=-1).tolist()
    input_ids += batch.numpy().tolist()

# 先找到B所在的位置，即标号为0的位置，然后顺着该位置一直找到所有的I，即标号为1，即为所得。
def find_entity(prediction, input_ids):
    entity = []
    entity_ids = []
    for index, idx in enumerate(prediction):
        if idx == label_list['B']:
            entity_ids = [input_ids[index]]
        elif idx == label_list['I']:
            if entity_ids:
                entity_ids.append(input_ids[index])
        elif idx == label_list['O']:
            if entity_ids:
                entity.append(''.join(tokenizer.convert_ids_to_tokens(entity_ids)))
                entity_ids = []
    return entity

with open(save_file[data_path_name], 'w', encoding='utf8') as f:
    f.write("index\tprediction\n")
    for idx, sample in enumerate(data_dict[data_path_name]['test']):
        qid = sample.split('\t')[0]
        entity = find_entity(predicts[idx], input_ids[idx])
        entity = list(set(entity)) # 去重
        entity = [re.sub('##', '', i) for i in entity]  # 去除英文编码时的特殊字符
        entity = [re.sub('[UNK]', '', i)for i in entity] # 去除未知符号 
        f.write(qid + '\t' + '\x01'.join(entity) + '\n')
    f.close()

[2021-06-23 14:06:32,786] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/skep_ernie_1.0_large_ch/skep_ernie_1.0_large_ch.pdparams


以上实现基于PaddleNLP，开源不易，希望大家多多支持~ 

**记得给[PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)点个小小的Star⭐，及时跟踪最新消息和功能哦**

GitHub地址：[https://github.com/PaddlePaddle/PaddleNLP](https://github.com/PaddlePaddle/PaddleNLP)