# 『2022 CCF BDCI』- 基于TrustAI的阅读理解可解释性评测基线
## 1、项目介绍
深度学习模型在很多NLP任务上已经取得巨大成功，但其常被当作一个黑盒使用，内部预测机制对使用者是不透明的。这使得深度学习模型结果不被使用者信任，增加了落地难度，尤其在医疗、法律等特殊领域。同时，当模型出现效果不好或鲁棒性差等问题时，由于不了解其内部机制，很难对模型进行改进优化。
近期，深度学习模型的可解释性被越来越多的人关注。但模型的可解释性评估还不够完善，本基线提供了阅读理解任务的评测数据和相关评测指标，旨在评估模型的可解释性。
近期百度发布了一款集可信分析和增强于一体的可信AI工具集TrustAI，旨在探索模型预测机制并增强模型效果。本次基线基于TrustAI搭建。
## 2、基线运行



### 依赖安装
安装一些必须的依赖包。

In [1]:
!pip3 install -U paddlepaddle-gpu==2.2.2
!pip3 install -U paddlenlp==2.3.0
!pip3 install trustai==0.1.5

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting paddlepaddle-gpu==2.2.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/88/06/edee9889fa8aff3eda68ca34528e13b694d6110e7ad678268eb2fa05994f/paddlepaddle_gpu-2.2.2-cp37-cp37m-manylinux1_x86_64.whl (435.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m435.4/435.4 MB[0m [31m634.3 kB/s[0m eta [36m0:00:00[0m00:01[0m00:04[0m
Installing collected packages: paddlepaddle-gpu
  Attempting uninstall: paddlepaddle-gpu
    Found existing installation: paddlepaddle-gpu 2.3.2.post112
    Uninstalling paddlepaddle-gpu-2.3.2.post112:
      Successfully uninstalled paddlepaddle-gpu-2.3.2.post112
Successfully installed paddlepaddle-gpu-2.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.1.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Looking in ind

### 数据准备
#### 1）模型训练数据
我们推荐使用DuReader-robust数据集训练中文相似度计算模型。Paddlenlp框架会自动下载及缓存训练数据集，默认缓存存储路径为"~/.paddlenlp/datasets"。如需修改训练数据，请参考『初始化工作』中DATASET_NAME的修改。
#### 2）下载预训练模型
基线使用了ERNIE-3.0-base预训练模型。Paddlenlp框架自动缓存模型文件，默认缓存存储路径为"~/.paddlenlp/models"。如需修改依赖的预训练模型，请在『初始化工作』中修改MODEL_NAME。

### 初始化工作
初始化工作包括了模型选择及加载、训练数据集选择、模型存储路径设定、抽取证据的长度占原文本长度的比例设定等。可按需更改。

In [2]:
import sys
import json
import numpy as np
import paddle
import paddlenlp
from paddlenlp.transformers import ErnieForQuestionAnswering, ErnieTokenizer
from mrc_utils import *

# Select pre-trained model
MODEL_NAME = "ernie-3.0-base-zh" # choose from ["ernie-1.0", "ernie-1.0-base-zh", "ernie-1.0-large-zh-cw", "ernie-2.0-base-zh", "ernie-2.0-large-zh", "ernie-3.0-xbase-zh", "ernie-3.0-base-zh", "ernie-3.0-medium-zh", "ernie-3.0-mini-zh", "ernie-3.0-micro-zh", "ernie-3.0-nano-zh"]
# Select dataset for model training
DATASET_NAME = 'dureader_robust'
# Set the path to save the trained model
MODEL_SAVE_PATH = f'save_model/{DATASET_NAME}-{MODEL_NAME}/1'
# Set the rationale length ratio which determines the length of the extracted rationales.
RATIONALE_RATIO = 0.096 # 0.096 for Chinese dataset, 0.102 for English dataset

# Init model and tokenizer
model = ErnieForQuestionAnswering.from_pretrained(MODEL_NAME, num_classes=2)
tokenizer = ErnieTokenizer.from_pretrained(MODEL_NAME)

[2022-09-30 16:31:41,451] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-09-30 16:31:41,455] [    INFO] - Downloading ernie_3.0_base_zh.pdparams from https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh.pdparams
100%|██████████| 452M/452M [00:08<00:00, 53.1MB/s] 
W0930 16:31:50.475505   210 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.2
W0930 16:31:50.479920   210 device_context.cc:465] device: 0, cuDNN Version: 8.2.
[2022-09-30 16:31:56,938] [    INFO] - Downloading https://bj.bcebos.com/paddlenlp/models/transformers/ernie_3.0/ernie_3.0_base_zh_vocab.txt and saved to /home/aistudio/.paddlenlp/models/ernie-3.0-base-zh
[2022-09-30 16:31:56,942] [    INFO] - Downloading ernie_3.0_base_zh_vocab.txt from https://bj.bcebos.com/paddlenlp/models/transf

### 模型训练
这里以ERNIE-3.0为例训练一个阅读理解模型。

In [None]:
from paddlenlp.datasets import load_dataset
# Hyperparameters
batch_size = 6
max_seq_length = 512
epochs = 3  #3
warmup_proportion = 0.1
weight_decay = 0.01
doc_stride = 512
learning_rate = 1e-5

# Load dataset
train_ds, dev_ds, test_ds = load_dataset(DATASET_NAME, splits=["train", "dev", "test"])

# Start training
training_mrc_model(model, 
                tokenizer,
                train_ds, 
                dev_ds,
                batch_size=batch_size,
                epochs=epochs,
                learning_rate=learning_rate,
                warmup_proportion=warmup_proportion,
                max_seq_length=max_seq_length,
                doc_stride=doc_stride, 
                weight_decay=weight_decay,
                save_dir=MODEL_SAVE_PATH)

100%|██████████| 20038/20038 [00:07<00:00, 2571.18it/s]


Training Starts:
global step 100, epoch: 1, batch: 100, loss: 5.97629
global step 200, epoch: 1, batch: 200, loss: 5.95978
global step 300, epoch: 1, batch: 300, loss: 5.23865
global step 400, epoch: 1, batch: 400, loss: 2.96501
global step 500, epoch: 1, batch: 500, loss: 2.71297
global step 600, epoch: 1, batch: 600, loss: 2.61423
global step 700, epoch: 1, batch: 700, loss: 1.07601
global step 800, epoch: 1, batch: 800, loss: 1.71146
global step 900, epoch: 1, batch: 900, loss: 1.81692
global step 1000, epoch: 1, batch: 1000, loss: 1.86749
global step 1100, epoch: 1, batch: 1100, loss: 2.20435
global step 1200, epoch: 1, batch: 1200, loss: 1.16324
global step 1300, epoch: 1, batch: 1300, loss: 1.85572
global step 1400, epoch: 1, batch: 1400, loss: 0.86686
global step 1500, epoch: 1, batch: 1500, loss: 1.44837
global step 1600, epoch: 1, batch: 1600, loss: 1.27209
global step 1700, epoch: 1, batch: 1700, loss: 0.79163
global step 1800, epoch: 1, batch: 1800, loss: 1.80296
global step

[2022-09-30 16:43:42,801] [    INFO] - tokenizer config file saved in save_model/dureader_robust-ernie-3.0-base-zh/1/tokenizer_config.json
[2022-09-30 16:43:42,804] [    INFO] - Special tokens file saved in save_model/dureader_robust-ernie-3.0-base-zh/1/special_tokens_map.json


global step 2700, epoch: 2, batch: 9, loss: 0.41021
global step 2800, epoch: 2, batch: 109, loss: 2.12763
global step 2900, epoch: 2, batch: 209, loss: 1.52163
global step 3000, epoch: 2, batch: 309, loss: 1.27877
global step 3100, epoch: 2, batch: 409, loss: 1.52239
global step 3200, epoch: 2, batch: 509, loss: 0.99942
global step 3300, epoch: 2, batch: 609, loss: 1.02544
global step 3400, epoch: 2, batch: 709, loss: 0.36325
global step 3500, epoch: 2, batch: 809, loss: 0.75883
global step 3600, epoch: 2, batch: 909, loss: 0.58391
global step 3700, epoch: 2, batch: 1009, loss: 0.80946
global step 3800, epoch: 2, batch: 1109, loss: 0.19867
global step 3900, epoch: 2, batch: 1209, loss: 0.76185
global step 4000, epoch: 2, batch: 1309, loss: 2.78727
global step 4100, epoch: 2, batch: 1409, loss: 0.70396
global step 4200, epoch: 2, batch: 1509, loss: 0.92491
global step 4300, epoch: 2, batch: 1609, loss: 1.33201
global step 4400, epoch: 2, batch: 1709, loss: 1.89055
global step 4500, epoc

[2022-09-30 16:51:24,621] [    INFO] - tokenizer config file saved in save_model/dureader_robust-ernie-3.0-base-zh/1/tokenizer_config.json
[2022-09-30 16:51:24,693] [    INFO] - Special tokens file saved in save_model/dureader_robust-ernie-3.0-base-zh/1/special_tokens_map.json


global step 5400, epoch: 3, batch: 18, loss: 0.31900
global step 5500, epoch: 3, batch: 118, loss: 0.88382
global step 5600, epoch: 3, batch: 218, loss: 0.63239


### 重要度分数获取
该步为输入中每个词赋一个重要度分数，表示该词对预测的影响度。重要度分数获取共分三步。
#### 1）加载模型和评测数据集
更改模型以及评估数据的存储路径（MODEL_PATH和DATA_PATH），完成模型和评测数据集的加载。赛段一数据量为1855条，赛段二数据量为4366条，请确认评测数据集完整。

In [10]:
from utils import load_data
from functools import partial

# Correct MODEL_PATH and DATA_PATH before executing
MODEL_PATH = MODEL_SAVE_PATH + '/model_state.pdparams'
DATA_PATH = 'mrc_interpretation.txt'

# Load the trained parameters
state_dict = paddle.load(MODEL_PATH)
model.set_dict(state_dict)

# Load test data
data_ds = DuReader().read(DATA_PATH)
data = load_data(DATA_PATH)
print("Num of data:", len(data))

Num of data: 1855



#### 2）数据预处理

a) 输入格式化：将输入的两个文本组织成模型预测所需格式，如对于Ernie3.0-base模型，其输入形式为[CLS]question[SEP]context[SEP]

b) 分词位置索引：计算每个分词结果对应的原文位置索引，这里的分词包括模型分词和标准分词

In [None]:
from mrc_utils import *
# Hyperparameters
batch_size = 12
max_seq_length = 512
epochs = 3  #3
warmup_proportion = 0.1
weight_decay = 0.01
doc_stride = 512

# Prepare dataloader
test_trans_func = partial(prepare_validation_features, 
                            max_seq_length=max_seq_length, 
                            doc_stride=doc_stride,
                            tokenizer=tokenizer)
                            
data_ds.map(test_trans_func, batched=True, num_workers=4)
test_batch_sampler = paddle.io.DistributedBatchSampler(
        data_ds, batch_size=batch_size, shuffle=False)

test_batchify_fn = lambda samples, fn=Dict({
    "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id),
    "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id)
}): fn(samples)
test_data_loader = paddle.io.DataLoader(
    dataset=data_ds,
    batch_sampler=test_batch_sampler,
    collate_fn=test_batchify_fn,
    return_list=True)

# Get offset maps which will be used for score alignment
contexts, standard_split, ori_offset_maps, standard_split_offset_maps = pre_process(data, data_ds, tokenizer)

#### 3）重要度分数获取
我们提供attention和IG两种解释方法，可根据实际实验结果选取最有效的一种方法。

##### a） Attention-based Interpreter

In [12]:
from trustai.interpretation.token_level import AttentionInterpreter
from utils import create_dataloader_from_scratch
import paddle
# Hyperparameters
BATCH_SIZE = 64

# Init an attention interpreter and get the importance scores
att = AttentionInterpreter(model, device="gpu", predict_fn=attention_predict_fn)

# Use attention interpreter to get the importance scores for all data
interp_results = None
for batch in test_data_loader:
    if interp_results:
        interp_results += att(batch)
    else:
        interp_results = att(batch)

# Trim the output to get scores only for context
interp_results = trim_output(interp_results, data_ds, tokenizer)

# Align the results back to the standard splited tokens so that it can be evaluated correctly later
align_res = att.alignment(interp_results, contexts, standard_split, standard_split_offset_maps, ori_offset_maps, special_tokens=["[CLS]", '[SEP]'])

##### b）IG-based Interpreter

In [None]:
from trustai.interpretation.token_level import IntGradInterpreter
from utils import create_dataloader_from_scratch
# Hyperparameters
IG_STEP = 100
BATCH_SIZE = 64

# Init an IG interpreter
ig = IntGradInterpreter(model, predict_fn=IG_predict_fn, device="gpu")

# Use IG interpreter to get the importance scores for all data
interp_results = None
for batch in test_data_loader:
    if interp_results:
        interp_results += ig(batch, steps=IG_STEP)
    else:
        interp_results = ig(batch, steps=IG_STEP)

# trim the output to get scores only for context
interp_results = trim_output(interp_results, data_ds, tokenizer)

# Align the results back to the standard splited tokens so that it can be evaluated correctly later
align_res = ig.alignment(interp_results, contexts, standard_split, standard_split_offset_maps, ori_offset_maps, special_tokens=["[CLS]", '[SEP]'])

### 生成用于评估的数据
评估文件格式要求是3列数据：编号\t预测答案\t证据，我们提供了脚本将模型输出结果转成评估所需格式。

In [18]:
import math

# Re-sort the token index according to their importance scores
def resort(index_array, importance_score):
    res = sorted([[idx, importance_score[idx]] for idx in index_array], key=lambda x:x[1], reverse=True)
    res = [n[0] for n in res]
    return res

# Post-prepare the result data so that it can be used for the evaluation directly
def prepare_eval_data(data, results, paddle_model):
    res = {}
    idx = 0
    for data_id, inter_res in zip(data, results):
        
        # Split importance score vectors for query and title from inter_res.word_attributions
        importance_score = np.array(inter_res.word_attributions[1:-1])
        # Extract topK importance scores
        topk = math.ceil(len(data[data_id]['sent_token'])*RATIONALE_RATIO)
        
        eval_data = {}        
        eval_data['id'] = data_id
        label = list(inter_res.pred_label)
        if int(label[0])>=int(label[1])+1:
            eval_data['pred_label'] = ''
        else:
            eval_data['pred_label'] = ''.join(tokenizer.convert_ids_to_tokens(data_ds[idx]['input_ids'][int(label[0]):int(label[1])+1]))
        # Find the token index of the topK importance scores
        eval_data['rationale'] = np.argpartition(importance_score, -topk)[-topk:]
        # Re-sort the token index according to their importance scores
        eval_data['rationale'] = resort(eval_data['rationale'], importance_score)

        res[data_id] = eval_data
        idx += 1
    return res

# Generate results for evaluation
predicts = prepare_eval_data(data, align_res, model)
out_file = open('./mrc_rationale.txt', 'w')
for key in predicts:
    out_file.write(str(predicts[key]['id'])+'\t'+ str(predicts[key]['pred_label'])+'\t')
    for idx in predicts[key]['rationale'][:-1]:
        out_file.write(str(idx)+',')
    out_file.write(str(predicts[key]['rationale'][-1])+'\n')
out_file.close()