# 2020语言与智能技术竞赛：机器阅读理解任务
https://aistudio.baidu.com/aistudio/competition/detail/28

机器阅读理解 (MRC, Machine Reading Comprehension) 是指让机器阅读文本，然后回答和阅读内容相关的问题。阅读理解是自然语言处理和人工智能领域的重要前沿课题，对于提升机器的智能水平、使机器具有持续知识获取的能力等具有重要价值，近年来受到学术界和工业界的广泛关注。

中国中文信息学会(CCF, the China Computer Federation)、中国计算机学会(CIPS, Chinese Information Processing Society of China)和百度公司已经于2018和2019年连续联合举办了机器阅读理解评测，极大地推动了中文机器阅读理解技术的发展。随着技术的进步，当前的一些模型已经能够在一些阅读理解测试集上取得较好的性能。但在实际应用中，这些模型所表现出的鲁棒性仍然较差。因此，“2020 语言与智能技术竞赛”将继续举办机器阅读理解任务的评测，重点关注阅读理解模型在真实应用场景中的鲁棒性，挑战模型的过敏感性、过稳定性以及泛化能力等。

本次评测将提供面向真实应用场景的高质量中文阅读理解数据集DuReader Robust，旨在为研究者和开发者提供学术和技术交流的平台， 进一步提升机器阅读理解的研究水平，推动语言理解和人工智能领域技术和应用的发展。本次竞赛将在第五届“语言与智能高峰论坛”举办技术交流论坛和颁奖仪式。 诚邀学术界和工业界的研究者和开发者参加本次竞赛！

## 赛程安排
    2020/3/10 	启动竞赛报名，发放样例数据
    2020/3/31 	开放评测入口和排行榜，对报名者发放全部训练数据和第一批测试数据
    2020/5/12 	报名截止
    2020/5/13 	发放最终测试数据
    2020/5/20 	系统结果提交截止
    2020/5/30 	公布竞赛结果，接收系统报告和论文
    2020/6/30 	论文提交截止日期
    2020/7 	在“语言与智能高峰论坛”上交流和颁奖

## 数据介绍 Data

本次竞赛数据集共包含约21K问题，其中包括15K训练集，约1.4K领域内开发集和5K测试集。测试集包含了领域内测试集和鲁棒性测试集，其中鲁棒性测试集包括了过敏感测试集、过稳定测试集以及泛化能力测试集。全部数据集将分为4个部分供参赛用户下载：

1.训练集：共15K样本，用于竞赛模型训练。
2.开发集：共1.4K样本，包含答案，用于竞赛模型训练和参数调试。
3.测试集1：共2K个样本，主要包含了大部分领域内测试集和少部分鲁棒性测试集，不提供参考答案，用于参赛者在比赛平台上自助验证模型效果。为了防止针对测试集的调试，数据中将会额外加入混淆数据。
4.测试集2：是本次竞赛最终测试数据（含测试集1），共5K问题，包含全部领域内测试集和鲁棒性测试集，不提供参考答案。为了防止针对测试集的调试，数据中将会额外加入混淆数据。该部分数据结果不能在比赛平台上自助验证。
### 数据样本 Data Sample

平台提供的数据为JSON文件格式，样例如下:

    {
        "data": [
            {
                "paragraphs": [
                    {
                        "qas": [
                            {
                                "question": "非洲气候带", 
                                "id": "bd664cb57a602ae784ae24364a602674", 
                                "answers": [
                                    {
                                        "text": "热带气候", 
                                        "answer_start": 45
                                    }
                                ]
                            }
                        ], 
                        "context": "1、全年气温高，有热带大陆之称。主要原因在与赤道穿过大陆中部，位于南北纬30度之间，主要是热带气候，没有温带和寒带… 
                    }, 
                    {
                        "qas": [
                            {
                                "question": "韩国全称", 
                                "id": "a7eec8cf0c55077e667e0d85b45a6b34", 
                                "answers": [
                                    {
                                        "text": "大韩民国", 
                                        "answer_start": 5
                                    }
                                ]
                            }
                        ], 
                        "context": "韩国全称“大韩民国”，位于朝鲜半岛南部，隔“三八线”与朝鲜民主主义人民共和国相邻，面积9.93万平方公理… "
                    }
                ], 
                "title": ""
            }
        ]
    }





* 百度LIC2020的机器阅读理解赛道，非官方baseline
* 直接用RoBERTa+Softmax预测首尾
* BASE模型在第一期测试集上能达到0.69的F1，优于官方baseline
* 如果你显存足够，可以换用RoBERTa Large模型，F1可以到0.71

In [1]:
!pip install --upgrade pip
!pip install bert4keras

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/43/84/23ed6a1796480a6f1a2d38f2802901d078266bda38388954d01d3f2e821d/pip-20.1.1-py2.py3-none-any.whl (1.5MB)
[K     |████████████████████████████████| 1.5MB 2.7MB/s 
[?25hInstalling collected packages: pip
  Found existing installation: pip 19.3.1
    Uninstalling pip-19.3.1:
      Successfully uninstalled pip-19.3.1
Successfully installed pip-20.1.1
Collecting bert4keras
  Downloading bert4keras-0.8.4.tar.gz (39 kB)
Building wheels for collected packages: bert4keras
  Building wheel for bert4keras (setup.py) ... [?25l[?25hdone
  Created wheel for bert4keras: filename=bert4keras-0.8.4-py3-none-any.whl size=37855 sha256=d86611a7e7540d2dde0e4ea44d9e2e8d5fc2573bcddb4125f31f7360462f46e5
  Stored in directory: /root/.cache/pip/wheels/71/d1/a7/d54027d32e1f308c36aa561e08e6d05323a71b8ddc0d6a5f23
Successfully built bert4keras
Installing collected packages: bert4keras
Successfully installed bert4keras-0.8.4


In [2]:
!nvidia-smi

Mon Jul 27 13:09:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
import json, os
import numpy as np
from bert4keras.backend import keras, K
from bert4keras.models import build_transformer_model
from bert4keras.tokenizers import Tokenizer
from bert4keras.optimizers import Adam
from bert4keras.snippets import sequence_padding, DataGenerator
from bert4keras.snippets import open
from keras.layers import Layer, Dense, Permute
from keras.models import Model
from tqdm import tqdm

# 基本信息
maxlen = 128
epochs = 20
batch_size = 4
learing_rate = 2e-5

Using TensorFlow backend.


# 下载模型

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## 解压

In [5]:
 !ls /content/drive/"My Drive"/kaikeba/project03/roberta/data

bert_config.json				    dev.json.pred.json
bert_model.ckpt.data-00000-of-00001		    evaluate.py
bert_model.ckpt.index				    License.docx
bert_model.ckpt.meta				    __pycache__
chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.zip  README.md
demo						    test1.json
demo_dev.json					    test2.json
demo_dev.json.pred.json				    train.json
demo_train.json					    vocab.txt
dev.json


In [6]:
! unzip -o /content/drive/"My Drive"/kaikeba/project03/roberta/data/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.zip -d /content/drive/"My Drive"/kaikeba/project03/roberta/data

Archive:  /content/drive/My Drive/kaikeba/project03/roberta/data/chinese_roberta_wwm_large_ext_L-24_H-1024_A-16.zip
  inflating: /content/drive/My Drive/kaikeba/project03/roberta/data/bert_config.json  
  inflating: /content/drive/My Drive/kaikeba/project03/roberta/data/bert_model.ckpt.data-00000-of-00001  
  inflating: /content/drive/My Drive/kaikeba/project03/roberta/data/bert_model.ckpt.index  
  inflating: /content/drive/My Drive/kaikeba/project03/roberta/data/bert_model.ckpt.meta  
  inflating: /content/drive/My Drive/kaikeba/project03/roberta/data/vocab.txt  


## 设置数据路径

In [7]:
data_dir='/content/drive/My Drive/kaikeba/project03/roberta/data'
output_dir='/content/drive/My Drive/kaikeba/project03/roberta/output'

# 模型路径

## bert

In [8]:
bert_dir = '/content/drive/My Drive/kaikeba/project03/roberta/data'
config_path = f'{bert_dir}/bert_config.json'
checkpoint_path = f'{bert_dir}/bert_model.ckpt'
dict_path = f'{bert_dir}/vocab.txt'

# 加载数据

In [9]:
def load_data(filename):
    D = []
    for d in json.load(open(filename))['data'][0]['paragraphs']:
        for qa in d['qas']:
            D.append([
                qa['id'], d['context'], qa['question'],
                [a['text'] for a in qa.get('answers', [])]
            ])
    return D

# 读取数据

In [10]:
train_data = load_data(
    # os.path.join(data_dir,'train.json')
    os.path.join(data_dir,'demo_train.json')
)

In [11]:
train_data[0]

['bd664cb57a602ae784ae24364a602674',
 '1、全年气温高，有热带大陆之称。主要原因在与赤道穿过大陆中部，位于南北纬30度之间，主要是热带气候，没有温带和寒带。2、气候带呈明显带状分布，且南北对称。原因在于赤道穿过大陆中部，整个大陆基本被赤道均分为两部分。因此，纬度地带性明显。气候带以热带雨林为中心，向南北依次分布着热带草原、热带沙漠和地中海式气候。3、气候炎热干燥。第一：热带雨林气候面积较小，主要位于刚果河流域，面积较小。第二，地中海式气候，位于大陆的南北边缘，面积较小。夏季炎热而干旱，冬季温暖而湿润。第三，面积较大热带草原气候，有明显的干湿季。第四，热带沙漠气候主要位于撒哈拉大沙漠和西南角狭长地带。而撒哈拉沙漠占非洲总面积的1/4，全年炎热干燥，日照时间长，昼夜温差大。总之，全非洲纬度低，气温高；干燥地区广，常年湿润地区面积小。',
 '非洲气候带',
 ['热带气候']]

# 建立分词器

In [12]:
tokenizer = Tokenizer(dict_path, do_lower_case=True)

# 子串搜索

In [13]:
def search(pattern, sequence):
    """从sequence中寻找子串pattern
    如果找到，返回第一个下标；否则返回-1。
    """
    n = len(pattern)
    for i in range(len(sequence)):
        if sequence[i:i + n] == pattern:
            return i
    return -1

# 数据生成器

In [14]:
class data_generator(DataGenerator):
    def __iter__(self, random=False):
        batch_token_ids, batch_segment_ids, batch_labels = [], [], []
        for is_end, item in self.sample(random):
            context, question, answers = item[1:]
            token_ids, segment_ids = tokenizer.encode(
                question, context, maxlen=maxlen
            )
            a = np.random.choice(answers)
            a_token_ids = tokenizer.encode(a)[0][1:-1]
            start_index = search(a_token_ids, token_ids)
            if start_index != -1:
                labels = [[start_index], [start_index + len(a_token_ids) - 1]]
                batch_token_ids.append(token_ids)
                batch_segment_ids.append(segment_ids)
                batch_labels.append(labels)
                if len(batch_token_ids) == self.batch_size or is_end:
                    batch_token_ids = sequence_padding(batch_token_ids)
                    batch_segment_ids = sequence_padding(batch_segment_ids)
                    batch_labels = sequence_padding(batch_labels)
                    yield [batch_token_ids, batch_segment_ids], batch_labels
                    batch_token_ids, batch_segment_ids, batch_labels = [], [], []

# Mask

In [15]:
class MaskedSoftmax(Layer):
    """
    在序列长度那一维进行softmax，并mask掉padding部分
    """
    def compute_mask(self, inputs, mask=None):
        return None

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = K.cast(mask, K.floatx())
            mask = K.expand_dims(mask, 2)
            inputs = inputs - (1.0 - mask) * 1e12
        return K.softmax(inputs, 1)

# 构建模型

In [16]:
model = build_transformer_model(
    config_path,
    checkpoint_path,
)

output = Dense(2)(model.output)
output = MaskedSoftmax()(output)
output = Permute((2, 1))(output)

model = Model(model.input, output)
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Embedding-Token (Embedding)     (None, None, 1024)   21635072    Input-Token[0][0]                
__________________________________________________________________________________________________
Embedding-Segment (Embedding)   (None, None, 1024)   2048        Input-Segment[0][0]              
____________________________________________________________________________________________

# 评估函数

In [17]:
def sparse_categorical_crossentropy(y_true, y_pred):
    # y_true需要重新明确一下shape和dtype
    y_true = K.reshape(y_true, K.shape(y_pred)[:-1])
    y_true = K.cast(y_true, 'int32')
    y_true = K.one_hot(y_true, K.shape(y_pred)[2])
    # 计算交叉熵
    return K.mean(K.categorical_crossentropy(y_true, y_pred))


def sparse_accuracy(y_true, y_pred):
    # y_true需要重新明确一下shape和dtype
    y_true = K.reshape(y_true, K.shape(y_pred)[:-1])
    y_true = K.cast(y_true, 'int32')
    # 计算准确率
    y_pred = K.cast(K.argmax(y_pred, axis=2), 'int32')
    return K.mean(K.cast(K.equal(y_true, y_pred), K.floatx()))

# 编译模型

In [19]:
model.compile(
    loss=sparse_categorical_crossentropy,
    optimizer=Adam(learing_rate),
    metrics=[sparse_accuracy]
)

# 答案抽取

In [21]:
def extract_answer(question, context, max_a_len=16):
    """
    抽取答案函数
    """
    max_q_len = 64
    q_token_ids = tokenizer.encode(question, maxlen=max_q_len)[0]
    c_token_ids = tokenizer.encode(
        context, maxlen=maxlen - len(q_token_ids) + 1
    )[0]
    token_ids = q_token_ids + c_token_ids[1:]
    segment_ids = [0] * len(q_token_ids) + [1] * (len(c_token_ids) - 1)
    c_tokens = tokenizer.tokenize(context)[1:-1]
    mapping = tokenizer.rematch(context, c_tokens)
    probas = model.predict([[token_ids], [segment_ids]])[0]
    probas = probas[:, len(q_token_ids):-1]
    start_end, score = None, -1
    for start, p_start in enumerate(probas[0]):
        for end, p_end in enumerate(probas[1]):
            if end >= start and end < start + max_a_len:
                if p_start * p_end > score:
                    start_end = (start, end)
                    score = p_start * p_end
    start, end = start_end
    return context[mapping[start][0]:mapping[end][-1] + 1]

# 预测文件生成

In [22]:
def predict_to_file(infile, out_file):
    """预测结果到文件，方便提交
    """
    fw = open(out_file, 'w', encoding='utf-8')
    R = {}
    for d in tqdm(load_data(infile)):
        a = extract_answer(d[2], d[1])
        R[d[0]] = a
    R = json.dumps(R, ensure_ascii=False, indent=4)
    fw.write(R)
    fw.close()

# 官方评估函数

In [23]:
import sys
import io
import json
sys.path.append('/content/drive/My Drive/kaikeba/project03/roberta/data')
from evaluate import evaluate as src_evaluate
from collections import OrderedDict

In [24]:
def evaluate(filename):
    """
    评测函数（官方提供评测脚本evaluate.py）
    """
    predict_to_file(filename, filename + '.pred.json')
    ref_ans = json.load(io.open(filename))
    pred_ans = json.load(io.open(filename + '.pred.json'))
    F1, EM, TOTAL, SKIP = src_evaluate(ref_ans, pred_ans)
    output_result = OrderedDict()
    output_result['F1'] = '%.3f' % F1
    output_result['EM'] = '%.3f' % EM
    output_result['TOTAL'] = TOTAL
    output_result['SKIP'] = SKIP
    return output_result


class Evaluator(keras.callbacks.Callback):
    """
    评估和保存模型
    """
    def __init__(self):
        self.best_val_f1 = 0.

    def on_epoch_end(self, epoch, logs=None):
        metrics = evaluate(
            os.path.join(data_dir,'dev.json')
            # os.path.join(data_dir,'demo_dev.json')
        )
        if float(metrics['F1']) >= self.best_val_f1:
            self.best_val_f1 = float(metrics['F1'])
            model.save_weights(os.path.join(output_dir,'roberta_best_model.weights'))
            model.save(os.path.join(output_dir,'roberta_best_model.h5'))
        metrics['BEST_F1'] = self.best_val_f1
        print(metrics)

# 获取数据

In [25]:
train_generator = data_generator(train_data, batch_size)
evaluator = Evaluator()

# 模型训练

In [None]:
epochs=5
model.fit_generator(
    train_generator.forfit(),
    steps_per_epoch=len(train_generator),
    epochs=epochs,
    verbose=1,
    callbacks=[evaluator]
)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/5


100%|██████████| 1417/1417 [01:59<00:00, 11.89it/s]


OrderedDict([('F1', '44.755'), ('EM', '31.193'), ('TOTAL', 1417), ('SKIP', 0), ('BEST_F1', 44.755)])
Epoch 2/5


100%|██████████| 1417/1417 [01:58<00:00, 11.94it/s]


# 加载最优模型

In [None]:
from keras.models import load_model
model=load_model(os.path.join(output_dir,'roberta_best_model.h5'),custom_objects={'MaskedSoftmax':MaskedSoftmax,'sparse_accuracy':sparse_accuracy})
print(evaluate(os.path.join(data_dir,'dev.json')))

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
100%|██████████| 1417/1417 [00:50<00:00, 27.86it/s]


OrderedDict([('F1', '41.801'), ('EM', '26.888'), ('TOTAL', 1417), ('SKIP', 0)])


In [None]:
# from keras.models import load_model
# model=load_model(os.path.join(output_dir,'roberta_best_model.h5'))

In [None]:
# model.load_weights(os.path.join(data_dir,'best_model.weights'))

# 预测结果

## Test1

In [None]:
predict_to_file(os.path.join(data_dir,'test1.json'), os.path.join(output_dir,'pred1.json'))

100%|██████████| 50000/50000 [28:26<00:00, 29.29it/s]


## Test2

In [None]:
predict_to_file(os.path.join(data_dir,'test2.json'),  os.path.join(output_dir,'pred2.json'))

100%|██████████| 50000/50000 [28:09<00:00, 29.59it/s]


# 保存上传结果