Skip to content

unbiarirang/Dureader-bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DuReader BERT

2019 DuReader 机器阅读理解模型。

base code: Dureader-Bert

预训练模型下载: BERT-base-chinese, wwm & wwm-ext

DuReader数据下载: DuReader_v2.0_preprocessed.zip

Summary

48.89 Baseline
49.65 (+0.76) Hyperparameter optimization
50.01 (+0.36) Paragraph selection – general BERT QP classification model
50.27 (+0.26) Paragraph selection – fine-tuning BERT QP classification model
50.55 (+0.28) Sample selection – use full range of match scores
50.89 (+0.34) Improved pre-training – wwm-ext
51.46 (+0.57) Model Improvement 
51.5 (+0.04) Data Augmentation – CMRC and DRCD
51.57 (+0.07) Data Augmentation – synonym word replacement
51.78 (+0.21) Ensemble
Single: ROUGE-L 51.57, BLEU-4: 48.7
Ensemble: ROUGE-L 51.78, BLEU-4: 48.37

Code

  • handle_data文件夹是处理DuReader的数据,与DuReader有关,与bert没有多大关系。
  • dataset文件夹是处理中文数据的代码,大致是将文字转化为bert的输入: (inputs_ids,token_type_ids,input_mask),然后做成dataloader。
  • predict文件夹是用来预测的,基本与训练时差不多,一些细节不一样(输出)。
  • 总的来说,只要输入符合bert的输入: (inputs_ids,token_type_ids,input_mask)就可以了。

How to Run

Dependencies

  • python3
  • torch 1.0
  • packages: pytorch-pretrained-bert, tqdm, torchtext

Installation with pip

pip install -r requirements.txt

Preprocess the data

将下载的 DuReader 数据放在data文件夹下。

|- data
| |- trainset
| | |- search.train.json
| | |- zhidao.train.json
| |- devset
| | |- search.dev.json
| | |- zhidao.dev.json
| |- testset
| | |- search.test.json
| | |- zhidao.test.json
# 数据处理
cd handle_data && sh run.sh
# 制作dataset
cd dataset && python run_squad.py
# 制作预测dataset
cd predict && python util.py

制作更多dataset

# 制作 qp-relevance 预测dataset
cd handle_data && sh run_qp.sh && cd ../predict && python util.py --dev-search-input-file '../../data/extracted/devset/search-qp.dev.json' --dev-zhidao-input-file '../../data/extracted/devset/zhidao-qp.dev.json' --predict-example-files 'predict-qp.data'
# 制作 no-match-score trainset
cd dataset && python run_squad_no_match_score.py
# 制作 synonym trainset(同义词替换训练集)
cd dataset && python run_squad_synonym.py

Train

python train.py --model-name 'best_model'

Predict

predict front 5 paragraphs:

cd predict && python predicting.py --model-name 'best_model' --result-file-name 'best_model.json'

predict top 5 qp-relevance score paragraphs :

cd predict && python predicting.py --model-name 'best_model' --result-file-name 'best_model-qp.json' --source-file-name predict-qp.data

ensemble predicting:

cd predict && python ensemble-predicting.py --model-names '["best_model1", "best_model2", "best_model3"]' --model-nums '[6, 6, 6]' --config-names '["bert_config.json", "bert_config.json", "bert_config.json"]' --result-file-name 'ensemble-qp.json' --source-file-name predict-qp.data

Eval

cd metric && python mrc_eval.py best_model.json ref.json v1

All-in-one (train, predict, eval)

sh train_and_predict.sh 8 2 512 3e-05 4 6

Reproduce the best result

train-synonym.data 1ep + train-no-match-score.data 2ep

python train.py --epochs 1 --model-name 'best_model_synonym' --trainset-name train-synonym.data --test-lines 186139 --state-dict pytorch_model_wwm_ext.bin --model-num 6
&& python train.py --model-name 'best_model_synonym' --state-dict model_dir/best_model_synonym --model-num 6 --trainset-name train-no-match-score.data --test-lines 229345
&& cd predict
&& python predicting.py --model-name 'best_model_synonym' --result-file-name 'best_model_synonym-qp.json' --source-file-name predict-qp.data --model-num 6 && cd ../metric
&& python mrc_eval.py best_model_synonym-qp.json ref.json v1

About

DuReader bert Chinese MRC

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published