This repository is the official implementation of paper "External Features Enriched Model for Biomedical Question Answering".
The code is based on the original BERT Repository released by Google Team and the BioBERT Model released by DMIS-LAB.
The repo's structure is:
BioQAExternalFeatures
├─ dataProcess
│ ├─ bioner.py
│ ├─ posTreat.py
│ ├─ pos_ner_treat.py
│ └─ readme.md
├─ scripts
│ ├─ evaluation.sh
│ ├─ prediction.sh
│ ├─ readme.md
│ └─ training.sh
├─ LICENSE
├─ README.md
├─ __init__.py
├─ create_pretraining_data.py
├─ extract_features.py
├─ modeling.py
├─ modeling_test.py
├─ optimization.py
├─ optimization_test.py
├─ requirements.txt
├─ run_factoid_baseline.py # Original BioBERT model
├─ run_factoid_pos_ner.py # Our model (Feature Fusion + NER feature + POS feature)
├─ tokenization.py
└─ tokenization_test.py
Trained Models's parameters and processed data (added POS and NER labels) can be directly downloaded from Google Drive.
For Chinese users, if the speed of Google Drive is limited, please contact us for the data and models on Baidu Pan.
For the trained models:
-
We firstly fine-tuned BERT (BERT-Base, Multilingual Cased) under our enriched external-feature framework on SQuAD v1.1, of which the param could be found under
model/squad
-
Next, we further fine-tuned the model respectively on Biomedical Training Sets (6b, 7b and 8b), and the well trained models could be downloaded from
model/6b
,model/7b
,model/8b
.
For the data utilized in our experiments:
-
The data mainly come from
SQuAD
andBioASQ Challenge
. Particularly, for the biomedical QA data, we have utilized the enriched data provided by DMIS-LAB; -
You can get the feature-enriched training data under two ways:
-
Directly download the processed data from our google drive link
-
Use the provided tools in
dataProcess
to extract the NER, POS and BioNER features on your own side.
-
To install requirements:
pip install -r requirements.txt
Following are the using guides for training, predicting and evaluating our framework. For more information,
please refer to the script files in the scripts
folder.
If you want to train the model from scratch (which means from original BERT), please download the BERT param
to a model
folder,
and the feature-enrich squad training data (data/squad
)
to a data
folder, indicate the variables MODEL_DIR
and DATA_DIR
,
export MODEL_DIR=/full/path/to/model
export DATA_DIR=/full/path/to/data
export OUTPUT_DIR=/please/set/an/output/dir
and run the training file:
python run_factoid_pos_ner.py \
--do_train=True\
--do_predict=True \
--vocab_file=$MODEL_DIR/vocab.txt \
--bert_config_file=$MODEL_DIR/bert_config.json \
--init_checkpoint=$MODEL_DIR/model.ckpt-1000000 \
--max_seq_length=384 \
--train_batch_size=8 \
--learning_rate=5e-6 \
--doc_stride=128 \
--num_train_epochs=4.0 \
--do_lower_case=False \
--train_file=$DATA_DIR/ner_pos_train-v1.1.json \
--predict_file=$DATA_DIR/ner_pos_BioASQ-test-factoid-6b-1.json \
--output_dir=$OUTPUT_DIR
If you want to train the model from SQuAD-trained step, please download the squad param under model/squad
to a model
folder,
and the target BioASQ training data (e.g., data/6b
)
to a data
folder, similarly indicate the variables MODEL_DIR
and DATA_DIR
,
export MODEL_DIR=/full/path/to/model
export DATA_DIR=/full/path/to/data
export OUTPUT_DIR=/please/set/an/output/dir
and run the training file:
python run_factoid_pos_ner.py \
--do_train=True\
--do_predict=True \
--vocab_file=$MODEL_DIR/vocab.txt \
--bert_config_file=$MODEL_DIR/bert_config.json \
--init_checkpoint=$MODEL_DIR/model.ckpt-1000000 \
--max_seq_length=384 \
--train_batch_size=8 \
--learning_rate=5e-6 \
--doc_stride=128 \
--num_train_epochs=4.0 \
--do_lower_case=False \
--train_file=$DATA_DIR/ner_pos_6b.json \
--predict_file=$DATA_DIR/ner_pos_BioASQ-test-factoid-6b-1.json \
--output_dir=$OUTPUT_DIR
If you only want to predict the result on a specific BioASQ challenge test set, please download the the target model param under for example model/6b
to a model
folder,
and the target BioASQ test data (e.g., data/6b
)
to a data
folder, similarly indicate the variables MODEL_DIR
and DATA_DIR
,
export MODEL_DIR=/full/path/to/model
export DATA_DIR=/full/path/to/data
export OUTPUT_DIR=/please/set/an/output/dir
and run the training file:
python run_factoid_pos_ner.py \
--do_train=False\
--do_predict=True \
--vocab_file=$MODEL_DIR/vocab.txt \
--bert_config_file=$MODEL_DIR/bert_config.json \
--init_checkpoint=$MODEL_DIR/model.ckpt-1000000 \
--max_seq_length=384 \
--train_batch_size=8 \
--learning_rate=5e-6 \
--doc_stride=128 \
--num_train_epochs=4.0 \
--do_lower_case=False \
--train_file=$DATA_DIR/ner_pos_6b.json \
--predict_file=$DATA_DIR/ner_pos_BioASQ-test-factoid-6b-1.json \
--output_dir=$OUTPUT_DIR
To evaluate BioASQ answers, the system should be able to execute java codes, and please refer https://github.com/BioASQ/Evaluation-Measures, the BioASQ official evaluation tool, for details.
Besides, we have utilized the transformation script released by DMIS-LAB
To conduct the complete evalution process, it is required the golden answers released by BioASQ Challenge, which could be downloaded after registering on http://www.bioasq.org/.
Metrics | SAcc | LAcc | MRR |
---|---|---|---|
Our Model | 0.4517 | 0.6294 | 0.5197 |
Metrics | SAcc | LAcc | MRR |
---|---|---|---|
Our Model | 0.4444 | 0.6419 | 0.5165 |
Metrics | SAcc | LAcc | MRR |
---|---|---|---|
Our Model | 0.3937 | 0.6098 | 0.4688 |