No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 42 commits ahead, 34 commits behind LiyuanLucasLiu:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
model Annotation Oct 8, 2018
.gitignore Annotation Oct 8, 2018
LICENSE Annotation Oct 8, 2018
README.md Update README.md Oct 8, 2018
requirements.txt Annotation Oct 8, 2018
run_lm-lstm-crf5.sh Annotation Oct 8, 2018
seq_wc.py Annotation Oct 8, 2018
train_wc.py Annotation Oct 8, 2018

README.md

Cross-type Biomedical Named Entity Recognition with Deep Multi-task Learning

This project provides a neural network based multi-task learning framework for biomedical named entity recognition (BioNER).

The implementation is based on the PyTorch library. Our model collectively trains different biomedical entity types to build a unified model that benefits the training of each single entity type and achieves a significantly better performance compared with the state-of-the-art BioNER systems.

Links

Installation

For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.

PyTorch

The code is based on PyTorch. You can find installation instructions here.

Dependencies

The code is written in Python 3.6. Its dependencies are summarized in the file requirements.txt. You can install these dependencies like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you can first download the corpora and the embedding file from here, unzip the folder data_bioner_5/ and put it under the main folder ./. Then the following running script can be used to run the model.

./run_lm-lstm-crf5.sh

Data

We use five biomedical corpora collected by Crichton et al. for biomedical NER. The dataset is publicly available and can be downloaded from here. The details of each dataset are listed below:

Dataset Entity Type Dataset Size
BC2GM Gene/Protein 20,000 sentences
BC4CHEMD Chemical 10,000 abstracts
BC5CDR Chemical, Disease 1,500 articles
NCBI-disease Disease 793 abstracts
JNLPBA Gene/Protein, DNA, Cell Type, Cell Line, RNA 2,404 abstracts

Note

In our paper, we merge the original training set and development set to be the new training set, as many teams did in the challenge. Some previous work (e.g., Luo et al., Bioinformatics 2017, Lu et al., Journal of cheminformatics 2015 and Leaman and Lu, Bioinformatics 2016) also preprocessed data in this way. If you want to reproduce our results, please also follow this way.

Format

Users may want to use other datasets. We assume the corpus is formatted as same as the CoNLL 2003 NER dataset.

More specifically, empty lines are used as separators between sentences, and the separator between documents is a special line as below.

-DOCSTART- -X- -X- -X- O

Other lines contains words, labels and other fields. Word must be the first field, label mush be the last. For example,

-DOCSTART- -X- -X- -X- O

Selegiline	S-Chemical
-	O
induced	O
postural	B-Disease
hypotension	E-Disease
in	O
Parkinson	B-Disease
'	I-Disease
s	I-Disease
disease	E-Disease
:	O
a	O
longitudinal	O
study	O
on	O
the	O
effects	O
of	O
drug	O
withdrawal	O
.	O

Embedding

We initialize the word embedding matrix with pre-trained word vectors from Pyysalo et al., 2013. These word vectors are trained using the skip-gram model on the PubMed abstracts together with all the full-text articles from PubMed Central (PMC) and a Wikipedia dump. You can download the embedding files from here.

Usage

train_wc.py is the script for our multi-task LSTM-CRF model. The usages of it can be accessed by

python train_wc.py -h

The default running commands are:

python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --word_dim 200

Users may incorporate an arbitrary number of corpora into the training process. In each epoch, our model randomly selects one dataset i. We use training set i to learn the parameters and developing set i to evaluate the performance. If the current model achieves the best performance for dataset i on the developing set, we will then calculate the precision, recall and F1 on testing set i.

Benchmarks

Here we compare our model with recent state-of-the-art models on the five datasets mentioned above. We use F1 score as the evaluation metric.

Model BC2GM BC4CHEMD BC5CDR NCBI-disease JNLPBA
Dataset Benchmark - 88.06 86.76 82.90 72.55
Crichton et al. 2016 73.17 83.02 83.90 80.37 70.09
Lample et al. 2016 80.51 87.74 86.92 85.80 73.48
Ma and Hovy 2016 78.48 86.84 86.65 82.62 72.68
Liu et al. 2018 80.00 88.75 86.96 83.92 72.17
Our Model 80.74 89.37 88.78 86.14 73.52

Prediction

Our train_wc.py provides an option to directly output the annotation results during the training process by the parameter --output_annotation, i.e.,

python3 train_wc.py --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --output_annotation --word_dim 200 --gpu 0

If users do not use --output_annotation, the best performing model during the training process will be saved in ./checkpoint/.

Pre-trained Model

We have released our pre-trained model. You can download the Arg file and the Model file and put them in ./checkpoint/.

Using the saved model, seq_wc.py can be applied to annotate raw text. Its usage can be accessed by command

python seq_wc.py -h

and a running command example is provided below:

python3 seq_wc.py --load_arg checkpoint/cwlm_lstm_crf.json --load_check_point checkpoint/cwlm_lstm_crf.model --input_file test.tsv --output_file annotate/output --gpu 0

The annotation results will be in ./annotate/.

The input format is similar to CoNLL, but each line is required to contain only one field, token. For example, an input file could be:

The
severe
anemia
(
hemoglobin
1
.
2
g
/
dl
)
appeared
to
be
the
primary
etiologic
factor
.

and the corresponding output is:

The O
severe O
anemia O
( O
hemoglobin B-GENE
1 I-GENE
. I-GENE
2 I-GENE
g I-GENE
/ I-GENE
dl E-GENE
) O
appeared O
to O
be O
the O
primary O
etiologic O
factor O
. O 

Citation

If you find the implementation useful, please cite the following paper:

@article{wang2018cross,
  title={Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning},
  author={Wang, Xuan and Zhang, Yu and Ren, Xiang and Zhang, Yuhao and Zitnik, Marinka and Shang, Jingbo and Langlotz, Curtis and Han, Jiawei},
  journal={arXiv preprint arXiv:1801.09851},
  year={2018}
}