No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 42 commits ahead, 34 commits behind LiyuanLucasLiu:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
model Annotation Oct 8, 2018
.gitignore Annotation Oct 8, 2018
LICENSE Annotation Oct 8, 2018 Update Oct 8, 2018
requirements.txt Annotation Oct 8, 2018 Annotation Oct 8, 2018 Annotation Oct 8, 2018 Annotation Oct 8, 2018

Cross-type Biomedical Named Entity Recognition with Deep Multi-task Learning

This project provides a neural network based multi-task learning framework for biomedical named entity recognition (BioNER).

The implementation is based on the PyTorch library. Our model collectively trains different biomedical entity types to build a unified model that benefits the training of each single entity type and achieves a significantly better performance compared with the state-of-the-art BioNER systems.



For training, a GPU is strongly recommended for speed. CPU is supported but training could be extremely slow.


The code is based on PyTorch. You can find installation instructions here.


The code is written in Python 3.6. Its dependencies are summarized in the file requirements.txt. You can install these dependencies like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you can first download the corpora and the embedding file from here, unzip the folder data_bioner_5/ and put it under the main folder ./. Then the following running script can be used to run the model.



We use five biomedical corpora collected by Crichton et al. for biomedical NER. The dataset is publicly available and can be downloaded from here. The details of each dataset are listed below:

Dataset Entity Type Dataset Size
BC2GM Gene/Protein 20,000 sentences
BC4CHEMD Chemical 10,000 abstracts
BC5CDR Chemical, Disease 1,500 articles
NCBI-disease Disease 793 abstracts
JNLPBA Gene/Protein, DNA, Cell Type, Cell Line, RNA 2,404 abstracts


In our paper, we merge the original training set and development set to be the new training set, as many teams did in the challenge. Some previous work (e.g., Luo et al., Bioinformatics 2017, Lu et al., Journal of cheminformatics 2015 and Leaman and Lu, Bioinformatics 2016) also preprocessed data in this way. If you want to reproduce our results, please also follow this way.


Users may want to use other datasets. We assume the corpus is formatted as same as the CoNLL 2003 NER dataset.

More specifically, empty lines are used as separators between sentences, and the separator between documents is a special line as below.

-DOCSTART- -X- -X- -X- O

Other lines contains words, labels and other fields. Word must be the first field, label mush be the last. For example,

-DOCSTART- -X- -X- -X- O

Selegiline	S-Chemical
-	O
induced	O
postural	B-Disease
hypotension	E-Disease
in	O
Parkinson	B-Disease
'	I-Disease
s	I-Disease
disease	E-Disease
:	O
a	O
longitudinal	O
study	O
on	O
the	O
effects	O
of	O
drug	O
withdrawal	O
.	O


We initialize the word embedding matrix with pre-trained word vectors from Pyysalo et al., 2013. These word vectors are trained using the skip-gram model on the PubMed abstracts together with all the full-text articles from PubMed Central (PMC) and a Wikipedia dump. You can download the embedding files from here.

Usage is the script for our multi-task LSTM-CRF model. The usages of it can be accessed by

python -h

The default running commands are:

python3 --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --word_dim 200

Users may incorporate an arbitrary number of corpora into the training process. In each epoch, our model randomly selects one dataset i. We use training set i to learn the parameters and developing set i to evaluate the performance. If the current model achieves the best performance for dataset i on the developing set, we will then calculate the precision, recall and F1 on testing set i.


Here we compare our model with recent state-of-the-art models on the five datasets mentioned above. We use F1 score as the evaluation metric.

Dataset Benchmark - 88.06 86.76 82.90 72.55
Crichton et al. 2016 73.17 83.02 83.90 80.37 70.09
Lample et al. 2016 80.51 87.74 86.92 85.80 73.48
Ma and Hovy 2016 78.48 86.84 86.65 82.62 72.68
Liu et al. 2018 80.00 88.75 86.96 83.92 72.17
Our Model 80.74 89.37 88.78 86.14 73.52


Our provides an option to directly output the annotation results during the training process by the parameter --output_annotation, i.e.,

python3 --train_file [training file 1] [training file 2] ... [training file N] \
                    --dev_file [developing file 1] [developing file 2] ... [developing file N] \
                    --test_file [testing file 1] [testing file 2] ... [testing file N] \
                    --caseless --fine_tune --emb_file [embedding file] --shrink_embedding --output_annotation --word_dim 200 --gpu 0

If users do not use --output_annotation, the best performing model during the training process will be saved in ./checkpoint/.

Pre-trained Model

We have released our pre-trained model. You can download the Arg file and the Model file and put them in ./checkpoint/.

Using the saved model, can be applied to annotate raw text. Its usage can be accessed by command

python -h

and a running command example is provided below:

python3 --load_arg checkpoint/cwlm_lstm_crf.json --load_check_point checkpoint/cwlm_lstm_crf.model --input_file test.tsv --output_file annotate/output --gpu 0

The annotation results will be in ./annotate/.

The input format is similar to CoNLL, but each line is required to contain only one field, token. For example, an input file could be:


and the corresponding output is:

The O
severe O
anemia O
( O
hemoglobin B-GENE
) O
appeared O
to O
be O
the O
primary O
etiologic O
factor O
. O 


If you find the implementation useful, please cite the following paper:

  title={Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning},
  author={Wang, Xuan and Zhang, Yu and Ren, Xiang and Zhang, Yuhao and Zitnik, Marinka and Shang, Jingbo and Langlotz, Curtis and Han, Jiawei},
  journal={arXiv preprint arXiv:1801.09851},