Skip to content
Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

End-to-End Pipeline for the BioCreative VI Precision Medicine Relation Extraction Task

This code constitutes an end-to-end system for the purpose of extracting interacting protein-protein pairs affected by a genetic mutation. It includes three primary components in the pipeline: a supervised named entity recognition (NER) model, a gene normalization (GN) system, and a supervised relation classification (RC) model.

The challenge/task description can be found here:


python 2.7
tensorflow 1.0.0 with tensorflow-fold

In order to train the NER and RC models, pre-trained word embeddings are needed. Since these files tend to be large (gigabytes), they must be downloaded separately to the embedding folder. See embeddings/ for more details on how to do this.


This will train on the pre-formatted data provided in the corpus_train directory. Note that this data is compiled from multiple sources including an augmented version of the original training data; please see the paper below for more details on the training data composition.

Training the supervised models may take up to 24 hours depending on hardware.

Named Entity Recognition (NER)

From the ner_model directory, run:

python --datapath=../corpus_train

This will create a folder saved_model_autumn in corpus_train where the trained models will be saved (each part of an ensemble). A word vocabulary word_vocab.ner.txt will also be created in the same folder. These files will be loaded during the annotation process.

Relation Classification (RC)

From the rc_model directory, run:

python --datapath=../corpus_train

This will similarly create a saved_model_ppi folder and word vocabulary file in the corpus_train directory.

Test Set Annotation

All intermediate outputs of the pipeline will be saved to the pipeline_test folder. Let's assume we want to annotate articles from the file Final_Gold_Relation.json corresponding to the test set. Keep in mind that this contains the both articles and their groundtruth annotations. First, we will run the following command from inside the pipeline_test directory:

python Final_Gold_Relation.json

This will create a new file pipeline_feed.txt serving as input to our pipeline. The input should only include the PMID and title/abstract text of each article, one article per line. No groundtruth annotations are fed into the pipeline. Once a feed file is generated, from the root directory, we can run a series of commands to take the feed and process it through each component in the pipeline. For convenience, we provide a bash script named to handle this aspect of the system:


echo "Step 1 / Tokenizing input feed / $PTOKEN"
python -u $PIPELINE/ < $INPUT > $PTOKEN
echo "Step 2 / NER Annotations / $PNER"
python -u ner_model/ --datapath=$DATAPATH < $PTOKEN > $PNER
echo "Step 3 / NER Corrections / $PNER2"
python -u ner_correction/ --datapath=$DATAPATH < $PNER > $PNER2
echo "Step 4 / Gene Normalization / $PGNORM"
python -u gn_model/ --datapath=$DATAPATH --cachepath=$GNCACHE < $PNER2 > $PGNORM
echo "Step 5 / PPIm Extraction / $OUTPUT"
python -u rc_model/ --datapath=$DATAPATH < $PGNORM > $OUTPUT
echo "Done" 

WARNING: Please note that, for gene normalization, the program will make frequent external queries to the NCBI gene database. Caching is implemented to avoid querying redundant information.

It is sufficient to run the following command from the root directory:


This will produce an output file at pipeline_feed/pipeline_output.txt. The last step is to take these predictions and put it in a format that can be processed by the official evaluation script. A simple way to conform to the required format is to take the JSON test file and replace all groundtruth relation annotations with our own predictions and save the result as a new JSON file. Here we use a script utility named to accomplish this:

python Final_Gold_Relation.json

This will produce a nicely formatted JSON file with our predicted annotations: PMtask_results.json.


The official evaluation script is found here:

To run the evaluation script for this dataset:

python relation Final_Gold_Relation.json PMtask_results.json


Please consider citing the following paper if you use this software in your work:


The NER model of this system is heavily inspired by the model proposed in the following paper:

Chiu, Jason PC, and Eric Nichols. "Named Entity Recognition with Bidirectional LSTM-CNNs." Transactions of the Association for Computational Linguistics 4 (2016): 357-370. Preprint


Tung Tran
tung.tran [at]


No releases published


No packages published