This repository contains code and data for the following article:
Wojciech Kusa, Michael Spranger. External Evaluation of Event Extraction Classifiers for Automatic Pathway Curation: An extended study of the mTOR pathway. In Proceedings of the 2017 Workshop on Biomedical Natural Language Processing (BioNLP 2017), pages 247–256. Association for Computational Linguistics, 2017.
This software was tested on Ubuntu 16.04
This project requires Python 2.7.
$ conda create -n bioNLP2017 python=2.7
$ conda activate bioNLP2017
(bioNLP2017) $ pip install -r requiremnts.txt
This project uses Turku Event Extraction System (TEES) in version 2.2.1. To properly install all dependencies (classifiers, models, corpora and preprocessing tools) you need to run
(bioNLP2017) $ python tees/configure.py
TEES installs the following dependencies:
- GENIA Sentence Splitter
- BANNER named entity recognizer
- BLLIP parser
- Stanford Parser
Stanford parser and BLLIP requires java
, g++
, flex
and ruby
. Use the following command to install them if they are missing on your system:
(bioNLP2017) $ sudo apt-get install g++ ruby flex default-jre
After succesfull installation of TEES you need to export the path
(bioNLP2017) $ export TEES_SETTINGS=/home/${USER}/.tees_local_settings.py
Training datasets were created using three different sources:
- ANN - consists of 60 abstracts of scientific papers from Pubmed database related to the mTORpathway map.
- GE11 consists of 908 abstracts and full texts of scientific papers used in BioNLP ST 2011 GENIA Event Extraction task
- PC13 consists of 260 abstracts of scientific papers used in BioNLP ST 2013 Pathway Curation task
All train datasets are stored in data/
directory
data/GE11-train.tar.gz
- standalone GE11data/GE11_mTOR-ann-train.tar.gz
- GE11+ANN - combined GE11 and ANNdata/GE11_PC13_mTOR-ann-train.tar.gz
- GE11+PC13+ANN - combined GE11, PC13 and ANNdata/PC13_mTOR-ann-train.tar.gz
- PC13+ANN - combined PC13 and ANN
For hyperparameter optimization of all classifiers "GE11-Devel BioNLP ST2011" dataset was used.
Test data consists of 449 full text papers mentioned in the mTOR pathway map. Original paper pdfs were downloaded and translated into raw txt files using CERMINE.
Test data in a format of preprocessed txt files can be downloaded from here. Documents should be extracted to: data/evaluation_mTOR_full/
directory.
First you need to preprocess training datasets:
(bioNLP2017) $ python preprocess.py
To train event extraction model run:
(bioNLP2017) $ python train.py \
--output_path results/GE11-SVM/ \
--classifier svm \
--train_data data/GE11-train/GE11-train.xml
To run predictions on mTOR papers with the model trained from the previous step:
(bioNLP2017) $ python predict.py \
--model_path results/GE11-SVM/ \
--input_data data/evaluation_mTOR_full/ \
--output_path results/GE11-SVM/evaluation_mTOR_full/
If you don't want to train your models you can use models included in TEES, e.g.:
(bioNLP2017) $ python predict.py \
--model_path /home/${USER}/.tees/models/GE11-devel \
--input_data data/evaluation_mTOR_full/ \
--output_path results/GE11/SVM/evaluation_mTOR_full/
Results from all 17 pretrained models from the paper can be downloaded from here.
Evaluation was done with scripts from sbnlp/mTOR-evaluation.