C++ Fortran Java Python CMake C Other
Switch branches/tags
Nothing to show
Clone or download
Latest commit 4f6c792 Jul 8, 2018
Permalink
Failed to load latest commit information.
code update Dec 1, 2017
data/source Update README.md Oct 30, 2017
.gitignore Create .gitignore May 25, 2017
README.md Update README.md Jul 9, 2018
run.sh add predition json file generation Dec 1, 2017

README.md

Entity and Relation Extraction with Knowledge Bases

This repository includes recent models and data for sentence-level extraction of entities and relations using knowledge bases (i.e., distant supervision). In particular, it contains the source code for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Task Setting: Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, teh task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.

Run on your own data: Code for producing the JSON files from a raw corpus for running CoType and baseline models is here.

Performance

Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).

Method Precision Recall F1
Mintz (our implementation, Mintz et al., 2009) 0.296 0.387 0.335
LINE + Dist Sup (Tang et al., 2015) 0.360 0.257 0.299
MultiR (Hoffmann et al., 2011) 0.325 0.278 0.301
FCM + Dist Sup (Gormley et al., 2015) 0.151 0.498 0.300
CoType (Ren et al., 2017) 0.348 0.406 0.369

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install pexpect ujson tqdm
$ cd code/DataProcessor/
$ git clone git@github.com:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

Data

We process (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:

  • BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). (Download JSON)
  • NYT (Riedel et al., 2011): 1.18M sentences sampled from 294K New York Times news articles. 395 sentences are manually annotated with 24 relation types and 47 entity types. (Download JSON)
  • Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from 2013 KBP slot filling assessment results. It has 13 relation types and 126 entity types after filtering of numeric value relations. (Download JSON)

Please put the data files in corresponding subdirectories under CoType/data/source

Makefile

We have included compilied binaries. If you need to re-compile retype.cpp under your own g++ environment

$ cd CoType/code/Model/retype; make

Default Run

Run CoType for the task of Relation Extraction on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Feature extraction, embedding learning on training data, and evaluation on test data.

$ ./run.sh  

For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.

Parameters - run.sh

Dataset to run on.

Data="KBP"
  • Hyperparameters for relation extraction:
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0

Hyperparameters for relation classification are included in the run.sh script.

Evaluation

Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.

$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine

Prediction

The last command in run.sh generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.

$ python code/Evaluation/convertPredictionToJson.py $Data 0.0

Reference

Please cite the following paper if you find the codes and datasets useful:

@inproceedings{ren2017cotype,
 author = {Ren, Xiang and Wu, Zeqiu and He, Wenqi and Qu, Meng and Voss, Clare R. and Ji, Heng and Abdelzaher, Tarek F. and Han, Jiawei},
 title = {CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases},
 booktitle = {Proceedings of the 26th International Conference on World Wide Web},
 year = {2017},
 pages = {1015--1024},
}