Source code and data for SIGKDD'16 paper Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding.
Given a text corpus with entity mentions detected and heuristically labeled by distant supervision, this code performs (1) label noise reduction over distant supervision, and (2) learning type classifiers over de-noised training data. For example, check out PLE's output on Tech news.
An end-to-end tool (corpus to typed entities) is under development. Please keep track of our updates.
Performance of fine-grained entity type classification over Wiki (Ling & Weld, 2012) dataset. We applied PLE to clean training data and ran FIGER (Ling & Weld, 2012) and over the de-noised labeled data to train type classifiers (thus the FIGER + PLE is the name of our final system).
Method | Accuray | Macro-F1 | Micro-F1 |
---|---|---|---|
HYENA (Yosef et al., 2012) | 0.288 | 0.528 | 0.506 |
WSABIE (Yogatama et al,., 2015) | 0.480 | 0.679 | 0.657 |
FIGER (Ling & Weld, 2012) | 0.474 | 0.692 | 0.655 |
FIGER + All Filter (Gillick et al., 2014) | 0.453 | 0.648 | 0.582 |
FIGER + PLE (Ren et al., 2016) | 0.599 | 0.763 | 0.749 |
The output on BBN dataset can be found here. Each line is a sentence in the test data of BBN, with entity mentions and their fine-grained entity typed identified.
- python 2.7, g++
- Python library dependencies
$ pip install pexpect unidecode six requests protobuf
- Setup stanford coreNLP and its python wrapper.
$ cd DataProcessor/
$ git clone git@github.com:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip
$ rm stanford-corenlp-full-2016-10-31.zip
- eigen 3.2.5 (already included).
We processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, and performed distant supervision using DBpediaSpotlight to assign type labels:
- Wiki (Ling & Weld, 2012): 1.5M sentences sampled from 780k Wikipedia articles. 434 news sentences are manually annotated for evaluation. 113 entity types are organized into a 2-level hierarchy (download JSON)
- OntoNotes (Weischedel et al., 2011): 13k news articles with 77 of them are manually labeled for evaluation. 89 entity types are organized into a 3-level hierarchy. (download JSON)
- BBN (Weischedel et al., 2005): 2,311 WSJ articles that are manually annotated using 93 types in a 2-level hierarchy. (download JSON)
Type hierarches
for each dataset are included.- Please put the data files in the corresponding subdirectories under
PLE/Data/
.
We have included compilied binaries. If you need to re-compile hple.cpp
under your own g++ environment
$ cd PLE/Model/ple/; make
Run PLE for the task of Reduce Label Noise on the BBN dataset
$ java -mx4g -cp "DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
$ ./run.sh
- The run.sh contains parameters for running on three datasets.
Dataset to run on.
Data="BBN"
Evaluate prediction results (by classifier trained on de-noised data) over test data
python Evaluation/evaluation.py BBN hple hete_feature
- python Evaluation/evaluation.py -DATA(BBN/ontonotes/FIGER) -METHOD(hple/...) -EMB_MODE(hete_feature)
Please cite the following paper if you found the codes/datasets useful:
@inproceedings{ren2016label,
title={Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Embedding},
author={Ren, Xiang and He, Wenqi and Qu, Meng and Voss, Clare R and Ji, Heng and Han, Jiawei},
booktitle={Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
pages={1825--1834},
year={2016},
organization={ACM}
}