Skip to content

Latest commit

 

History

History
93 lines (67 loc) · 3.56 KB

ACEReader.md

File metadata and controls

93 lines (67 loc) · 3.56 KB

ACE Reader for the 2004 and 2005 datasets.

Overview

Dataset Annotation Guidelines [link] (https://www.ldc.upenn.edu/collaborations/past-projects/ace/annotation-tasks-and-specifications)

Dataset Download links: ACE-2004 ACE-2005

Implementation details

Each document is read into a TextAnnotation instance with the following views defined in the ViewNames class:

  • TOKENS: The basic TokenLabelView view generated by a Tokenizer from the raw dataset text
  • MENTION_ACE: SpanLabelView with overlapping constituents where each constituent represents a entity extent and head-word information is stored as attributes and the label is the Coarse Entity Type. The Fine Entity Type is also stored as an attribute in the Entity constituent. Relations between entities are represented as edges between Entity constituents
  • COREF_HEAD, COREF_EXTENT: CoreferenceView uses copies of mentions from NER_ACE_COARSE_* views and adds the longest mention as the canonical mention + adds coreference edges to other mentions for the same entity.

Usage

Directory Structure

The reader expects data-set to be in the following structure:

corpusHomeDir
├── bc
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
├── bn
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
├── cts
│   └── apf.dtd
|   └── <other files (*.apf.xml, *.sgm)>
└── newswire_nw
    └── apf.dtd
    └── <other files (*.apf.xml, *.sgm)>

Each of the sub-directories represents a section and has different text parsing logic. The reader expects the section directories to end with a suffix representing the parser to be used according to the following suffix logic:

  • bn : Broadcast News
  • nw : Newswire
  • bc : Broadcast Conversation
  • wl : Weblog
  • un : Usenet Newsgroups/Discussion Forum
  • cts : Conversational Telephone Speech

Note: The version of the 2005 corpus for which this reader was developed had the markup files (.xml, .sgm etc.) in a single directory timex2norm under each section. The reader should work for this directory structure too.

Java Usage

import edu.illinois.cs.cogcomp.nlp.corpusreaders.ACEReader;

// Read all sections in ACE-2004
ACEReader reader2004 = new ACEReader("data/ace2004/data/English", true);

// Read all sections in ACE-2005
ACEReader reader2005 = new ACEReader("data/ace2005/data/English", false);

// Read limited sections only
String[] sections = new String[] { "nw", "bn" };
ACEReader reader2004Partial = new ACEReader("data/ace2004/data/English", sections, true);

ACEReader implements Iterable<TextAnnotation> interface.

Sample Usage:

while (reader.hasNext()) {
	TextAnnotation doc = reader.next();
	...
}

or

for (TextAnnotation doc : reader) {
	...
}

Caveats

  • Values, TimeEX and Events are not parsed currently.