Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
README added nltk data Sep 19, 2010
esp.testa added nltk data Sep 19, 2010
esp.testb added nltk data Sep 19, 2010
esp.train added nltk data Sep 19, 2010
ned.testa added nltk data Sep 19, 2010
ned.testb added nltk data Sep 19, 2010
ned.train added nltk data Sep 19, 2010


These files contain the train and test data for for the three parts of 
the CoNLL-2002 shared task:

   esp.testa: Spanish test data for the development stage
   esp.testb: Spanish test data
   esp.train: Spanish train data
   ned.testa: Dutch test data for the development stage
   ned.testb: Dutch test data
   ned.train: Dutch train data

All data files contain a single word per line with it associated 
named entity tag in the IOB2 format (Tjong Kim Sang and Veenstra,
EACL 1999). Sentence breaks are encoded by empty lines. Additionally
the Dutch data contains non-checked part-of-speech tags generated
by the MBT tagger (Daelemans, WVLC 1996). In the Dutch data
article boundaries have been marked by a special tag (-DOCSTART-).

Associated url:


* Files in these directories may only be used for research
  applications in the context of the CoNLL-2002 shared task.
  No permission is given for usage other applications especially
  not for commercial applications.
* Some redundant empty lines have been removed from the Spanish 
  data files at May 1, 2002. The extra empty lines had no effect 
  on the evaluation results.
* An extra checkup round has been applied to the Dutch data files
  and these have been replaced by new versions on August 22, 2002.
  The original Dutch files which have been used by the participants
  of CoNLL-2002 can be found in the subdirectory OldFiles.
* Note that for copyright reasons the sentences in the Dutch files 
  have been randomized within each article. Your system can rely on 
  sentences between two article boundaries being of the same
  article but it should not rely on first occurrences of entities.
* Xavier Carreras provides the Spanish data sets with part of speech 
  tags at (20030803)
* Inconsistencies in the named entity annotation can be reported
  to Erik Tjong Kim Sang <>. 


The Spanish data is a collection of news wire articles made
available by the Spanish EFE News Agency. The articles are from 
May 2000. The annotation was carried out by the TALP Research 
Center ( of the Technical University 
of Catalonia (UPC) and the Center of Language and Computation 
(CLiC, of the University of Barcelona 
(UB), and funded by the European Commission through the NAMIC 
project (IST-1999-12392).

The Dutch data consist of four editions of the Belgian newspaper
"De Morgen" of 2000 (June 2, July 1, August 1 and September 1).
The data was annotated as a part of the Atranos project
( at the University of