Skip to content
Different datasets for developing and testing keyword extraction algorithms
Branch: master
Clone or download
Latest commit ba4966c May 19, 2015

Keyword Extraction Datasets

Different datasets for developing, evaluating and testing keyword extraction algorithms. For benchmarking performance see: O. Medelyan. 2009. Human-competitive automatic topic indexing. PhD Thesis. University of Waikato, New Zealand.

Extracting keywords using a controlled vocabulary or a thesaurus as a source: - 500 PubMed documents with MeSH terms

fao780.tar.gz - 780 FAO publications with Agrovoc terms

fao30.tar.gz - 30 FAO publications, each annotated by 6 professional FAO indexers

Free-text keyword extraction (without a vocabulary):

citeulike180.tar.gz - 180 publications crawled from CiteULike, and keywords assigned by different CiteULike users who saved these publications - SemEval-2010 Keyphrase extraction track data in Maui format

keyphrextr.tar.gz - Keyphrase extraction model created using SemEval-2010 training data. This model is used in the Maui GPL demo when no vocabulary is selected.

Extracting keywords using Wikipedia as a controlled vocabulary of allowed terms:

wiki20.tar.gz - 20 Computer Science papers, each annotated with at least 5 Wikipedia articles by 15 teams of indexers

You can’t perform that action at this time.