Importing Datasets

The datasets module allows easy access to existing standard test collections, particulary those from TREC. In particular, each defined dataset can download and provide easy access to:

files containing the documents of the corpus

topics (queries), as a dataframe, ready for retrieval

relevance assessments (aka, labels or qrels), as a dataframe, ready for evaluation

ready-made Terrier indices, where appropriate

pyterrier.datasets.list_datasets()

pyterrier.datasets.find_datasets()

pyterrier.datasets.get_dataset()

pyterrier.datasets.Dataset

Examples

Many of the PyTerrier unit tests are based on the Vaswani NPL test collection, a corpus of scientific abstract from ~11,000 documents. PyTerrier provides a ready-made index on the Terrier Data Repository. This allows experiments to be easily conducted:

dataset = pt.get_dataset("vaswani")
bm25 = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="BM25")
dph = pt.BatchRetrieve.from_dataset(dataset, "terrier_stemmed", wmodel="DPH")
pt.Experiment(
    [bm25, dph],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["map"]
)

Indexing and then retrieval of documents from the MSMARCO document corpus can be achieved as follows:

dataset = pt.get_dataset("trec-deep-learning-docs")
indexer = pt.TRECCollectionIndexer("./index")
# this downloads the file msmarco-docs.trec.gz 
indexref = indexer.index(dataset.get_corpus())
index = pt.IndexFactory.of(indexref)

DPH_br = pt.BatchRetrieve(index, wmodel="DPH") % 100
BM25_br = pt.BatchRetrieve(index, wmodel="BM25") % 100
# this runs an experiment to obtain results on the TREC 2019 Deep Learning track queries and qrels
pt.Experiment(
    [DPH_br, BM25_br], 
    dataset.get_topics("test"), 
    dataset.get_qrels("test"), 
    eval_metrics=["recip_rank", "ndcg_cut_10", "map"])

For more details on use of MSMARCO, see our MSMARCO leaderboard submission notebooks.

You can also index datasets that include a corpus using IterDictIndexer and get_corpus_iter:

dataset = pt.datasets.get_dataset('irds:cord19/trec-covid')
indexer = pt.index.IterDictIndexer('./cord19-index')
indexref = indexer.index(dataset.get_corpus_iter(), fields=('title', 'abstract'))
index = pt.IndexFactory.of(indexref)

DPH_br = pt.BatchRetrieve(index, wmodel="DPH") % 100
BM25_br = pt.BatchRetrieve(index, wmodel="BM25") % 100
# this runs an experiment to obtain results on the TREC COVID queries and qrels
pt.Experiment(
    [DPH_br, BM25_br], 
    dataset.get_topics('title'), 
    dataset.get_qrels(), 
    eval_metrics=["P.5", "P.10", "ndcg_cut.10", "map"])

Available Datasets

The table below lists the provided datasets, detailing the attributes available for each dataset. In each column, True designates the presence of a single artefact of that type, while a list denotes the available variants. Datasets with the irds: prefix are from the ir_datasets package; further documentation on these datasets can be found here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.rst

datasets.rst

Importing Datasets

Examples

Available Datasets

Files

datasets.rst

Latest commit

History

datasets.rst

File metadata and controls

Importing Datasets

Examples

Available Datasets