# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 3: Datasets**

This notebook focuses on IR datasets and pre-made indexes that can be loaded automatically in PyTerrier.


In [1]:
pip install python-terrier==0.10.0

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## Importing datasets

PyTerrier comes with a multitude of datasets that can be loaded directly. This is great because the parsing is already taken care of and any required files will be downloaded automatically.

A list of available datasets can be found [here](https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets) or by calling the following function:


In [3]:
pt.datasets.list_datasets()

[INFO] trec-robust04 is deprecated. Consider using disks45/nocr/trec-robust-2004 instead, which provides better parsing of the corpus.
[INFO] trec-robust04/fold1 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold1 instead, which provides better parsing of the corpus.
[INFO] trec-robust04/fold2 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold2 instead, which provides better parsing of the corpus.
[INFO] trec-robust04/fold3 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold3 instead, which provides better parsing of the corpus.
[INFO] trec-robust04/fold4 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold4 instead, which provides better parsing of the corpus.
[INFO] trec-robust04/fold5 is deprecated. Consider using disks45/nocr/trec-robust-2004/fold5 instead, which provides better parsing of the corpus.
[INFO] acessing TREC Fair Ranking 2021 through trec-fair-2021 is deprecated; use trec-fair/2021 instead.
[INFO] acessing TREC Fair

Unnamed: 0,dataset,topics,topics_lang,qrels,corpus,corpus_lang,index,info_url
0,50pct,"[training, validation]",en,"[training, validation]",,,"[ex2, ex3]",
1,antique,"[train, test]",en,"[train, test]",True,en,,https://ciir.cs.umass.edu/downloads/Antique/re...
2,vaswani,True,en,True,True,en,True,http://ir.dcs.gla.ac.uk/resources/test_collect...
3,msmarco_document,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/
4,msmarcov2_document,"[train, dev1, dev2, valid1, valid2, trec_2021]",en,"[train, dev1, dev2, valid1, valid2]",,,True,https://microsoft.github.io/msmarco/TREC-Deep-...
...,...,...,...,...,...,...,...,...
746,irds:neuclir,,,,,,,https://ir-datasets.com/neuclir.html
747,irds:neuclir/1,,,,,,,https://ir-datasets.com/neuclir.html#neuclir/1
754,irds:sara,True,en,True,True,en,,https://ir-datasets.com/sara.html
755,trec-deep-learning-docs,"[train, dev, test, test-2020, leaderboard-2020]",en,"[train, dev, test, test-2020]",True,en,True,https://microsoft.github.io/msmarco/


Each dataset has the following components:

- Corpus (the documents),
- index (pre-made, ready to use),
- topics (queries or topic descriptions, grouped in folds or splits),
- qrels (query relevance information, we'll use this for evaluation in an upcoming notebook).

Note that, for many datasets, some of these components are missing. Furthermore, the prefix `irds:` denotes that the corresponding dataset is loaded from the [`ir_datasets`](https://ir-datasets.com/) library, which seamlessly integrates with PyTerrier.

Let's start by loading the `vaswani` dataset:


In [4]:
dataset = pt.get_dataset("vaswani")

For this dataset, there are pre-made indexes available that we can load. In order to do this, we need to select a _variant_. The variants differ slightly, for example, in terms of pre-processing. An overview of the indexes and variants can be found in the [Terrier data repository](http://data.terrier.org/).

We'll use the standard variant, `terrier_stemmed`, to create a BM25 model:


In [5]:
index = dataset.get_index(variant="terrier_stemmed")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25.search("computer")

Downloading vaswani index to /Users/tomighita/.pyterrier/corpora/vaswani/index/terrier_stemmed


data.direct.bf:   0%|          | 0.00/388k [00:00<?, ?iB/s]

data.document.fsarrayfile:   0%|          | 0.00/234k [00:00<?, ?iB/s]

data.inverted.bf:   0%|          | 0.00/362k [00:00<?, ?iB/s]

data.lexicon.fsomapfile:   0%|          | 0.00/682k [00:00<?, ?iB/s]

data.lexicon.fsomaphash:   0%|          | 0.00/777 [00:00<?, ?iB/s]

data.lexicon.fsomapid:   0%|          | 0.00/30.3k [00:00<?, ?iB/s]

data.meta-0.fsomapfile:   0%|          | 0.00/725k [00:00<?, ?iB/s]

data.meta.idx:   0%|          | 0.00/89.3k [00:00<?, ?iB/s]

data.meta.zdata:   0%|          | 0.00/224k [00:00<?, ?iB/s]

data.properties:   0%|          | 0.00/4.29k [00:00<?, ?iB/s]

md5sums:   0%|          | 0.00/619 [00:00<?, ?iB/s]

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,3941,3942,0,7.697643,computer
1,1,139,140,1,7.634963,computer
2,1,2546,2547,2,7.582282,computer
3,1,3597,3598,3,7.582282,computer
4,1,394,395,4,7.504337,computer
...,...,...,...,...,...,...
527,1,3718,3719,527,2.401824,computer
528,1,10015,10016,528,2.379234,computer
529,1,3484,3485,529,2.335306,computer
530,1,8380,8381,530,2.313945,computer


We can also create a retriever directly from the dataset like so:


In [6]:
bm25 = pt.BatchRetrieve.from_dataset(dataset, variant="terrier_stemmed", wmodel="BM25")

We can also browse the corpus:


In [7]:
for doc in dataset.get_corpus_iter():
    print(doc)
    break

Downloading vaswani corpus to /Users/tomighita/.pyterrier/corpora/vaswani/corpus


doc-text.trec:   0%|          | 0.00/0.99M [00:00<?, ?iB/s]

{'docno': '1', 'text': 'compact memories have flexible capacities  a digital data storage\nsystem with capacity up to bits and random and or sequential access\nis described'}


Similarly, the topics (queries) can be accessed as a `pandas.DataFrame`, such that we can use them directly:


In [8]:
bm25(dataset.get_topics())

Downloading vaswani topics to /Users/tomighita/.pyterrier/corpora/vaswani/query-text.trec


query-text.trec:   0%|          | 0.00/3.05k [00:00<?, ?iB/s]

Unnamed: 0,qid,docid,docno,rank,score,query
0,1,8171,8172,0,24.566031,measurement of dielectric constant of liquids ...
1,1,9880,9881,1,22.110514,measurement of dielectric constant of liquids ...
2,1,5501,5502,2,21.717148,measurement of dielectric constant of liquids ...
3,1,1501,1502,3,19.478355,measurement of dielectric constant of liquids ...
4,1,9858,9859,4,18.626342,measurement of dielectric constant of liquids ...
...,...,...,...,...,...,...
91925,93,5495,5496,995,8.163870,high frequency oscillators using transistors t...
91926,93,9187,9188,996,8.163650,high frequency oscillators using transistors t...
91927,93,4821,4822,997,8.163588,high frequency oscillators using transistors t...
91928,93,3279,3280,998,8.163573,high frequency oscillators using transistors t...


Note that some datasets require a variant here, such as `variant="train"`.

Since the corpus iterator already yields the documents in the correct format (see part 2: indexing), we can use it directly to create our own index if we wish:


In [9]:
from pathlib import Path

index = pt.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    type=pt.index.IndexingType.MEMORY,
).index(dataset.get_corpus_iter())

## Further reading

Check out the [datasets section](https://pyterrier.readthedocs.io/en/latest/datasets.html) in the documentation.
