# Tutorial on how to access the underlying datasets of the lsr-benchmark

This tutorial aims to show how the underlying datasets can be accessed. Please note that most datasets are private and therefore only available in the TIRA sandbox, but for datasets where the License allows us to make the subsamples public we do this. **(Please note: we still make the embeddings public for datasets that are private, but the underlying text can not be accessed in those cases)**

Datasets that are completely public are:

- `msmarco-passage/trec-dl-2019/judged`
- `msmarco-passage/trec-dl-2020/judged`
- `msmarco-segment-v2.1/trec-rag-2024`

## Step 1: Install the lsr-benchmark

In [None]:
!pip3 install lsr-benchmark

## Step 2: Register and load a dataset

In [1]:
import ir_datasets
from lsr_benchmark import register_to_ir_datasets

register_to_ir_datasets("msmarco-passage/trec-dl-2019/judged")

dataset = ir_datasets.load("lsr-benchmark/msmarco-passage/trec-dl-2019/judged")

## Step 3: Process Documents, Queries, and Qrels

The following processes the dataset with ir_datasets, see further below for the PyTerrier integration via ir_datasets.

In [2]:
for doc in dataset.docs_iter():
    print("doc_id:", doc.doc_id)
    print("Text:", doc.default_text())
    break

doc_id: 8811478
Text: Solution to Example 6: Function f is defined for all real values of x. The domain of f is the set of all real numbers. We will graph it by considering the value of the function in each interval. In the interval (- inf , -2] the graph of f is a horizontal line y = f(x) = -1 (see formula for this interval above).


In [3]:
for query in dataset.queries_iter():
    print("query_id:", query.query_id)
    print("Text:", query.default_text())
    break

query_id: 156493
Text: do goldfish grow


In [4]:
for qrel in dataset.qrels_iter():
    print(qrel)
    break

TrecQrel(query_id='19335', doc_id='1017759', relevance=0, iteration='0')


## Step 4: Process Documents, Queries, and Qrels with PyTerrier

In [5]:
import pyterrier as pt
dataset = pt.datasets.get_dataset("irds:lsr-benchmark/msmarco-passage/trec-dl-2019/judged")

In [6]:
for doc in dataset.get_corpus_iter():
    print("doc_id:", doc["docno"])
    print("Text:", doc["text"])
    break

lsr-benchmark/msmarco-passage/trec-dl-2019/judged documents:   0%|          | 0/32123 [00:00<?, ?it/s]

doc_id: 8811478
Text: Solution to Example 6: Function f is defined for all real values of x. The domain of f is the set of all real numbers. We will graph it by considering the value of the function in each interval. In the interval (- inf , -2] the graph of f is a horizontal line y = f(x) = -1 (see formula for this interval above).





In [8]:
dataset.get_topics()

Unnamed: 0,qid,query
0,156493,do goldfish grow
1,1110199,what is wifi vs bluetooth
2,1063750,why did the us volunterilay enter ww1
3,130510,definition declaratory judgment
4,489204,right pelvic pain causes
5,573724,what are the social determinants of health
6,168216,does legionella pneumophila cause pneumonia
7,1133167,how is the weather in jamaica
8,527433,types of dysarthria from cerebral palsy
9,1037798,who is robert gray


In [9]:
dataset.get_qrels()

Unnamed: 0,qid,docno,label,iteration
0,19335,1017759,0,0
1,19335,1082489,0,0
2,19335,109063,0,0
3,19335,1160863,0,0
4,19335,1160871,0,0
...,...,...,...,...
9255,1133167,8839920,2,0
9256,1133167,8839922,2,0
9257,1133167,944810,0,0
9258,1133167,949411,0,0
