# Information Retrieval Lab WiSe 2024/2025: Evaluation

This Jupyter notebook serves as an example on how to evaluate the research hypothesis that you developed throughout the course.

We used a subset of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset for developing the systems and to formulate the research hypothesis, this dataset is loaded into the PyTerrier dataset `pt_training_dataset` below (having the dataset id `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`).

Furthermore, we developed an own test dataset together throughout the course on the MS MARCO v2.1 passage dataset. You can use this dataset to test if your research hypothesis was true on an unseen evaluation corpus. This dataset is loaded into the PyTerrier dataset `pt_test_dataset` below (having the dataset id `ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test`).

### Step 1: Ensure Libraries are installed

We install the dependencies and clean up old versions of the test dataset that had no qrels available

In [None]:
!pip3 install 'tira>=0.0.141' ir-datasets 'python-terrier==0.10.0'

In [1]:
# remove cached versions of the dataset that might not have the qrels
!rm -Rf ~/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test
!rm -Rf ~/.tira/.archived

### Step 2: Import Dependencies and Load Datasets

In [None]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
from ir_measures import nDCG, Judged, RR

ensure_pyterrier_is_loaded()
tira = Client()

from pyterrier import get_dataset, Experiment

pt_training_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')
pt_test_dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')

### Step 3: Load the runs for the Evaluations

In [4]:
MY_TEAM = 'ir-wise-24-tutors'
MY_APPROACH = 'Retrieval Baseline'

trainings_run = tira.pt.from_retriever_submission(f'ir-lab-wise-2024/{MY_TEAM}/{MY_APPROACH}', dataset=pt_training_dataset)

test_run = tira.pt.from_retriever_submission(f'ir-lab-wise-2024/{MY_TEAM}/{MY_APPROACH}', dataset=pt_test_dataset)

### Step 4: Evaluation on the Training Dataset

In [5]:
Experiment(
    retr_systems = [trainings_run],
    names = [MY_APPROACH],
    eval_metrics = [nDCG@10, nDCG(judged_only=True)@10, RR@10, Judged@10],
    topics = pt_training_dataset.get_topics('title'),
    qrels = pt_training_dataset.get_qrels(),
)

Unnamed: 0,name,nDCG@10,nDCG(judged_only=True)@10,RR@10,Judged@10
0,Retrieval Baseline,0.489469,0.495665,0.784737,0.94433


### Step 5: Evaluation on the Test Dataset

In [7]:
Experiment(
    retr_systems = [test_run],
    names = [MY_APPROACH],
    eval_metrics = [nDCG@10, nDCG(judged_only=True)@10, RR@10, Judged@10],
    topics = pt_test_dataset.get_topics('title'),
    qrels = pt_test_dataset.get_qrels(),
)

Unnamed: 0,name,nDCG@10,nDCG(judged_only=True)@10,RR@10,Judged@10
0,Retrieval Baseline,0.493561,0.542444,0.624715,0.908696
