## Step 1: Ensure that libraries are imported

In [1]:
!pip3 install tira>=0.0.141 ir-datasets python-terrier==0.10.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [2]:
!rm -Rf ~/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test
!rm -Rf ~/.tira/.archived

In [3]:
# This command loads and starts PyTerrier so that it also works in TIRA.

from tira.third_party_integrations import ensure_pyterrier_is_loaded

ensure_pyterrier_is_loaded()

  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [4]:
# PyTerrier must be imported after `ensure_pyterrier_is_loaded` is called.

from pyterrier import started, init

if not started():
    init()

## Step 2: Load the dataset

In [5]:
from pyterrier import get_dataset

dataset = get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')
dataset

IRDSDataset('ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test')

## Step 3: Create the retrieval pipeline with TIRA

In this example, we will just use two existing retrieval components from TIREx: BM25 and DirichletLM, two lexical rankers.
We load the approaches via the TIRA API.

In [6]:
from tira.rest_api_client import Client

tira_client = Client()

In [7]:
kombi = tira_client.pt.from_retriever_submission(
    approach='ir-lab-wise-2024/ir-wise-24-th25/BM25+doc2query+DataCleaning',
    dataset='subsampled-ms-marco-ir-lab-20250105-test',
)
kombi

TiraSourceTransformer()

In [8]:
bm25 = tira_client.pt.from_retriever_submission(
    approach='ir-lab-wise-2024/ir-wise-24-uk-ir-1/BM25',
    dataset='subsampled-ms-marco-ir-lab-20250105-test',
)
bm25

TiraSourceTransformer()

## Step 4: Measure effectiveness

Now let us measure the nDCG@10 effectiveness of both systems on the Touché 2020 task 1 dataset.

In [9]:
from pyterrier.pipelines import Experiment

experiment = Experiment(
    retr_systems=[
        kombi,
        bm25,
    ],
    topics=dataset.get_topics("query"),
    qrels=dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=[
        "Kombination",
        "BM25",
    ],
    perquery=True,
)
experiment.sample(n=10)

Download from Zenodo: https://zenodo.org/records/14743268/files/subsampled-ms-marco-ir-lab-20250105-test-truths.zip


Download: 100%|██████████| 50.6k/50.6k [00:00<00:00, 1.77MiB/s]


Download finished. Extract...
Extraction finished:  /home/codespace/.tira/extracted_datasets/ir-lab-wise-2024/subsampled-ms-marco-ir-lab-20250105-test/


Unnamed: 0,name,qid,measure,value
60,BM25,46,ndcg_cut_10,0.72733
41,Kombination,31,ndcg_cut_10,0.0
75,BM25,11,ndcg_cut_10,0.063621
42,Kombination,10,ndcg_cut_10,0.0
29,Kombination,11,ndcg_cut_10,0.0
32,Kombination,13,ndcg_cut_10,0.367921
18,Kombination,6,ndcg_cut_10,0.0
62,BM25,36,ndcg_cut_10,0.220092
73,BM25,17,ndcg_cut_10,0.0
3,Kombination,48,ndcg_cut_10,0.388981


In [10]:
experiment_bm25 = experiment[experiment["name"] == "BM25"]\
    .drop(columns=["name"])
experiment_kombi = experiment[experiment["name"] == "Kombination"]\
    .drop(columns=["name"])

experiment_paired = experiment_bm25.merge(
    experiment_kombi,
    on=["qid", "measure"],
    suffixes=("_bm25", "_kombi"),
)
experiment_paired.head(n=10)

Unnamed: 0,qid,measure,value_bm25,value_kombi
0,10,ndcg_cut_10,0.546257,0.0
1,11,ndcg_cut_10,0.063621,0.0
2,12,ndcg_cut_10,0.224663,0.330138
3,13,ndcg_cut_10,1.0,0.367921
4,14,ndcg_cut_10,0.921602,0.0
5,16,ndcg_cut_10,0.643404,0.44222
6,17,ndcg_cut_10,0.0,0.0
7,18,ndcg_cut_10,0.0,0.0
8,19,ndcg_cut_10,0.425208,0.078398
9,2,ndcg_cut_10,1.0,1.0


## Step 5: Conduct hypothesis tests

On this _paired_ measurement data, we can now conduct _paired_ t-tests to test for statistical significance of given hypotheses.
Remember that the choice of your test depends (amongst other factors) on how the hypothesis is formulated.

Let us test some hypotheses to get a feeling of what this means:

#### Hypothesis 3: Die Kombination von Query Expansion und Data Cleaning zusätzlich zur BM25-Methode führt zu signifikant verbesserten NDCG@10-Werten, da die präzisere Datenbasis durch Data Cleaning und die erweiterten Suchanfragen durch Query Expansion synergetisch wirken und mehr relevante Dokumente identifiziert werden können.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [11]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_kombi"],
    experiment_paired["value_bm25"],
    alternative='two-sided',
).pvalue

2.964309781917969e-06

Because this is lower than our significance level. This suggests there is a statistically significant difference.

In [37]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_kombi"],
    experiment_paired["value_bm25"],
    alternative='greater',
).pvalue

0.9851371761165255

This time, the probability p of the null hypothesis is higher than our significance level alpha.
So we cannot reject the null hypothesis and fail to confirm hypothesis 1.