# IR Lab Tutorial: Statistical Analysis

This tutorial shows how to conduct a hypothesis test to compare two retrieval approaches.
The two runs compared in this example are loaded from the TIRA cache.

## Step 1: Ensure that libraries are imported

In [21]:
# This command loads and starts PyTerrier so that it also works in TIRA.

from tira.third_party_integrations import ensure_pyterrier_is_loaded

ensure_pyterrier_is_loaded()

In [22]:
# PyTerrier must be imported after `ensure_pyterrier_is_loaded` is called.

from pyterrier import started, init

if not started():
    init()

## Step 2: Load the dataset

In [23]:
from pyterrier import get_dataset

dataset = get_dataset('irds:ir-benchmarks/argsme-touche-2020-task-1-20230209-training')
dataset

IRDSDataset('ir-benchmarks/argsme-touche-2020-task-1-20230209-training')

## Step 3: Create the retrieval pipeline with TIRA

In this example, we will just use two existing retrieval components from TIREx: BM25 and DirichletLM, two lexical rankers.
We load the approaches via the TIRA API.

In [24]:
from tira.rest_api_client import Client

tira_client = Client()

In [25]:
dlm = tira_client.pt.from_retriever_submission(
    approach='ir-benchmarks/tira-ir-starter/DirichletLM (tira-ir-starter-pyterrier)',
    dataset='argsme-touche-2020-task-1-20230209-training',
)
dlm

SourceTransformer()

In [26]:

bm25 = tira_client.pt.from_retriever_submission(
    approach='ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)',
    dataset='argsme-touche-2020-task-1-20230209-training',
)
bm25

SourceTransformer()

## Step 4: Measure effectiveness

Now let us measure the nDCG@10 effectiveness of both systems on the Touché 2020 task 1 dataset.

In [27]:
from pyterrier.pipelines import Experiment

experiment = Experiment(
    retr_systems=[
        dlm,
        bm25,
    ],
    topics=dataset.get_topics("query"),
    qrels=dataset.get_qrels(),
    eval_metrics=["ndcg_cut_10"],
    names=[
        "DirichletLM",
        "BM25",
    ],
    perquery=True,
)
experiment.sample(n=10)

Unnamed: 0,name,qid,measure,value
3,DirichletLM,4,ndcg_cut_10,0.453743
49,BM25,1,ndcg_cut_10,0.661871
76,BM25,29,ndcg_cut_10,0.254807
75,BM25,28,ndcg_cut_10,0.063621
41,DirichletLM,43,ndcg_cut_10,0.829042
11,DirichletLM,12,ndcg_cut_10,0.19279
83,BM25,36,ndcg_cut_10,0.286423
7,DirichletLM,8,ndcg_cut_10,0.315163
89,BM25,42,ndcg_cut_10,0.508146
73,BM25,26,ndcg_cut_10,0.286346


This data frame shows the nDCG@10 values measured for each query and both systems (DrichletLM and BM25). \
So we have pairs of measurements where the same metric (i.e., nDCG@10) is measured using the same input (e.g., query #1) but for two different systems.
Let's re-arrange the data frame so that BM25 and DirichletLM values are in separate columns, not rows.

In [28]:
experiment_bm25 = experiment[experiment["name"] == "BM25"]\
    .drop(columns=["name"])
experiment_dlm = experiment[experiment["name"] == "DirichletLM"]\
    .drop(columns=["name"])

experiment_paired = experiment_bm25.merge(
    experiment_dlm,
    on=["qid", "measure"],
    suffixes=("_bm25", "_dlm"),
)
experiment_paired.head(n=10)

Unnamed: 0,qid,measure,value_bm25,value_dlm
0,1,ndcg_cut_10,0.661871,0.8805
1,10,ndcg_cut_10,0.158507,0.63322
2,11,ndcg_cut_10,0.309352,0.752969
3,12,ndcg_cut_10,0.061113,0.19279
4,13,ndcg_cut_10,0.31488,0.434739
5,14,ndcg_cut_10,0.355866,0.408224
6,15,ndcg_cut_10,0.094788,0.542364
7,16,ndcg_cut_10,0.208744,0.443535
8,17,ndcg_cut_10,0.0,0.686715
9,18,ndcg_cut_10,0.540948,0.699474


## Step 5: Conduct hypothesis tests

On this _paired_ measurement data, we can now conduct _paired_ t-tests to test for statistical significance of given hypotheses.
Remember that the choice of your test depends (amongst other factors) on how the hypothesis is formulated.

Let us test some hypotheses to get a feeling of what this means:

#### Hypothesis 1: BM25 has a significantly different nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: two-sided paired t-test \
Significance level: $\alpha = 0.05$ (i.e., the effect is only considered significant if $p < 0.05$)

In [29]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_bm25"],
    experiment_paired["value_dlm"],
    alternative='two-sided',
).pvalue

1.0865032406710116e-08

The above value is called $p$, the probability of the corresponding null hypothesis (the probability that the effect would be observed by chance). \
Because this is lower than our significance level $\alpha$, we can reject the null hypothesis and confirm the hypothesis 1. \
Indeed, BM25 and DirichletLM lead to significantly different nDCG@10 scores.

Now it would be great to find out which is better. \
One way could be to formulate a hypothesis with a predefined "direction". In this example we assume BM25 to be better.

#### Hypothesis 2: BM25 has a significantly higher nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [30]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_bm25"],
    experiment_paired["value_dlm"],
    alternative='greater',
).pvalue

0.9999999945674838

This time, the probability $p$ of the null hypothesis is much higher than our significance level $\alpha$. \
So we cannot reject the null hypothesis and fail to confirm hypothesis 2.

Last, we test the opposite direction: BM25 could be worse w.r.t. nDCG@10 than DirichletLM.

#### Hypothesis 2: BM25 has a significantly lower nDCG@10 on Touché 2020 task 1 than DirichletLM.

Significance test: one-sided paired t-test \
Significance level: $\alpha = 0.05$ (or $p < 0.05$)

In [31]:
from scipy.stats import ttest_rel

ttest_rel(
    experiment_paired["value_bm25"],
    experiment_paired["value_dlm"],
    alternative='less',
).pvalue

5.432516203355058e-09

Here, $p$ is less than than our significance level $\alpha$. We reject the null hypothesis and confirm hypothesis 3.