# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

Collecting python-terrier
  Downloading python-terrier-0.10.0.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tira==0.0.88
  Downloading tira-0.0.88-py3-none-any.whl.metadata (4.4 kB)
Collecting ir_datasets
  Downloading ir_datasets-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting docker==6.*,>=6.0.0 (from tira==0.0.88)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tqdm (from python-terrier)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57

  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [2]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True
terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/codespace/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /home/codespace/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /home/codespace/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load data, create index

In [3]:
dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
qrels = dataset.get_qrels()
topics = dataset.get_topics(variant="title")

from pathlib import Path
index_loc = "./index"
if not (Path(index_loc) / "data.properties").exists():
    indexer = pt.IterDictIndexer(index_loc)
    indexref = indexer.index(dataset.get_corpus_iter())
else:
    indexref = pt.IndexFactory.of(index_loc)

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:52<00:00, 1172.38it/s]


### Step 3: Create retrieval pipeline

#### We aim for retrieving documents via a linear combination of PL2 and BM25. Firstly, let's focus on PL2.

In [4]:
pl2 = pt.BatchRetrieve(indexref, wmodel="PL2", verbose=True)

#### Next, we perform BM25-retrieval with query expansion.

In [5]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True)

bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(indexref)
bm25_bo1 = bo1_expansion >> bm25

#### Let's combine the two systems.

In [6]:
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)

#### Test to ensure that renaming of dataframe is needed for reranking.

In [7]:
run = bm25_bo1_pl2(topics)
run.rename(columns= {"query": "query_0", "query_0": "query"})

BR(BM25):   0%|          | 0/882 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 882/882 [00:20<00:00, 42.49q/s]
BR(BM25): 100%|██████████| 878/878 [00:22<00:00, 38.89q/s]
BR(PL2): 100%|██████████| 882/882 [00:18<00:00, 46.44q/s]


Unnamed: 0,qid,docid,docno,score,query,query_0,rank
0,q072210025,21318.0,doc072207501000,66.759043,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,1
1,q072210025,9672.0,doc072212607743,66.977549,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,0
2,q072210025,43796.0,doc072207504499,66.178928,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,2
3,q072210025,8216.0,doc072201202671,62.435225,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,3
4,q072210025,59542.0,doc072204307357,62.309674,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,4
...,...,...,...,...,...,...,...
1099732,q072230074,,doc072211000339,3.473103,,,1309
1099733,q072230074,,doc072208407385,3.473081,,,1310
1099734,q072230074,,doc072203309049,3.472435,,,1311
1099735,q072230074,,doc072202201538,3.471925,,,1312


### Next, we want to rerank the output with a transformer.
(after loading the document text and resetting the expanded query)

In [8]:
import pandas as pd

corpus = pd.DataFrame(dataset.get_corpus_iter())


class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

    
class ResetQueryColumn(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return topics_or_res.rename(columns= {"query": "query_0", "query_0": "query"})


No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   6%|▋         | 3873/61307 [00:00<00:02, 19288.38it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:03<00:00, 16319.50it/s]


In [9]:
# TODO: use first 2 lines in final version, third is only for debbuging (GitHub Codespaces seem to not have enough RAM to run monoT5)
# from pyterrier_t5 import MonoT5ReRanker
# monoT5 = MonoT5ReRanker(verbose=True, batch_size=1)

monoT5 = pt.text.scorer(body_attr="text", wmodel="BM25")

In [10]:
bm25_bo1_pl2_mono = (
    bm25_bo1_pl2 % 10 >> 
    GetText() >> 
    ResetQueryColumn() >> 
    pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text") >> 
    monoT5)

#### Cache the pipeline.

In [11]:
import pandas as pd
from dataclasses import dataclass

# TODO rename NamedTransformer after debugging to invalidate cache

@dataclass(frozen=True)
class NamedTransformer(pt.Transformer):
    nametrans: str
    wrapped: pt.Transformer

    def __repr__(self) -> str:
        return self.nametrans

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return self.wrapped.transform(df)


#### Check if caching worked.

In [12]:
bm25_bo1_pl2_mono_cached = ~NamedTransformer("bm25_bo1_pl2_mono_cache_bust_6", bm25_bo1_pl2_mono)
repr(bm25_bo1_pl2_mono_cached)

'Cache(bm25_bo1_pl2_mono_cache_bust_6)'

### Step 4: Test hypotheses.

### Hypothesis 1: There is a significant ($\alpha < 0.05$) difference w.r.t. nDCG between aggregating with max passage and mean passage.

#### Firstly, rerank with max passage aggregation.

In [13]:
bm25_bo1_pl2_max = bm25_bo1_pl2_mono_cached >> pt.text.max_passage()
bm25_bo1_pl2_max.transform(topics).head()

BR(BM25): 100%|██████████| 882/882 [00:18<00:00, 46.79q/s]
BR(BM25): 100%|██████████| 878/878 [00:22<00:00, 39.54q/s]
BR(PL2): 100%|██████████| 882/882 [00:18<00:00, 48.63q/s]


calling sliding on df of 8780 rows


                                                                



Unnamed: 0,qid,query_0,text,score,query,docno,rank
56,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Recipe Peanut Sauce for Spring Rolls Peanut S...,26.547201,recipe spring roll,doc072201202671,4
95,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Spring Roll: Easy Recipe Created: 12 August 2...,26.232037,recipe spring roll,doc072201901565,6
100,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,recipe: spring rolls (Vietnamese recipe) serv...,25.881011,recipe spring roll,doc072203110074,7
57,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Spring Rolls Recipe | Pratique.fr Spring roll...,26.691692,recipe spring roll,doc072204307357,3
0,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Recipe Vegetarian Spring Rolls By Kitchen Z 4...,26.881702,recipe spring roll,doc072207501000,2


#### Secondly, rerank with mean passage aggregation.

In [14]:
bm25_bo1_pl2_mean = bm25_bo1_pl2_mono_cached  >> pt.text.mean_passage()
bm25_bo1_pl2_mean.transform(topics).head()

Unnamed: 0,qid,docno,score,query,query_0,rank
0,q072210025,doc072201202671,26.547201,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,2
1,q072210025,doc072201901565,23.233679,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,7
2,q072210025,doc072203110074,25.881011,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,4
3,q072210025,doc072204307357,26.691692,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,1
4,q072210025,doc072207501000,25.487764,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,6


#### Let's compare both systems.

In [15]:
pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["max passage", "mean passage"],
)

Unnamed: 0,name,ndcg_cut_5,ndcg
0,max passage,0.139535,0.179825
1,mean passage,0.135164,0.177454


In [16]:
metrics_per_query = pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5"],
    names=["max passage", "mean passage"],
    perquery=True,
)
metrics_per_query

Unnamed: 0,name,qid,measure,value
0,max passage,q072210025,ndcg_cut_5,0.000000
1,max passage,q072210054,ndcg_cut_5,0.000000
2,max passage,q072210114,ndcg_cut_5,0.151020
3,max passage,q07221016,ndcg_cut_5,0.327395
4,max passage,q072210178,ndcg_cut_5,0.000000
...,...,...,...,...
1755,mean passage,q07229744,ndcg_cut_5,0.000000
1756,mean passage,q07229758,ndcg_cut_5,0.000000
1757,mean passage,q07229782,ndcg_cut_5,0.000000
1758,mean passage,q07229809,ndcg_cut_5,0.478543


### Significance Test between both systems.

In [17]:

hypo1 = pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["max passage", "mean passage"],
    baseline = 0
)

hypo1


Unnamed: 0,name,ndcg,ndcg_cut_5,ndcg +,ndcg -,ndcg p-value,ndcg_cut_5 +,ndcg_cut_5 -,ndcg_cut_5 p-value
0,max passage,0.179825,0.139535,,,,,,
1,mean passage,0.177454,0.135164,185.0,171.0,0.322267,137.0,148.0,0.258123


### Hypothesis 2: Choosing $k \in \{2\cdot i \mid i \in [1, 10]\}$ such that the nDCG-score of $k$-max average aggregation is maximized, yields a significantly ($\alpha < 0.05$)  better nDCG-score than using max passage or mean passage aggregation.

In [18]:
from pyterrier.text import KMaxAvgPassage

@dataclass(unsafe_hash=True)
class TuneableKMaxAvgPassage(KMaxAvgPassage):
    pipeline: pt.Transformer
    k: int

    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        pipeline = self.pipeline >> pt.text.kmaxavg_passage(k=self.k)
        return pipeline.transform(topics_or_res)


In [38]:
bm25_bo1_pl2_kmax = TuneableKMaxAvgPassage(bm25_bo1_pl2_mono_cached, k=2)

#### Find $k$ such that nDCG of reranking with $k$-max average passage is maximized.

In [39]:
bm25_bo1_pl2_best_kmax = pt.GridSearch(
    bm25_bo1_pl2_kmax,
    {bm25_bo1_pl2_kmax :  {'k' : [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]}},
    topics,
    qrels,
    'ndcg',
    verbose=True
)

GridScan: 100%|██████████| 10/10 [00:51<00:00,  5.13s/it]

Best ndcg is 0.180469
Best setting is ['TuneableKMaxAvgPassage(pipeline=Cache(bm25_bo1_pl2_mono_cache_bust_6), k=20) k=6']





In [40]:
pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean, bm25_bo1_pl2_best_kmax],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["max passage", "mean passage", "best_k_pipeline"],
)

Unnamed: 0,name,ndcg_cut_5,ndcg
0,max passage,0.139535,0.179825
1,mean passage,0.135164,0.177454
2,best_k_pipeline,0.139134,0.180469


In [27]:
metrics_per_query_2 = pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean, bm25_bo1_pl2_best_kmax],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5"],
    names=["max passage", "mean passage", "best kmax pipeline"],
    perquery=True,
)
metrics_per_query_2

Unnamed: 0,name,qid,measure,value
1764,best kmax pipeline,q072210025,ndcg_cut_5,0.000000
1765,best kmax pipeline,q072210054,ndcg_cut_5,0.000000
1766,best kmax pipeline,q072210114,ndcg_cut_5,0.151020
1767,best kmax pipeline,q07221016,ndcg_cut_5,0.327395
1768,best kmax pipeline,q072210178,ndcg_cut_5,0.000000
...,...,...,...,...
1755,mean passage,q07229744,ndcg_cut_5,0.000000
1756,mean passage,q07229758,ndcg_cut_5,0.000000
1757,mean passage,q07229782,ndcg_cut_5,0.000000
1758,mean passage,q07229809,ndcg_cut_5,0.478543


### Significance Test between systems.

In [29]:
hypo2 = pt.Experiment(
    [bm25_bo1_pl2_best_kmax, bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["best kmax pipeline", "max passage", "mean passage"],
    baseline = 0
)

hypo2

Unnamed: 0,name,ndcg,ndcg_cut_5,ndcg +,ndcg -,ndcg p-value,ndcg_cut_5 +,ndcg_cut_5 -,ndcg_cut_5 p-value
0,best kmax pipeline,0.180415,0.138147,,,,,,
1,max passage,0.179825,0.139535,135.0,163.0,0.704515,107.0,114.0,0.602257
2,mean passage,0.177454,0.135164,162.0,146.0,0.115133,117.0,103.0,0.334018


### Step 5: Persist run.

In [43]:
with open("results.txt", "wt") as file:
    file.write("Results for Hypothesis 1:\n\n")
    file.write("Significance test:\n")
    hypo1_string = hypo1.to_string(header=True, index=False)
    file.write(hypo1_string + "\n\n")
    # ...
    file.write("Results for Hypothesis 2:\n\n")
    file.write("Best k in [2, 4, 6, 8, 10, 12, 14, 16, 18, 10] is " + str(bm25_bo1_pl2_best_kmax.k) + "\n\n")
    file.write("Significance test:\n")
    hypo2_string = hypo2.to_string(header=True, index=False)
    file.write(hypo2_string)
    # ...