# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

Collecting python-terrier
  Downloading python-terrier-0.10.0.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tira==0.0.88
  Downloading tira-0.0.88-py3-none-any.whl.metadata (4.4 kB)
Collecting ir_datasets
  Downloading ir_datasets-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting docker==6.*,>=6.0.0 (from tira==0.0.88)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tqdm (from python-terrier)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57

  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [2]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True
terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/codespace/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /home/codespace/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /home/codespace/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load data, create index

In [16]:
dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
qrels = dataset.get_qrels()
#for debugging: use only top 5
topics = dataset.get_topics(variant="title")#[:5]

from pathlib import Path
index_loc = "./index"
if not (Path(index_loc) / "data.properties").exists():
    indexer = pt.IterDictIndexer(index_loc)
    indexref = indexer.index(dataset.get_corpus_iter())
else:
    indexref = pt.IndexFactory.of(index_loc)

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 24/61307 [00:00<04:36, 221.65it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:49<00:00, 1248.40it/s]


### Step 3: Create retrieval pipeline

#### We aim for retrieving documents via a linear combination of PL2 and BM25. Firstly, let's focus on PL2.

In [17]:
pl2 = pt.BatchRetrieve(indexref, wmodel="PL2", verbose=True)

#### Next, we perform BM25-retrieval with query expansion.

In [18]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True)

bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(indexref)
bm25_bo1 = bo1_expansion >> bm25

#### Let's combine the two systems.

In [19]:
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)

#### Test to ensure that renaming of dataframe is needed for reranking.

In [53]:
run = bm25_bo1_pl2(topics)
run.rename(columns= {"query": "query_0", "query_0": "query"})

BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 22.31q/s]
BR(BM25):   0%|          | 0/5 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 19.57q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 27.19q/s]


Unnamed: 0,qid,docid,docno,score,query,query_0,rank
0,q072224,59628.0,doc072202002905,36.642224,purchase money,applypipeline:off purchas^1.124916191 monei^1....,0
1,q072224,27801.0,doc072211705166,32.470182,purchase money,applypipeline:off purchas^1.124916191 monei^1....,2
2,q072224,16790.0,doc072215502070,32.121412,purchase money,applypipeline:off purchas^1.124916191 monei^1....,3
3,q072224,19718.0,doc072211305429,32.121412,purchase money,applypipeline:off purchas^1.124916191 monei^1....,4
4,q072224,40315.0,doc072203203007,32.121412,purchase money,applypipeline:off purchas^1.124916191 monei^1....,5
...,...,...,...,...,...,...,...
6515,q072242,,doc072201001557,2.758636,,,1527
6516,q072242,,doc072203205122,2.757709,,,1528
6517,q072242,,doc072200401101,2.756787,,,1529
6518,q072242,,doc072208905974,2.755216,,,1530


### Next, we want to rerank the output with a transformer.
(after loading the document text and resetting the expanded query)

In [20]:
import pandas as pd

corpus = pd.DataFrame(dataset.get_corpus_iter())


class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

    
class ResetQueryColumn(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return topics_or_res.rename(columns= {"query": "query_0", "query_0": "query"})


No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   2%|▏         | 1402/61307 [00:00<00:04, 14006.82it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:04<00:00, 14706.38it/s]


In [21]:
# from pyterrier_t5 import MonoT5ReRanker
# monoT5 = MonoT5ReRanker(verbose=True, batch_size=1)

# For debugging (GitHub Codespaces seem to not have enough RAM to run monoT5):
monoT5 = pt.text.scorer(body_attr="text", wmodel="BM25")

In [22]:
bm25_bo1_pl2_mono = (
    bm25_bo1_pl2 % 10 >> 
    GetText() >> 
    ResetQueryColumn() >> 
    pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text") % 5 >> 
    monoT5)

#### Cache the pipeline.

In [23]:
import pandas as pd
from dataclasses import dataclass


@dataclass(frozen=True)
class NamedTransformer(pt.Transformer):
    nametrans: str
    wrapped: pt.Transformer

    def __repr__(self) -> str:
        return self.nametrans

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return self.wrapped.transform(df)


In [24]:
bm25_bo1_pl2_mono_cached = ~NamedTransformer("bm25_bo1_pl2_mono_cache_bust_6", bm25_bo1_pl2_mono)
repr(bm25_bo1_pl2_mono_cached)

'Cache(bm25_bo1_pl2_mono_cache_bust_6)'

### Hypothesis 1: There is a significant ($\alpha < 0.05$) difference w.r.t. nDCG between aggregating with max passage and mean passage.

#### Firstly, rerank with max passage aggregation.

In [25]:
bm25_bo1_pl2_max = bm25_bo1_pl2_mono_cached >> pt.text.max_passage()
bm25_bo1_pl2_max.transform(topics).head()

BR(BM25): 100%|██████████| 877/877 [00:19<00:00, 44.80q/s]
BR(BM25): 100%|██████████| 873/873 [00:23<00:00, 36.91q/s]
BR(PL2): 100%|██████████| 877/877 [00:18<00:00, 46.67q/s]


calling sliding on df of 8730 rows


                                                                



Unnamed: 0,qid,query_0,text,score,query,docno,rank
29,q072224,applypipeline:off purchas^1.124916191 monei^1....,"to stay, making inflation even more difficult ...",0.78848,purchase money,doc072215502070,7
29,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Recipe Peanut Sauce for Spring Rolls Peanut S...,26.489269,recipe spring roll,doc072201202671,5
30,q072224,applypipeline:off purchas^1.124916191 monei^1....,doc072211305429 Aztec Group Inc Florida Singa...,1.111982,purchase money,doc072211305429,6
30,q072210025,applypipeline:off recip^1.053565089 spring^1.3...,Spring Rolls Recipe | Pratique.fr Spring roll...,26.640448,recipe spring roll,doc072204307357,4
0,q072224,applypipeline:off purchas^1.124916191 monei^1....,Real Estate Offer Form | Gorvamur Real Estate...,0.470985,purchase money,doc072202002905,11


#### Secondly, rerank with mean passage aggregation.

In [26]:
bm25_bo1_pl2_mean = bm25_bo1_pl2_mono_cached  >> pt.text.mean_passage()
bm25_bo1_pl2_mean.transform(topics).head()

Unnamed: 0,qid,docno,score,query,query_0,rank
0,q072210025,doc072201202671,26.489269,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,2
1,q072210025,doc072204307357,26.640448,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,1
2,q072210025,doc072207501000,25.446916,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,3
3,q072210025,doc072207504499,28.091892,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,0
4,q072210025,doc072212607743,21.485242,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,4


#### Let's compare both systems.

In [29]:
pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["max passage", "mean passage"],
)

Unnamed: 0,name,ndcg_cut_5,ndcg
0,max passage,0.142422,0.137406
1,mean passage,0.14106,0.136222


In [32]:
metrics_per_query = pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5"],
    names=["max passage", "mean passage"],
    perquery=True,
)
metrics_per_query

Unnamed: 0,name,qid,measure,value
1,max passage,q072210025,ndcg_cut_5,0.00000
3,max passage,q072210054,ndcg_cut_5,0.00000
6,max passage,q072210114,ndcg_cut_5,0.00000
8,max passage,q07221016,ndcg_cut_5,0.00000
9,max passage,q072210178,ndcg_cut_5,0.00000
...,...,...,...,...
1755,mean passage,q07229744,ndcg_cut_5,0.00000
1756,mean passage,q07229758,ndcg_cut_5,0.00000
1757,mean passage,q07229782,ndcg_cut_5,0.00000
1758,mean passage,q07229809,ndcg_cut_5,0.38223


### Significance Test between both systems.

In [35]:

pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5", "ndcg"],
    names=["max passage", "mean passage"],
    baseline = 0
)


Unnamed: 0,name,ndcg,ndcg_cut_5,ndcg +,ndcg -,ndcg p-value,ndcg_cut_5 +,ndcg_cut_5 -,ndcg_cut_5 p-value
0,max passage,0.137406,0.142422,,,,,,
1,mean passage,0.136222,0.14106,103.0,104.0,0.559196,103.0,104.0,0.510276


### Hypothesis 2: Choosing $k \in \{5\cdot i \mid i \in [1, 10]\}$ such that the nDCG-score of $k$-max average aggregation is maximized, yields a significantly ($\alpha < 0.05$)  better nDCG-score than using max passage or mean passage aggregation.

In [None]:
from pyterrier.text import KMaxAvgPassage

@dataclass(unsafe_hash=True)
class TuneableKMaxAvgPassage(KMaxAvgPassage):
    pipeline: pt.Transformer
    k: int

    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        pipeline = self.pipeline >> pt.text.kmaxavg_passage(k=self.k)
        return pipeline.transform(topics_or_res)


In [None]:
bm25_bo1_pl2_kmax = TuneableKMaxAvgPassage(bm25_bo1_pl2_mono_cached, k=5)

#### Find $k$ such that nDCG of reranking with $k$-max average passage is maximized.

In [None]:
bm25_bo1_pl2_best_kmax = pt.GridSearch(
    bm25_bo1_pl2_kmax,
    {bm25_bo1_pl2_kmax :  {'k' : [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]}},
    topics,
    qrels,
    'ndcg',
    verbose=True
)

GridScan: 100%|██████████| 10/10 [00:00<00:00, 17.61it/s]

Best ndcg is 0.000000
Best setting is ['TuneableKMaxAvgPassage(pipeline=Cache(bm25_bo1_pl2_mono_cache_bust_3), k=50) k=5']





In [64]:
pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean, bm25_bo1_pl2_best_kmax],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5"],
    names=["max passage", "mean passage", "best_k_pipeline"],
)

Unnamed: 0,name,ndcg_cut_5
0,max passage,0.0
1,mean passage,0.0
2,best_k_pipeline,0.0


### Step 4: Persist results.

In [65]:
metrics_per_query = pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean, bm25_bo1_pl2_best_kmax],
    topics,
    qrels,
    eval_metrics=["ndcg_cut_5"],
    names=["max passage", "mean passage", "best kmax pipeline"],
    perquery=True,
)
metrics_per_query

Unnamed: 0,name,qid,measure,value
10,best kmax pipeline,q072224,ndcg_cut_5,0.0
11,best kmax pipeline,q072226,ndcg_cut_5,0.0
12,best kmax pipeline,q072232,ndcg_cut_5,0.0
13,best kmax pipeline,q072240,ndcg_cut_5,0.0
14,best kmax pipeline,q072242,ndcg_cut_5,0.0
0,max passage,q072224,ndcg_cut_5,0.0
1,max passage,q072226,ndcg_cut_5,0.0
2,max passage,q072232,ndcg_cut_5,0.0
3,max passage,q072240,ndcg_cut_5,0.0
4,max passage,q072242,ndcg_cut_5,0.0


In [66]:
# TODO: Significace tests between the best k-max average pipeline and the two others

In [67]:
with open("results.txt", "wt") as file:
    file.write("Results for Hypothesis 1:\n")
    # ...
    file.write("Results for Hypothesis 2:\n")
    # ...