# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !python -m pip install --upgrade pip
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt



  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [2]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load data, create index

In [4]:
dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
topics = dataset.get_topics(variant="title")

from pathlib import Path
index_loc = "./index"
if not (Path(index_loc) / "data.properties").exists():
    indexer = pt.IterDictIndexer(index_loc)
    indexref = indexer.index(dataset.get_corpus_iter())
else:
    indexref = pt.IndexFactory.of(index_loc)

### Step 3: Create retrieval pipeline

#### We aim for retrieving documents via a linear combination of PL2 and BM25. Firstly, let's focus on PL2.

In [5]:
pl2 = pt.BatchRetrieve(indexref, wmodel="PL2", verbose=True)

#### Next, we perform BM25-retrieval with query expansion.

In [6]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True)

bo1_expansion = bm25 >> pt.rewrite.Bo1QueryExpansion(indexref)
bm25_bo1 = bo1_expansion >> bm25

#### Let's combine the two systems.

In [7]:
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)

#### Test to ensure that renaming of dataframe is needed for reranking.

In [7]:
# run = bm25_bo1_pl2(topics)
# run.rename(columns= {"query": "query_0", "query_0": "query"})

BR(BM25):   0%|          | 0/882 [00:00<?, ?q/s]

BR(BM25): 100%|██████████| 882/882 [00:23<00:00, 36.84q/s]
BR(BM25): 100%|██████████| 878/878 [00:21<00:00, 40.37q/s]
BR(PL2): 100%|██████████| 882/882 [00:18<00:00, 47.68q/s]


Unnamed: 0,qid,docid,docno,score,query,query_0,rank
0,q072210025,21318.0,doc072207501000,66.759043,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,1
1,q072210025,9672.0,doc072212607743,66.977549,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,0
2,q072210025,43796.0,doc072207504499,66.178928,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,2
3,q072210025,8216.0,doc072201202671,62.435225,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,3
4,q072210025,59542.0,doc072204307357,62.309674,recipe spring roll,applypipeline:off recip^1.053565089 spring^1.3...,4
...,...,...,...,...,...,...,...
1099732,q072230074,,doc072211000339,3.473103,,,1309
1099733,q072230074,,doc072208407385,3.473081,,,1310
1099734,q072230074,,doc072203309049,3.472435,,,1311
1099735,q072230074,,doc072202201538,3.471925,,,1312


### Next, we want to rerank the output with a transformer.
(after loading the document text and resetting the expanded query)

In [8]:
import pandas as pd

corpus = pd.DataFrame(dataset.get_corpus_iter())


class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

    
class ResetQueryColumn(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return topics_or_res.rename(columns= {"query": "query_0", "query_0": "query"})


No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   5%|▌         | 3112/61307 [00:00<00:03, 15805.43it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:04<00:00, 14955.48it/s]


In [9]:
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker(verbose=True, batch_size=16)

# only for debbuging (GitHub Codespaces seem to not have enough RAM to run monoT5):
# monoT5 = pt.text.scorer(body_attr="text", wmodel="BM25")

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
bm25_bo1_pl2_mono = (
    bm25_bo1_pl2 % 50 >> 
    GetText() >> 
    ResetQueryColumn() >> 
    pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text") >> 
    monoT5)

#### Cache the pipeline.

In [12]:
import pandas as pd
from dataclasses import dataclass

# TODO rename NamedTransformer after debugging to invalidate cache

@dataclass(frozen=True)
class MonoT5Transformer(pt.Transformer):
    nametrans: str
    wrapped: pt.Transformer

    def __repr__(self) -> str:
        return self.nametrans

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return self.wrapped.transform(df)


#### Check if caching worked.

In [13]:
bm25_bo1_pl2_monot5_cache = ~MonoT5Transformer("bm25_bo1_pl2_mono_cache_bust_10", bm25_bo1_pl2_mono)
repr(bm25_bo1_pl2_monot5_cache)

'Cache(bm25_bo1_pl2_mono_cache_bust_10)'

### Step 4: Test hypotheses.

### Hypothesis 1: There is a significant ($\alpha < 0.05$) difference w.r.t. nDCG between aggregating with max passage and mean passage.

#### Firstly, rerank with max passage aggregation.

#### Secondly, rerank with mean passage aggregation.

In [None]:
from pyterrier.text import KMaxAvgPassage

@dataclass(unsafe_hash=True)
class TuneableKMaxAvgPassage(KMaxAvgPassage):
    pipeline: pt.Transformer
    k: int

    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        pipeline = self.pipeline >> pt.text.kmaxavg_passage(k=self.k)
        return pipeline.transform(topics_or_res)


In [None]:
k=4
bm25_bo1_pl2_kmax = TuneableKMaxAvgPassage(bm25_bo1_pl2_monot5_cache, k=k)
kmax_run = bm25_bo1_pl2_kmax.transform(topics)
persist_and_normalize_run(kmax_run, f"bm25_bo1_pl2_kmax_k_{k}")

### Step 5: Persist run.