# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

Collecting python-terrier
  Downloading python-terrier-0.10.0.tar.gz (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.6/107.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tira==0.0.88
  Downloading tira-0.0.88-py3-none-any.whl.metadata (4.4 kB)
Collecting ir_datasets
  Downloading ir_datasets-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting docker==6.*,>=6.0.0 (from tira==0.0.88)
  Downloading docker-6.1.3-py3-none-any.whl.metadata (3.5 kB)
Collecting wget (from python-terrier)
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting tqdm (from python-terrier)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57

  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [2]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True
terrier-assemblies 5.7 jar-with-dependencies not found, downloading to /home/codespace/.pyterrier...
Done
terrier-python-helper 0.0.7 jar not found, downloading to /home/codespace/.pyterrier...
Done
terrier-prf -SNAPSHOT jar not found, downloading to /home/codespace/.pyterrier...
Done


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load data, create index

In [3]:
dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
qrels = dataset.get_qrels()
topics = dataset.get_topics(variant="title")

index_loc = "./index"
indexer = pt.IterDictIndexer(index_loc)
indexref = indexer.index(dataset.get_corpus_iter())

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:54<00:00, 1128.00it/s]


### Step 3: Create retrieval pipeline

#### We aim for retrieving docs via a linear combination of PL2 and BM25. Firstly, let's focus on PL2.

In [4]:
pl2 = pt.BatchRetrieve(indexer, wmodel="PL2", verbose=True)

#### Next, we perform BM25-retrieval with query expansion.

In [5]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25", verbose=True, controls={"b" : 0.8})
#bm25= ~bm25

bo1_expansion = ~bm25 >> pt.rewrite.Bo1QueryExpansion(indexer)
bm25_bo1 = bo1_expansion >> bm25

#### Let's combine the two systems.

In [13]:
#bm25_bo1_pl2 = (2* bm25_bo1 + pl2).transform(topics)
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)

### Next, we want to rerank the output with a transformer.

In [7]:
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker()

spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 8.80MB/s]
tokenizer.json: 100%|██████████| 1.39M/1.39M [00:00<00:00, 13.0MB/s]
config.json: 100%|██████████| 1.21k/1.21k [00:00<00:00, 6.61MB/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as exp

In [8]:
import pandas as pd

corpus = pd.DataFrame(dataset.get_corpus_iter())


class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   6%|▌         | 3820/61307 [00:00<00:03, 19036.67it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:03<00:00, 16879.48it/s]


In [9]:
from pyterrier_t5 import T5Tokenizer

### Hypothesis 1: There is a significant ($\alpha < 0.05$) difference w.r.t. nDCG between aggregating with max passage and mean passage.

#### Firstly, rerank with max passage aggregation.

In [17]:

bm25_bo1_pl2_max = (bm25_bo1_pl2 % 10 >> GetText()
        >> pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text")
        >> monoT5 
        >> pt.text.max_passage()) 


#### Secondly, rerank with mean passage aggregation.

In [18]:
bm25_bo1_pl2_mean = (bm25_bo1_pl2 % 10 >> GetText()
        >> pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text")
        >> monoT5 
        >> pt.text.mean_passage()) 

#### Let's compare both systems.

In [20]:
pt.Experiment(
    [bm25_bo1_pl2_max(dataset.get_topics("text")), bm25_bo1_pl2_mean(dataset.get_topics("text"))],
    dataset.get_topics(),
    dataset.get_qrels(),
    eval_metrics=["ndcg"],
    names=["max passage", "mean passage"],
)

BR(BM25): 100%|██████████| 878/878 [00:21<00:00, 40.48q/s]
BR(PL2): 100%|██████████| 882/882 [00:17<00:00, 50.36q/s]


calling sliding on df of 8780 rows


monoT5:   0%|          | 7/18445 [00:38<28:23:03,  5.54s/batches]

In [16]:
print('Create max-run')
run_max = bm25_bo1_pl2_max(dataset.get_topics("text"))
print('Max-run was created')
print('Create mean-run')
run_mean = bm25_bo1_pl2_mean(dataset.get_topics("text"))
print('Done, mean-run was created')

Create max-run


BR(BM25): 100%|██████████| 878/878 [00:21<00:00, 40.24q/s]
BR(PL2): 100%|██████████| 882/882 [00:17<00:00, 49.01q/s]


calling sliding on df of 8780 rows


monoT5:   0%|          | 0/18445 [00:00<?, ?batches/s]          Token indices sequence length is longer than the specified maximum sequence length for this model (653 > 512). Running this sequence through the model will result in indexing errors
monoT5:   0%|          | 46/18445 [04:28<29:52:09,  5.84s/batches]


KeyboardInterrupt: 

### Step 4: Persist run.

In [8]:
persist_and_normalize_run(run_max, output_file="./max_output", system_name='t5-reranker')
persist_and_normalize_run(run_mean, output_file="./mean_output", system_name='t5-reranker')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
