# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [3]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt

#import matplotlib.pyplot as plt



  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [6]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.


### Step 2: Load data, create index

In [42]:
dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
qrels = dataset.get_qrels()
topics = dataset.get_topics(variant="title")
topics = topics.head(5)

index_loc = "./index4"
indexer = pt.IterDictIndexer(index_loc)
indexref = indexer.index(dataset.get_corpus_iter())

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 19/61307 [00:00<05:27, 186.96it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:46<00:00, 1312.13it/s]


#### For retrieval with PL2 c=1.2 seems to work best

In [43]:
pl2 = pt.BatchRetrieve(indexer, wmodel="PL2", verbose=True, controls={"b" : 1.2})

#### Next, we want to experiment with query expansion for retrieval with BM25.

In [44]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25", verbose=True, controls={"b" : 0.8})
#bm25= ~bm25

bo1_expansion = ~bm25 >> pt.rewrite.Bo1QueryExpansion(indexer)
bm25_bo1 = bo1_expansion >> bm25

kl = ~bm25 >> pt.rewrite.KLQueryExpansion(indexer)
bm25_kl = kl >> bm25

#### Additionally, we linearly combine retrieval with BM25 and PL2.

In [45]:
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)
bm25_kl_pl2 = (2* bm25_kl + pl2).transform(topics)

BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 25.01q/s]
BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 20.47q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 20.90q/s]


#### Let's run an experiment, to see which retrieval model works best.

#### The combination of BM25 with Bo1 query expansion and PL2 performs slightly better than the others. 

### Next, we want to rerank the output with a transformer.

In [46]:
from pyterrier_t5 import MonoT5ReRanker
monoT5 = MonoT5ReRanker()

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [47]:
import pandas as pd

corpus = pd.DataFrame(dataset.get_corpus_iter())


class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   3%|▎         | 1993/61307 [00:00<00:02, 19927.83it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:03<00:00, 18118.30it/s]


In [48]:
from pyterrier_t5 import T5Tokenizer

Rerank with transformer

In [49]:

bm25_bo1_pl2_monot5 = (bm25_bo1_pl2 % 10 >> GetText()
        >> pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text")
        >> monoT5
        >> pt.text.max_passage())

In [50]:
print('Create run')
run = bm25_bo1_pl2_monot5(topics)
print('Done, run was created')

Create run


BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 53.98q/s]
BR(BM25): 100%|██████████| 5/5 [00:00<00:00,  9.88q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 62.89q/s]


calling sliding on df of 50 rows


monoT5:   0%|          | 0/103 [00:00<?, ?batches/s]Token indices sequence length is longer than the specified maximum sequence length for this model (549 > 512). Running this sequence through the model will result in indexing errors
monoT5:   0%|          | 0/103 [00:09<?, ?batches/s]


RuntimeError: unknown parameter type

: 

### Step 4: Persist run.

In [8]:
persist_and_normalize_run(run, 't5-reranker')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
