# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [2]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')



In [3]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt


  from .autonotebook import tqdm as notebook_tqdm


Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


In [None]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')

In [4]:
index_loc_j = "./index"
#dataset = pt.get_dataset("irds:vaswani")
dataset_j = data
indexer_j = pt.IterDictIndexer(index_loc_j)
indexref_j = indexer_j.index(dataset_j.get_corpus_iter())

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:50<00:00, 1224.78it/s]


In [5]:
from pyterrier_t5 import MonoT5ReRanker, DuoT5ReRanker
monoT5 = MonoT5ReRanker()
duoT5 = DuoT5ReRanker()

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens ha

In [6]:
import pandas as pd

In [7]:
corpus = pd.DataFrame(dataset_j.get_corpus_iter())

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   6%|▌         | 3520/61307 [00:00<00:03, 17732.10it/s]

No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.
No settings given in /home/codespace/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:03<00:00, 17772.57it/s]


In [8]:
import pandas as pd
from pandas import DataFrame


class GetText(pt.Transformer):
    def transform(self, topics_or_res: DataFrame) -> DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

In [10]:
bm25_j = pt.BatchRetrieve(indexref_j, wmodel="BM25")

In [24]:
mono_pipeline_j = (bm25_j % 50 >> GetText()
        >> pt.text.sliding(length=480, stride=64, prepend_attr=None, text_attr="text")
        >> monoT5 
        >> pt.text.mean_passage() ) 

In [16]:
#pt.Experiment(
#  [bm25_j, mono_pipeline_j,],
#  dataset_j.get_topics("text").head(1),
#  dataset_j.get_qrels(),
#  names=["BM25", "BM25 >> monoT5",],
#  eval_metrics=["map", "P"]
#)

calling sliding on df of 50 rows


monoT5: 100%|██████████| 86/86 [08:23<00:00,  5.85s/batches]


Unnamed: 0,name,map,P@5,P@10,P@15,P@20,P@30,P@100,P@200,P@500,P@1000
0,BM25,0.001868,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.002
1,BM25 >> monoT5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
print('Create run')
run = mono_pipeline_j(dataset_j.get_topics("text"))
print('Done, run was created')

Create run


BR(TF_IDF): 100%|██████████| 882/882 [00:18<00:00, 47.89q/s]


Done, run was created


In [8]:
persist_and_normalize_run(run, 't5-reranker')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
