# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [2]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install  python-terrier tira==0.0.88 ir_datasets
    !pip3 install -q python-terrier
    !pip3 install -q --upgrade git+https://github.com/terrierteam/pyterrier_t5.git
else:
    print('We are in the TIRA sandbox.')

from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


  from .autonotebook import tqdm as notebook_tqdm


Ensure Pyterrier integration is loaded

In [3]:
ensure_pyterrier_is_loaded()

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load data, create index

In [4]:
#use different dataset if TIRA-server is down.
#dataset = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')
dataset = pt.get_dataset('irds:cranfield')
qrels = dataset.get_qrels()
#topics = dataset.get_topics(variant="title")[:5]
topics = dataset.get_topics(variant="text")[:5]

#index_loc = "./index"
index_loc = "./index_cranfield"
indexer = pt.IterDictIndexer(index_loc)
indexref = indexer.index(dataset.get_corpus_iter())

cranfield documents:  37%|███▋      | 513/1400 [00:01<00:01, 617.00it/s]



cranfield documents: 100%|██████████| 1400/1400 [00:02<00:00, 587.88it/s] 


19:32:01.195 [ForkJoinPool-1-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 2 empty documents


### Step 3: Create retrieval pipeline

#### We aim for retrieving docs via a linear combination of PL2 and BM25. Firstly, let's focus on PL2.

In [5]:
pl2 = pt.BatchRetrieve(indexer, wmodel="PL2", verbose=True)

#### Next, we perform BM25-retrieval with query expansion.

In [6]:
bm25 = pt.BatchRetrieve(indexer, wmodel="BM25", verbose=True)

bo1_expansion = ~bm25 >> pt.rewrite.Bo1QueryExpansion(indexer)
bm25_bo1 = bo1_expansion >> bm25

#### Let's combine the two systems.

In [7]:
bm25_bo1_pl2 = (2* bm25_bo1 + pl2)

#### Test to ensure that renaming of dataframe is needed for reranking.

In [8]:
run = bm25_bo1_pl2(topics)
run.rename(columns= {"query": "query_0", "query_0": "query"})

BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 13.60q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 35.83q/s]


Unnamed: 0,qid,docid,docno,score,query,query_0,rank
0,1,50.0,51,77.128727,what similarity laws must be obeyed when const...,applypipeline:off similar^1.226527223 law^1.00...,0
1,1,11.0,12,62.807036,what similarity laws must be obeyed when const...,applypipeline:off similar^1.226527223 law^1.00...,1
2,1,485.0,486,61.268570,what similarity laws must be obeyed when const...,applypipeline:off similar^1.226527223 law^1.00...,2
3,1,183.0,184,58.663315,what similarity laws must be obeyed when const...,applypipeline:off similar^1.226527223 law^1.00...,3
4,1,877.0,878,49.728666,what similarity laws must be obeyed when const...,applypipeline:off similar^1.226527223 law^1.00...,4
...,...,...,...,...,...,...,...
5036,8,,133,1.062810,,,1036
5037,8,,1106,1.051268,,,1037
5038,8,,779,1.037575,,,1038
5039,8,,1214,1.035661,,,1039


### Next, we want to rerank the output with a transformer.

In [8]:
from pyterrier_t5 import MonoT5ReRanker
#monoT5 = MonoT5ReRanker()

In [15]:
import pandas as pd
from dataclasses import dataclass

corpus = pd.DataFrame(dataset.get_corpus_iter())

@dataclass(frozen=True)
class NamedTransformer(pt.Transformer):
    nametrans: str
    wrapped: pt.Transformer

    def __repr__(self) -> str:
        return self.nametrans

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return self._wrapped.transform(df)
    
class GetText(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return pd.merge(topics_or_res, corpus, on="docno")

    #def __repr__(self) -> str:
    #    return "GetText"
    
class ResetQueryColumn(pt.Transformer):
    def transform(self, topics_or_res: pd.DataFrame) -> pd.DataFrame:
        return topics_or_res.rename(columns= {"query": "query_0", "query_0": "query"})

    #def __repr__(self) -> str:
    #    return "ResetQueryColumn"

#class SlidingWindowPassager(pt.Transformer):
#    def __repr__(self) -> str:
#        return "SlidingWindowPassager"

#class MonoT5ReRanker(pt.Transformer):
#    def __repr__(self) -> str:
#        return "MonoT5ReRanker"
    
monoT5 = MonoT5ReRanker()

cranfield documents: 100%|██████████| 1400/1400 [00:00<00:00, 37971.98it/s]
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [12]:
from pyterrier_t5 import T5Tokenizer

#### Cache the pipeline.

In [16]:
bm25_bo1_pl2_mono = ~(bm25_bo1_pl2 % 10 >> GetText() >> ResetQueryColumn()
        >> pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text")%5
        >> monoT5 )
repr(bm25_bo1_pl2_mono)

  warn("Cannot cache pipeline %s has a component has not overridden __repr__" % trepr)


"Cache(ComposedPipeline(ComposedPipeline(ComposedPipeline(ComposedPipeline(RankCutoffTransformer(CombSumTransformer(ScalarProductTransformer(ComposedPipeline(ComposedPipeline(Cache(BR(./index_cranfield/data.properties,{'terrierql': 'on', 'parsecontrols': 'on', 'parseql': 'on', 'applypipeline': 'on', 'localmatching': 'on', 'filters': 'on', 'decorate': 'on', 'wmodel': 'BM25'},{'querying.processes': 'terrierql:TerrierQLParser,parsecontrols:TerrierQLToControls,parseql:TerrierQLToMatchingQueryTerms,matchopql:MatchingOpQLParser,applypipeline:ApplyTermPipeline,context_wmodel:org.terrier.python.WmodelFromContextProcess,localmatching:LocalManager$ApplyLocalMatching,qe:QueryExpansion,labels:org.terrier.learning.LabelDecorator,filters:LocalManager$PostFilterProcess,decorate:SimpleDecorateProcess', 'querying.postfilters': 'decorate:SimpleDecorate,site:SiteFilter,scope:Scope', 'querying.default.controls': 'wmodel:DPH,parsecontrols:on,parseql:on,applypipeline:on,terrierql:on,localmatching:on,filters

### Hypothesis 1: There is a significant ($\alpha < 0.05$) difference w.r.t. nDCG between aggregating with max passage and mean passage.

#### Firstly, rerank with max passage aggregation.

In [19]:
bm25_bo1_pl2_max = (bm25_bo1_pl2_mono >> pt.text.max_passage()) 

#### Secondly, rerank with mean passage aggregation.

In [20]:
bm25_bo1_pl2_mean = (bm25_bo1_pl2_mono  >> pt.text.mean_passage()) 

#### Let's compare both systems.

In [22]:
pt.Experiment(
    [bm25_bo1_pl2_max, bm25_bo1_pl2_mean],
    topics,
    qrels,
    eval_metrics=["ndcg"],
    names=["max passage", "mean passage"],
)

BR(BM25): 100%|██████████| 5/5 [00:00<00:00,  6.40q/s]
BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 18.04q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 22.07q/s]


calling sliding on df of 50 rows


monoT5: 100%|██████████| 7/7 [00:25<00:00,  3.62s/batches]
BR(BM25): 100%|██████████| 5/5 [00:00<00:00, 19.69q/s]
BR(PL2): 100%|██████████| 5/5 [00:00<00:00, 26.24q/s]


calling sliding on df of 50 rows


monoT5: 100%|██████████| 7/7 [00:24<00:00,  3.47s/batches]


Unnamed: 0,name,ndcg
0,max passage,0.06937
1,mean passage,0.06937


### Hypothesis 2: Choosing $k \in \{5\cdot i \mid i \in [1, 10]\}$ such that the nDCG-score of $k$-max average aggregation is maximized, yields a significantly ($\alpha < 0.05$)  better nDCG-score than using max passage or mean passage aggregation.

In [None]:
bm25_bo1_pl2_kmax = (bm25_bo1_pl2 % 10 >> GetText() >> ResetQueryColumn()
        >> pt.text.sliding(length=400, stride=64, prepend_attr=None, text_attr="text")
        >> monoT5 
        >> pt.text.kmaxavg_passage(controls={'k' : 5})) 

#### Find $k$ such that nDCG of reranking with $k$-max average passage is maximized.

In [None]:
pt.GridSearch(
    bm25_bo1_pl2_kmax,
    {bm25_bo1_pl2_kmax :  {'k' : [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]}},
    topics,
    qrels,
    'ndcg',
    verbose=True
)

In [16]:
print('Create max-run')
run_max = bm25_bo1_pl2_max(dataset.get_topics("text"))
print('Max-run was created')
print('Create mean-run')
run_mean = bm25_bo1_pl2_mean(dataset.get_topics("text"))
print('Done, mean-run was created')

Create max-run


BR(BM25): 100%|██████████| 878/878 [00:21<00:00, 40.24q/s]
BR(PL2): 100%|██████████| 882/882 [00:17<00:00, 49.01q/s]


calling sliding on df of 8780 rows


monoT5:   0%|          | 0/18445 [00:00<?, ?batches/s]          Token indices sequence length is longer than the specified maximum sequence length for this model (653 > 512). Running this sequence through the model will result in indexing errors
monoT5:   0%|          | 46/18445 [04:28<29:52:09,  5.84s/batches]


KeyboardInterrupt: 

### Step 4: Persist run.

In [8]:
persist_and_normalize_run(run_max, output_file="./max_output", system_name='t5-reranker')
persist_and_normalize_run(run_mean, output_file="./mean_output", system_name='t5-reranker')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
