# Post-Hoc Experimentation for [Retrieval Benchmarks in the IR Experiment Platform](https://www.tira.io/task/ir-benchmarks)

This notebook showcases how post-hoc experiments of the IR Experiment Platform can be conducted.

To start the notebook, please clone the archived shared task repository:

```
git@github.com:tira-io/ir-experiment-platform-benchmarks.git
```

Inside the cloned repository, you can start the Jupyter notebook which automatically installs a minimal virtual environment using:
```
make jupyterlab
```

The notebook covers:

- 1.) Diving into artifacts submitted to some shared task
- 2.) Re-evaluation of submitted approaches (e.g., a different subset of the data or different measures)
- 3.) Execution of submitted approaches on new or manipulated data (e.g., for ablation studies or ensembles)

All software submissions and evaluators come as docker images.
Hence, you a minimal environment is sufficient: You need `Python3` and `Docker`.

The installation of the environment is simplified with a virtual environment and executing `make jupyterlab` installs the virtual environment (if not already done) and starts the jupyter notebook ready to run all parts of the tutorial.

For each of the softwares submitted to TIRA, the `tira` integration to PyTerrier loads the Docker Image submitted to TIRA to execute it in PyTerrier pipelines (i.e., a first execution could take sligthly longer).

Other notebooks show other use-cases as tutorials:

- [full-rank-retriever-tutorial.ipynb](full-rank-retriever-tutorial.ipynb): showcases how full-rankers can be reproduced/replicated.
- [re-rank-tutorial.ipynb](re-rank-tutorial.ipynb): showcases how re-rankers can be reproduced/replicated.
- [interoparability-tutorial](interoparability-tutorial): showcases how full-rankers and re-rankers submitted in TIRA can be combined in new ways in post-hoc experiments.


### Import Dependencies

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', 0)

from tira.local_client import Client
tira = Client()

import pyterrier as pt
if not pt.started():
    pt.init()


# 1.) Overview of Artifacts submitted to the Shared Task

In [3]:
# Overview of evaluators
tira.all_evaluators().describe()

Unnamed: 0,dataset,image,command
count,33,33,33
unique,33,1,1
top,clueweb09-en-trec-web-2011-20230107-training,webis/ir_measures_evaluator:1.0.5,"/ir_measures_evaluator.py --run ${inputRun}/run.txt --topics ${inputDataset}/queries.jsonl --qrels ${inputDataset}/qrels.txt --output_path ${outputDir} --measures ""P@10"" ""nDCG@10"" ""MRR"""
freq,1,33,33


In [4]:
#Overview of softwares
tira.all_softwares().describe()

Unnamed: 0,approach,team,image,command
count,75,75,75,75
unique,75,1,75,51
top,ir-benchmarks/tira-ir-starter/DuoT5 Top-25 (tira-ir-starter-pyterrier),tira-ir-starter,docker.io/webis/ir-benchmarks-submissions:tira-ir-starter-duot5-0-0-1-duot5-base-msmarco-tira-docker-software-id-felt-slide,/reranking.py --input $inputDataset --output $outputDir --score_function cos_sim
freq,1,75,1,9


In [5]:
tira.all_datasets().describe()

Unnamed: 0,dataset
count,33
unique,33
top,antique-test-20230107-training
freq,1


# 2.) Re-execute re-ranking approaches submitted to a shared task

We run the approach `'ir-benchmarks/tira-ir-starter/SBERT multi-qa-MiniLM-L6-dot-v1 (tira-ir-starter-beir)'` on a small dataset.
Therefore, we create a re-ranking dataset, and build a pipeline:

```
bm25

advanced_pipeline = bm25 >> tira.pt.reranker('ir-benchmarks/tira-ir-starter/SBERT multi-qa-MiniLM-L6-dot-v1 (tira-ir-starter-beir)')
```

There, `ir-benchmarks` is the task that contains many retrieval datasets, `tira-ir-starter` is the team (i.e., the baseline team), and `SBERT multi-qa-MiniLM-L6-dot-v1 (tira-ir-starter-beir)` is the appraoch.


In [6]:
data_to_rerank = pd.DataFrame([
        ["d1", "this is the first document of many documents", "1", "first document"],
        ["d2", "this is another document", "1", "first document"],
        ["d3", "the topic of this document is unknown", "1", "first document"]
    ], columns=["docno", "body", "qid", "query"])

data_to_rerank

Unnamed: 0,docno,body,qid,query
0,d1,this is the first document of many documents,1,first document
1,d2,this is another document,1,first document
2,d3,the topic of this document is unknown,1,first document


In [7]:
bm25 = pt.text.scorer(wmmodel='bm25')

bm25(data_to_rerank)

Unnamed: 0,docno,body,qid,rank,score,query
0,d1,this is the first document of many documents,1,0,0.5560003,first document
1,d2,this is another document,1,2,-3.085859e-10,first document
2,d3,the topic of this document is unknown,1,1,0.05681316,first document


In [8]:
advanced_pipeline = bm25 >> tira.pt.reranker('ir-benchmarks/tira-ir-starter/SBERT multi-qa-MiniLM-L6-dot-v1 (tira-ir-starter-beir)')

advanced_pipeline(data_to_rerank)

Unnamed: 0,docno,body,qid,query,q0,rank,score,system
0,d1,this is the first document of many documents,1,first document,0,1,46.084885,multi-qa-MiniLM-L6-dot-v1-dot
1,d2,this is another document,1,first document,0,2,40.802025,multi-qa-MiniLM-L6-dot-v1-dot
2,d3,the topic of this document is unknown,1,first document,0,3,37.29475,multi-qa-MiniLM-L6-dot-v1-dot
