# IR Lab Tutorial: Re-using Existing PyTerrier Indices from TIREx

This tutorial shows how to re-use PyTerrier indices that were build for public datasets in [TIREx](https://www.tira.io/tirex).
The indices were all created by the same immutable submission to [TIRA](https://www.tira.io/) that was also executed on the non-public (or confidential test) datasets, so that you can re-use this as prior stage in your retrieval system.

In [None]:
# This is only needed in Google Colab, not in the Codespace / Dev-Container
!pip3 install python-terrier ir-datasets git+https://github.com/tira-io/tira.git@development#\&subdirectory=python-client

### Import All Libraries

In [15]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client
ensure_pyterrier_is_loaded()
import pandas as pd
import pyterrier as pt
from tqdm import tqdm

tira = Client()

def pyterrier_index_from_tira(dataset):
  ret = tira.get_run_output('ir-benchmarks/tira-ir-starter/Index (tira-ir-starter-pyterrier)', dataset) + '/index'
  return pt.IndexFactory.of(ret)


### Run BM25 on all Public Datasets

In [2]:
# This is the set of datasets available in TIREx that can be publicly accessed, there are more that are only available within the TIRA sandbox
public_tirex_datasets = [
    'msmarco-passage-trec-dl-2019-judged-20230107-training', 'msmarco-passage-trec-dl-2020-judged-20230107-training',
    'antique-test-20230107-training', 'vaswani-20230107-training',
    'cranfield-20230107-training', 'medline-2004-trec-genomics-2004-20230107-training',
    'medline-2017-trec-pm-2017-20230211-training', 'cord19-fulltext-trec-covid-20230107-training',
    'nfcorpus-test-20230107-training', 'argsme-touche-2020-task-1-20230209-training',
    'argsme-touche-2021-task-1-20230209-training', 'medline-2017-trec-pm-2018-20230211-training',
    'medline-2004-trec-genomics-2005-20230107-training', 'trec-tip-of-the-tongue-dev-20230607-training',
    'longeval-short-july-20230513-training', 'longeval-heldout-20230513-training',
    'longeval-long-september-20230513-training', 'longeval-train-20230513-training'
]

In [13]:
df_all = []

for dataset in tqdm(public_tirex_datasets):
  index = pyterrier_index_from_tira(dataset)
  pt_dataset = pt.get_dataset(f"irds:ir-benchmarks/{dataset}")

  bm25 = pt.BatchRetrieve(index, wmodel="BM25")
  df = pt.Experiment([bm25], pt_dataset.get_topics('query'), pt_dataset.get_qrels(), ['ndcg_cut.10'], names=['BM25'])
  df['dataset'] = dataset
  df_all += [df]
df_all = pd.concat(df_all)

100%|██████████| 18/18 [04:26<00:00, 14.79s/it]


### Analyse The Results

In [18]:
df_all.sort_values('ndcg_cut.10')

Unnamed: 0,name,ndcg_cut.10,dataset
0,BM25,0.012241,cranfield-20230107-training
0,BM25,0.103825,trec-tip-of-the-tongue-dev-20230607-training
0,BM25,0.16252,longeval-heldout-20230513-training
0,BM25,0.176477,longeval-train-20230513-training
0,BM25,0.179376,longeval-short-july-20230513-training
0,BM25,0.182979,longeval-long-september-20230513-training
0,BM25,0.267611,nfcorpus-test-20230107-training
0,BM25,0.2809,medline-2017-trec-pm-2017-20230211-training
0,BM25,0.298936,argsme-touche-2020-task-1-20230209-training
0,BM25,0.344219,medline-2004-trec-genomics-2004-20230107-training
