# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets
else:
    print('We are in the TIRA sandbox.')



In [2]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt


  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the data

In [3]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.


In [43]:
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:
       qid              query
0  q072224     purchase money
1  q072226  purchase used car


In [60]:
import pandas as pd
train_topics = topics.sample(frac=0.8,random_state=200)
test_topics = topics.drop(train_topics.index)

qrels = data.get_qrels()
train_qrels = qrels.sample(frac=0.8,random_state=200)
test_qrels = qrels.drop(train_qrels.index)


### Step 3: Build the Index

In [14]:
print('Build index:')
iter_indexer = pt.IterDictIndexer("/tmp/index", meta={'docno': 100}, verbose=True)
!rm -Rf /tmp/index
indexref = iter_indexer.index(data.get_corpus_iter())
print('Done. Index is created')

Build index:
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 138/61307 [00:00<01:23, 732.82it/s]

No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:26<00:00, 2341.51it/s]


Done. Index is created


### Step 4: Create the Retrieval Pipeline

In [84]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True)
tf_idf = pt.BatchRetrieve(indexref, wmodel="TF_IDF", verbose=True)
dph = pt.BatchRetrieve(indexref, wmodel="DPH", verbose=True)
drlm = pt.BatchRetrieve(indexref, wmodel="DirichletLM", verbose=True)

bo1 = pt.rewrite.Bo1QueryExpansion(indexref, verbose=True)
rm3 = pt.rewrite.RM3(indexref, verbose=True)
klq = pt.rewrite.KLQueryExpansion(indexref, verbose=True)

tf_idf_bo1 = tf_idf >> bo1 >> tf_idf
tf_idf_rm3 = tf_idf >> rm3 >> tf_idf
tf_idf_klq = tf_idf >> klq >> tf_idf


In [78]:
pt.Experiment(
    [bm25, tf_idf, dph, drlm],
    test_topics, test_qrels,
    ["P_10", "recall_10", "ndcg"],
    ["BM25","TF_IDF", "DPH", "DirichletLM"],
    highlight="bold"
)

BR(BM25): 100%|██████████| 176/176 [00:01<00:00, 97.62q/s] 
BR(DPH): 100%|██████████| 176/176 [00:01<00:00, 98.95q/s] 
BR(DirichletLM): 100%|██████████| 176/176 [00:01<00:00, 94.87q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,BM25,0.021118,0.146998,0.140814
1,TF_IDF,0.021118,0.150104,0.131495
2,DPH,0.019255,0.13147,0.143123
3,DirichletLM,0.018634,0.1294,0.133254


In [85]:
pt.Experiment(
    [tf_idf_bo1, tf_idf_rm3, tf_idf_klq],
    test_topics, test_qrels,
    ["P_10", "recall_10", "ndcg"],
    ["TF_IDF >> bo1 >> TF_IDF","TF_IDF >> rm3 >> TF_IDF", "TF_IDF >> klq >> TF_IDF"],
    highlight="bold"
)

BR(TF_IDF): 100%|██████████| 176/176 [00:02<00:00, 87.24q/s]
Transformer: 100%|██████████| 175/175 [00:00<00:00, 176.30q/s]
BR(TF_IDF): 100%|██████████| 175/175 [00:02<00:00, 76.30q/s]
BR(TF_IDF): 100%|██████████| 176/176 [00:01<00:00, 110.52q/s]
Transformer: 100%|██████████| 175/175 [00:00<00:00, 184.71q/s]
BR(TF_IDF): 100%|██████████| 175/175 [00:02<00:00, 80.37q/s]
BR(TF_IDF): 100%|██████████| 176/176 [00:01<00:00, 102.21q/s]
Transformer: 100%|██████████| 175/175 [00:01<00:00, 170.44q/s]
BR(TF_IDF): 100%|██████████| 175/175 [00:02<00:00, 70.99q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,TF_IDF >> bo1 >> TF_IDF,0.020497,0.15528,0.149736
1,TF_IDF >> rm3 >> TF_IDF,0.019876,0.144928,0.140707
2,TF_IDF >> klq >> TF_IDF,0.020497,0.152174,0.152469


In [81]:
tfidf_bo1_tfpl2_union = tf_idf >> bo1 >> pt.FeaturesBatchRetrieve(indexref, wmodel="TF_IDF", features=["WMODEL:BM25", "WMODEL:DPH"], verbose=True)

pt.Experiment(
    [tfidf_bo1_tfpl2_union],
    test_topics, test_qrels,
    ["P_10", "recall_10", "ndcg"],
    ["TF_IDF >> BO1 >> TF_IDF >> TF**PL2"],
    highlight="bold"
)

FBR(TF_IDF and 2 features): 100%|██████████| 175/175 [00:18<00:00,  9.41q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,TF_IDF >> BO1 >> TF_IDF >> TF**PL2,0.019876,0.145963,0.144568


In [83]:
import numpy as np

tf_idf = pt.BatchRetrieve(indexref, wmodel="TF_IDF", controls={"tf_idf.b" : 0.75})
tfidf_klq = tf_idf >> klq >> tf_idf

param_map = {
        tf_idf : { "tf_idf.b" : list(np.arange(0.5,1.5,0.1))},
        # bo1 : {
        #     "fb_terms" : list(range(1, 12, 3)), # makes a list of 1,3,6,7,12
        #     "fb_docs" : list(range(2, 30, 6))   # etc.
        # }
}
tfidf_bo1 = pt.GridSearch(tfidf_klq, param_map, train_topics, train_qrels, verbose=True, metric="ndcg")
pt.Experiment([tfidf_bo1], test_topics, test_qrels, ["P_10", "recall_10", "ndcg"])

Transformer: 100%|██████████| 703/703 [00:12<00:00, 57.06q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 56.64q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 56.66q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 57.63q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 57.53q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 55.29q/s]
Transformer: 100%|██████████| 703/703 [00:13<00:00, 52.83q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 55.62q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 55.14q/s]
Transformer: 100%|██████████| 703/703 [00:12<00:00, 56.46q/s]
GridScan: 100%|██████████| 10/10 [05:16<00:00, 31.67s/it]


Best ndcg is 0.231389
Best setting is ['BR(TF_IDF) tf_idf.b=0.9999999999999999']


Transformer: 100%|██████████| 175/175 [00:00<00:00, 182.94q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,"Compose(Compose(BR(TF_IDF), QueryExpansion(/tm...",0.020497,0.153209,0.142564


### Step 5: Create the Run and Persist the Run

In [8]:
print('Create run')
run = bm25(topics)
print('Done, run was created')

Create run


BR(PL2): 100%|██████████| 882/882 [00:09<00:00, 96.97q/s] 


Done, run was created


In [9]:
persist_and_normalize_run(run, 'bm25-baseline')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".
