# PyTerrier Notebook for Full-Rank Submissions

This notebook serves as a baseline full-rank submission for [TIRA](https://tira.io)/[TIREx](https://tira.io/tirex) that builds a PyTerrier index and subsequently creates a run with BM25.

### Step 1: Ensure Libraries are Imported

In [1]:
import os

# Detect if we are in the TIRA sandbox
# Install the required dependencies if we are not in the sandbox.
if 'TIRA_DATASET_ID' not in os.environ:
    !pip3 install python-terrier tira==0.0.88 ir_datasets
else:
    print('We are in the TIRA sandbox.')



In [2]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run

# this loads and starts pyterrier so that it also works in the TIRA
ensure_pyterrier_is_loaded()

# PyTerrier must be imported after the call to ensure_pyterrier_is_loaded in TIRA.
import pyterrier as pt


  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.0 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


### Step 2: Load the data

In [3]:
data = pt.get_dataset('irds:ir-lab-jena-leipzig-wise-2023/validation-20231104-training')

Load ir_dataset "ir-lab-jena-leipzig-wise-2023/validation-20231104-training" from tira.


In [4]:
print('See the first two queries:')
topics = data.get_topics('title')
print(topics.head(2))

See the first two queries:
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
       qid              query
0  q072224     purchase money
1  q072226  purchase used car


In [5]:
import pandas as pd
train_topics = topics.sample(frac=0.8,random_state=200)
test_topics = topics.drop(train_topics.index)

qrels = data.get_qrels()
train_qrels = qrels.sample(frac=0.8,random_state=200)
test_qrels = qrels.drop(train_qrels.index)


No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.


### Step 3: Build the Index

In [6]:
print('Build index:')
iter_indexer = pt.IterDictIndexer("/tmp/index", meta={'docno': 100}, verbose=True)
!rm -Rf /tmp/index
indexref = iter_indexer.index(data.get_corpus_iter())
print('Done. Index is created')

Build index:
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents:   0%|          | 0/61307 [00:00<?, ?it/s]

No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.
No settings given in /Users/dominicwild/.tira/.tira-settings.json. I will use defaults.


ir-lab-jena-leipzig-wise-2023/validation-20231104-training documents: 100%|██████████| 61307/61307 [00:25<00:00, 2409.97it/s]


Done. Index is created


### Step 4: Create the Retrieval Pipeline

In [7]:
bm25 = pt.BatchRetrieve(indexref, wmodel="BM25", verbose=True)
tf_idf = pt.BatchRetrieve(indexref, wmodel="TF_IDF", verbose=True)
dph = pt.BatchRetrieve(indexref, wmodel="DPH", verbose=True)
drlm = pt.BatchRetrieve(indexref, wmodel="DirichletLM", verbose=True)

bo1 = pt.rewrite.Bo1QueryExpansion(indexref, verbose=True)
rm3 = pt.rewrite.RM3(indexref, verbose=True)
klq = pt.rewrite.KLQueryExpansion(indexref, verbose=True)

tf_idf_bo1 = tf_idf >> bo1 >> tf_idf
tf_idf_rm3 = tf_idf >> rm3 >> tf_idf
tf_idf_klq = tf_idf >> klq >> tf_idf


In [16]:
pt.Experiment(
    [bm25, tf_idf, dph, drlm],
    topics, qrels,
    ["P_10", "recall_10", "ndcg"],
    ["BM25","TF_IDF", "DPH", "DirichletLM"],
    highlight="bold"
)

BR(BM25): 100%|██████████| 882/882 [00:09<00:00, 91.94q/s] 
BR(TF_IDF): 100%|██████████| 882/882 [00:09<00:00, 92.98q/s] 
BR(DPH): 100%|██████████| 882/882 [00:09<00:00, 91.95q/s] 
BR(DirichletLM): 100%|██████████| 882/882 [00:09<00:00, 91.71q/s] 


Unnamed: 0,name,P_10,recall_10,ndcg
0,BM25,0.093878,0.24599,0.31043
1,TF_IDF,0.096032,0.251361,0.311507
2,DPH,0.092177,0.239749,0.310629
3,DirichletLM,0.077664,0.200004,0.279046


In [17]:
pt.Experiment(
    [tf_idf_bo1, tf_idf_rm3, tf_idf_klq],
    topics, qrels,
    ["P_10", "recall_10", "ndcg"],
    ["TF_IDF >> bo1 >> TF_IDF","TF_IDF >> rm3 >> TF_IDF", "TF_IDF >> klq >> TF_IDF"],
    highlight="bold"
)

BR(TF_IDF): 100%|██████████| 882/882 [00:09<00:00, 94.04q/s] 
Transformer: 100%|██████████| 878/878 [00:18<00:00, 46.38q/s]
BR(TF_IDF): 100%|██████████| 878/878 [00:12<00:00, 72.33q/s]
BR(TF_IDF): 100%|██████████| 882/882 [00:08<00:00, 98.47q/s] 
Transformer: 100%|██████████| 878/878 [00:19<00:00, 45.63q/s]
BR(TF_IDF): 100%|██████████| 878/878 [00:11<00:00, 77.75q/s]
BR(TF_IDF): 100%|██████████| 882/882 [00:09<00:00, 95.38q/s] 
Transformer: 100%|██████████| 878/878 [00:19<00:00, 46.00q/s]
BR(TF_IDF): 100%|██████████| 878/878 [00:13<00:00, 65.93q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,TF_IDF >> bo1 >> TF_IDF,0.096032,0.248524,0.31513
1,TF_IDF >> rm3 >> TF_IDF,0.095011,0.247119,0.307819
2,TF_IDF >> klq >> TF_IDF,0.096145,0.248862,0.31388


In [8]:
import numpy as np

tf_idf = pt.BatchRetrieve(indexref, wmodel="TF_IDF", controls={"tf_idf.b" : 0.75})
tfidf_klq_tuned = tf_idf >> klq >> tf_idf

param_map = {
        tf_idf : { "tf_idf.b" : list(np.arange(0.5,1.5,0.1))},
        # bo1 : {
        #     "fb_terms" : list(range(1, 12, 3)), # makes a list of 1,3,6,7,12
        #     "fb_docs" : list(range(2, 30, 6))   # etc.
        # }
}
tfidf_klq_tuned = pt.GridSearch(tfidf_klq_tuned, param_map, topics, qrels, verbose=True, metric="ndcg")
pt.Experiment([tfidf_klq_tuned], topics, qrels, ["P_10", "recall_10", "ndcg"], ["TF_IDF >> klq >> TF_IDF (tuned hyperparameter b)"], highlight="bold")

Transformer: 100%|██████████| 878/878 [00:18<00:00, 47.32q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 47.82q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 47.67q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 47.98q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.16q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.27q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.42q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.36q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.33q/s]
Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.37q/s]
GridScan: 100%|██████████| 10/10 [06:47<00:00, 40.72s/it]


Best ndcg is 0.319023
Best setting is ['BR(TF_IDF) tf_idf.b=0.9999999999999999']


Transformer: 100%|██████████| 878/878 [00:18<00:00, 48.68q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,TF_IDF >> klq >> TF_IDF (tuned hyperparameter b),0.099546,0.259348,0.319023


### Step 5: Create the Run and Persist the Run

In [8]:
print('Create run')
run = tfidf_klq_tuned(topics)
print('Done, run was created')

Create run


BR(PL2): 100%|██████████| 882/882 [00:09<00:00, 96.97q/s] 


Done, run was created


In [9]:
persist_and_normalize_run(run, 'tfidf_klq_tuned')

I use the environment variable "TIRA_OUTPUT_DIR" to determine where I should store the run file using "." as default.
Done. run file is stored under "./run.txt".


## FAILED EXPERIMENTS

In [18]:
tfidf_bo1_tfpl2_union = tf_idf >> bo1 >> pt.FeaturesBatchRetrieve(indexref, wmodel="TF_IDF", features=["WMODEL:BM25", "WMODEL:DPH"], verbose=True)

pt.Experiment(
    [tfidf_bo1_tfpl2_union],
    topics, qrels,
    ["P_10", "recall_10", "ndcg"],
    ["TF_IDF >> BO1 >> TF_IDF >> TF**PL2"],
    highlight="bold"
)

BR(TF_IDF): 100%|██████████| 882/882 [00:09<00:00, 97.50q/s] 
Transformer: 100%|██████████| 878/878 [00:18<00:00, 47.48q/s]
FBR(TF_IDF and 2 features): 100%|██████████| 878/878 [01:21<00:00, 10.73q/s]


Unnamed: 0,name,P_10,recall_10,ndcg
0,TF_IDF >> BO1 >> TF_IDF >> TF**PL2,0.096032,0.248524,0.31513


In [77]:
from chatnoir_api.chat import ChatNoirChatClient
from tqdm import tqdm
import time

chatnoir_chat = ChatNoirChatClient()

def pseudo_document(query):
    # return only first 100 characters ~= 10 terms. Hypothesis: the LLM outputs more important terms first.
    # (this is cherry picked)
    return chatnoir_chat.chat(f'I am a search engine. Please name related terms for the query "{query}".')[:100]

# we only have one query for which we generate a pseudo relevant document

llm_expansion_documents = []
for i,t in tqdm(enumerate(list(topics.iterrows()))):
    if i < 150:
        llm_expansion_documents.append({'docno': f'llm-expansion-for-query-{t[1].qid}', 'text':  pseudo_document(t[1].query)} )
    else:
        llm_expansion_documents.append({'docno': f'llm-expansion-for-query-{t[1].qid}', 'text':  ""} )


ChatNoir Chat uses ws_host from environment Environment variable
ChatNoir Chat uses API key from Environment variable
ChatNoir Chat uses model 'alpaca-en-7b' from Environment variable
ChatNoir Chat uses endpoint 'https://chat.web.webis.de/' from {endpoint[1]}


146it [00:13, 11.24it/s]ChatNoir API quota exceeded. Retrying in 1 seconds.
882it [00:15, 58.54it/s]


In [83]:
indexer = pt.IterDictIndexer('/tmp/llm-expansion-index', overwrite=True, blocks=True, meta={'docno': 100, 'text': 20480})
index_ref = indexer.index(llm_expansion_documents)
llm_index = pt.IndexFactory.of(index_ref)

# We make a pyterrier-transformer out o fthe expansion documents 
# so that we can use it in subsequent pipelines.
llm_expansion = pt.Transformer.from_df(pd.DataFrame([
    {'qid': f'{i}', 'docno': f'{expanded_doc["text"]}'} for i, expanded_doc in enumerate(llm_expansion_documents)
]))

tfidf_llm_bo1 = llm_expansion >> pt.rewrite.Bo1QueryExpansion(llm_index) >> tf_idf

12:10:00.825 [ForkJoinPool-12-worker-1] WARN org.terrier.structures.indexing.Indexer - Indexed 732 empty documents


In [84]:
pt.Experiment(
    [tfidf_llm_bo1, tf_idf_bo1],
    topics, qrels,
    ["P_10", "recall_10", "ndcg"],
    ["TF_IDF >> bo1 >> TF_IDF","LLM + TF_IDF"],
    highlight="bold"
)


ValueError: 882 topics, but no results received from Compose(Compose(Transformer(), QueryExpansion(/tmp/llm-expansion-index/data.properties,3,10,<org.terrier.querying.QueryExpansion at 0x29841a570 jclass=org/terrier/querying/QueryExpansion jself=<LocalRef obj=0x136ec3d52 at 0x299146ed0>>)), BR(TF_IDF))