# BM25 + SDM Model trial

## Step 1: Import libraries

No additional libraries apart from the baselines.
Ensure pyterrier is loaded, confirm the pt is initialized
Persist and normalize run, further writing of the result in out txt.

In [1]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in ./iranthology-dataset-tira.


In [2]:
tira = Client()

# Step 2: Load the data and create the index 
Index already built in PT format from TIRA. We will be using the IR and ACL anthology

In [3]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

Step 3. Create the retrieval pipeline. In this case is our baseline BM25 model trying to be improved with sdm.

In [4]:
sdm = pt.rewrite.SDM()
bm25 = pt.BatchRetrieve(index, wmodel="BM25", verbose=True)

retrieval_pipeline = sdm >> bm25

# Step 4. Create the run. 
Do the retrieval with the smd improvement.

In [5]:

print('First, we have a short look at the first three topics:')

topics = pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


In [6]:
print('Now we do the retrieval...')
run_sdm = retrieval_pipeline(topics)

print('Done. Here are the first 10 entries of the run')
run_sdm.head(10)

Now we do the retrieval...


BR(BM25):   0%|          | 0/3 [00:00<?, ?q/s]

18:32:03.657 [main] ERROR org.terrier.querying.LocalManager - Problem running Matching, returning empty result set as query 1
java.io.IOException: This index does not support blocks
	at org.terrier.matching.matchops.PhraseOp.createFinalPostingIterator(PhraseOp.java:85)
	at org.terrier.matching.matchops.MultiTermOp.getPostingIterator(MultiTermOp.java:131)
	at org.terrier.matching.matchops.MultiTermOp.getMatcher(MultiTermOp.java:147)
	at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:304)
	at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:282)
	at org.terrier.matching.daat.Full.match(Full.java:88)
	at org.terrier.querying.LocalManager$ApplyLocalMatching.process(LocalManager.java:518)
	at org.terrier.querying.LocalManager.runSearchRequest(LocalManager.java:895)
Caused by: java.lang.ClassCastException: class org.terrier.structures.postings.bit.FieldIterablePosting cannot be cast to class org.terrier.structures.postings.BlockPosting (org.terri

BR(BM25):  33%|███▎      | 1/3 [00:00<00:00,  7.46q/s]

18:32:03.702 [main] ERROR org.terrier.querying.LocalManager - Problem running Matching, returning empty result set as query 2
java.io.IOException: This index does not support blocks
	at org.terrier.matching.matchops.PhraseOp.createFinalPostingIterator(PhraseOp.java:85)
	at org.terrier.matching.matchops.MultiTermOp.getPostingIterator(MultiTermOp.java:131)
	at org.terrier.matching.matchops.MultiTermOp.getMatcher(MultiTermOp.java:147)
	at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:304)
	at org.terrier.matching.PostingListManager.<init>(PostingListManager.java:282)
	at org.terrier.matching.daat.Full.match(Full.java:88)
	at org.terrier.querying.LocalManager$ApplyLocalMatching.process(LocalManager.java:518)
	at org.terrier.querying.LocalManager.runSearchRequest(LocalManager.java:895)
Caused by: java.lang.ClassCastException: class org.terrier.structures.postings.bit.FieldIterablePosting cannot be cast to class org.terrier.structures.postings.BlockPosting (org.terri

BR(BM25): 100%|██████████| 3/3 [00:00<00:00, 16.40q/s]

Done. Here are the first 10 entries of the run





Unnamed: 0,docid,docno,rank,score,qid,query,query_0


# Step 5. Save in the file. 
This step should be implemented for doing the evaluation in our complex retrieval system, but we will do the evaluation here, so it is not necessary, that is why it is commented.

In [7]:
#Now we add the result to our run.txt
#persist_and_normalize_run(run_sdm,  default_output='../runs', system_name='BM25-SDM', depth=1000)


# Step 6. Evaluation

In [8]:
pt.Experiment(
    retr_systems=[bm25,sdm],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    names=['BM25', 'BM25_SMD'],
    eval_metrics=['recall_1000', 'ndcg_cut_5', 'ndcg_cut.10', 'recip_rank']
)

BR(BM25): 100%|██████████| 68/68 [00:02<00:00, 27.34q/s]


ValueError: unknown run format: DataFrame missing columns: ['score', 'doc_id'] (found ['query_id', 'query', 'query_0'])