# BM25 Retrieval with PyTerrier

### Step 1: Import everything and load variables

In [2]:
import pyterrier as pt
import pandas as pd

# We use three methods from the tira third_party_integrations that help to mitigate common pitfalls:
# - ensure_pyterrier_is_loaded:
#    loads PyTerrier without internet connection
#     (in TIRA, retrieval approaches have no access to the internet to improve reproducibility)
#
# - get_input_directory_and_output_directory:
#   Your software is expected to read the data from an input directory and write the results (i.e., the run file) to an output directory.
#   Both input and output directories are passes as arguments when the software is executed within TIRA,
#   so this command ensures that you can run the same notebook for development as in TIRA by
#   returning the passed input directory (that might be mounted) if the software is not executed in TIRA.
#
# - persist_and_normalize_run:
#   Writing run files can come with some non-obvious edge cases (e.g., score ties).
#   This method takes care of some frequent of those edge cases.
#
# You do not have to use any of those methods, in the end it is only "generate an output from an input".
# We are of course also happy for pull requests that help to improve the handling of frequently used patterns.
# Please find the documentation here: https://github.com/tira-io/tira/blob/main/python-client/tira/third_party_integrations.py
#
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


In [3]:
print('The input directory contains the following files:\n')
!ls -lh {input_directory}

The input directory contains the following files:

total 77M
-rw-r--r-- 1 root root  77M May  3 05:57 documents.jsonl
-rw-r--r-- 1 root root   41 May  3 05:57 metadata.json
-rw-r--r-- 1 root root 1.6K May  3 05:57 queries.jsonl
-rw-r--r-- 1 root root 2.1K May  3 05:57 queries.xml


### Step 2: Load the Data

In [4]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]


Step 2: Load the data.


In [6]:
print('We look at the first document:\n')
print(documents[0])

We look at the first document:

{'docno': '2019.sigirconf_workshop-2019birndl.0', 'text': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019 ', 'original_document': {'doc_id': '2019.sigirconf_workshop-2019birndl.0', 'abstract': '', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019', 'authors': [], 'year': '2019', 'booktitle': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIR

In [7]:
print('We look at the first query:\n')
print(queries.iloc[0].to_dict())

We look at the first query:

{'qid': '1', 'query': ' detect health related queries'}


### Step 3: Create the Index

In [8]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100})
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.


 32%|████████████████████████████████                                                                     | 17035/53673 [00:05<00:06, 5884.27it/s]



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:11<00:00, 4500.49it/s]


06:05:27.563 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


### Step 4: Create Retrieval Pipeline

In [9]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

### Step 5: Create the run

In [10]:
print('Step 5: Create Run.')

run = bm25(queries)

Step 5: Create Run.


BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.34q/s]


In [11]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,49659,2021.ipm_journal-ir0anthology0volumeA58A1.6,0,16.708093,detect health related queries
1,1,27490,2011.spire_conference-2011.10,1,15.699445,detect health related queries
2,1,19930,2019.cikm_conference-2019.346,2,15.507586,detect health related queries
3,1,39429,2021.tist_journal-ir0anthology0volumeA12A2.4,3,15.137599,detect health related queries
4,1,33009,2013.wwwconf_conference-2013c.302,4,14.88186,detect health related queries
5,1,33172,2018.wwwconf_conference-2018.13,5,14.657232,detect health related queries
6,1,23061,2010.cikm_conference-2010.284,6,14.632242,detect health related queries
7,1,28878,2012.wwwconf_conference-2012c.37,7,14.61578,detect health related queries
8,1,14388,2016.fire_conference-2016w.51,8,14.587083,detect health related queries
9,1,13049,2020.clef_conference-2020w.101,9,14.474406,detect health related queries


### Step 6: Persist Run

In [12]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)
