# BM25 Retrieval with PyTerrier

### Step 1: Import everything and load variables

In [1]:
import pyterrier as pt
import pandas as pd

# We use three methods from the tira third_party_integrations that help to mitigate common pitfalls:
# - ensure_pyterrier_is_loaded:
#    loads PyTerrier without internet connection
#     (in TIRA, retrieval approaches have no access to the internet to improve reproducibility)
#
# - get_input_directory_and_output_directory:
#   Your software is expected to read the data from an input directory and write the results (i.e., the run file) to an output directory.
#   Both input and output directories are passes as arguments when the software is executed within TIRA,
#   so this command ensures that you can run the same notebook for development as in TIRA by
#   returning the passed input directory (that might be mounted) if the software is not executed in TIRA.
#
# - persist_and_normalize_run:
#   Writing run files can come with some non-obvious edge cases (e.g., score ties).
#   This method takes care of some frequent of those edge cases.
#
# You do not have to use any of those methods, in the end it is only "generate an output from an input".
# We are of course also happy for pull requests that help to improve the handling of frequently used patterns.
# Please find the documentation here: https://github.com/tira-io/tira/blob/main/python-client/tira/third_party_integrations.py
#
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('ir-anthology-dataset')


Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


I will use a small hardcoded example located in ir-anthology-dataset.
The output directory is /tmp/


In [2]:
print('The input directory contains the following files:\n')
!ls -lh {input_directory}

The input directory contains the following files:

total 111M
-rw-r--r-- 1 root root 111M Apr 30 13:09 documents.jsonl
-rw-r--r-- 1 root root   73 Apr 30 13:09 metadata.json
-rw-r--r-- 1 root root 1.8K Apr 30 13:09 queries.jsonl
-rw-r--r-- 1 root root 2.3K Apr 30 13:09 queries.xml


### Step 2: Load the Data

In [3]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]


Step 2: Load the data.


In [4]:
print('We look at the first document:\n')
print(documents[0])

We look at the first document:

{'docno': '2019.sigirconf_workshop-2019birndl.0', 'text': 'CEUR Workshop Proceedings 2414 CEUR-WS.org 2019 http://ceur-ws.org/Vol-2414 urn:nbn:de:0074-2414-3 https://dblp.org/rec/conf/sigir/2019birndl.bib dblp computer science bibliography, https://dblp.org DBLP:conf/sigir/2019birndl proceedings Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019 Muthu Kumar Chandrasekaran Philipp Mayr SIGIR 1581522299.0  Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25

In [5]:
print('We look at the first query:\n')
print(queries.iloc[0].to_dict())

We look at the first query:

{'qid': '1', 'query': ' fake news detection'}


### Step 3: Create the Index

In [6]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100})
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.


100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:17<00:00, 2993.66it/s]


### Step 4: Create Retrieval Pipeline

In [10]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

### Step 5: Create the run

In [11]:
print('Step 5: Create Run.')

run = bm25(queries)

Step 5: Create Run.


BR(BM25): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  4.69q/s]


In [12]:
print('We look at the first 10 results of the run:\n')
run.head(10)

We look at the first 10 results of the run:



Unnamed: 0,qid,docid,docno,rank,score,query
0,1,14643,2019.wsdm_conference-2019.115,0,27.200501,fake news detection
1,1,33491,2018.wwwconf_conference-2018c.140,1,26.183819,fake news detection
2,1,13148,2020.clef_conference-2020w.200,2,24.150446,fake news detection
3,1,13000,2020.clef_conference-2020w.52,3,23.678194,fake news detection
4,1,13059,2020.clef_conference-2020w.111,4,23.382961,fake news detection
5,1,8166,2018.sigirconf_conference-2018.30,5,23.355508,fake news detection
6,1,13022,2020.clef_conference-2020w.74,6,23.279387,fake news detection
7,1,12972,2020.clef_conference-2020w.24,7,23.100486,fake news detection
8,1,13002,2020.clef_conference-2020w.54,8,22.934617,fake news detection
9,1,13117,2020.clef_conference-2020w.169,9,22.790468,fake news detection


### Step 6: Persist Run

In [14]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='BM25', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)


# Reflections

## Iris

## Yannick

The first steps of the second milestone were fairly straightforward, and we managed to generate the `run.txt` and `run.html` files fairly quickly. However, some of the steps were kind of unclear after that, and after asking in the support forum we eventually got the task updated to be a bit more clear. However, since we had already started the milestone and some of the steps had now changed, we had to go back and re-do some of the steps which was a bit confusing. In the end, once we understood the steps of the assigment properly we were able to create the binary relevance assessments relatively quickly and the rest of the steps went more or less as planned.

## Justin
