# BM25 Retrieval on Multiple Fields in PyTerrier

This Jupyter notebook implements retrieval using multiple fields by combining the scores on multiple fields in a "dummy" way.

This might serve as starting point or inspiration for feature-based learning-to-rank approaches.

The notebook itself is a bit more condensed.
For a more detailed notebook, please look at [pyterrier-bm25.ipynb](pyterrier-bm25.ipynb).

### Step 1: Import everything and load variables

In [2]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


### Step 2: Load the Data

In [3]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]
documents = [{'docno': i['docno'], 'text': i['text'], 'title': i['original_document']['title'], 'abstract': i['original_document']['abstract']} for i in documents]

print('We look at the first document:\n')
print(documents[0])

Step 2: Load the data.
We look at the first document:

{'docno': '2019.sigirconf_workshop-2019birndl.0', 'text': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019 ', 'title': 'Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019) co-located with the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), Paris, France, July 25, 2019', 'abstract': ''}


### Step 3: Create the Index

In [4]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100, 'title': 10240, 'abstract': 10240, 'text': 10240}, blocks=True)
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.


 31%|███████████████████████████████▏                                                                     | 16560/53673 [00:08<00:10, 3486.71it/s]



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:19<00:00, 2717.15it/s]


06:06:29.865 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


### Step 4: Create Retrieval Pipeline

In [5]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True, metadata=['docno', 'text', 'title', 'abstract'])

bm25_title = pt.text.scorer(body_attr="title", wmodel="BM25")
bm25_abstract = pt.text.scorer(body_attr="abstract", wmodel="BM25")
bm25_text = pt.text.scorer(body_attr="text", wmodel="BM25")


# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_bm25_score = ((2*bm25_title) + (1*bm25_abstract) + (0.5*bm25_text))


dph_title = pt.text.scorer(body_attr="title", wmodel="DPH")
dph_abstract = pt.text.scorer(body_attr="abstract", wmodel="DPH")
dph_text = pt.text.scorer(body_attr="text", wmodel="DPH")

# Here some "random" ranking formula that puts the highest weight on the title and
# reduces the weight of matches on the text field
# Here is big potential for improvements :)
combined_dph_score = ((2*dph_title) + (1*dph_abstract) + (0.5*dph_text))

# The overall Pipeline: We retrieve the top-1000 results from BM25 that we re-rank using the combined BM25 and DPH scores.
# We just add the scores of BM25 and DPH
# Here is big potential for improvements :)
retrieval_pipeline = bm25 %1000 >> combined_bm25_score + combined_dph_score

In [6]:
print('We show the first ten query document pairs after BM25 retrieval to show what fields we have added:')
run = bm25(queries[:1])
run.head(5)

We show the first ten query document pairs after BM25 retrieval to show what fields we have added:


BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.10q/s]


Unnamed: 0,qid,docid,docno,text,title,abstract,rank,score,query
0,1,49659,2021.ipm_journal-ir0anthology0volumeA58A1.6,Detecting health misinformation in online heal...,Detecting health misinformation in online heal...,,0,16.708093,detect health related queries
1,1,27490,2011.spire_conference-2011.10,Detecting Health Events on the Social Web to E...,Detecting Health Events on the Social Web to E...,,1,15.699445,detect health related queries
2,1,19930,2019.cikm_conference-2019.346,Concept Drift Adaption for Online Anomaly Dete...,Concept Drift Adaption for Online Anomaly Dete...,,2,15.507586,detect health related queries
3,1,39429,2021.tist_journal-ir0anthology0volumeA12A2.4,Indirectly Supervised Anomaly Detection of Cli...,Indirectly Supervised Anomaly Detection of Cli...,,3,15.137599,detect health related queries
4,1,33009,2013.wwwconf_conference-2013c.302,From health-persona to societal health ABSTRAC...,From health-persona to societal health,"ABSTRACTIn this position paper, we propose an ...",4,14.88186,detect health related queries


### Step 5: Create the run

In [7]:
print('Step 5: Create Run.')

run = retrieval_pipeline(queries)

Step 5: Create Run.


BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.91q/s]




In [8]:
print('We look at the first 10 results of the run (query has ben expanded):\n')
run.head(10)

We look at the first 10 results of the run (query has ben expanded):



Unnamed: 0,qid,docid,docno,text,title,abstract,score,query,rank
0,1,49659,2021.ipm_journal-ir0anthology0volumeA58A1.6,Detecting health misinformation in online heal...,Detecting health misinformation in online heal...,,25.991802,detect health related queries,56
1,1,27490,2011.spire_conference-2011.10,Detecting Health Events on the Social Web to E...,Detecting Health Events on the Social Web to E...,,25.015085,detect health related queries,64
2,1,19930,2019.cikm_conference-2019.346,Concept Drift Adaption for Online Anomaly Dete...,Concept Drift Adaption for Online Anomaly Dete...,,24.333914,detect health related queries,71
3,1,39429,2021.tist_journal-ir0anthology0volumeA12A2.4,Indirectly Supervised Anomaly Detection of Cli...,Indirectly Supervised Anomaly Detection of Cli...,,23.046158,detect health related queries,101
4,1,33009,2013.wwwconf_conference-2013c.302,From health-persona to societal health ABSTRAC...,From health-persona to societal health,"ABSTRACTIn this position paper, we propose an ...",33.452691,detect health related queries,5
5,1,33172,2018.wwwconf_conference-2018.13,Did You Really Just Have a Heart Attack?: Towa...,Did You Really Just Have a Heart Attack?: Towa...,ABSTRACTMillions of users share their experien...,39.376744,detect health related queries,1
6,1,23061,2010.cikm_conference-2010.284,Unsupervised public health event detection for...,Unsupervised public health event detection for...,ABSTRACTRecent pandemics such as Swine Flu hav...,41.056776,detect health related queries,0
7,1,28878,2012.wwwconf_conference-2012c.37,Making use of social media data in public heal...,Making use of social media data in public health,ABSTRACTDisease surveillance systems exist to ...,32.582234,detect health related queries,7
8,1,14388,2016.fire_conference-2016w.51,Team DA_IICT at Consumer Health Information Se...,Team DA_IICT at Consumer Health Information Se...,Consumer Health Information Search task focuse...,27.562023,detect health related queries,39
9,1,13049,2020.clef_conference-2020w.101,LIG-Health at Adhoc and Spoken IR Consumer Hea...,LIG-Health at Adhoc and Spoken IR Consumer Hea...,This paper describes the work done by the LIG ...,33.311709,detect health related queries,6


### Step 6: Persist Run

In [9]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='multi-field', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)
