# BM25 Retrieval with Query Expansion with RM3 in PyTerrier

This Jupyter notebook implements query expansion with RM3 for BM25 retrieval.
The notebook itself is a bit more condensed.
For a more detailed notebook, please look at [pyterrier-bm25.ipynb](pyterrier-bm25.ipynb).

### Step 1: Import everything and load variables

In [2]:
import pyterrier as pt
import pandas as pd
from tira.third_party_integrations import ensure_pyterrier_is_loaded, get_input_directory_and_output_directory, persist_and_normalize_run
import json
from tqdm import tqdm

ensure_pyterrier_is_loaded()
input_directory, output_directory = get_input_directory_and_output_directory('./iranthology-dataset-tira')


I will use a small hardcoded example located in ./iranthology-dataset-tira.
The output directory is /tmp/


### Step 2: Load the Data

In [3]:
print('Step 2: Load the data.')

queries = pt.io.read_topics(input_directory + '/queries.xml', format='trecxml')

documents = [json.loads(i) for i in open(input_directory + '/documents.jsonl', 'r')]


Step 2: Load the data.


### Step 3: Create the Index

In [4]:
print('Step 3: Create the Index.')

!rm -Rf ./index
iter_indexer = pt.IterDictIndexer("./index", meta={'docno' : 100}, blocks=True)
index_ref = iter_indexer.index(tqdm(documents))

Step 3: Create the Index.


 31%|██████████████████████████████▉                                                                      | 16460/53673 [00:06<00:08, 4234.48it/s]



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 53673/53673 [00:13<00:00, 4009.39it/s]


06:03:29.125 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 3 empty documents


### Step 4: Create Retrieval Pipeline

In [5]:
rm3 = pt.rewrite.RM3(index_ref)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose=True)

# We first retrieve some "pseudo relevant" documents with BM25
# We use the use the top results of BM25 to add expansion terms to the query with RM3
# We finally retrieve again with the expanded query against BM25

retrieval_pipeline = bm25 >> rm3 >> bm25

### Step 5: Create the run

In [6]:
print('Step 5: Create Run.')

run = retrieval_pipeline(queries)

Step 5: Create Run.


BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.92q/s]
BR(BM25): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 21.94q/s]


In [7]:
print('We look at the first 10 results of the run (query has ben expanded):\n')
run.head(10)

We look at the first 10 results of the run (query has ben expanded):



Unnamed: 0,qid,docid,docno,rank,score,query_0,query
0,1,27490,2011.spire_conference-2011.10,0,20.478847,detect health related queries,applypipeline:off social^0.028764594 structur^...
1,1,19930,2019.cikm_conference-2019.346,1,17.698716,detect health related queries,applypipeline:off social^0.028764594 structur^...
2,1,23061,2010.cikm_conference-2010.284,2,17.479005,detect health related queries,applypipeline:off social^0.028764594 structur^...
3,1,49659,2021.ipm_journal-ir0anthology0volumeA58A1.6,3,17.344492,detect health related queries,applypipeline:off social^0.028764594 structur^...
4,1,39429,2021.tist_journal-ir0anthology0volumeA12A2.4,4,15.679457,detect health related queries,applypipeline:off social^0.028764594 structur^...
5,1,33172,2018.wwwconf_conference-2018.13,5,15.508869,detect health related queries,applypipeline:off social^0.028764594 structur^...
6,1,31383,2014.wwwconf_conference-2014c.211,6,15.482942,detect health related queries,applypipeline:off social^0.028764594 structur^...
7,1,28878,2012.wwwconf_conference-2012c.37,7,15.478182,detect health related queries,applypipeline:off social^0.028764594 structur^...
8,1,31384,2014.wwwconf_conference-2014c.212,8,15.33639,detect health related queries,applypipeline:off social^0.028764594 structur^...
9,1,33009,2013.wwwconf_conference-2013c.302,9,15.053277,detect health related queries,applypipeline:off social^0.028764594 structur^...


### Step 6: Persist Run

In [8]:
print('Step 6: Persist Run.')

persist_and_normalize_run(run, output_file=output_directory, system_name='BM25-RM3', depth=1000)

print('Done :)')

Step 6: Persist Run.
Done :)
