## docT5 Query Document Expansion
This Retrieval System implements the docT5 Query document expansion [Tutorial](https://github.com/tira-io/teaching-ir-with-shared-tasks/blob/main/tutorials/tutorial-doc-t5-query.ipynb). 
This specific approach works with a Corpus that already has a high recall. Our corpus consists of the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)

## Step 1. Imports

In [1]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
ensure_pyterrier_is_loaded()
import pandas as pd
import pyterrier as pt
from tqdm import tqdm
from jnius import autoclass
import gzip
import json
import os



PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


## Step 2. Initialize TIRA client

In [2]:
tira = Client()

## Step 3. Dataset setup
The dataset: the union of the IR Anthology and the ACL Anthology

In [3]:
dataset = 'antique-test-20230107-training'
pt_dataset = pt.get_dataset(f'irds:ir-benchmarks/{dataset}')
bm25 = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 Re-Rank (tira-ir-starter-pyterrier)', dataset)
# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

# Retrieve topics from the dataset
#topics=pt_dataset.get_topics('text')


## Directly initialize BM25 model using PyTerrier

In [4]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

## Step 4. Implementing the BM25 Retrieval Model to show it's recall

In [5]:
pt.Experiment(
    retr_systems=[bm25],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    names=['BM25'],
    eval_metrics=['recall_1000']
)

Unnamed: 0,name,recall_1000
0,BM25,0.788732


## Step 5. Define function to read DocT5Query expanded documents

In [6]:
def doc_t5_query(pt_dataset):
    docs = tira.get_run_output('ir-benchmarks/seanmacavaney/DocT5Query', pt_dataset) + '/documents.jsonl.gz'
    with gzip.open(docs, 'rt') as f:
        for l in tqdm(f):
            l = json.loads(l)
            l['text'] = l['querygen']
            l['docno'] = l['doc_id']
            del l['doc_id']
            del l['querygen']
            yield l


# Define function to create index from DocT5Query expanded documents
def doc_t5_query_index(pt_dataset):
    indexer = pt.IterDictIndexer("/tmp/index2", overwrite=True, meta={'docno': 100, 'text': 20480})
    index_ref = indexer.index(doc_t5_query(pt_dataset))
    return pt.IndexFactory.of(index_ref)

## Step 6. Create index from expanded documents
We first check if the expanded index exists and we then create it if necessary

In [7]:
if not os.path.exists("/tmp/index2/data.properties"):
    print("Creating DocT5Query index...")
    index = doc_t5_query_index(dataset)
else:
    print("Loading existing DocT5Query index...")
    index = pt.IndexFactory.of('/tmp/index2')

Creating DocT5Query index...


4606it [00:02, 4205.45it/s]



403666it [00:32, 12412.39it/s]


19:26:43.315 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 113 empty documents


## Step 7. Retrieve documents using BM25 model

In [8]:
docs_retrieved_by_bm25 = {}
bm25_result = bm25(pt_dataset.get_topics('title'))

for _, i in tqdm(bm25_result.iterrows()):
    qid, docno = str(i['qid']), str(i['docno'])

    if qid not in docs_retrieved_by_bm25:
        docs_retrieved_by_bm25[qid] = set()
    
    docs_retrieved_by_bm25[qid].add(docno)

188633it [00:07, 25078.82it/s]


## Step 8. Define lambda function to omit already retrieved documents

In [9]:
omit_already_retrieved_docs = lambda i: i[i.apply(lambda j: str(j['docno']) not in docs_retrieved_by_bm25[str(j['qid'])], axis=1)]
omit_already_retrieved_docs = pt.apply.generic(omit_already_retrieved_docs)

## Step 9. Create BM25 model for DocT5Query index and apply omission filter

In [10]:
bm25_doct5query = pt.BatchRetrieve(index, wmodel="BM25")
bm25_doct5query_new = bm25_doct5query >> omit_already_retrieved_docs

## Step 10. Applying Bo1 Expansion

In [11]:
bo1_expansion = bm25_doct5query_new >> pt.rewrite.Bo1QueryExpansion(index)
# Final retrieval pipeline
bm25_bo1 = bo1_expansion >> bm25

## Step 11. Experiment showing results for Bo1 Expansion, DocT5Query with BM25, and DocT5 without the BM25 Model

In [12]:
pt.Experiment(
    retr_systems=[bm25_bo1, bm25_doct5query, bm25_doct5query_new],
    topics=pt_dataset.get_topics('text'),
    qrels=pt_dataset.get_qrels(),
    names=['BM25_Bo1', 'DocT5Query >> BM25', 'DocT5Query w.o. BM25 >> BM25'],
    eval_metrics=['recall_1000','ndcg_cut_5', 'ndcg_cut.10', 'recip_rank']
)

Unnamed: 0,name,recall_1000,ndcg_cut_5,ndcg_cut.10,recip_rank
0,BM25_Bo1,0.776453,0.493561,0.467125,0.912156
1,DocT5Query >> BM25,0.534685,0.399011,0.348678,0.793546
2,DocT5Query w.o. BM25 >> BM25,0.019399,0.015763,0.01267,0.056205


In [13]:
# Create and run the retrieval
#topics = pt_dataset.get_topics('text')
#print('Create run')
#run = bm25(topics)
#print('Done, run was created')



In [14]:
# Persist and normalize the run
#persist_and_normalize_run(run, system_name = 'doc_T5_Query', default_output='../runs')

# Diagnostic: Check the first few rows of the run
#print(run.head())