# IR Lab SoSe 2024: Combined Retrieval System with Query Segmentation

This jupyter notebook serves as an improved retrieval system combining BM25, Query Expansion, Query Segmentation, and additional reranking models.
We will use a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the IR Anthology and the ACL Anthology). This notebook serves as a retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use tira, an information retrieval shared task platform, for loading the (pre-built) retrieval index and ir_dataset to subsequently build a retrieval system with PyTerrier, an open-source search engine. We'll also use NLTK for query segmentation.

In [None]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

# Download necessary NLTK data
nltk.download('punkt')

# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the Dataset and the Index

In [None]:
# The dataset: the union of the IR Anthology and the ACL Anthology
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define Query Segmentation and the Retrieval Pipeline

In [None]:
def query_segmentation(query, max_segment_length=3):
    tokens = word_tokenize(query)
    segments = []
    for n in range(1, min(max_segment_length, len(tokens)) + 1):
        segments.extend([' '.join(gram) for gram in ngrams(tokens, n)])
    return ' '.join(segments)

class QuerySegmentationTransformer(pt.Transformer):
    def transform(self, topics):
        topics['query'] = topics['query'].apply(query_segmentation)
        return topics

# Base retrieval model with BM25
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Query expansion with Bo1
bo1_expansion = pt.rewrite.Bo1QueryExpansion(index, fb_docs=10, fb_terms=20)

# Additional reranking models
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichletLM = pt.BatchRetrieve(index, wmodel="DirichletLM")

# Combined retrieval pipeline with query segmentation
combined_pipeline = (
    QuerySegmentationTransformer() >>
    bm25 >>
    bo1_expansion >>
    bm25
) + 2 * tf_idf + 2 * dirichletLM

### Step 4: Create the Run

In [None]:
print('First, we have a short look at the first three topics:')
print(pt_dataset.get_topics('text').head(3))

print('Now we do the retrieval...')
run = combined_pipeline.transform(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
print(run.head(10))

### Step 5: Persist the run file for subsequent evaluations

In [None]:
# Create the 'runs' directory if it doesn't exist
os.makedirs('../runs', exist_ok=True)

persist_and_normalize_run(run, system_name='combined-bm25-bo1-tfidf-dirichlet-querysegmentation', default_output='../runs')
print('Run file is stored under "../runs/run.txt".')