# IR Lab SoSe 2024: Combined Retrieval System with Query Segmentation

This jupyter notebook serves as an improved retrieval system combining BM25, Query Expansion, Query Segmentation, and additional reranking models.
We will use a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This notebook serves as a retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine. We'll also use NLTK for query segmentation.

In [1]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd
import os
import nltk
from nltk.tokenize import word_tokenize
from nltk.chunk import ne_chunk
from nltk.tag import pos_tag

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'nltk'

### Step 2: Load the Dataset and the Index

In [None]:
# The dataset: the union of the IR Anthology and the ACL Anthology
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define Query Segmentation Function and the Retrieval Pipeline

In [None]:
def segment_query(query):
    # Tokenize the query
    tokens = word_tokenize(query)
    
    # Perform POS tagging
    tagged = pos_tag(tokens)
    
    # Perform named entity recognition
    chunked = ne_chunk(tagged)
    
    # Extract named entities and other important phrases
    segments = []
    current_segment = []
    for subtree in chunked:
        if isinstance(subtree, nltk.Tree):
            segments.append(' '.join(word for word, pos in subtree.leaves()))
        else:
            word, pos = subtree
            if pos.startswith('NN') or pos.startswith('JJ'):
                current_segment.append(word)
            else:
                if current_segment:
                    segments.append(' '.join(current_segment))
                    current_segment = []
    if current_segment:
        segments.append(' '.join(current_segment))
    
    # Join segments with quotes for exact phrase matching
    segmented_query = ' '.join(f'"{segment}"' for segment in segments)
    
    return segmented_query

# Query segmentation as a PyTerrier transformer
query_segmentation = pt.apply.query(segment_query)

# Base retrieval model with BM25
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Query expansion with Bo1
bo1_expansion = pt.rewrite.Bo1QueryExpansion(index, fb_docs=10, fb_terms=20)
bm25_bo1 = bm25 >> bo1_expansion >> bm25

# Additional reranking models
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichletLM = pt.BatchRetrieve(index, wmodel="DirichletLM")

# Combined retrieval pipeline with query segmentation
combined_pipeline = query_segmentation >> (bm25_bo1 + 2 * tf_idf + 2 * dirichletLM)

### Step 4: Create the Run

In [None]:
print('First, we have a short look at the first three topics:')
print(pt_dataset.get_topics('text').head(3))

print('Now we do the retrieval...')
run = combined_pipeline.transform(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
print(run.head(10))

### Step 5: Persist the run file for subsequent evaluations

In [None]:
# Create the 'runs' directory if it doesn't exist
os.makedirs('../runs', exist_ok=True)

persist_and_normalize_run(run, system_name='combined-bm25-bo1-tfidf-dirichlet-with-segmentation', default_output='../runs')
print('Run file is stored under "../runs/run.txt".')