# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Colab.
!pip3 install tira ir-datasets python-terrier


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Create a REST client
client = Client(base_url='http://localhost:12345')

In [4]:
# Ensure PyTerrier is loaded
ensure_pyterrier_is_loaded()
pt.init()

PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


RuntimeError: pt.init() has already been called. Check pt.started() before calling pt.init()

In [None]:
# Load the dataset
dataset = pt.get_dataset('irds:antique/train')
index = dataset.get_index()

In [None]:
# Initialize BM25 model
bm25 = pt.BatchRetrieve(index, wmodel='BM25')

In [None]:
# Perform Query Expansion using Bo1 (Rocchio)
qe = pt.rewrite.Bo1QueryExpansion(index)

In [None]:
# Combine BM25 with Query Expansion
pipeline = bm25 >> qe >> bm25

In [None]:
# Perform retrieval
topics = dataset.get_topics('text')
expanded_run = pipeline(topics)

In [None]:
# Segment the expanded queries for better retrieval (this is a mock-up of segmentation process)
def segment_query(query):
    return query.split()

expanded_run['query'] = expanded_run['query'].apply(segment_query)

In [None]:
print('Now we do the retrieval with query expansion and segmentation...')
print('Done. Here are the first 10 entries of the expanded and segmented run')
print(expanded_run.head(10))

### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(expanded_run, system_name='bm25-qe-segmented', default_output='../runs')

In [None]:
# The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
# Done. Run file is stored under "../runs/run.txt".