# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
import pyterrier as pt
import pandas as pd

# Ensure PyTerrier is loaded
ensure_pyterrier_is_loaded()

# Load the dataset
pt_dataset = pt.get_dataset('irds:dataset-name')

# Load the index (assuming the index has already been built and is available)
index = pt.IndexFactory.of('/path/to/index')  # Replace with the correct path to your index

  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


JavaException: JVM exception occurred: No IndexLoaders were supported for indexref /path/to/index; It may be your ref has the wrong location. Alternatively, Terrier is misconfigured - did you import the correct package to deal with this indexref? java.lang.UnsupportedOperationException

In [3]:

# Load the index (assuming the index has already been built and is available)
index_path = '/path/to/actual/index'  # Replace with the correct path to your index
index = pt.IndexFactory.of(index_path)


JavaException: JVM exception occurred: No IndexLoaders were supported for indexref /path/to/actual/index; It may be your ref has the wrong location. Alternatively, Terrier is misconfigured - did you import the correct package to deal with this indexref? java.lang.UnsupportedOperationException

In [None]:
# Define the BM25 retrieval component
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Perform BM25 retrieval
print('Now we do the retrieval with BM25...')
run_bm25 = bm25(pt_dataset.get_topics('text'))

# Display the first 10 entries of the BM25 run
print('Done. Here are the first 10 entries of the BM25 run')
print(run_bm25.head(10))

# Persist the BM25 run file for subsequent evaluations
persist_and_normalize_run(run_bm25, system_name='bm25-baseline', default_output='../runs')

### Step 3: Perform Retrieval with BM25 and TF-IDF

In [None]:
# Define the TF-IDF retrieval component
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

# Combine the components into a single pipeline
pipeline = bm25 >> tfidf

# Perform retrieval with BM25 and TF-IDF
print('Now we do the retrieval with BM25 and TF-IDF...')
run_bm25_tfidf = pipeline(pt_dataset.get_topics('text'))

# Display the first 10 entries of the BM25 + TF-IDF run
print('Done. Here are the first 10 entries of the BM25 + TF-IDF run')
print(run_bm25_tfidf.head(10))

# Persist the BM25 + TF-IDF run file for subsequent evaluations
persist_and_normalize_run(run_bm25_tfidf, system_name='bm25-tfidf-baseline', default_output='../runs')

### Step 4: Compare the Results

In [None]:
# Compare the results
comparison = run_bm25.set_index(['qid', 'docid']).join(run_bm25_tfidf.set_index(['qid', 'docid']), lsuffix='_bm25', rsuffix='_bm25_tfidf')
print('Comparison of BM25 and BM25 + TF-IDF results:')
print(comparison.head(10))

# Save comparison to a file
comparison.to_csv('../runs/comparison_bm25_vs_bm25_tfidf.csv')