# IR Lab SoSe 2024: Combined Retrieval System

This jupyter notebook serves as an improved retrieval system combining BM25, Query Expansion, and additional reranking models.
We will use a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This notebook serves as a retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

In [1]:
!pip3 install tira ir-datasets python-terrier transformers torch nltk

Collecting tira
  Using cached tira-0.0.134-py3-none-any.whl.metadata (4.6 kB)
Collecting ir-datasets
  Using cached ir_datasets-0.5.8-py3-none-any.whl.metadata (12 kB)
Collecting python-terrier
  Using cached python_terrier-0.10.1-py3-none-any.whl
Collecting transformers
  Using cached transformers-4.42.3-py3-none-any.whl.metadata (43 kB)
Collecting docker==7.*,>=7.1.0 (from tira)
  Using cached docker-7.1.0-py3-none-any.whl.metadata (3.8 kB)
Collecting inscriptis>=2.2.0 (from ir-datasets)
  Using cached inscriptis-2.5.0-py3-none-any.whl.metadata (25 kB)
Collecting statsmodels (from python-terrier)
  Using cached statsmodels-0.14.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (9.2 kB)
Collecting ir-measures>=0.3.1 (from python-terrier)
  Using cached ir_measures-0.3.3-py3-none-any.whl
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers)
  Using cached huggingface_hub-0.23.4-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers)
  Using cached toke

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd
import os

# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

  from .autonotebook import tqdm as notebook_tqdm
PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

In [3]:
# The dataset: the union of the IR Anthology and the ACL Anthology
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

### Step 3: Define the Retrieval Pipeline

In [4]:
# Base retrieval model with BM25
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Query expansion with Bo1
bo1_expansion = pt.rewrite.Bo1QueryExpansion(index, fb_docs=10, fb_terms=20)
bm25_bo1 = bm25 >> bo1_expansion >> bm25

# Additional reranking models
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichletLM = pt.BatchRetrieve(index, wmodel="DirichletLM")

# Combined retrieval pipeline
combined_pipeline = bm25_bo1 + 2 * tf_idf + 2 * dirichletLM

### Step 4: Create the Run

In [5]:
print('First, we have a short look at the first three topics:')
print(pt_dataset.get_topics('text').head(3))

print('Now we do the retrieval...')
run = combined_pipeline.transform(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
print(run.head(10))

First, we have a short look at the first three topics:
  qid                                     query
0   1  retrieval system improving effectiveness
1   2  machine learning language identification
2   3             social media detect self harm
Now we do the retrieval...
Done. Here are the first 10 entries of the run
  qid     docid                                        docno      score  \
0   1  125738.0             1971.ipm_journal-ir0volumeA7A6.0  26.030957   
1   1       NaN              1971.sigirconf_conference-71.10   2.714984   
2   1  124795.0             1972.ipm_journal-ir0volumeA8A5.1  25.507635   
3   1   82625.0               1973.sigirconf_conference-73.4  23.323948   
4   1       NaN  1976.sigirjournals_journal-ir0volumeA11A2.4   2.749110   
5   1  123985.0            1977.ipm_journal-ir0volumeA13A2.2  24.760950   
6   1   81370.0               1978.sigirconf_conference-78.2  22.373232   
7   1  121909.0  1980.sigirjournals_journal-ir0volumeA15A3.4  25.766639   
8   

### Step 5: Persist the run file for subsequent evaluations

In [6]:
# Create the 'runs' directory if it doesn't exist
os.makedirs('../runs', exist_ok=True)

persist_and_normalize_run(run, system_name='combined-bm25-bo1-tfidf-dirichlet', default_output='../runs')
print('Run file is stored under "../runs/run.txt".')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
Run file is stored under "../runs/run.txt".
