# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [None]:
!pip3 install "tira>=0.0.139" ir-datasets "python-terrier==0.10.0"

Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [None]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [3]:
import pyterrier as pt

pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [None]:
import os

indexer = pt.IterDictIndexer(
    index_path=os.getcwd() + os.sep + "index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)

index = indexer.index(pt_dataset.get_corpus_iter())

### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [5]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Create verachell Synonyms Dict
Based on Synonym Data from https://github.com/verachell/English-word-lists-synonyms-antonyms

In [None]:
synonyms_dict_verachell = {}
with open("synonym files/syn-ant.csv", "r") as file:
    for line in file.readlines():
        values = line.split(",")
        word = values[0]
        synonyms = values[1].split("#")
        if synonyms[-1] == "":
            synonyms = synonyms[:-1]
        synonyms_dict_verachell[word] = synonyms
print(synonyms_dict_verachell)

In [7]:
def addVerachellSynonyms(q):
    query = q["query"].split(" ")
    
    new_query = [word for word in query]
    for word in query:
        if word in synonyms_dict_verachell:
            new_query += synonyms_dict_verachell[word]
    return " ".join(new_query)

verachell_querys = pt.apply.query(addVerachellSynonyms)
verachell_pipeline = verachell_querys >> bm25

In [None]:
verachell_querys.transform(pt_dataset.get_topics('text')[:10])

# Create zaibacu Synonyms Dict
Based on Synonym Data from https://github.com/zaibacu/thesaurus

In [9]:
import json

synonyms_dict_zaibacu = {}
with open("synonym files/en_thesaurus.jsonl", "r") as file:
    for line in file.readlines():
        entry = json.loads(line)
        if len(entry["synonyms"]) > 0:
            synonyms = []
            for synonym in entry["synonyms"]:
                if "'" in synonym:
                    synonym = synonym.replace("'s", "")
                    synonym = synonym.replace("'", "")
                
                synonyms.append(synonym)
                
            synonyms_dict_zaibacu[entry["word"]] = synonyms

In [10]:
def addZaibacuSynonyms(q):
    query = q["query"].split(" ")
    
    new_query = [word for word in query]
    for word in query:
        if word in synonyms_dict_zaibacu:
            new_query += synonyms_dict_zaibacu[word]
    return " ".join(new_query)

zaibacu_querys = pt.apply.query(addZaibacuSynonyms)
zaibacu_pipeline = zaibacu_querys >> bm25

In [None]:
zaibacu_querys.transform(pt_dataset.get_topics('text')[:10])

# Create zaibacu Synonyms Dicts - Only nouns / adjectives / verbs
Based on Synonym Data from https://github.com/zaibacu/thesaurus

In [12]:
import json

synonyms_dicts_zaibacu = {}
with open("synonym files/en_thesaurus.jsonl", "r") as file:
    for line in file.readlines():
        entry = json.loads(line)
        
        if entry["pos"] not in synonyms_dicts_zaibacu:
            synonyms_dicts_zaibacu[entry["pos"]] = {}
            
        pos_dict = synonyms_dicts_zaibacu[entry["pos"]]
        if len(entry["synonyms"]) > 0:
            synonyms = []
            for synonym in entry["synonyms"]:
                if "'" in synonym:
                    synonym = synonym.replace("'s", "")
                    synonym = synonym.replace("'", "")
                
                synonyms.append(synonym)
                
            pos_dict[entry["word"]] = synonyms

In [13]:
def addZaibacuSynonymsNouns(q):
    query = q["query"].split(" ")
    new_query = [word for word in query]
    for word in query:
        if word in synonyms_dicts_zaibacu["noun"]:
            new_query += synonyms_dicts_zaibacu["noun"][word]
    return " ".join(new_query)

def addZaibacuSynonymsAdjectives(q):
    query = q["query"].split(" ")
    new_query = [word for word in query]
    for word in query:
        if word in synonyms_dicts_zaibacu["adj"]:
            new_query += synonyms_dicts_zaibacu["adj"][word]
    return " ".join(new_query)

def addZaibacuSynonymsVerbs(q):
    query = q["query"].split(" ")
    new_query = [word for word in query]
    for word in query:
        if word in synonyms_dicts_zaibacu["verb"]:
            new_query += synonyms_dicts_zaibacu["verb"][word]
    return " ".join(new_query)

zaibacu_querys_noun = pt.apply.query(addZaibacuSynonymsNouns)
zaibacu_pipeline_noun = zaibacu_querys_noun >> bm25

zaibacu_querys_adjvective = pt.apply.query(addZaibacuSynonymsAdjectives)
zaibacu_pipeline_adjvective = zaibacu_querys_adjvective >> bm25

zaibacu_querys_verb = pt.apply.query(addZaibacuSynonymsVerbs)
zaibacu_pipeline_verb = zaibacu_querys_verb >> bm25

In [None]:
zaibacu_querys_noun.transform(pt_dataset.get_topics('text')[:10])

### Step 5: Evaluate your run

In [18]:
experiments = [bm25, 
               verachell_pipeline, 
               zaibacu_pipeline, 
               zaibacu_pipeline_noun, 
               zaibacu_pipeline_adjvective, 
               zaibacu_pipeline_verb
               ]
experiments_names = ["BM25",
                     "BM25 + verachell synonyms",
                     "BM25 + zaibacu synonyms",
                     "BM25 + zaibacu synonyms only nouns",
                     "BM25 + zaibacu synonyms only adjectives",
                     "BM25 + zaibacu synonyms only verbs"
                     ]

pt.Experiment(experiments,
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names = experiments_names,
    verbose=True
)

pt.Experiment: 100%|██████████| 6/6 [00:48<00:00,  8.11s/system]


Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,BM25,0.412718,0.786653,0.489469,0.701031,0.62268,0.574227
1,BM25 + verachell synonyms,0.275158,0.518404,0.304276,0.381443,0.395876,0.371134
2,BM25 + zaibacu synonyms,0.36356,0.689224,0.42157,0.587629,0.523711,0.508247
3,BM25 + zaibacu synonyms only nouns,0.365621,0.71169,0.427496,0.608247,0.546392,0.51134
4,BM25 + zaibacu synonyms only adjectives,0.406223,0.754702,0.470799,0.659794,0.606186,0.559794
5,BM25 + zaibacu synonyms only verbs,0.39555,0.766893,0.474376,0.680412,0.589691,0.55567
