# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [None]:
!pip3 install "tira>=0.0.139" ir-datasets "python-terrier==0.10.0"

Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [None]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [4]:
import pyterrier as pt

pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

### Step 3: Build an index

We will then create an index from the documents in the dataset we just loaded.

In [None]:
import os

indexer = pt.IterDictIndexer(
    index_path=os.getcwd() + os.sep + "index",
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=True,
)

index = indexer.index(pt_dataset.get_corpus_iter())

### Step 4: Define the retrieval pipeline

We will define a simple retrieval pipeline using just BM25 as a baseline. For details, refer to the PyTerrier [documentation](https://pyterrier.readthedocs.io) or [tutorial](https://github.com/terrier-org/ecir2021tutorial).

In [6]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

# Using WordNet and synset_similarity:

In [None]:
!pip3 install nltk

## Use nltk stopwords and lemmatizer

In [None]:
import re
from typing import Literal, Set, Tuple
import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd

nltk.download('wordnet')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

## Create some Helper-Functions

In [9]:
def get_synset_similarity(word1, word2):
    try:
        synset1 = wordnet.synsets(word1)[0]
        synset2 = wordnet.synsets(word2)[0]
        return synset1.path_similarity(synset2)
    except Exception:
        return 0

def remove_bad_characters(text):
    text = text.replace("'s", "")
    text = re.sub(r'[^a-zA-Z0-9_]', '', text)
    return text

def filter_min_similarity(synonyms: Set[Tuple[str, float]], similarity: float) -> Set[Tuple[str, float]]:
    synonyms = set(filter(lambda x: x[1] > similarity, synonyms))
    return synonyms

def find_top_k(synonyms: Set[Tuple[str, float]], k: int) -> Set[Tuple[str, float]]:
    sorted_synonyms_list = sorted(synonyms, key=lambda x: x[1], reverse=True)
    synonyms = set(sorted_synonyms_list[:min(k, len(synonyms))])
    return synonyms

def add_synset_similarity(synonyms_in: Set[str], term: str) -> Set[Tuple[str, float]]:
    return set([(name, get_synset_similarity(term, name)) for name in synonyms_in])

def remove_stopwords(query):
    filtered_query = [w for w in query.split() if w not in stop_words]
    return " ".join(filtered_query)

def lemmatize_query(query):
    lemmatized_query = []
    for word in query.split():
        lemma = lemmatizer.lemmatize(word)
        lemmatized_query.append(lemma)
    return " ".join(lemmatized_query)

## Create Pyterrier Transformer to modify the queries

In [10]:
class WordnetQueryModifier(pt.Transformer):
    def __init__(self, min_similarity: float, top_k: int, pos: Literal["noun"] | Literal["verb"] | Literal["adjective"] | None = None):
        self.min_similarity = min_similarity
        self.top_k = top_k
        self.pos = pos[0] if pos != None else None
        
    def transform(self, queries: pd.DataFrame):
        queries["query"] = queries["query"].apply(self.expand_query_wordnet)
        return queries
    
    def expand_query_wordnet(self, query):
        query = remove_stopwords(query)
        query = lemmatize_query(query)
        expanded_query = query.split()
        
        # find synonyms
        for term in query.split():
            synonyms: Set[str] = set()
            for syn in wordnet.synsets(term, self.pos):
                for lemma in syn.lemmas():
                    name = lemma.name()
                    name = remove_bad_characters(name)
                    synonyms.add(name)
                    
            # only select some synonyms
            synonyms_with_similarity: Set[Tuple[str, float]] = add_synset_similarity(synonyms, term)
            synonyms_with_similarity_filtered: Set[Tuple[str, float]] = filter_min_similarity(synonyms_with_similarity, self.min_similarity)
            synonyms_with_similarity_filtered_top_k: Set[Tuple[str, float]] = find_top_k(synonyms_with_similarity_filtered, self.top_k)
            
            # add selected synonyms to query
            synonyms_words: list[str] = [syn[0] for syn in synonyms_with_similarity_filtered_top_k]
            for synonym in synonyms_words:
                if len(expanded_query) < 64:
                    if synonym not in expanded_query and synonym != term and synonym.lower() != term.lower():
                        expanded_query.append(synonym)
                else:
                    print(f"query '{query}' has too many synonyms to add: {synonyms_words}")
        
        # print(f' final query \"{" ".join(expanded_query)}\"')
        return " ".join(expanded_query)


In [None]:
wordnet_modifier = WordnetQueryModifier(0.5, 3, None)
wordnet_modifier.transform(pt_dataset.get_topics('text')[:5])

### Step 5: Evaluate your run
 This uses ParameterGrid from sklearn.model_selection instead of pt.GridScan since GridScan uses the modified queries from a previous attempt in the next one
 Gridsearch find the best parameters for:
 - minimum synset_similarity
 - top_k value
 - word_type

In [None]:
from sklearn.model_selection import ParameterGrid
import tqdm

params = {
    "min_similarity": [0.1*x for x in range(0, 11)],
    "top_k": [x for x in range(1, 6)],
    "pos": ["noun", "verb", "adjective", None]
}
param_grid = ParameterGrid(params)

results = []
names = []
for p in tqdm.tqdm(param_grid):
    wordnet_query_modifier = WordnetQueryModifier(**p)
    wordnet_pipeline = wordnet_query_modifier >> bm25

    experiment_results = pt.Experiment(
        [wordnet_pipeline],
        pt_dataset.get_topics('text'),
        pt_dataset.get_qrels(),
        eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
    )
    results.append(experiment_results)
    names.append(str(p))

all_results = pd.concat(results, keys=names)
all_results = all_results.reset_index().drop(["name", "level_1"], axis=1).rename(columns={"level_0": "name"})


In [32]:
all_results['row_average'] = all_results[['map', 'recip_rank', 'ndcg_cut_10', 'P_1', 'P_5', 'P_10']].mean(axis=1)
df_sorted = all_results.sort_values(by='row_average', ascending=False)
df_sorted

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10,row_average
217,"{'min_similarity': 1.0, 'pos': None, 'top_k': 3}",0.417705,0.794299,0.491219,0.711340,0.620619,0.575258,0.601740
219,"{'min_similarity': 1.0, 'pos': None, 'top_k': 5}",0.417705,0.794299,0.491219,0.711340,0.620619,0.575258,0.601740
218,"{'min_similarity': 1.0, 'pos': None, 'top_k': 4}",0.417705,0.794299,0.491219,0.711340,0.620619,0.575258,0.601740
202,"{'min_similarity': 1.0, 'pos': 'noun', 'top_k'...",0.417705,0.794299,0.491219,0.711340,0.620619,0.575258,0.601740
203,"{'min_similarity': 1.0, 'pos': 'noun', 'top_k'...",0.417705,0.794299,0.491219,0.711340,0.620619,0.575258,0.601740
...,...,...,...,...,...,...,...,...
17,"{'min_similarity': 0.0, 'pos': None, 'top_k': 3}",0.352749,0.610806,0.378328,0.484536,0.480412,0.461856,0.461448
38,"{'min_similarity': 0.1, 'pos': None, 'top_k': 4}",0.336860,0.568524,0.351076,0.422680,0.457732,0.436082,0.428826
18,"{'min_similarity': 0.0, 'pos': None, 'top_k': 4}",0.329683,0.568963,0.345760,0.432990,0.447423,0.431959,0.426130
39,"{'min_similarity': 0.1, 'pos': None, 'top_k': 5}",0.330339,0.552143,0.337982,0.402062,0.443299,0.423711,0.414923
