# Information Retrieval Lab WiSe 2024/2025: Baseline Retrieval System

This Jupyter notebook serves as a baseline retrieval system that you can improve upon.
We use subsets of the MS MARCO datasets to retrieve passages of web documents.
We will show you how to create a software submission to TIRA from this notebook.

An overview of all corpora that we use in the current course is available at [https://tira.io/datasets?query=ir-lab-wise-2024](https://tira.io/datasets?query=ir-lab-wise-2024). The dataset IDs for loading the datasets are:

- `ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training`: A subsample of the TREC 2019/2020 Deep Learning tracks on the MS MARCO v1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/subsampled-ms-marco-rag-20241202-training` (_work in progress_): A subsample of the TREC 2024 Retrieval-Augmented Generation track on the MS MARCO v2.1 passage dataset. Use this dataset to tune your system(s).
- `ir-lab-wise-2024/ms-marco-rag-20241203-test` (work in progress): The test corpus that we have created together in the course, based on the MS MARCO v2.1 passage dataset. We will use this dataset as the test dataset, i.e., evaluation scores become available only after the submission deadline.

### Step 1: Import libraries

We will use [tira](https://tira.io/), an information retrieval shared task platform, and [ir_dataset](https://ir-datasets.com/) for loading the datasets. Subsequently, we will build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine framework.

First, we need to install the required libraries.

In [20]:
!pip3 install "tira>=0.0.139" ir-datasets "python-terrier==0.10.0"



Create an API client to interact with the TIRA platform (e.g., to load datasets and submit runs).

In [21]:
from tira.third_party_integrations import ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the dataset

We load the dataset by its ir_datasets ID (as listed in the Readme). Just be sure to add the `irds:` prefix before the dataset ID to tell PyTerrier to load the data from ir_datasets.

In [22]:
import pyterrier as pt

pt_dataset = pt.get_dataset('irds:ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training')

In [23]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---- ----------------------------------- 1.6/12.8 MB 10.5 MB/s eta 0:00:02
     ----------- ---------------------------- 3.7/12.8 MB 9.5 MB/s eta 0:00:01
     ----------------- ---------------------- 5.5/12.8 MB 9.6 MB/s eta 0:00:01
     ----------------------- ---------------- 7.6/12.8 MB 9.6 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 9.7 MB/s eta 0:00:01
     ------------------------------------ --- 11.5/12.8 MB 9.8 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 9.3 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [24]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [25]:
from importlib import reload
from typing import Iterable
import os
import data_cleaning
reload(data_cleaning)

class DataCleaningIter(Iterable):
    def __init__(self, dataset_iter, remove_token_tags: list[str]) -> None:
        self.dataset_iter = iter(dataset_iter)
        self.remove_token_tags = remove_token_tags
    
    def __iter__(self):
        return self
    
    def __next__(self):
        item = next(self.dataset_iter)
        
        edited_text = []
        doc = nlp(item["text"])
        for token in doc:
            if token.pos_ in self.remove_token_tags:
                # print(f"removing {token} because it is a {token.pos_}")
                pass
            else:
                edited_text.append(str(token))
        
        item["text"] = " ".join(edited_text)
        return item


In [26]:
from sklearn.model_selection import ParameterGrid
import tqdm
import pandas as pd

spacy_token_tags = ['DET', 'INTJ', 'ADJ', 'VERB', 'PRON', 'CCONJ', 'PART', 'X', 'ADV', 'PUNCT', 'SCONJ', 'SYM', 'SPACE', 'NUM', 'ADP', 'AUX', 'PROPN', 'NOUN']

# Create Parameter Grid

In [27]:
params = {
    "token_tag": spacy_token_tags,
}
param_grid = ParameterGrid(params)

results = []
names = []


# Baseline BM25

In [28]:
data_cleaning_iter = DataCleaningIter(pt_dataset.get_corpus_iter(verbose=True), "")

index_path = os.getcwd() + os.sep + "index"

if os.path.exists(index_path):
    index_path = index_path
    index = pt.IndexFactory.of(index_path) 
else:
    indexer = pt.IterDictIndexer(
        index_path=index_path,
        meta={'docno': 50, 'text': 4096},
        # If an index already exists there, then overwrite it.
        overwrite=False
    )
    index = indexer.index(data_cleaning_iter)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

experiment_results = pt.Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"]
)
results.append(experiment_results)
names.append("BM25")

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [03:36<?, ?it/s]


# Run Parameter Grid

In [30]:
for i, p in tqdm.tqdm(enumerate(param_grid), total=len(spacy_token_tags)):
    data_cleaning_iter = DataCleaningIter(
        pt_dataset.get_corpus_iter(verbose=True), [p["token_tag"]]
    )
    index_path = os.getcwd() + os.sep + f"index-{i}"

    if os.path.exists(index_path):
        index_path = index_path
        index = pt.IndexFactory.of(index_path)
    else:
        indexer = pt.IterDictIndexer(
            index_path=index_path,
            meta={"docno": 50, "text": 4096},
            # If an index already exists there, then overwrite it.
            overwrite=False,
        )
        index = indexer.index(data_cleaning_iter)
    bm25 = pt.BatchRetrieve(index, wmodel="BM25")

    experiment_results = pt.Experiment(
        [bm25],
        pt_dataset.get_topics("text"),
        pt_dataset.get_qrels(),
        eval_metrics=["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    )
    results.append(experiment_results)
    names.append(str(p))

  6%|▌         | 1/18 [00:04<01:14,  4.36s/it]

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
 17%|█▋        | 3/18 [00:12<01:03,  4.23s/it]

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
 28%|██▊       | 5/18 [00:21<00:54,  4.23s/it]

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:04<?, ?it/s]
 39%|███▉      | 7/18 [00:29<00:46,  4.27s/it]

ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documen

# Show Results

In [31]:
all_results = pd.concat(results, keys=names)
all_results = all_results.reset_index().drop(["name", "level_1"], axis=1).rename(columns={"level_0": "name"})
all_results

Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,BM25,0.41257,0.78651,0.488841,0.701031,0.620619,0.574227
1,{'token_tag': 'DET'},0.412687,0.78651,0.488207,0.701031,0.620619,0.573196
2,{'token_tag': 'INTJ'},0.411576,0.78502,0.487932,0.701031,0.618557,0.573196
3,{'token_tag': 'ADJ'},0.367286,0.743099,0.441431,0.639175,0.587629,0.534021
4,{'token_tag': 'VERB'},0.382058,0.733259,0.430185,0.628866,0.569072,0.515464
5,{'token_tag': 'PRON'},0.414859,0.788805,0.489165,0.701031,0.620619,0.574227
6,{'token_tag': 'CCONJ'},0.412701,0.786939,0.488706,0.701031,0.620619,0.574227
7,{'token_tag': 'PART'},0.412541,0.78651,0.489154,0.701031,0.620619,0.574227
8,{'token_tag': 'X'},0.412144,0.785911,0.488158,0.701031,0.614433,0.573196
9,{'token_tag': 'ADV'},0.412123,0.786666,0.487221,0.701031,0.620619,0.572165


In [32]:
import json


all_results["avg"] = all_results.iloc[:, 1:-1].mean(axis=1)

bm25_avg = all_results.loc[all_results["name"] == "BM25", "avg"].values[0]
only_better_results = all_results[all_results["avg"] > bm25_avg]
print(only_better_results)
only_better_token_tags = list(only_better_results["name"].apply(lambda x: json.loads(str(x).replace("'", "\""))["token_tag"]))
print(f"removing these token_tags improves the results: {only_better_token_tags}")

                      name       map  recip_rank  ndcg_cut_10       P_1  \
5    {'token_tag': 'PRON'}  0.414859    0.788805     0.489165  0.701031   
6   {'token_tag': 'CCONJ'}  0.412701    0.786939     0.488706  0.701031   
7    {'token_tag': 'PART'}  0.412541    0.786510     0.489154  0.701031   
11  {'token_tag': 'SCONJ'}  0.412617    0.786510     0.488955  0.701031   
16    {'token_tag': 'AUX'}  0.412462    0.791188     0.490190  0.711340   

         P_5      P_10       avg  
5   0.620619  0.574227  0.602896  
6   0.620619  0.574227  0.601999  
7   0.620619  0.574227  0.601971  
11  0.620619  0.574227  0.601946  
16  0.616495  0.572165  0.604335  
removing these token_tags improves the results: ['PRON', 'CCONJ', 'PART', 'SCONJ', 'AUX']


These are:
- PRON: Pronoun (e.g., "I", "you", "he", "she", "it", "we", "they", "this", "that")
- CCONJ: Coordinating conjunction (e.g., "and", "or", "but")
- PART: Particle (e.g., "not", "'s" in "cat's")
- SCONJ: Subordinating conjunction (e.g., "because", "although", "if", "since")
- ADP: Adposition (preposition or postposition) (e.g., "in", "on", "at", "to", "from")
- AUX: Auxiliary verb (e.g., "is", "are", "was", "were", "have", "has", "had", "will", "would", "can", "could", "should", "may", "might")

In [33]:
data_cleaning_iter = DataCleaningIter(pt_dataset.get_corpus_iter(verbose=True), only_better_token_tags)
index_path = os.getcwd() + os.sep + f"index-combined"

if os.path.exists(index_path):
    index_path = index_path
    index = pt.IndexFactory.of(index_path)
else:
    indexer = pt.IterDictIndexer(
    index_path=index_path,
    meta={'docno': 50, 'text': 4096},
    # If an index already exists there, then overwrite it.
    overwrite=False
)
index = indexer.index(data_cleaning_iter)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

experiment_results = pt.Experiment([bm25],
    pt_dataset.get_topics('text'),
    pt_dataset.get_qrels(),
    eval_metrics = ["map", "recip_rank", "ndcg_cut_10", "P_1", "P_5", "P_10"],
    names=["BM25 removed token_types with spacy"]
)

experiment_results


ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:   0%|          | 0/68261 [00:26<?, ?it/s]
ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents:  38%|███▊      | 26013/68261 [04:23<07:23, 95.18it/s] 



ir-lab-wise-2024/subsampled-ms-marco-deep-learning-20241201-training documents: 100%|██████████| 68261/68261 [11:17<00:00, 100.72it/s]


23:45:57.547 [main] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


Unnamed: 0,name,map,recip_rank,ndcg_cut_10,P_1,P_5,P_10
0,BR(BM25),0.414901,0.793504,0.491667,0.71134,0.618557,0.572165
