# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [None]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

In [1]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
# stopword imports
import nltk
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# further imports
import os
import pandas as pd
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()
# spacy model
!python -m spacy download en_core_web_sm

Due to execution in TIRA, I have patched ir_datasets to always return the single input dataset mounted to the sandbox.
Start PyTerrier with version=5.7, helper_version=0.0.7, no_download=True
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 893 kB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Step 2: Stopword Removal

bottom text

In [3]:
# download stopwords
nltk.download('stopwords')

# Generate custom stopword list
nltk_stopwords = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
combined_stopwords = set.union(nltk_stopwords, spacy_stopwords, sklearn_stopwords)

## Create and save stopword file
file_path = "../custom-stopwords/custom_stopwords.txt"

with open(file_path, 'w+') as file:
    for element in combined_stopwords:
        file.write(element + "\n")

# Set property for stopword file in PyTerrier
pt.set_property('stopwords.filename', '../custom-stopwords/custom_stopwords.txt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Step 3: Load the Dataset
:)

In [4]:
print('Loading Dataset...')
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print('Dataset loaded.')

# TODO implement Query Expansion

Loading Dataset...
Load ir_dataset "ir-lab-sose-2024/ir-acl-anthology-20240504-training" from tira.
Dataset loaded.


### Step 4: Index Building
yup

In [6]:
print('Building Index...')

def create_index(pt_dataset, stopwords):
    indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480}, stopwords=stopwords)
    index_ref = indexer.index(pt_dataset)
    return pt.IndexFactory.of(index_ref)

index = create_index(pt_dataset.get_corpus_iter(), combined_stopwords)
print('Index created.')

Building Index...
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:  65%|██████▌   | 82805/126958 [00:07<00:03, 12033.68it/s]



ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:10<00:00, 11929.72it/s]


13:53:00.051 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 4 empty documents
Index created.


### Step 5: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [7]:
# definition of BM25 pipeline with stopword index
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

### Step 6: Create the Run


In [8]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:
No settings given in /root/.tira/.tira-settings.json. I will use defaults.


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [9]:
print('Create run')
run = bm25(pt_dataset.get_topics('text'))
print('Done. Here are the first 10 entries of the run')
run.head(10)

Create run
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,5868,W05-0704,0,13.822877,retrieval system improving effectiveness
1,1,116566,1988.jasis_journal-ir0volumeA39A2.0,1,13.57962,retrieval system improving effectiveness
2,1,126826,2007.tois_journal-ir0volumeA26A1.4,2,13.548694,retrieval system improving effectiveness
3,1,116546,1988.jasis_journal-ir0volumeA39A3.0,3,13.312149,retrieval system improving effectiveness
4,1,74020,2008.ntcir_workshop-2008evia.1,4,13.163106,retrieval system improving effectiveness
5,1,94858,2004.cikm_conference-2004.47,5,13.025476,retrieval system improving effectiveness
6,1,81397,1986.sigirconf_conference-86.12,6,12.820963,retrieval system improving effectiveness
7,1,96429,1999.cikm_conference-99.43,7,12.812355,retrieval system improving effectiveness
8,1,111285,2005.trec_conference-2005.11,8,12.781339,retrieval system improving effectiveness
9,1,123051,2002.ipm_journal-ir0volumeA38A1.0,9,12.727434,retrieval system improving effectiveness


### Step 7: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [10]:
persist_and_normalize_run(run, system_name='bm25-stopwords-query-expansion', default_output='../runs')

TypeError: persist_and_normalize_run() got an unexpected keyword argument 'default_output'