# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [94]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

[0m

In [95]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
# stopword imports
import nltk
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# lemmatizer imports
import pandas as pd
pd.set_option('display.max_colwidth', 0)

# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()
# spacy model
!python -m spacy download en_core_web_sm

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1'])
    from jnius import autoclass

Collecting en-core-web-sm==3.4.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Step 2: Stopword Removal

In [97]:
# download stopwords
nltk.download('stopwords')

# Generate custom stopword list
nltk_stopwords = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
combined_stopwords = set.union(nltk_stopwords, spacy_stopwords, sklearn_stopwords)

# output to verify stopwords
print('a quick look at the stopwords')
print(combined_stopwords)

## Create and save stopword file
file_path = "../custom-stopwords/custom_stopwords.txt"

with open(file_path, 'w+') as file:
    for element in combined_stopwords:
        file.write(element + "\n")

# Set property for stopword file in PyTerrier
pt.set_property('stopwords.filename', '../custom-stopwords/custom_stopwords.txt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


a quick look at the stopwords
{"you'd", 'anyway', 'although', 'ltd', 'none', 'ever', 'always', 'call', 'out', 'are', 'in', 'describe', '’m', 'a', 'amount', 'theirs', 'that', "hadn't", 'her', 'ma', 'first', 'hasn', 'via', 'latter', 'at', 'didn', 'mustn', 'see', 'nothing', 'five', 'become', "hasn't", "should've", 'she', 'mostly', 'himself', 'almost', "weren't", 'give', 'fifty', 'yourself', 'while', 'too', 'would', 'except', 'every', 'with', 'back', 'only', 'six', 'name', 'against', 'amongst', 'what', 'until', 'eg', 'the', 'moreover', 'whose', 'other', 'meanwhile', "'m", 'herself', 'regarding', 'sometime', 'afterwards', 'noone', 'more', 'doesn', 'of', 'again', 'eleven', 'perhaps', 'say', 'me', 'seems', 'whence', '‘m', "wasn't", 'such', 'itself', 'nevertheless', 'hasnt', '’ll', 'thick', 'been', 'seem', '‘d', 'please', 'three', 'hereby', 'used', 'indeed', 'when', 'ie', 'un', 'enough', 'between', 'done', "needn't", "wouldn't", 'mill', 'etc', 'must', 'bill', 'doing', 'but', 'same', 'unless', 

### Step 3: Load the Dataset

In [98]:
print('Loading Dataset...')
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print('Dataset loaded.')

Loading Dataset...
Dataset loaded.


### Step x: Lemmatizer

In [99]:
# basic lemmatizer implementation
# stanford lemmatizer
def lemmatize(t):
    lemmatizer = autoclass("org.terrier.terms.StanfordLemmatizer")()
    return lemmatizer.stem(t)

# porterStemmer
def stem(t):
    stemmer = autoclass("org.terrier.terms.PorterStemmer")()
    return stemmer.stem(t)

# lemurKrovetzStemmer
def stem_krovetz(t):
    stemmer = autoclass("org.terrier.terms.LemurKrovetzStemmer")()
    return stemmer.stem(t)

### Step 4: Index Building

In [101]:
print('Building Index...')

def create_index(pt_dataset, stopwords):
    # added lemmatizer/stemmer
    indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480}, stopwords=stopwords, stemmer='PorterStemmer') # stemmer needs to be set
    index_ref = indexer.index(pt_dataset)
    return pt.IndexFactory.of(index_ref)

# doesnt lead to improvement atm
index = create_index(pt_dataset.get_corpus_iter(), combined_stopwords)

# A (pre-built) PyTerrier index loaded from TIRA
# index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
print('Index created.')

Building Index...


ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:12<00:00, 10364.52it/s]


Index created.


### Step 5: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [103]:
#bo1 = pt.rewrite.Bo1QueryExpansion(index)

# definition of BM25 pipeline with stopword index
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tf = pt.BatchRetrieve(index, wmodel="Tf")
pl2 = pt.BatchRetrieve(index, wmodel="PL2")

#bm25_expand = bm25 >> bo1 >> bm25

pipeline = bm25 >> (tf ** pl2)

### Step 6: Create the Run


In [104]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [105]:
print('Create run')
run = pipeline(pt_dataset.get_topics('text'))
print('Done. Here are the first 10 entries of the run')
run.head(10)

Create run
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query,features
0,1,94858,2004.cikm_conference-2004.47,0,15.331717,retrieval system improving effectiveness,"[22.0, 9.600797845496768]"
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,14.986278,retrieval system improving effectiveness,"[4.0, 9.618979393386638]"
2,1,84876,2016.ntcir_conference-2016.90,2,14.40577,retrieval system improving effectiveness,"[14.0, 8.401517143790793]"
3,1,5868,W05-0704,3,13.929917,retrieval system improving effectiveness,"[15.0, 9.282615310408456]"
4,1,125817,2005.ipm_journal-ir0volumeA41A5.11,4,13.856653,retrieval system improving effectiveness,"[15.0, 7.8862767550893285]"
5,1,126826,2007.tois_journal-ir0volumeA26A1.4,5,13.800983,retrieval system improving effectiveness,"[16.0, 8.818200605377877]"
6,1,94415,2008.cikm_conference-2008.183,6,13.760831,retrieval system improving effectiveness,"[18.0, 8.10588972959852]"
7,1,124801,2006.ipm_journal-ir0volumeA42A3.2,7,13.689213,retrieval system improving effectiveness,"[17.0, 7.677016688150646]"
8,1,81840,2006.sigirconf_conference-2006.103,8,13.629453,retrieval system improving effectiveness,"[8.0, 7.669430024361412]"
9,1,82472,1998.sigirconf_conference-98.15,9,13.616212,retrieval system improving effectiveness,"[12.0, 7.7792092708026255]"


### Step 7: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [106]:
persist_and_normalize_run(run, system_name='bm25-stopwords-query-expansion', output_file='../runs')

Done. run file is stored under "../runs/run.txt".


### evaluation

In [107]:
bm25 = pt.io.read_results('../runs/run.txt')

bm25_baseline = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)', pt_dataset)
sparse_cross_encoder = tira.pt.from_submission('ir-benchmarks/fschlatt/sparse-cross-encoder-4-512', pt_dataset)
rank_zephyr = tira.pt.from_submission('workshop-on-open-web-search/fschlatt/rank-zephyr', pt_dataset)

pt.Experiment(
    [pipeline, bm25_baseline, sparse_cross_encoder, rank_zephyr],
    pt_dataset.get_topics(),
    pt_dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_100"],
    names=["BM25 (Own)", "BM 25 (Baseline)", "Sparse Cross Encoder", "RankZephyr"]
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_100
0,BM25 (Own),0.355181,0.564025,0.561727
1,BM 25 (Baseline),0.374041,0.579877,0.601333
2,Sparse Cross Encoder,0.36646,0.61298,0.601333
3,RankZephyr,0.34707,0.568413,0.601333
