# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [2]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

[0m

In [6]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
# stopword imports
import nltk
from nltk.corpus import stopwords
import spacy
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
# lemmatizer imports
import pandas as pd
pd.set_option('display.max_colwidth', 0)

if not pt.started():
    pt.init(boot_packages=['mam10eks:custom-terrier-token-processing:0.0.1'])
    from jnius import autoclass

In [5]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()
# spacy model
!python -m spacy download en_core_web_sm

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Step 2: Stopword Removal

In [7]:
# download stopwords
nltk.download('stopwords')

# Generate custom stopword list
nltk_stopwords = set(stopwords.words('english'))
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
sklearn_stopwords = set(ENGLISH_STOP_WORDS)
combined_stopwords = set.union(nltk_stopwords, spacy_stopwords, sklearn_stopwords)

# output to verify stopwords
print('a quick look at the stopwords')
print(combined_stopwords)

## Create and save stopword file
file_path = "../custom-stopwords/custom_stopwords.txt"

with open(file_path, 'w+') as file:
    for element in combined_stopwords:
        file.write(element + "\n")

# Set property for stopword file in PyTerrier
pt.set_property('stopwords.filename', '../custom-stopwords/custom_stopwords.txt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


a quick look at the stopwords
{'ours', 'nobody', '‘ve', 'any', 'others', 'serious', "'m", 'formerly', 'former', 'its', 'behind', 'you', 'towards', 'not', 'toward', 'latter', 'twenty', 'onto', 'her', 'wasn', 'been', 'de', 'there', 're', 'un', 'side', 'y', 'if', 'n‘t', 'sixty', "hasn't", 'upon', 'most', 'ie', 'front', 'whose', 'amount', 'take', 'along', 'own', 'becoming', 'beforehand', 'last', '‘ll', 'few', 'cannot', "needn't", 'used', 'whereas', 'again', 'enough', 'himself', 'are', 'eight', 'seems', 'ma', "you'd", 'they', "n't", 'between', 'theirs', 'ten', 'system', 'do', 'being', 'cant', "didn't", '‘d', 'nor', 'we', 'fill', 'while', 'those', 'when', "haven't", '’ve', 'my', 'least', 'beyond', 'twelve', 'that', 'elsewhere', "won't", 'though', 'namely', 'itself', 'how', 'ain', 'make', 'unless', 'what', "mustn't", 'had', 'move', 'now', 'alone', 'weren', 'part', 'whom', 'each', 'but', 'found', 'eleven', 'indeed', 'meanwhile', 'become', 'top', 'than', 'get', 'therefore', '‘m', 'isn', 'these'

### Step 3: Load the Dataset

In [8]:
print('Loading Dataset...')
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print('Dataset loaded.')

Loading Dataset...
Dataset loaded.


### Step x: Lemmatizer

In [9]:
# basic lemmatizer implementation
# stanford lemmatizer
def lemmatize(t):
    lemmatizer = autoclass("org.terrier.terms.StanfordLemmatizer")()
    return lemmatizer.stem(t)

# porterStemmer
def stem(t):
    stemmer = autoclass("org.terrier.terms.PorterStemmer")()
    return stemmer.stem(t)

# lemurKrovetzStemmer
def stem_krovetz(t):
    stemmer = autoclass("org.terrier.terms.LemurKrovetzStemmer")()
    return stemmer.stem(t)

### Step 4: Index Building

In [10]:
print('Building Index...')



def create_index(pt_dataset, stopwords):
    # added lemmatizer/stemmer
    indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480}, stopwords=stopwords, stemmer='LemurKrovetzStemmer')
    index_ref = indexer.index(pt_dataset)
    return pt.IndexFactory.of(index_ref)


index = create_index(pt_dataset.get_corpus_iter(), combined_stopwords)
print('Index created.')

Building Index...


ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:   0%|          | 0/126958 [00:00<?, ?it/s]

19:00:00.043 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer - TermPipeline object org.terrier.terms.LemurKrovetzStemmer not found: java.lang.ClassNotFoundException: org.terrier.terms.LemurKrovetzStemmer


java.lang.ClassNotFoundException: org.terrier.terms.LemurKrovetzStemmer
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at org.terrier.utility.ApplicationSetup.getClass(ApplicationSetup.java:416)
	at org.terrier.structures.indexing.Indexer.load_pipeline(Indexer.java:323)
	at org.terrier.structures.indexing.Indexer.init(Indexer.java:197)
	at org.terrier.structures.indexing.classical.BasicIndexer.<init>(BasicIndexer.java:183)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingCo

Index created.


### Step 5: Define the Retrieval Pipeline + Query Expansion

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [11]:
bo1 = pt.rewrite.Bo1QueryExpansion(index)

# definition of BM25 pipeline with stopword index
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

bm25_expand = bm25 >> bo1 >> bm25

19:00:17.429 [main] ERROR org.terrier.terms.BaseTermPipelineAccessor - TermPipeline object org.terrier.terms.LemurKrovetzStemmer not found
java.lang.ClassNotFoundException: org.terrier.terms.LemurKrovetzStemmer
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at org.terrier.utility.ApplicationSetup.getClass(ApplicationSetup.java:421)
	at org.terrier.terms.BaseTermPipelineAccessor.<init>(BaseTermPipelineAccessor.java:70)
	at org.terrier.querying.ApplyTermPipeline.load_pipeline(ApplyTermPipeline.java:90)
	at org.terrier.querying.ApplyTermPipeline.<init>(ApplyTermPipeline.java:80)
	at org.terrier.querying.ApplyTermPipeline.<init>(ApplyTermPipeline.java:74)


### Step 6: Create the Run


In [12]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [13]:
print('Create run')
run = bm25_expand(pt_dataset.get_topics('text'))
print('Done. Here are the first 10 entries of the run')
run.head(10)

Create run
19:00:21.865 [main] ERROR org.terrier.terms.BaseTermPipelineAccessor - TermPipeline object org.terrier.terms.LemurKrovetzStemmer not found
java.lang.ClassNotFoundException: org.terrier.terms.LemurKrovetzStemmer
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at java.base/java.lang.Class.forName0(Native Method)
	at java.base/java.lang.Class.forName(Class.java:398)
	at org.terrier.utility.ApplicationSetup.getClass(ApplicationSetup.java:421)
	at org.terrier.terms.BaseTermPipelineAccessor.<init>(BaseTermPipelineAccessor.java:70)
	at org.terrier.querying.ApplyTermPipeline.load_pipeline(ApplyTermPipeline.java:90)
	at org.terrier.querying.ApplyTermPipeline.getPipeline(ApplyTermPipeline.java:150)
	at org.terrier.querying.ApplyTermPipeline.process(ApplyTermPipeline.java:162)
	at 

Unnamed: 0,qid,docid,docno,rank,score,query_0,query
0,1,91391,2004.ecir_conference-2004.21,0,19.12088,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
1,1,123051,2002.ipm_journal-ir0volumeA38A1.0,1,19.034918,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
2,1,90655,2017.airs_conference-2017.1,2,19.026703,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
3,1,125137,1989.ipm_journal-ir0volumeA25A4.2,3,19.013045,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
4,1,53327,O07-2010,4,18.890801,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
5,1,84876,2016.ntcir_conference-2016.90,5,18.700888,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
6,1,78846,2016.iir_workshop-2016.13,6,18.355955,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
7,1,125817,2005.ipm_journal-ir0volumeA41A5.11,7,17.709968,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
8,1,126826,2007.tois_journal-ir0volumeA26A1.4,8,16.507355,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000
9,1,121640,2005.sigirjournals_journal-ir0volumeA39A2.12,9,16.231794,retrieval system improving effectiveness,applypipeline:off retrieval^1.248593174 system^1.049840886 improving^1.173257422 effectiveness^1.301538472 by^0.075992356 the^0.062643036 of^0.041941845 multiple^0.000000000 discussed^0.000000000 use^0.000000000


### Step 7: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [14]:
persist_and_normalize_run(run, system_name='bm25-stopwords-query-expansion', output_file='../runs')

Done. run file is stored under "../runs/run.txt".


### custom eval

In [15]:
bm25 = pt.io.read_results('../runs/run.txt')

bm25_baseline = tira.pt.from_submission('ir-benchmarks/tira-ir-starter/BM25 (tira-ir-starter-pyterrier)', pt_dataset)
sparse_cross_encoder = tira.pt.from_submission('ir-benchmarks/fschlatt/sparse-cross-encoder-4-512', pt_dataset)
rank_zephyr = tira.pt.from_submission('workshop-on-open-web-search/fschlatt/rank-zephyr', pt_dataset)

pt.Experiment(
    [bm25, bm25_baseline, sparse_cross_encoder, rank_zephyr],
    pt_dataset.get_topics(),
    pt_dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_100"],
    names=["BM25 (Own)", "BM 25 (Baseline)", "Sparse Cross Encoder", "RankZephyr"]
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_100
0,BM25 (Own),0.274717,0.463803,0.459857
1,BM 25 (Baseline),0.374041,0.579877,0.601333
2,Sparse Cross Encoder,0.36646,0.61298,0.601333
3,RankZephyr,0.34707,0.568413,0.601333
