# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [None]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
# !pip3 install tira ir-datasets python-terrier

In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt

In [3]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

PyTerrier 0.10.0 has loaded Terrier 5.8 (built by craigm on 2023-11-01 18:05) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [4]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

# A (pre-built) PyTerrier index loaded from TIRA
index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)

In [5]:
print(dir(pt_dataset))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_configure', '_describe_component', '_irds_id', '_irds_ref', 'get_corpus', 'get_corpus_iter', 'get_corpus_lang', 'get_index', 'get_qrels', 'get_results', 'get_topics', 'get_topics_lang', 'get_topicsqrels', 'info_url', 'irds_ref']


In [6]:
print(pt_dataset.get_topics())

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
   qid                                      text  \
0    1  retrieval system improving effectiveness   
1    2  machine learning language identification   
2    3             social media detect self-harm   
3    4             stemming for arabic languages   
4    5            audio based animal recognition   
..  ..                                       ...   
63  65         information in different language   
64  66                  Abbreviations in queries   
65  67                  lemmatization algorithms   
66  68                  filter ad rich documents   
67  18     Advancements in Information Retrieval   

                                       title  \
0   retrieval system improving effectiveness   
1   machine learning language identification   
2              social media detect self-harm   
3   

In [7]:
print(pt_dataset.get_topics()["text"])

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
0     retrieval system improving effectiveness
1     machine learning language identification
2                social media detect self-harm
3                stemming for arabic languages
4               audio based animal recognition
                        ...                   
63           information in different language
64                    Abbreviations in queries
65                    lemmatization algorithms
66                    filter ad rich documents
67       Advancements in Information Retrieval
Name: text, Length: 68, dtype: object


In [8]:
it_test = iter(pt_dataset.get_corpus_iter())

sanity = 0
for doc in it_test:
    if doc["text"] != "" :
        sanity += 1
    if doc["text"] == "" :
        print(doc)

print(sanity)

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:02<00:00, 57474.14it/s]

126958





In [9]:
it_test = iter(pt_dataset.get_corpus_iter())

next(it_test)

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:   0%|          | 0/126958 [00:00<?, ?it/s]

{'text': 'A Study on Word Similarity using Context Vector Models\n\n\n There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment o

In [10]:
it_test = iter(pt_dataset.get_corpus_iter())
print(next(it_test)["docno"])
for i in range(20):
    print(next(it_test)["docno"])

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:   0%|          | 0/126958 [00:02<?, ?it/s]

O02-2002
L02-1310
R13-1042
W05-0819
L02-1309
R13-1044
W05-0818
L02-1313
R13-1045
W05-0821
L02-1314
R13-1046
W05-0820
L02-1315
R13-1047
W05-0823
2009.mtsummit-posters.23
L02-1316
R13-1048
W05-0822
L02-1317





In [11]:
print(pt_dataset.get_qrels()["docno"].head(20))

0      2005.ipm_journal-ir0volumeA41A1.7
1     2019.tois_journal-ir0volumeA37A1.2
2     2008.sigirconf_conference-2008.127
3      2015.ipm_journal-ir0volumeA51A5.7
4     2008.tois_journal-ir0volumeA27A1.1
5            1999.ntcir_workshop-1999.31
6       2001.sigirconf_workshop-2001w1.0
7           2018.wsdm_conference-2018.47
8        1995.sigirconf_conference-95.30
9      2006.ipm_journal-ir0volumeA42A3.0
10    2013.ipm_journal-ir0volumeA49A1.18
11    2007.ipm_journal-ir0volumeA43A1.13
12          2008.cikm_conference-2008.76
13          2004.cikm_conference-2004.47
14      2000.sigirconf_conference-2000.7
15       1997.sigirconf_conference-97.36
16    2005.ipm_journal-ir0volumeA41A4.12
17          2008.cikm_conference-2008.59
18     2013.ipm_journal-ir0volumeA49A1.7
19    2011.tois_journal-ir0volumeA29A2.0
Name: docno, dtype: object


In [12]:
it_test = iter(pt_dataset.get_corpus_iter())

docno_list = []
for doc in it_test:
    docno_list.append(doc["docno"])

docno_qrel_list = pt_dataset.get_qrels()["docno"].tolist()

print(set(docno_qrel_list).intersection(set(docno_list)))

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:   0%|          | 20/126958 [00:02<3:38:25,  9.69it/s]
ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:02<00:00, 56493.07it/s]

{'2016.sigirconf_conference-2016.43', '1984.jasis_journal-ir0volumeA35A5.2', '2011.wwwconf_conference-2011c.21', '2010.sigirconf_conference-2010.49', '2017.chiir_conference-2017.72', '2015.wwwconf_conference-2015.58', '2010.sigirconf_conference-2010.75', '2010.ntcir_workshop-2010.21', '2021.wwwconf_conference-2021c.80', '2003.wwwconf_conference-2003.32', '2013.cikm_workshop-2013esair.6', '2020.wwwconf_conference-2020.240', '2005.wwwconf_conference-2005.26', '2010.wwwconf_conference-2010.231', '2019.ipm_journal-ir0volumeA56A3.34', '2017.sigirconf_conference-2017.48', '2021.wwwconf_conference-2021.256', '2008.clef_workshop-2008.72', '2000.clef_workshop-2000.11', '2021.chiir_workshop-2021birds.9', '2021.ecir_conference-20212.27', '2019.wwwconf_conference-2019.134', '2017.wsdm_conference-2017.97', '2009.cikm_conference-2009.41', '2013.cikm_conference-2013.122', '2013.cikm_workshop-2013ueo.0', '2018.cikm_conference-2018.277', '2017.cikm_conference-2017.320', '2013.ictir_conference-2013.3', 




In [13]:
print(pt_dataset.get_topicsqrels())

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.
(   qid                                      text  \
0    1  retrieval system improving effectiveness   
1    2  machine learning language identification   
2    3             social media detect self-harm   
3    4             stemming for arabic languages   
4    5            audio based animal recognition   
..  ..                                       ...   
63  65         information in different language   
64  66                  Abbreviations in queries   
65  67                  lemmatization algorithms   
66  68                  filter ad rich documents   
67  18     Advancements in Information Retrieval   

                                       title  \
0   retrieval system improving effectiveness   
1   machine learning language identification   
2              social media detect self-harm   
3  

In [20]:
di = index.getDirectIndex()
doi = index.getDocumentIndex()
lex = index.getLexicon()
meta = index.getMetaIndex()
docid = 0
#NB: postings will be null if the document is empty
for posting in  di.getPostings(doi.getDocumentEntry(docid)):
  termid = posting.getId()
  lee = lex.getLexiconEntry(termid)
  print("%s with frequency %d" % (lee.getKey(),posting.getFrequency()))

i with frequency 2
mixtur with frequency 1
adopt with frequency 1
exampl with frequency 1
featur with frequency 2
especi with frequency 1
classif with frequency 1
requir with frequency 1
gener with frequency 1
measur with frequency 2
idf with frequency 1
characterist with frequency 1
valu with frequency 2
languag with frequency 1
vector with frequency 3
environ with frequency 1
data with frequency 1
turn with frequency 1
differ with frequency 1
model with frequency 2
solv with frequency 1
context with frequency 5
probabilist with frequency 1
similarit with frequency 2
accord with frequency 3
weight with frequency 1
invers with frequency 1
propos with frequency 1
similar with frequency 8
appli with frequency 1
base with frequency 3
defin with frequency 1
syntact with frequency 5
word with frequency 7
us with frequency 3
categori with frequency 1
applic with frequency 1
adjust with frequency 1
algorithm with frequency 1
need with frequency 1
spars with frequency 1
process with frequency 

In [35]:
bow = {}
for docid in range(doi.getNumberOfDocuments()):
  for posting in  di.getPostings(doi.getDocumentEntry(docid)):
    docno = meta.getItem("docno", docid)
    if not docno in bow:
      bow[docno]= {}
    termid = posting.getId()
    lee = lex.getLexiconEntry(termid)
    bow[docno][lee.getKey()] = posting.getFrequency()

In [36]:
print(bow["O02-2002"])

{'i': 2, 'mixtur': 1, 'adopt': 1, 'exampl': 1, 'featur': 2, 'especi': 1, 'classif': 1, 'requir': 1, 'gener': 1, 'measur': 2, 'idf': 1, 'characterist': 1, 'valu': 2, 'languag': 1, 'vector': 3, 'environ': 1, 'data': 1, 'turn': 1, 'differ': 1, 'model': 2, 'solv': 1, 'context': 5, 'probabilist': 1, 'similarit': 2, 'accord': 3, 'weight': 1, 'invers': 1, 'propos': 1, 'similar': 8, 'appli': 1, 'base': 3, 'defin': 1, 'syntact': 5, 'word': 7, 'us': 3, 'categori': 1, 'applic': 1, 'adjust': 1, 'algorithm': 1, 'need': 1, 'spars': 1, 'process': 1, 'studi': 1, 'natur': 1, 'precis': 1, 'consid': 1, 'paper': 1, 'frequenc': 1, 'group': 2, 'agglom': 1, 'problem': 1, 'document': 1, 'approach': 2, 'semant': 6, 'distribut': 1, 'real': 1, 'occurr': 2, 'deriv': 1, 'pars': 1, 'class': 2, 'taxonomi': 2, 'cluster': 1, 'distanc': 1, 'contextu': 1, 'theoret': 1, 'onli': 1}


### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [4]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

### Step 4: Create the Run


In [5]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:
Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/ir-acl-anthology-20240504-truth.zip?download=1
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 29.6k/29.6k [00:00<00:00, 1.49MiB/s]

Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-lab-sose-2024/ir-acl-anthology-20240504-training/





Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [11]:
print('Now we do the retrieval...')
run = bm25(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [20]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
