# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [5]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier

[0m

In [6]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import pandas as pd #extra import
from sklearn.feature_extraction.text import CountVectorizer #extra import

In [7]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [8]:
# Datenset laden
# Lade den Datensatz in test_dataset
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.

test_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print(test_dataset)

IRDSDataset('ir-lab-sose-2024/ir-acl-anthology-20240504-training')


Create a custom tokenizer method.
This method takes a String as an argument and returns a List of all tokens, including ngrams between 1 and 3 words.

In [9]:
# Methode, die einen String einliest und eine Liste aller ngrams in der Range ausgibt. 1grams: alle Wörter einzeln. 2grams: alle paare aus benachbarten etc.
def ngram_tokenizer(text, ngram_range=(1, 3)):
    vectorizer = CountVectorizer(ngram_range=ngram_range, token_pattern=r'\b\w+\b', analyzer='word')
    analyze = vectorizer.build_analyzer()
    return analyze(text)

#Beispiel, wie ein text getokenized wird
print(ngram_tokenizer("Testing of Information Retrieval Systems"));

['testing', 'of', 'information', 'retrieval', 'systems', 'testing of', 'of information', 'information retrieval', 'retrieval systems', 'testing of information', 'of information retrieval', 'information retrieval systems']


Create the Iterator for our dataset. This is an Iterator where every element is a dict which contains a text and a docno.
See the example print below.

In [10]:

# Erstelle einen Iterator von Dicts von unserem Dataset
# Return type : Iterator[Dict[str, Any]]
corpus_iter = test_dataset.get_corpus_iter()

#Print example element of corpus_iter.
#Corpus_iter is an iterator where all elements are dicts that look like this
for doc in corpus_iter:
     if doc.get("docno") == "L02-1310":
          print(doc)
       

Download from the Incubator: https://files.webis.de/data-in-production/data-research/tira-zenodo-dump-preparation/ir-lab-sose2024/ir-acl-anthology-20240504-inputs.zip?download=1
	This is only used for last spot checks before archival to Zenodo.


Download: 100%|██████████| 39.4M/39.4M [00:00<00:00, 64.5MiB/s]


Download finished. Extract...
Extraction finished:  /root/.tira/extracted_datasets/ir-lab-sose-2024/ir-acl-anthology-20240504-training/


ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:   4%|▍         | 5496/126958 [00:00<00:02, 54941.71it/s]

{'text': 'Bootstrapping Large Sense Tagged Corpora', 'docno': 'L02-1310'}


ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:02<00:00, 56152.48it/s]


In [None]:
# erstelle custom Indexer, der IterDictIndexer extended
class CustomTokenizerIterDictIndexer(IterDictIndexer):
    # Constructor, der den Konstruktor der Superklasse aufruft und die Argumente weitergibt
    def __init__(self, index_path, tokenizer, meta, meta_lengths):
        super().__init__(index_path, meta, meta_lengths) # übergebe Parameter an Konstruktor der Superklasse
        self.tokenizer = tokenizer  # Da man den tokenizer nicht als Parameter übergeben kann, muss man ihn so von hand ändern




In [None]:
# Create an instance of the custom indexer
index_path = "./ngram_test_index"
ngram_indexer = CustomTokenizerIterDictIndexer(index_path, ngram_tokenizer, meta=['docno', 'text'], meta_lengths=[20, 4096]) 




JavaException: JVM exception occurred: java.lang.NullPointerException

In [9]:

#index = tira.pt.index('ir-lab-sose-2024-needthegrade/baseline-retrieval-system/index_ngram2', test_dataset)
        

AttributeError: 'Client' object has no attribute 'submissions_of_team'

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [8]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25")

### Step 4: Create the Run


In [9]:
print('First, we have a short look at the first three topics:')

test_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [10]:
print('Now we do the retrieval...')
run = bm25(test_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,19.564883,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,19.211812,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,19.090084,retrieval system improving effectiveness
3,1,84876,2016.ntcir_conference-2016.90,3,19.011221,retrieval system improving effectiveness
4,1,124801,2006.ipm_journal-ir0volumeA42A3.2,4,18.919271,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,18.856596,retrieval system improving effectiveness
6,1,74513,2001.clef_workshop-2001w.24,6,18.825665,retrieval system improving effectiveness
7,1,81840,2006.sigirconf_conference-2006.103,7,18.813841,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,18.794527,retrieval system improving effectiveness
9,1,17496,O01-2005,9,18.793485,retrieval system improving effectiveness


In [11]:
index = pt.IndexFactory.of(index_ref)
meta = index.getMetaIndex()

# List of metadata keys
meta_keys = ['docno', 'text']  # Adjust this list based on your meta fields

# Function to print document attributes
def print_doc_attributes(docid):
    attributes = {key: meta.getItem(key, docid) for key in meta_keys}
    print(f"Attributes for docid {docid}: {attributes}")

# Example: Print attributes for the first 5 documents
for docid in range(5):
    print_doc_attributes(docid)

# Function to print document attributes by docno
def print_doc_attributes_by_docno(docno):
    try:
        docid = meta.getDocument("docno", docno)
        print_doc_attributes(docid)
    except KeyError:
        print(f"Document with docno {docno} not found.")

# List of specific docnos to retrieve
docnos = ['W05-0704']

# Retrieve and print attributes for each specified docno
for docno in docnos:
    print_doc_attributes_by_docno(docno)

Attributes for docid 0: {'docno': 'O02-2002', 'text': 'a study on word similarity using context vector models there is a need to measure word similarity when processing natural languages especially when using generalization classification or example based approaches usually measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy the taxonomy approaches are more or less semantic based that do not consider syntactic similarit ies however in real applications both semantic and syntactic'}
Attributes for docid 1: {'docno': 'L02-1310', 'text': 'bootstrapping large sense tagged corpora bootstrapping large large sense sense tagged tagged corpora bootstrapping large sense large sense tagged sense tagged corpora'}
Attributes for docid 2: {'docno': 'R13-1042', 'text': 'headerless quoteless but not hopeless using pairwise email classification to disentangle email threads thread disentanglement is the task of separating o

### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
