# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [1]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier
!pip3 install spacy



In [2]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import spacy 
import pandas as pd


  from .autonotebook import tqdm as notebook_tqdm


In [3]:

nlp = spacy.load('en_core_web_md')

def get_similar_words(word, threshold=0.60):
    token = nlp(word)
    similar_words = []
    for vocab_word in nlp.vocab:
        if vocab_word.has_vector and vocab_word.is_lower and vocab_word.is_alpha:
            similarity = token.similarity(vocab_word)
            if similarity >= threshold:
                similar_words.append(vocab_word.text)
    return similar_words if similar_words else [word]

def get_best_word(original_word, similar_words, bm25, topic, pt_dataset):
    best_word = original_word
    best_score = -float('inf')
    
    for word in similar_words:
        topic_copy = topic.copy()
        topic_copy['query'] = topic_copy['query'].replace(original_word, word)
        
        print(f"Testing word: {word} in query: {topic_copy['query']}")
        
        experiment = pt.Experiment(
            [bm25],
            pd.DataFrame([topic_copy]),  
            pt_dataset.get_qrels(),
            ["ndcg_cut_10", "recip_rank", "recall_1000"],
            names=["BM25 - Low Entities"]
        )
        
        print(experiment)
        
        score = experiment[['ndcg_cut_10', 'recip_rank', 'recall_1000']].mean().mean()
        
        if score > best_score:
            best_score = score
            best_word = word
    
    return best_word

def queryExpansion(topics, bm25, pt_dataset):    
    expandedQueries = []
    originalQueries = topics['query'].tolist()

    for index, row in topics.iterrows():
        expandedTopic = []
        for word in row['query'].split(' '):
            similar_words = get_similar_words(word)
            best_word = get_best_word(word, similar_words, bm25, row, pt_dataset)
            expandedTopic.append(best_word)
        expandedQueries.append(' '.join(expandedTopic))
    topics['query'] = expandedQueries
    return topics, originalQueries, expandedQueries


ensure_pyterrier_is_loaded()
tira = Client()

pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
topics = pt_dataset.get_topics(variant='title')

index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

expanded_topics, original_queries, expanded_queries = queryExpansion(topics, bm25, pt_dataset)


experiment = pt.Experiment(
    [bm25],
    expanded_topics,
    pt_dataset.get_qrels(),
    ["ndcg_cut_10", "recip_rank", "recall_1000"],
    names=["BM25 - Finaly Entities"]
)

for original, expanded in zip(original_queries, expanded_queries):
    print(f"Original Query: {original}")
    print(f"Expanded Query: {expanded}\n")

print(experiment)

PyTerrier 0.10.1 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7



Testing word: retrieval in query: retrieval system improving effectiveness
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities      0.83578         1.0     0.966667
Testing word: system in query: retrieval system improving effectiveness
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities      0.83578         1.0     0.966667
Testing word: improving in query: retrieval system improving effectiveness
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities      0.83578         1.0     0.966667
Testing word: improving in query: retrieval system improving improving
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.271088    0.333333     0.866667
Testing word: effectiveness in query: retrieval system improving effectiveness
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities      0.83578         1.0     0.966667
Testing word: that

  similarity = token.similarity(vocab_word)


                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.083333          1.0
Testing word: of in query: exhaustivity of index
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.083333          1.0
Testing word: index in query: exhaustivity of index
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.083333          1.0
Testing word: query in query: query optimization
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.706544         0.5     0.821429
Testing word: improving in query: query improving
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.078398    0.166667     0.357143
Testing word: research in query: query research
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.055556     0.107143
Testing word: recomm

  similarity = token.similarity(vocab_word)


                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.020833       0.4375
Testing word: what in query: what makes natural language processing natural
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.020833       0.4375
Testing word: nothin in query: nothin makes natural language processing natural
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.020833       0.4375
Testing word: that in query: that makes natural language processing natural
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.020833       0.4375
Testing word: makes in query: what makes natural language processing natural
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities          0.0    0.020833       0.4375
Testing word: natural in query: what makes natural language processing natural
  

  similarity = token.similarity(vocab_word)


Testing word: algorithms in query: lemmatization algorithms
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.414123         0.5          1.0
Testing word: algorithm in query: lemmatization algorithm
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.414123         0.5          1.0
Testing word: analysis in query: lemmatization analysis
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities      0.33916         1.0          1.0
Testing word: graph in query: lemmatization graph
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.414123         0.5          1.0
Testing word: optimization in query: lemmatization optimization
                  name  ndcg_cut_10  recip_rank  recall_1000
0  BM25 - Low Entities     0.414123         0.5          1.0
Testing word: techniques in query: lemmatization techniques
                  name  ndcg_cut_10  recip

In [4]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

In [5]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.
pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
print(pt_dataset)
topics = pt_dataset.get_topics(variant='title')

query_entity_linking = tira.pt.transform_queries('ir-benchmarks/marcel-gohsen/entity-linking', pt_dataset)
print(query_entity_linking(topics).iloc[1].to_dict())
print(query_entity_linking(topics).iloc[0].to_dict())

lowEntity = []
highEntity = []
linked_queries = query_entity_linking(topics)

for i in range(len(linked_queries)):
    entities = linked_queries.iloc[i].to_dict().get('entities')
    if entities is not None and len(entities) < 17:
        lowEntity.append(entities)
    elif entities is not None:
        highEntity.append(entities)
        
print(len(lowEntity))
print(len(highEntity))

index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25QE = bm25 >> query_entity_linking # Teste mit "<<" Operator und der Python 'main.py' Datei wird eine manuelle Methode probiert.

IRDSDataset('ir-lab-sose-2024/ir-acl-anthology-20240504-training')
{'qid': '2', 'query': 'machine learning language identification', 'original_query': {'query_id': '2', 'title': 'machine learning language identification', 'description': 'What papers are about machine learning for language identification?', 'narrative': 'Relevant papers include research on methods of machine learning for language identification or how to improve those methods. Papers that focus on other methods for language identification or the usaged of machine learning not for language identification are not relevant.'}, 'entities': [{'begin': 17, 'end': 40, 'mention': 'language identification', 'url': 'https://en.wikipedia.org/wiki/Language_identification', 'score': 1.0}, {'begin': 0, 'end': 16, 'mention': 'machine learning', 'url': 'https://en.wikipedia.org/wiki/Machine_learning', 'score': 0.9745664739884391}, {'begin': 8, 'end': 16, 'mention': 'learning', 'url': 'https://en.wikipedia.org/wiki/Learning', 'score': 0

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [6]:
# Experiment ohne Query Expansion

pt.Experiment(
    [bm25],
    pt_dataset.get_topics(),
    pt_dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_1000"],
    names=["BM25"]
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_1000
0,BM25,0.374041,0.579877,0.825376


In [7]:
# Experiment mit Query Expansion

pt.Experiment(
    [bm25QE],
    pt_dataset.get_topics(),
    pt_dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_1000"],
    names=["BM25"]
)

There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_1000
0,BM25,0.374041,0.579877,0.825376


### Step 4: Create the Run


In [8]:
print('First, we have a short look at the first three topics:')

pt_dataset.get_topics('text').head(3)

First, we have a short look at the first three topics:


Unnamed: 0,qid,query
0,1,retrieval system improving effectiveness
1,2,machine learning language identification
2,3,social media detect self harm


In [9]:
print('Now we do the retrieval...')
run = bm25(pt_dataset.get_topics('text'))

print('Done. Here are the first 10 entries of the run')
run.head(10)

Now we do the retrieval...
Done. Here are the first 10 entries of the run


Unnamed: 0,qid,docid,docno,rank,score,query
0,1,94858,2004.cikm_conference-2004.47,0,15.681777,retrieval system improving effectiveness
1,1,125137,1989.ipm_journal-ir0volumeA25A4.2,1,15.04738,retrieval system improving effectiveness
2,1,125817,2005.ipm_journal-ir0volumeA41A5.11,2,14.144223,retrieval system improving effectiveness
3,1,5868,W05-0704,3,14.025748,retrieval system improving effectiveness
4,1,84876,2016.ntcir_conference-2016.90,4,13.947994,retrieval system improving effectiveness
5,1,82472,1998.sigirconf_conference-98.15,5,13.901647,retrieval system improving effectiveness
6,1,94415,2008.cikm_conference-2008.183,6,13.808208,retrieval system improving effectiveness
7,1,17496,O01-2005,7,13.749449,retrieval system improving effectiveness
8,1,82490,1998.sigirconf_conference-98.33,8,13.735541,retrieval system improving effectiveness
9,1,124801,2006.ipm_journal-ir0volumeA42A3.2,9,13.569263,retrieval system improving effectiveness


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [10]:
# Auskommentiert, da main.py testen wollte wie die run.txt aussieht
#persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')