# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [4]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier
!pip3 install spacy
!python3 -m spacy download en_core_web_md

[0mCollecting en-core-web-md==3.7.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [5]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt
import spacy 
import pandas as pd
import en_core_web_md


In [6]:

nlp = en_core_web_md.load()

def average_score(scores):
    result = 0
    if len(scores) > 0:
        for score in scores:
            result += score
        result = result / len(scores)
    return result

def get_similar_words(word, threshold=0.60):
    token = nlp(word)
    similar_words = []
    for vocab_word in nlp.vocab:
        if vocab_word.has_vector and vocab_word.is_lower and vocab_word.is_alpha:
            similarity = token.similarity(vocab_word)
            if similarity >= threshold:
                similar_words.append(vocab_word.text)
    return similar_words if similar_words else [word]

def get_best_word(original_word, similar_words, bm25, topic, pt_dataset):
    best_word = original_word
    best_score = average_score(bm25.search(topic['query'])['score'])
    
    for word in similar_words:
        topic_copy = topic.copy()
        topic_copy['query'] = topic_copy['query'].replace(original_word, word)
        
        qr = topic_copy['query']
        result = average_score(bm25.search(qr)['score'])

        score = result
        if score > best_score:
            best_score = score
            best_word = word
    
    return best_word

def queryExpansion(topics, bm25, pt_dataset):    
    expandedQueries = []
    originalQueries = topics['query'].tolist()

    for index, row in topics.iterrows():
        expandedTopic = []
        for word in row['query'].split(' '):
            similar_words = get_similar_words(word)
            best_word = get_best_word(word, similar_words, bm25, row, pt_dataset)
            expandedTopic.append(best_word)
        expandedQueries.append(' '.join(expandedTopic))
    topics['query'] = expandedQueries
    return topics, originalQueries, expandedQueries


ensure_pyterrier_is_loaded()
tira = Client()


pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
topics = pt_dataset.get_topics(variant='title')

index = tira.pt.index('ir-lab-sose-2024/tira-ir-starter/Index (tira-ir-starter-pyterrier)', pt_dataset)
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

expanded_topics, original_queries, expanded_queries = queryExpansion(topics, bm25, pt_dataset)

for original, expanded in zip(original_queries, expanded_queries):
    print(f"Original Query: {original}")
    print(f"Expanded Query: {expanded}\n")


# print(experiment)

  similarity = token.similarity(vocab_word)
  similarity = token.similarity(vocab_word)
  similarity = token.similarity(vocab_word)


Original Query: retrieval system improving effectiveness
Expanded Query: retrieval system improving effectiveness

Original Query: machine learning language identification
Expanded Query: machine learning language identification

Original Query: social media detect self harm
Expanded Query: social media detect self harm

Original Query: stemming for arabic languages
Expanded Query: stemming for arabic languages

Original Query: audio based animal recognition
Expanded Query: audio based animal recognition

Original Query: comparison different retrieval models
Expanded Query: effectiveness different retrieval models

Original Query: cache architecture
Expanded Query: cache architecture

Original Query: document scoping formula
Expanded Query: identification scoping formula

Original Query: pseudo relevance feedback
Expanded Query: pseudo relevance feedback

Original Query: how to represent natural conversations in word nets
Expanded Query: how to represent natural conversations in word n

In [7]:
bm25 = bm25 >> expanded_topics >> bm25
run = bm25(pt_dataset.get_topics('text'))
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

  bm25 = bm25 >> expanded_topics >> bm25


The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

### Step 4: Create the Run


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [8]:
# Auskommentiert, da main.py testen wollte wie die run.txt aussieht
#run = bm25(pt_dataset.get_topics('text'))
#persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')