# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [33]:
# You only need to execute this cell if you are using Google Golab.
# If you use GitHub Codespaces, everything is already installed.
!pip3 install tira ir-datasets python-terrier
!pip3 install spacy

[0m

In [34]:
# Imports
from tira.third_party_integrations import ensure_pyterrier_is_loaded, persist_and_normalize_run
from tira.rest_api_client import Client
import pyterrier as pt


In [35]:
def custom_stopwords():
        # Der Pfad zur Textdatei
    file_path = '../stopwords/stopword-list.txt' # Default Stopword List by Terrier

    # Initialisiere eine leere Liste
    stopwords_list = []

    # Öffne die Datei und lese jede Zeile
    with open(file_path, 'r') as file:
        for line in file:
            # Entferne führende und nachfolgende Leerzeichen (einschließlich neuer Zeilen)
            stripped_line = line.strip()
            if "information" in stripped_line: # Ignoriere Stopwort Information
                continue 
            # Füge die bereinigte Zeile zur Liste hinzu, falls sie nicht leer ist
            if stripped_line:
                stopwords_list.append(stripped_line)
    return stopwords_list


In [36]:
# Create a REST client to the TIRA platform for retrieving the pre-indexed data.
ensure_pyterrier_is_loaded()
tira = Client()

pt_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')

iter_indexer = pt.IterDictIndexer("../index", stopwords=custom_stopwords(),meta={'docno': 50, 'text': 4096}, overwrite=True, blocks=True)

index = iter_indexer.index(pt_dataset.get_corpus_iter())

ir-lab-sose-2024/ir-acl-anthology-20240504-training documents:  71%|███████   | 90188/126958 [00:17<00:05, 6673.09it/s]



ir-lab-sose-2024/ir-acl-anthology-20240504-training documents: 100%|██████████| 126958/126958 [00:22<00:00, 5584.87it/s] 


10:24:47.075 [ForkJoinPool-4-worker-3] WARN org.terrier.structures.indexing.Indexer - Indexed 1 empty documents


### Step 2: Define the Retrieval Pipeline and load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [47]:
# The dataset: the union of the IR Anthology and the ACL Anthology
# This line creates an IRDSDataset object and registers it under the name provided as an argument.

topics = pt_dataset.get_topics(variant='title')

query_entity_linking = tira.pt.transform_queries('ir-benchmarks/marcel-gohsen/entity-linking', pt_dataset)

linked_queries = query_entity_linking(topics)

bm25 = pt.BatchRetrieve("../index", wmodel="BM25")

def remove_questionmark_from_queries():
    for index, row in linked_queries.iterrows():
        if '?' in str(row['query']):
            linked_queries['query'] = linked_queries['query'].str.replace('?', '')

def keyphrase_containment_checker(input_phrase, wanted_string):
    # Überprüfen, ob der gewollte String bereits in der Input Phrase enthalten ist
    if wanted_string in input_phrase:
        return True
    else:
        return False
    
def entity_keyphrase_length_checker(entity_keyphrase):
    # String nach Leerzeichen aufsplitten
    entity_word = entity_keyphrase.split(" ")
    # Anzahl der Elemente ermitteln
    return len(entity_word)

def add_missing_terms(input_phrase, wanted_string):
    wanted_string = wanted_string.lower()
    if not keyphrase_containment_checker(input_phrase, wanted_string):
        input_phrase += wanted_string + f" "
        return input_phrase
    else:
        return input_phrase


def query_rewrite(linked_queries, score_threshold=0.9, entity_list_size=2):
    queries_entities = linked_queries['entities'].to_dict()
    for qid in queries_entities:
        query = linked_queries['query'][qid];
        query_entities = queries_entities[qid] # Eine Liste mit entitäten, die von jeder Query kommt
        if(len(query_entities) > 0): # Manche Query besitzen vielleicht keine Entitäten
            keyphrases = {} # Dict für die Keyphrases

            j = 0;
            for i in range(0,entity_list_size):
                if(entity_list_size < len(query_entities)):
                    entity_keyphrase = query_entities[i]['mention']
                    entity_score = query_entities[i]['score']
                    keyphrases[i] = (entity_keyphrase, entity_score) # keyphrases[i][0] := Entität und keyphrases[i][1] := Der Score der Entität
                else:                                                # Wenn entity_list_size zu groß ist für len(query_entities), dann nehme die maximale mögliche anzahl, also len(query_entities)
                    entity_keyphrase = query_entities[i]['mention']
                    entity_score = query_entities[i]['score']
                    keyphrases[i] = (entity_keyphrase, entity_score)
                    j += 1
                    if(j == len(query_entities)):
                        break
            

            delta = 0 # Mittelwert des Scores von den einzelen Entitäten aus einer Query
            for i in range(0,len(keyphrases)):
                delta += keyphrases[i][1]
            delta = delta / len(keyphrases)
            
            result = ""
            old_result = ""
            if( delta >= score_threshold):
                for i in range(0,len(keyphrases)):
                    phrase = keyphrases[i][0]
                    if entity_keyphrase_length_checker(phrase) >= 2 and not keyphrase_containment_checker(old_result, phrase):
                        result += f'"{phrase}"' + f" "
                        old_result = result
                
                if len(result) != 0:
                    query_word_list = query.split(" ")
                    for i in range(0, len(query_word_list)):
                        result = add_missing_terms(result, query_word_list[i])
                    result = result.strip(" ");
                    print("Changing Query from " +  '[' + linked_queries['query'][qid] + ']' + " to [" + result + "]")
                    linked_queries['query'][qid] = result
remove_questionmark_from_queries()
query_rewrite(linked_queries, 0.9, 2)

Changing Query from [machine learning language identification] to ["language identification" "machine learning"]
Changing Query from [The Ethics of Artificial Intelligence] to ["ethics of artificial intelligence" the]
Changing Query from [machine learning for more relevant results] to ["machine learning" for more relevant results]
Changing Query from [Crawling websites using machine learning] to ["machine learning" crawling websites using]
Changing Query from [Limitations machine learning] to ["machine learning" limitations]
Changing Query from [Natural Language Processing] to ["natural language processing"]
Changing Query from [ risks of information retrieval in social media ] to ["information retrieval" "social media" risks of]
Changing Query from [processing natural language for information retrieval] to ["information retrieval" "natural language" processing]
Changing Query from [The University of Amsterdam] to ["university of amsterdam" the]
Changing Query from [ What makes Natural

In [45]:
bm25QR = linked_queries >> bm25 
linear = 2 * bm25 + 0.5 * bm25QR

pt.Experiment(
    [bm25, bm25QR, linear],
    pt_dataset.get_topics(),
    pt_dataset.get_qrels(),
    ["ndcg_cut.10", "recip_rank", "recall_1000", "P"],
    names=["BM25", "BM25QR", "Linear"]
)


# stopwords=['a', 'an', 'the'] ,
#0 	BM25 	0.356608 	0.572054 	0.824595 	0.326471 	0.320588 	0.285294 	0.250000 	0.203431 	0.100735 	0.061250 	0.029500 	0.016235
#1 	BM25QR 	0.330551 	0.548680 	0.727783 	0.300000 	0.291176 	0.259804 	0.229412 	0.186765 	0.091618 	0.055147 	0.026206 	0.014368
#2 	Linear 	0.352036 	0.581671 	0.824595 	0.320588 	0.311765 	0.287255 	0.248529 	0.203431 	0.100294 	0.060809 	0.029529 	0.016235

# Default Stopword List
#0 	BM25 	0.374041 	0.579877 	0.825376 	0.376471 	0.332353 	0.311765 	0.270588 	0.219608 	0.108382 	0.063676 	0.029941 	0.016191
#1 	BM25QR 	0.327492 	0.512991 	0.701300 	0.329412 	0.286765 	0.266667 	0.227941 	0.186765 	0.091765 	0.053971 	0.025176 	0.013721
#2 	Linear 	0.373095 	0.581238 	0.825376 	0.370588 	0.330882 	0.306863 	0.269118 	0.217647 	0.107500 	0.063529 	0.029853 	0.016191

# stopwords/stopword-list.txt
#0 	BM25 	0.367383 	0.582789 	0.834716 	0.364706 	0.323529 	0.305882 	0.263971 	0.211275 	0.106912 	0.063824 	0.030059 	0.016500
#1 	BM25QR 	0.336283 	0.553027 	0.724920 	0.326471 	0.292647 	0.267647 	0.227941 	0.186275 	0.092941 	0.055809 	0.026324 	0.014441
#2 	Linear 	0.363278 	0.584142 	0.834716 	0.355882 	0.319118 	0.298039 	0.259559 	0.208333 	0.104853 	0.063456 	0.029941 	0.016500


  bm25QR = linked_queries >> bm25


There are multiple query fields available: ('text', 'title', 'query', 'description', 'narrative'). To use with pyterrier, provide variant or modify dataframe to add query column.


Unnamed: 0,name,ndcg_cut.10,recip_rank,recall_1000,P@5,P@10,P@15,P@20,P@30,P@100,P@200,P@500,P@1000
0,BM25,0.367383,0.582789,0.834716,0.364706,0.323529,0.305882,0.263971,0.211275,0.106912,0.063824,0.030059,0.0165
1,BM25QR,0.336283,0.553027,0.72492,0.326471,0.292647,0.267647,0.227941,0.186275,0.092941,0.055809,0.026324,0.014441
2,Linear,0.363278,0.584142,0.834716,0.355882,0.319118,0.298039,0.259559,0.208333,0.104853,0.063456,0.029941,0.0165


### Step 4: Create the Run


In [None]:
print('Now we do the retrieval...')
runDefault = bm25(pt_dataset.get_topics('text'))
runQR = bm25QR(pt_dataset.get_topics('text'))
print("Done!")

Now we do the retrieval...
Done!


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(runDefault,  system_name='bm25-baseline', default_output='../runs/defaultRuns')
persist_and_normalize_run(runQR, system_name='bm25-baseline', default_output='../runs/QRRuns')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs/defaultRuns".
Done. run file is stored under "../runs/defaultRuns/run.txt".
The run file is normalized outside the TIRA sandbox, I will store it at "../runs/QRRuns".
Done. run file is stored under "../runs/QRRuns/run.txt".
