# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries



In [1]:
!pip3 install tira ir-datasets python-terrier
import pyterrier as pt
from collections import Counter
import pandas as pd
from tira.third_party_integrations import persist_and_normalize_run,  ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

[0m

### Step 2: Start Pyterrier
Only start pyterrier if it's not already started

In [2]:
# Initialize PyTerrier
if not pt.started():
  pt.init()
ensure_pyterrier_is_loaded
tira = Client()

PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 3: Load the dataset and convert it into panda dataframe
Right now, we are only manually loading a small dataset for testing purposes. 
This is where later, we need to load the proper dataset from Tira and put it into the correct panda dataframe format

In [26]:
test_dataset = [
    ["d1", "systems for implementing machine learning algorithms"], 
    ["d2", "algorithms that help with optimizing machine learning"], 
    ["d3", "machine learning systems"], 
    ["d4", "learning machine systems"],
    ["d5", "white quick fox"]
    ]
#for some reason, i cannot get the dataset in this notebook, while it works in another notebook
#real_dataset = pt.get_dataset('irds:ir-lab-sose-2024/ir-acl-anthology-20240504-training')
#print(test_dataset)

docs_df = pd.DataFrame(test_dataset, columns=["docno", "text"])

In [32]:
# Useful method for getting all ngrams in a String, returns a list of the ngrams
# Not currently used anywhere
def tokenize_ngrams(text, n=2):
    tokens = text.split()
    ngrams = [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    return ngrams

### Step 4: Create the index
Here, we create our index. We use DFIndexer which is used for indexing Panda dataframes. We chose this indexer as IterDictIndexer, our first choice, didn't support block queries, which are integral to our ngram approach.

In [33]:
#Create Indexer 
#overwrite = True : if index already exists, delete the old one
#blocks = True : allow block queries (which we use for ngrams)
indexer = pt.DFIndexer("./test_index", overwrite=True, blocks=True, verbose=True)

# Create the index from our data using our indexer. Here, We need to specify the attributes from the dataframe to take into the index
index_ref = indexer.index(docs_df["text"], docs_df["docno"])

print(f"Indexing complete: {index_ref.toString()}")

# Create the actual index which will be used for retrieval
index = pt.IndexFactory.of(index_ref)

# print some stats from our index for debugging purposes
print("\nSome useful stats about our index:")
print(index.getCollectionStatistics())

# Print the frequency of every indexed term
print("Total frequencies for each term in our index:")
for term, le in index.getLexicon():
    #print(le.getFrequency() + " " term) 
    print(str(le.getFrequency()) + " "+ term)

100%|██████████| 5/5 [00:00<00:00, 17.50documents/s]

Indexing complete: ./test_index/data.properties

Some useful stats about our index:
Number of documents: 5
Number of terms: 10
Number of postings: 19
Number of fields: 0
Number of tokens: 19
Field names: []
Positions:   true

Total frequencies for each term in our index:
2 algorithm
1 fox
1 help
1 implement
4 learn
4 machin
1 optim
1 quick
3 system
1 white





### Step 5: Define the retrieval pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [34]:
#bm25 = pt.BatchRetrieve(index)
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose = True)

### Step 6: load and modify the queries
This is where we modify the queries to include ngrams. We achieve this by using specific operators from the pyterrier query language https://github.com/terrier-org/terrier-core/blob/5.x/doc/querylanguage.md . 

Right now, we only load one single example query for testing purposes. But this would be where we first load all the queries from Tira and then modify them to include ngrams.

In [35]:
#query for ngrams by typing '"word1 word2"' will only return documents where "word1 word2" appear in this constellation (document where they dont appear like this will not get ranked)
#query = 'brown fox'
query = 'machine learning'
tokenise_query_ngram = pt.rewrite.tokenise(lambda query: tokenize_ngrams(query))
print()
poortokenisation = tokenise_query_ngram >> pt.BatchRetrieve(index_ref)
print()
df = pd.DataFrame([{"query": query}])
transformed_df = tokenise_query_ngram.transform(df)
print(transformed_df["query"].values)
#TODO put multiple queries into list and run on all queries
#TODO load queries from Tira
#TODO add ngrams to the queries like shown, but only for real ngrams. perhaps use LLMs to find out which ngrams are real phrases from the english language and only put those
# or perhaps just weight all the ngrams but dont make them necessary



['machine$learning']


### Step 7: Create the run
Print the top 10 best results

In [31]:
run = poortokenisation.search(query)
print(run.head(10))

  qid  docid docno  rank     score           query_0             query
0   1      2    d3     0  0.753881  machine learning  machine learning
1   1      3    d4     1  0.753881  machine learning  machine learning
2   1      0    d1     2  0.698101  machine learning  machine learning
3   1      1    d2     3  0.698101  machine learning  machine learning


Print the attributes of the results. Not sure why the text turns empty

In [18]:
#index = pt.IndexFactory.of(index_ref)
meta = index.getMetaIndex()

# List of metadata keys
meta_keys = ['docno', 'text']  # Adjust this list based on your meta fields

# Function to print document attributes
def print_doc_attributes(docid):
    attributes = {key: meta.getItem(key, docid) for key in meta_keys}
    print(f"Attributes for docid {docid}: {attributes}")

# Example: Print attributes for the first 5 documents
for docid in range(3):
    print_doc_attributes(docid)

# Function to print document attributes by docno
def print_doc_attributes_by_docno(docno):
    try:
        docid = meta.getDocument("docno", docno)
        print_doc_attributes(docid)
    except KeyError:
        print(f"Document with docno {docno} not found.")


Attributes for docid 0: {'docno': 'd1', 'text': ''}
Attributes for docid 1: {'docno': 'd2', 'text': ''}
Attributes for docid 2: {'docno': 'd3', 'text': ''}


### Step 8: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [19]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


### Legacy Code and old methods that might still be useful

In [21]:
'''
# Apply n-gram tokenization to the dataset
for doc in dataset:
    ngram_tokens1 = tokenize_ngrams(doc['text'], n=1)
    ngram_tokens2 = tokenize_ngrams(doc['text'], n=2)
    ngram_tokens = ngram_tokens1 + ngram_tokens2
    token_counts = Counter(ngram_tokens)
    doc['toks'] = token_counts
    del doc['text']  # Remove the 'text' field as it's not needed anymore
'''

"\n# Apply n-gram tokenization to the dataset\nfor doc in dataset:\n    ngram_tokens1 = tokenize_ngrams(doc['text'], n=1)\n    ngram_tokens2 = tokenize_ngrams(doc['text'], n=2)\n    ngram_tokens = ngram_tokens1 + ngram_tokens2\n    token_counts = Counter(ngram_tokens)\n    doc['toks'] = token_counts\n    del doc['text']  # Remove the 'text' field as it's not needed anymore\n"