# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [2]:
import pyterrier as pt
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import re

from tira.third_party_integrations import persist_and_normalize_run,  ensure_pyterrier_is_loaded
from tira.rest_api_client import Client

### Step 2: Start pyterrier

In [3]:

# Initialize PyTerrier
if not pt.started():
    pt.init()


PyTerrier 0.10.1 has loaded Terrier 5.9 (built by craigm on 2024-05-02 17:40) and terrier-helper 0.0.8

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.


### Step 3: Load dataset

In [4]:
#TODO Load real dataset from tira and process it
# Sample dataset
dataset = [
    {'docno': 'd1', 'text': 'do goldfish grow?'},
    {'docno': 'd2', 'text': 'a quick brown fox'},
    {'docno' : 'd3', 'text' : 'a brown quick fox'}
]

documents = [
    {"docno": "d1", "text": "machine learning is fun"},
    {"docno": "d2", "text": "machines are helpful"},
    {"docno": "d3", "text": "machine learning algorithms"},
    {"docno": "d4", "text": "machine learning algorithms are interesting"},
    {"docno": "d5", "text": "nothing related here"},
    {"docno": "d6", "text": "machine machine machine learning learning learning algorithms algorithms"}  # Increased term frequency
]

### Step 4: Convert documents to include ngrams
Right now, we only include bigrams
We will change to to include variable ngrams

In [5]:
#This is our ngram tokenizer. It takes a string and returns a dict of all ngrams, where each ngram is seperated by $$ so it will be parsed as one token

def tokenize_ngrams_to_dict(text, n1=1, n2=3):
    # Replace spaces with dollar signs
    text_with_dollar_signs = re.sub(r'\s+', '$', text)
    
    # Tokenize the text into words
    words = text_with_dollar_signs.split('$')
    
    # Initialize an empty Counter to hold all n-grams
    all_ngram_counts = Counter()
    
    # Loop through each n from n1 to n2
    for n in range(n1, n2 + 1):
        # Generate n-grams for the current n
        ngrams = ['$$'.join(words[i:i+n]) for i in range(len(words)-n+1)]
        
        # Update the Counter with the current n-grams
        all_ngram_counts.update(ngrams)
    
    return dict(all_ngram_counts)



In [6]:
# Apply n-gram tokenization to the dataset
for doc in documents:
    doc_1gram = tokenize_ngrams_to_dict(doc['text'], n1=1, n2= 3)

    doc['toks'] = doc_1gram
    del doc['text']  # Remove the 'text' field as it's not needed anymore

### Step 5: Create the index


In [7]:

# Initialize the IterDictIndexer with pretokenised set to True
iter_indexer = pt.IterDictIndexer("./pretokindex", overwrite=True, meta={'docno': 20}, pretokenised=True)

# Index the pretokenized dataset
index_ref = iter_indexer.index(documents)

print(f"Indexing complete: {index_ref}")
# Now you can use the index_ref as usual
index = pt.IndexFactory.of(index_ref)

print(index.getCollectionStatistics())
for term, le in index.getLexicon():
    print(term) 
    print(le.getFrequency())

# Access the MetaIndex and Lexicon
meta = index.getMetaIndex()
lexicon = index.getLexicon()



Indexing complete: <org.terrier.querying.IndexRef at 0x7a74d7b0f060 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x634f5e66dfe8 at 0x7a74d7c85cf0>>
Number of documents: 6
Number of terms: 38
Number of postings: 53
Number of fields: 0
Number of tokens: 60
Field names: []
Positions:   false

algorithms
4
algorithms$$algorithms
1
algorithms$$are
1
algorithms$$are$$interesting
1
are
2
are$$helpful
1
are$$interesting
1
fun
1
helpful
1
here
1
interesting
1
is
1
is$$fun
1
learning
6
learning$$algorithms
3
learning$$algorithms$$algorithms
1
learning$$algorithms$$are
1
learning$$is
1
learning$$is$$fun
1
learning$$learning
2
learning$$learning$$algorithms
1
learning$$learning$$learning
1
machine
6
machine$$learning
4
machine$$learning$$algorithms
2
machine$$learning$$is
1
machine$$learning$$learning
1
machine$$machine
2
machine$$machine$$learning
1
machine$$machine$$machine
1
machines
1
machines$$are
1
machines$$are$$helpful
1
nothing
1
nothing$$related
1
nothing$$related$$here
1
rel

### Step 6: Define the retrieval pipeline
First, we will define a method that takes a string (our query) and returns a list of all ngrams in the same $$-Format as we used during indexing

In [14]:
#takes a string and returns a list with all ngrams
def tokenize_ngrams_to_list(text, n1=1, n2=3):
    #Split the text into all individual words
    #TODO use another tokenizer here first to get rid of all special characters and to do stemming etc
    words = text.split()

    # Initialize an empty list to hold all n-grams
    all_ngrams = []
    
    # Loop through each n between n1 to n2
    for n in range(n1, n2 + 1):
        # Generate n-grams for the current n
        ngrams = ['$$'.join(words[i:i+n]) for i in range(len(words)-n+1)]
        
        # Add all current ngrams to the all_ngrams list
        all_ngrams.extend(ngrams)
    
    return all_ngrams

Now, we define our actual retrieval pipeline. It consists of two steps. First, we tokenise the query into the ngrams with the method above. Then, we do the retrieval using the standard bm25 algorithm.

In [15]:
# This transformer will tokenise the queries into the ngrams
tokenise_query_ngram = pt.rewrite.tokenise(lambda query: tokenize_ngrams_to_list(query))

# This transformer will do the retrieval using bm25
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose = True)

# This is our retrieval pipeline
retr_pipeline = tokenise_query_ngram >> pt.BatchRetrieve(index_ref)

### Step 7: Load the queries
Here, we load the queries and rewrite them to include all the ngrams, also seperated by $$

In [16]:
#TODO put multiple queries into list and run on all queries
#TODO load queries from Tira

query = 'machine learning algorithms'

# Print the new query representation with ngrams included. This is how our query will get passed to bm25
df = pd.DataFrame([{"query": query}])
transformed_df = tokenise_query_ngram.transform(df)
print(transformed_df["query"].values)

['machine learning algorithms machine$$learning learning$$algorithms machine$$learning$$algorithms']


### Step 8: Create the run


In [20]:
run = retr_pipeline.search(query)
print(run.head(10))
#results = retr_pipeline.transform(query)
#print(results)

  qid  docid docno  rank     score                      query_0  \
0   1      2    d3     0  5.327106  machine learning algorithms   
1   1      3    d4     1  4.098277  machine learning algorithms   
2   1      5    d6     2  2.982158  machine learning algorithms   
3   1      0    d1     3  1.881809  machine learning algorithms   

                                               query  
0  machine learning algorithms machine$$learning ...  
1  machine learning algorithms machine$$learning ...  
2  machine learning algorithms machine$$learning ...  
3  machine learning algorithms machine$$learning ...  


### Step 9: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [22]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".


### Legacy methods that might still be useful

In [21]:

'''
# Function to tokenize query into n-grams
def query_tokenize_ngrams(query, n=2):
    return list(tokenize_ngrams(query, n=n).keys())

# Define the retrieval pipeline
class NgramTokenizeTransform:
    def __init__(self, n=2):
        self.n = n

    def transform(self, queries):
        def ngrams(text, n):
            tokens = text.split()
            return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
        
        queries['query'] = queries['query'].apply(lambda x: ' '.join(ngrams(x, self.n)))
        return queries

ngram_tokenize = NgramTokenizeTransform(n=2)

'''


#print('Now we do the retrieval...')
#run = bm25(test_dataset.get_topics('text'))
#run = bm25([{"qid": "0", "query" : "retrieval systems"}])
#print('Done. Here are the first 10 entries of the run')
#run.head(10)


# tokenize ngram the queries
# Create a query pipeline that includes the n-gram tokenization
#ngram_tokenize = NgramTokenizeTransform(n=2)
#retr_pipe = ngram_tokenize >> bm25
'''
import pandas as pd
# Sample queries
queries = pd.DataFrame([
    {'qid': 'q1', 'query': 'do goldfish grow?'},
    {'qid': 'q2', 'query': 'quick brown fox'}
])

# Retrieve results
run = retr_pipe(queries)
run.head(10)
results = retr_pipe.transform(queries)
print(results)



#index = pt.IndexFactory.of(index_ref)
meta = index.getMetaIndex()

# List of metadata keys
meta_keys = ['docno', 'text']  # Adjust this list based on your meta fields

# Function to print document attributes
def print_doc_attributes(docid):
    attributes = {key: meta.getItem(key, docid) for key in meta_keys}
    print(f"Attributes for docid {docid}: {attributes}")

# Example: Print attributes for the first 5 documents
for docid in range(3):
    print_doc_attributes(docid)

# Function to print document attributes by docno
def print_doc_attributes_by_docno(docno):
    try:
        docid = meta.getDocument("docno", docno)
        print_doc_attributes(docid)
    except KeyError:
        print(f"Document with docno {docno} not found.")

# List of specific docnos to retrieve
#docnos = ['W05-0704']

# Retrieve and print attributes for each specified docno
#for docno in docnos:
 #   print_doc_attributes_by_docno(docno)




def tokenize_ngrams_dollar_sign_query_OLD(text, n=2):
    tokens = text.split()
    ngrams = ['$$'.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    return ngrams


 
def tokenize_ngrams_dollar_signOLD(text, n=2):
    # Replace spaces with dollar signs
    text_with_dollar_signs = re.sub(r'\s+', '$', text)
    
    # Tokenize the text into words
    words = text_with_dollar_signs.split('$')
    
    # Generate n-grams manually
    ngrams = ['$$'.join(words[i:i+n]) for i in range(len(words)-n+1)]
    
    # Count occurrences of each n-gram
    ngram_counts = Counter(ngrams)
    
    return dict(ngram_counts)




# Function to tokenize text into n-grams
def tokenize_ngrams(text, n=2):
    vectorizer = CountVectorizer(ngram_range=(n, n), token_pattern=r'\b\w+\b')
    X = vectorizer.fit_transform([text])
    ngrams = vectorizer.get_feature_names_out()
    counts = X.toarray().flatten()
    return dict(zip(ngrams, counts))

'''

'\nimport pandas as pd\n# Sample queries\nqueries = pd.DataFrame([\n    {\'qid\': \'q1\', \'query\': \'do goldfish grow?\'},\n    {\'qid\': \'q2\', \'query\': \'quick brown fox\'}\n])\n\n# Retrieve results\nrun = retr_pipe(queries)\nrun.head(10)\nresults = retr_pipe.transform(queries)\nprint(results)\n\n\n\n#index = pt.IndexFactory.of(index_ref)\nmeta = index.getMetaIndex()\n\n# List of metadata keys\nmeta_keys = [\'docno\', \'text\']  # Adjust this list based on your meta fields\n\n# Function to print document attributes\ndef print_doc_attributes(docid):\n    attributes = {key: meta.getItem(key, docid) for key in meta_keys}\n    print(f"Attributes for docid {docid}: {attributes}")\n\n# Example: Print attributes for the first 5 documents\nfor docid in range(3):\n    print_doc_attributes(docid)\n\n# Function to print document attributes by docno\ndef print_doc_attributes_by_docno(docno):\n    try:\n        docid = meta.getDocument("docno", docno)\n        print_doc_attributes(docid)\n  