# IR Lab SoSe 2024: Baseline Retrieval System

This jupyter notebook serves as baseline retrieval system that you can try to improve upon.
We will use the a corpus of scientific papers (title + abstracts) from the fields of information retrieval and natural language processing (the [IR Anthology](https://ir.webis.de/anthology/) and the [ACL Anthology](https://aclanthology.org/)). This serves Jupyter notebook only serves as retrieval system, i.e., it gets a set of information needs (topics) and a corpus as input and produces a run file as output. Please do evaluations in a new dedicated notebook.

### Step 1: Import Libraries

We will use [tira](https://www.tira.io/), an information retrieval shared task platform, for loading the (pre-built) retrieval index and [ir_dataset](https://ir-datasets.com/) to subsequently build a retrieval system with [PyTerrier](https://github.com/terrier-org/pyterrier), an open-source search engine.

Building your own index can be already one way that you can try to improve upon this baseline (if you want to focus on creating good document representations). Other ways could include reformulating queries or tuning parameters or building better retrieval pipelines.

In [21]:
import pyterrier as pt
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import re
# Initialize PyTerrier
if not pt.started():
    pt.init()

# Sample dataset
dataset = [
    {'docno': 'd1', 'text': 'do goldfish grow?'},
    {'docno': 'd2', 'text': 'a quick brown fox'},
    {'docno' : 'd3', 'text' : 'a brown quick fox'}
]

# Function to tokenize text into n-grams
def tokenize_ngrams(text, n=2):
    vectorizer = CountVectorizer(ngram_range=(n, n), token_pattern=r'\b\w+\b')
    X = vectorizer.fit_transform([text])
    ngrams = vectorizer.get_feature_names_out()
    counts = X.toarray().flatten()
    return dict(zip(ngrams, counts))

def tokenize_ngrams_dollar_sign(text, n=2):
    # Replace spaces with dollar signs
    text_with_dollar_signs = re.sub(r'\s+', '$', text)
    
    # Tokenize the text into words
    words = text_with_dollar_signs.split('$')
    
    # Generate n-grams manually
    ngrams = ['$$'.join(words[i:i+n]) for i in range(len(words)-n+1)]
    
    # Count occurrences of each n-gram
    ngram_counts = Counter(ngrams)
    
    return dict(ngram_counts)


# Apply n-gram tokenization to the dataset
for doc in dataset:
    doc['toks'] = tokenize_ngrams_dollar_sign(doc['text'], n=2)
    del doc['text']  # Remove the 'text' field as it's not needed anymore

# Initialize the IterDictIndexer with pretokenised set to True
iter_indexer = pt.IterDictIndexer("./pretokindex", meta={'docno': 20}, pretokenised=True)

# Index the pretokenized dataset
index_ref = iter_indexer.index(dataset)

print(f"Indexing complete: {index_ref}")
# Now you can use the index_ref as usual
index = pt.IndexFactory.of(index_ref)

print(index.getCollectionStatistics())
for term, le in index.getLexicon():
    print(term) 
    print(le.getFrequency())

# Access the MetaIndex and Lexicon
meta = index.getMetaIndex()
lexicon = index.getLexicon()



Indexing complete: <org.terrier.querying.IndexRef at 0x76c1f4e528e0 jclass=org/terrier/querying/IndexRef jself=<LocalRef obj=0x5ae7c75c8b28 at 0x76c1e2bb4530>>
Number of documents: 3
Number of terms: 8
Number of postings: 8
Number of fields: 0
Number of tokens: 8
Field names: []
Positions:   false

a$$brown
1
a$$quick
1
brown$$fox
1
brown$$quick
1
do$$goldfish
1
goldfish$$grow?
1
quick$$brown
1
quick$$fox
1


In [22]:
# Useful method for getting all ngrams in a String, returns a list of the ngrams
# Not currently used anywhere
def tokenize_ngrams_dollar_sign_query(text, n=2):
    tokens = text.split()
    ngrams = ['$$'.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
    return ngrams

In [24]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose = True)

#query for ngrams by typing '"word1 word2"' will only return documents where "word1 word2" appear in this constellation (document where they dont appear like this will not get ranked)
#query = 'brown fox'
query = 'brown quick fox'
tokenise_query_ngram = pt.rewrite.tokenise(lambda query: tokenize_ngrams_dollar_sign_query(query))
print()
poortokenisation = tokenise_query_ngram >> pt.BatchRetrieve(index_ref)
print()
df = pd.DataFrame([{"query": query}])
transformed_df = tokenise_query_ngram.transform(df)
print(transformed_df["query"].values)
#TODO put multiple queries into list and run on all queries
#TODO load queries from Tira
#TODO add ngrams to the queries like shown, but only for real ngrams. perhaps use LLMs to find out which ngrams are real phrases from the english language and only put those
# or perhaps just weight all the ngrams but dont make them necessary

run = poortokenisation.search(query)
print(run.head(10))



['brown$$quick quick$$fox']
  qid  docid docno  rank     score          query_0                    query
0   1      2    d3     0  1.088135  brown quick fox  brown$$quick quick$$fox


In [19]:

# Function to tokenize query into n-grams
def query_tokenize_ngrams(query, n=2):
    return list(tokenize_ngrams(query, n=n).keys())

# Define the retrieval pipeline
class NgramTokenizeTransform:
    def __init__(self, n=2):
        self.n = n

    def transform(self, queries):
        def ngrams(text, n):
            tokens = text.split()
            return [' '.join(tokens[i:i+n]) for i in range(len(tokens)-n+1)]
        
        queries['query'] = queries['query'].apply(lambda x: ' '.join(ngrams(x, self.n)))
        return queries

ngram_tokenize = NgramTokenizeTransform(n=2)



### Step 2: Load the Dataset and the Index

The type of the index object that we load is `<class 'jnius.reflect.org.terrier.structures.Index'>`, in fact a [Java class](http://terrier.org/docs/v3.6/javadoc/org/terrier/structures/Index.html) wrapped into Python. However, you do not need to worry about this: at this point, we will simply use the provided Index object to run procedures defined in Python.

Create a custom tokenizer method.
This method takes a String as an argument and returns a List of all tokens, including ngrams between 1 and 3 words.

Create the Iterator for our dataset. This is an Iterator where every element is a dict which contains a text and a docno.
See the example print below.

### Step 3: Define the Retrieval Pipeline

We will define a BM25 retrieval pipeline as baseline. For details, see:

- [https://pyterrier.readthedocs.io](https://pyterrier.readthedocs.io)
- [https://github.com/terrier-org/ecir2021tutorial](https://github.com/terrier-org/ecir2021tutorial)

In [3]:
bm25 = pt.BatchRetrieve(index_ref, wmodel="BM25", verbose = True)

### Step 4: Create the Run


In [4]:
print('Now we do the retrieval...')
#run = bm25(test_dataset.get_topics('text'))
#run = bm25([{"qid": "0", "query" : "retrieval systems"}])
#print('Done. Here are the first 10 entries of the run')
#run.head(10)


# tokenize ngram the queries
# Create a query pipeline that includes the n-gram tokenization
#ngram_tokenize = NgramTokenizeTransform(n=2)
retr_pipe = ngram_tokenize >> bm25

import pandas as pd
# Sample queries
queries = pd.DataFrame([
    {'qid': 'q1', 'query': 'do goldfish grow?'},
    {'qid': 'q2', 'query': 'quick brown fox'}
])

# Retrieve results
run = retr_pipe(queries)
run.head(10)
results = retr_pipe.transform(queries)
print(results)

Now we do the retrieval...


TypeError: ValueError() takes no keyword arguments

In [9]:
#index = pt.IndexFactory.of(index_ref)
meta = index.getMetaIndex()

# List of metadata keys
meta_keys = ['docno', 'text']  # Adjust this list based on your meta fields

# Function to print document attributes
def print_doc_attributes(docid):
    attributes = {key: meta.getItem(key, docid) for key in meta_keys}
    print(f"Attributes for docid {docid}: {attributes}")

# Example: Print attributes for the first 5 documents
for docid in range(3):
    print_doc_attributes(docid)

# Function to print document attributes by docno
def print_doc_attributes_by_docno(docno):
    try:
        docid = meta.getDocument("docno", docno)
        print_doc_attributes(docid)
    except KeyError:
        print(f"Document with docno {docno} not found.")

# List of specific docnos to retrieve
#docnos = ['W05-0704']

# Retrieve and print attributes for each specified docno
#for docno in docnos:
 #   print_doc_attributes_by_docno(docno)

Attributes for docid 0: {'docno': '0', 'text': ''}
Attributes for docid 1: {'docno': '1', 'text': ''}
Attributes for docid 2: {'docno': '2', 'text': ''}


### Step 5: Persist the run file for subsequent evaluations

The output of a prototypical retrieval system is a run file. This run file can later (optimally in a different notebook) be statistically evaluated.

In [None]:
persist_and_normalize_run(run, system_name='bm25-baseline', default_output='../runs')

The run file is normalized outside the TIRA sandbox, I will store it at "../runs".
Done. run file is stored under "../runs/run.txt".
