# Information Retrieval Experiment Notebook

In this notebook, we will use the [Beir](https://github.com/beir-cellar/beir) library to experiment with different information retrieval techniques. We will use the same dataset (XML-Coll-withSem). As we already have explored probabilistic models and vector space models, we will focus on neural models in this notebook. 

Read about the beir paper :
[A Heterogeneous Benchmark for Zero-shot
Evaluation of Information Retrieval Models](https://openreview.net/pdf?id=wCu6T5xFjeJ)

---

## Import Libraries
Let's start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running the following command in your terminal:

```bash
make install
```

In [65]:
from beir.datasets.data_loader import GenericDataLoader
from time import time
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging

import jsonlines
import zipfile
import io
import re
import os

from textprocessor import CustomTextProcessorNoStem

## Data Preprocessing
As we want to use our own dataset with Beir, we need to convert it to the format that Beir expects. 
We will also use our ``textprocessor`` module to preprocess the data before feeding it to the models. We also transformed the queries file to the format expected by Beir manually.

In [66]:
textprocessor = CustomTextProcessorNoStem()

def parse_xml_to_json(filename, lines):
    """
    Parses XML content into a JSON format.

    Args:
        filename (str): The name of the file being parsed.
        lines (list): The lines of content to be parsed.

    Returns:
        dict: A dictionary containing the parsed data in JSON format.
    """
    docno = filename.split('/')[-1].split('.')[0]
    
    content = ' '.join(lines)
    content = re.sub('&[^;]+;', ' ', content)
    
    text = re.sub('<[^>]+>', '', content)
    tokens = textprocessor.pre_processing(text)
     
    return {docno: {'title': "", 'text': ' '.join(tokens)}}
    
    
def parse_collection(file):
    """
    Parses a collection of XML files into a JSON format.

    Args:
        file (str): The path to the collection file.

    Returns:
        dict: A dictionary containing the parsed data in JSON format.
    """
    parsed_data = {}
    with zipfile.ZipFile(file, 'r') as zip_file:
        for filename in zip_file.namelist():
            with zip_file.open(filename) as binary_file:
                with io.TextIOWrapper(binary_file, encoding='utf-8') as f:
                    parsed_data.update(parse_xml_to_json(filename, f.readlines()))
                    
    return parsed_data
                    
def save_parsed_collection(parsed_data, output_file):
    """
    Saves the parsed collection data to a JSON file.

    Args:
        parsed_data (dict): The parsed data to be saved.
        output_file (str): The path to the output file.
    """
    with jsonlines.open(output_file, 'w') as writer:
        for docno, data in parsed_data.items():
            writer.write({'_id': docno, 'title': data['title'], 'text': data['text']})

**CHOOSE DATASET :** Change the dataset name in the following cell to use a different dataset.
if the dataset has not already been formatted, you should uncomment the following cell to format it.

In [67]:
dataset_name = 'small'
# parsed_data = parse_collection('../lib/data/practice_05/' + dataset_name + '.zip') 
# save_parsed_collection(parsed_data,'./data/' + dataset_name + '.jsonl')

In [68]:
corpus_path = "./data/" + dataset_name + ".jsonl"
query_path = "./data/queries.jsonl"
qrels_path = "./data/qrl.tsv" # Mandatory for validation evaluation only

## Beir Setup
Now that we have our data ready, we can start using Beir. We will define a logger and start by loading the dataset. And define some useful class to simplify the use of the different models.

In [69]:
# define a logger and capture results
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

# load our dataset
corpus, queries, qrels = GenericDataLoader(
    corpus_file=corpus_path, 
    query_file=query_path, 
    qrels_file=qrels_path).load_custom()

2023-12-30 23:27:08 - Loading Corpus...


100%|██████████| 10/10 [00:00<00:00, 647.77it/s]

2023-12-30 23:27:08 - Loaded 10 Documents.
2023-12-30 23:27:08 - Doc Example: {'text': 'algorithms calculating variance lese statistical algorithms statistical deviation dispersion articles example pseudocode algorithms calculating variance play major role statistical computing key problem design good algorithms problem formulas variance involve sums squares lead numerical instability well arithmetic overflow dealing large values nave algorithm formula calculating variance entire population size displaystylefrac sum_ sum_ x_i formula calculating unbiased estimate population variance finite sample observations displaystylefrac sum_ sum_ x_i naive algorithm calculate estimated variance pseudocode sum sum_sqr foreach data sum sum sum_sqr sum_sqr sumn variance sum_sqr sum algorithm easily adapted compute variance finite population simply divide minus sum_sqr sum numbers precision result inherent precision floatingpoint arithmetic perform computation bad variance small relative sum numbers 




In [70]:
class BEIRModelWrapper:
    def __init__(self):
        self.model = None
        self.retriever = None
        self.model_name = None
        self.score_function = "dot"

    def get_results(self, corpus, queries):
        results = self.retriever.retrieve(corpus, queries)
        return results
    
    def evaluate(self, qrels, results):
        return self.retriever.evaluate(qrels, results, self.retriever.k_values)

    def get_model_name(self):
        return self.model_name

class SentenceBERTModelWrapper(BEIRModelWrapper):
    def __init__(self, batch_size=256, corpus_chunk_size=512*9999):
        super().__init__()
        self.model_path = "msmarco-distilbert-base-tas-b"
        self.model = DRES(models.SentenceBERT(self.model_path), batch_size=batch_size, corpus_chunk_size=corpus_chunk_size)
        self.retriever = EvaluateRetrieval(self.model, score_function=self.score_function)
        self.model_name = f"SBERT_{self.model_path.replace('-', '_')}"
        
class ANCEModelWrapper(BEIRModelWrapper):
    def __init__(self):
        super().__init__()
        self.model_path = "msmarco-roberta-base-ance-firstp"
        self.model = DRES(models.SentenceBERT(self.model_path))
        self.retriever = EvaluateRetrieval(self.model, score_function="dot")
        self.model_name = f"ANCE_{self.model_path.replace('-', '_')}"
    
# ###############################################
# Below Models need GPU support
# ###############################################

class DPRModelWrapper(BEIRModelWrapper):
    def __init__(self, batch_size=128):
        super().__init__()
        self.question_encoder = "facebook/dpr-question_encoder-multiset-base"
        self.ctx_encoder = "facebook/dpr-ctx_encoder-multiset-base"
        self.model = DRES(models.DPR((self.question_encoder, self.ctx_encoder), batch_size=batch_size))
        
        self.retriever = EvaluateRetrieval(self.model, score_function=self.score_function)
        self.model_name = f"DPR_dpr_ctx_encoder_multiset_base_dpr_ctx_encoder_multiset_base_"
        
class UseQAModelWrapper(BEIRModelWrapper):
    def __init__(self):
        super().__init__()
        self.model_path = "https://tfhub.dev/google/universal-sentence-encoder-qa/3"
        self.model = DRES(models.UseQA(self.model_path))
        self.retriever = EvaluateRetrieval(self.model, score_function="dot")
        
        self.model_name = f"USEQA_universal_sentence_encoder_qa"

In [72]:
model_wrapper = SentenceBERTModelWrapper()

2023-12-30 23:29:31 - Load pretrained SentenceTransformer: msmarco-distilbert-base-tas-b


2023-12-30 23:29:36 - Use pytorch device: cpu


In [None]:
start_time = time()
results = model_wrapper.get_results(corpus, queries)
end_time = time()

print("Time taken to retrieve: {:.2f} seconds".format(end_time - start_time))

# logging.info("Retriever evaluation for k in: {}".format(model_wrapper.retriever.k_values))
# ndcg, _map, recall, precision = model_wrapper.evaluate(qrels, results)

2023-12-30 23:17:27 - Encoding Queries...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches: 100%|██████████| 1/1 [00:02<00:00,  2.03s/it]


2023-12-30 23:17:29 - Sorting Corpus by document length (Longest first)...
2023-12-30 23:17:29 - Scoring Function: Dot Product (dot)
2023-12-30 23:17:29 - Encoding Batch 1/1...


Batches: 100%|██████████| 1/1 [00:14<00:00, 14.72s/it]


Time taken to retrieve: 17.25 seconds


## Run Generation

Now that we have the results, we can generate the run files that will be used to evaluate the models. 

In [73]:
RUN_OUTPUT_FOLDER = "../docs/resources/runs/"

def format_results(results):
    res = []
    for query_id, score in results.items():
        for doc_id, score in score.items():
            res.append((query_id, doc_id, score))
            
    # lets group in different list the results for each query
    res_grouped = {}
    for query_id, doc_id, score in res:
        if query_id not in res_grouped:
            res_grouped[query_id] = []
        res_grouped[query_id].append((doc_id, score))

    # lets sort the results for each query by score
    for query_id in res_grouped:
        res_grouped[query_id] = sorted(res_grouped[query_id], key=lambda x: x[1], reverse=True)
        
    return res_grouped

def get_run_id(folder_path=RUN_OUTPUT_FOLDER):
    files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
    return len(files) + 1

def display_top_k_results(results, k=10):
    for query_id, doc_ids in results.items():
        print(f"Query {query_id}:")
        for i, (doc_id, score) in enumerate(doc_ids[:k]):
            print(f"\t{i+1}. {doc_id} ({score})")
        print()   

In [74]:
team_name = "BengezzouIdrissMezianeGhilas"
run_id = get_run_id()
processing = textprocessor.get_text_processor_name()
run_file = f"{RUN_OUTPUT_FOLDER}{team_name}_{run_id}_{model_wrapper.get_model_name()}_{processing}.txt"

In [75]:
formated_results = format_results(results)
# display_top_k_results(formated_results)

with open(run_file, "w") as f_out:
    for query_id in formated_results:
        for rank, (doc_id, score) in enumerate(formated_results[query_id]):
            f_out.write("{} Q0 {} {} {} BengezzouIdrissMezianeGhilas /article[1]\n".format(query_id, doc_id, rank+1, score))

Query 2009011:
	1. 1064 (703.6972045898438)
	2. 1134 (702.343505859375)
	3. 612 (702.04296875)
	4. 753 (701.8960571289062)
	5. 627 (701.69287109375)
	6. 1164 (701.625244140625)
	7. 775 (701.3036499023438)
	8. 1063 (700.8485107421875)
	9. 780 (699.796142578125)
	10. 717 (698.7045288085938)

Query 2009036:
	1. 753 (706.2047119140625)
	2. 612 (704.6282958984375)
	3. 627 (703.4949951171875)
	4. 775 (702.9257202148438)
	5. 1134 (702.78515625)
	6. 1164 (702.0452880859375)
	7. 1063 (701.9610595703125)
	8. 1064 (701.3392944335938)
	9. 780 (699.7847900390625)
	10. 717 (698.1004638671875)

Query 2009067:
	1. 775 (707.5659790039062)
	2. 612 (706.60302734375)
	3. 1134 (706.4557495117188)
	4. 1164 (705.624267578125)
	5. 1063 (705.209228515625)
	6. 627 (705.1932983398438)
	7. 1064 (703.7508544921875)
	8. 753 (701.9403686523438)
	9. 780 (701.7125244140625)
	10. 717 (699.7239990234375)

Query 2009073:
	1. 1134 (705.3028564453125)
	2. 1063 (704.0633544921875)
	3. 612 (703.7564697265625)
	4. 775 (703.65