# Information Retrieval Experiment Notebook

In this notebook, we will use the [Beir](https://github.com/beir-cellar/beir) library to experiment with different information retrieval techniques. We will use the same dataset (XML-Coll-withSem). As we already have explored probabilistic models and vector space models, we will focus on neural models in this notebook. 

Read about the beir paper :
[A Heterogeneous Benchmark for Zero-shot
Evaluation of Information Retrieval Models](https://openreview.net/pdf?id=wCu6T5xFjeJ)

---

## Import Libraries
Let's start by importing the libraries we need for this project. You can install any missing libraries using the requirements.txt file provided or by running the following command in your terminal:

```bash
make install
```

In [1]:
from beir.datasets.data_loader import GenericDataLoader
from time import time
from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging

import jsonlines
import zipfile
import io
import re
import os

from textprocessor import CustomTextProcessorNoStem

  from tqdm.autonotebook import tqdm


## Data Preprocessing
As we want to use our own dataset with Beir, we need to convert it to the format that Beir expects. 
We will also use our ``textprocessor`` module to preprocess the data before feeding it to the models. We also transformed the queries file to the format expected by Beir manually.

In [2]:
textprocessor = CustomTextProcessorNoStem()

def parse_xml_to_json(filename, lines):
    """
    Parses XML content into a JSON format.

    Args:
        filename (str): The name of the file being parsed.
        lines (list): The lines of content to be parsed.

    Returns:
        dict: A dictionary containing the parsed data in JSON format.
    """
    docno = filename.split('/')[-1].split('.')[0]
    
    content = ' '.join(lines)
    content = re.sub('&[^;]+;', ' ', content)
    
    text = re.sub('<[^>]+>', '', content)
    tokens = textprocessor.pre_processing(text)
     
    return {docno: {'title': "", 'text': ' '.join(tokens)}}
    
    
def parse_collection(file):
    """
    Parses a collection of XML files into a JSON format.

    Args:
        file (str): The path to the collection file.

    Returns:
        dict: A dictionary containing the parsed data in JSON format.
    """
    parsed_data = {}
    with zipfile.ZipFile(file, 'r') as zip_file:
        for filename in zip_file.namelist():
            with zip_file.open(filename) as binary_file:
                with io.TextIOWrapper(binary_file, encoding='utf-8') as f:
                    parsed_data.update(parse_xml_to_json(filename, f.readlines()))
                    
    return parsed_data
                    
def save_parsed_collection(parsed_data, output_file):
    """
    Saves the parsed collection data to a JSON file.

    Args:
        parsed_data (dict): The parsed data to be saved.
        output_file (str): The path to the output file.
    """
    with jsonlines.open(output_file, 'w') as writer:
        for docno, data in parsed_data.items():
            writer.write({'_id': docno, 'title': data['title'], 'text': data['text']})

**CHOOSE DATASET :** Change the dataset name in the following cell to use a different dataset.
if the dataset has not already been formatted, you should uncomment the following cell to format it.

In [3]:
dataset_name = 'XML-Coll-withSem'

# Uncomment to parse the collection
# parsed_data = parse_collection('../lib/data/practice_05/' + dataset_name + '.zip') 
# save_parsed_collection(parsed_data,'./data/' + dataset_name + '.jsonl')

In [4]:
corpus_path = "./data/" + dataset_name + ".jsonl"
query_path = "./data/queries.jsonl"
qrels_path = "./data/qrl.tsv" # Mandatory for validation evaluation only (not used)

## Beir Setup
Now that we have our data ready, we can start using Beir. We will define a logger and start by loading the dataset. And define some useful class to simplify the use of the different models.

In [5]:
# define a logger and capture results
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

# load our dataset
corpus, queries, qrels = GenericDataLoader(
    corpus_file=corpus_path, 
    query_file=query_path, 
    qrels_file=qrels_path).load_custom()

2023-12-31 03:17:25 - Loading Corpus...


100%|██████████| 9804/9804 [00:01<00:00, 8954.79it/s]

2023-12-31 03:17:29 - Loaded 9804 Documents.
2023-12-31 03:17:29 - Doc Example: {'text': 'gottschalk benson eastlaw united patent case law flagged us supreme court articles law computerrelated patent case law united supreme court cases gottschalk benson supreme court united argued october november full case gottschalk acting commissioner patents benson al citations us ct us lexis uspq bna prior history certiorari united court customs patent appeals subsequent history diamond diehr diamond chakrabarty holding respondents method converting numerical binarycoded decimal numbers pure binary numbers programming conventional generalpurpose digital computers series mathematical calculations mental steps constitute patentable process meaning patent usc court membership chief justice warren burger associate justices william douglas william brennan jr potter stewart byron white thurgood marshall harry blackmun lewis powell jr william rehnquist case opinions majority douglasjoined burger brennan 




In [6]:
class BEIRModelWrapper:
    def __init__(self):
        self.model = None
        self.retriever = None
        self.model_name = None
        self.score_function = "dot"
        self.k_values = [1, 3, 5, 10, 100, 1000, 1500]

    def get_results(self, corpus, queries):
        results = self.retriever.retrieve(corpus, queries)
        return results
    
    def evaluate(self, qrels, results):
        return self.retriever.evaluate(qrels, results, self.retriever.k_values)

    def get_model_name(self):
        return self.model_name

class SentenceBERTModelWrapper(BEIRModelWrapper):
    def __init__(self, batch_size=32, corpus_chunk_size=16*9999):
        super().__init__()
        
        self.model_path = "msmarco-distilbert-base-tas-b"
        self.model = DRES(models.SentenceBERT(self.model_path), batch_size=batch_size, corpus_chunk_size=corpus_chunk_size)
        
        self.score_function = "cos_sim"
        self.retriever = EvaluateRetrieval(self.model, score_function=self.score_function, k_values=self.k_values)
        
        self.model_name = f"SBERT_{self.model_path.replace('-', '_')}"
        
class ANCEModelWrapper(BEIRModelWrapper):
    def __init__(self):
        super().__init__()
        self.model_path = "msmarco-roberta-base-ance-firstp"
        self.model = DRES(models.SentenceBERT(self.model_path))
        self.retriever = EvaluateRetrieval(self.model, score_function="dot", k_values=self.k_values)
        self.model_name = f"ANCE_{self.model_path.replace('-', '_')}"
    
# ###############################################
# Below Models need GPU support
# ###############################################

class DPRModelWrapper(BEIRModelWrapper):
    def __init__(self, batch_size=128):
        super().__init__()
        self.question_encoder = "facebook/dpr-question_encoder-multiset-base"
        self.ctx_encoder = "facebook/dpr-ctx_encoder-multiset-base"
        self.model = DRES(models.DPR((self.question_encoder, self.ctx_encoder), batch_size=batch_size))
        
        self.retriever = EvaluateRetrieval(self.model, score_function=self.score_function, k_values=self.k_values)
        self.model_name = f"DPR_dpr_ctx_encoder_multiset_base_dpr_ctx_encoder_multiset_base"
        
class UseQAModelWrapper(BEIRModelWrapper):
    def __init__(self):
        super().__init__()
        self.model_path = "https://tfhub.dev/google/universal-sentence-encoder-qa/3"
        self.model = DRES(models.UseQA(self.model_path))
        self.retriever = EvaluateRetrieval(self.model, score_function="dot", k_values=self.k_values)
        
        self.model_name = f"USEQA_universal_sentence_encoder_qa"

In [7]:
model_wrapper = SentenceBERTModelWrapper()

2023-12-31 03:17:29 - Load pretrained SentenceTransformer: msmarco-distilbert-base-tas-b


2023-12-31 03:17:30 - Use pytorch device: cpu


In [8]:
start_time = time()
results = model_wrapper.get_results(corpus, queries)
end_time = time()

print("Time taken to retrieve: {:.2f} seconds".format(end_time - start_time))

# logging.info("Retriever evaluation for k in: {}".format(model_wrapper.retriever.k_values))
# ndcg, _map, recall, precision = model_wrapper.evaluate(qrels, results)

2023-12-31 03:17:30 - Encoding Queries...


Batches: 100%|██████████| 1/1 [00:00<00:00,  1.39it/s]


2023-12-31 03:17:31 - Sorting Corpus by document length (Longest first)...
2023-12-31 03:17:31 - Scoring Function: Cosine Similarity (cos_sim)
2023-12-31 03:17:31 - Encoding Batch 1/1...


Batches:  20%|█▉        | 60/307 [22:46<1:51:54, 27.19s/it]

: 

## Run Generation

Now that we have the results, we can generate the run files that will be used to evaluate the models. 

In [None]:
RUN_OUTPUT_FOLDER = "../docs/resources/runs/"

def format_results(results):
    res = []
    for query_id, score in results.items():
        for doc_id, score in score.items():
            res.append((query_id, doc_id, score))
            
    # lets group in different list the results for each query
    res_grouped = {}
    for query_id, doc_id, score in res:
        if query_id not in res_grouped:
            res_grouped[query_id] = []
        res_grouped[query_id].append((doc_id, score))

    # lets sort the results for each query by score
    for query_id in res_grouped:
        res_grouped[query_id] = sorted(res_grouped[query_id], key=lambda x: x[1], reverse=True)
        
    return res_grouped

def get_run_id(folder_path=RUN_OUTPUT_FOLDER):
    files = [f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))]
    return len(files) + 1

def display_top_k_results(results, k=10):
    for query_id, doc_ids in results.items():
        print(f"Query {query_id}:")
        for i, (doc_id, score) in enumerate(doc_ids[:k]):
            print(f"\t{i+1}. {doc_id} ({score})")
        print()   

In [None]:
team_name = "BengezzouIdrissMezianeGhilas"
run_id = get_run_id()
processing = textprocessor.get_text_processor_name()
granularity = "article"
run_file = f"{RUN_OUTPUT_FOLDER}{team_name}_{run_id}_{model_wrapper.get_model_name()}_{granularity}_{processing}.txt"

In [None]:
formated_results = format_results(results)
# display_top_k_results(formated_results)

with open(run_file, "w") as f_out:
    for query_id in formated_results:
        for rank, (doc_id, score) in enumerate(formated_results[query_id]):
            f_out.write("{} Q0 {} {} {} BengezzouIdrissMezianeGhilas /article[1]\n".format(query_id, doc_id, rank+1, score))

In [None]:
import json
# write results to file
with open("result.json", "w") as f_out:
    json.dump(results, f_out)