# RAG Evaluation with [BERGEN Benchmark](https://github.com/naver/bergen/)

- Have out-of-the-box models and datasets
- Can add custom models and datasets
- Possible to use only partwise models (for example only a custom reranker and the rest is used out-of-the-box)
- Need many dependencies and additional code for your custom components

<br><br>

**The ugly:**
- Broken Setup (/Dependencies)
  - Tried with Anaconda and Docker

<br><br>

**Content:**
- [Python Env](#python-env)
- [Example RAG Model](#example-rag-model)
  - Retriever: Embedding + Indexing (Database) (+ example data)
  - Reranker (we don't use one)
  - Generator: Tokenizer + LLM
- [Evaluation with BERGEN](#evaluation-with-bergen)
  - 1. Defining our Model in BERGEN Repo
    - Classes + Configs
  - 2. Evaluate your model with Bergen

<br><br>

---



### Python Env

Install Repository:
```bash
git clone https://github.com/naver/bergen.git ./bergen
```

<br><br>

Tried with **Anaconda** and **Python 3.10** & **3.11**

Tried with **Docker** and **CUDA 11.2**, **11.8**, **13.0** + **Python 3.9**, **3.10**, **3.12**

Tried with **Google Coolab**

<span style="color:red">=> All Setups did not work</span>

<span style="color:red">*but the procedure still can be shown</span>


### System Information

In [None]:
import prime_printer as prime
print(prime.get_hardware())

### Example RAG Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import faiss

In [None]:
embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", dtype=torch.float16)
embedding_model.resize_token_embeddings(len(embedding_tokenizer))

In [None]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [None]:
def encode(model, tokenizer, texts):
    tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()

In [None]:
doc_embeddings = encode(embedding_model, embedding_tokenizer, example_documents)

Build FAISS Index (our "database")

In [None]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

Load a language model (decoder)

In [None]:
model_name = "gpt2"  # "distilgpt2"
generator_tokenizer = AutoTokenizer.from_pretrained(model_name)
generator_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
generator_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             dtype=torch.float16)
generator_model.resize_token_embeddings(len(generator_tokenizer))

In [None]:
generator_model.device

RAG Method

In [None]:
import numpy as np

def rag_answer(query, given_passages, k=2):
    # Create prompt + docs embedding
    embedded_prompt = encode(embedding_model, embedding_tokenizer, [query])[0]
    given_passages_embedded = encode(embedding_model, embedding_tokenizer, given_passages)

    # Convert embeddings to float32 numpy arrays
    prompt_vec = embedded_prompt.astype(np.float32).reshape(1, -1)
    passage_vecs = given_passages_embedded.astype(np.float32)

    # Build index
    dim = passage_vecs.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(passage_vecs)

    # Retrieve top-k docs
    distances, indices = index.search(prompt_vec, k)
    retrieved = [given_passages[i] for i in indices[0]]

    # Build the final prompt for generation
    context_text = "\n".join(retrieved)
    prompt = (
        f"Use the following context to answer the question.\n\n"
        f"Context: {context_text}\n\n"
        f"Question: {query}\nAnswer:"
    )

    # Tokenize final prompt
    inputs = generator_tokenizer(prompt, return_tensors="pt")

    # Generate
    outputs = generator_model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        temperature=0.7
    )

    # Decode output
    answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = answer[len(prompt):].strip()
    return answer, retrieved

Example Run

In [None]:
answer, retrieved_docs = rag_answer("Where is the Eiffel Tower located?", example_documents, k=2)
print(f"Retrieved Docs: {retrieved_docs}")
print(f"\nRAG Answer:\n'{answer}'")

### **Evaluation with BERGEN**

[See documentation](https://github.com/naver/bergen/blob/main/documentation/extensions.md)

### 1. Defining our Model in BERGEN Repo

You can add a custom:
- Retriever
- Reranker
- Generator
- Dataset

Or you choose a out-of-the-box choice.

<br><br>

**Retriever**
- inherit from `models.retrievers.retriever.Retriever`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`
  - `similarity_fn(self, q_embs, doc_embs)`

In [None]:
import os

new_retriever = """
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

from models.retrievers.retriever import Retriever

class NewRetriever(Retriever):
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        self.model = AutoModelForCausalLM.from_pretrained(model_name,
                                                          device_map="auto",
                                                          dtype=torch.float16)
        self.model.resize_token_embeddings(len(self.tokenizer))

    def encode(self, texts):
        tokens = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens)
            embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings.cpu().numpy()

    def collate_fn(self, batch, query_or_doc=None):
        if isinstance(batch[0], dict):
            return [sample["content"] for sample in batch]
        return batch

    def __call__(self, kwargs):
        texts = kwargs["content"]
        emb = self.encode(texts)
        return {"embeddings": emb, "raw_texts": texts}

    def similarity_fn(self, q_embs, doc_embs):
        return torch.matmul(q_embs, doc_embs.T)
"""

os.makedirs("./bergen/models/retrievers/", exist_ok=True)
with open("./bergen/models/retrievers/new_retriever.py", "w") as f:
  f.write(new_retriever)

Add config yaml to `config/retriever`

In [None]:
new_retriever_config = """
init_args:
  _target_: models.retrievers.new_retriever.NewRetriever
  model_name: "new_retriever"
batch_size: 1024
batch_size_sim: 256
"""

os.makedirs("./bergen/config/retriever/", exist_ok=True)
with open("./bergen/config/retriever/new_retriever.yaml", "w") as f:
  f.write(new_retriever_config)

<br><br>

**Reranker**
- inherit from `models.rerankers.reranker.Reranker`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`

In [None]:
new_reranker = """
from models.rerankers.reranker import Reranker

class NewReranker(Reranker):
    def __init__(self, model_name=None):
        self.model_name = 'no_reranker'

    def collate_fn(self, batch, query_or_doc=None):
        return batch

    def __call__(self, kwargs):
        return kwargs
"""

os.makedirs("./bergen/models/rerankers/", exist_ok=True)
with open("./bergen/models/rerankers/new_reranker.py", "w") as f:
  f.write(new_reranker)

Add config yaml to `config/reranker`

In [None]:
new_reranker_config = """
init_args:
  _target_: models.rerankers.new_reranker.NewReranker
  model_name: "new_reranker"
batch_size: 2048
"""

os.makedirs("./bergen/config/reranker/", exist_ok=True)
with open("./bergen/config/reranker/new_reranker.yaml", "w") as f:
  f.write(new_reranker_config)

<br><br>

**Generator**
- inherit from `models.generators.generator.Generator`
- needed methods:
  - `collate_fn(self, inp)`
  - `generate(self, inp)`
  - `prediction_step(self, model, model_input, label_ids=None)`

In [None]:
new_generator = """
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from models.generators.generator import Generator

class NewGenerator(Generator):
    def __init__(self, model_name="gpt2"):
        self.model_name = "gpt-2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        self.model = AutoModelForCausalLM.from_pretrained(model_name,
                                                          device_map="auto",
                                                          dtype=torch.float16)
        self.model.resize_token_embeddings(len(self.tokenizer))

    def collate_fn(self, inp):
        return self.tokenizer(
            inp,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

    def generate(self, inp):
        outputs = self.model.generate(
            input_ids=inp["input_ids"],
            attention_mask=inp["attention_mask"],
            max_length=150,
            do_sample=True,
            temperature=0.7,
        )
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

    def prediction_step(self, model, model_input, label_ids=None):
        output = model(**model_input, labels=label_ids)
        return output.logits, output.loss

"""

os.makedirs("./bergen/models/generators/", exist_ok=True)
with open("./bergen/models/generators/new_generator.py", "w") as f:
  f.write(new_generator)

Add config yaml to `config/generators`

In [None]:
new_generator_config = """
init_args:
  _target_: models.generators.new_generator.NewGenerator
  model_name: "new_generator"
  max_new_tokens: 128
batch_size: 32
max_inp_length: null
"""

os.makedirs("./bergen/config/generator/", exist_ok=True)
with open("./bergen/config/generator/new_generator.yaml", "w") as f:
  f.write(new_generator_config)

<br><br>

Other:

**Dataset**
- inherit from `modules.dataset_processor.Processor`
- needed methods:
  - `__init__(self, *args, **kwargs)`
  - `process(self)`




Add config yaml to `config/generators`

In [None]:
new_dataset_config = """
test:
    doc: null
    query: null
dev:
  doc:
    init_args:
    _target_: modules.dataset_processor.NewDataset
    split: "full"
query:
  init_args:
    _target_: modules.dataset_processor.KILTNQProcessor
    split: "validation"
train:
    doc: null
    query: null
"""

with open("./bergen/config/dataset/new_config.yaml", "w") as f:
  f.write(new_dataset_config)

<br><br>


**Prompt**


In [None]:
new_prompt_config = """
system: "You are a helpful assistant. Your task is to extract relevant information from the provided documents and to answer questions accordingly."
user: f"Background:\ {docs}\n\nQuestion:\ {question}\nAnswer:"
system_without_docs: "You are a helpful assistant."
user_without_docs: f"Question:\ {question}\nAnswer:"
"""

with open("./bergen/config/prompt/new_prompt.yaml", "w") as f:
  f.write(new_prompt_config)

### 2. Evaluate your model with Bergen

In [None]:
!python ./bergen/bergen.py retriever='new_retriever' \
                           reranker='new_reranker' \
                           generator='new_generator' \
                           dataset='kilt_hotpotqa'