# RAG Evaluation with [BERGEN Benchmark](https://github.com/naver/bergen/)

Consist of out-of-the-box models and datasets. Can add custom models and datasets. Possible to use only partwise models (for example only a custom reranker and the rest is used out-of-the-box). Need many dependencies and additional code to use custom components.

> Warning: Installation and Dependencies are complex. We will use the sandbox environment's pre-installed Python 3.11 and `pip` for a clean setup.

- [Python Env Setup](#python-env-setup)
- [Example RAG Model](#example-rag-model)
- [Evaluation with BERGEN](#evaluation-with-bergen)
  - 1. Defining our Model in BERGEN Repo
    - Classes + Configs
  - 2. Evaluate your model with Bergen

<br><br>

---



### Python Env Setup

**Note:** The following steps assume you are running this notebook in the `/home/ubuntu/` directory, where the `bergen` repository has been cloned.

1. **Clone Repository (Already done in the sandbox):**
    ```bash
    git clone https://github.com/naver/bergen.git ./bergen
    ```

2. **Create Anacona Env:**
    ```bash
    conda create -n bergen-fixed python=3.11 -y
    conda activate bergen-fixed
    ```

3. **Install Dependencies (Already done in the sandbox):**
The following command installs the necessary packages, including `torch`, `transformers`, `faiss-cpu`, `datasets`, `hydra-core`, `omegaconf`, `pytrec_eval`, and `torchinfo`.
    ```bash
    pip3 install torch transformers faiss-cpu datasets hydra-core omegaconf pytrec_eval torchinfo jupyter ipykernel
    ```

**Please ensure you select the `Python 3.11 (bergen)` kernel for this notebook.**

In [5]:
!nvidia-smi | sed "/Processes/,$d"

Sun Nov 30 11:41:21 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 581.29                 Driver Version: 581.29         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4060      WDDM  |   00000000:26:00.0  On |                  N/A |
|  0%   36C    P8            N/A  /  115W |    2745MiB /   8188MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

### Example RAG Model

In [6]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import faiss
from torchinfo import summary
import numpy as np
import os

In [7]:
embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Use torch.float32 for CPU/general compatibility, as float16 might cause issues on CPU or without a dedicated GPU.
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", dtype=torch.float32)
embedding_model.resize_token_embeddings(len(embedding_tokenizer))

Embedding(30522, 384, padding_idx=0)

In [8]:
# Removed tokenizer summary as it is not a torch.nn.Module
print("Embedding Tokenizer loaded successfully.")

Embedding Tokenizer loaded successfully.


In [9]:
dummy_input = embedding_tokenizer("This is a test sentence.", return_tensors="pt")
summary(
    embedding_model,
    input_data=dummy_input, # Pass the actual input data
    verbose=0
)

RuntimeError: Failed to run torchinfo. See above stack traces for more details. Executed layers up to: []

In [10]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [11]:
def encode(model, tokenizer, texts):
    tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    # Ensure model is in evaluation mode
    model.eval()
    with torch.no_grad():
        outputs = model(**tokens)
        # Mean pooling of the last hidden state for sentence embedding
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()

In [12]:
doc_embeddings = encode(embedding_model, embedding_tokenizer, example_documents)

Build FAISS Index (our "database")

In [13]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings.astype('float32')) # FAISS requires float32

Load a language model (decoder)

In [15]:
model_name = "gpt2"  # "distilgpt2"
generator_tokenizer = AutoTokenizer.from_pretrained(model_name)
generator_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Use torch.float32 for CPU/general compatibility
generator_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             #device_map="auto",
                                             dtype=torch.float32)
generator_model.resize_token_embeddings(len(generator_tokenizer))

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Embedding(50258, 768)

In [16]:
# Removed tokenizer summary as it is not a torch.nn.Module
print("Generator Tokenizer loaded successfully.")

Generator Tokenizer loaded successfully.


In [17]:
dummy_input = generator_tokenizer("This is a test sentence.", return_tensors="pt")
summary(
    generator_model,
    input_data=dummy_input, # Pass the actual input data
    verbose=0
)

RuntimeError: Failed to run torchinfo. See above stack traces for more details. Executed layers up to: []

In [18]:
print(f"Generator model device: {generator_model.device}")

Generator model device: cpu


RAG Method

In [19]:
def rag_answer(query, given_passages, k=2):
    # Create prompt + docs embedding
    embedded_prompt = encode(embedding_model, embedding_tokenizer, [query])[0]
    given_passages_embedded = encode(embedding_model, embedding_tokenizer, given_passages)

    # Convert embeddings to float32 numpy arrays
    prompt_vec = embedded_prompt.astype(np.float32).reshape(1, -1)
    passage_vecs = given_passages_embedded.astype(np.float32)

    # Build index
    dim = passage_vecs.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(passage_vecs)

    # Retrieve top-k docs
    distances, indices = index.search(prompt_vec, k)
    retrieved = [given_passages[i] for i in indices[0]]

    # Build the final prompt for generation
    context_text = "\n".join(retrieved)
    prompt = (
        f"Use the following context to answer the question.\n\n"
        f"Context:\n{context_text}\n\n"
        f"Question: {query}\n\n"
        f"Answer:"
    )

    # Generate answer
    input_ids = generator_tokenizer.encode(prompt, return_tensors="pt")
    # Move to model device if not already there
    input_ids = input_ids.to(generator_model.device)

    # Set pad_token_id to eos_token_id for open-ended generation if it's not set
    if generator_tokenizer.pad_token_id is None:
        generator_tokenizer.pad_token_id = generator_tokenizer.eos_token_id

    output = generator_model.generate(
        input_ids,
        max_length=input_ids.shape[-1] + 50, # Generate up to 50 new tokens
        num_return_sequences=1,
        do_sample=True,
        temperature=0.7,
        pad_token_id=generator_tokenizer.pad_token_id
    )

    generated_text = generator_tokenizer.decode(output[0], skip_special_tokens=True)
    # Remove the prompt part from the generated text
    answer = generated_text[len(prompt):].strip()

    return answer, retrieved

query = "Where is the famous tower in Paris?"
answer, retrieved_docs = rag_answer(query, example_documents)

print(f"Query: {query}")
print(f"Retrieved Documents: {retrieved_docs}")
print(f"Generated Answer: {answer}")

Query: Where is the famous tower in Paris?
Retrieved Documents: ['The Eiffel Tower is located in Paris.', 'The Pythagorean theorem describes the relationship between the sides of a right triangle.']
Generated Answer: It is in the middle of the city.

The Pythagorean theorem is one of the most famous facts about the architecture of the United States. The Eiffel Tower is one of the most famous buildings in the world and is famous for


---

## Evaluation with BERGEN

### 1. Defining our Model in BERGEN Repo

**Note:** We will create the necessary files in the cloned `bergen` directory.

**Retriever**
- inherit from `models.retrievers.retriever.Retriever`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`

In [20]:
new_retriever = """
from models.retrievers.retriever import Retriever
from transformers import AutoTokenizer, AutoModel
import torch

class NewRetriever(Retriever):
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2", **kwargs):
        # Call super().__init__ if the base class requires it, but for a simple custom retriever, it might be optional.
        # super().__init__(**kwargs)
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        # Use float32 for compatibility
        self.model = AutoModel.from_pretrained(model_name, dtype=torch.float32)
        self.model.resize_token_embeddings(len(self.tokenizer))

    def collate_fn(self, batch, query_or_doc=None):
        # This is a simplified collate_fn for demonstration
        return self.tokenizer(
            batch,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

    def __call__(self, kwargs):
        # This is a simplified __call__ for demonstration
        tokens = kwargs['input_ids']
        attention_mask = kwargs['attention_mask']
        
        self.model.eval()
        with torch.no_grad():
            outputs = self.model(input_ids=tokens, attention_mask=attention_mask)
            # Mean pooling
            embeddings = outputs.last_hidden_state.mean(dim=1)
        
        # BERGEN expects a dictionary with 'embedding' key
        return {'embedding': embeddings.cpu().numpy()}
"""

os.makedirs("./bergen/models/retrievers/", exist_ok=True)
with open("./bergen/models/retrievers/new_retriever.py", "w") as f:
  f.write(new_retriever)

In [21]:
new_retriever_config = """
init_args:
  _target_: models.retrievers.new_retriever.NewRetriever
  model_name: "sentence-transformers/all-MiniLM-L6-v2"
batch_size: 2048
"""

os.makedirs("./bergen/config/retriever/", exist_ok=True)
with open("./bergen/config/retriever/new_retriever.yaml", "w") as f:
  f.write(new_retriever_config)

<br><br>

**Reranker**
- inherit from `models.rerankers.reranker.Reranker`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`

In [22]:
new_reranker = """
from models.rerankers.reranker import Reranker

class NewReranker(Reranker):
    def __init__(self, model_name=None):
        # BERGEN expects model_name to be set
        self.model_name = 'no_reranker'

    def collate_fn(self, batch, query_or_doc=None):
        # No-op collate function
        return batch

    def __call__(self, kwargs):
        # No-op reranker, returns the input kwargs as is
        return kwargs
"""

os.makedirs("./bergen/models/rerankers/", exist_ok=True)
with open("./bergen/models/rerankers/new_reranker.py", "w") as f:
  f.write(new_reranker)

Add config yaml to `config/reranker`

In [23]:
new_reranker_config = """
init_args:
  _target_: models.rerankers.new_reranker.NewReranker
  model_name: "new_reranker"
batch_size: 2048
"""

os.makedirs("./bergen/config/reranker/", exist_ok=True)
with open("./bergen/config/reranker/new_reranker.yaml", "w") as f:
  f.write(new_reranker_config)

<br><br>

**Generator**
- inherit from `models.generators.generator.Generator`
- needed methods:
  - `collate_fn(self, inp)`
  - `generate(self, inp)`
  - `prediction_step(self, model, model_input, label_ids=None)`

In [26]:
new_generator = """
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from models.generators.generator import Generator

class NewGenerator(Generator):
    def __init__(self, model_name="gpt2", max_new_tokens=128):
        self.model_name = model_name
        self.max_new_tokens = max_new_tokens
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        # Use float32 for compatibility
        self.model = AutoModelForCausalLM.from_pretrained(model_name,
                                                          # device_map="auto",
                                                          dtype=torch.float32)
        self.model.resize_token_embeddings(len(self.tokenizer))
        # Set pad_token_id for generation
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def collate_fn(self, inp):
        # inp is a list of strings (prompts)
        return self.tokenizer(
            inp,
            padding=True,
            truncation=True,
            return_tensors="pt",
            pad_to_multiple_of=8 # Optimization for some hardware
        )

    def generate(self, inp):
        # inp is the output of collate_fn (a dict of tensors)
        # Move tensors to model device
        input_ids = inp["input_ids"].to(self.model.device)
        attention_mask = inp["attention_mask"].to(self.model.device)
        
        outputs = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=self.max_new_tokens, # Use config value
            do_sample=True,
            temperature=0.7,
            pad_token_id=self.tokenizer.pad_token_id
        )
        
        # Decode only the newly generated part
        decoded_outputs = []
        for i, output in enumerate(outputs):
            # Slice to get only the generated tokens (after the input prompt)
            generated_tokens = output[input_ids.shape[1]:]
            decoded_text = self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
            decoded_outputs.append(decoded_text.strip())
            
        return decoded_outputs

    def prediction_step(self, model, model_input, label_ids=None):
        # This method is typically for training/validation, but required by the base class.
        # We can simplify it for a basic evaluation setup.
        output = model(**model_input, labels=label_ids)
        return output.logits, output.loss

"""

os.makedirs("./bergen/models/generators/", exist_ok=True)
with open("./bergen/models/generators/new_generator.py", "w") as f:
  f.write(new_generator)

Add config yaml to `config/generators`

In [25]:
new_generator_config = """
init_args:
  _target_: models.generators.new_generator.NewGenerator
  model_name: "gpt2"
  max_new_tokens: 128
batch_size: 32
max_inp_length: null
"""

os.makedirs("./bergen/config/generator/", exist_ok=True)
with open("./bergen/config/generator/new_generator.yaml", "w") as f:
  f.write(new_generator_config)

<br><br>

**Dataset**
- inherit from `modules.dataset_processor.Processor`
- needed methods:
  - `__init__(self, *args, **kwargs)`
  - `process(self)`

**Note:** Since you didn't provide the `NewDataset` class implementation, I will assume you intend to use a standard BERGEN dataset like `kilt_hotpotqa` for the final evaluation, but I will keep the dataset config creation for completeness, using a placeholder target that you can replace with your custom dataset class later.

In [27]:
new_dataset_config = """
test:
    doc: null
    query: null
dev:
  doc:
    init_args:
      _target_: modules.dataset_processor.KILTNQProcessor # Placeholder, replace with your custom class if needed
      split: "validation"
  query:
    init_args:
      _target_: modules.dataset_processor.KILTNQProcessor # Placeholder, replace with your custom class if needed
      split: "validation"
train:
    doc: null
    query: null
"""

os.makedirs("./bergen/config/dataset/", exist_ok=True)
with open("./bergen/config/dataset/new_config.yaml", "w") as f:
  f.write(new_dataset_config)

<br><br>


**Prompt**
This config defines the prompt template for the generator.

In [28]:
new_prompt_config = """
system: "You are a helpful assistant. Your task is to extract relevant information from the provided documents and to answer questions accordingly."
user: "Background:\n{docs}\n\nQuestion:\n{question}\nAnswer:"
system_without_docs: "You are a helpful assistant."
user_without_docs: "Question:\n{question}\nAnswer:"
"""

os.makedirs("./bergen/config/prompt/", exist_ok=True)
with open("./bergen/config/prompt/new_prompt.yaml", "w") as f:
  f.write(new_prompt_config)

### 2. Evaluate your model with Bergen

Run the BERGEN evaluation script using the custom components and the `kilt_hotpotqa` dataset.

In [34]:
# Use the custom configs we just created
!python ./bergen/bergen.py retriever=new_retriever \
                           reranker=new_reranker \
                           generator=new_generator \
                           dataset=kilt_nq \
                           prompt=new_prompt

Unfinished experiment_folder: experiments/tmp_9f14019e8482790a
experiment_folder experiments/9f14019e8482790a
run_name: null
dataset_folder: datasets/
index_folder: indexes/
runs_folder: runs/
generated_query_folder: generated_queries/
processed_context_folder: processed_contexts/
experiments_folder: experiments/
retrieve_top_k: 50
rerank_top_k: 50
generation_top_k: 5
pyserini_num_threads: 20
processing_num_proc: 40
retriever:
  init_args:
    _target_: models.retrievers.new_retriever.NewRetriever
    model_name: sentence-transformers/all-MiniLM-L6-v2
  batch_size: 2048
reranker:
  init_args:
    _target_: models.rerankers.new_reranker.NewReranker
    model_name: new_reranker
  batch_size: 2048
generator:
  init_args:
    _target_: models.generators.new_generator.NewGenerator
    model_name: gpt2
    max_new_tokens: 128
  batch_size: 32
  max_inp_length: null
dataset:
  train:
    doc:
      init_args:
        _target_: modules.dataset_processor.KILT100w
        split: full
    query:


Error executing job with overrides: ['retriever=new_retriever', 'reranker=new_reranker', 'generator=new_generator', 'dataset=kilt_nq', 'prompt=new_prompt']

Traceback (most recent call last):
  File "d:\Informatik\Projekte\RAG_Evaluation\bergen\bergen.py", line 18, in main
    rag = RAG(**config, config=config)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Informatik\Projekte\RAG_Evaluation\bergen\modules\rag.py", line 160, in __init__
    self.datasets = ProcessDatasets.process(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Informatik\Projekte\RAG_Evaluation\bergen\modules\dataset_processor.py", line 674, in process
    dataset = processor.get_dataset()
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "d:\Informatik\Projekte\RAG_Evaluation\bergen\modules\dataset_processor.py", line 92, in get_dataset
    dataset = self.process()
              ^^^^^^^^^^^^^^
  File "d:\Informatik\Projekte\RAG_Evaluation\bergen\modules\dataset_processor.py", line 305, in process
    dataset = 

In [32]:
import pyarrow
print(f"pyarrow version: {pyarrow.__version__}")

pyarrow version: 22.0.0


In [33]:
# These commands are for post-evaluation analysis, which you can run after the main evaluation is complete.
# !python evaluate.py --experiments_folder experiments/ --llm_batch_size 16 --split 'dev' --llm vllm_SOLAR-107B
# !python print_results.py --folder experiments/ --format=tiny
print("Post-evaluation analysis commands are ready to run after the main BERGEN evaluation.")

Post-evaluation analysis commands are ready to run after the main BERGEN evaluation.
