# RAG System Evaluation with Bergen Benchmark

- [Python Env Setup](#python-env-setup)
  - install anaconda
  - install BERGEN
  - setup python env
- [Example RAG Model](#example-rag-model)
  - Retriever: Embedding + Indexing (Database) (+ example data)
  - Reranker (we don't use one)
  - Generator: Tokenizer + LLM
- [Evaluation with BERGEN](#evaluation-with-bergen)
  - 1. Defining our Model in BERGEN Repo
    - Classes + Configs
  - 2. Evaluate your model with Bergen

<br><br>

---

<br><br>


### Python Env Setup

Installing Anaconda

In [1]:
# !wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
# !bash Anaconda3-2022.05-Linux-x86_64.sh -y -b -f -p /usr/local

--2025-11-28 20:14:47--  https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
Resolving repo.anaconda.com (repo.anaconda.com)... 104.16.32.241, 104.16.191.158, 2606:4700::6810:bf9e, ...
Connecting to repo.anaconda.com (repo.anaconda.com)|104.16.32.241|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 690850711 (659M) [application/x-sh]
Saving to: ‘Anaconda3-2022.05-Linux-x86_64.sh’


2025-11-28 20:14:50 (225 MB/s) - ‘Anaconda3-2022.05-Linux-x86_64.sh’ saved [690850711/690850711]



Creating Python Env with Requirements

In [4]:
!conda create -n bergen python=3.10 -y

Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | done
Solving environment: - \ done


  current version: 4.12.0
  latest version: 25.11.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /usr/local/envs/bergen

  added / updated specs:
    - python=3.10


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _openmp_mutex-5.1          |            1_gnu          21 KB
    bzip2-1.0.8                |       h5eee18b_6         262 KB
    ca-certificates-2025.11.4  |       h06a4308_0         128 KB
    expat-2.7.3                |       h3385a95_0         167 KB
    ld_impl_linux-64-2.44      |       h153f514_2         672 KB
    libffi-3.4.4   

In [None]:
!pip install ipykernel jupyter notebook traitlets tornado ipython
!pip install matplotlib numpy scipy pandas

In [5]:
# !source ~/.bashrc
# !conda init bash
# !source ~/.bashrc
# !conda activate bergen

In [6]:
!conda info --envs

# conda environments:
#
base                  *  /usr/local
bergen                   /usr/local/envs/bergen



In [7]:
!conda env list

# conda environments:
#
base                  *  /usr/local
bergen                   /usr/local/envs/bergen



In [8]:
# RAG_ENV = "/usr/local/envs/bergen"
# !ls $RAG_ENV
# !$RAG_ENV/bin/python --version
# !$RAG_ENV/bin/pip --version

bin		 conda-meta  lib  share  x86_64-conda-linux-gnu
compiler_compat  include     man  ssl
Python 3.10.19
pip 25.3 from /usr/local/envs/bergen/lib/python3.10/site-packages/pip (python 3.10)


Cloning Benchmark Repo

In [None]:
!git clone https://github.com/naver/bergen.git .

Cloning into 'bergen'...
remote: Enumerating objects: 3749, done.[K
remote: Counting objects: 100% (1363/1363), done.[K
remote: Compressing objects: 100% (313/313), done.[K
remote: Total 3749 (delta 1244), reused 1050 (delta 1050), pack-reused 2386 (from 2)[K
Receiving objects: 100% (3749/3749), 142.41 MiB | 9.17 MiB/s, done.
Resolving deltas: 100% (2653/2653), done.
Updating files: 100% (621/621), done.


In [9]:
!pip install torch
!pip install packaging
!pip install ninja
!pip install flash-attn --no-build-isolation   #skip it for V100
!pip install vllm
!pip install -r ./bergen/requirements.txt

Collecting torch
  Downloading torch-2.9.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (30 kB)
Collecting filelock (from torch)
  Downloading filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting typing-extensions>=4.10.0 (from torch)
  Downloading typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=0.8.5 (from torch)
  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.8.93 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-runtime-cu12==12.8.90 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.8.90-p

Install dependencies for our RAG

In [10]:
!pip install transformers sentence-transformers faiss-cpu accelerate
# !pip install nbformat datasets transformers omegaconf pandas scikit-learn tqdm

Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Downloading sentence_transformers-5.1.2-py3-none-any.whl (488 kB)
Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m123.9 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu, sentence-transformers
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [sentence-transformers]
[1A[2KSuccessfully installed faiss-cpu-1.13.0 sentence-transformers-5.1.2


Use installed env

In [12]:
!python --version

import os
os.environ["PATH"] = f"{RAG_ENV+'/bin'}:" + os.environ["PATH"]

!python --version

Python 3.9.12
Python 3.10.19


In [13]:
!$RAG_ENV/bin/python -m ipykernel install --user --name custom-env --display-name "Python (Custom ENV)"

Installed kernelspec custom-env in /root/.local/share/jupyter/kernels/custom-env


Now:
- Go to `Runtime` > `Change runtime type` > `Kernel`
- Select `Python (Custom ENV)`
- Restart runtime (most likely automatically, if not: `Runtime` > `Restart Runtime`)

In [14]:
raise Exception("Stop here and see the command on top. After restarting continue running the commands below.")

Exception: Stop here and see the command on top. After restarting continue running the commands below.

### Example RAG Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from sentence_transformers import SentenceTransformer
import faiss

In [None]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [None]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [None]:
doc_embeddings = embedding_model.encode(example_documents, convert_to_numpy=True)

Build FAISS Index (our "database")

In [None]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

# save for later
faiss.write_index(index, "/content/my_index.faiss")

Load a language model (decoder)

In [None]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

RAG Method

In [None]:
def rag_answer(query, k=2):
    # Create prompt embedding
    embedded_prompt = embedding_model.encode([query], convert_to_numpy=True)

    # Retrieve top-k docs
    distances, idx = index.search(embedded_prompt, k)
    retrieved = [example_documents[i] for i in idx[0]]

    # Build the final prompt for generation
    prompt = (
        "Use the following context to answer the given question.\n\n"
        f"Context: {retrieved}\n\n"
        f"Question: {query}\nAnswer:"
    )

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        temperature=0.7
    )

    # Decode output
    return tokenizer.decode(outputs[0], skip_special_tokens=True), retrieved


Example Run

In [None]:
answer, retrieved_docs = rag_answer("Where is the Eiffel Tower located?")
print("Retrieved Docs:", retrieved_docs)
print("\nRAG Answer:\n", answer)

### **Evaluation with BERGEN**

[See documentation](https://github.com/naver/bergen/blob/main/documentation/extensions.md)

### 1. Defining our Model in BERGEN Repo

You can add a custom:
- Retriever
- Reranker
- Generator
- Dataset

Or you choose a out-of-the-box choice.

<br><br>

**Retriever**
- inherit from `models.retrievers.retriever.Retriever`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`
  - `similarity_fn(self, q_embs, doc_embs)`

In [None]:
new_retriever = """
import torch
import numpy as np
from models.retrievers.retriever import Retriever
from sentence_transformers import SentenceTransformer

class NewRetriever(Retriever):
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)
        self.index = faiss.read_index("/content/my_index.faiss")

    def collate_fn(self, batch, query_or_doc=None):
        if isinstance(batch[0], dict):
            return [sample["content"] for sample in batch]
        return batch

    def __call__(self, kwargs):
        texts = kwargs["content"]
        emb = self.model.encode(texts, convert_to_tensor=True)
        return {"embeddings": emb, "raw_texts": texts}

    def similarity_fn(self, q_embs, doc_embs):
        return torch.matmul(q_embs, doc_embs.T)
"""

with open("./bergen/models/retrievers/new_retriever.py", "w") as f:
  f.write(new_retriever)

Add config yaml to `config/retriever`

In [None]:
new_retriever_config = """
init_args:
  _target_: models.retrievers.new_retriever.NewRetriever
  model_name: "new_retriever"
batch_size: 1024
batch_size_sim: 256
"""

with open("./bergen/config/retrievers/new_retriever.yaml", "w") as f:
  f.write(new_retriever_config)

<br><br>

**Reranker**
- inherit from `models.rerankers.reranker.Reranker`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`

In [None]:
new_reranker = """
from models.rerankers.reranker import Reranker

class NewReranker(Reranker):
    def __init__(self, model_name=None):
        self.model_name = 'no_reranker'

    def collate_fn(self, batch, query_or_doc=None):
        return batch

    def __call__(self, kwargs):
        return kwargs
"""

with open("./bergen/models/rerankers/new_reranker.py", "w") as f:
  f.write(new_reranker)

Add config yaml to `config/reranker`

In [None]:
new_reranker_config = """
init_args:
  _target_: models.rerankers.new_reranker.NewReranker
  model_name: "new_reranker"
batch_size: 2048
"""

with open("./bergen/config/rerankers/new_reranker.yaml", "w") as f:
  f.write(new_reranker_config)

<br><br>

**Generator**
- inherit from `models.generators.generator.Generator`
- needed methods:
  - `collate_fn(self, inp)`
  - `generate(self, inp)`
  - `prediction_step(self, model, model_input, label_ids=None)`

In [None]:
new_generator = """
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from models.generators.generator import Generator

class NewGenerator(Generator):
    def __init__(self, model_name="gpt2"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)

    def collate_fn(self, inp):
        return self.tokenizer(
            inp,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

    def generate(self, inp):
        outputs = self.model.generate(
            input_ids=inp["input_ids"],
            attention_mask=inp["attention_mask"],
            max_length=150,
            do_sample=True,
            temperature=0.7,
        )
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

    def prediction_step(self, model, model_input, label_ids=None):
        output = model(**model_input, labels=label_ids)
        return output.logits, output.loss

"""

with open("./bergen/models/generators/new_generator.py", "w") as f:
  f.write(new_generator)

Add config yaml to `config/generators`

In [None]:
new_generator_config = """
defaults:
  - prompt: basic
init_args:
  _target_: models.generators.new_generator.NewGenerator
  model_name: "new_generator"
  max_new_tokens: 128
batch_size: 32
max_inp_length: null
"""

with open("./bergen/config/generators/new_generator.yaml", "w") as f:
  f.write(new_generator_config)

<br><br>

Other:

**Dataset**
- inherit from `modules.dataset_processor.Processor`
- needed methods:
  - `__init__(self, *args, **kwargs)`
  - `process(self)`




Add config yaml to `config/generators`

In [None]:
new_dataset_config = """
test:
    doc: null
    query: null
dev:
  doc:
    init_args:
    _target_: modules.dataset_processor.NewDataset
    split: "full"
query:
  init_args:
    _target_: modules.dataset_processor.KILTNQProcessor
    split: "validation"
train:
    doc: null
    query: null
"""

with open("./bergen/config/dataset/new_config.yaml", "w") as f:
  f.write(new_dataset_config)

<br><br>


**Prompt**


In [None]:
new_prompt_config = """
system: "You are a helpful assistant. Your task is to extract relevant information from the provided documents and to answer questions accordingly."
user: f"Background:\ {docs}\n\nQuestion:\ {question}\nAnswer:"
system_without_docs: "You are a helpful assistant."
user_without_docs: f"Question:\ {question}\nAnswer:"
"""

with open("./bergen/config/prompt/new_prompt.yaml", "w") as f:
  f.write(new_prompt_config)

### 2. Evaluate your model with Bergen

In [None]:
!python bergen.py retriever='new_retriever' \
                  reranker='new_reranker' \
                  generator='new_generator' \
                  dataset='kilt_nq'