# RAG Evaluation with [BERGEN Benchmark](https://github.com/naver/bergen/)

Consist of out-of-the-box models and datasets. Can add custom models and datasets. Possible to use only partwise models (for example only a custom reranker and the rest is used out-of-the-box). Need many dependencies and additional code to use custom components.

- [System Setup](#system-setup)
  - docker setup
  - install BERGEN
- [Example RAG Model](#example-rag-model)
  - Retriever: Embedding + Indexing (Database) (+ example data)
  - Reranker (we don't use one)
  - Generator: Tokenizer + LLM
- [Evaluation with BERGEN](#evaluation-with-bergen)
  - 1. Defining our Model in BERGEN Repo
    - Classes + Configs
  - 2. Evaluate your model with Bergen

<br><br>

---



### System Setup

Using Docker

You might want to check your CUDA version first:
```bash
!nvidia-smi | sed '/Processes/,$d'
```

- `nvidia-smi` -> standard NVIDIA information command
- `|` -> send content to `sed` (which is a streaming editor)
- `sed '/Processes/,$d'` -> delete (`d`) from line containing `Processes` to the end (`$`)

**Build your image:**
1. Open `Docker Desktop` and open the bash (right bottom corner)
2. Run:
    ```bash
    cd D:\Informatik\Projekte\RAG_Evaluation
    docker build -t rag-eval .
    ```

**Starting Setup:**<br>
1. Open `Docker Desktop`
2. Starting Container:
    ```bash
    docker run -it --rm --gpus all -v .:/workspace -w /workspace rag-eval bash
    ```
3. Attach Visual Studio to that Container (Docker Extension installation required)

<br>

[We used this Image](./Dockerfile)

### System Information

In [None]:
import prime_printer as prime
print(prime.get_hardware())


-------------------------------- 
Your Hardware:

    ---> General <---
Operatingsystem: Linux
Version: #1 SMP Tue Nov 5 00:21:55 UTC 2024
Architecture: ('64bit', '')
Processor: x86_64

    ---> GPU <---
GPU Name: NVIDIA GeForce RTX 4060
VRAM Total: 8188 MB
VRAM Used: 1992 MB
Utilization: 3.0 %
PyTorch Support: True (NVIDIA GeForce RTX 4060)
TensorFlow Support: False -> not installed

    ---> CPU <---
CPU-Name: AMD Ryzen 7 3700X 8-Core Processor
CPU Kernels: 8
Logical CPU-Kernels: 16
CPU-Frequence: 0 MHz
CPU-Utilization: 0.9 %

    ---> RAM <---
RAM Total: 15 GB
RAM Available: 10 GB
RAM-Utilization: 31.6 %

--------------------------------


### Example RAG Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import faiss

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", dtype=torch.float16)

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


TypeError: Unable to convert function return value to a Python type! The signature was
	() -> handle

In [None]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [None]:
def encode(model, tokenizer, texts):
    tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings.cpu().numpy()

In [None]:
doc_embeddings = encode(embedding_model, tokenizer, example_documents)

Build FAISS Index (our "database")

In [None]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

# save for later
faiss.write_index(index, "/content/my_index.faiss")

Load a language model (decoder)

In [None]:
model_name = "gpt2"  # "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             torch_dtype="torch.float16")

RAG Method

In [None]:
def rag_answer(query, k=2):
    # Create prompt embedding
    embedded_prompt = encode(embedding_model, tokenizer, [query])

    # Retrieve top-k docs
    distances, idx = index.search(embedded_prompt, k)
    retrieved = [example_documents[i] for i in idx[0]]

    # Build the final prompt for generation
    prompt = (
        "Use the following context to answer the given question.\n\n"
        f"Context: {retrieved}\n\n"
        f"Question: {query}\nAnswer:"
    )

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        temperature=0.7
    )

    # Decode output
    return tokenizer.decode(outputs[0], skip_special_tokens=True), retrieved


Example Run

In [None]:
answer, retrieved_docs = rag_answer("Where is the Eiffel Tower located?")
print("Retrieved Docs:", retrieved_docs)
print("\nRAG Answer:\n", answer)

### **Evaluation with BERGEN**

[See documentation](https://github.com/naver/bergen/blob/main/documentation/extensions.md)

### 1. Defining our Model in BERGEN Repo

You can add a custom:
- Retriever
- Reranker
- Generator
- Dataset

Or you choose a out-of-the-box choice.

<br><br>

**Retriever**
- inherit from `models.retrievers.retriever.Retriever`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`
  - `similarity_fn(self, q_embs, doc_embs)`

In [None]:
new_retriever = """
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM

from models.retrievers.retriever import Retriever

class NewRetriever(Retriever):
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name,
                                                          device_map="auto",
                                                          torch_dtype="torch.float16")

    def encode(self, texts):
        tokens = self.tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
        with torch.no_grad():
            outputs = self.model(**tokens)
            embeddings = outputs.last_hidden_state.mean(dim=1)
        return embeddings.cpu().numpy()

    def collate_fn(self, batch, query_or_doc=None):
        if isinstance(batch[0], dict):
            return [sample["content"] for sample in batch]
        return batch

    def __call__(self, kwargs):
        texts = kwargs["content"]
        emb = self.encode(texts)
        return {"embeddings": emb, "raw_texts": texts}

    def similarity_fn(self, q_embs, doc_embs):
        return torch.matmul(q_embs, doc_embs.T)
"""

with open("./bergen/models/retrievers/new_retriever.py", "w") as f:
  f.write(new_retriever)

Add config yaml to `config/retriever`

In [None]:
new_retriever_config = """
init_args:
  _target_: models.retrievers.new_retriever.NewRetriever
  model_name: "new_retriever"
batch_size: 1024
batch_size_sim: 256
"""

with open("./bergen/config/retrievers/new_retriever.yaml", "w") as f:
  f.write(new_retriever_config)

<br><br>

**Reranker**
- inherit from `models.rerankers.reranker.Reranker`
- needed methods:
  - `collate_fn(self, batch, query_or_doc=None)`
  - `__call__(self, kwargs)`

In [None]:
new_reranker = """
from models.rerankers.reranker import Reranker

class NewReranker(Reranker):
    def __init__(self, model_name=None):
        self.model_name = 'no_reranker'

    def collate_fn(self, batch, query_or_doc=None):
        return batch

    def __call__(self, kwargs):
        return kwargs
"""

with open("./bergen/models/rerankers/new_reranker.py", "w") as f:
  f.write(new_reranker)

Add config yaml to `config/reranker`

In [None]:
new_reranker_config = """
init_args:
  _target_: models.rerankers.new_reranker.NewReranker
  model_name: "new_reranker"
batch_size: 2048
"""

with open("./bergen/config/rerankers/new_reranker.yaml", "w") as f:
  f.write(new_reranker_config)

<br><br>

**Generator**
- inherit from `models.generators.generator.Generator`
- needed methods:
  - `collate_fn(self, inp)`
  - `generate(self, inp)`
  - `prediction_step(self, model, model_input, label_ids=None)`

In [None]:
new_generator = """
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from models.generators.generator import Generator

class NewGenerator(Generator):
    def __init__(self, model_name="gpt2"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name,
                                                          device_map="auto",
                                                          torch_dtype="torch.float16")

    def collate_fn(self, inp):
        return self.tokenizer(
            inp,
            padding=True,
            truncation=True,
            return_tensors="pt"
        )

    def generate(self, inp):
        outputs = self.model.generate(
            input_ids=inp["input_ids"],
            attention_mask=inp["attention_mask"],
            max_length=150,
            do_sample=True,
            temperature=0.7,
        )
        return self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

    def prediction_step(self, model, model_input, label_ids=None):
        output = model(**model_input, labels=label_ids)
        return output.logits, output.loss

"""

with open("./bergen/models/generators/new_generator.py", "w") as f:
  f.write(new_generator)

Add config yaml to `config/generators`

In [None]:
new_generator_config = """
defaults:
  - prompt: basic
init_args:
  _target_: models.generators.new_generator.NewGenerator
  model_name: "new_generator"
  max_new_tokens: 128
batch_size: 32
max_inp_length: null
"""

with open("./bergen/config/generators/new_generator.yaml", "w") as f:
  f.write(new_generator_config)

<br><br>

Other:

**Dataset**
- inherit from `modules.dataset_processor.Processor`
- needed methods:
  - `__init__(self, *args, **kwargs)`
  - `process(self)`




Add config yaml to `config/generators`

In [None]:
new_dataset_config = """
test:
    doc: null
    query: null
dev:
  doc:
    init_args:
    _target_: modules.dataset_processor.NewDataset
    split: "full"
query:
  init_args:
    _target_: modules.dataset_processor.KILTNQProcessor
    split: "validation"
train:
    doc: null
    query: null
"""

with open("./bergen/config/dataset/new_config.yaml", "w") as f:
  f.write(new_dataset_config)

<br><br>


**Prompt**


In [None]:
new_prompt_config = """
system: "You are a helpful assistant. Your task is to extract relevant information from the provided documents and to answer questions accordingly."
user: f"Background:\ {docs}\n\nQuestion:\ {question}\nAnswer:"
system_without_docs: "You are a helpful assistant."
user_without_docs: f"Question:\ {question}\nAnswer:"
"""

with open("./bergen/config/prompt/new_prompt.yaml", "w") as f:
  f.write(new_prompt_config)

### 2. Evaluate your model with Bergen

In [None]:
!python bergen.py retriever='new_retriever' \
                  reranker='new_reranker' \
                  generator='new_generator' \
                  dataset='kilt_nq'