# RAG Evaluation with [RAGBench Benchmark](https://huggingface.co/datasets/galileo-ai/ragbench)

**Features:**
- Multiple datasets to evaluate your model + some evaluation scripts
- Used for evaluation of:
  - Hallucination Detection
  - Context Relevance Detection
  - Context Utilization Detection

<br><br>

**The ugly:**
- Does not provide:
  - any Guide of how to use it
  - requirement dependencies or any setup details
  - out-of-the-box models
- Old Dependencies (you have to updgrade the *RAGBench code* for modern RAGs)

<br><br>

**Content:**
- [Python Env](#python-env)
- [Example RAG Model](#example-rag-model)
  - Retriever: Embedding + Indexing (Database) (+ example data)
  - Reranker (we don't use one)
  - Generator: Tokenizer + LLM
- [Evaluation with RAGBench](#evaluation-with-ragbench)
  - 1. Load Datasets
  - 2. Evaluate your model

<br><br>

---



### Python Env

Install Repository:
```bash
cd D:\Informatik\Projekte\RAG_Evaluation && D:
git clone https://github.com/rungalileo/ragbench.git ./ragbench
```
Out-Comment foolowing lines in order to does not get in trouble because old-dependencies:
- inference.py, line 6
- inference.py, line 9 -> remove 'context_relevancy'
- trulens_async.py, line 17
- trulens_async.py, line 18

<br><br>

Installation in Anaconda Bash:
```bash
conda create -n ragbench python=3.12 -y 
conda activate ragbench

pip install ipykernel jupyter notebook ipython

pip install torch torchvision --index-url https://download.pytorch.org/whl/cu130

pip install datasets transformers faiss-cpu accelerate prime_printer

pip install ragas 
pip install "trulens-eval==1.4.0"
```




### System Information

In [1]:
import prime_printer as prime
print(prime.get_hardware())

  from pkg_resources import resource_stream, resource_exists



-------------------------------- 
Your Hardware:

    ---> General <---
Operatingsystem: Windows
Version: 10.0.26200
Architecture: ('64bit', 'WindowsPE')
Processor: AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD

    ---> GPU <---
GPU Name: NVIDIA GeForce RTX 4060
VRAM Total: 8188 MB
VRAM Used: 1617 MB
Utilization: 39.0 %
PyTorch Support: True (NVIDIA GeForce RTX 4060)
TensorFlow Support: False -> not installed

    ---> CPU <---
CPU-Name: AMD Ryzen 7 3700X 8-Core Processor
CPU Kernels: 8
Logical CPU-Kernels: 16
CPU-Frequence: 3600 MHz
CPU-Utilization: 20.9 %

    ---> RAM <---
RAM Total: 31 GB
RAM Available: 15 GB
RAM-Utilization: 50.7 %

--------------------------------


### Example RAG Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import faiss

In [14]:
embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", dtype=torch.float16).to("cpu")
embedding_model.resize_token_embeddings(len(embedding_tokenizer))

Embedding(30522, 384, padding_idx=0)

In [6]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [10]:
def encode(model, tokenizer, texts):
    tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # CLS Token pooling
        # attention_mask = tokens["attention_mask"].unsqueeze(-1)
        # embeddings = (outputs.last_hidden_state * attention_mask).sum(dim=1)
        # embeddings = embeddings / attention_mask.sum(dim=1)
    return embeddings.cpu().numpy()

In [16]:
doc_embeddings = encode(embedding_model, embedding_tokenizer, example_documents)
doc_embeddings

array([[ 0.3853 ,  0.01636,  0.0658 , ...,  0.1343 ,  0.3914 ,  0.2822 ],
       [-0.2357 ,  0.3455 , -0.3367 , ...,  0.4631 ,  0.2502 ,  0.0448 ],
       [ 0.35   ,  0.08856,  0.197  , ...,  0.1251 ,  0.1675 ,  0.04062]],
      shape=(3, 384), dtype=float16)

Build FAISS Index (our "database")

In [17]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

Load a language model (decoder)

In [20]:
model_name = "gpt2"  # "distilgpt2"
generator_tokenizer = AutoTokenizer.from_pretrained(model_name)
generator_tokenizer.add_special_tokens({'pad_token': '[PAD]'})
generator_model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             dtype=torch.float16).to('cpu')
generator_model.resize_token_embeddings(len(generator_tokenizer))

Embedding(50258, 768)

In [21]:
generator_model.device

device(type='cpu')

RAG Method

In [77]:
import numpy as np

def rag_answer(query, given_passages, k=2):
    # Create prompt + docs embedding
    embedded_prompt = encode(embedding_model, embedding_tokenizer, [query])[0]
    given_passages_embedded = encode(embedding_model, embedding_tokenizer, given_passages)

    # Convert embeddings to float32 numpy arrays
    prompt_vec = embedded_prompt.astype(np.float32).reshape(1, -1)
    passage_vecs = given_passages_embedded.astype(np.float32)

    # Build index
    dim = passage_vecs.shape[1]
    index = faiss.IndexFlatL2(dim)
    index.add(passage_vecs)

    # Retrieve top-k docs
    distances, indices = index.search(prompt_vec, k)
    retrieved = [given_passages[i] for i in indices[0]]

    # Build the final prompt for generation
    context_text = "\n".join(retrieved)
    prompt = (
        f"Use the following context to answer the question.\n\n"
        f"Context: {context_text}\n\n"
        f"Question: {query}\nAnswer:"
    )

    # Tokenize final prompt
    inputs = generator_tokenizer(prompt, 
                                 truncation=True,
                                 max_length=500,
                                 return_tensors="pt")

    # Generate
    outputs = generator_model.generate(
        **inputs,
        max_length=500,
        do_sample=True,
        temperature=0.7,
        num_return_sequences=1,
        pad_token_id=generator_tokenizer.pad_token_id
    )

    # Decode output
    answer = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
    answer = answer[len(prompt):].strip()
    return answer, retrieved

Example Run

In [78]:
answer, retrieved_docs = rag_answer("Where is the Eiffel Tower located?", example_documents, k=2)
print(f"Retrieved Docs: {retrieved_docs}")
print(f"\nRAG Answer:\n'{answer}'")

Retrieved Docs: ['The Eiffel Tower is located in Paris.', 'The Pythagorean theorem describes the relationship between the sides of a right triangle.']

RAG Answer:
'The Eiffel Tower is between the Eiffel Tower in Paris and the Aussies in New York City.

The Pythagorean theorem describes the relationship between the sides of a right triangle. The Pythagorean theorem is an interesting idea that is not to be confused with the Pythagorean theorem, which states that if two sides are equal in a straight line, the two sides may be equal in a straight line. Thus, the Pythagorean theorem is a theorem by which the triangle must be one of two sides, and this is true if the triangle is equal to one side.

The Pythagorean theorem is an interesting idea that is not to be confused with the Pythagorean theorem, which states that if two sides are equal in a straight line, the two sides may be equal in a straight line. Thus, the Pythagorean theorem is a theorem by which the triangle must be one of two s

### **Evaluation with RAGBench**

Witht he given models it is easy:
```bash
python run_inference.py --dataset msmarco --model trulens --output results
```

1. Load Datasets

In [27]:
from datasets import load_dataset

# load the full ragbench dataset
ragbench = {}
columns = set()
for dataset in ['covidqa', 'cuad', 'delucionqa', 'emanual', 'expertqa', 'finqa', 'hagrid', 'hotpotqa', 'msmarco', 'pubmedqa', 'tatqa', 'techqa']:
  ragbench[dataset] = load_dataset("rungalileo/ragbench", dataset)
  prime.awesome_print(f"Loaded '{dataset}' dataset from RAGBench", prime.RED)
  # columns = columns.union(set(ragbench[dataset]['test'].keys()))
  columns = columns.union(set(ragbench[dataset]['test'].column_names))
print(f"Columns in ragbench datasets: {columns}")

[91mLoaded 'covidqa' dataset from RAGBench[0m
[0m[91mLoaded 'cuad' dataset from RAGBench[0m
[0m[91mLoaded 'delucionqa' dataset from RAGBench[0m
[0m[91mLoaded 'emanual' dataset from RAGBench[0m
[0m[91mLoaded 'expertqa' dataset from RAGBench[0m
[0m[91mLoaded 'finqa' dataset from RAGBench[0m
[0m[91mLoaded 'hagrid' dataset from RAGBench[0m
[0m[91mLoaded 'hotpotqa' dataset from RAGBench[0m
[0m[91mLoaded 'msmarco' dataset from RAGBench[0m
[0m[91mLoaded 'pubmedqa' dataset from RAGBench[0m
[0m[91mLoaded 'tatqa' dataset from RAGBench[0m
[0m[91mLoaded 'techqa' dataset from RAGBench[0m
[0mColumns in ragbench datasets: {'trulens_groundedness', 'adherence_score', 'documents_sentences', 'response', 'sentence_support_information', 'relevance_score', 'overall_supported_explanation', 'all_relevant_sentence_keys', 'trulens_context_relevance', 'ragas_faithfulness', 'ragas_context_relevance', 'relevance_explanation', 'gpt3_context_relevance', 'id', 'unsupported_response

In [41]:
ds = ragbench['delucionqa']['test']
small_ds = ds.select([0])

2. Run our Model on them

In [65]:
from datasets import Dataset

# collect your predictions in a list of dicts
data = []

for sample in small_ds:
    question = sample["question"]
    context = sample["documents"]
    gold = sample["response"]

    pred_answer, pred_contexts = rag_answer(question, given_passages=context, k=3)

    data.append({
        "question": question,
        "gold_answer": gold,
        "response": pred_answer,
        "documents": pred_contexts,
    })

# convert to Hugging Face Dataset
ds_ = Dataset.from_list(data)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


3. Run RAGAS/TruLens annotation

In [46]:
import sys
sys.path += ["./ragbench/ragbench"]

In [75]:
from inference import ragas_annotate_dataset

annotated_ds = ragas_annotate_dataset(ds_, output_path="./my_rag_predictions_ragas.jsonl")

Running RAGAS Inference on 1 rows


OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

4. Calculate scores

In [None]:
from evaluation import calculate_metrics

metrics = calculate_metrics(
    annotated_ds,
    pred_adherence="pred_adherence",
    pred_context_utilization="pred_context_utilization"
)

metrics