# RAG Evaluation with [RAGBench Benchmark](https://huggingface.co/datasets/galileo-ai/ragbench)

A big dataset to evaluate your model with some evaluation scripts but no direct support for new custom model integrations. Therefore: No registration, installation (only HuggingFace) required.

Used for evaluation of:
- Hallucination Detection
- Context Relevance Detection
- Context Utilization Detection

<br><br>

- [System Setup](#system-setup)
  - docker setup
  - install RAGBench
- [Example RAG Model](#example-rag-model)
  - Retriever: Embedding + Indexing (Database) (+ example data)
  - Reranker (we don't use one)
  - Generator: Tokenizer + LLM
- [Evaluation with RAGBench](#evaluation-with-ragbench)
  - 1. Load Datasets
  - 2. Evaluate your model

<br><br>

---



### System Setup

Using Docker

You might want to check your CUDA version first:
```bash
!nvidia-smi | sed '/Processes/,$d'
```

- `nvidia-smi` -> standard NVIDIA information command
- `|` -> send content to `sed` (which is a streaming editor)
- `sed '/Processes/,$d'` -> delete (`d`) from line containing `Processes` to the end (`$`)

<br><br>

> **Notice: Same setup as for BERGEN is used via Docker but this repository also would work with a much simpler system setup.**

<br>

**Build your image:**
1. Open `Docker Desktop` and open the bash (right bottom corner)
2. Run:
    ```bash
    cd D:\Informatik\Projekte\RAG_Evaluation
    docker build -t rag-eval .
    ```

**Starting Setup:**<br>
1. Open `Docker Desktop`
2. Starting Container:
    ```bash
    docker run -it --rm  --gpus all -v .:/workspace -w /workspace rag-eval bash
    ```
3. Attach Visual Studio to that Container (Docker Extension installation required)

<br>

[We used this Image](./Dockerfile)

### System Information

In [None]:
import prime_printer as prime
print(prime.get_hardware())

### Example RAG Model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM
import faiss

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", dtype=torch.float16)

In [None]:
example_documents = [
    "The Eiffel Tower is located in Paris.",
    "The Pythagorean theorem describes the relationship between the sides of a right triangle.",
    "The capital of Germany is Berlin.",
]

In [None]:
def encode(model, tokenizer, texts):
    tokens = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**tokens)
        embeddings = outputs.last_hidden_state.mean(dim=1)  # CLS Token pooling
        # attention_mask = tokens["attention_mask"].unsqueeze(-1)
        # embeddings = (outputs.last_hidden_state * attention_mask).sum(dim=1)
        # embeddings = embeddings / attention_mask.sum(dim=1)
    return embeddings.cpu().numpy()

In [None]:
doc_embeddings = encode(embedding_model, tokenizer, example_documents)

Build FAISS Index (our "database")

In [None]:
dim = doc_embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(doc_embeddings)

# save for later
faiss.write_index(index, "/content/my_index.faiss")

Load a language model (decoder)

In [None]:
model_name = "gpt2"  # "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             torch_dtype="torch.float16")

RAG Method

In [None]:
def rag_answer(query, given_passages=None, k=2):
    # Create prompt embedding
    embedded_prompt = encode(embedding_model, tokenizer, [query])

    # Retrieve top-k docs
    if given_passages is None:
        distances, idx = index.search(embedded_prompt, k)
    else:
        given_passages_embedded = encode(embedding_model, tokenizer, given_passages)
        index = faiss.IndexFlatL2(dim)
        index.add(given_passages_embedded)
    retrieved = [example_documents[i] for i in idx[0]]

    # Build the final prompt for generation
    prompt = (
        "Use the following context to answer the given question.\n\n"
        f"Context: {retrieved}\n\n"
        f"Question: {query}\nAnswer:"
    )

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(
        **inputs,
        max_length=200,
        do_sample=True,
        temperature=0.7
    )

    # Decode output
    return tokenizer.decode(outputs[0], skip_special_tokens=True), retrieved


Example Run

In [None]:
answer, retrieved_docs = rag_answer("Where is the Eiffel Tower located?")
print("Retrieved Docs:", retrieved_docs)
print("\nRAG Answer:\n", answer)

### **Evaluation with RAGBench**

Witht he given models it is easy:
```bash
python run_inference.py --dataset msmarco --model trulens --output results
```

1. Load Datasets

In [None]:
from datasets import load_dataset

# load the full ragbench dataset delucionqa
ragbench = {}
columns = set()
for dataset in ['covidqa', 'cuad', 'delusionqa', 'emanual', 'expertqa', 'finqa', 'hagrid', 'hotpotqa', 'msmarco', 'pubmedqa', 'tatqa', 'techqa']:
  ragbench[dataset] = load_dataset("rungalileo/ragbench", dataset)
  print(f"Loaded '{dataset}' dataset from RAGBench")
  columns = columns.union(set(ragbench[dataset]['test'].keys()))
print(f"Columns in ragbench datasets: {columns}")

2. Calculate scores

In [None]:
import sys
sys.path += ["./ragbench/ragbench"]

In [None]:
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy

def evaluate_rag_output(question, answer, contexts):
    data = {
        "question": question,
        "answer": answer,
        "contexts": contexts,
    }

    adherence = faithfulness(data)
    relevance = context_relevancy(data)
    utilization = answer_relevancy(data)

    return float(adherence), float(relevance), float(utilization)


In [None]:
results = {}
for dataset_name, dataset in ragbench.items():
    print(f"Evaluating on {dataset_name}...")
    results[dataset_name] = []
    for sample in dataset['test']:
        question = sample['question']
        # is there also context / documents given? FIXME
        ground_truth = sample.get('answers', {}).get('text', [''])[0]  # Adjust based on dataset structure
        
        given_passages = sample.get('documents', [])
        rag_response, contexts = rag_answer(question, given_passages)

        adherence, relevance, utilization = evaluate_rag_output(question, answer, contexts)
        
        results[dataset_name] += [{
            'question': question,
            'ground_truth': ground_truth,
            'rag_response': rag_response,
            "pred_adherence": adherence,
            "pred_context_relevance": relevance,
            "pred_context_utilization": utilization,
            "supported": sample["supported"],
            "relevance": sample["relevance"],
            "utilization": sample["utilization"],
        }]

    print(f"Completed evaluation on {dataset_name}")

In [None]:
from datasets import Dataset

eval_datasets = []
for result_dataset in results.values():
    eval_datasets += [Dataset.from_list(result_dataset) ]

3. Evaluate your model

In [None]:
from evaluation import calculate_metrics

all_metrices = []
for annotated in eval_datasets:
    metrics = calculate_metrics(
        annotated,
        pred_adherence="pred_adherence",
        pred_context_releavance="pred_context_relevance",
        pred_context_utilization="pred_context_utilization",
        ground_truth_adherence="supported", 
        ground_truth_context_relevance="relevance",
        ground_truth_context_utilization="utilization")
    all_metrices += [metrics]

all_metrices