# Evaluation of RAG Systems

Evaluating a Retrieval-Augmented Generation (RAG) system is **critical for making it production-grade**. Unlike traditional language models that generate responses based solely on internal knowledge, RAG systems rely on **retrieved external documents**. This makes evaluation really important.

#### **Importance of RAG Evaluation for Production-Readiness**

* **Pinpoints Failures Precisely:** It helps identify whether issues are due to **poor retrieval** (irrelevant documents) or **generation errors** (hallucinations or incoherent answers).
* **Improves End-to-End Quality:** By using evaluation metrics like **faithfulness, relevance, and context recall**, teams can systematically improve both components of the system.
* **Boosts Reliability and Trust:** In production environments, factual correctness is crucial. RAG evaluation ensures answers are **grounded in retrieved data**, increasing user trust.
* **Enables Iterative Optimization:** Consistent evaluation provides feedback loops for optimizing **retrievers, rerankers, chunking strategies, and prompts**, making the system scalable and robust.
* **Supports Debugging & Monitoring:** In real-time applications, evaluation helps detect when the system fails silently (e.g., generating plausible but incorrect answers).



**While evaluation is essential, it's not flawless:**

* **Automated metrics** like BLEU or faithfulness may **miss nuanced errors** or fail to capture partial truths.
* **Subjectivity in responses** means human evaluation is often needed—but it’s **costly and inconsistent**.
* **Metric limitations:** High fluency doesn’t always mean factual accuracy. A response may look good but be based on irrelevant or incorrect context.
* **Hard to fully automate:** Especially in domain-specific settings, context relevance or truthfulness may be hard for generic metrics to judge.


So, **it’s best used as a guiding tool, not an absolute measure**. Combining automated + human-in-the-loop evaluation gives the most reliable results.

## 1. Retrieval Metrics

Retrieval metrics evaluate how effectively the retriever fetches relevant documents for a given user query.

Good retrieval ensures the LLM has the right information to generate accurate answers. Poor retrieval leads to hallucination or irrelevant outputs—even if the generator is strong. This needs our human feedback, it needs labeled relevant documents for accurate scoring.

In [None]:
# This is Dummy Data that we will use later
retrieved_docs = [
    "Albert Einstein developed the theory of relativity.",
    "He was a German physicist known for E=mc².",
    "Einstein was born in a city in Germany.",
    "He entered the world in the late 19th century.",
    "Quantum mechanics and relativity shaped modern physics.",
    "He worked at the Swiss Patent Office."
]

ground_truth = ["Einstein was born in Ulm, Germany in 1879."]

Suppose the variable `retrieved_docs` contains the top **k = 5** documents **retrieved by the system for a given query.** The `relevant_doc` represents the **ground-truth document** (or chunk) that actually contains the correct answer, which has been **manually identified by a human evaluator**.



### **A. Recall\@K**

Recall\@K evaluates whether **at least one relevant document** appears in the **top K** documents retrieved for a query.

#### **Formula:**

$$
\text{Recall@K} = \frac{\text{Number of queries with at least one relevant doc in top K}}{\text{Total number of queries}}
$$

* **High Recall\@K** = Retriever is surfacing relevant context frequently.
* **Low Recall\@K** = Retriever is failing to include useful documents, leading to LLM hallucinations.
----

* It does need **labeled relevant documents** (ground truth).
* Doesn’t consider **rank** or **multiple relevant docs**—just if **any** one is found.

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

def recall_at_k_semantic(retrieved_docs, ground_truth, k=5, similarity_threshold=0.7):
    """
    Compute Recall@K for retrieval with semantic similarity.

    Args:
      retrieved_docs (List[str]): Retrieved docs for a single query.
      ground_truth (List[str]): Ground-truth relevant docs for the query.
      k (int): Top K docs to consider.
      similarity_threshold (float): Minimum cosine similarity to count as relevant.

    Returns:
      int: 1 if at least one doc in top K passes similarity threshold, else 0.
    """
    # Encode ground truth and top-k retrieved docs
    ground_truth_embeddings = model.encode(ground_truth, convert_to_tensor=True)
    retrieved_embeddings = model.encode(retrieved_docs[:k], convert_to_tensor=True)

    # Compute cosine similarity matrix [top_k x ground_truth]
    cos_scores = util.cos_sim(retrieved_embeddings, ground_truth_embeddings)

    # Check if any retrieved doc has similarity >= threshold with any ground truth doc
    max_similarities, _ = cos_scores.max(dim=1)  # max similarity per retrieved doc

    if (max_similarities >= similarity_threshold).any():
        return 1
    else:
        return 0


# Compute Recall@5
recall_score = recall_at_k_semantic(retrieved_docs, ground_truth, k=5, similarity_threshold=0.7)
print("Recall@5 (semantic):", recall_score)


Recall@5 (semantic): 1


Semantic embeddings capture the overall meaning, so phrases like **"Einstein was born in a city in Germany" strongly overlap with "Einstein was born in Ulm, Germany" by referencing the same event and location generally**. Even without exact matches like "Ulm," the model detects key concepts such as "born," "Einstein," and "Germany," resulting in a **cosine similarity above the threshold and a positive Recall\@5.**



### **B. Context Recall**

Context Recall measures how much of the **ground-truth answer** is present in the **retrieved context** passed to the LLM.

$$
\text{Context Recall} = \frac{\text{Number of overlapping tokens between answer and context}}{\text{Total tokens in the ground-truth answer}}
$$


* **High Context Recall** = The context supports answering the query correctly.
* **Low Context Recall** = The LLM is expected to guess or hallucinate due to missing info.

---
* Token overlap may not always reflect **semantic similarity**.
* Doesn’t evaluate **usefulness** or **clarity** of retrieved content—just literal overlap.



In [None]:
def context_recall_semantic(answer: str, context: str, similarity_threshold=0.7) -> float:
    """
    Compute semantic Context Recall using embeddings.
    Measures how many segments of the ground-truth answer are semantically found in the context.

    Args:
        answer (str): Ground-truth answer string.
        context (str): Retrieved context string.
        similarity_threshold (float): Minimum cosine similarity to consider a token matched.

    Returns:
        float: Semantic context recall (0 to 1).
    """
    # Split into tokens or short phrases (approximate)
    answer_tokens = answer.strip().split()
    context_tokens = context.strip().split()

    if not answer_tokens:
        return 0.0

    # Encode each token or short phrase as a vector
    answer_embeds = model.encode(answer_tokens, convert_to_tensor=True)
    context_embeds = model.encode(context_tokens, convert_to_tensor=True)

    # Compute cosine similarity matrix [answer_token x context_token]
    similarity_matrix = util.cos_sim(answer_embeds, context_embeds)

    # For each answer token, find the max similarity to any context token
    max_similarities, _ = similarity_matrix.max(dim=1)

    # Count how many tokens are semantically matched (above threshold)
    matched_count = (max_similarities >= similarity_threshold).sum().item()

    return matched_count / len(answer_tokens)

# Compute context string and answer
retrieved_context = " ".join(retrieved_docs)
ground_truth_answer = ground_truth[0]

# Compute semantic Context Recall
score = context_recall_semantic(ground_truth_answer, retrieved_context)
print("Semantic Context Recall:", round(score, 2))


Semantic Context Recall: 0.75


The answer "Einstein was born in Ulm, Germany in 1879" shares tokens like "Einstein," "born," and "Germany" with vague retrieved chunks such as "Einstein was born in a city in Germany" and "He entered the world in the late 19th century." However, specific tokens like "Ulm" and "1879" are missing, resulting in about 57% token overlap—indicating the retrieved context is somewhat relevant but lacks precise details.


### **C. Precision\@K**

Precision\@K measures how many of the **top K retrieved documents** are actually relevant to the query.

$$
\text{Precision@K} = \frac{\text{Number of relevant documents in top K}}{K}
$$

* **High Precision\@K** = Most retrieved documents are relevant, reducing noise.
* **Low Precision\@K** = Many irrelevant documents appear, potentially confusing the LLM.


*Precision\@K depends on reliable labeling of relevant documents and helps assess retrieval quality.*

In [None]:
def precision_at_k_semantic(retrieved_docs, ground_truth, k=5, similarity_threshold=0.7):
    """
    Compute Precision@K for retrieval with semantic similarity.

    Args:
      retrieved_docs (List[str]): Retrieved docs for a single query.
      ground_truth (List[str]): Ground-truth relevant docs for the query.
      k (int): Top K docs to consider.
      similarity_threshold (float): Minimum cosine similarity to count as relevant.

    Returns:
      float: Precision@K = (# relevant docs in top K) / K
    """
    # Encode ground truth and top-k retrieved docs
    ground_truth_embeddings = model.encode(ground_truth, convert_to_tensor=True)
    retrieved_embeddings = model.encode(retrieved_docs[:k], convert_to_tensor=True)

    # Compute cosine similarity matrix [top_k x ground_truth]
    cos_scores = util.cos_sim(retrieved_embeddings, ground_truth_embeddings)

    # For each retrieved doc, find max similarity to any ground truth doc
    max_similarities, _ = cos_scores.max(dim=1)

    # Count how many retrieved docs pass the similarity threshold
    relevant_count = (max_similarities >= similarity_threshold).sum().item()

    precision = relevant_count / k
    return precision

# Calculate semantic Precision@5
prec_at_5 = precision_at_k_semantic(retrieved_docs, ground_truth, k=5, similarity_threshold=0.7)
print("Precision@5 (semantic):", prec_at_5)

Precision@5 (semantic): 0.2


**Precision\@5 = 0.2** means that only 1 out of 5 retrieved chunks was semantically close to the ground-truth answer. Most retrieved content was loosely related or irrelevant, making it harder for the LLM to generate accurate responses. This indicates the retriever needs improvement—such as better chunking, embeddings, or reranking.


### **D. Context Precision**

Context Precision measures how much of the **retrieved context tokens** are actually present in the **ground-truth answer**.

$$
\text{Context Precision} = \frac{\text{Number of overlapping tokens between context and answer}}{\text{Total tokens in the retrieved context}}
$$

* **High Context Precision** = Most tokens in the retrieved context are relevant to the answer, meaning less noise.
* **Low Context Precision** = Retrieved context contains many irrelevant tokens, potentially distracting the LLM.

*Like Context Recall, Context Precision measures literal overlap and does not capture semantic relevance or usefulness.*


In [None]:
from sentence_transformers import SentenceTransformer, util

# Load model
model = SentenceTransformer('all-MiniLM-L6-v2')

def context_precision_semantic(context: str, answer: str, similarity_threshold=0.7) -> float:
    """
    Compute semantic Context Precision: what fraction of tokens in the retrieved context
    are semantically relevant to the ground-truth answer.

    Args:
        context (str): Retrieved context string.
        answer (str): Ground-truth answer string.
        similarity_threshold (float): Cosine similarity threshold for a semantic match.

    Returns:
        float: Semantic context precision score (0 to 1).
    """
    context_tokens = context.strip().split()
    answer_tokens = answer.strip().split()

    if not context_tokens:
        return 0.0

    # Encode context and answer tokens
    context_embeds = model.encode(context_tokens, convert_to_tensor=True)
    answer_embeds = model.encode(answer_tokens, convert_to_tensor=True)

    # Compute cosine similarity matrix [context_token x answer_token]
    similarity_matrix = util.cos_sim(context_embeds, answer_embeds)

    # For each context token, find its max similarity to any answer token
    max_similarities, _ = similarity_matrix.max(dim=1)

    # Count context tokens semantically matching the answer
    matched_count = (max_similarities >= similarity_threshold).sum().item()

    return matched_count / len(context_tokens)

retrieved_context = " ".join(retrieved_docs)
ground_truth_answer = ground_truth[0]

# Compute semantic context precision
ctx_precision_score = context_precision_semantic(retrieved_context, ground_truth_answer)
print("Semantic Context Precision:", round(ctx_precision_score, 2))

Semantic Context Precision: 0.22


**Context Precision = 0.11** means that only 11% of the tokens in the retrieved context overlap with the ground-truth answer. This indicates the context contains mostly irrelevant or unrelated information, adding noise and making it difficult for the LLM to focus on the correct answer. It suggests low-quality retrieval or excessive irrelevant content in the context.


## 2. Generation Metrics

Metrics to track the quality, correctness, and fluency of generated answers from the LLM.

In [None]:
# Example
response = "The first Super Bowl was held on January 15, 1967."
contexts = [
    "The First AFL–NFL World Championship Game, later known as Super Bowl I, was held on January 15, 1967, in Los Angeles.",
    "It was a historic match between the Green Bay Packers and the Kansas City Chiefs."
]

### **A. BLEU & ROUGE**

These are **overlap-based metrics** comparing the generated answer to one or more reference answers:

* **BLEU:** Measures n-gram precision, mostly used in machine translation.
* **ROUGE:** Measures n-gram recall, common in summarization.


$$
\text{BLEU} = BP \times \exp\left(\sum_{n=1}^N w_n \log p_n\right)
$$

Where:

* $p_n$ = precision of n-grams
* $w_n$ = weights (usually uniform)
* $BP$ = brevity penalty to penalize short outputs


$$
\text{ROUGE-N} = \frac{\sum_{\text{gram}_n \in \text{Ref}} \text{Count}_\text{match}(\text{gram}_n)}{\sum_{\text{gram}_n \in \text{Ref}} \text{Count}(\text{gram}_n)}
$$

Where:

* $\text{Count}_\text{match}$ = count of overlapping n-grams between generated and reference
* $\text{Count}$ = total n-grams in reference

*Good for automated evaluation but may miss semantic correctness.*

---

BLEU n-gram precision between generated text and references whereas ROUGE n-gram recall (often ROUGE-N or ROUGE-L).

In short:

- BLEU asks: Of what I generated, how much matches the reference?

- ROUGE asks: Of what the reference says, how much did I generate?

### **B. Faithfulness**

Faithfulness measures how **factually accurate and consistent** the generated answer is with respect to the provided context or source documents.

$$
\text{Faithfulness Score} = \frac{\left| \text{Number of claims in the generated answer that can be inferred from the given context} \right|}{\left| \text{Total number of claims in the generated answer} \right|}
$$


* **High faithfulness** = Generated content is grounded and truthful.
* **Low faithfulness** = Model hallucinates or introduces incorrect information.

*Essential to ensure trustworthy, reliable outputs in RAG systems.*

In [None]:
from sentence_transformers import SentenceTransformer, util
import numpy as np
import torch

# Load a pre-trained model (small and fast for demo)
model = SentenceTransformer('all-MiniLM-L6-v2')

def calculate_faithfulness(response, contexts, threshold=0.7):
    """
    Calculates faithfulness based on cosine similarity between response and context.
    """
    # Encode response and all context chunks
    response_embedding = model.encode(response, convert_to_tensor=True)
    context_embeddings = model.encode(contexts, convert_to_tensor=True)

    # Compute cosine similarities
    cosine_scores = util.cos_sim(response_embedding, context_embeddings)[0]

    # Get maximum similarity score across all contexts
    max_score = float(torch.max(cosine_scores))

    # Return result with interpretation
    return {
        "faithfulness_score": round(max_score, 3),
        "is_faithful": max_score > threshold
    }


result = calculate_faithfulness(response, contexts)
print(result)


{'faithfulness_score': 0.849, 'is_faithful': True}


### **C. Perplexity**

Perplexity measures how **predictable or fluent** the generated text is according to the language model.

$$
\text{Perplexity} = \exp\left(-\frac{1}{N} \sum_{i=1}^N \log p(w_i)\right)
$$

Where:

* $N$ = number of tokens
* $p(w_i)$ = probability assigned by the model to token $w_i$

* **Lower perplexity** = More fluent, coherent text.
* **Higher perplexity** = Text is unexpected or unnatural.

*Focuses on language quality rather than factual correctness.*

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import math

def calculate_perplexity(sentence):
    model_name = 'gpt2'
    model = GPT2LMHeadModel.from_pretrained(model_name)
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model.eval()

    # Encode input
    input_ids = tokenizer.encode(sentence, return_tensors='pt')

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)

    return perplexity.item()


In [None]:
# Example sentence
ppl = calculate_perplexity(response)
print(f"Perplexity: {ppl:.2f}")

Perplexity: 18.87



In our evaluation, we used human-written responses as the ground truth. However, an **alternative approach** involves using a **larger language model—guided by prompt engineering and chain-of-thought reasoning—to assess the output of a smaller model**. This method allows the larger model to act as a judge, helping to calculate key evaluation metrics.

There are also several purpose-built libraries designed to support this kind of evaluation. Tools like **Ragas**, **TruLens**, **G-Eval**, and **LMEval** provide frameworks to assess critical aspects of a RAG system’s performance, including **faithfulness**, **relevance**, **fluency**, and **overall answer quality**.

# Limitations of RAG Evaluation

While RAG evaluation is essential for production readiness, it faces several challenges:

1. **Ground Truth Scarcity** – Many queries have multiple valid answers or no definitive “correct” document, making recall/precision evaluation tricky.
2. **Dependence on Corpus Quality** – Outdated, biased, or incomplete knowledge bases can mislead metrics, even if the retriever and generator are strong.
3. **Metric Blind Spots** – Overlap-based metrics (BLEU, ROUGE) and even semantic similarity can miss nuanced factual errors or partial truths.
4. **Attribution Ambiguity** – It’s often unclear if a fact came from retrieved context, the model’s memory, or hallucination, complicating error diagnosis.
5. **Subjectivity & Cost of Human Evaluation** – Human judgment remains the gold standard for faithfulness and relevance, but it is slow, costly, and inconsistent.
6. **Context Sensitivity** – Small changes in top-k, chunking, or prompt design can significantly shift results, reducing reproducibility.
7. **Limited Automation for Truthfulness** – Current automated tools struggle with subtle fact-checking, especially in domain-specific contexts.

**Bottom line:**
RAG evaluation should be used as a *guiding framework*, not an absolute scorecard. Combining automated metrics with targeted human review yields the most reliable insights.




## References

![Different RAG Configuration and accuracy](https://i.postimg.cc/d1PgJ3F4/RAG-settings-accuracy.png)



The graph illustrates how the performance of the RAG system improved through various tuning techniques, courtesy of Hugging Face.

[Pinnecone RAG Evaluation](https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/)

[Huggingface RAG Evaluation](https://huggingface.co/learn/cookbook/en/rag_evaluation)

[RAGAS Evaluation Mertics](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/context_precision/)