# RAG Evaluation with RAGAS Framework

In this notebook, we explore various RAG (Retrieval-Augmented Generation) evaluation metrics using the [RAGAS framework](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/).

## Why LLM-as-Judge Evaluation?

Traditional testing approaches fail for LLM systems due to non-deterministic outputs and semantic complexity. We use LLMs themselves as judges to evaluate semantic similarity, factual accuracy, and context utilization.

## RAG Pipeline Overview

RAG enhances language models through a 4-step process:
1. **Document Ingestion** → Vector embeddings stored in database
2. **Query Processing** → User queries converted to embeddings  
3. **Retrieval** → Similar document chunks retrieved
4. **Generation** → LLM generates grounded responses using query + context

## Evaluation Metrics

We examine five key metrics assessing every RAG pipeline component:

**Retrieval Quality**

1. **Context Precision**: Relevance of retrieved contexts to query
2. **Context Recall**: Coverage of all relevant information

**Response Quality**

3. **Response Relevance**: How well response addresses the query
4. **Faithfulness**: Response consistency with retrieved contexts
5. **Factual Correctness**: Accuracy of response claims


<image src="assets/Evaluation.png">

In [1]:
from ragas import SingleTurnSample
from ragas.metrics import (
    LLMContextPrecisionWithoutReference,
    LLMContextRecall,
    ResponseRelevancy,
    Faithfulness,
    FactualCorrectness,
)
from ragas.llms.base import llm_factory
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv

In [2]:
# load environment variables
load_dotenv("configs.env")

True

In [3]:
# MODELS
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4o-mini"

In [4]:
# initialize LLMs and embedding models
llm_as_judge = llm_factory(LLM_MODEL)
embeddings = OpenAIEmbeddings(model=EMBEDDING_MODEL)

## 1. Context Precision

Context Precision measures how well a system puts the most useful chunks at the top of your search results.

<image src="assets/Zoom Context Precision.png">

For example, we have sample data shown below where `user_query` is the user query, `retrieved_contexts` simulates the retrieved text chunks, and `response` is the model generated output. 

We initialize the context precision metric with our LLM as judge. This metric evaluates how well the most relevant contexts are ranked at the top of the retrieved results, providing a score between 0 and 1 where 1 indicates perfect precision with all relevant contexts ranked highest.

In [5]:
# sample data for context precision
sample_data = {
    "user_query": "What is the capital of France?",
    "retrieved_contexts": [
        "France is a country in Europe. Its capital is Paris.",
        "The Eiffel Tower is located in Paris, the capital city of France.",
        "Berlin is the capital of Germany.",
    ],
    "response": "The capital of France is Paris.",
}

# initialize LLMContextPrecisionWithoutReference
context_precision = LLMContextPrecisionWithoutReference(llm=llm_as_judge)
sample = SingleTurnSample(
    user_input=sample_data["user_query"],
    response=sample_data["response"],
    retrieved_contexts=sample_data["retrieved_contexts"],
)

# print context precision score
print(await context_precision.single_turn_ascore(sample))

0.99999999995


If the retrieved contexts are ranked in a different order where the Berlin chunk **(irrelevant information)** is given higher ranking, then the context precision score drops significantly! This demonstrates how context precision specifically measures the ranking quality of retrieved information.

In [6]:
# Revised sample data with different context ranking
sample_data = {
    "user_query": "What is the capital of France?",
    "retrieved_contexts": [
        "Berlin is the capital of Germany.",
        "The Eiffel Tower is located in Paris, the capital city of France.",
        "France is a country in Europe. Its capital is Paris.",
    ],
    "response": "The capital of France is Paris.",
}

# initialize LLMContextPrecisionWithoutReference
context_precision = LLMContextPrecisionWithoutReference(llm=llm_as_judge)
sample = SingleTurnSample(
    user_input=sample_data["user_query"],
    response=sample_data["response"],
    retrieved_contexts=sample_data["retrieved_contexts"],
)

# print context precision score
print(await context_precision.single_turn_ascore(sample))

0.5833333333041666


## 2. Context Recall

Context Recall measures how well the retrieved contexts cover all the important facts in the ground truth.

$$\text{Context Recall} = \frac{\text{Number of facts supported by retrieved contexts}}{\text{Total number of facts in the ground truth answer}}$$

<image src="assets/Zoom Context Recall.png">

Below are few samples illustrating how **Context Recall** is computed. 

Note:
* `user_input`: The user's query 
* `retrieved_contexts`: Simulated retrieved text chunks from the database 
* `reference`: Ground truth answer

### **Example 1**

In [7]:
# create sample data for context recall
sample = SingleTurnSample(
    user_input="Where is Bob's burger located and what is it famous for?",
    retrieved_contexts=[
        "Bob's burger is owned by Bob Belcher and is located in New York City.",
        "Bob's burger has a new daily burger of the day special.",
    ],
    reference="Bob's burger is located in New York City. It is famous for its burger of the day specials.",
)

# initialize LLMContextRecall
context_recall = LLMContextRecall(llm=llm_as_judge)

# print context recall score
print(await context_recall.single_turn_ascore(sample))

1.0


**Analysis for Example 1**

* **Ground truth facts:**
    1. Bob's burger is located in New York City.
    2. Bob's burger is famous for its burger of the day specials.

* **Retrieved contexts:**
    * "Bob's burger is owned by Bob Belcher and is located in New York City."
    * "Bob's burger has a new daily burger of the day special."

* **Fact coverage:**
    * The **first fact** ("located in New York City") is **fully supported**, as the first context explicitly mentions the location.
    * The **second fact** ("famous for its burger of the day specials") is **fully supported**, because the second context mentions a daily burger of the day special which directly relates to what makes it famous.

$$\text{Context Recall} = \frac{\text{Facts supported by context}}{\text{Total facts in ground truth}} = \frac{2}{2} = 1.0$$

The high score of 1.0 indicates that all ground truth facts are adequately covered by the retrieved contexts.


### **Example 2**

In [8]:
# create sample data for context recall
sample = SingleTurnSample(
    user_input="Where is Bob's burger located and what is it famous for?",
    retrieved_contexts=[
        "Bob's burger is owned by Bob Belcher and and is established since 2011.",
        "Bob's burger has a new daily burger of the day special.",
    ],
    reference="Bob's burger is located in New York City. It is famous for its burger of the day specials.",
)

# initialize LLMContextRecall
context_recall = LLMContextRecall(llm=llm_as_judge)

# print context recall score
print(await context_recall.single_turn_ascore(sample))

0.5


**Analysis for Example 2**

* **Ground truth facts:**
  1. Bob's burger is located in New York City.
  2. Bob's burger is famous for its burger of the day specials.

* **Retrieved contexts:**
  * "Bob's burger is owned by Bob Belcher and is established since 2011."
  * "Bob's burger has a new daily burger of the day special."

* **Fact coverage:**
  * The **first fact** ("located in New York City") is **not supported**, as neither context mentions the location of Bob's burger.
  * The **second fact** ("famous for its burger of the day specials") is **supported**, because the second context mentions a daily burger of the day special.

$$\text{Context Recall} = \frac{\text{Facts supported by context}}{\text{Total facts in ground truth}} = \frac{1}{2} = 0.5$$


## 3. Response relevance

The Response Relevancy metric measures how relevant a generated response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.

<image src="assets/Zoom Response Relevance.png">

In [9]:
# create sample data for response relevancy
sample = SingleTurnSample(
    user_input="When was the first super bowl?",
    response="The first superbowl was held on Jan 15, 1967",
)

# initialize ResponseRelevancy
scorer = ResponseRelevancy(llm=llm_as_judge, embeddings=embeddings)
response_relevance_score = await scorer.single_turn_ascore(sample)

# print response relevancy score
print(response_relevance_score)

0.9164761343745044


## 4. Faithfulness

The Faithfulness metric measures how factually consistent a response is with the retrieved context, ranging from 0 to 1 with higher scores indicating better consistency and helping identify if the model is hallucinating its generated response.

<image src="assets/Zoom Faithfulness.png">

In [10]:
# create sample data for faithfulness
sample = SingleTurnSample(
    user_input="When was the first super bowl?",
    response="The first superbowl was held on Jan 15, 1967",
    retrieved_contexts=[
        "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
    ],
)

# initialize Faithfulness metric
scorer = Faithfulness(llm=llm_as_judge)
faithfulness_score = await scorer.single_turn_ascore(sample)

# print faithfulness score
print(faithfulness_score)

1.0


## 5. Factual Correctness

Factual Correctness measures the agreement between a model’s generated response and a reference ground truth for a given question. 

<image src="assets/Zoom Factual Correctness.png">

In [11]:
# create sample data for factual correctness
sample = SingleTurnSample(
    response="Bob has 3 kids: Tina, Gene, Louise, and he also has a dog.",
    reference="Bob has 3 kids: Tina, Gene, and Louise.",
)

# initialize Factual Correctness metric
factual_scorer = FactualCorrectness(llm=llm_as_judge)
factual_score = await factual_scorer.single_turn_ascore(sample)

# print factual correctness score
print(factual_score)

0.8


# Summary of metrics

|Metric|Assesses|Ground Truth required|
|----|-----|----------|
|Context Precision|Retrieval|No|
|Context Recall|Retrieval|Yes, ground truth response required.|
|Response Relevance|Generation|No|
|Faithfulness|Generation|No|
|Factual Correctness|Generation|Yes, ground truth response required.|
