In this notebook, we will explore various RAG (Retrieval-Augmented Generation) evaluation metrics using the [RAGAS framework](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/).

**Why LLM Evaluation Matters**

Traditional software testing approaches fall short when evaluating LLM-based systems because:
- Outputs are non-deterministic and context-dependent
- Multiple correct answers can exist for the same input
- Semantic understanding is required beyond exact string matching

**LLMs as Judges**

We leverage LLMs themselves as evaluation judges to assess:
- Semantic similarity and relevance
- Factual accuracy and consistency
- Context utilization and precision

**What You'll Learn**

This notebook demonstrates how to evaluate every component of your RAG pipeline with practical code examples, covering key metrics that measure different aspects of RAG performance.

# Background

<image src="assets/Evaluation.png">

## 1. Context Precision

What is ...?


## 2. Context Recall

What is ...?

## 3. Response relevance

What is ...?

## 4. Faithfulness

What is ...?

## 5. Factual Correctness

What is ...?

# Code implementation

In [86]:
# imports
from ragas import SingleTurnSample
from ragas.metrics import (
    LLMContextPrecisionWithoutReference,
    LLMContextRecall,
    ResponseRelevancy,
    Faithfulness,
    FactualCorrectness,
)
from ragas.llms.base import llm_factory
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv

In [None]:
# load environment variables
load_dotenv("configs.env")

True

In [64]:
# initialize LLMs and embedding models
llm_as_judge = llm_factory("gpt-4o-mini")
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

## 1. Context Precision


In [20]:
sample_data = {
    "user_query": "What is the capital of France?",
    "retrieved_contexts": [
        "France is a country in Europe. Its capital is Paris.",
        "The Eiffel Tower is located in Paris, the capital city of France.",
        "Berlin is the capital of Germany.",
    ],
    "response": "The capital of France is Paris.",
}

In [21]:
context_precision = LLMContextPrecisionWithoutReference(llm=llm_as_judge)

In [22]:
sample = SingleTurnSample(
    user_input=sample_data["user_query"],
    response=sample_data["response"],
    retrieved_contexts=sample_data["retrieved_contexts"],
)

In [23]:
await context_precision.single_turn_ascore(sample)

0.99999999995

In [24]:
sample_data = {
    "user_query": "What is the capital of France?",
    "retrieved_contexts": [
        "Berlin is the capital of Germany.",
        "The Eiffel Tower is located in Paris, the capital city of France.",
        "France is a country in Europe. Its capital is Paris.",
    ],
    "response": "The capital of France is Paris.",
}

In [25]:
context_precision = LLMContextPrecisionWithoutReference(llm=llm_as_judge)

In [26]:
sample = SingleTurnSample(
    user_input=sample_data["user_query"],
    response=sample_data["response"],
    retrieved_contexts=sample_data["retrieved_contexts"],
)

In [27]:
await context_precision.single_turn_ascore(sample)

0.5833333333041666

## 2. Context Recall

In [30]:
sample = SingleTurnSample(
    user_input="Where is the Eiffel Tower located?",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["Paris is the capital of France."],
)

In [32]:
context_recall = LLMContextRecall(llm=llm_as_judge)

In [33]:
await context_recall.single_turn_ascore(sample)

1.0

## 3. Response relevance

In [76]:
sample = SingleTurnSample(
        user_input="When was the first super bowl?",
        response="The first superbowl was held on Jan 15, 1967",
    )

In [77]:
scorer = ResponseRelevancy(llm=llm_as_judge, embeddings=embeddings)

In [78]:
response_relevance_score = await scorer.single_turn_ascore(sample)

In [79]:
print(response_relevance_score)

0.8808327374424308


## 4. Faithfulness

In [82]:
sample = SingleTurnSample(
    user_input="When was the first super bowl?",
    response="The first superbowl was held on Jan 15, 1967",
    retrieved_contexts=[
        "The First AFL–NFL World Championship Game was an American football game played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles."
    ],
)

In [83]:
scorer = Faithfulness(llm=llm_as_judge)

In [84]:
faithfulness_score = await scorer.single_turn_ascore(sample)

In [85]:
print(faithfulness_score)

1.0


## 5. Factual Correctness

In [87]:
sample = SingleTurnSample(
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris. I has a height of 1000ft."
)

In [88]:
factual_scorer = FactualCorrectness(llm=llm_as_judge)

In [89]:
factual_score = await factual_scorer.single_turn_ascore(sample)

In [90]:
print(factual_score)

0.67


# Summary of metrics

TODO Insert table