[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weaviate/recipes/blob/main/integrations/operations/deepeval/rag_evaluation_deepeval.ipynb)

# RAG Evaluation with DeepEval

# Tutorial Overview

In this tutorial, we will use [DeepEval](https://docs.confident-ai.com/) to evaluate a RAG pipeline built with **Weaviate**. Our goal is to optimize RAG responses by selecting the appropriate hyperparameters. These hyperparameters include both:

- **Generation Hyperparameters:** such as `model` and `prompt template`
- **Retrieval Hyperparameters:** such as `top-K`, `embedding model`, and `chunk size`

Evaluating will allow us to identify and pick the best hyperparameters that optimize our RAG pipeline performance.

In this notebook, we will:

1. Define **DeepEval [metrics](https://docs.confident-ai.com/docs/metrics-contextual-precision)** to measure RAG performance
2. Build a simple RAG pipeline with Weaviate  
3. Run evaluations on the RAG pipeline using DeepEval metrics
4. Optimize the hyperparameters based on evaluation results  

DeepEval metrics work out of the box without any additional configuration. This example demonstrates the basics of using DeepEval. For more details on advanced usage, please visit the [docs](https://docs.confident-ai.com/).


# 1. Install packages and dependencies

Begin by installing the necessary libraries.

In [None]:
!pip install -U deepeval weaviate-client

# 2. Define DeepEval RAG Metrics

There are 2 types of RAG evaluation metrics:

*   **Generator Metrics:** measures response generation quality
*   **Retrieval Metrics:** measures retriever generation quality

We'll be using both of these metrics in this tutorial to evaluate our RAG.

DeepEval metrics are **powered by LLMs**. You can use any LLM, but for this tutorial we'll be using `gpt-4o` as the default model.

Begin by setting your `OPENAI_API_KEY`. Once this environment variable is set, `gpt-4o` will automatically be used as the default model for running these metrics.

In [2]:
# Export the API key to an environment variable
openai_api_key = "Your OPENAI_API_KEY"
import os
os.environ["OPENAI_API_KEY"] = openai_api_key

Deepeval offers **2 generator metrics** to evaluate response generations:

* [Answer Relevancy](https://docs.confident-ai.com/docs/metrics-answer-relevancy): evaluates how relevant the output of your LLM application is compared to the provided input.
* [Faithfulness](https://docs.confident-ai.com/docs/metrics-faithfulness): evaluates whether the actual output factually aligns with the contents of your retrieval_context

In [3]:
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric
)

# Initialize the generator metrics
answer_relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)

...and **3 RAG retriever metrics metrics** to measure retrieval:

* [Contextual Precision](https://docs.confident-ai.com/metrics/contextual-precision): Ensures the most relevant information are ranked higher than the irrelevant ones.
* [Contextual Recall](https://docs.confident-ai.com/metrics/contextual-recall): Measures how well the retrieved information aligns with the expected LLM output
* [Contextual Relevancy](https://docs.confident-ai.com/metrics/contextual-relevancy): Checks how well the retrieved context aligns with the query.

In [4]:
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric
)

# Initialize the retriever metrics
contextual_precision = ContextualPrecisionMetric(threshold=0.7)
contextual_recall = ContextualRecallMetric(threshold=0.7)
contextual_relevancy = ContextualRelevancyMetric(threshold=0.7)

# 3. Defining your Weaviate RAG Pipeline



With the metrics defined, we can start building our RAG pipeline. In this tutorial, we'll construct and evaluate a QA RAG system designed to answer questions about git. Begin by defining the Weaviate client.

In [5]:
os.environ["WCD_URL"] = "Your Weaviate URL"
os.environ["WCD_API_KEY"] = "Your Weaviate API Key"

In [32]:
import weaviate
from weaviate.classes.init import Auth

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=os.environ["WCD_URL"],
    auth_credentials=Auth.api_key(os.environ["WCD_API_KEY"]),
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

### Download & Chunk

Next, we'll populate our Weaviate client with information chunks about Git. To do this, we'll first create the chunks by downloading a chapter from the Pro Git book, cleaning the text, and then chunking it.

In [33]:
from typing import List


def download_and_chunk(src_url: str, chunk_size: int, overlap_size: int) -> List[str]:
    import requests
    import re

    response = requests.get(src_url)  # Retrieve source text
    source_text = re.sub(r"\s+", " ", response.text)  # Remove multiple whitespaces
    text_words = re.split(r"\s", source_text)  # Split text by single whitespace

    chunks = []
    for i in range(0, len(text_words), chunk_size):  # Iterate through & chunk data
        chunk = " ".join(text_words[max(i - overlap_size, 0): i + chunk_size])  # Join a set of words into a string
        chunks.append(chunk)
    return chunks


pro_git_chapter_url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
chunked_text = download_and_chunk(pro_git_chapter_url, 150, 25)

### Creating the Collection

Once we have our chunks, we can define a collection for them. This collection will not only serve as our vector database and retrieval engine, but Weaviate also has **built-in RAG capabilities** that let you generate responses to queries automatically.


When using RAG capabilities with Weaviate, it's important to designate your preferred generative module directly at the collection level. In the example below, the `GitBookChunk` collection is configured with `text2vec-openai` as the vectorizer and `generative-openai` as the generative module.

Let's define our collection and call it `GitBookChunk`.

In [35]:
import weaviate.classes as wvc

collection_name = "GitBookChunk"

if client.collections.exists(collection_name):  # In case we've created this collection before
    client.collections.delete(collection_name)  # THIS WILL DELETE ALL DATA IN THE COLLECTION

wvc.config.Configure.Generative.openai.model = "gpt-3.5-turbo"
chunks = client.collections.create(
    name=collection_name,
    properties=[
        wvc.config.Property(
            name="chunk",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="chapter_title",
            data_type=wvc.config.DataType.TEXT
        ),
        wvc.config.Property(
            name="chunk_index",
            data_type=wvc.config.DataType.INT
        ),
    ],
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),  # Use `text2vec-openai` as the vectorizer
    generative_config=wvc.config.Configure.Generative.openai(),  # Use `generative-openai` with default parameters
)

Then, populate the collection with the text chunks we created and cleaned.

In [None]:
chunks_list = list()
for i, chunk in enumerate(chunked_text):
    data_properties = {
        "chapter_title": "What is Git",
        "chunk": chunk,
        "chunk_index": i
    }
    data_object = wvc.data.DataObject(properties=data_properties)
    chunks_list.append(data_object)
chunks.data.insert_many(chunks_list)

Finally, run a test query.

In [None]:
response = chunks.generate.fetch_objects(
    limit=1,
    grouped_task="What is git"
)
print(response.generated)

# 4. Evaluating the RAG


With the RAG collection pipeline ready, we can begin evaluating it. Evaluation consists of two main steps:

1. **Test Case Preparation:**  
   Prepare an input query along with the expected LLM response. Then, use the input to generate a response from your RAG pipeline, creating an `LLMTestCase` that contains:
   - `input`
   - `actual_output`
   - `expected_output`
   - `retrieval_context`

2. **Test Case Evaluation:**  
   Evaluate the test case using the selection of RAG metrics we previously defined.

### Test Case preparation

Let's begin by defining an `input` and preparing an `expected_output` for it.

In [37]:
input = "How does Git work, and why is it so fast?"
expected_output = "Git is a distributed version control system that manages project data by capturing snapshots of the entire filesystem with each commit. This snapshot-based approach, combined with the fact that nearly all operations are performed locally, enables Git to provide near-instantaneous responses even without a network connection."

Next, retrieve the `actual_output` and `retrieval_context` for this input by running a RAG query using the `chunks` collection we defined in the previous setion and create an `LLMTestCase` from it.

In [38]:
from deepeval.test_case import LLMTestCase

# Example usage
response = chunks.generate.fetch_objects(
    limit=1,
    grouped_task=f"Answer the following question using only the information contained in your chunks, in not more than 2 sentences: {input}"
)
actual_output = response.generated
retrieval_context = [o.properties['chunk'] for o in response.objects]

test_case = LLMTestCase(
    input=input,
    actual_output=actual_output,
    expected_output=expected_output,
    retrieval_context=retrieval_context
)

In [None]:
print(test_case)

### Run Evaluations

To run evaluations, simply pass the test case and metrics into DeepEval's `evaluate` function.

In [None]:
from deepeval import evaluate

evaluate(
  [test_case],
  [answer_relevancy, faithfulness, contextual_precision, contextual_recall, contextual_relevancy]
)

# 6. Optimizing RAG

You may notice that the RAG model pipeline we created is performing well on some metrics while underperforming on others. This highlights the importance of iterating over different hyperparameters to determine which combinations yield the best scores across the board.

Even though we defined several hyperparameters—such as the embedding model and prompt template—let's iterate over **top-K (limit)** values to identify the best-performing top-K option across these metrics. This can be accomplished with a simple `for` loop in DeepEval.



In [None]:
# Example usage
for top_k in [1, 3, 5, 7]:
    response = chunks.generate.fetch_objects(
        limit=top_k,
        grouped_task=f"Answer the following question using only the information contained in your chunks, in not more than 2 sentences: {input}"
    )
    actual_output = response.generated
    retrieval_context = [o.properties['chunk'] for o in response.objects]

    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output,
        expected_output=expected_output,
        retrieval_context=retrieval_context
    )

    evaluate([test_case], [contextual_precision, contextual_recall, contextual_relevancy])

To optimize all hyperparameters, iterate over each one along with the metrics to find the optimal combination for your specific use case!





