# DeepEval Integration with ValidMind

Let's learn how to integrate [DeepEval](https://github.com/confident-ai/deepeval) with the ValidMind Library to evaluate Large Language Models (LLMs) and AI agents. This notebook demonstrates how to use DeepEval's summarization metrics within ValidMind's testing infrastructure.

To integrate DeepEval with ValidMind, we'll:
 1. Set up both frameworks and install required dependencies
 2. Create a dataset with source texts and generated summaries
 3. Use ValidMind's Summarization scorer to evaluate summary quality
 4. Analyze the evaluation results and reasons
 5. Apply the evaluation pipeline to multiple examples


## Contents    
- [Introduction](#toc1_)    
- [About DeepEval Integration](#toc2_)    
  - [Before you begin](#toc2_1_)    
  - [Key concepts](#toc2_2_)    
- [Setting up](#toc3_)    
  - [Install required packages](#toc3_1_)    
  - [Initialize ValidMind](#toc3_2_)    
- [Basic Usage - Simple Q&A Evaluation](#toc4_)    
- [RAG System Evaluation](#toc5_)    
  - [Create test cases](#toc5_1_)    
  - [Build dataset](#toc5_2_)    
  - [Evaluation metrics](#toc5_3_)    
    - [Contextual Relevancy](#toc5_3_1_)    
    - [Contextual Precision](#toc5_3_2_)    
    - [Contextual Recall](#toc5_3_3_)    
- [LLM Agent Evaluation](#toc6_)    
  - [Create test cases](#toc6_1_)    
  - [Build dataset](#toc6_2_)    
  - [Evaluation metrics](#toc6_3_)    
    - [Faithfulness](#toc6_3_1_)    
    - [Hallucination](#toc6_3_2_)    
    - [Summarization](#toc6_3_3_)    
    - [Task Completion](#toc6_3_4_)      
- [In summary](#toc10_)    
- [Next steps](#toc11_)    



<a id="toc1_"></a>

## Introduction

Large Language Model (LLM) evaluation is critical for understanding model performance across different tasks and scenarios. This notebook demonstrates how to integrate DeepEval's comprehensive evaluation framework with ValidMind's testing infrastructure to create a robust LLM evaluation pipeline.

DeepEval provides over 30 evaluation metrics specifically designed for LLMs, covering scenarios from simple Q&A to complex agent interactions. By integrating with ValidMind, you can leverage these metrics within a structured testing framework that supports documentation, collaboration, and compliance requirements.


<a id="toc2_"></a>

## About DeepEval Integration

DeepEval is a comprehensive evaluation framework for LLMs that provides metrics for various scenarios including hallucination detection, answer relevancy, faithfulness, and custom evaluation criteria. ValidMind is a platform for managing model risk and documentation through automated testing.

Together, these tools enable comprehensive LLM evaluation within a structured, compliant framework.


<a id="toc2_1_"></a>

### Before you begin

This notebook assumes you have basic familiarity with Python and Large Language Models. You'll need:

- Python 3.8 or higher
- Access to OpenAI API (for DeepEval metrics evaluation)
- ValidMind account and model registration

If you encounter errors due to missing modules, install them with `pip install` and re-run the notebook.


<a id="toc2_2_"></a>

### Key concepts

**LLMTestCase**: A DeepEval object that represents a single test case with input, expected output, actual output, and optional context.

**LLMAgentDataset**: A ValidMind dataset class that bridges DeepEval test cases with ValidMind's testing infrastructure.

**RAG Evaluation**: Testing retrieval-augmented generation systems that combine document retrieval with generation.

**Agent Evaluation**: Testing LLM agents that can use tools and perform multi-step reasoning.


<a id="toc3_"></a>

## Setting up


<a id="toc3_1_"></a>

### Install required packages

First, let's install the required packages and set up our environment.


In [None]:
%pip install -q validmind

<a id="toc3_2_"></a>

### Initialize ValidMind

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, you'll need access to a ValidMind account.</b></span>
<br></br>
<a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>


In [None]:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env

# Or replace with your code snippet
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

In [None]:
# Core imports
import pandas as pd
import warnings
from deepeval.test_case import LLMTestCase, ToolCall
from validmind.datasets.llm import LLMAgentDataset

warnings.filterwarnings('ignore')

<a id="toc4_"></a>

## Basic Usage - Simple Q&A Evaluation

Let's start with the simplest use case: evaluating a basic question-and-answer interaction with an LLM. This demonstrates how to create LLMTestCase objects and integrate them with ValidMind's dataset infrastructure.


### Create a simple LLM test case

In [None]:
simple_test_cases = [
LLMTestCase(
    input="What is machine learning?",
    actual_output="""Machine learning is a subset of artificial intelligence (AI) that enables 
    computers to learn and make decisions from data without being explicitly programmed for every task. 
    It uses algorithms to find patterns in data and make predictions or decisions based on those patterns.""",
    expected_output="""Machine learning is a method of data analysis that automates analytical 
    model building. It uses algorithms that iteratively learn from data, allowing computers to find 
    hidden insights without being explicitly programmed where to look.""",
    context=["Machine learning is a branch of AI that focuses on algorithms that can learn from data."],
    retrieval_context=["Machine learning is a branch of AI that focuses on algorithms that can learn from data."],
    tools_called=[
        ToolCall(
            name="search_docs",
            args={"query": "machine learning definition"},
            response="Found definition of machine learning in documentation."
        )
    ]
),
LLMTestCase(
    input="What is deep learning?",
    actual_output="""Bananas are yellow fruits that grow on trees in tropical climates. 
    They are rich in potassium and make a great healthy snack. You can also use them 
    in smoothies and baking.""",
    expected_output="""Deep learning is an advanced machine learning technique that uses neural networks
    with many layers to automatically learn representations of data with multiple levels of abstraction.
    It has enabled major breakthroughs in AI applications.""",
    context=["Deep learning is a specialized machine learning approach that uses deep neural networks to learn from data."],
    retrieval_context=["Deep learning is a specialized machine learning approach that uses deep neural networks to learn from data."],
    tools_called=[
        ToolCall(
            name="search_docs", 
            args={"query": "deep learning definition"},
            response="Found definition of deep learning in documentation."
        )
    ]
)]


### Create LLMAgentDataset from the test case
Let's create ValidMind dataset from Deepeval's test cases

In [None]:
print("\nCreating ValidMind dataset...")

simple_dataset = LLMAgentDataset.from_test_cases(
    test_cases=simple_test_cases,
    input_id="simple_qa_dataset"
)


# Display the dataset
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 120)
pd.set_option('display.max_colwidth', None)
print("\nDataset preview:")
display(simple_dataset.df)

### Compute metrics using ValidMind scorer interface

Now we'll compute metrics on our dataset using ValidMind's scorer interface. This will help us evaluate how well our model is performing by calculating various metrics like answer relevancy. The scorer interface provides a standardized way to assess model outputs against expected results.


In [None]:
simple_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.AnswerRelevancy",
    input_column = "input",
    actual_output_column = "actual_output",
)
simple_dataset._df.head()

In [None]:
simple_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.Bias",
    input_column = "input",
    actual_output_column = "actual_output",
)
simple_dataset._df.head()

<a id="toc5_"></a>

## RAG System Evaluation

Now let's evaluate a more complex use case: a Retrieval-Augmented Generation (RAG) system that retrieves relevant documents and generates responses based on them. RAG systems combine document retrieval with text generation, requiring specialized evaluation approaches.


<a id="toc5_1_"></a>

### Create test cases

In [None]:

print("Creating RAG evaluation test cases...")
rag_test_cases = [
    LLMTestCase(
        input="How do I return a product that doesn't fit?",
        actual_output="""You can return any product within 30 days of purchase for a full refund. 
        Simply visit our returns page on the website and follow the step-by-step instructions. 
        You'll need your order number and email address. No questions asked!""",
        expected_output="We offer a 30-day return policy for full refunds. Visit our returns page to start the process.",
        context=["Company policy allows 30-day returns for full refund with no restocking fees."],
        retrieval_context=[
            "Return Policy: All items can be returned within 30 days of purchase for a full refund.",
            "Return Process: Visit our website's returns page and enter your order details.",
            "Customer Service: Available 24/7 to help with returns and refunds.",
            "No restocking fees apply to returns within the 30-day window."
        ]
    ),
    LLMTestCase(
        input="What are your shipping options and costs?",
        actual_output="""We offer three shipping options: Standard (5-7 days, $5.99), 
        Express (2-3 days, $9.99), and Overnight (next day, $19.99). 
        Free shipping is available on orders over $50 with Standard delivery.""",
        expected_output="Multiple shipping options available with costs ranging from $5.99 to $19.99. Free shipping on orders over $50.",
        context=["Shipping information includes various speed and cost options."],
        retrieval_context=[
            "Standard Shipping: 5-7 business days, $5.99",
            "Express Shipping: 2-3 business days, $9.99", 
            "Overnight Shipping: Next business day, $19.99",
            "Free Standard Shipping on orders over $50"
        ]
    ),
    LLMTestCase(
        input="Do you have a warranty on electronics?",
        actual_output="""Yes, all electronics come with a manufacturer's warranty. 
        Most items have a 1-year warranty, while premium products may have 2-3 years. 
        We also offer extended warranty options for purchase.""",
        expected_output="Electronics include manufacturer warranty, typically 1-year, with extended options available.",
        context=["Electronics warranty information varies by product type and manufacturer."],
        retrieval_context=[
            "Electronics Warranty: Manufacturer warranty included with all electronic items",
            "Standard Coverage: 1 year for most electronics",
            "Premium Products: May include 2-3 year coverage",
            "Extended Warranty: Available for purchase at checkout"
        ]
    )
]

print(f"Created {len(rag_test_cases)} RAG test cases")

<a id="toc5_2_"></a>

### Build dataset

In this section, we'll convert our Deepeval LLMTestCase objects into a ValidMind dataset format.
This allows us to leverage ValidMind's powerful evaluation capabilities while maintaining 
compatibility with Deepeval's test case structure.

The dataset will contain:
- Input queries
- Actual model outputs 
- Expected outputs
- Context information
- Retrieved context passages

This structured format enables detailed analysis of the RAG system's performance
across multiple evaluation dimensions.


In [None]:
rag_dataset = LLMAgentDataset.from_test_cases(
    test_cases=rag_test_cases,
    input_id="rag_evaluation_dataset"
)

print(f"RAG Dataset: {rag_dataset}")
print(f"Shape: {rag_dataset.df.shape}")

# Show dataset structure
print("\nRAG Dataset Preview:")
display(rag_dataset.df[['input', 'actual_output', 'context', 'retrieval_context']].head())


<a id="toc5_3_"></a>

### Evaluation metrics

<a id="toc5_3_1_"></a>

#### Contextual Relevancy
The Contextual Relevancy metric evaluates how well the retrieved context aligns with the input query.
It measures whether the context contains the necessary information to answer the query accurately.
A high relevancy score indicates that the retrieved context is highly relevant and contains the key information needed.
This helps validate that the RAG system is retrieving appropriate context for the given queries.

<a id="toc5_3_2_"></a>

#### Contextual Precision
The Contextual Precision metric evaluates how well a RAG system ranks retrieved context nodes by relevance to the input query. 
It checks if the most relevant nodes are ranked at the top of the retrieval results.
A high precision score indicates that the retrieved context is highly relevant to the query and properly ranked.
This is particularly useful for evaluating RAG systems and ensuring they surface the most relevant information first.

<a id="toc5_3_3_"></a>

#### Contextual Recall
The Contextual Recall metric evaluates how well the retrieved context covers all the information needed to generate the expected output.
It extracts statements from the expected output and checks how many of them can be attributed to the retrieved context.
A high recall score indicates that the retrieved context contains all the key information needed to generate the expected response.
This helps ensure the RAG system retrieves comprehensive context that covers all aspects of the expected answer.

Now we'll evaluate the RAG system's performance using multiple metrics at once. The `assign_scores()` method accepts a list of metrics to evaluate different aspects of the system's behavior. The metrics will add score and reason columns to the dataset, providing quantitative and qualitative feedback on the system's performance. This multi-metric evaluation gives us comprehensive insights into the strengths and potential areas for improvement.


In [None]:
rag_dataset.assign_scores(
    metrics = ["validmind.scorer.llm.deepeval.ContextualRelevancy",
               "validmind.scorer.llm.deepeval.ContextualPrecision",
               "validmind.scorer.llm.deepeval.ContextualRecall"],
    input_column = "input",
    expected_output_column = "expected_output",
    retrieval_context_column = "retrieval_context",
)
display(rag_dataset._df.head(2))

<a id="toc6_"></a>

## LLM Agent Evaluation

Let's evaluate LLM agents that can use tools to accomplish tasks. This is one of the most advanced evaluation scenarios, requiring assessment of both response quality and tool usage appropriateness.

<a id="toc6_1_"></a>
### Create test cases


In [None]:
# Create LLM Agent test cases with tool usage
print("Creating Agent evaluation test cases...")

# Create test cases
agent_test_cases = [
    LLMTestCase(
        input="What's the weather like in New York City today?",
        actual_output="""Based on current weather data, New York City is experiencing partly cloudy skies 
        with a temperature of 72°F (22°C). The humidity is at 60% and there's a light breeze from the west at 8 mph. 
        No precipitation is expected today.""",
        expected_output="Current weather in New York shows mild temperatures with partly cloudy conditions.",
        tools_called=[
            ToolCall(
                name="WeatherAPI",
                description="Fetches current weather information for a specified location",
                input_parameters={"city": "New York City", "units": "fahrenheit", "include_forecast": False},
                output={
                    "temperature": 72,
                    "condition": "partly_cloudy", 
                    "humidity": 60,
                    "wind_speed": 8,
                    "wind_direction": "west"
                },
                reasoning="User asked for current weather in NYC, so I need to call the weather API"
            )
        ],
        expected_tools=[
            ToolCall(
                name="WeatherAPI",
                description="Should fetch weather information for New York City",
                input_parameters={"city": "New York City"}
            )
        ],
        retrieval_context=[
            "Temperature: 72°F, Condition: Partly Cloudy, Humidity: 60%, Wind: 8mph from west",
            "No precipitation in forecast for today",
            "Historical average temperature for this date: 70°F"
        ]
    ),
    LLMTestCase(
        input="Calculate the compound interest on $1000 invested at 5% annually for 3 years",
        actual_output="""Using the compound interest formula A = P(1 + r)^t, where P = $1000, r = 0.05, and t = 3:
        A = 1000(1 + 0.05)^3 = 1000(1.05)^3 = 1000 × 1.157625 = $1,157.63
        The investment will grow to $1,157.63, earning $157.63 in compound interest.""",
        expected_output="The compound interest calculation shows the investment grows to approximately $1,157.63.",
        tools_called=[
            ToolCall(
                name="Calculator",
                description="Performs mathematical calculations",
                input_parameters={"expression": "1000 * (1.05 ** 3)", "precision": 2},
                output=1157.63,
                reasoning="Need to calculate compound interest using the standard formula"
            )
        ],
        expected_tools=[
            ToolCall(
                name="Calculator", 
                description="Should perform compound interest calculation",
                input_parameters={"calculation_type": "compound_interest"}
            )
        ],
        retrieval_context=[
            "Calculation result: $1,157.63",
            "Formula used: A = P(1 + r)^t",
            "Parameters: Principal=$1000, Rate=5%, Time=3 years"
        ]
    ),
    LLMTestCase(
        input="Send an email to john@example.com about our meeting tomorrow at 2 PM",
        actual_output="""I've sent an email to john@example.com with the subject "Meeting Reminder - Tomorrow at 2 PM". 
        The email confirms your meeting scheduled for tomorrow at 2:00 PM and asks John to confirm his attendance. 
        The email was successfully delivered to the recipient.""",
        expected_output="Email sent successfully to john@example.com about the 2 PM meeting tomorrow.",
        tools_called=[
            ToolCall(
                name="EmailSender",
                description="Sends emails to specified recipients",
                input_parameters={
                    "to": "john@example.com",
                    "subject": "Meeting Reminder - Tomorrow at 2 PM", 
                    "body": "Hi John,\n\nThis is a reminder about our meeting scheduled for tomorrow at 2:00 PM. Please confirm your attendance.\n\nBest regards"
                },
                output={"status": "sent", "message_id": "msg_12345", "timestamp": "2024-01-15T10:30:00Z"},
                reasoning="User requested to send email, so I need to use the email tool with appropriate content"
            )
        ],
        expected_tools=[
            ToolCall(
                name="EmailSender",
                description="Should send an email about the meeting",
                input_parameters={"recipient": "john@example.com"}
            )
        ],
        retrieval_context=[
            "Email sent successfully (msg_12345)",
            "Recipient: john@example.com",
            "Subject: Meeting Reminder - Tomorrow at 2 PM",
            "Timestamp: 2024-01-15T10:30:00Z"
        ]
    )
]
print(f"Created {len(agent_test_cases)} Agent test cases")

<a id="toc6_2_"></a>

### Build dataset


In [None]:
# Create Agent dataset
agent_dataset = LLMAgentDataset.from_test_cases(
    test_cases=agent_test_cases,
    input_id="agent_evaluation_dataset"
)

print(f"Agent Dataset: {agent_dataset}")
print(f"Shape: {agent_dataset.df.shape}")

# Analyze tool usage
tool_usage = {}
for case in agent_test_cases:
    if case.tools_called:
        for tool in case.tools_called:
            tool_usage[tool.name] = tool_usage.get(tool.name, 0) + 1

print("\nTool Usage Analysis:")
for tool, count in tool_usage.items():
    print(f"  - {tool}: {count} times")

print("\nAgent Dataset Preview:")
display(agent_dataset.df[['input', 'actual_output', 'tools_called']].head())

<a id="toc6_3_"></a>

### Evaluation metrics
<a id="toc6_3_1_"></a>

#### Faithfulness
The Faithfulness metric evaluates whether the model's output contains any contradictions or hallucinations compared to the provided context. It ensures that the model's response is grounded in and consistent with the given information, rather than making up facts or contradicting the context. A high faithfulness score indicates that the model's output aligns well with the source material.


In [None]:
agent_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.Faithfulness",
    user_input_column = "input",
    response_column = "actual_output",
    retrieved_contexts_column = "retrieval_context",
    )
agent_dataset._df.head()

<a id="toc6_3_2_"></a>

#### Hallucination
The Hallucination metric evaluates whether the model's output contains information that is not supported by or contradicts the provided context. It helps identify cases where the model makes up facts or includes details that aren't grounded in the source material. A low hallucination score indicates that the model's response stays faithful to the given context without introducing unsupported information.


In [None]:
agent_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.Hallucination",
    input_column = "input",
    actual_output_column = "actual_output",
    context_column = "retrieval_context",
)
agent_dataset._df.head()

<a id="toc6_3_3_"></a>

#### Summarization
The Summarization metric evaluates how well a model's output summarizes the given context by generating assessment questions to check if the summary is factually aligned with and sufficiently covers the source text. It helps ensure that summaries are accurate, complete, and maintain the key information from the original content without introducing unsupported details or omitting critical points. A high summarization score indicates that the model effectively condenses the source material while preserving its essential meaning.


In [None]:
agent_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.Summarization",
    input_column = "input",
    actual_output_column = "actual_output",
)
agent_dataset._df.head()

<a id="toc6_3_4_"></a>

#### Task Completion
The Task Completion metric evaluates whether the model's output successfully accomplishes the intended task or goal specified in the input prompt. It assesses if the model has properly understood the task requirements and provided a complete and appropriate response. A high task completion score indicates that the model has effectively addressed the core objective of the prompt and delivered a satisfactory solution.


In [None]:
agent_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.TaskCompletion",
    input_column = "input",
    actual_output_column = "actual_output",
    agent_output_column = "agent_output",
    tools_called_column = "tools_called",

)
agent_dataset._df.head()

<a id="toc10_"></a>

## In summary

This notebook demonstrated the comprehensive integration between DeepEval and ValidMind for LLM evaluation:

**Key Achievements:**
- Successfully created and evaluated different types of LLM test cases (Q&A, RAG, Agents)
- Integrated DeepEval metrics with ValidMind's testing infrastructure
- Showed how to handle complex agent scenarios with tool usage

**Integration Benefits:**
- **Comprehensive Coverage**: Evaluate LLMs across 30+ specialized metrics
- **Structured Documentation**: Leverage ValidMind's compliance and documentation features
- **Flexibility**: Support for custom metrics and domain-specific evaluation criteria
- **Production Ready**: Handle real-world LLM evaluation scenarios at scale

The `LLMAgentDataset` class provides a seamless bridge between DeepEval's evaluation capabilities and ValidMind's testing infrastructure, enabling robust LLM evaluation within a structured, compliant framework.


<a id="toc11_"></a>

## Next steps

**Explore Advanced Features:**
- **Continuous Evaluation**: Set up automated LLM evaluation pipelines
- **A/B Testing**: Compare different LLM models and configurations
- **Metrics Customization**: Create domain-specific evaluation criteria
- **Integration Patterns**: Embed evaluation into your LLM development workflow

**Additional Resources:**
- [ValidMind Library Documentation](https://docs.validmind.ai/developer/validmind-library.html) - Complete API reference and tutorials

**Try These Examples:**
- Implement custom business-specific evaluation metrics
- Create automated evaluation pipelines for model deployment
- Integrate with your existing ML infrastructure and workflows
- Explore multi-modal evaluation scenarios (text, code, images)

Start building comprehensive LLM evaluation workflows that combine the power of DeepEval's specialized metrics with ValidMind's structured testing and documentation framework.
