# DeepEval Integration with ValidMind

Learn how to integrate [DeepEval](https://github.com/confident-ai/deepeval) with the ValidMind Library to evaluate Large Language Models (LLMs) and AI agents. This notebook demonstrates the complete integration through the new `LLMAgentDataset` class, enabling you to leverage DeepEval's 30+ evaluation metrics within ValidMind's testing infrastructure.

To integrate DeepEval with ValidMind, we'll:

1. Set up both frameworks and install required dependencies
2. Create and evaluate LLM test cases for different scenarios
3. Work with RAG systems and agent evaluations
4. Use Golden templates for standardized testing
5. Create custom evaluation metrics with G-Eval
6. Integrate everything with ValidMind's testing framework
7. Apply production-ready evaluation patterns


## Contents    
- [Introduction](#toc1_)    
- [About DeepEval Integration](#toc2_)    
  - [Before you begin](#toc2_1_)    
  - [Key concepts](#toc2_2_)    
- [Setting up](#toc3_)    
  - [Install required packages](#toc3_1_)    
  - [Initialize ValidMind](#toc3_2_)    
- [Basic Usage - Simple Q&A Evaluation](#toc4_)    
- [RAG System Evaluation](#toc5_)    
- [LLM Agent Evaluation](#toc6_)    
- [Working with Golden Templates](#toc7_)    
- [ValidMind Integration](#toc8_)    
- [Custom Metrics with G-Eval](#toc9_)    
- [In summary](#toc10_)    
- [Next steps](#toc11_)    



<a id="toc1_"></a>

## Introduction

Large Language Model (LLM) evaluation is critical for understanding model performance across different tasks and scenarios. This notebook demonstrates how to integrate DeepEval's comprehensive evaluation framework with ValidMind's testing infrastructure to create a robust LLM evaluation pipeline.

DeepEval provides over 30 evaluation metrics specifically designed for LLMs, covering scenarios from simple Q&A to complex agent interactions. By integrating with ValidMind, you can leverage these metrics within a structured testing framework that supports documentation, collaboration, and compliance requirements.


<a id="toc2_"></a>

## About DeepEval Integration

DeepEval is a comprehensive evaluation framework for LLMs that provides metrics for various scenarios including hallucination detection, answer relevancy, faithfulness, and custom evaluation criteria. ValidMind is a platform for managing model risk and documentation through automated testing.

Together, these tools enable comprehensive LLM evaluation within a structured, compliant framework.


<a id="toc2_1_"></a>

### Before you begin

This notebook assumes you have basic familiarity with Python and Large Language Models. You'll need:

- Python 3.8 or higher
- Access to OpenAI API (for DeepEval metrics evaluation)
- ValidMind account and model registration

If you encounter errors due to missing modules, install them with `pip install` and re-run the notebook.


<a id="toc2_2_"></a>

### Key concepts

**LLMTestCase**: A DeepEval object that represents a single test case with input, expected output, actual output, and optional context.

**Golden Templates**: Pre-defined test templates with inputs and expected outputs that can be converted to test cases by generating actual outputs.

**G-Eval**: Generative evaluation using LLMs to assess response quality based on custom criteria.

**LLMAgentDataset**: A ValidMind dataset class that bridges DeepEval test cases with ValidMind's testing infrastructure.

**RAG Evaluation**: Testing retrieval-augmented generation systems that combine document retrieval with generation.

**Agent Evaluation**: Testing LLM agents that can use tools and perform multi-step reasoning.


<a id="toc3_"></a>

## Setting up


<a id="toc3_1_"></a>

### Install required packages

First, let's install the required packages and set up our environment.


In [None]:
%pip install -q validmind

<a id="toc3_2_"></a>

### Initialize ValidMind

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, you'll need access to a ValidMind account.</b></span>
<br></br>
<a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>


In [None]:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env

# Or replace with your code snippet
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

In [None]:
# Core imports
import pandas as pd
import warnings
from deepeval.test_case import LLMTestCase, ToolCall, LLMTestCaseParams
from deepeval.dataset import Golden
from deepeval.metrics import GEval
from validmind.datasets.llm import LLMAgentDataset

warnings.filterwarnings('ignore')

<a id="toc4_"></a>

## Basic Usage - Simple Q&A Evaluation

Let's start with the simplest use case: evaluating a basic question-and-answer interaction with an LLM. This demonstrates how to create LLMTestCase objects and integrate them with ValidMind's dataset infrastructure.


### Create a simple LLM test case

In [None]:
simple_test_cases = [
LLMTestCase(
    input="What is machine learning?",
    actual_output="""Machine learning is a subset of artificial intelligence (AI) that enables 
    computers to learn and make decisions from data without being explicitly programmed for every task. 
    It uses algorithms to find patterns in data and make predictions or decisions based on those patterns.""",
    expected_output="""Machine learning is a method of data analysis that automates analytical 
    model building. It uses algorithms that iteratively learn from data, allowing computers to find 
    hidden insights without being explicitly programmed where to look.""",
    context=["Machine learning is a branch of AI that focuses on algorithms that can learn from data."],
    retrieval_context=["Machine learning is a branch of AI that focuses on algorithms that can learn from data."]
),
LLMTestCase(
    input="What is deep learning?",
    actual_output="""Bananas are yellow fruits that grow on trees in tropical climates. 
    They are rich in potassium and make a great healthy snack. You can also use them 
    in smoothies and baking.""",
    expected_output="""Deep learning is an advanced machine learning technique that uses neural networks
    with many layers to automatically learn representations of data with multiple levels of abstraction.
    It has enabled major breakthroughs in AI applications.""",
    context=["Deep learning is a specialized machine learning approach that uses deep neural networks to learn from data."],
    retrieval_context=["Deep learning is a specialized machine learning approach that uses deep neural networks to learn from data."]
)]





### Create LLMAgentDataset from the test case
Let's create ValidMind dataset from Deepeval's test cases

In [None]:
print("\nCreating ValidMind dataset...")

simple_dataset = LLMAgentDataset.from_test_cases(
    test_cases=simple_test_cases,
    input_id="simple_qa_dataset"
)

# Display the dataset
print("\nDataset preview:")
display(simple_dataset.df)

In [None]:
# def agent_fn(input):
#     """
#     Invoke the simplified agent with the given input.
#     """
    
#     return 1.23

    
# vm_model = vm.init_model(
#     predict_fn=agent_fn,
#     input_id="test_model",
# )

### Compute metrics using ValidMind scorer interface
Now we'll compute metrics on our dataset using ValidMind's scorer interface. This will help us evaluate how well our model is performing by calculating various metrics like answer relevancy. The scorer interface provides a standardized way to assess model outputs against expected results.


In [None]:
simple_dataset.assign_scores(metrics = "validmind.scorer.llm.deepeval.AnswerRelevancy")

In [None]:
simple_dataset._df.head()

<a id="toc5_"></a>

## RAG System Evaluation

Now let's evaluate a more complex use case: a Retrieval-Augmented Generation (RAG) system that retrieves relevant documents and generates responses based on them. RAG systems combine document retrieval with text generation, requiring specialized evaluation approaches.


### Create multiple RAG test cases

In [None]:

print("Creating RAG evaluation test cases...")
rag_test_cases = [
    LLMTestCase(
        input="How do I return a product that doesn't fit?",
        actual_output="""You can return any product within 30 days of purchase for a full refund. 
        Simply visit our returns page on the website and follow the step-by-step instructions. 
        You'll need your order number and email address. No questions asked!""",
        expected_output="We offer a 30-day return policy for full refunds. Visit our returns page to start the process.",
        context=["Company policy allows 30-day returns for full refund with no restocking fees."],
        retrieval_context=[
            "Return Policy: All items can be returned within 30 days of purchase for a full refund.",
            "Return Process: Visit our website's returns page and enter your order details.",
            "Customer Service: Available 24/7 to help with returns and refunds.",
            "No restocking fees apply to returns within the 30-day window."
        ]
    ),
    LLMTestCase(
        input="What are your shipping options and costs?",
        actual_output="""We offer three shipping options: Standard (5-7 days, $5.99), 
        Express (2-3 days, $9.99), and Overnight (next day, $19.99). 
        Free shipping is available on orders over $50 with Standard delivery.""",
        expected_output="Multiple shipping options available with costs ranging from $5.99 to $19.99. Free shipping on orders over $50.",
        context=["Shipping information includes various speed and cost options."],
        retrieval_context=[
            "Standard Shipping: 5-7 business days, $5.99",
            "Express Shipping: 2-3 business days, $9.99", 
            "Overnight Shipping: Next business day, $19.99",
            "Free Standard Shipping on orders over $50"
        ]
    ),
    LLMTestCase(
        input="Do you have a warranty on electronics?",
        actual_output="""Yes, all electronics come with a manufacturer's warranty. 
        Most items have a 1-year warranty, while premium products may have 2-3 years. 
        We also offer extended warranty options for purchase.""",
        expected_output="Electronics include manufacturer warranty, typically 1-year, with extended options available.",
        context=["Electronics warranty information varies by product type and manufacturer."],
        retrieval_context=[
            "Electronics Warranty: Manufacturer warranty included with all electronic items",
            "Standard Coverage: 1 year for most electronics",
            "Premium Products: May include 2-3 year coverage",
            "Extended Warranty: Available for purchase at checkout"
        ]
    )
]

print(f"Created {len(rag_test_cases)} RAG test cases")



# Create RAG LLMTestCase dataset to ValidMind dataset

In this section, we'll convert our Deepeval LLMTestCase objects into a ValidMind dataset format.
This allows us to leverage ValidMind's powerful evaluation capabilities while maintaining 
compatibility with Deepeval's test case structure.

The dataset will contain:
- Input queries
- Actual model outputs 
- Expected outputs
- Context information
- Retrieved context passages

This structured format enables detailed analysis of the RAG system's performance
across multiple evaluation dimensions.


In [None]:
rag_dataset = LLMAgentDataset.from_test_cases(
    test_cases=rag_test_cases,
    input_id="rag_evaluation_dataset"
)

print(f"RAG Dataset: {rag_dataset}")
print(f"Shape: {rag_dataset.df.shape}")

# Show dataset structure
print("\nRAG Dataset Preview:")
display(rag_dataset.df[['input', 'actual_output', 'context', 'retrieval_context']].head())


In [None]:
rag_dataset.assign_scores(metrics = "validmind.scorer.llm.deepeval.ContextualRelevancy")
# Display the dataset
print("\nDataset preview:")
display(rag_dataset.df)

<a id="toc6_"></a>

## LLM Agent Evaluation

Let's evaluate LLM agents that can use tools to accomplish tasks. This is one of the most advanced evaluation scenarios, requiring assessment of both response quality and tool usage appropriateness.


In [None]:
# Create LLM Agent test cases with tool usage
print("Creating Agent evaluation test cases...")

agent_test_cases = [
    LLMTestCase(
        input="What's the weather like in New York City today?",
        actual_output="""Based on current weather data, New York City is experiencing partly cloudy skies 
        with a temperature of 72°F (22°C). The humidity is at 60% and there's a light breeze from the west at 8 mph. 
        No precipitation is expected today.""",
        expected_output="Current weather in New York shows mild temperatures with partly cloudy conditions.",
        tools_called=[
            ToolCall(
                name="WeatherAPI",
                description="Fetches current weather information for a specified location",
                input_parameters={"city": "New York City", "units": "fahrenheit", "include_forecast": False},
                output={
                    "temperature": 72,
                    "condition": "partly_cloudy", 
                    "humidity": 60,
                    "wind_speed": 8,
                    "wind_direction": "west"
                },
                reasoning="User asked for current weather in NYC, so I need to call the weather API"
            )
        ],
        expected_tools=[
            ToolCall(
                name="WeatherAPI",
                description="Should fetch weather information for New York City",
                input_parameters={"city": "New York City"}
            )
        ]
    ),
    LLMTestCase(
        input="Calculate the compound interest on $1000 invested at 5% annually for 3 years",
        actual_output="""Using the compound interest formula A = P(1 + r)^t, where P = $1000, r = 0.05, and t = 3:
        A = 1000(1 + 0.05)^3 = 1000(1.05)^3 = 1000 × 1.157625 = $1,157.63
        The investment will grow to $1,157.63, earning $157.63 in compound interest.""",
        expected_output="The compound interest calculation shows the investment grows to approximately $1,157.63.",
        tools_called=[
            ToolCall(
                name="Calculator",
                description="Performs mathematical calculations",
                input_parameters={"expression": "1000 * (1.05 ** 3)", "precision": 2},
                output=1157.63,
                reasoning="Need to calculate compound interest using the standard formula"
            )
        ],
                 expected_tools=[
             ToolCall(
                 name="Calculator", 
                 description="Should perform compound interest calculation",
                 input_parameters={"calculation_type": "compound_interest"}
             )
         ]
    ),
    LLMTestCase(
        input="Send an email to john@example.com about our meeting tomorrow at 2 PM",
        actual_output="""I've sent an email to john@example.com with the subject "Meeting Reminder - Tomorrow at 2 PM". 
        The email confirms your meeting scheduled for tomorrow at 2:00 PM and asks John to confirm his attendance. 
        The email was successfully delivered to the recipient.""",
        expected_output="Email sent successfully to john@example.com about the 2 PM meeting tomorrow.",
        tools_called=[
            ToolCall(
                name="EmailSender",
                description="Sends emails to specified recipients",
                input_parameters={
                    "to": "john@example.com",
                    "subject": "Meeting Reminder - Tomorrow at 2 PM", 
                    "body": "Hi John,\n\nThis is a reminder about our meeting scheduled for tomorrow at 2:00 PM. Please confirm your attendance.\n\nBest regards"
                },
                output={"status": "sent", "message_id": "msg_12345", "timestamp": "2024-01-15T10:30:00Z"},
                reasoning="User requested to send email, so I need to use the email tool with appropriate content"
            )
        ],
                 expected_tools=[
             ToolCall(
                 name="EmailSender",
                 description="Should send an email about the meeting",
                 input_parameters={"recipient": "john@example.com"}
             )
         ]
    )
]

print(f"Created {len(agent_test_cases)} Agent test cases")

# Create Agent dataset
agent_dataset = LLMAgentDataset.from_test_cases(
    test_cases=agent_test_cases,
    input_id="agent_evaluation_dataset"
)

print(f"Agent Dataset: {agent_dataset}")
print(f"Shape: {agent_dataset.df.shape}")

# Analyze tool usage
tool_usage = {}
for case in agent_test_cases:
    if case.tools_called:
        for tool in case.tools_called:
            tool_usage[tool.name] = tool_usage.get(tool.name, 0) + 1

print("\nTool Usage Analysis:")
for tool, count in tool_usage.items():
    print(f"  - {tool}: {count} times")

print("\nAgent Dataset Preview:")
display(agent_dataset.df[['input', 'actual_output', 'tools_called']].head())


<a id="toc7_"></a>

## Working with Golden Templates

Golden templates are a powerful feature of DeepEval that allow you to define test inputs and expected outputs, then generate actual outputs at evaluation time. This approach enables systematic testing across multiple scenarios.


In [None]:
# Create Golden templates
print("Creating Golden templates...")

goldens = [
    Golden(
        input="Explain the concept of neural networks in simple terms",
        expected_output="Neural networks are computing systems inspired by biological neural networks that constitute animal brains.",
        context=["Neural networks are a key component of machine learning and artificial intelligence."]
    ),
    Golden(
        input="What are the main benefits of cloud computing for businesses?", 
        expected_output="Cloud computing offers scalability, cost-effectiveness, accessibility, and reduced infrastructure maintenance.",
        context=["Cloud computing provides on-demand access to computing resources over the internet."]
    ),
    Golden(
        input="How does password encryption protect user data?",
        expected_output="Password encryption converts passwords into unreadable formats using cryptographic algorithms, protecting against unauthorized access.",
        context=["Encryption is a fundamental security technique used to protect sensitive information."]
    ),
    Golden(
        input="What is the difference between machine learning and deep learning?",
        expected_output="Machine learning is a broad field of AI, while deep learning is a subset that uses neural networks with multiple layers.",
        context=["Both are important areas of artificial intelligence with different approaches and applications."]
    )
]

print(f"Created {len(goldens)} Golden templates")

# Create dataset from goldens
golden_dataset = LLMAgentDataset.from_goldens(
    goldens=goldens,
    input_id="golden_templates_dataset"
)

print(f"Golden Dataset: {golden_dataset}")
print(f"Shape: {golden_dataset.df.shape}")

print("\nGolden Templates Preview:")
display(golden_dataset.df[['input', 'expected_output', 'context', 'type']].head())

# Mock LLM application function for demonstration
def mock_llm_application(input_text: str) -> str:
    """
    Simulate an LLM application generating responses.
    In production, this would be your actual LLM application.
    """
    
    responses = {
        "neural networks": """Neural networks are computational models inspired by the human brain. 
        They consist of interconnected nodes (neurons) that process information by learning patterns from data. 
        These networks can recognize complex patterns and make predictions, making them useful for tasks like 
        image recognition, natural language processing, and decision-making.""",
        
        "cloud computing": """Cloud computing provides businesses with flexible, scalable access to computing resources 
        over the internet. Key benefits include reduced upfront costs, automatic scaling based on demand, 
        improved collaboration through shared access, enhanced security through professional data centers, 
        and reduced need for internal IT maintenance.""",
        
        "password encryption": """Password encryption protects user data by converting passwords into complex, 
        unreadable strings using mathematical algorithms. When you enter your password, it's immediately encrypted 
        before storage or transmission. Even if data is intercepted, the encrypted password appears as random characters, 
        making it virtually impossible for attackers to determine the original password.""",
        
        "machine learning": """Machine learning is a broad approach to artificial intelligence where computers learn 
        to make predictions or decisions by finding patterns in data. Deep learning is a specialized subset that uses 
        artificial neural networks with multiple layers (hence 'deep') to process information in ways that mimic 
        human brain function, enabling more sophisticated pattern recognition and decision-making."""
    }
    
    # Simple keyword matching for demonstration
    input_lower = input_text.lower()
    for keyword, response in responses.items():
        if keyword in input_lower:
            return response.strip()
    
    return f"Thank you for your question about: {input_text}. I'd be happy to provide a comprehensive answer based on current knowledge and best practices."

print(f"\nMock LLM application ready - will generate responses for {len(goldens)} templates")


In [None]:
# Convert goldens to test cases by generating actual outputs
print("Converting Golden templates to test cases...")

print("Before conversion:")
print(f"  - Test cases: {len(golden_dataset.test_cases)}")
print(f"  - Goldens: {len(golden_dataset.goldens)}")

# Convert goldens to test cases using our mock LLM
golden_dataset.convert_goldens_to_test_cases(mock_llm_application)

print("\nAfter conversion:")
print(f"  - Test cases: {len(golden_dataset.test_cases)}")
print(f"  - Goldens: {len(golden_dataset.goldens)}")

print("\nConversion completed!")

# Show the updated dataset
print("\nUpdated Dataset with Generated Outputs:")
dataset_df = golden_dataset.df
# Filter for rows with actual output
mask = pd.notna(dataset_df['actual_output']) & (dataset_df['actual_output'] != '')
converted_df = dataset_df[mask]

if not converted_df.empty:
    display(converted_df[['input', 'actual_output', 'expected_output']])
    
    # Analyze output lengths using pandas string methods
    actual_lengths = pd.Series([len(str(x)) for x in converted_df['actual_output']])
    expected_lengths = pd.Series([len(str(x)) for x in converted_df['expected_output']])
else:
    print("No converted test cases found")

print(f"\nOutput Analysis:")
print(f"Average actual output length: {actual_lengths.mean():.0f} characters")
print(f"Average expected output length: {expected_lengths.mean():.0f} characters")
print(f"Ratio (actual/expected): {(actual_lengths.mean() / expected_lengths.mean()):.2f}x")


<a id="toc8_"></a>

## ValidMind Integration

Now let's demonstrate how to integrate our LLMAgentDataset with ValidMind's testing framework, enabling comprehensive documentation and compliance features.


In [None]:
# Initialize ValidMind
print("Integrating with ValidMind framework...")

try:
    # Initialize ValidMind
    vm.init()
    print("ValidMind initialized")
    
    # Register our datasets with ValidMind
    datasets_to_register = [
        (simple_dataset, "simple_qa_dataset"),
        (rag_dataset, "rag_evaluation_dataset"),
        (agent_dataset, "agent_evaluation_dataset"),
        (golden_dataset, "golden_templates_dataset")
    ]
    
    for dataset, dataset_id in datasets_to_register:
        try:
            vm.init_dataset(
                dataset=dataset.df,
                input_id=dataset_id,
                text_column="input",
                target_column="expected_output"
            )
            print(f"Registered: {dataset_id}")
        except Exception as e:
            print(f"WARNING: Failed to register {dataset_id}: {e}")
    
    # Note: ValidMind datasets are now registered and can be used in test suites
    print("\nValidMind Integration Complete:")
    print("  - Datasets registered successfully")
    print("  - Ready for use in ValidMind test suites")
    print("  - Can be referenced by their input_id in test configurations")
        
except Exception as e:
    print(f"ERROR: ValidMind integration failed: {e}")
    print("Note: Some ValidMind features may require additional setup")

# Demonstrate dataset compatibility
print(f"\nDataset Compatibility Check:")
print(f"All datasets inherit from VMDataset: SUCCESS")

for dataset, name in [(simple_dataset, "Simple Q&A"), (rag_dataset, "RAG"), (agent_dataset, "Agent"), (golden_dataset, "Golden")]:
    print(f"\n{name} Dataset:")
    print(f"  - Type: {type(dataset).__name__}")
    print(f"  - Inherits VMDataset: {hasattr(dataset, 'df')}")
    print(f"  - Has text_column: {hasattr(dataset, 'text_column')}")
    print(f"  - Has target_column: {hasattr(dataset, 'target_column')}")
    print(f"  - DataFrame shape: {dataset.df.shape}")
    print(f"  - Columns: {len(dataset.columns)}")


<a id="toc9_"></a>

## Custom Metrics with G-Eval

One of DeepEval's most powerful features is the ability to create custom evaluation metrics using G-Eval (Generative Evaluation). This enables domain-specific evaluation criteria tailored to your use case.


In [None]:
# Create custom evaluation metrics using G-Eval
print("Creating custom evaluation metrics...")

# Custom metric 1: Technical Accuracy
technical_accuracy_metric = GEval(
    name="Technical Accuracy",
    criteria="""Evaluate whether the response is technically accurate and uses appropriate 
    terminology for the domain. Consider if the explanations are scientifically sound 
    and if technical concepts are explained correctly.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.CONTEXT
    ],
    threshold=0.8
)

# Custom metric 2: Clarity and Comprehensiveness  
clarity_metric = GEval(
    name="Clarity and Comprehensiveness",
    criteria="""Assess whether the response is clear, well-structured, and comprehensive. 
    The response should be easy to understand, logically organized, and address all 
    aspects of the user's question without being overly verbose.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.75
)

# Custom metric 3: Business Context Appropriateness
business_context_metric = GEval(
    name="Business Context Appropriateness", 
    criteria="""Evaluate whether the response is appropriate for a business context. 
    Consider if the tone is professional, if the content is relevant to business needs, 
    and if it provides actionable information that would be valuable to a business user.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT
    ],
    threshold=0.7
)

# Custom metric 4: Tool Usage Appropriateness (for agents)
tool_usage_metric = GEval(
    name="Tool Usage Appropriateness",
    criteria="""Evaluate whether the agent used appropriate tools for the given task. 
    Consider if the tools were necessary, if they were used correctly, and if the 
    agent's reasoning for tool selection was sound.""",
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT
    ],
    threshold=0.8
)

custom_metrics = [
    technical_accuracy_metric,
    clarity_metric, 
    business_context_metric,
    tool_usage_metric
]

print("Custom metrics created:")
for metric in custom_metrics:
    print(f"  - {metric.name}: threshold {metric.threshold}")

# Demonstrate metric application to different dataset types
print(f"\nMetric-Dataset Matching:")
metric_dataset_pairs = [
    ("Technical Accuracy", "golden_templates_dataset (tech questions)"),
    ("Clarity and Comprehensiveness", "simple_qa_dataset (general Q&A)"),
    ("Business Context Appropriateness", "rag_evaluation_dataset (business support)"),
    ("Tool Usage Appropriateness", "agent_evaluation_dataset (agent actions)")
]

for metric_name, dataset_name in metric_dataset_pairs:
    print(f"  - {metric_name} → {dataset_name}")

print(f"\nEvaluation Setup (Demo Mode):")
print("Note: Actual evaluation requires OpenAI API key")
print("These metrics would evaluate:")
print("  - Technical accuracy of AI/ML explanations") 
print("  - Clarity of business support responses")
print("  - Appropriateness of agent tool usage")
print("  - Overall comprehensiveness across all domains")


In [None]:
from validmind.vm_models import VMDataset
# Create a test dataset for evaluating the custom metrics
test_cases = [
    LLMTestCase(
        input="What is machine learning?",
        actual_output="Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses statistical techniques to allow computers to find patterns in data.",
        context=["Machine learning is a branch of AI that focuses on building applications that learn from data and improve their accuracy over time without being programmed to do so."],
        expected_output="Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
    ),  
    LLMTestCase(
        input="How do I implement a neural network?",
        actual_output="To implement a neural network, you need to: 1) Define the network architecture (layers, neurons), 2) Initialize weights and biases, 3) Implement forward propagation, 4) Calculate loss, 5) Perform backpropagation, and 6) Update weights using gradient descent.",
        context=["Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process and transmit signals."],
        expected_output="Neural network implementation involves defining network architecture, initializing parameters, implementing forward and backward propagation, and using optimization algorithms for training."
    )
]

# Convert to VMDataset format

# Create Agent dataset
geval_dataset = LLMAgentDataset.from_test_cases(
    test_cases=test_cases,
    input_id="geval_dataset"
)


# FIXED VERSION: Apply custom metrics to individual test cases
print("Applying custom metrics to evaluation dataset (FIXED VERSION):")
for metric in custom_metrics:
    print(f"\nResults for {metric.name}:")
    for i, test_case in enumerate(test_cases):
        try:
            result = metric.measure(test_case)
            print(f"Test case {i+1}:")
            print(f"  Score: {metric.score:.2f}")
            print(f"  Reason: {metric.reason}")
        except Exception as e:
            print(f"Test case {i+1}: Error - {str(e)}")


In [None]:
name="Technical Accuracy",
criteria="""Evaluate whether the response is technically accurate and uses appropriate 
    terminology for the domain. Consider if the explanations are scientifically sound 
    and if technical concepts are explained correctly.
    """
threshold=0.8
input_column="input",
actual_output_column="actual_output",
context_column="context",

geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GenericEval",
    input_column=input_column,
    actual_output_column=actual_output_column,
    context_column=context_column,
    metric_name=name,
    criteria=criteria,
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.CONTEXT
    ],
    threshold=0.8,
)

<a id="toc10_"></a>

## In summary

This notebook demonstrated the comprehensive integration between DeepEval and ValidMind for LLM evaluation:

**Key Achievements:**
- Successfully created and evaluated different types of LLM test cases (Q&A, RAG, Agents)
- Integrated DeepEval metrics with ValidMind's testing infrastructure
- Demonstrated Golden template workflows for systematic testing
- Created custom evaluation metrics using G-Eval
- Showed how to handle complex agent scenarios with tool usage

**Integration Benefits:**
- **Comprehensive Coverage**: Evaluate LLMs across 30+ specialized metrics
- **Structured Documentation**: Leverage ValidMind's compliance and documentation features
- **Flexibility**: Support for custom metrics and domain-specific evaluation criteria
- **Production Ready**: Handle real-world LLM evaluation scenarios at scale

The `LLMAgentDataset` class provides a seamless bridge between DeepEval's evaluation capabilities and ValidMind's testing infrastructure, enabling robust LLM evaluation within a structured, compliant framework.


<a id="toc11_"></a>

## Next steps

**Explore Advanced Features:**
- **Continuous Evaluation**: Set up automated LLM evaluation pipelines
- **A/B Testing**: Compare different LLM models and configurations
- **Metrics Customization**: Create domain-specific evaluation criteria
- **Integration Patterns**: Embed evaluation into your LLM development workflow

**Additional Resources:**
- [ValidMind Library Documentation](https://docs.validmind.ai/developer/validmind-library.html) - Complete API reference and tutorials

**Try These Examples:**
- Implement custom business-specific evaluation metrics
- Create automated evaluation pipelines for model deployment
- Integrate with your existing ML infrastructure and workflows
- Explore multi-modal evaluation scenarios (text, code, images)

Start building comprehensive LLM evaluation workflows that combine the power of DeepEval's specialized metrics with ValidMind's structured testing and documentation framework.
