# G-Eval Integration for DeepEval within ValidMind

Let's learn how to integrate [DeepEval](https://github.com/confident-ai/deepeval) with the ValidMind Library to evaluate Large Language Models (LLMs) and AI agents. This notebook demonstrates how to use DeepEval's G-eval custom evaluation metrics within ValidMind's testing infrastructure.

To integrate DeepEval with ValidMind, we'll:
 1. Set up both frameworks and install required dependencies
 2. Create a dataset with source texts and generated summaries
 3. Analyze the evaluation results using G-eval custom metrics


## Contents    
- [Introduction](#toc1_)    
- [About DeepEval Integration](#toc2_)    
  - [Before you begin](#toc2_1_)    
  - [Key concepts](#toc2_2_)    
- [Setting up](#toc3_)    
  - [Install required packages](#toc3_1_)    
  - [Initialize ValidMind](#toc3_2_)    
- [Custom Metrics with G-Eval](#toc4_)    
  - [Technical accuracy](#toc4_1_)    
  - [Clarity and Comprehensiveness](#toc4_2_)    
  - [Business Context Appropriateness](#toc4_3_)    
  - [Tool Usage Appropriateness](#toc4_4_)    
  - [Coherence Evaluation](#toc4_5_)    
- [In summary](#toc5_)    
- [Next steps](#toc6_)    



<a id="toc1_"></a>

## Introduction

Large Language Model (LLM) evaluation is critical for understanding model performance across different tasks and scenarios. This notebook demonstrates how to integrate DeepEval's comprehensive evaluation framework with ValidMind's testing infrastructure to create a robust LLM evaluation pipeline.




<a id="toc2_"></a>

## About DeepEval Integration

DeepEval is a comprehensive evaluation framework for LLMs that provides metrics for various scenarios including hallucination detection, answer relevancy, faithfulness, and custom evaluation criteria. ValidMind is a platform for managing model risk and documentation through automated testing.

Together, these tools enable comprehensive LLM evaluation within a structured, compliant framework.


<a id="toc2_1_"></a>

### Before you begin

This notebook assumes you have basic familiarity with Python and Large Language Models. You'll need:

- Python 3.8 or higher
- Access to OpenAI API (for DeepEval metrics evaluation)
- ValidMind account and model registration

If you encounter errors due to missing modules, install them with `pip install` and re-run the notebook.


<a id="toc2_2_"></a>

### Key concepts

**LLMTestCase**: A DeepEval object that represents a single test case with input, expected output, actual output, and optional context.

**G-Eval**: Generative evaluation using LLMs to assess response quality based on custom criteria.



<a id="toc3_"></a>

## Setting up


<a id="toc3_1_"></a>

### Install required packages

First, let's install the required packages and set up our environment.


In [None]:
%pip install -q validmind

<a id="toc3_2_"></a>

### Initialize ValidMind

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, you'll need access to a ValidMind account.</b></span>
<br></br>
<a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>


In [None]:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env

# Or replace with your code snippet
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

In [None]:
# Core imports
import pandas as pd
import warnings
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.dataset import Golden
from validmind.datasets.llm import LLMAgentDataset

warnings.filterwarnings('ignore')


## Create test cases

Let's create test cases to demonstrate the G-Eval custom metrics functionality.

In [None]:
# Create a test dataset for evaluating the custom metrics
test_cases = [
    LLMTestCase(
        input="What is machine learning?",
        actual_output="Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses statistical techniques to allow computers to find patterns in data.",
        context=["Machine learning is a branch of AI that focuses on building applications that learn from data and improve their accuracy over time without being programmed to do so."],
        expected_output="Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
    ),  
    LLMTestCase(
        input="How do I implement a neural network?",
        actual_output="To implement a neural network, you need to: 1) Define the network architecture (layers, neurons), 2) Initialize weights and biases, 3) Implement forward propagation, 4) Calculate loss, 5) Perform backpropagation, and 6) Update weights using gradient descent.",
        context=["Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process and transmit signals."],
        expected_output="Neural network implementation involves defining network architecture, initializing parameters, implementing forward and backward propagation, and using optimization algorithms for training."
    )
]

# Create Agent dataset
geval_dataset = LLMAgentDataset.from_test_cases(
    test_cases=test_cases,
    input_id="geval_dataset"
)

<a id="toc4_"></a>

## Custom Metrics with G-Eval

One of DeepEval's most powerful features is the ability to create custom evaluation metrics using G-Eval (Generative Evaluation). This enables domain-specific evaluation criteria tailored to your use case.

<a id="toc4_1_"></a>

### Technical accuracy

In [None]:
name="Technical Accuracy"
criteria="""Evaluate whether the response is technically accurate and uses appropriate 
terminology for the domain. Consider if the explanations are scientifically sound 
and if technical concepts are explained correctly."""
threshold=0.8
geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold
)
geval_dataset._df.head()

<a id="toc4_2_"></a>

### Clarity and Comprehensiveness
This evaluation assesses the clarity and comprehensiveness of responses, focusing on how well-structured and understandable they are. The criteria examines whether responses are logically organized, address all aspects of questions thoroughly, and maintain an appropriate level of detail without being overly verbose.
 


In [None]:
name="Clarity and Comprehensiveness"
criteria="""Assess whether the response is clear, well-structured, and comprehensive. 
The response should be easy to understand, logically organized, and address all 
aspects of the user's question without being overly verbose."""
threshold=0.75

geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold
)
geval_dataset._df.head()

<a id="toc4_3_"></a>

### Business Context Appropriateness

This evaluation assesses whether responses are appropriate for a business context, considering factors like professional tone, business relevance, and actionable insights. The criteria focuses on ensuring content would be valuable and applicable for business users.


In [None]:
name="Business Context Appropriateness"
criteria="""Evaluate whether the response is appropriate for a business context. 
Consider if the tone is professional, if the content is relevant to business needs, 
and if it provides actionable information that would be valuable to a business user."""
threshold=0.7
geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold
)
geval_dataset._df.head()


In [None]:
name="Tool Usage Appropriateness"
criteria="""Evaluate whether the agent used appropriate tools for the given task. 
Consider if the tools were necessary, if they were used correctly, and if the 
agent's reasoning for tool selection was sound."""
threshold=0.8
geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold
)
geval_dataset._df.head()


<a id="toc4_5_"></a>

### Coherence Evaluation
This evaluation assesses how well the responses flow and connect logically. It examines whether the content builds naturally from sentence to sentence to form a coherent narrative, rather than just being a collection of related but disconnected information. The evaluation considers factors like fluency, logical progression, and overall readability.



In [None]:
criteria = """Coherence (1-5) - the collective quality of all sentences. We align this dimension with
the DUC quality question of structure and coherence whereby the summary should be
well-structured and well-organized. The summary should not just be a heap of related information, but should build from sentence to sentence to a coherent body of information about a topic."""

evaluation_steps=[
        "Read the news article carefully and identify the main topic and key points.",
        "Read the summary and compare it to the news article. Check if the summary covers the main topic and key points of the news article, and if it presents them in a clear and logical order.",
        "Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria."
    ]

rubrics = [
      {
          "score":0, 
          "criteria":"Measure the fluency of the actual output.",
          "expected_outcome": "The output should be fluent and natural sounding"
      },
      {
          "score":2, 
          "criteria":"Measure the logical flow of the actual output.",
          "expected_outcome": "The output should flow logically from one point to the next"
      },
      {
          "score":3, 
          "criteria":"Measure the linguistic flow of the actual output.",
          "expected_outcome": "The output should have good linguistic structure and readability"
      }
]

geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name="Coherence", 
    criteria = criteria,
    input_column="context",
)
geval_dataset._df.head()

<a id="toc6_"></a>

## Next steps

**Explore Advanced Features:**
- **Continuous Evaluation**: Set up automated LLM evaluation pipelines
- **Metrics Customization**: Create domain-specific evaluation criteria
- **Integration Patterns**: Embed evaluation into your LLM development workflow

**Additional Resources:**
- [ValidMind Library Documentation](https://docs.validmind.ai/developer/validmind-library.html) - Complete API reference and tutorials

**Try These Examples:**
- Implement custom business-specific evaluation metrics
- Create automated evaluation pipelines for model deployment
- Integrate with your existing ML infrastructure and workflows
- Explore multi-modal evaluation scenarios (text, code, images)

Start building comprehensive LLM evaluation workflows that combine the power of DeepEval's specialized metrics with ValidMind's structured testing and documentation framework.
