# G-Eval Integration for DeepEval within ValidMind

Let's learn how to integrate [DeepEval](https://github.com/confident-ai/deepeval) with the ValidMind Library to evaluate Large Language Models (LLMs) and AI agents. 
Large Language Model (LLM) evaluation requires robust metrics to assess model outputs. G-Eval, a key feature of DeepEval, uses LLMs themselves to evaluate model responses across dimensions like factual accuracy, coherence, and relevance, etc. This notebook demonstrates how to leverage G-Eval metrics within ValidMind's testing infrastructure to create comprehensive, automated evaluations of LLM outputs.

To integrate DeepEval with ValidMind, we'll:
 1. Set up both frameworks and install required dependencies
 2. Create a dataset with source texts and generated summaries
 3. Analyze the evaluation results using G-eval custom metrics


## Contents    
- [Introduction](#toc1_)    
  - [Before you begin](#toc2_1_)    
  - [Key concepts](#toc2_2_)    
- [Setting up](#toc3_)    
  - [Install required packages](#toc3_1_)    
  - [Initialize ValidMind](#toc3_2_)    
- [Custom Metrics with G-Eval](#toc4_)    
  - [Technical accuracy](#toc4_1_)    
  - [Clarity and Comprehensiveness](#toc4_2_)    
  - [Business Context Appropriateness](#toc4_3_)    
  - [Tool Usage Appropriateness](#toc4_4_)    
  - [Coherence Evaluation](#toc4_5_)    
- [In summary](#toc5_)    
- [Next steps](#toc6_)    



<a id="toc1_"></a>

## Introduction
**G-Eval** is a framework that uses large language models (LLMs) as evaluators—essentially treating an LLM as a “judge” to assess the quality of other LLM outputs. Instead of relying on traditional metrics like BLEU or ROUGE, G-Eval enables natural-language evaluation criteria (e.g., “rate how factual this summary is”). The framework guides the judge model through structured reasoning steps, producing more consistent, transparent, and interpretable scoring results. It is particularly effective for subjective or open-ended tasks such as summarization, dialogue generation, and content evaluation.

Key advantages of G-Eval include:

* **Structured reasoning:** Uses a step-by-step approach to improve reliability and reduce bias.
* **Custom evaluation criteria:** Supports diverse factors like accuracy, tone, safety, or style.
* **Enhanced consistency:** Provides more repeatable judgments than earlier LLM-as-a-judge methods.
* **Production scalability:** Integrates easily with CI/CD pipelines via tools like *DeepEval*.
* **Broader applicability:** Works across multiple domains and task types, from creative writing to factual QA.

<a id="toc2_1_"></a>

### Before you begin

This notebook assumes you have basic familiarity with Python and Large Language Models. You'll need:

- Python 3.8 or higher
- Access to OpenAI API (for DeepEval metrics evaluation)
- ValidMind account and model registration

If you encounter errors due to missing modules, install them with `pip install` and re-run the notebook.


<a id="toc2_2_"></a>

### Key concepts

**LLMTestCase**: A DeepEval object that represents a single test case with input, expected output, actual output, and optional context.

**G-Eval**: Generative evaluation using LLMs to assess response quality based on custom criteria.



<a id="toc3_"></a>

## Setting up


<a id="toc3_1_"></a>

### Install required packages

First, let's install the required packages and set up our environment.


In [None]:
%pip install -q validmind

<a id="toc3_2_"></a>

### Initialize ValidMind

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, you'll need access to a ValidMind account.</b></span>
<br></br>
<a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>


In [None]:
# Load your model identifier credentials from an `.env` file
%load_ext dotenv
%dotenv .env

# # Or replace with your code snippet
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

In [None]:
# Core imports
import warnings
from deepeval.test_case import LLMTestCase
from deepeval.metrics.g_eval.utils import Rubric
from deepeval.test_case import LLMTestCaseParams
from validmind.datasets.llm import LLMAgentDataset
import pandas as pd

warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)


## Create test cases

Let's create test cases to demonstrate the G-Eval custom metrics functionality.

In [None]:
# Create a test dataset for evaluating the custom metrics
test_cases = [
    LLMTestCase(
        input="What is machine learning?",
        actual_output="Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It uses statistical techniques to allow computers to find patterns in data.",
        context=["Machine learning is a branch of AI that focuses on building applications that learn from data and improve their accuracy over time without being programmed to do so."],
        expected_output="Machine learning is a method of data analysis that automates analytical model building. It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."
    ),  
    LLMTestCase(
        input="How do I implement a neural network?",
        actual_output="To implement a neural network, you need to: 1) Define the network architecture (layers, neurons), 2) Initialize weights and biases, 3) Implement forward propagation, 4) Calculate loss, 5) Perform backpropagation, and 6) Update weights using gradient descent.",
        context=["Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process and transmit signals."],
        expected_output="Neural network implementation involves defining network architecture, initializing parameters, implementing forward and backward propagation, and using optimization algorithms for training."
    )
]

# Create Agent dataset
geval_dataset = LLMAgentDataset.from_test_cases(
    test_cases=test_cases,
    input_id="geval_dataset"
)
geval_dataset._df

## Scorers in ValidMind

Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:

- For Geval scorer adds new columns (score, reason and criteria) to the dataset with format: `GEval_{metric_name}_score`, `GEval_{metric_name}_reason` and `GEval_{metric_name}_criteria`
- The column contains the numeric score (typically 0-1) for each example
- Multiple scorers can be run on the same dataset, each adding their own columns
- Scores are persisted in the dataset for later analysis and visualization

<a id="toc4_"></a>

## Custom Metrics with G-Eval
One of DeepEval's most powerful features is the ability to create custom evaluation metrics using G-Eval (Generative Evaluation). This enables domain-specific evaluation criteria tailored to your use case.


<a id="toc4_1_"></a>

### Technical accuracy

In [None]:
name="Technical Accuracy"
criteria="""Evaluate whether the response is technically accurate and uses appropriate 
terminology for the domain. Consider if the explanations are scientifically sound 
and if technical concepts are explained correctly."""
threshold=0.8
geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold,
    evaluation_params={
        LLMTestCaseParams.INPUT: "input",
        LLMTestCaseParams.ACTUAL_OUTPUT: "actual_output",
    }
)
geval_dataset._df

<a id="toc4_2_"></a>

### Clarity and Comprehensiveness
This evaluation assesses the clarity and comprehensiveness of responses, focusing on how well-structured and understandable they are. The criteria examines whether responses are logically organized, address all aspects of questions thoroughly, and maintain an appropriate level of detail without being overly verbose.
 


In [None]:
name="Clarity and Comprehensiveness"
criteria="""Evaluate the clarity, structure, and comprehensiveness of the actual output 
in relation to the expected output. The response should be clear, well-organized, and 
comparable in coverage to the expected output, addressing all relevant aspects without 
being overly verbose. Deduct points if important points or details present in the expected 
output are missing or inaccurately conveyed in the actual output."""
threshold=0.75

geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold,
    evaluation_params={
        LLMTestCaseParams.INPUT: "input",
        LLMTestCaseParams.ACTUAL_OUTPUT: "actual_output",
        LLMTestCaseParams.EXPECTED_OUTPUT: "expected_output",
    }
)
geval_dataset._df.head()

<a id="toc4_3_"></a>

### Business Context Appropriateness

This evaluation assesses whether responses are appropriate for a business context, considering factors like professional tone, business relevance, and actionable insights. The criteria focuses on ensuring content would be valuable and applicable for business users.


In [None]:
name="Business Context Appropriateness"
criteria="""Evaluate whether the response is appropriate for a business context. 
Consider if the tone is professional, if the content is relevant to business needs, 
and if it provides actionable information that would be valuable to a business user."""
threshold=0.7
geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name=name, 
    criteria = criteria,
    threshold=threshold,
    evaluation_params={
        LLMTestCaseParams.INPUT: "input",
        LLMTestCaseParams.ACTUAL_OUTPUT: "actual_output",
        LLMTestCaseParams.EXPECTED_OUTPUT: "expected_output",
    }
)
geval_dataset._df.head()


<a id="toc4_5_"></a>

### Conciseness Evaluation
This evaluation assesses how well the responses flow and connect logically. It examines whether the content builds naturally from sentence to sentence to form a coherent narrative, rather than just being a collection of related but disconnected information. The evaluation considers factors like fluency, logical progression, and overall readability.

In [None]:
criteria = """
    Evaluate the conciseness of the generation on a continuous scale from 0 to 1.
    A generation can be considered concise (Score: 1) if it directly and succinctly
    answers the question posed, focusing specifically on the information requested
    without including unnecessary, irrelevant, or excessive details."""

evaluation_steps=[
        "Read the input and identify which pieces of information need to be conveyed."
        "Read the actual_output and check if it includes all the required information.",
        "Check if the actual_output excludes irrelevant details or redundancies.",
        "Check if the wording is as brief as possible while still being clear and complete.",
        "Assign a score (e.g., 0-10) based on how well the actual_output meets the above."
    ]

rubric=[
        Rubric(score_range=(0, 1), expected_outcome="Very poor Conciseness"),
        Rubric(score_range=(2, 3), expected_outcome="Poor Conciseness"),
        Rubric(score_range=(4, 5), expected_outcome="Fair Conciseness"),
        Rubric(score_range=(6, 7), expected_outcome="Good Conciseness"),
        Rubric(score_range=(8, 10), expected_outcome="Excellent Conciseness"),
    ]

geval_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.GEval",
    metric_name="Conciseness", 
    criteria = criteria,
    rubric=rubric,
    evaluation_steps=evaluation_steps,
    evaluation_params={
        LLMTestCaseParams.INPUT: "input",
        LLMTestCaseParams.ACTUAL_OUTPUT: "actual_output",
    }
)
geval_dataset._df.head()

Let's plot all of these metrics together in a Boxplot Test

In [None]:
vm.tests.run_test(
    "validmind.plots.BoxPlot",
    inputs={"dataset": geval_dataset},
    params={
        "columns": [
            "GEval_Technical_Accuracy_score",
            "GEval_Clarity_and_Comprehensiveness_score",
            "GEval_Business_Context_Appropriateness_score",
            "GEval_Conciseness_score"
        ],
        "title": "Distribution of G-Eval Scores",
        "ylabel": "Score",
    }
).log()


<a id="toc6_"></a>

## Next steps

**Explore Advanced Features:**
- **Continuous Evaluation**: Set up automated LLM evaluation pipelines
- **Metrics Customization**: Create domain-specific evaluation criteria
- **Integration Patterns**: Embed evaluation into your LLM development workflow

**Additional Resources:**
- [ValidMind Library Documentation](https://docs.validmind.ai/developer/validmind-library.html) - Complete API reference and tutorials

**Try These Examples:**
- Implement custom business-specific evaluation metrics
- Create automated evaluation pipelines for model deployment
- Integrate with your existing ML infrastructure and workflows
- Explore multi-modal evaluation scenarios (text, code, images)

Start building comprehensive LLM evaluation workflows that combine the power of DeepEval's specialized metrics with ValidMind's structured testing and documentation framework.
