You've built/evaluating your own custom LLM (due to data privacy or confidentiality reasons), or you want to evaluate the performance of a new LLM of your interest against a mature LLM such as ChatGPT, Gemini, LLaMA, etc.
Validating AI output means measuring how accurate, relevant, or reliable an AI-generated response is for a given input prompt (from your LLM of interest), compared to a reference or ideal answer.
Using a JudgeLLM (a mature LLM that is widely accepted, such as ChatGPT, Gemini, LLaMA, etc.) allows you to automate this evaluation.
Quantitative validation includes:
- ๐ Comparison against a reference or ground-truth answer.
- ๐งฎ Scoring AI-generated outputs based on measurable quality metrics.
- ๐ Aggregation of scores to produce an overall performance rating.
- ๐ง Reasoned Justification explaining why a particular score or judgment was given.
flowchart TD
%% -----------------------------
%% Core Test Case Structure
%% -----------------------------
subgraph TC["LLM Test Case"]
P["๐งฉ input_prompt"]
EA["๐ expected_answer"]
AGA["๐ค ai_generated_answer"]
SoT["๐ source_of_truth"]
AiT["๐ง ai_inferred_truth"]
end
%% -----------------------------
%% Metrics Section
%% -----------------------------
subgraph METRICS["GenAI QA Framework"]
AR["Answer Relevancy"]
FA["Faithfulness"]
CP["Contextual Precision"]
CR["Contextual Recall"]
CRe["Contextual Relevancy"]
HA["Hallucination"]
SU["Summarization"]
end
%% -----------------------------
%% Judge LLM
%% -----------------------------
subgraph LLM["Judge LLM"]
J["AmazonBedrockModel (amazon.nova-pro-v1:0)"]
end
%% -----------------------------
%% Main Flow
%% -----------------------------
TC -->|"Passed to"| METRICS
METRICS -->|"Each metric uses"| LLM
LLM -->|"Evaluates each test metric for a given LLM Test Case"| METRICS
%% -----------------------------
%% Relationships and Outputs
%% -----------------------------
METRICS -->|"Generate test scores + reasons"| RES["๐ Metric Results per Test Case"]
RES -->|"Aggregated across test cases"| AVG["๐ Aggregated Test Metrics"]
%% -----------------------------
%% Visual Group Labels
%% -----------------------------
classDef main fill:#f9f9f9,stroke:#333,stroke-width:1px;
classDef block fill:#eef6ff,stroke:#4a90e2,stroke-width:1px;
classDef judge fill:#fff2cc,stroke:#e6b800,stroke-width:1px;
classDef result fill:#e6ffed,stroke:#27ae60,stroke-width:1px;
class TC,SoT,AiT,P,EA,AGA block
class METRICS,AR,FA,CP,CR,CRe,HA,SU main
class LLM,J judge
class RES,AVG result
- Python3 โ Core implementation language
- DeepEval โ Framework for extensible GenAI test metrics and customizable Judge LLM integration
- AWS Bedrock (Optional) โ Used with Amazon Nova Pro as the
Judge LLM(for this demo) - Vega (with altair) โ Visualization library for analyzing and illustrating GenAI testing metrics
This program was tested using Python 3.10.16 on an Apple M1 Mac running macOS Ventura 13.7.1.
python3 --version
# Python 3.10.16
pip3 --version
# pip 23.0.1 from /Users/{your_username}/.pyenv/versions/3.10.16/lib/python3.10/site-packages/pip (python 3.10)
uname -a
# Darwin {your_machine_name} 22.6.0 Darwin Kernel Version 22.6.0: Thu Sep 5 20:47:01 PDT 2024; root:xnu-8796.141.3.708.1~1/RELEASE_ARM64_T6000 arm64
sw_vers
# ProductName: macOS
# ProductVersion: 13.7.1
# BuildVersion: 22H221
uname -m
# arm64
sysctl -n machdep.cpu.brand_string
# Apple M1 Max- ๐ Install Python3
- โ๏ธ Ensure you have an AWS account and valid AWS credentials ๐
- โ๏ธ Configure your
AWS credentialsto your default AWS profile usingaws configurecommand inside the$HOME/.awsfolder ๐.
# Clone this repository
git clone https://github.com/tech-magic/gen-ai-validator.git
cd gen-ai-validator
# Create your own python virtual environment
python3 -m venv llm-testing-venv
source llm-testing-venv/bin/activate
# Install all requirements into the python virtual environment
pip install -r requirements.txt
# Run the app from the python virtual environment
python main.py๐ง AWS is not a mandatory requirement to run deepeval test cases.
It is ONLY listed as a pre-requisite here since Amazon-Nova-Pro is being used as the JudgeLLM for this demo (see modules/judge_llm.py)
You can choose any trusted LLM ๐ค as your JudgeLLM from the following list in the deepeval library:
https://github.com/confident-ai/deepeval/tree/main/deepeval/models/llms
๐ง The available options for choosing a JudgeLLM (as per the above link) includes:
- ๐ฌ OpenAI
- ๐ Deepseek
- ๐ Gemini
- โ๏ธ AWS
- ๐ท Azure
- ๐ง Ollama
- โ and more...
๐ก You can even create your own custom JudgeLLM ๐งฉ by extending the class:
deepeval.models.DeepEvalBaseModel ๐งฑ
Each test case contains the following fields:
| Key | Type | Required | Description |
|---|---|---|---|
input_prompt |
str |
โ | The question or instruction given to the AI model. |
expected_answer |
str |
โ | The correct or reference answer (ground truth). |
ai_generated_answer |
str |
โ | The actual response produced by the AI model. |
test_case_group |
str |
โ | Optional logical grouping (e.g., โGeneral Knowledgeโ, โMathโ, โGeographyโ). Defaults to "default". |
test_case_name |
str |
โ | Optional unique name for the test case. Defaults to the input_prompt value. |
background_context |
str |
โ | Optional textual context or reference document supporting the expected answer. If not provided, it is auto-generated as a combination of input_prompt and expected_answer. |
ai_inferred_context |
str |
โ | Optional textual context inferred from the AIโs reasoning. If not provided, it is auto-generated as a combination of input_prompt and ai_generated_answer. |
All metrics produce a score between 0.0 and 1.0, where 1.0 ๐ฏ indicates perfect alignment with the evaluation goal.
Each metric includes a ๐ฌ Reason to explain the score โ showing what the AI did well โ
or where it failed โ.
- ๐ฏ Purpose: Measures how relevant the AI-generated answer is to the original input prompt.
- ๐ Scoring:
0.0 โ 1.0 - ๐ง Reason: Explains whether the output directly addresses the user query.
๐ป Low scores = off-topic or partially relevant answers.
- ๐ฏ Purpose: Assesses whether the AI output is factually correct โ relative to the source/reference.
- ๐ Scoring:
0.0 โ 1.0 - ๐ Reason: Highlights factual errors โ or unsupported claims.
๐ฏ High scores = answer faithfully reflects expected content.
- ๐ฏ Purpose: Measures the proportion of relevant content in the AI output compared to all content generated.
- ๐ Scoring:
0.0 โ 1.0 - ๐ Reason: Indicates if irrelevant or extra content was produced.
โ ๏ธ Low precision = unnecessary or inaccurate info included.
- ๐ฏ Purpose: Measures how much of the expected information is actually present in the AI output.
- ๐ Scoring:
0.0 โ 1.0 - ๐งพ Reason: Explains which key points were missing.
๐ High recall = most expected content covered.
- ๐ฏ Purpose: Combines Precision ๐ฏ and Recall ๐ to evaluate how relevant and complete the content is in context.
- ๐ Scoring:
0.0 โ 1.0 - ๐งฉ Reason: Highlights missing or irrelevant pieces, giving a holistic view ๐ of content quality.
- ๐ฏ Purpose: Detects whether the AI produced fabricated ๐ or unsupported content.
- ๐ Scoring:
0.0 โ 1.0 - ๐ Reason: High score โ
= no hallucinations;
Low score โ = identifies speculative or false content.
- ๐ฏ Purpose: Evaluates how well the AI output summarizes ๐ the provided context or expected answer.
- ๐ Scoring:
0.0 โ 1.0 - ๐ง Reason: Indicates whether key points were retained โจ or lost, and if the summary is concise and coherent ๐ฌ.
