Skip to content

tech-magic/gen-ai-validator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

16 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– GenAI Output Validation

๐Ÿ’ก Problem

You've built/evaluating your own custom LLM (due to data privacy or confidentiality reasons), or you want to evaluate the performance of a new LLM of your interest against a mature LLM such as ChatGPT, Gemini, LLaMA, etc.

Validating AI output means measuring how accurate, relevant, or reliable an AI-generated response is for a given input prompt (from your LLM of interest), compared to a reference or ideal answer.

Using a JudgeLLM (a mature LLM that is widely accepted, such as ChatGPT, Gemini, LLaMA, etc.) allows you to automate this evaluation.

๐Ÿ“Š About Quantitative Validation

Quantitative validation includes:

  • ๐Ÿ” Comparison against a reference or ground-truth answer.
  • ๐Ÿงฎ Scoring AI-generated outputs based on measurable quality metrics.
  • ๐Ÿ“ˆ Aggregation of scores to produce an overall performance rating.
  • ๐Ÿง  Reasoned Justification explaining why a particular score or judgment was given.

โš™๏ธ๐Ÿ› ๏ธ Solution

flowchart TD

    %% -----------------------------
    %% Core Test Case Structure
    %% -----------------------------
    subgraph TC["LLM Test Case"]
        P["๐Ÿงฉ input_prompt"]
        EA["๐Ÿ“˜ expected_answer"]
        AGA["๐Ÿค– ai_generated_answer"]
        SoT["๐Ÿ“„ source_of_truth"]
        AiT["๐Ÿง  ai_inferred_truth"]
    end

    %% -----------------------------
    %% Metrics Section
    %% -----------------------------
    subgraph METRICS["GenAI QA Framework"]
        AR["Answer Relevancy"]
        FA["Faithfulness"]
        CP["Contextual Precision"]
        CR["Contextual Recall"]
        CRe["Contextual Relevancy"]
        HA["Hallucination"]
        SU["Summarization"]
    end

    %% -----------------------------
    %% Judge LLM
    %% -----------------------------
    subgraph LLM["Judge LLM"]
        J["AmazonBedrockModel (amazon.nova-pro-v1:0)"]
    end

    %% -----------------------------
    %% Main Flow
    %% -----------------------------
    TC -->|"Passed to"| METRICS
    METRICS -->|"Each metric uses"| LLM
    LLM -->|"Evaluates each test metric for a given LLM Test Case"| METRICS

    %% -----------------------------
    %% Relationships and Outputs
    %% -----------------------------
    METRICS -->|"Generate test scores + reasons"| RES["๐Ÿ“Š Metric Results per Test Case"]
    RES -->|"Aggregated across test cases"| AVG["๐Ÿ“ˆ Aggregated Test Metrics"]

    %% -----------------------------
    %% Visual Group Labels
    %% -----------------------------
    classDef main fill:#f9f9f9,stroke:#333,stroke-width:1px;
    classDef block fill:#eef6ff,stroke:#4a90e2,stroke-width:1px;
    classDef judge fill:#fff2cc,stroke:#e6b800,stroke-width:1px;
    classDef result fill:#e6ffed,stroke:#27ae60,stroke-width:1px;

    class TC,SoT,AiT,P,EA,AGA block
    class METRICS,AR,FA,CP,CR,CRe,HA,SU main
    class LLM,J judge
    class RES,AVG result
Loading

๐Ÿ“ธ Demo

Demo


โš™๏ธ Tech Stack

  • Python3 โ€“ Core implementation language
  • DeepEval โ€“ Framework for extensible GenAI test metrics and customizable Judge LLM integration
  • AWS Bedrock (Optional) โ€“ Used with Amazon Nova Pro as the Judge LLM (for this demo)
  • Vega (with altair) โ€“ Visualization library for analyzing and illustrating GenAI testing metrics

๐Ÿ“ฆ Installation Guide

๐Ÿ–ฅ๏ธ Test Environment

This program was tested using Python 3.10.16 on an Apple M1 Mac running macOS Ventura 13.7.1.

python3 --version
# Python 3.10.16

pip3 --version
# pip 23.0.1 from /Users/{your_username}/.pyenv/versions/3.10.16/lib/python3.10/site-packages/pip (python 3.10)

uname -a
# Darwin {your_machine_name} 22.6.0 Darwin Kernel Version 22.6.0: Thu Sep  5 20:47:01 PDT 2024; root:xnu-8796.141.3.708.1~1/RELEASE_ARM64_T6000 arm64

sw_vers
# ProductName:            macOS
# ProductVersion:         13.7.1
# BuildVersion:           22H221

uname -m
# arm64

sysctl -n machdep.cpu.brand_string
# Apple M1 Max

๐Ÿงฉ Pre-requisites

  • ๐Ÿ Install Python3
  • โ˜๏ธ Ensure you have an AWS account and valid AWS credentials ๐Ÿ”‘
  • โš™๏ธ Configure your AWS credentials to your default AWS profile using aws configure command inside the $HOME/.aws folder ๐Ÿ“‚.

๐Ÿš€ How to Run

# Clone this repository
git clone https://github.com/tech-magic/gen-ai-validator.git
cd gen-ai-validator

# Create your own python virtual environment
python3 -m venv llm-testing-venv
source llm-testing-venv/bin/activate

# Install all requirements into the python virtual environment
pip install -r requirements.txt

# Run the app from the python virtual environment
python main.py

๐Ÿ“ Notes

๐Ÿง  AWS is not a mandatory requirement to run deepeval test cases. It is ONLY listed as a pre-requisite here since Amazon-Nova-Pro is being used as the JudgeLLM for this demo (see modules/judge_llm.py)

You can choose any trusted LLM ๐Ÿค– as your JudgeLLM from the following list in the deepeval library:

https://github.com/confident-ai/deepeval/tree/main/deepeval/models/llms

๐Ÿง  The available options for choosing a JudgeLLM (as per the above link) includes:

  • ๐Ÿ’ฌ OpenAI
  • ๐Ÿ” Deepseek
  • ๐ŸŒ Gemini
  • โ˜๏ธ AWS
  • ๐Ÿ”ท Azure
  • ๐Ÿง  Ollama
  • โž• and more...

๐Ÿ’ก You can even create your own custom JudgeLLM ๐Ÿงฉ by extending the class: deepeval.models.DeepEvalBaseModel ๐Ÿงฑ


๐Ÿงฉ LLM Test Case Structure

Each test case contains the following fields:

Key Type Required Description
input_prompt str โœ… The question or instruction given to the AI model.
expected_answer str โœ… The correct or reference answer (ground truth).
ai_generated_answer str โœ… The actual response produced by the AI model.
test_case_group str โŒ Optional logical grouping (e.g., โ€œGeneral Knowledgeโ€, โ€œMathโ€, โ€œGeographyโ€). Defaults to "default".
test_case_name str โŒ Optional unique name for the test case. Defaults to the input_prompt value.
background_context str โŒ Optional textual context or reference document supporting the expected answer. If not provided, it is auto-generated as a combination of input_prompt and expected_answer.
ai_inferred_context str โŒ Optional textual context inferred from the AIโ€™s reasoning. If not provided, it is auto-generated as a combination of input_prompt and ai_generated_answer.

๐Ÿงฎ GenAI Test Metrics

All metrics produce a score between 0.0 and 1.0, where 1.0 ๐ŸŽฏ indicates perfect alignment with the evaluation goal.
Each metric includes a ๐Ÿ’ฌ Reason to explain the score โ€” showing what the AI did well โœ… or where it failed โŒ.

1๏ธโƒฃ ๐Ÿ’ฌ Answer Relevancy

  • ๐ŸŽฏ Purpose: Measures how relevant the AI-generated answer is to the original input prompt.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿง  Reason: Explains whether the output directly addresses the user query.
    ๐Ÿ”ป Low scores = off-topic or partially relevant answers.

2๏ธโƒฃ ๐Ÿ“š Faithfulness

  • ๐ŸŽฏ Purpose: Assesses whether the AI output is factually correct โœ… relative to the source/reference.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿ” Reason: Highlights factual errors โŒ or unsupported claims.
    ๐Ÿ’ฏ High scores = answer faithfully reflects expected content.

3๏ธโƒฃ ๐ŸŽฏ Contextual Precision

  • ๐ŸŽฏ Purpose: Measures the proportion of relevant content in the AI output compared to all content generated.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿ”Ž Reason: Indicates if irrelevant or extra content was produced.
    โš ๏ธ Low precision = unnecessary or inaccurate info included.

4๏ธโƒฃ ๐Ÿ“ˆ Contextual Recall

  • ๐ŸŽฏ Purpose: Measures how much of the expected information is actually present in the AI output.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿงพ Reason: Explains which key points were missing.
    ๐Ÿš€ High recall = most expected content covered.

5๏ธโƒฃ ๐Ÿ”— Contextual Relevancy

  • ๐ŸŽฏ Purpose: Combines Precision ๐ŸŽฏ and Recall ๐Ÿ“ˆ to evaluate how relevant and complete the content is in context.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿงฉ Reason: Highlights missing or irrelevant pieces, giving a holistic view ๐ŸŒ of content quality.

6๏ธโƒฃ ๐Ÿšซ Hallucination

  • ๐ŸŽฏ Purpose: Detects whether the AI produced fabricated ๐ŸŒ€ or unsupported content.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿ” Reason: High score โœ… = no hallucinations;
    Low score โŒ = identifies speculative or false content.

7๏ธโƒฃ ๐Ÿ“ Summarization

  • ๐ŸŽฏ Purpose: Evaluates how well the AI output summarizes ๐Ÿ“š the provided context or expected answer.
  • ๐Ÿ“Š Scoring: 0.0 โ€“ 1.0
  • ๐Ÿง  Reason: Indicates whether key points were retained โœจ or lost, and if the summary is concise and coherent ๐Ÿ’ฌ.

About

Measuring the Magic: Quantitatively validating AI-Generated output!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors