# DeepEval: Evaluating AI Agents with LLM-as-Judge

**A Complete Guide to Golden, Dataset, and Metrics**

[![DeepEval](https://img.shields.io/badge/DeepEval-Latest-purple.svg)](https://docs.confident-ai.com/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)

---

## What is DeepEval?

[DeepEval](https://docs.confident-ai.com/) is an open-source LLM evaluation framework that uses **LLM-as-Judge** methodology.

### Key Concepts

| Concept | Description |
|---------|-------------|
| **Golden** | Blueprint/template for test cases (input + expected data) |
| **LLMTestCase** | Actual test case with LLM outputs |
| **EvaluationDataset** | Collection of Goldens or TestCases |
| **Metrics** | LLM-as-Judge evaluators (Relevancy, Faithfulness, etc.) |

### DeepEval is Framework-Agnostic

DeepEval works with **any** LLM framework:
- Agno (this project)
- LangChain
- LlamaIndex
- Custom implementations

What matters is: `input` → `actual_output` → `context`

---

## 1. Installation & Setup

In [None]:
# Install DeepEval (uncomment if needed)
# !pip install -q deepeval openai pandas python-dotenv

In [None]:
import os
import json
from typing import List, Dict, Any, Optional
from dataclasses import dataclass

# DeepEval imports
from deepeval import evaluate, assert_test
from deepeval.dataset import EvaluationDataset, Golden
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
)
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

print("DeepEval imported successfully!")

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Verify API key
assert os.getenv("OPENAI_API_KEY"), "Please set OPENAI_API_KEY"

# Configuration
EVAL_MODEL = os.getenv("DEEPEVAL_MODEL", "gpt-4o-mini")

print(f"Evaluation model: {EVAL_MODEL}")

## 2. Understanding Goldens vs Test Cases

### Best Practice from DeepEval Documentation

> "Think of goldens as 'pending test cases' - they contain all the input data and expected results, but are missing the dynamic elements (actual_output, retrieval_context) that will be generated when your LLM processes them."

**Workflow:**
1. Define **Goldens** with inputs and expected behavior
2. Run LLM/Agent to generate **actual_output**
3. Convert to **LLMTestCase** for evaluation

```
Golden (template) → LLM Execution → LLMTestCase (with actual output) → Metrics Evaluation
```

## 3. Define Goldens (Test Templates)

Following [DeepEval best practices](https://www.confident-ai.com/docs/llm-evaluation/core-concepts/test-cases-goldens-datasets):
- Start with ~100 test cases
- Include diverse real-world inputs
- Vary complexity levels
- Cover edge cases

In [None]:
# Define Goldens - templates for test cases
# These contain input + expected behavior, NOT actual_output

GOLDENS_DEFINITION = [
    {
        "id": "loan_calculation_basic",
        "input": "Calculate my monthly payment for a $50,000 loan at 5% annual interest for 36 months.",
        "expected_output": "Monthly payment around $1,498.88",
        "expected_tools": ["calculate_loan_payment"],
        "expected_keywords": ["1,498", "1498", "monthly"],
        "complexity": "simple",
    },
    {
        "id": "eligibility_check",
        "input": "Check my loan eligibility: I'm 25 years old, monthly income $6000, credit score 720, requesting $30,000 loan. I work full-time.",
        "expected_output": "Eligible for the loan with good credit score",
        "expected_tools": ["check_loan_eligibility"],
        "expected_keywords": ["eligible", "720", "good"],
        "complexity": "simple",
    },
    {
        "id": "affordability_analysis",
        "input": "Can I afford a $40,000 loan at 6% interest for 48 months? My monthly income is $5,500 and I have $500 in existing debt.",
        "expected_output": "Affordability analysis with DTI ratio calculation",
        "expected_tools": ["check_loan_affordability"],
        "expected_keywords": ["afford", "DTI", "debt"],
        "complexity": "moderate",
    },
    {
        "id": "edge_case_low_income",
        "input": "I earn $2000/month. Can I get a $100,000 loan?",
        "expected_output": "Not eligible or not affordable due to income constraints",
        "expected_tools": ["check_loan_eligibility", "check_loan_affordability"],
        "expected_keywords": ["income", "not", "afford"],
        "complexity": "edge_case",
    },
]

print(f"Defined {len(GOLDENS_DEFINITION)} Golden templates:")
for g in GOLDENS_DEFINITION:
    print(f"  [{g['complexity']}] {g['id']}")

## 4. Simulate Agent Execution

In production, you would run your actual agent here.

For this demo, we use **pre-collected responses** to show the evaluation workflow without requiring the full agent setup.

In [None]:
# Pre-collected agent responses (simulating agent execution)
# In production: actual_output = agent.run(golden["input"])

AGENT_RESPONSES = {
    "loan_calculation_basic": {
        "actual_output": """Based on your loan details:

**Loan Amount**: $50,000.00
**Interest Rate**: 5.00% per year
**Term**: 36 months

### Monthly Payment: $1,498.88

**Total Payment**: $53,959.68
**Total Interest**: $3,959.68
**Interest as % of Principal**: 7.9%""",
        "tools_called": ["calculate_loan_payment"],
        "retrieval_context": [
            json.dumps({
                "monthly_payment": 1498.88,
                "total_payment": 53959.68,
                "total_interest": 3959.68,
                "loan_amount": 50000,
                "annual_interest_rate": 0.05,
                "loan_term_months": 36
            })
        ]
    },
    "eligibility_check": {
        "actual_output": """Congratulations! Based on your profile, you ARE eligible for the $30,000 loan.

**Your Profile**:
- Age: 25 (meets requirement: 18-65)
- Monthly Income: $6,000 (exceeds minimum: $3,000)
- Credit Score: 720 (Good - above 670 threshold)
- Employment: Full-time (stable)

Your debt-to-income ratio is healthy at 15%, well below the 50% maximum.""",
        "tools_called": ["check_loan_eligibility"],
        "retrieval_context": [
            json.dumps({
                "is_eligible": True,
                "credit_rating": "Good",
                "dti_ratio": 0.15,
                "max_loan_amount": 50000,
                "reasons": ["Age OK", "Income OK", "Credit OK"]
            })
        ]
    },
    "affordability_analysis": {
        "actual_output": """Based on affordability analysis:

**Loan Details**:
- Amount: $40,000 at 6% for 48 months
- Monthly Payment: $939.15

**Your Financial Situation**:
- Monthly Income: $5,500
- Existing Debt: $500/month
- New Loan Payment: $939.15/month
- Total Debt: $1,439.15/month

**DTI Ratio**: 26.2% (Below 50% threshold)

Yes, this loan is AFFORDABLE for you.""",
        "tools_called": ["check_loan_affordability"],
        "retrieval_context": [
            json.dumps({
                "is_affordable": True,
                "monthly_payment": 939.15,
                "current_dti": 0.091,
                "new_dti": 0.262,
                "max_dti": 0.5
            })
        ]
    },
    "edge_case_low_income": {
        "actual_output": """Based on your income and requested loan amount:

**Analysis**:
- Monthly Income: $2,000
- Requested Loan: $100,000

Unfortunately, this loan is NOT affordable for you.

The monthly payment would be approximately $2,300, which exceeds your income.
Your DTI ratio would be over 100%, far exceeding the 50% maximum.

**Recommendation**: Consider a smaller loan amount that fits within 50% of your income.""",
        "tools_called": ["check_loan_eligibility", "check_loan_affordability"],
        "retrieval_context": [
            json.dumps({
                "is_eligible": False,
                "is_affordable": False,
                "reason": "Income too low for requested amount",
                "projected_dti": 1.15
            })
        ]
    }
}

print(f"Loaded {len(AGENT_RESPONSES)} pre-collected agent responses")

## 5. Create Golden → Dataset Workflow

This follows [DeepEval's recommended workflow](https://deepeval.com/docs/evaluation-datasets):

1. Loop through goldens
2. Generate actual_output from LLM/Agent
3. Create test cases with proper context

In [None]:
def create_evaluation_dataset(
    goldens_def: List[Dict],
    agent_responses: Dict[str, Dict]
) -> EvaluationDataset:
    """
    Create EvaluationDataset from Golden definitions + Agent responses.
    
    This follows DeepEval best practice:
    - Goldens define expected behavior
    - actual_output is generated at evaluation time
    - retrieval_context provides ground truth for Faithfulness/Hallucination
    """
    goldens = []
    
    for golden_def in goldens_def:
        test_id = golden_def["id"]
        
        # Get agent response (in production: run agent here)
        response = agent_responses.get(test_id, {})
        
        if not response:
            print(f"  Warning: No response for {test_id}, skipping")
            continue
        
        # Create Golden with actual output
        golden = Golden(
            input=golden_def["input"],
            actual_output=response["actual_output"],
            expected_output=golden_def.get("expected_output", ""),
            retrieval_context=response.get("retrieval_context", []),
            additional_metadata={
                "test_id": test_id,
                "complexity": golden_def.get("complexity", "unknown"),
                "expected_tools": golden_def.get("expected_tools", []),
                "actual_tools_called": response.get("tools_called", []),
                "expected_keywords": golden_def.get("expected_keywords", []),
            }
        )
        goldens.append(golden)
    
    return EvaluationDataset(goldens=goldens)


# Create dataset
print("Creating EvaluationDataset...")
dataset = create_evaluation_dataset(GOLDENS_DEFINITION, AGENT_RESPONSES)
print(f"Created dataset with {len(dataset.goldens)} test cases")

## 6. Convert Goldens to Test Cases

DeepEval metrics require `LLMTestCase` objects.

In [None]:
def goldens_to_test_cases(dataset: EvaluationDataset) -> List[LLMTestCase]:
    """
    Convert Goldens to LLMTestCase for metric evaluation.
    
    Key mappings:
    - context: Used by HallucinationMetric
    - retrieval_context: Used by FaithfulnessMetric
    """
    return [
        LLMTestCase(
            input=g.input,
            actual_output=g.actual_output,
            expected_output=g.expected_output,
            context=g.retrieval_context,          # For HallucinationMetric
            retrieval_context=g.retrieval_context, # For FaithfulnessMetric
        )
        for g in dataset.goldens
    ]


test_cases = goldens_to_test_cases(dataset)
print(f"Converted to {len(test_cases)} LLMTestCase objects")

## 7. Define Evaluation Metrics

### Standard Metrics

| Metric | Purpose | Good Score |
|--------|---------|------------|
| **AnswerRelevancy** | Response addresses the question | >= 0.7 |
| **Faithfulness** | Response grounded in context | >= 0.7 |
| **Hallucination** | No made-up information | <= 0.5 |

### Custom Metrics with G-Eval

[G-Eval](https://deepeval.com/docs/metrics-llm-evals) allows custom evaluation criteria.

In [None]:
# Standard metrics
standard_metrics = [
    AnswerRelevancyMetric(threshold=0.7, model=EVAL_MODEL),
    FaithfulnessMetric(threshold=0.7, model=EVAL_MODEL),
    HallucinationMetric(threshold=0.5, model=EVAL_MODEL),
]

# Custom metric: Financial Accuracy (using G-Eval)
financial_accuracy_metric = GEval(
    name="Financial Accuracy",
    criteria="""Evaluate whether the financial calculations and advice are accurate:
    1. Are numerical calculations correct?
    2. Are financial terms used correctly?
    3. Is the advice financially sound?""",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
    model=EVAL_MODEL
)

# Combine all metrics
all_metrics = standard_metrics + [financial_accuracy_metric]

print("Configured metrics:")
for m in all_metrics:
    name = m.name if hasattr(m, 'name') else m.__class__.__name__
    print(f"  - {name} (threshold: {m.threshold})")

## 8. Run Evaluation

In [None]:
print("Running DeepEval evaluation...")
print("=" * 60)

# Run batch evaluation
results = evaluate(
    test_cases=test_cases,
    metrics=all_metrics,
)

print("\n" + "=" * 60)
print("Evaluation Complete!")

## 9. Detailed Results Analysis

In [None]:
import pandas as pd

# Collect detailed results
results_data = []

print("\n" + "=" * 60)
print("DETAILED RESULTS BY TEST CASE")
print("=" * 60)

for i, (tc, golden) in enumerate(zip(test_cases, dataset.goldens)):
    test_id = golden.additional_metadata["test_id"]
    complexity = golden.additional_metadata["complexity"]
    
    print(f"\n[{complexity.upper()}] {test_id}")
    print("-" * 40)
    
    row = {
        "test_id": test_id,
        "complexity": complexity,
    }
    
    for metric in all_metrics:
        metric.measure(tc)
        metric_name = metric.name if hasattr(metric, 'name') else metric.__class__.__name__.replace("Metric", "")
        status = "PASS" if metric.is_successful() else "FAIL"
        print(f"  {metric_name}: {metric.score:.2f} ({status})")
        
        row[metric_name] = metric.score
        row[f"{metric_name}_pass"] = metric.is_successful()
    
    results_data.append(row)

# Create DataFrame
df = pd.DataFrame(results_data)

In [None]:
# Summary statistics
print("\n" + "=" * 60)
print("SUMMARY STATISTICS")
print("=" * 60)

# Score columns
score_cols = [c for c in df.columns if not c.endswith('_pass') and c not in ['test_id', 'complexity']]

print("\nAverage Scores:")
for col in score_cols:
    avg = df[col].mean()
    print(f"  {col}: {avg:.2f}")

# Pass rates
print("\nPass Rates:")
pass_cols = [c for c in df.columns if c.endswith('_pass')]
for col in pass_cols:
    rate = df[col].mean()
    print(f"  {col.replace('_pass', '')}: {rate:.1%}")

# Overall
overall = df[pass_cols].all(axis=1).mean()
print(f"\nOverall Pass Rate: {overall:.1%}")

In [None]:
# Results by complexity
print("\n" + "=" * 60)
print("RESULTS BY COMPLEXITY")
print("=" * 60)

for complexity in df['complexity'].unique():
    subset = df[df['complexity'] == complexity]
    print(f"\n{complexity.upper()} ({len(subset)} tests):")
    for col in score_cols:
        avg = subset[col].mean()
        print(f"  {col}: {avg:.2f}")

## 10. Tool Call Validation

Verify the agent called the expected tools.

In [None]:
print("\n" + "=" * 60)
print("TOOL CALL VALIDATION")
print("=" * 60)

tool_results = []

for golden in dataset.goldens:
    meta = golden.additional_metadata
    test_id = meta["test_id"]
    expected = set(meta.get("expected_tools", []))
    actual = set(meta.get("actual_tools_called", []))
    
    # Check if expected tools were called
    match = expected.issubset(actual)
    status = "PASS" if match else "FAIL"
    
    print(f"\n{test_id}: {status}")
    print(f"  Expected: {list(expected)}")
    print(f"  Actual:   {list(actual)}")
    
    tool_results.append({
        "test_id": test_id,
        "expected": list(expected),
        "actual": list(actual),
        "match": match
    })

# Summary
tool_pass_rate = sum(r["match"] for r in tool_results) / len(tool_results)
print(f"\nTool Call Accuracy: {tool_pass_rate:.1%}")

## 11. Keyword Validation

In [None]:
print("\n" + "=" * 60)
print("KEYWORD VALIDATION")
print("=" * 60)

keyword_results = []

for golden in dataset.goldens:
    meta = golden.additional_metadata
    test_id = meta["test_id"]
    expected_kw = meta.get("expected_keywords", [])
    output = golden.actual_output.lower()
    
    found = [kw for kw in expected_kw if kw.lower() in output]
    match = len(found) > 0 if expected_kw else True
    status = "PASS" if match else "FAIL"
    
    print(f"\n{test_id}: {status}")
    print(f"  Expected: {expected_kw}")
    print(f"  Found:    {found}")
    
    keyword_results.append({"test_id": test_id, "match": match})

kw_pass_rate = sum(r["match"] for r in keyword_results) / len(keyword_results)
print(f"\nKeyword Match Rate: {kw_pass_rate:.1%}")

## 12. Export Results

In [None]:
# Final results table
print("\n" + "=" * 60)
print("FINAL RESULTS TABLE")
print("=" * 60)
print(df.to_string(index=False))

# Optionally save to CSV
# df.to_csv("evaluation_results.csv", index=False)
# print("\nResults saved to evaluation_results.csv")

## 13. Key Takeaways

### DeepEval Best Practices (Summary)

1. **Framework-Agnostic**: Works with Agno, LangChain, or any LLM framework
2. **Golden → TestCase Workflow**: Define templates, generate outputs at eval time
3. **Context is Critical**: `retrieval_context` enables Faithfulness/Hallucination metrics
4. **Multiple Metrics**: Combine standard + custom (G-Eval) metrics
5. **Start with ~100 cases**: Diverse, varying complexity, include edge cases

### For AI Agent Evaluation

When evaluating agents that use tools:
- **Tool outputs = retrieval_context** (ground truth)
- Validate both LLM quality AND tool selection
- Use additional_metadata to track tool calls

---

### Resources

- [DeepEval Documentation](https://docs.confident-ai.com/)
- [Evaluation Datasets Guide](https://deepeval.com/docs/evaluation-datasets)
- [G-Eval Custom Metrics](https://deepeval.com/docs/metrics-llm-evals)
- [AI Agent Evaluation Guide](https://deepeval.com/guides/guides-ai-agent-evaluation-metrics)
- [DeepEval GitHub](https://github.com/confident-ai/deepeval)