# DeepEval: Evaluating LangChain Agents

**LLM-as-Judge Evaluation for AI Agents**

[![DeepEval](https://img.shields.io/badge/DeepEval-Latest-purple.svg)](https://docs.confident-ai.com/)
[![LangChain](https://img.shields.io/badge/LangChain-0.2+-green.svg)](https://python.langchain.com/)

---

## What This Notebook Covers

1. **Golden → Dataset → Evaluation** workflow with DeepEval
2. **LLM-as-Judge metrics**: Answer Relevancy, Faithfulness, Hallucination
3. **Multi-type loan agent**: Personal, Mortgage, Auto loans
4. **Tool output as context**: Using tool results for evaluation

### Loan Advisor Agent Tools

| Tool | Description |
|------|-------------|
| `calculate_personal_loan` | Personal loan payment calculation |
| `calculate_mortgage` | Home loan with down payment & LTV |
| `calculate_auto_loan` | Car loan with trade-in support |
| `check_loan_eligibility` | Credit & income eligibility check |
| `check_affordability` | DTI-based affordability analysis |
| `compare_loan_options` | Compare different loan terms |

---

## 1. Setup

In [None]:
# Install dependencies (uncomment for Kaggle/Colab)
# !pip install -q deepeval langchain langchain-openai langgraph pandas python-dotenv

In [None]:
import os
import json
from typing import Dict, Any

# =============================================================================
# API Key Setup (Kaggle / Colab / Local)
# =============================================================================

def setup_api_key():
    """Load API key from various sources."""
    if os.getenv("OPENAI_API_KEY"):
        return "environment"
    
    try:  # Kaggle
        from kaggle_secrets import UserSecretsClient
        os.environ["OPENAI_API_KEY"] = UserSecretsClient().get_secret("OPENAI_API_KEY")
        return "Kaggle Secrets"
    except: pass
    
    try:  # Colab
        from google.colab import userdata
        os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
        return "Colab Secrets"
    except: pass
    
    try:  # Local .env
        from dotenv import load_dotenv
        load_dotenv()
        if os.getenv("OPENAI_API_KEY"):
            return ".env file"
    except: pass
    
    return None

source = setup_api_key()
if source:
    print(f"API key loaded from {source}")
else:
    raise ValueError("OPENAI_API_KEY not found. Set via Kaggle/Colab Secrets or .env file.")

## 2. Import Loan Advisor Tools

Tools are defined in `langchain_tools.py` - a separate module for cleaner organization.

In [None]:
# Import tools from external module
from langchain_tools import (
    get_all_tools,
    get_tool_descriptions,
    calculate_personal_loan,
    calculate_mortgage,
    calculate_auto_loan,
    check_loan_eligibility,
    check_affordability,
    compare_loan_options,
)

tools = get_all_tools()
print(f"Loaded {len(tools)} tools:\n")
print(get_tool_descriptions())

## 3. Create LangChain Agent

In [None]:
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage

# Create agent
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

SYSTEM_PROMPT = """You are a Loan Advisor assistant helping users with:
- Personal loans, mortgages (home loans), and auto (car) loans
- Payment calculations, eligibility checks, and affordability analysis
Use the provided tools for accurate calculations. Be clear and helpful."""

agent = create_react_agent(llm, tools, prompt=SYSTEM_PROMPT)
print("Agent created with 6 loan advisor tools")

In [None]:
# Agent runner - extracts output and tool context
class AgentRunner:
    def __init__(self, agent):
        self.agent = agent
    
    def run(self, query: str) -> Dict[str, Any]:
        result = self.agent.invoke({"messages": [HumanMessage(content=query)]})
        messages = result["messages"]
        
        # Extract final output
        actual_output = next(
            (m.content for m in reversed(messages) if isinstance(m, AIMessage) and m.content),
            ""
        )
        
        # Extract tool calls and results (retrieval context)
        tools_called = []
        retrieval_context = []
        
        for msg in messages:
            if isinstance(msg, AIMessage) and msg.tool_calls:
                tools_called.extend([tc.get("name", "") for tc in msg.tool_calls if isinstance(tc, dict)])
            if isinstance(msg, ToolMessage):
                retrieval_context.append(msg.content)
        
        return {
            "actual_output": actual_output,
            "tools_called": tools_called,
            "retrieval_context": retrieval_context,
        }

runner = AgentRunner(agent)

## 4. Define Test Goldens

Goldens = test templates with expected behavior. We cover all loan types and tools.

In [None]:
GOLDENS = [
    # --- Personal Loan ---
    {
        "id": "personal_loan_basic",
        "category": "personal",
        "input": "Calculate monthly payment for a $25,000 personal loan at 10% interest for 48 months.",
        "expected_tools": ["calculate_personal_loan"],
        "expected_keywords": ["634", "monthly", "payment"],
    },
    {
        "id": "personal_loan_comparison",
        "category": "personal",
        "input": "Compare a $20,000 personal loan at 9% interest for 36, 48, and 60 months.",
        "expected_tools": ["compare_loan_options"],
        "expected_keywords": ["36", "48", "60", "interest"],
    },
    
    # --- Mortgage (Home Loan) ---
    {
        "id": "mortgage_basic",
        "category": "mortgage",
        "input": "Calculate mortgage payment for a $500,000 home with 20% down payment at 6.5% for 30 years.",
        "expected_tools": ["calculate_mortgage"],
        "expected_keywords": ["2,528", "monthly", "down payment", "400,000"],
    },
    {
        "id": "mortgage_low_down",
        "category": "mortgage",
        "input": "What's the monthly payment for a $400,000 house with only 10% down at 7% for 30 years? Will I need PMI?",
        "expected_tools": ["calculate_mortgage"],
        "expected_keywords": ["LTV", "PMI", "monthly"],
    },
    
    # --- Auto Loan (Car Loan) ---
    {
        "id": "auto_loan_basic",
        "category": "auto",
        "input": "Calculate car loan payment for a $35,000 vehicle with $5,000 down at 5.9% for 60 months.",
        "expected_tools": ["calculate_auto_loan"],
        "expected_keywords": ["581", "monthly", "30,000"],
    },
    {
        "id": "auto_loan_trade_in",
        "category": "auto",
        "input": "I want to buy a $40,000 car. I have a trade-in worth $8,000 and can put $2,000 down. What's my payment at 6% for 72 months?",
        "expected_tools": ["calculate_auto_loan"],
        "expected_keywords": ["trade", "monthly", "30,000"],
    },
    
    # --- Eligibility ---
    {
        "id": "eligibility_good_credit",
        "category": "eligibility",
        "input": "Check my loan eligibility: age 35, income $8,000/month, credit score 750, full-time employed, requesting $50,000 personal loan.",
        "expected_tools": ["check_loan_eligibility"],
        "expected_keywords": ["eligible", "Excellent", "750"],
    },
    {
        "id": "eligibility_low_credit",
        "category": "eligibility",
        "input": "Am I eligible for a mortgage? Age 28, income $5,000/month, credit score 580, self-employed, want $300,000.",
        "expected_tools": ["check_loan_eligibility"],
        "expected_keywords": ["not", "credit", "580"],
    },
    
    # --- Affordability ---
    {
        "id": "affordability_ok",
        "category": "affordability",
        "input": "Can I afford a $30,000 car loan at 6% for 60 months? I earn $6,000/month with $500 existing debt.",
        "expected_tools": ["check_affordability"],
        "expected_keywords": ["affordable", "DTI"],
    },
    {
        "id": "affordability_stretched",
        "category": "affordability",
        "input": "Monthly income $4,000, existing debt $1,500. Can I afford a $25,000 loan at 8% for 48 months?",
        "expected_tools": ["check_affordability"],
        "expected_keywords": ["DTI", "exceed", "not"],
    },
]

print(f"Defined {len(GOLDENS)} test cases across categories:")
categories = {}
for g in GOLDENS:
    cat = g["category"]
    categories[cat] = categories.get(cat, 0) + 1
for cat, count in categories.items():
    print(f"  - {cat}: {count}")

## 5. Run Agent on All Test Cases

In [None]:
from deepeval.dataset import EvaluationDataset, Golden

print("Running agent on all test cases...\n")

goldens_with_output = []

for g in GOLDENS:
    print(f"[{g['category']}] {g['id']}")
    
    result = runner.run(g["input"])
    print(f"  Tools: {result['tools_called']}")
    
    golden = Golden(
        input=g["input"],
        actual_output=result["actual_output"],
        retrieval_context=result["retrieval_context"],
        additional_metadata={
            "test_id": g["id"],
            "category": g["category"],
            "expected_tools": g["expected_tools"],
            "actual_tools": result["tools_called"],
            "expected_keywords": g["expected_keywords"],
        }
    )
    goldens_with_output.append(golden)

dataset = EvaluationDataset(goldens=goldens_with_output)
print(f"\nCreated dataset with {len(dataset.goldens)} test cases")

## 6. Configure DeepEval Metrics

In [None]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric
from deepeval.test_case import LLMTestCase

EVAL_MODEL = "gpt-4o-mini"

metrics = [
    AnswerRelevancyMetric(threshold=0.7, model=EVAL_MODEL),
    FaithfulnessMetric(threshold=0.7, model=EVAL_MODEL),
    HallucinationMetric(threshold=0.5, model=EVAL_MODEL),
]

print("Metrics configured:")
for m in metrics:
    print(f"  - {m.__class__.__name__} (threshold: {m.threshold})")

In [None]:
# Convert Goldens to TestCases
test_cases = [
    LLMTestCase(
        input=g.input,
        actual_output=g.actual_output,
        context=g.retrieval_context,           # For Hallucination
        retrieval_context=g.retrieval_context,  # For Faithfulness
    )
    for g in dataset.goldens
]
print(f"Created {len(test_cases)} test cases")

## 7. Run Evaluation

In [None]:
print("Running DeepEval evaluation...\n")

results = evaluate(
    test_cases=test_cases,
    metrics=metrics,
)

print("\nEvaluation complete!")

## 8. Analyze Results

In [None]:
import pandas as pd

results_data = []

for tc, golden in zip(test_cases, dataset.goldens):
    meta = golden.additional_metadata
    row = {"test_id": meta["test_id"], "category": meta["category"]}
    
    for metric in metrics:
        metric.measure(tc)
        name = metric.__class__.__name__.replace("Metric", "")
        row[name] = metric.score
        row[f"{name}_pass"] = metric.is_successful()
    
    results_data.append(row)

df = pd.DataFrame(results_data)
print("RESULTS BY TEST CASE\n")
print(df.to_string(index=False))

In [None]:
# Summary by category
print("\nAVERAGE SCORES BY CATEGORY\n")
score_cols = [c for c in df.columns if not c.endswith("_pass") and c not in ["test_id", "category"]]
summary = df.groupby("category")[score_cols].mean().round(2)
print(summary.to_string())

In [None]:
# Overall pass rates
print("\nOVERALL PASS RATES\n")
pass_cols = [c for c in df.columns if c.endswith("_pass")]
for col in pass_cols:
    rate = df[col].mean()
    print(f"  {col.replace('_pass', '')}: {rate:.1%}")

overall = df[pass_cols].all(axis=1).mean()
print(f"\n  Overall (all metrics pass): {overall:.1%}")

## 9. Tool Call Validation

In [None]:
print("TOOL CALL VALIDATION\n")

tool_results = []
for golden in dataset.goldens:
    meta = golden.additional_metadata
    expected = set(meta["expected_tools"])
    actual = set(meta["actual_tools"])
    match = expected.issubset(actual)
    
    print(f"[{'PASS' if match else 'FAIL'}] {meta['test_id']}")
    if not match:
        print(f"      Expected: {list(expected)}, Got: {list(actual)}")
    
    tool_results.append(match)

print(f"\nTool Accuracy: {sum(tool_results)/len(tool_results):.1%}")

## 10. Key Takeaways

### DeepEval Evaluation Flow

```
Goldens (templates) → Agent Execution → LLMTestCase → Metrics → Results
                           ↓
                    Tool Outputs = retrieval_context
```

### Metrics Explained

| Metric | Measures | Uses Context? |
|--------|----------|---------------|
| **AnswerRelevancy** | Is response relevant to question? | No |
| **Faithfulness** | Is response grounded in context? | Yes |
| **Hallucination** | Does response contain made-up info? | Yes |

### Best Practices

1. **Tool outputs → retrieval_context**: Ground truth for Faithfulness/Hallucination
2. **Separate tools module**: Keep notebook focused on evaluation
3. **Category-based Goldens**: Cover different loan types and scenarios
4. **Validate tool calls**: Ensure agent uses correct tools

---

**Resources**: [DeepEval Docs](https://docs.confident-ai.com/) | [LangChain](https://python.langchain.com/) | [LangGraph](https://langchain-ai.github.io/langgraph/)