# AI Agent Validation with ValidMind - Banking Demo

This notebook shows how to document and evaluate an agentic AI system with the ValidMind Library. Using a small banking agent built in LangGraph as an example, you will run ValidMind’s built-in and custom tests and produce the artifacts needed to create evidence-backed documentation.

An AI agent is an autonomous system that interprets inputs, selects from available tools or actions, and carries out multi-step behaviors to achieve user goals. In this example, our agent acts as a professional banking assistant that analyzes user requests and automatically selects and invokes the most appropriate specialized banking tool (credit, account, or fraud) to deliver accurate, compliant, and actionable responses.

However, agentic capabilities bring concrete risks. The agent may misinterpret user inputs or fail to extract required parameters, producing incorrect credit assessments or inappropriate account actions; it can select the wrong tool (for example, invoking account management instead of fraud detection), which may cause unsafe, non-compliant, or customer-impacting behaviour.

This interactive notebook guides you step-by-step through building a demo LangGraph banking agent, preparing an evaluation dataset, initializing the ValidMind Library and required objects, writing custom tests for tool-selection accuracy and entity extraction, running ValidMind’s built-in and custom test suites, and logging documentation artifacts to ValidMind.

::: {.content-hidden when-format="html"}
## Contents    
- [About ValidMind](#toc1__)    
  - [Before you begin](#toc1_1__)    
  - [New to ValidMind?](#toc1_2__)    
  - [Key concepts](#toc1_3__)    
- [Setting up](#toc2__)    
  - [Install the ValidMind Library](#toc2_1__)    
  - [Initialize the ValidMind Library](#toc2_2__)    
    - [Register sample model](#toc2_2_1__)    
    - [Apply documentation template](#toc2_2_2__)    
    - [Get your code snippet](#toc2_2_3__)    
  - [Initialize the Python environment](#toc2_3__)    
- [Banking Tools](#toc3__)    
  - [Tool Overview](#toc3_1__)    
  - [Test Banking Tools Individually](#toc3_2__)    
- [Complete LangGraph Banking Agent](#toc4__)    
- [ValidMind Model Integration](#toc5__)    
- [Prompt Validation](#toc6__)    
- [Banking Test Dataset](#toc7__)    
  - [Initialize ValidMind Dataset](#toc7_1__)    
  - [Run the Agent and capture result through assign predictions](#toc7_2__)    
    - [Dataframe Display Settings](#toc7_2_1__)    
- [Banking Accuracy Test](#toc8__)    
- [Banking Tool Call Accuracy Test](#toc9__)    
- [Scorers in ValidMind](#toc10__)
  - [Plan Quality Metric scorer](#toc10_1)    
  - [Plan Adherence Metric scorer](#toc10_2)    
  - [Tool Correctness Metric scorer](#toc10_3)    
  - [Argument Correctness Metric scorer](#toc10_4)    
  - [Task Completion scorer](#toc10_5)    
- [RAGAS Tests for an Agent Evaluation](#toc12__)    
  - [Faithfulness](#toc12_1__)    
  - [Response Relevancy](#toc12_2__)    
  - [Context Recall](#toc12_3__)    
- [Safety](#toc13__)    
  - [AspectCritic](#toc13_1__)    
  - [Prompt bias](#toc13_2__)    
  - [Toxicity](#toc13_3__)    
- [Demo Summary and Next Steps](#toc14__)    
  - [What We Built](#toc14_1__)    
  - [Next Steps](#toc14_2__)    
  - [Key Benefits](#toc14_3__)    

:::
<!-- jn-toc-notebook-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/jn-toc-notebook-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

<a id='toc1__'></a>

## About ValidMind
ValidMind is a suite of tools for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Library to automate documentation and validation tests, and then use the ValidMind Platform to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

<a id='toc1_1__'></a>

### Before you begin
This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with `pip install`, and then re-run the notebook. For more help, refer to [Installing Python Modules](https://docs.python.org/3/installing/index.html).

<a id='toc1_2__'></a>

### New to ValidMind?
If you haven't already seen our documentation on the [ValidMind Library](https://docs.validmind.ai/developer/validmind-library.html), we recommend you begin by exploring the available resources in this section. There, you can learn more about documenting models and running tests, as well as find code samples and our Python Library API reference.

<div class="alert alert-block alert-info" style="background-color: #B5B5B510; color: black; border: 1px solid #083E44; border-left-width: 5px; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.2);border-radius: 5px;"><span style="color: #083E44;"><b>For access to all features available in this notebook, you'll need access to a ValidMind account.</b></span>
<br></br>
<a href="https://docs.validmind.ai/guide/configuration/register-with-validmind.html" style="color: #DE257E;"><b>Register with ValidMind</b></a></div>

<a id='toc1_3__'></a>

### Key concepts

**Model documentation**: A structured and detailed record pertaining to a model, encompassing key components such as its underlying assumptions, methodologies, data sources, inputs, performance metrics, evaluations, limitations, and intended uses. It serves to ensure transparency, adherence to regulatory requirements, and a clear understanding of potential risks associated with the model’s application.

**Documentation template**: Functions as a test suite and lays out the structure of model documentation, segmented into various sections and sub-sections. Documentation templates define the structure of your model documentation, specifying the tests that should be run, and how the results should be displayed.

**Tests**: A function contained in the ValidMind Library, designed to run a specific quantitative test on the dataset or model. Tests are the building blocks of ValidMind, used to evaluate and document models and datasets, and can be run individually or as part of a suite defined by your model documentation template.

**Custom tests**: Custom tests are functions that you define to evaluate your model or dataset. These functions can be registered via the ValidMind Library to be used with the ValidMind Platform.

**Inputs**: Objects to be evaluated and documented in the ValidMind Library. They can be any of the following:

- **model**: A single model that has been initialized in ValidMind with [`vm.init_model()`](https://docs.validmind.ai/validmind/validmind.html#init_model).
- **dataset**: Single dataset that has been initialized in ValidMind with [`vm.init_dataset()`](https://docs.validmind.ai/validmind/validmind.html#init_dataset).
- **models**: A list of ValidMind models - usually this is used when you want to compare multiple models in your custom test.
- **datasets**: A list of ValidMind datasets - usually this is used when you want to compare multiple datasets in your custom test. See this [example](https://docs.validmind.ai/notebooks/how_to/run_tests_that_require_multiple_datasets.html) for more information.

**Parameters**: Additional arguments that can be passed when running a ValidMind test, used to pass additional information to a test, customize its behavior, or provide additional context.

**Outputs**: Custom tests can return elements like tables or plots. Tables may be a list of dictionaries (each representing a row) or a pandas DataFrame. Plots may be matplotlib or plotly figures.

**Test suites**: Collections of tests designed to run together to automate and generate model documentation end-to-end for specific use-cases.

<a id='toc2__'></a>

## Setting up

<a id='toc2_1__'></a>

### Install the ValidMind Library

To install the library:

In [None]:
%pip install -q validmind 

<a id='toc2_2__'></a>

### Initialize the ValidMind Library

<a id='toc2_2_1__'></a>

#### Register sample model

Let's first register a sample model for use with this notebook:

1. In a browser, [log in to ValidMind](https://docs.validmind.ai/guide/configuration/log-in-to-validmind.html).

2. In the left sidebar, navigate to **Inventory** and click **+ Register Model**.

3. Enter the model details and click **Next >** to continue to assignment of model stakeholders. ([Need more help?](https://docs.validmind.ai/guide/model-inventory/register-models-in-inventory.html))

4. Select your own name under the **MODEL OWNER** drop-down.

5. Click **Register Model** to add the model to your inventory.

<a id='toc2_2_2__'></a>

#### Apply documentation template

Once you've registered your model, let's select a documentation template. A template predefines sections for your model documentation and provides a general outline to follow, making the documentation process much easier.

1. In the left sidebar that appears for your model, click **Documents** and select **Documentation**.

2. Under **TEMPLATE**, select `Agentic AI System`.

3. Click **Use Template** to apply the template.

<a id='toc2_2_3__'></a>

#### Get your code snippet

ValidMind generates a unique _code snippet_ for each registered model to connect with your developer environment. You initialize the ValidMind Library with this code snippet, which ensures that your documentation and tests are uploaded to the correct model when you run the notebook.

1. On the left sidebar that appears for your model, select **Getting Started** and click **Copy snippet to clipboard**.
2. Next, [load your model identifier credentials from an `.env` file](https://docs.validmind.ai/developer/model-documentation/store-credentials-in-env-file.html) or replace the placeholder with your own code snippet:

In [None]:
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

<a id='toc2_3__'></a>

### Initialize the Python environment

Next, let's import all the necessary libraries for building our banking LangGraph agent system:

- **LangChain components** for LLM integration and tool management
- **LangGraph** for building stateful, multi-step agent workflows
- **ValidMind** for model validation and testing
- **Banking tools** for specialized financial services
- **Standard libraries** for data handling and environment management

The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.

In [None]:
# Standard library imports
from typing import TypedDict, Annotated, Sequence

# Third party imports
import pandas as pd
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph import StateGraph, END, START
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode

# Local imports
from banking_tools import AVAILABLE_TOOLS
from validmind.tests import run_test

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', None)

# Load environment variables if using .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.")

<a id='toc3__'></a>

## Banking Tools

Now let's use the following banking demo tools that provide use cases of the financial services:

<a id='toc3_1__'></a>

### Tool Overview
1. **Credit Risk Analyzer** - Loan applications and credit decisions
2. **Customer Account Manager** - Account services and customer support
3. **Fraud Detection System** - Security and fraud prevention

In [None]:
print(f"Available tools: {len(AVAILABLE_TOOLS)}")
print("\nTool Details:")
for i, tool in enumerate(AVAILABLE_TOOLS, 1):
    print(f"   - {tool.name}")   

<a id='toc3_2__'></a>

### Test Banking Tools Individually

Let's test each banking tool individually to ensure they're working correctly before integrating them into our agent.

In [None]:
print("Testing Individual Banking Tools")
print("=" * 60)

# Test 1: Credit Risk Analyzer
print("TEST 1: Credit Risk Analyzer")
print("-" * 40)
try:
    # Access the underlying function using .func
    credit_result = AVAILABLE_TOOLS[0].func(
        customer_income=75000,
        customer_debt=1200,
        credit_score=720,
        loan_amount=50000,
        loan_type="personal"
    )
    print(credit_result)
    print("Credit Risk Analyzer test PASSED")
except Exception as e:
    print(f"Credit Risk Analyzer test FAILED: {e}")

print("" + "=" * 60)

# Test 2: Customer Account Manager
print("TEST 2: Customer Account Manager")
print("-" * 40)
try:
    # Test checking balance
    account_result = AVAILABLE_TOOLS[1].func(
        account_type="checking",
        customer_id="12345",
        action="check_balance"
    )
    print(account_result)
    
    # Test getting account info
    info_result = AVAILABLE_TOOLS[1].func(
        account_type="all",
        customer_id="12345", 
        action="get_info"
    )
    print(info_result)
    print("Customer Account Manager test PASSED")
except Exception as e:
    print(f"Customer Account Manager test FAILED: {e}")

print("" + "=" * 60)

# Test 3: Fraud Detection System
print("TEST 3: Fraud Detection System")
print("-" * 40)
try:
    fraud_result = AVAILABLE_TOOLS[2].func(
        transaction_id="TX123",
        customer_id="12345",
        transaction_amount=500.00,
        transaction_type="withdrawal",
        location="Miami, FL",
        device_id="DEVICE_001"
    )
    print(fraud_result)
    print("Fraud Detection System test PASSED")
except Exception as e:
    print(f"Fraud Detection System test FAILED: {e}")

print("" + "=" * 60)

<a id='toc4__'></a>

## Complete LangGraph Banking Agent

Now we'll create our intelligent banking agent with LangGraph that can automatically select and use the appropriate banking tools based on user requests.

In [None]:

# Enhanced banking system prompt with tool selection guidance
system_context = """You are a professional banking AI assistant with access to specialized banking tools.
            Analyze the user's banking request and directly use the most appropriate tools to help them.
            
            AVAILABLE BANKING TOOLS:
            
            credit_risk_analyzer - Analyze credit risk for loan applications and credit decisions
            - Use for: loan applications, credit assessments, risk analysis, mortgage eligibility
            - Examples: "Analyze credit risk for $50k personal loan", "Assess mortgage eligibility for $300k home purchase"
            - Parameters: customer_income, customer_debt, credit_score, loan_amount, loan_type

            customer_account_manager - Manage customer accounts and provide banking services
            - Use for: account information, transaction processing, product recommendations, customer service
            - Examples: "Check balance for checking account 12345", "Recommend products for customer with high balance"
            - Parameters: account_type, customer_id, action, amount, account_details

            fraud_detection_system - Analyze transactions for potential fraud and security risks
            - Use for: transaction monitoring, fraud prevention, risk assessment, security alerts
            - Examples: "Analyze fraud risk for $500 ATM withdrawal in Miami", "Check security for $2000 online purchase"
            - Parameters: transaction_id, customer_id, transaction_amount, transaction_type, location, device_id

            BANKING INSTRUCTIONS:
            - Analyze the user's banking request carefully and identify the primary need
            - If they need credit analysis → use credit_risk_analyzer
            - If they need financial calculations → use financial_calculator
            - If they need account services → use customer_account_manager
            - If they need security analysis → use fraud_detection_system
            - Extract relevant parameters from the user's request
            - Provide helpful, accurate banking responses based on tool outputs
            - Always consider banking regulations, risk management, and best practices
            - Be professional and thorough in your analysis

            Choose and use tools wisely to provide the most helpful banking assistance.
        """
# Initialize the main LLM for banking responses
main_llm = ChatOpenAI(
    model="gpt-5-mini",
    reasoning={
        "effort": "low",
        "summary": "auto"
    }
)
# Bind all banking tools to the main LLM
llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)

# Banking Agent State Definition
class BankingAgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    user_input: str
    session_id: str
    context: dict

def create_banking_langgraph_agent():
    """Create a comprehensive LangGraph banking agent with intelligent tool selection."""
    def llm_node(state: BankingAgentState) -> BankingAgentState:
        """Main LLM node that processes banking requests and selects appropriate tools."""
        messages = state["messages"]
        # Add system context to messages
        enhanced_messages = [SystemMessage(content=system_context)] + list(messages)
        # Get LLM response with tool selection
        response = llm_with_tools.invoke(enhanced_messages)
        return {
            **state,
            "messages": messages + [response]
        }
    
    def should_continue(state: BankingAgentState) -> str:
        """Decide whether to use tools or end the conversation."""
        last_message = state["messages"][-1]
        # Check if the LLM wants to use tools
        if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
            return "tools"
        return END
        
    # Create the banking state graph
    workflow = StateGraph(BankingAgentState)
    # Add nodes
    workflow.add_node("llm", llm_node)
    workflow.add_node("tools", ToolNode(AVAILABLE_TOOLS))
    # Simplified entry point - go directly to LLM
    workflow.add_edge(START, "llm")
    # From LLM, decide whether to use tools or end
    workflow.add_conditional_edges(
        "llm",
        should_continue,
        {"tools": "tools", END: END}
    )
    # Tool execution flows back to LLM for final response
    workflow.add_edge("tools", "llm")
    # Set up memory
    memory = MemorySaver()
    # Compile the graph
    agent = workflow.compile(checkpointer=memory)
    return agent

# Create the banking intelligent agent
banking_agent = create_banking_langgraph_agent()

print("Banking LangGraph Agent Created Successfully!")
print("\nFeatures:")
print("   - Intelligent banking tool selection")
print("   - Comprehensive banking system prompt")
print("   - Streamlined workflow: LLM → Tools → Response")
print("   - Automatic tool parameter extraction")
print("   - Professional banking assistance")


<a id='toc5__'></a>

## ValidMind Model Integration

Now we'll integrate our banking LangGraph agent with ValidMind for comprehensive testing and validation.

In [None]:
from validmind.models import Prompt
from validmind.scorer.llm.deepeval import extract_tool_calls_from_agent_output, _convert_to_tool_call_list
def banking_agent_fn(input):
    """
    Invoke the banking agent with the given input.
    """
    try:
        # Initial state for banking agent
        initial_state = {
            "user_input": input["input"],
            "messages": [HumanMessage(content=input["input"])],
            "session_id": input["session_id"],
            "context": {}
        }
        session_config = {"configurable": {"thread_id": input["session_id"]}}
        result = banking_agent.invoke(initial_state, config=session_config)

        from utils import capture_tool_output_messages

        # Capture all tool outputs and metadata
        captured_data = capture_tool_output_messages(result)
    
        # Access specific tool outputs, this will be used for RAGAS tests
        tool_message = ""
        for output in captured_data["tool_outputs"]:
            tool_message += output['content']
        
        tool_calls_found = []
        messages = result['messages']
        for message in messages:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    # Handle both dictionary and object formats
                    if isinstance(tool_call, dict):
                        tool_calls_found.append(tool_call['name'])
                    else:
                        # ToolCall object - use attribute access
                        tool_calls_found.append(tool_call.name)


        return {
            "prediction": result['messages'][-1].content[0]['text'],
            "output": result,
            "tool_messages": [tool_message],
            # "tool_calls": tool_calls_found,
            "tool_called": _convert_to_tool_call_list(extract_tool_calls_from_agent_output(result))
        }
    except Exception as e:
        # Return a fallback response if the agent fails
        error_message = f"""I apologize, but I encountered an error while processing your banking request: {str(e)}.
        Please try rephrasing your question or contact support if the issue persists."""
        return {
            "prediction": error_message, 
            "output": {
                "messages": [HumanMessage(content=input["input"]), SystemMessage(content=error_message)],
                "error": str(e)
            }
        }

## Initialize the model
vm_banking_model = vm.init_model(
    input_id="banking_agent_model",
    predict_fn=banking_agent_fn,
    prompt=Prompt(template=system_context)
)

# Add the banking agent to the vm model
vm_banking_model.model = banking_agent

print("Banking Agent Successfully Integrated with ValidMind!")
print(f"Model ID: {vm_banking_model.input_id}")

<a id='toc6__'></a>

## Prompt Validation

Let's get an initial sense of how well the prompt meets a few best practices for prompt engineering. These tests use an LLM to rate the prompt on a scale of 1-10 against the following criteria:

- **Clarity**: How clearly the prompt states the task.
- **Conciseness**: How succinctly the prompt states the task.
- **Delimitation**: When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
- **NegativeInstruction**: Whether the prompt contains negative instructions.
- **Specificity**: How specific the prompt defines the task.

In [None]:
run_test(
    "validmind.prompt_validation.Clarity",
    inputs={
        "model": vm_banking_model,
    },
).log()

In [None]:
run_test(
    "validmind.prompt_validation.Conciseness",
    inputs={
        "model": vm_banking_model,
    },
).log()

In [None]:
run_test(
    "validmind.prompt_validation.Delimitation",
    inputs={
        "model": vm_banking_model,
    },
).log()

In [None]:
run_test(
    "validmind.prompt_validation.NegativeInstruction",
    inputs={
        "model": vm_banking_model,
    },
).log()

In [None]:
run_test(
    "validmind.prompt_validation.Specificity",
    inputs={
        "model": vm_banking_model,
    },
).log()

<a id='toc7__'></a>

## Banking Test Dataset

We'll use a sample test dataset to evaluate our agent's performance across different banking scenarios.

<a id='toc7_1__'></a>

### Initialize ValidMind Dataset

Before we can run tests and evaluations, we need to initialize our banking test dataset as a ValidMind dataset object.

In [None]:
# Import our banking-specific test dataset
from banking_test_dataset import banking_test_dataset

vm_test_dataset = vm.init_dataset(
    input_id="banking_test_dataset",
    dataset=banking_test_dataset.sample(2),
    text_column="input",
    target_column="possible_outputs",
)

print("Banking Test Dataset Initialized in ValidMind!")
print(f"Dataset ID: {vm_test_dataset.input_id}")
print(f"Dataset columns: {vm_test_dataset._df.columns}")
vm_test_dataset._df.head(1)

<a id='toc7_2__'></a>

### Run the Agent and capture result through assign predictions

Now we'll execute our banking agent on the test dataset and capture its responses for evaluation.

In [None]:
vm_test_dataset.assign_predictions(vm_banking_model)

print("Banking Agent Predictions Generated Successfully!")
print(f"Predictions assigned to {len(vm_test_dataset._df)} test cases")
vm_test_dataset._df.head()

<a id='toc8__'></a>

## Banking Accuracy Test

This test evaluates the banking agent's ability to provide accurate responses by:
- Testing against a dataset of predefined banking questions and expected answers
- Checking if responses contain expected keywords and banking terminology
- Providing detailed test results including pass/fail status
- Helping identify any gaps in the agent's banking knowledge or response quality

In [None]:

@vm.test("my_custom_tests.banking_accuracy_test")
def banking_accuracy_test(model, dataset, list_of_columns):
    """
    The Banking Accuracy Test evaluates whether the agent’s responses include 
    critical domain-specific keywords and phrases that indicate accurate, compliant,
    and contextually appropriate banking information. This test ensures that the agent
    provides responses containing the expected banking terminology, risk classifications,
    account details, or other domain-relevant information required for regulatory compliance,
    customer safety, and operational accuracy.
    """
    df = dataset._df
    
    # Pre-compute responses for all tests
    y_true = dataset.y.tolist()
    y_pred = dataset.y_pred(model).tolist()

    # Vectorized test results
    test_results = []
    for response, keywords in zip(y_pred, y_true):
        # Convert keywords to list if not already a list
        if not isinstance(keywords, list):
            keywords = [keywords]
        test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))
        
    results = pd.DataFrame()
    column_names = [col + "_details" for col in list_of_columns]
    results[column_names] = df[list_of_columns]
    results["actual"] = y_pred
    results["expected"] = y_true
    results["passed"] = test_results
    results["error"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'
    
    return results
   
result = run_test(
    "my_custom_tests.banking_accuracy_test",
    inputs={
        "dataset": vm_test_dataset,
        "model": vm_banking_model
    },
    params={
        "list_of_columns": ["input"]
    }
)
result.log()

<a id='toc9__'></a>

## Banking Tool Call Accuracy Test

This test evaluates how accurately our intelligent banking router selects the correct tools for different banking requests. This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right banking tools to help them.

In [None]:
@vm.test("my_custom_tests.BankingToolCallAccuracy")
def BankingToolCallAccuracy(dataset, agent_output_column, expected_tools_column):
    """
    Evaluates the tool selection accuracy of a LangGraph-powered banking agent.

    This test measures whether the agent correctly identifies and invokes the required banking tools
    for each user query scenario.
    For each case, the outputs generated by the agent (including its tool calls) are compared against an
    expected set of tools. The test considers both coverage and exactness: it computes the proportion of
    expected tools correctly called by the agent for each instance.

    Parameters:
        dataset (VMDataset): The dataset containing user queries, agent outputs, and ground-truth tool expectations.
        agent_output_column (str): Dataset column name containing agent outputs (should include tool call details in 'messages').
        expected_tools_column (str): Dataset column specifying the true expected tools (as lists).

    Returns:
        List[dict]: Per-row dictionaries with details: expected tools, found tools, match count, total expected, and accuracy score.

    Purpose:
        Provides diagnostic evidence of the banking agent's core reasoning ability—specifically, its capacity to
        interpret user needs and select the correct banking actions. Useful for diagnosing gaps in tool coverage,
        misclassifications, or breakdowns in agent logic.

    Interpretation:
        - An accuracy of 1.0 signals perfect tool selection for that example.
        - Lower scores may indicate partial or complete failures to invoke required tools.
        - Review 'found_tools' vs. 'expected_tools' to understand the source of discrepancies.

    Strengths:
        - Directly tests a core capability of compositional tool-use agents.
        - Framework-agnostic; robust to tool call output format (object or dict).
        - Supports batch validation and result logging for systematic documentation.

    Limitations:
        - Does not penalize extra, unnecessary tool calls.
        - Does not assess result quality—only correct invocation.

    """
    def validate_tool_calls_simple(messages, expected_tools):
        """Simple validation of tool calls without RAGAS dependency issues."""
        
        tool_calls_found = []
        
        for message in messages:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    # Handle both dictionary and object formats
                    if isinstance(tool_call, dict):
                        tool_calls_found.append(tool_call['name'])
                    else:
                        # ToolCall object - use attribute access
                        tool_calls_found.append(tool_call.name)
        
        # Check if expected tools were called
        accuracy = 0.0
        matches = 0
        if expected_tools:
            matches = sum(1 for tool in expected_tools if tool in tool_calls_found)
            accuracy = matches / len(expected_tools)
        
        return {
            'expected_tools': expected_tools,
            'found_tools': tool_calls_found,
            'matches': matches,
            'total_expected': len(expected_tools) if expected_tools else 0,
            'accuracy': accuracy,
        }

    df = dataset._df
    
    results = []
    for i, row in df.iterrows():
        result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])
        results.append(result)
         
    return results

run_test(
    "my_custom_tests.BankingToolCallAccuracy",
    inputs = {
        "dataset": vm_test_dataset,
    },
    params = {
        "agent_output_column": "banking_agent_model_output",
        "expected_tools_column": "expected_tools"
    }
).log()

<a id='toc10__'></a>

## Scorers in ValidMind

Scorers are evaluation metrics that analyze model outputs and store their results in the dataset. When using `assign_scores()`:

- Each scorer adds a new column to the dataset with format: {scorer_name}_{metric_name}
- The column contains the numeric score (typically 0-1) for each example
- Multiple scorers can be run on the same dataset, each adding their own column
- Scores are persisted in the dataset for later analysis and visualization
- Common scorer patterns include:
  - Model performance metrics (accuracy, F1, etc)
  - Output quality metrics (relevance, faithfulness)
  - Task-specific metrics (completion, correctness)

<a id="toc6_3_4_"></a>

### AI Agent Evaluation Metrics

AI agent evaluation metrics are specialized measurements designed to assess how well autonomous LLM-based agents reason, plan, select and execute tools, and ultimately complete user tasks by analyzing the **full execution trace**—including reasoning steps, tool calls, intermediate decisions, and outcomes—rather than just single input–output pairs.

These metrics are essential because agent failures often occur in ways traditional LLM metrics miss (e.g., choosing the right tool with wrong arguments, creating a good plan but not following it, or completing a task inefficiently).

**DeepEval’s AI agent evaluation framework** breaks evaluation into three layers with corresponding metric categories:

1. **Reasoning Layer** – Evaluates planning and strategy generation:

   * *PlanQualityMetric* – how logical, complete, and efficient the agent’s plan is
   * *PlanAdherenceMetric* – whether the agent follows its own plan during execution 

2. **Action Layer** – Assesses tool usage and argument generation:

   * *ToolCorrectnessMetric* – whether the agent selects and calls the right tools
   * *ArgumentCorrectnessMetric* – whether the agent generates correct tool arguments

3. **Execution Layer** – Measures end-to-end performance:

   * *TaskCompletionMetric* – whether the agent successfully completes the intended task
   * *StepEfficiencyMetric* – whether the agent avoids unnecessary or redundant steps

Together, these metrics enable granular diagnosis of agent behavior, help pinpoint where failures occur (reasoning, action, or execution), and support both development benchmarking and production monitoring.

<a id='toc10_1'></a>

#### **Reasoning Layer**
#### PlanQualityMetric
Let's measures how well the agent generates a plan before acting. A high score means the plan is logical, complete, and efficient.

In [None]:

vm_test_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.PlanQuality",
    input_column = "input",
    actual_output_column = "banking_agent_model_prediction",
    tools_called_column = "banking_agent_model_tool_called",
    agent_output_column = "banking_agent_model_output",
)
vm_test_dataset._df.head()

<a id='toc10_2'></a>

#### PlanAdherenceMetric
Let's checks whether the agent follows the plan it created. Deviations lower this score and indicate gaps between reasoning and execution.

In [None]:
vm_test_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.PlanAdherence",
    input_column = "input",
    actual_output_column = "banking_agent_model_prediction",
    expected_output_column = "expected_output",
    tools_called_column = "banking_agent_model_tool_called",
    agent_output_column = "banking_agent_model_output",

)
vm_test_dataset._df.head()

<a id='toc10_3'></a>

#### **Action Layer**
#### ToolCorrectnessMetric
Let's evaluates if the agent selects the appropriate tool for the task. Choosing the wrong tool reduces performance even if reasoning was correct.

In [None]:
vm_test_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.ToolCorrectness",
    input_column = "input",
    actual_output_column = "banking_agent_model_prediction",
    tools_called_column = "banking_agent_model_tool_called",
    expected_tools_column = "expected_tools",
    agent_output_column = "banking_agent_model_output",

)
vm_test_dataset._df.head()

<a id='toc10_4'></a>

#### ArgumentCorrectnessMetric
Let's assesses whether the agent provides correct inputs or arguments to the selected tool. Incorrect arguments can lead to failed or unexpected results.

In [None]:
vm_test_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.ArgumentCorrectness",
    input_column = "input",
    actual_output_column = "banking_agent_model_prediction",
    tools_called_column = "banking_agent_model_tool_called",
    agent_output_column = "banking_agent_model_output",

)
vm_test_dataset._df.head()



<a id='toc10_5'></a>

#### **Execution Layer**
#### TaskCompletionMetric
The TaskCompletion test evaluates whether our banking agent successfully completes the requested tasks by analyzing its outputs and tool usage. This metric assesses the agent's ability to understand user requests, execute appropriate actions, and provide complete responses that address the original query. The test provides a score between 0-1 along with detailed feedback on task completion quality.

In [None]:
vm_test_dataset.assign_scores(
    metrics = "validmind.scorer.llm.deepeval.TaskCompletion",
    input_column = "input",
    actual_output_column = "banking_agent_model_prediction",
    agent_output_column = "banking_agent_model_output",
    tools_called_column = "banking_agent_model_tool_called",

)
vm_test_dataset._df.head()

The TaskCompletion scorer has added a new column 'TaskCompletion_score' to our dataset. This is because when we run scorers through assign_scores(), the return values are automatically processed and added as new columns with the format {scorer_name}_{metric_name}. We'll use this column to visualize the distribution of task completion scores across our test cases. Let's visualize the distribution through the box plot test.

In [None]:
run_test(
    "validmind.plots.BoxPlot",
    inputs={"dataset": vm_test_dataset},
    params={
        "columns": "TaskCompletion_score",
        "title": "Distribution of Task Completion Scores",
        "ylabel": "Score",
        "figsize": (8, 6)
    }
).log()


<a id='toc12__'></a>

## RAGAS Tests for an Agent Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our banking agent. These tests analyze different aspects of agent performance:

Our banking agent uses tools to retrieve information and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:

- **Response Quality**: How well the agent uses retrieved tool outputs to generate helpful banking responses
- **Information Faithfulness**: Whether agent responses accurately reflect tool outputs  
- **Relevance Assessment**: How well responses address the original banking query
- **Context Utilization**: How effectively the agent incorporates tool results into final answers

These tests provide insights into how well our banking agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to banking users.

<a id='toc12_1__'></a>

### Faithfulness

Faithfulness measures how accurately the banking agent's responses reflect the information retrieved from tools. This metric evaluates:

**Information Accuracy**: Whether the agent correctly uses tool outputs in its responses
- **Fact Preservation**: Ensuring credit scores, loan calculations, compliance results are accurately reported
- **No Hallucination**: Verifying the agent doesn't invent banking information not provided by tools
- **Source Attribution**: Checking that responses align with actual tool outputs

**Critical for Banking Trust**: Faithfulness is essential for banking agent reliability because users need to trust that:
- Credit analysis results are reported correctly
- Financial calculations are accurate  
- Compliance checks return real information
- Risk assessments are properly communicated

In [None]:
run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

<a id='toc12_2__'></a>

### Response Relevancy

Response Relevancy evaluates how well the banking agent's answers address the user's original banking question or request. This metric assesses:

**Query Alignment**: Whether responses directly answer what users asked for
- **Intent Fulfillment**: Checking if the agent understood and addressed the user's actual banking need
- **Completeness**: Ensuring responses provide sufficient information to satisfy the banking query
- **Focus**: Avoiding irrelevant information that doesn't help the banking user

**Banking Quality**: Measures the agent's ability to maintain relevant, helpful banking dialogue
- **Context Awareness**: Responses should be appropriate for the banking conversation context
- **User Satisfaction**: Answers should be useful and actionable for banking users
- **Clarity**: Banking information should be presented in a way that directly helps the user

High relevancy indicates the banking agent successfully understands user needs and provides targeted, helpful banking responses.

In [None]:
run_test(
    "validmind.model_validation.ragas.ResponseRelevancy",
    inputs={"dataset": vm_test_dataset},
    params={
        "user_input_column": "input",
        "response_column": "banking_agent_model_prediction",
        "retrieved_contexts_column": "banking_agent_model_tool_messages",
    }
).log()

<a id='toc12_3__'></a>

### Context Recall

Context Recall measures how well the banking agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:

**Information Utilization**: Whether the agent effectively incorporates tool outputs into its responses
- **Coverage**: How much of the available tool information is used in the response
- **Integration**: How well tool outputs are woven into coherent, natural banking responses
- **Completeness**: Whether all relevant information from tools is considered

**Tool Effectiveness**: Assesses whether selected banking tools provide useful context for responses
- **Relevance**: Whether tool outputs actually help answer the user's banking question
- **Sufficiency**: Whether enough information was retrieved to generate good banking responses
- **Quality**: Whether the tools provided accurate, helpful banking information

High context recall indicates the banking agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed banking responses.

In [None]:
run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
        "reference_column": ["banking_agent_model_prediction"],
    },
).log()

<a id='toc13__'></a>

## Safety

Safety testing is critical for banking AI agents to ensure they operate reliably and securely.
These tests help validate that our banking agent maintains high standards of fairness and professionalism.

<a id='toc13_1__'></a>

### AspectCritic

AspectCritic provides comprehensive evaluation across multiple dimensions of banking agent performance. This metric analyzes various aspects of response quality:

**Multi-Dimensional Assessment**: Evaluates responses across different quality criteria:
  - **Conciseness**: Whether responses are clear and to-the-point without unnecessary details
  - **Coherence**: Whether responses are logically structured and easy to follow
  - **Correctness**: Accuracy of banking information and appropriateness of recommendations
  - **Harmfulness**: Whether responses could cause harm or damage to users or systems
  - **Maliciousness**: Whether responses contain malicious content or intent

**Holistic Quality Scoring**: Provides an overall assessment that considers:
- **User Experience**: How satisfying and useful the banking interaction would be for real users
- **Professional Standards**: Whether responses meet quality expectations for production banking systems
- **Consistency**: Whether the banking agent maintains quality across different types of requests

AspectCritic helps identify specific areas where the banking agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction in banking scenarios.

In [None]:
run_test(
    "validmind.model_validation.ragas.AspectCritic",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["banking_agent_model_prediction"],
        "retrieved_contexts_column": ["banking_agent_model_tool_messages"],
    },
).log()

<a id='toc13_2__'></a>

### Prompt bias

Let's check if the agent's prompts contain unintended biases that could affect banking decisions.

In [None]:
run_test(
    "validmind.prompt_validation.Bias",
    inputs={
        "model": vm_banking_model,
    },
).log()

<a id='toc13_3__'></a>

### Toxicity

Let's ensure responses are professional and appropriate for banking contexts.

In [None]:
run_test(
    "validmind.data_validation.nlp.Toxicity",
    inputs={
        "dataset": vm_test_dataset,
    },
).log()

<a id='toc14__'></a>

## Demo Summary and Next Steps

We have successfully built and tested a comprehensive **Banking AI Agent** using LangGraph and ValidMind. Here's what we've accomplished:

<a id='toc14_1__'></a>

### What We Built

1. **5 Specialized Banking Tools**
   - Credit Risk Analyzer for loan assessments
   - Customer Account Manager for account services
   - Fraud Detection System for security monitoring

2. **Intelligent LangGraph Agent**
   - Automatic tool selection based on user requests
   - Banking-specific system prompts and guidance
   - Professional banking assistance and responses

3. **Comprehensive Testing Framework**
   - banking-specific test cases
   - ValidMind integration for validation
   - Performance analysis across banking domains

<a id='toc14_2__'></a>

### Next Steps

1. **Customize Tools**: Adapt the banking tools to your specific banking requirements
2. **Expand Test Cases**: Add more banking scenarios and edge cases
3. **Integrate with Real Data**: Connect to actual banking systems and databases
4. **Add More Tools**: Implement additional banking-specific functionality
5. **Production Deployment**: Deploy the agent in a production banking environment

<a id='toc14_3__'></a>

### Key Benefits

- **Industry-Specific**: Designed specifically for banking operations
- **Regulatory Compliance**: Built-in SR 11-7 and SS 1-23 compliance checks
- **Risk Management**: Comprehensive credit and fraud risk assessment
- **Customer Focus**: Tools for both retail and commercial banking needs
- **Real-World Applicability**: Addresses actual banking use cases and challenges

Your banking AI agent is now ready to handle real-world banking scenarios while maintaining regulatory compliance and risk management best practices!

<small>

***

Copyright © 2023-2026 ValidMind Inc. All rights reserved.<br>
Refer to the [LICENSE file in the root of the GitHub `validmind-library` repository](https://github.com/validmind/validmind-library/blob/main/LICENSE) for details.<br>
SPDX-License-Identifier: AGPL-3.0 AND ValidMind Commercial</small>