# Simplified LangGraph Agent Model Documentation

This notebook demonstrates how to build and validate a simplified AI agent using LangGraph integrated with ValidMind for comprehensive testing and monitoring.

Learn how to create intelligent agents that can:
- **Automatically select appropriate tools** based on user queries using LLM-powered routing
- **Manage workflows** with state management and memory
- **Handle two specialized tools** with smart decision-making
- **Provide validation and testing** through ValidMind integration

We'll build a simplified agent system that intelligently routes user requests to two specialized tools: **search_engine** for document search and **task_assistant** for general assistance, then validate its performance using ValidMind's testing framework.



## Setup and Imports

First, let's import all the necessary libraries for building our LangGraph agent system:

- **LangChain components** for LLM integration and tool management
- **LangGraph** for building stateful, multi-step agent workflows  
- **ValidMind** for model validation and testing
- **Standard libraries** for data handling and environment management

The setup includes loading environment variables (like OpenAI API keys) needed for the LLM components to function properly.


In [None]:
%pip install -q langgraph langchain validmind openai

In [None]:
from typing import TypedDict,  Annotated, Sequence, Optional
from langchain.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END, START
from langgraph.prebuilt import ToolNode
from langgraph.checkpoint.memory import MemorySaver
from langgraph.graph.message import add_messages
import pandas as pd

# Load environment variables if using .env file
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("dotenv not installed. Make sure OPENAI_API_KEY is set in your environment.")


In [None]:
import validmind as vm

vm.init(
    api_host="...",
    api_key="...",
    api_secret="...",
    model="...",
)

## Simplified Tools with Rich Docstrings

We've simplified the agent to use only two core tools:
- **search_engine**: For searching through documents, policies, and knowledge base  
- **task_assistant**: For general-purpose task assistance and problem-solving


In [None]:
# Search Engine Tool
@tool
def search_engine(query: str, document_type: Optional[str] = "all") -> str:
    """
    Search through internal documents, policies, and knowledge base.
    
    This tool can search for:
    - Company policies and procedures
    - Technical documentation and manuals
    - Compliance and regulatory documents
    - Historical records and reports
    - Product specifications and requirements
    - Legal documents and contracts
    
    Args:
        query (str): Search terms or questions about documents
        document_type (str, optional): Type of document to search ("policy", "technical", "legal", "all")
    
    Returns:
        str: Relevant document excerpts and references
        
    Examples:
        - "Find our data privacy policy"
        - "Search for loan approval procedures"
        - "What are the security guidelines for API access?"
        - "Show me compliance requirements for financial reporting"
    """
    document_db = {
        "policy": [
            "Data Privacy Policy: All personal data must be encrypted...",
            "Remote Work Policy: Employees may work remotely up to 3 days...",
            "Security Policy: All systems require multi-factor authentication..."
        ],
        "technical": [
            "API Documentation: REST endpoints available at /api/v1/...",
            "Database Schema: User table contains id, name, email...",
            "Deployment Guide: Use Docker containers with Kubernetes..."
        ],
        "legal": [
            "Terms of Service: By using this service, you agree to...",
            "Privacy Notice: We collect information to provide services...",
            "Compliance Framework: SOX requirements mandate quarterly audits..."
        ]
    }
    
    results = []
    search_types = [document_type] if document_type != "all" else document_db.keys()
    
    for doc_type in search_types:
        if doc_type in document_db:
            for doc in document_db[doc_type]:
                if any(term.lower() in doc.lower() for term in query.split()):
                    results.append(f"[{doc_type.upper()}] {doc}")
    
    if not results:
        results.append(f"No documents found matching '{query}'")
    
    return "\n\n".join(results)

# Task Assistant Tool
@tool
def task_assistant(task_description: str, context: Optional[str] = None) -> str:
    """
    General-purpose task assistance and problem-solving tool.
    
    This tool can help with:
    - Breaking down complex tasks into steps
    - Providing guidance and recommendations
    - Answering questions and explaining concepts
    - Suggesting solutions to problems
    - Planning and organizing activities
    - Research and information gathering
    
    Args:
        task_description (str): Description of the task or question
        context (str, optional): Additional context or background information
    
    Returns:
        str: Helpful guidance, steps, or information for the task
        
    Examples:
        - "How do I prepare for a job interview?"
        - "What are the steps to deploy a web application?"
        - "Help me plan a team meeting agenda"
        - "Explain machine learning concepts for beginners"
    """
    responses = {
        "meeting": "For planning meetings: 1) Define objectives, 2) Create agenda, 3) Invite participants, 4) Prepare materials, 5) Set time limits",
        "interview": "Interview preparation: 1) Research the company, 2) Practice common questions, 3) Prepare examples, 4) Plan your outfit, 5) Arrive early",
        "deploy": "Deployment steps: 1) Test in staging, 2) Backup production, 3) Deploy code, 4) Run health checks, 5) Monitor performance",
        "learning": "Learning approach: 1) Start with basics, 2) Practice regularly, 3) Build projects, 4) Join communities, 5) Stay updated"
    }
    
    task_lower = task_description.lower()
    for key, response in responses.items():
        if key in task_lower:
            return f"Task assistance for '{task_description}':\n\n{response}"
    
    
    return f"""For the task '{task_description}', I recommend: 1) Break it into smaller steps, 2) Gather necessary resources, 3)
    Create a timeline, 4) Start with the most critical parts, 5) Review and adjust as needed.
        """

# Collect all tools for the LLM router - SIMPLIFIED TO ONLY 2 TOOLS
AVAILABLE_TOOLS = [
    search_engine,
    task_assistant
]

print("Simplified tools created!")
print(f"Available tools: {len(AVAILABLE_TOOLS)}")
for tool in AVAILABLE_TOOLS:
    print(f"   - {tool.name}: {tool.description[:50]}...")


## Complete LangGraph Agent with Intelligent Router


In [None]:

# Simplified Agent State (removed routing fields)
class IntelligentAgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], add_messages]
    user_input: str
    session_id: str
    context: dict

def create_intelligent_langgraph_agent():
    """Create a simplified LangGraph agent with direct LLM tool selection."""
    
    # Initialize the main LLM for responses
    main_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
    
    # Bind tools to the main LLM
    llm_with_tools = main_llm.bind_tools(AVAILABLE_TOOLS)
    
    def llm_node(state: IntelligentAgentState) -> IntelligentAgentState:
        """Main LLM node that processes requests and directly selects tools."""
        
        messages = state["messages"]
        
        # Enhanced system prompt with tool selection guidance
        system_context = f"""You are a helpful AI assistant with access to specialized tools.
            Analyze the user's request and directly use the most appropriate tools to help them.
            
            AVAILABLE TOOLS:
            🔍 **search_engine** - Search through internal documents, policies, and knowledge base
            - Use for: finding company policies, technical documentation, compliance documents
            - Examples: "Find our data privacy policy", "Search for API documentation"

            🎯 **task_assistant** - General-purpose task assistance and problem-solving  
            - Use for: guidance, recommendations, explaining concepts, planning activities
            - Examples: "How to prepare for an interview", "Help plan a meeting", "Explain machine learning"

            INSTRUCTIONS:
            - Analyze the user's request carefully
            - If they need to find documents/policies → use search_engine
            - If they need general help/guidance/explanations → use task_assistant  
            - If the request needs specific information search, use search_engine first
            - You can use tools directly based on the user's needs
            - Provide helpful, accurate responses based on tool outputs
            - If no tools are needed, respond conversationally

            Choose and use tools wisely to provide the most helpful response."""
        
        # Add system context to messages
        enhanced_messages = [SystemMessage(content=system_context)] + list(messages)
        
        # Get LLM response with tool selection
        response = llm_with_tools.invoke(enhanced_messages)
        
        return {
            **state,
            "messages": messages + [response]
        }
    
    def should_continue(state: IntelligentAgentState) -> str:
        """Decide whether to use tools or end the conversation."""
        last_message = state["messages"][-1]
        
        # Check if the LLM wants to use tools
        if hasattr(last_message, 'tool_calls') and last_message.tool_calls:
            return "tools"
        
        return END
        
    
    # Create the simplified state graph  
    workflow = StateGraph(IntelligentAgentState)
    
    # Add nodes (removed router node)
    workflow.add_node("llm", llm_node) 
    workflow.add_node("tools", ToolNode(AVAILABLE_TOOLS))
    
    # Simplified entry point - go directly to LLM
    workflow.add_edge(START, "llm")
    
    # From LLM, decide whether to use tools or end
    workflow.add_conditional_edges(
        "llm",
        should_continue,
        {"tools": "tools", END: END}
    )
    
    # Tool execution flows back to LLM for final response
    workflow.add_edge("tools", "llm")
    
    # Set up memory
    memory = MemorySaver()
    
    # Compile the graph
    agent = workflow.compile(checkpointer=memory)
    
    return agent

# Create the simplified intelligent agent
intelligent_agent = create_intelligent_langgraph_agent()

print("Simplified LangGraph Agent Created!")
print("Features:")
print("   - Direct LLM tool selection (no separate router)")
print("   - Enhanced system prompt for intelligent tool choice")
print("   - Streamlined workflow: LLM -> Tools -> Response")
print("   - Automatic tool parameter extraction")
print("   - Clean, simplified architecture")


## ValidMind Model Integration

Now we'll integrate our LangGraph agent with ValidMind for comprehensive testing and validation. This step is crucial for:

**Model Wrapping**: We create a wrapper function (`agent_fn`) that standardizes the agent interface for ValidMind
- **Input Formatting**: Converts ValidMind inputs to the agent's expected format
- **State Management**: Handles session configuration and conversation threads
- **Result Processing**: Returns agent responses in a consistent format

**ValidMind Agent Initialization**: Using `vm.init_model()` creates a ValidMind model object that:
- **Enables Testing**: Allows us to run validation tests on the agent
- **Tracks Performance**: Monitors agent behavior and responses  
- **Provides Documentation**: Generates documentation and analysis reports
- **Supports Evaluation**: Enables quantitative assessment of agent capabilities

This integration allows us to treat our LangGraph agent like any other machine learning model in the ValidMind ecosystem, enabling comprehensive testing and validation workflows.

In [None]:
def agent_fn(input):
    """
    Invoke the simplified agent with the given input.
    """
    # Simplified initial state (removed routing fields)
    initial_state = {
        "user_input": input["input"],
        "messages": [HumanMessage(content=input["input"])],
        "session_id": input["session_id"],
        "context": {}
    }

    session_config = {"configurable": {"thread_id": input["session_id"]}}

    result = intelligent_agent.invoke(initial_state, config=session_config)

    return {"prediction": result['messages'][-1].content, "output": result}


vm_intelligent_model = vm.init_model(input_id="financial_model", predict_fn=agent_fn)
# add model to the vm agent
vm_intelligent_model.model = intelligent_agent

## Prepare Sample Test Dataset

We'll create a comprehensive test dataset to evaluate our agent's performance across different scenarios. This dataset includes:

**Diverse Test Cases**: Various types of user requests that test different agent capabilities:
- **Single Tool Requests**: Simple queries that require one specific tool
- **Multi-Tool Requests**: Complex queries requiring multiple tools in sequence  
- **Validation Tasks**: Requests for data validation and verification
- **General Assistance**: Open-ended questions for problem-solving guidance

**Expected Outputs**: For each test case, we define:
- **Expected Tools**: Which tools should be selected by the router
- **Possible Outputs**: Valid response patterns or values
- **Session IDs**: Unique identifiers for conversation tracking

This structured approach allows us to systematically evaluate both tool selection accuracy and response quality.

In [None]:
import pandas as pd
import uuid

# Simplified test dataset with only search_engine and task_assistant tools
test_dataset = pd.DataFrame([
    {
        "input": "Find our company's data privacy policy",
        "expected_tools": ["search_engine"],
        "possible_outputs": ["privacy_policy.pdf", "data_protection.doc", "company_privacy_guidelines.txt"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "Search for loan approval procedures", 
        "expected_tools": ["search_engine"],
        "possible_outputs": ["loan_procedures.doc", "approval_process.pdf", "lending_guidelines.txt"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "How should I prepare for a technical interview?",
        "expected_tools": ["task_assistant"],
        "possible_outputs": ["algorithms", "data structures", "system design", "coding practice"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "Help me understand machine learning basics",
        "expected_tools": ["task_assistant"],
        "possible_outputs": ["supervised", "unsupervised", "neural networks", "training", "testing"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "What can you do for me?",
        "expected_tools": ["task_assistant"],
        "possible_outputs": ["search documents", "provide assistance", "answer questions", "help with tasks"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "Find technical documentation about API endpoints",
        "expected_tools": ["search_engine"],
        "possible_outputs": ["API_documentation.pdf", "REST_endpoints.doc", "technical_guide.txt"],
        "session_id": str(uuid.uuid4())
    },
    {
        "input": "Help me plan a team meeting agenda",
        "expected_tools": ["task_assistant"],
        "possible_outputs": ["objectives", "agenda", "participants", "materials", "time limits"],
        "session_id": str(uuid.uuid4())
    }
])

print("Simplified test dataset created!")
print(f"Number of test cases: {len(test_dataset)}")
print(f"Test tools: {test_dataset['expected_tools'].explode().unique()}")


In [None]:
# Display the simplified test dataset
print("Using simplified test dataset with only 2 tools:")
print(f"Number of test cases: {len(test_dataset)}")
print(f"Available tools being tested: {sorted(test_dataset['expected_tools'].explode().unique())}")
print("\nTest cases preview:")
for i, row in test_dataset.iterrows():
    print(f"{i+1}. {row['input']} -> Expected tool: {row['expected_tools'][0]}")


### Initialize ValidMind Dataset

Before we can run tests and evaluations, we need to initialize our test dataset as a ValidMind dataset object. 
This step is essential for integrating our agent evaluation into ValidMind's comprehensive testing and validation framework.


In [None]:
vm_test_dataset = vm.init_dataset(
    input_id="test_dataset",
    dataset=test_dataset,
    target_column="possible_outputs"
)

### Run Agent and Assign Predictions

Now we'll execute our agent on the test dataset and capture its responses for evaluation. This process generates the prediction data needed for comprehensive performance evaluation and comparison against expected outputs.

In [None]:
vm_test_dataset.assign_predictions(vm_intelligent_model)

#### Dataframe display settings

In [None]:
pd.set_option('display.max_colwidth', 40)
pd.set_option('display.width', 120)
pd.set_option('display.max_colwidth', None)
vm_test_dataset._df

## Visualization
This section visualizes the LangGraph agent's workflow structure using Mermaid diagrams.
The test below validates that the agent's architecture is properly structured by:
- Checking if the model has a valid LangGraph Graph object
- Generating a visual representation of component connections and flow
- Ensuring the graph can be properly rendered as a Mermaid diagram


In [None]:
import langgraph

@vm.test("my_custom_tests.LangGraphVisualization")
def LangGraphVisualization(model):
    """
    Visualizes the LangGraph workflow structure using Mermaid diagrams.
    
    ### Purpose
    Creates a visual representation of the LangGraph agent's workflow using Mermaid diagrams
    to show the connections and flow between different components. This helps validate that
    the agent's architecture is properly structured.
    
    ### Test Mechanism
    1. Retrieves the graph representation from the model using get_graph()
    2. Attempts to render it as a Mermaid diagram
    3. Returns the visualization and validation results
    
    ### Signs of High Risk
    - Failure to generate graph visualization indicates potential structural issues
    - Missing or broken connections between components
    - Invalid graph structure that cannot be rendered
    """
    try:
        if not hasattr(model, 'model') or not isinstance(model.model, langgraph.graph.state.CompiledStateGraph):
            return {
                'test_results': False,
                'summary': {
                    'status': 'FAIL', 
                    'details': 'Model must have a LangGraph Graph object as model attribute'
                }
            }
        graph = model.model.get_graph(xray=False)
        mermaid_png = graph.draw_mermaid_png()
        return mermaid_png
    except Exception as e:
        return {
            'test_results': False, 
            'summary': {
                'status': 'FAIL',
                'details': f'Failed to generate graph visualization: {str(e)}'
            }
        }

vm.tests.run_test(
    "my_custom_tests.LangGraphVisualization",
    inputs = {
        "model": vm_intelligent_model
    }
).log()

## Accuracy Test
The purpose of this test is to evaluate the agent's ability to provide accurate responses by:
- Testing against a dataset of predefined questions and expected answers
- Checking if responses contain expected keywords
- Providing detailed test results including pass/fail status
- Helping identify any gaps in the agent's knowledge or response quality

In [None]:
import pandas as pd
import validmind as vm

@vm.test("my_custom_tests.accuracy_test")
def accuracy_test(model, dataset, list_of_columns):
    """
    Run tests on a dataset of questions and expected responses.
    Optimized version using vectorized operations and list comprehension.
    """
    df = dataset._df
    
    # Pre-compute responses for all tests
    y_true = dataset.y.tolist()
    y_pred = dataset.y_pred(model).tolist()

    # Vectorized test results
    test_results = []
    for response, keywords in zip(y_pred, y_true):
        test_results.append(any(str(keyword).lower() in str(response).lower() for keyword in keywords))
        
    results = pd.DataFrame()
    column_names = [col + "_details" for col in list_of_columns]
    results[column_names] = df[list_of_columns]
    results["actual"] = y_pred
    results["expected"] = y_true
    results["passed"] = test_results
    results["error"] = None if test_results else f'Response did not contain any expected keywords: {y_true}'
    
    return results
   
result = vm.tests.run_test(
    "my_custom_tests.accuracy_test",
    inputs={
        "dataset": vm_test_dataset,
        "model": vm_intelligent_model
    },
    params={
        "list_of_columns": ["input"]
    }
)
result.log()

## Tool Call Accuracy Test

This test evaluates how accurately our intelligent router selects the correct tools for different user requests. This test provides quantitative feedback on the agent's core intelligence - its ability to understand what users need and select the right tools to help them.

In [None]:
import validmind as vm

# Test with a real LangGraph result instead of creating mock objects
@vm.test("my_custom_tests.ToolCallAccuracy")
def ToolCallAccuracy(dataset, agent_output_column, expected_tools_column):
    """Test validation using actual LangGraph agent results."""
    # Let's create a simpler validation without the complex RAGAS setup
    def validate_tool_calls_simple(messages, expected_tools):
        """Simple validation of tool calls without RAGAS dependency issues."""
        
        tool_calls_found = []
        
        for message in messages:
            if hasattr(message, 'tool_calls') and message.tool_calls:
                for tool_call in message.tool_calls:
                    # Handle both dictionary and object formats
                    if isinstance(tool_call, dict):
                        tool_calls_found.append(tool_call['name'])
                    else:
                        # ToolCall object - use attribute access
                        tool_calls_found.append(tool_call.name)
        
        # Check if expected tools were called
        accuracy = 0.0
        matches = 0
        if expected_tools:
            matches = sum(1 for tool in expected_tools if tool in tool_calls_found)
            accuracy = matches / len(expected_tools)
        
        return {
            'accuracy': accuracy,
            'expected_tools': expected_tools,
            'found_tools': tool_calls_found,
            'matches': matches,
            'total_expected': len(expected_tools) if expected_tools else 0
        }

    df = dataset._df
    
    results = []
    for i, row in df.iterrows():
        result = validate_tool_calls_simple(row[agent_output_column]['messages'], row[expected_tools_column])
        results.append(result)
         
    return results

vm.tests.run_test(
    "my_custom_tests.ToolCallAccuracy",
    inputs = {
        "dataset": vm_test_dataset,
    },
    params = {
        "agent_output_column": "output",
        "expected_tools_column": "expected_tools"
    }
)

## RAGAS Tests for Agent Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) provides specialized metrics for evaluating conversational AI systems like our LangGraph agent. These tests analyze different aspects of agent performance:

Our agent uses tools to retrieve information (weather, documents, calculations) and generates responses based on that context, making it similar to a RAG system. RAGAS metrics help evaluate:

- **Response Quality**: How well the agent uses retrieved tool outputs to generate helpful responses
- **Information Faithfulness**: Whether agent responses accurately reflect tool outputs  
- **Relevance Assessment**: How well responses address the original user query
- **Context Utilization**: How effectively the agent incorporates tool results into final answers

These tests provide insights into how well our agent integrates tool usage with conversational abilities, ensuring it provides accurate, relevant, and helpful responses to users.


### Dataset Preparation - Extract Context from Agent State

Before running RAGAS tests, we need to extract and prepare the context information from our agent's execution results. This process:

**Tool Output Extraction**: Retrieves the outputs from tools used during agent execution
- **Message Parsing**: Analyzes the agent's conversation state to find tool outputs
- **Content Aggregation**: Combines outputs from multiple tools when used in sequence
- **Context Formatting**: Structures tool outputs as context for RAGAS evaluation

**RAGAS Format Preparation**: Converts agent data into the format expected by RAGAS metrics
- **User Input**: Original user queries from the test dataset
- **Retrieved Context**: Tool outputs treated as "retrieved" information  
- **Agent Response**: Final responses generated by the agent
- **Ground Truth**: Expected outputs for comparison

This preparation step is essential because RAGAS metrics were designed for traditional RAG systems, so we need to map our agent's tool-based architecture to the RAG paradigm for meaningful evaluation. 

In [None]:
from utils import capture_tool_output_messages

tool_messages = []
for i, row in vm_test_dataset._df.iterrows():
    tool_message = ""
    result = row['output']
    # Capture all tool outputs and metadata
    captured_data = capture_tool_output_messages(result)
   
    # Access specific tool outputs
    for output in captured_data["tool_outputs"]:
        tool_message += output['content']
    tool_messages.append([tool_message])

vm_test_dataset._df['tool_messages'] = tool_messages

In [None]:
vm_test_dataset._df.head(2)

### Faithfulness

Faithfulness measures how accurately the agent's responses reflect the information retrieved from tools. This metric evaluates:

**Information Accuracy**: Whether the agent correctly uses tool outputs in its responses
- **Fact Preservation**: Ensuring numerical results, weather data, and document content are accurately reported
- **No Hallucination**: Verifying the agent doesn't invent information not provided by tools
- **Source Attribution**: Checking that responses align with actual tool outputs

**Critical for Agent Trust**: Faithfulness is essential for agent reliability because users need to trust that:
- Calculator results are reported correctly
- Weather information is accurate  
- Document searches return real information
- Validation results are properly communicated

In [None]:
vm.tests.run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["financial_model_prediction"],
        "retrieved_contexts_column": ["tool_messages"],
    },
).log()

### Response Relevancy

Response Relevancy evaluates how well the agent's answers address the user's original question or request. This metric assesses:

**Query Alignment**: Whether responses directly answer what users asked for
- **Intent Fulfillment**: Checking if the agent understood and addressed the user's actual need
- **Completeness**: Ensuring responses provide sufficient information to satisfy the query
- **Focus**: Avoiding irrelevant information that doesn't help the user

**Conversational Quality**: Measures the agent's ability to maintain relevant, helpful dialogue
- **Context Awareness**: Responses should be appropriate for the conversation context
- **User Satisfaction**: Answers should be useful and actionable for the user
- **Clarity**: Information should be presented in a way that directly helps the user

High relevancy indicates the agent successfully understands user needs and provides targeted, helpful responses.

In [None]:
vm.tests.run_test(
    "validmind.model_validation.ragas.ResponseRelevancy",
    inputs={"dataset": vm_test_dataset},
    params={
        "user_input_column": "input",
        "response_column": "financial_model_prediction",
        "retrieved_contexts_column": "tool_messages",
    }
).log()

### Context Recall

Context Recall measures how well the agent utilizes the information retrieved from tools when generating its responses. This metric evaluates:

**Information Utilization**: Whether the agent effectively incorporates tool outputs into its responses
- **Coverage**: How much of the available tool information is used in the response
- **Integration**: How well tool outputs are woven into coherent, natural responses
- **Completeness**: Whether all relevant information from tools is considered

**Tool Effectiveness**: Assesses whether selected tools provide useful context for responses
- **Relevance**: Whether tool outputs actually help answer the user's question
- **Sufficiency**: Whether enough information was retrieved to generate good responses
- **Quality**: Whether the tools provided accurate, helpful information

High context recall indicates the agent not only selects the right tools but also effectively uses their outputs to create comprehensive, well-informed responses.

In [None]:
vm.tests.run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "retrieved_contexts_column": ["tool_messages"],
        "reference_column": ["financial_model_prediction"],
    },
).log()

### AspectCritic

AspectCritic provides comprehensive evaluation across multiple dimensions of agent performance. This metric analyzes various aspects of response quality:

**Multi-Dimensional Assessment**: Evaluates responses across different quality criteria
- **Helpfulness**: Whether responses genuinely assist users in accomplishing their goals
- **Relevance**: How well responses address the specific user query
- **Coherence**: Whether responses are logically structured and easy to follow
- **Correctness**: Accuracy of information and appropriateness of recommendations

**Holistic Quality Scoring**: Provides an overall assessment that considers:
- **User Experience**: How satisfying and useful the interaction would be for real users
- **Professional Standards**: Whether responses meet quality expectations for production systems
- **Consistency**: Whether the agent maintains quality across different types of requests

AspectCritic helps identify specific areas where the agent excels or needs improvement, providing actionable insights for enhancing overall performance and user satisfaction.

In [None]:
vm.tests.run_test(
    "validmind.model_validation.ragas.AspectCritic",
    inputs={"dataset": vm_test_dataset},
    param_grid={
        "user_input_column": ["input"],
        "response_column": ["financial_model_prediction"],
        "retrieved_contexts_column": ["tool_messages"],
    },
).log()