# MCP Tools Evaluation with flex-evals

This notebook demonstrates how to evaluate LLM tool-calling capabilities using the [flex-evals](https://github.com/shane-kercheval/flex-evals) framework.

We'll verify that `gpt-4.1-mini` correctly predicts tool arguments for our MCP tools (e.g., `edit_content`), then execute those predictions and verify the actual results.

## Prerequisites

Before running this notebook, ensure your `.env` file has:
- `VITE_DEV_MODE=true` (bypasses auth)
- `OPENAI_API_KEY=sk-...`

1. **Start Docker containers** (PostgreSQL + Redis):
   ```bash
   make docker-up
   ```

2. **Run database migrations** (if needed):
   ```bash
   make migrate
   ```

3. **Start the API server** (port 8000):
   ```bash
   make run
   ```

4. **Start the Content MCP server** (port 8001) - in a separate terminal:
   ```bash
   make content-mcp-server
   ```

5. **Node.js** (for `npx` to run `mcp-remote`):
   - Ensure Node.js is installed
   - `npx` will automatically download `mcp-remote` on first use

## Setup

In [1]:
import json
from dotenv import load_dotenv

load_dotenv()

# Dummy token - works with VITE_DEV_MODE=true
PAT_TOKEN = "bm_devtoken"

# MCP Server URL
CONTENT_MCP_URL = "http://localhost:8001/mcp"

In [2]:
from sik_llms.mcp_manager import MCPClientManager

mcp_config = {
    "mcpServers": {
        "content": {
            "command": "npx",
            "args": [
                "mcp-remote",
                CONTENT_MCP_URL,
                "--header",
                f"Authorization: Bearer {PAT_TOKEN}",
            ],
        },
    },
}

## End-to-End Eval: Tool Prediction + Execution

This eval:
1. Creates a note with a typo
2. Uses a realistic prompt (no hardcoded answer)
3. Gets the LLM's tool prediction
4. Executes the tool
5. Verifies both the prediction and final content

The test cases define expected values, and checks reference them via JSONPath - making it scalable to multiple test cases.

In [None]:
# Define test case data - each case has content with a typo and expected outcomes
TEST_CASES_DATA = [
    {
        "content": "This docu needs to be updated.\n\nSee the documentation for more details.",
        "typo": "docu",  # What to search for
        "expected": {
            "tool_name": "edit_content",
            "old_str_contains": "docu",
            "new_str_contains": "document",
            "final_must_contain": "document",
            "final_must_not_contain": "documentmentation",  # Corruption check
        },
        "description": "Fix 'docu' -> 'document' without corrupting 'documentation'",
    },
    {
        "content": "The recieve function handles incoming data.\n\nUsers will receive a confirmation email.",  # noqa: E501
        "typo": "recieve",  # Common misspelling
        "expected": {
            "tool_name": "edit_content",
            "old_str_contains": "recieve",
            "new_str_contains": "receive",
            "final_must_contain": "receive function",
            "final_must_not_contain": "receivee",  # Corruption check
        },
        "description": "Fix 'recieve' -> 'receive' without corrupting second 'receive'",
    },
]

### Helper Functions

In [None]:
import httpx
from sik_llms import create_client, RegisteredClients, user_message
from flex_evals import TestCase, Output, ExactMatchCheck, ContainsCheck, evaluate


async def get_tool_prediction(prompt: str, tools: list) -> dict:
    """Get the LLM's tool prediction for a given prompt."""
    client = create_client(
        client_type=RegisteredClients.OPENAI_TOOLS,
        model_name="gpt-4.1-mini",
        tools=tools,
    )
    response = await client.run_async(messages=[user_message(prompt)])

    if response.tool_prediction:
        return {
            "tool_name": response.tool_prediction.name,
            "arguments": response.tool_prediction.arguments,
        }
    return {"tool_name": None, "arguments": {}}


async def run_eval_case(mcp, case_data: dict, tools: list) -> dict:  # noqa: ANN001
    """
    Run a single eval case end-to-end.

    1. Create note with content
    2. Search for the typo
    3. Get tool prediction with realistic prompt
    4. Execute the tool
    5. Get final content
    6. Delete the note

    Returns dict with all data for evaluation.
    """
    # Create the note
    create_result = await mcp.call_tool("create_note", {
        "title": "Eval Test Note",
        "content": case_data["content"],
        "tags": ["eval-test"],
    })
    note_data = json.loads(create_result.content[0].text)
    note_id = note_data["id"]

    try:
        # Search for the typo (like an agent would)
        search_result = await mcp.call_tool("search_in_content", {
            "id": note_id,
            "type": "note",
            "query": case_data["typo"],
        })
        search_response = json.dumps(search_result.structuredContent, indent=2)

        # Build a realistic prompt - agent knows there's a typo but doesn't know the fix
        prompt = f"""I found a typo in this note. Please fix it.

Note ID: {note_id}
Type: note

search_in_content result for "{case_data["typo"]}":
```json
{search_response}
```

Fix the typo."""

        # Get tool prediction
        prediction = await get_tool_prediction(prompt, tools)

        # Execute the tool if it's edit_content
        tool_result = None
        final_content = None

        if prediction["tool_name"] == "edit_content":
            edit_result = await mcp.call_tool("edit_content", prediction["arguments"])
            if not edit_result.isError:
                tool_result = edit_result.structuredContent

                # Get final content
                get_result = await mcp.call_tool("get_content", {"id": note_id, "type": "note"})
                if not get_result.isError:
                    final_content = get_result.structuredContent.get("content")

        return {
            "note_id": note_id,
            "prompt": prompt,
            "tool_prediction": prediction,
            "tool_result": tool_result,
            "final_content": final_content,
        }
    finally:
        # Clean up: delete the note
        async with httpx.AsyncClient() as client:
            await client.delete(
                f"http://localhost:8000/notes/{note_id}?permanent=true",
                headers={"Authorization": f"Bearer {PAT_TOKEN}"},
            )

In [5]:
# Example: Run a single eval case to see the output structure
async with MCPClientManager(mcp_config) as mcp:
    tools = mcp.get_tools()
    example_result = await run_eval_case(mcp, TEST_CASES_DATA[0], tools)

print("Example eval case result:")
print(json.dumps(example_result, indent=2))

Example eval case result:
{
  "note_id": "019bdea3-ae64-7ef4-a85c-de5bc1ee2b8d",
  "prompt": "I found a typo in this note. Please fix it.\n\nNote ID: 019bdea3-ae64-7ef4-a85c-de5bc1ee2b8d\nType: note\n\nsearch_in_content result for \"docu\":\n```json\n{\n  \"matches\": [\n    {\n      \"field\": \"content\",\n      \"line\": 1,\n      \"context\": \"This docu needs to be updated.\\n\\nSee the documentation for more details.\"\n    },\n    {\n      \"field\": \"content\",\n      \"line\": 3,\n      \"context\": \"This docu needs to be updated.\\n\\nSee the documentation for more details.\"\n    }\n  ],\n  \"total_matches\": 2\n}\n```\n\nFix the typo.",
  "tool_prediction": {
    "tool_name": "edit_content",
    "arguments": {
      "id": "019bdea3-ae64-7ef4-a85c-de5bc1ee2b8d",
      "type": "note",
      "old_str": "This docu needs to be updated.",
      "new_str": "This document needs to be updated."
    }
  },
  "tool_result": {
    "response_type": "minimal",
    "match_type": "exact"

### Run Eval Cases

In [6]:
# Run all eval cases and collect results
eval_results = []

async with MCPClientManager(mcp_config) as mcp:
    tools = mcp.get_tools()

    for case_data in TEST_CASES_DATA:
        print(f"Running: {case_data['description']}")
        result = await run_eval_case(mcp, case_data, tools)
        eval_results.append(result)

        print(f"  Tool: {result['tool_prediction']['tool_name']}")
        print(f"  Args: {json.dumps(result['tool_prediction']['arguments'], indent=4)}")
        print(f"  Final: {result['final_content']!r}")
        print()

Running: Fix 'docu' -> 'document' without corrupting 'documentation'
  Tool: edit_content
  Args: {
    "id": "019bdea3-b987-71c8-9199-611a7616e364",
    "type": "note",
    "old_str": "This docu needs to be updated.",
    "new_str": "This document needs to be updated."
}
  Final: 'This document needs to be updated.\n\nSee the documentation for more details.'

Running: Fix 'recieve' -> 'receive' without corrupting second 'receive'
  Tool: edit_content
  Args: {
    "id": "019bdea3-c0e4-772e-9090-fa764b94a30a",
    "type": "note",
    "old_str": "The recieve function handles incoming data.",
    "new_str": "The receive function handles incoming data."
}
  Final: 'The receive function handles incoming data.\n\nUsers will receive a confirmation email.'



### Evaluate Results

In [None]:
# Build TestCases and Outputs from our results
test_cases = []
outputs = []

for case_data, result in zip(TEST_CASES_DATA, eval_results):
    test_cases.append(TestCase(
        input=result["prompt"],
        expected=case_data["expected"],
        metadata={"description": case_data["description"]},
    ))
    outputs.append(Output(
        value=result,
        metadata={"model": "gpt-4.1-mini"},
    ))

# Define checks that reference expected values via JSONPath
# Note: phrases takes a single JSONPath string (not a list) - it gets converted to a list
# internally
checks = [
    # Verify correct tool was selected
    ExactMatchCheck(
        actual="$.output.value.tool_prediction.tool_name",
        expected="$.test_case.expected.tool_name",
    ),
    # Verify old_str contains what we expect
    ContainsCheck(
        text="$.output.value.tool_prediction.arguments.old_str",
        phrases="$.test_case.expected.old_str_contains",
    ),
    # Verify new_str contains what we expect
    ContainsCheck(
        text="$.output.value.tool_prediction.arguments.new_str",
        phrases="$.test_case.expected.new_str_contains",
    ),
    # Verify final content contains the fix
    ContainsCheck(
        text="$.output.value.final_content",
        phrases="$.test_case.expected.final_must_contain",
    ),
]

results = await evaluate(test_cases, outputs, checks)

print("Evaluation Results:")
print(f"  Status: {results.status}")
print(f"  Total: {results.summary.total_test_cases}, Completed: {results.summary.completed_test_cases}, Errors: {results.summary.error_test_cases}")  # noqa: E501

for i, result in enumerate(results.results):
    desc = test_cases[i].metadata.get("description", f"Case {i+1}")
    print(f"\n{desc}:")
    for check_result in result.check_results:
        passed = check_result.results.get('passed', False)
        status = "PASS" if passed else "FAIL"
        print(f"  [{status}] {check_result.check_type}")
        if not passed:
            print(f"         resolved_arguments: {json.dumps(check_result.resolved_arguments, indent=12)}")  # noqa: E501
            print(f"         results: {check_result.results}")

Evaluation Results:
  Status: completed
  Total: 2, Completed: 2, Errors: 0

Fix 'docu' -> 'document' without corrupting 'documentation':
  [PASS] exact_match
  [PASS] contains
  [PASS] contains
  [PASS] contains

Fix 'recieve' -> 'receive' without corrupting second 'receive':
  [PASS] exact_match
  [PASS] contains
  [PASS] contains
  [PASS] contains
