# NPS Agent with OpenTelemetry Tracing & Agent-as-a-Judge

Query the National Parks Service using LlamaStack + MCP, with **OpenTelemetry** tracing exported to MLflow via OTLP, and automated evaluation.

**Prerequisites:**
- LlamaStack server on `localhost:8321`
- NPS MCP server on `localhost:3005`
- MLflow server running: `mlflow server --backend-store-uri sqlite:///mlflow.db --port 5001`
- `OPENAI_API_KEY` in environment

In [1]:
import json
import os
import time
from dotenv import load_dotenv

# Load .env from parent directory (agents_tracing-eval_mlflow/.env)
env_path = os.path.join(os.path.dirname(os.getcwd()), ".env")
load_dotenv(env_path)

import mlflow
from mlflow.genai.judges import make_judge
from llama_stack_client import LlamaStackClient
from typing import Literal

from opentelemetry import trace as otel_trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

  from .autonotebook import tqdm as notebook_tqdm


## Configuration

In [2]:
# Configuration
LLAMA_STACK_URL = "http://localhost:8321/"
NPS_MCP_URL = "http://localhost:3005/sse/"
MODEL_ID = "openai/gpt-4o"
JUDGE_MODEL = "openai:/gpt-4o"

In [3]:
MLFLOW_TRACKING_URI = os.environ.get("MLFLOW_TRACKING_URI", "http://127.0.0.1:5001")
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
experiment = mlflow.set_experiment("nps-agent")
print(f"MLflow tracking: {MLFLOW_TRACKING_URI}")

MLflow tracking: http://127.0.0.1:5001


## OpenTelemetry Tracing Setup

Initialize a `TracerProvider` with an OTLP HTTP exporter that sends traces to the MLflow server.

- **Endpoint**: `{MLFLOW_TRACKING_URI}/v1/traces`
- **Header**: `x-mlflow-experiment-id` tells MLflow which experiment the traces belong to

In [4]:
# Init OTel tracing → OTLP export to MLflow
tracer_provider = TracerProvider()
tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(
    endpoint=f"{MLFLOW_TRACKING_URI.rstrip('/')}/v1/traces",
    headers={"x-mlflow-experiment-id": experiment.experiment_id},
)))
otel_trace.set_tracer_provider(tracer_provider)
tracer = otel_trace.get_tracer("nps-agent")

## Agent Function

Queries NPS via LlamaStack with MCP tools attached. Uses OTel `tracer.start_as_current_span()` to create spans with input/output attributes that the judge can inspect.

In [5]:
def query_nps(prompt: str, model: str = MODEL_ID) -> tuple[str, str]:
    """Query the National Parks Service agent. Returns (response_text, trace_id)."""
    client = LlamaStackClient(base_url=LLAMA_STACK_URL)

    with tracer.start_as_current_span("query_nps") as root:
        root.set_attribute("input.question", prompt)
        root.set_attribute("mlflow.spanInputs", json.dumps({"prompt": prompt}))

        with tracer.start_as_current_span("mcp_tool_call") as span:
            span.set_attribute("model", model)
            span.set_attribute("input.prompt", prompt)
            response = client.responses.create(
                model=model,
                input=prompt,
                tools=[{"type": "mcp", "server_url": NPS_MCP_URL, "server_label": "NPS tools"}],
            )
            span.set_attribute("response.id", response.id)
            span.set_attribute("response.status", response.status)

            # Record tool calls from the response so the judge can see them
            for i, output in enumerate(response.output):
                if hasattr(output, "type"):
                    span.set_attribute(f"output.{i}.type", output.type)
                if hasattr(output, "name"):
                    span.set_attribute(f"output.{i}.name", output.name)
                if hasattr(output, "arguments") and output.arguments:
                    span.set_attribute(f"output.{i}.arguments", str(output.arguments)[:1000])

        # Extract text response
        result = ""
        for output in response.output:
            if output.type in ("text", "message") and hasattr(output, "content") and output.content:
                result = output.content[0].text
                break

        root.set_attribute("output.response", result[:4000])
        root.set_attribute("mlflow.spanOutputs", json.dumps({"response": result[:4000]}))
        trace_id = format(root.get_span_context().trace_id, "032x")

    return result, trace_id

## Agent-as-a-Judge

An Agent that evaluates the agent's trace after execution. Instead of just looking at inputs/outputs, it uses tools to inspect the full execution:
- What spans were created
- What tools were called
- How long each step took

The `{{ trace }}` in the instructions tells MLflow to give the judge these inspection tools.

In [6]:
# Agent-as-a-Judge scorer
nps_judge = make_judge(
    name="nps_agent_evaluator",
    instructions=(
        "Evaluate the NPS agent's performance in {{ trace }}.\n\n"
        "Check for:\n"
        "1. Response Quality: Did the agent correctly identify parks and provide accurate information?\n"
        "2. Tool Usage: Were the correct NPS MCP tools used (search_parks, get_park_events, etc.)?\n"
        "3. Completeness: Did the agent answer all parts of the user's question?\n\n"
        "Rate as: 'good', 'acceptable', or 'poor'"
    ),
    feedback_value_type=Literal["good", "acceptable", "poor"],
    model=JUDGE_MODEL,
)

## Run Agent & Evaluate

1. Send a query to the NPS agent (OTel traces sent via OTLP)
2. Flush traces to MLflow and wait for ingestion
3. Load traces via `mlflow.search_traces()`
4. Evaluate with `mlflow.genai.evaluate()`

In [8]:
prompt = "Tell me about some parks in Rhode Island, and let me know if there are any upcoming events at them."

result, trace_id = query_nps(prompt)
print(f"Response:\n{result}")
print(f"Trace ID: {trace_id}")

# Flush OTel traces to MLflow
print("\nFlushing traces to MLflow...")
tracer_provider.force_flush()
time.sleep(2)

# Load traces from MLflow and evaluate with agent-as-a-judge
print("Loading traces from MLflow...")
traces_df = mlflow.search_traces(locations=[experiment.experiment_id])

# Keep only traces produced in this run
for col in ("request_id", "client_request_id", "trace_id"):
    if not traces_df.empty and col in traces_df.columns:
        match_ids = [trace_id] if col != "trace_id" else [f"tr-{trace_id}"]
        filtered = traces_df[traces_df[col].isin(match_ids)]
        if not filtered.empty:
            traces_df = filtered
            break

print(f"Found {len(traces_df)} trace(s) matching this run.\n")

if traces_df.empty:
    print(
        "ERROR: No traces found. Make sure MLflow server is running:\n"
        "  mlflow server --backend-store-uri sqlite:///mlflow.db --port 5001\n"
    )
else:
    print("=== Running agent-as-a-judge evaluation ===\n")
    results = mlflow.genai.evaluate(
        data=traces_df,
        scorers=[nps_judge],
    )
    print("--- Evaluation results ---")
    for name, table in results.tables.items():
        print(f"\n{name}:")
        for _, row in table.iterrows():
            for col in table.columns:
                if "nps_agent_evaluator" in str(col):
                    print(f"  {col}: {row[col]}")

Response:
Here are some of the national parks in Rhode Island and their upcoming events:

### Blackstone River Valley National Historical Park
- **Description**: The Blackstone River was pivotal to America's entry into the Age of Industry, starting with Samuel Slater's cotton spinning mill in Pawtucket, RI. It's a place where you can learn about how this industrial revolution changed the landscape and lives in the Blackstone Valley and beyond.
- **Website**: [More Info](https://www.nps.gov/blrv/index.htm)

**Upcoming Events**:
1. **Revolutionary War Pension Files Transcription Event**
   - **Date**: February 10 and February 20, 2026
   - **Venue**: Carpenter Museum and Upton Community Center in Massachusetts
   - **Description**: Participate in transcribing Revolutionary War Pension Files. It's free and open to all.
   
2. **Old Slater Mill Tour**
   - **Description**: A guided tour of Slater Mill, the start of the American Industrial Revolution. The tour is 30 minutes long and begins 

2026/02/09 13:53:04 INFO mlflow.models.evaluation.utils.trace: Auto tracing is temporarily enabled during the model evaluation for computing some metrics and debugging. To disable tracing, call `mlflow.autolog(disable=True)`.


Loading traces from MLflow...
Found 1 trace(s) matching this run.

=== Running agent-as-a-judge evaluation ===



Evaluating:   0%|          | 0/1 [Elapsed: 00:00, Remaining: ?] [92m13:53:04 - LiteLLM:INFO[0m: utils.py:3879 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai
[92m13:53:05 - LiteLLM:INFO[0m: utils.py:1629 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m13:53:05 - LiteLLM:INFO[0m: utils.py:3879 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai
[92m13:53:06 - LiteLLM:INFO[0m: utils.py:1629 - Wrapper: Completed Call, calling success_handler
INFO:LiteLLM:Wrapper: Completed Call, calling success_handler
[92m13:53:06 - LiteLLM:INFO[0m: utils.py:3879 - 
LiteLLM completion() model= gpt-4o; provider = openai
INFO:LiteLLM:
LiteLLM completion() model= gpt-4o; provider = openai
[92m13:53:07 - LiteLLM:INFO[0m: utils.py:1629 - Wrapper: Completed Call, calling success_hand

--- Evaluation results ---

eval_results:
  nps_agent_evaluator/value: good


## View Traces in MLflow UI

The MLflow server should already be running (required for OTLP export). Open it in your browser:

```bash
# If not already running:
mlflow server --backend-store-uri sqlite:///mlflow.db --port 5001
```

Then open http://localhost:5001 in your browser.

### How to Navigate

1. **Select the Experiment** - Click on `nps-agent` in the left sidebar
2. **Go to Traces tab** - Click the "Traces" tab to see all agent executions
3. **View Trace Details** - Click on any Trace ID to open the trace detail view
   - You'll see the span hierarchy showing the agent execution (query_nps → mcp_tool_call)
   - Click on individual spans to see inputs/outputs for each step
4. **View Assessments** - In the trace detail view, look for the assessments side-panel on the right
   - This shows the Agent-as-a-Judge evaluation results.