Skip to content

fix(agents): constrain experiment read-only queries to read tools #347

@w7-mgfcode

Description

@w7-mgfcode

Summary

A manual Chat UI test against the Experiment Agent derailed a read-only query into an unrelated what-if scenario proposal.

User prompt: "List the most recent model runs and tell me which has the lowest WAPE."

Observed wrong answer: "Proposed what-if for store 123 / product 456 toward the objective ''. Cut price 15% ..."

Investigation evidence

  • Correct agent (experiment), fresh session, no stale context.
  • Models at the time: primary ollama:llama3.1:8b, fallback ollama:qwen3:8b.
  • The model first called the correct read tool: tool_list_runs({"status":"success"}), which returned run data including WAPE.
  • The model then produced malformed structured output missing ExperimentReport.summary.
  • A PydanticAI output-validation retry occurred.
  • On the retry, the local 8B model derailed and called tool_propose_scenario with hallucinated store_id=123/product_id=456.
  • The final answer summarized the unrelated scenario proposal.

Root cause

Local 8B model weakness under PromptedOutput validation-retry, plus the experiment agent exposing scenario/write/planning tools during a read-only task with no guard telling the model (a) to stick to read tools for read-only intents and (b) that an output-format retry is a reformat, not a reason to start a new action.

Desired behavior

  • Read-only queries answer using read-only tools only.
  • Validation retries only reformat the previous result into ExperimentReport — they never call new scenario/write tools.
  • Read-only intents that should never trigger scenario/write tools (unless the user explicitly asks to create/save/promote/archive/run something): top products, sales/revenue/units summaries, forecast summaries, registry aliases & deployment status, model-run & metric comparisons (WAPE/MAE/RMSE), backtest metrics, RAG/document questions.
  • Ambiguous rankings (e.g. "top products") → ask a clarifying question ("Top by revenue, units sold, forecasted demand, or model error?").
  • If no read-only tool exists for the requested metric, state the limitation rather than invent data.
  • Never invent store_id/product_id/run_id values.

Secondary validation gap

tool_propose_scenario accepted non-existent store/product IDs (123/456) and returned a normal proposal. It should reject non-existent store/product pairs with a clear, non-persistable validation error.

Acceptance

  • Generalized read-only intent guard added to the experiment-agent prompt.
  • Regression tests covering the exact WAPE case and broader read-only questions (top products, highest forecasted demand, current deployment alias) — no live model calls.
  • propose_scenario rejects non-existent entity pairs (123/456 covered explicitly); persists nothing on failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions