Summary
A manual Chat UI test against the Experiment Agent derailed a read-only query into an unrelated what-if scenario proposal.
User prompt: "List the most recent model runs and tell me which has the lowest WAPE."
Observed wrong answer: "Proposed what-if for store 123 / product 456 toward the objective ''. Cut price 15% ..."
Investigation evidence
- Correct agent (
experiment), fresh session, no stale context.
- Models at the time: primary
ollama:llama3.1:8b, fallback ollama:qwen3:8b.
- The model first called the correct read tool:
tool_list_runs({"status":"success"}), which returned run data including WAPE.
- The model then produced malformed structured output missing
ExperimentReport.summary.
- A PydanticAI output-validation retry occurred.
- On the retry, the local 8B model derailed and called
tool_propose_scenario with hallucinated store_id=123/product_id=456.
- The final answer summarized the unrelated scenario proposal.
Root cause
Local 8B model weakness under PromptedOutput validation-retry, plus the experiment agent exposing scenario/write/planning tools during a read-only task with no guard telling the model (a) to stick to read tools for read-only intents and (b) that an output-format retry is a reformat, not a reason to start a new action.
Desired behavior
- Read-only queries answer using read-only tools only.
- Validation retries only reformat the previous result into
ExperimentReport — they never call new scenario/write tools.
- Read-only intents that should never trigger scenario/write tools (unless the user explicitly asks to create/save/promote/archive/run something): top products, sales/revenue/units summaries, forecast summaries, registry aliases & deployment status, model-run & metric comparisons (WAPE/MAE/RMSE), backtest metrics, RAG/document questions.
- Ambiguous rankings (e.g. "top products") → ask a clarifying question ("Top by revenue, units sold, forecasted demand, or model error?").
- If no read-only tool exists for the requested metric, state the limitation rather than invent data.
- Never invent
store_id/product_id/run_id values.
Secondary validation gap
tool_propose_scenario accepted non-existent store/product IDs (123/456) and returned a normal proposal. It should reject non-existent store/product pairs with a clear, non-persistable validation error.
Acceptance
- Generalized read-only intent guard added to the experiment-agent prompt.
- Regression tests covering the exact WAPE case and broader read-only questions (top products, highest forecasted demand, current deployment alias) — no live model calls.
propose_scenario rejects non-existent entity pairs (123/456 covered explicitly); persists nothing on failure.
Summary
A manual Chat UI test against the Experiment Agent derailed a read-only query into an unrelated what-if scenario proposal.
User prompt: "List the most recent model runs and tell me which has the lowest WAPE."
Observed wrong answer: "Proposed what-if for store 123 / product 456 toward the objective ''. Cut price 15% ..."
Investigation evidence
experiment), fresh session, no stale context.ollama:llama3.1:8b, fallbackollama:qwen3:8b.tool_list_runs({"status":"success"}), which returned run data including WAPE.ExperimentReport.summary.tool_propose_scenariowith hallucinatedstore_id=123/product_id=456.Root cause
Local 8B model weakness under
PromptedOutputvalidation-retry, plus the experiment agent exposing scenario/write/planning tools during a read-only task with no guard telling the model (a) to stick to read tools for read-only intents and (b) that an output-format retry is a reformat, not a reason to start a new action.Desired behavior
ExperimentReport— they never call new scenario/write tools.store_id/product_id/run_idvalues.Secondary validation gap
tool_propose_scenarioaccepted non-existent store/product IDs (123/456) and returned a normal proposal. It should reject non-existent store/product pairs with a clear, non-persistable validation error.Acceptance
propose_scenariorejects non-existent entity pairs (123/456 covered explicitly); persists nothing on failure.