fix(agents): stop experiment read-only tool-call loop#350
Conversation
There was a problem hiding this comment.
Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Root cause
Follow-up to #347 / PR #348. The read-only intent guard correctly stops the Experiment Agent from derailing into scenario/write tools — but on a weak local model (
ollama:llama3.1:8b) read-only queries then loop the read tool and never finish, surfacing as "invalid tool call".Live evidence (session
3b5f965b…, three queries, identical pattern):tool_list_runs→ it returns the data (10 successful runs incl. WAPE) → model re-callstool_list_runs3 more times (tool_call_count1→2→3→4) →"Exceeded maximum output retries (3)"→UnexpectedModelBehavior→ error event.naiverun2fad611b…(18.93).The #348 guard correctly closed the
propose_scenario"escape hatch" the model previously used to emit something (a wrong but complete answer), so the weak model now loops to retry-exhaustion instead. Purellama3.1:8bstructured-output weakness — not a regression, not a data issue.Fix summary
Prompt-only hardening of
READ_ONLY_INTENT_GUARD(agents/base.py) — a new "FINISH IN ONE PASS — do not loop" section:ExperimentReport.summaryfrom what it returned.tool_list_runsloop).summary("No model runs found.") instead of retrying.No tool surfaces added, no mutation surfaces widened, HITL gates untouched, no API contract change.
Tests added
app/features/agents/tests/test_read_only_guard.py:test_guard_forbids_tool_call_loops— asserts "FINISH IN ONE PASS", "AT MOST ONCE", "NEVER call a tool again that has already returned", "STOP calling tools".test_guard_handles_empty_tool_result— asserts the empty-result → summarize (don't retry) rule.Deterministic, no live model calls. The existing
test_prompts_only_reference_registered_tool_namesinvariant andtest_guard_is_delivered_in_system_prompt_to_modelcontinue to pass (guard reaches the model).Validation results
ruff check .→ All checks passed ·ruff format --check .→ 334 files formattedmypy app/→ only the pre-existingxgboost/lightgbmoptional-extra import errors in untouched files; none in changed filespyright(changed files) → 0 errorspytest -m "not integration"→ 1692 passed, 12 skippedNotes / limitations
summaryin the common case, but cannot fully guarantee structured-output reliability on a weak local model. For consistently robust structured analytical queries, a cloud model remains the stronger option (not changed here).Closes #349