Skip to content

fix(agents): stop experiment read-only tool-call loop#350

Merged
w7-mgfcode merged 1 commit into
devfrom
fix/agents-read-tool-loop-guard
Jun 1, 2026
Merged

fix(agents): stop experiment read-only tool-call loop#350
w7-mgfcode merged 1 commit into
devfrom
fix/agents-read-tool-loop-guard

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

Root cause

Follow-up to #347 / PR #348. The read-only intent guard correctly stops the Experiment Agent from derailing into scenario/write tools — but on a weak local model (ollama:llama3.1:8b) read-only queries then loop the read tool and never finish, surfacing as "invalid tool call".

Live evidence (session 3b5f965b…, three queries, identical pattern):

  • Model calls tool_list_runs → it returns the data (10 successful runs incl. WAPE) → model re-calls tool_list_runs 3 more times (tool_call_count 1→2→3→4) → "Exceeded maximum output retries (3)"UnexpectedModelBehavior → error event.
  • Zero scenario/write tool calls — the fix(agents): constrain experiment read-only queries #348 guard works; the derail is gone.
  • The answer was available throughout: lowest WAPE = naive run 2fad611b… (18.93).

The #348 guard correctly closed the propose_scenario "escape hatch" the model previously used to emit something (a wrong but complete answer), so the weak model now loops to retry-exhaustion instead. Pure llama3.1:8b structured-output weakness — not a regression, not a data issue.

Fix summary

Prompt-only hardening of READ_ONLY_INTENT_GUARD (agents/base.py) — a new "FINISH IN ONE PASS — do not loop" section:

  • Call each read-only tool at most once per question.
  • The moment a read tool returns, STOP calling tools and write the ExperimentReport.summary from what it returned.
  • NEVER re-call a tool that already returned (the exact 4× tool_list_runs loop).
  • If a read tool returns an empty result, say so in the summary ("No model runs found.") instead of retrying.

No tool surfaces added, no mutation surfaces widened, HITL gates untouched, no API contract change.

Tests added

app/features/agents/tests/test_read_only_guard.py:

  • test_guard_forbids_tool_call_loops — asserts "FINISH IN ONE PASS", "AT MOST ONCE", "NEVER call a tool again that has already returned", "STOP calling tools".
  • test_guard_handles_empty_tool_result — asserts the empty-result → summarize (don't retry) rule.

Deterministic, no live model calls. The existing test_prompts_only_reference_registered_tool_names invariant and test_guard_is_delivered_in_system_prompt_to_model continue to pass (guard reaches the model).

Validation results

  • ruff check . → All checks passed · ruff format --check . → 334 files formatted
  • mypy app/ → only the pre-existing xgboost/lightgbm optional-extra import errors in untouched files; none in changed files
  • pyright (changed files) → 0 errors
  • pytest -m "not integration"1692 passed, 12 skipped

Notes / limitations

  • Prompt-level mitigation. It directly targets the observed re-call loop and should let the 8B model terminate with a summary in the common case, but cannot fully guarantee structured-output reliability on a weak local model. For consistently robust structured analytical queries, a cloud model remains the stronger option (not changed here).
  • No live model/agent calls were made; verification is via the persisted logs/DB and deterministic tests.

Closes #349

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ca9ea618-89c3-4ecb-8646-bea247820e1e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/agents-read-tool-loop-guard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant