Skip to content

fix(agents): salvage experiment answer when weak model fails structured output#352

Merged
w7-mgfcode merged 2 commits into
devfrom
fix/agents-finalizer-fallback
Jun 1, 2026
Merged

fix(agents): salvage experiment answer when weak model fails structured output#352
w7-mgfcode merged 2 commits into
devfrom
fix/agents-finalizer-fallback

Conversation

@w7-mgfcode
Copy link
Copy Markdown
Owner

Root cause (from a full capture_run_messages trace, live ollama:llama3.1:8b)

msg[1] TOOLCALL tool_list_runs {"status":"success"}   ← real tool call
msg[2] TOOLRETURN → 10 runs incl. WAPE                ← data obtained
msg[3] TEXT: {"runs":[...],"total":10,"page":1}       ← model returns RAW tool data as its "answer"
msg[4] RETRY: summary: Field required                 ← not an ExperimentReport
msg[5] TOOLCALL tool_list_runs {} ... loops → "Exceeded maximum output retries (3)" → UnexpectedModelBehavior

The Experiment Agent's failure on weak local models is not a tool-loop the model chooses — it is structured-output (PromptedOutput(ExperimentReport)) incompatibility. llama3.1:8b calls the tool and gets the data, but emits the raw tool-result shape ({"runs":[...]}) instead of ExperimentReport{summary}; PromptedOutput rejects it and the run exhausts its output-retry budget. (qwen3:8b is worse — it emits the tool call itself as text.) No prompt change (#348, #349) can fix this — the model cannot produce the schema while juggling tools.

Fix — graceful plain-text finalizer fallback

When an agent run raises UnexpectedModelBehavior and there is no pending HITL action to salvage, the service now extracts the tool results already obtained during the run (via capture_run_messages) and makes ONE tool-less, output_type=str follow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce — this converts the structured-output failure into a correct answer, on Ollama, with no cloud model.

  • Wired into both chat() and stream_chat() misbehavior handlers.
  • HITL approval salvage (fix(agents): Ollama chat HITL — sanitize null content and preserve pending approval #344) takes precedence and is untouched.
  • Finalizer has no tools (cannot loop) and str output (cannot fail schema validation); on any finalizer error it returns None and the caller emits the existing recoverable error — never a crash.
  • PromptedOutput(ExperimentReport) stays the happy path; cloud models are unaffected.

Tests (deterministic, no live model calls)

  • test_chat_finalizer_salvages_answer_on_misbehavior — on output-retry exhaustion with tool data, chat() returns the salvaged plain-text answer (not the generic error).
  • TestFinalizerSalvage::test_extract_tool_payloads_pulls_tool_returns / _empty_when_no_tool_returns — tool-return extraction.
  • test_salvage_returns_none_without_tool_data — no data → None → caller falls back to the recoverable error (existing test_chat_model_misbehavior_returns_friendly_message still passes).

Validation

  • ruff check . ✓ · ruff format --check
  • mypy app/ → only the pre-existing xgboost/lightgbm optional-extra import errors in untouched files
  • pyright (changed files) → 0 errors
  • pytest -m "not integration"1696 passed, 12 skipped

Notes

Closes #351

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6c9a551-40fa-481c-85e2-40437e466e10

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/agents-finalizer-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@w7-mgfcode
Copy link
Copy Markdown
Owner Author

Live verification on Ollama (llama3.1:8b)

Rebuilt the backend from this branch and ran the full AgentService.chat() path against the live model with the original failing prompt:

List the most recent model runs and tell me which has the lowest WAPE.

Before: Exceeded maximum output retries (3) → "invalid tool call" error.
After (this branch):

... tool_list_runs ×4 → Exceeded maximum output retries (3)
agents.chat_finalizer_salvage
MESSAGE: The most recent model run with the lowest WAPE is "2fad611b4cef41a0adbad1cc0f804859" with a WAPE of 18.930180701722712.

✅ No crash, and the answer is correct2fad611b (naive) = 18.93 is the true minimum (vs seasonal_naive 99.0, prophet_like 999.0).

Second commit (ae25be8) was needed: the first finalizer build returned a wrong lowest (99.0) because the 6000-char cap truncated the run list before the 18.93 run. Compacting the tool payload (dropping model_config/runtime_info/etc., keeping metrics) lets the finalizer see every run's WAPE, so the ranking is correct. Verified live above.

@w7-mgfcode w7-mgfcode merged commit 1b4c3f3 into dev Jun 1, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant