fix(agents): salvage experiment answer when weak model fails structured output by w7-mgfcode · Pull Request #352 · w7-mgfcode/ForecastLabAI

w7-mgfcode · 2026-06-01T03:16:35Z

Root cause (from a full `capture_run_messages` trace, live `ollama:llama3.1:8b`)

msg[1] TOOLCALL tool_list_runs {"status":"success"}   ← real tool call
msg[2] TOOLRETURN → 10 runs incl. WAPE                ← data obtained
msg[3] TEXT: {"runs":[...],"total":10,"page":1}       ← model returns RAW tool data as its "answer"
msg[4] RETRY: summary: Field required                 ← not an ExperimentReport
msg[5] TOOLCALL tool_list_runs {} ... loops → "Exceeded maximum output retries (3)" → UnexpectedModelBehavior

The Experiment Agent's failure on weak local models is not a tool-loop the model chooses — it is structured-output (PromptedOutput(ExperimentReport)) incompatibility. llama3.1:8b calls the tool and gets the data, but emits the raw tool-result shape ({"runs":[...]}) instead of ExperimentReport{summary}; PromptedOutput rejects it and the run exhausts its output-retry budget. (qwen3:8b is worse — it emits the tool call itself as text.) No prompt change (#348, #349) can fix this — the model cannot produce the schema while juggling tools.

Fix — graceful plain-text finalizer fallback

When an agent run raises UnexpectedModelBehavior and there is no pending HITL action to salvage, the service now extracts the tool results already obtained during the run (via capture_run_messages) and makes ONE tool-less, output_type=str follow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce — this converts the structured-output failure into a correct answer, on Ollama, with no cloud model.

Wired into both chat() and stream_chat() misbehavior handlers.
HITL approval salvage (fix(agents): Ollama chat HITL — sanitize null content and preserve pending approval #344) takes precedence and is untouched.
Finalizer has no tools (cannot loop) and str output (cannot fail schema validation); on any finalizer error it returns None and the caller emits the existing recoverable error — never a crash.
PromptedOutput(ExperimentReport) stays the happy path; cloud models are unaffected.

Tests (deterministic, no live model calls)

test_chat_finalizer_salvages_answer_on_misbehavior — on output-retry exhaustion with tool data, chat() returns the salvaged plain-text answer (not the generic error).
TestFinalizerSalvage::test_extract_tool_payloads_pulls_tool_returns / _empty_when_no_tool_returns — tool-return extraction.
test_salvage_returns_none_without_tool_data — no data → None → caller falls back to the recoverable error (existing test_chat_model_misbehavior_returns_friendly_message still passes).

Validation

ruff check . ✓ · ruff format --check ✓
mypy app/ → only the pre-existing xgboost/lightgbm optional-extra import errors in untouched files
pyright (changed files) → 0 errors
pytest -m "not integration" → 1696 passed, 12 skipped

Notes

This is the durable, Ollama-compatible fix the previous prompt-only mitigations (fix(agents): constrain experiment read-only queries #348/fix(agents): stop experiment read-only tool-call loop #350, fix(agents): stop experiment read-only tool-call loop on weak models #349/fix(agents): stop experiment read-only tool-call loop #350) could not achieve. Diagnosed via a capture_run_messages reproduction; no production agent call was needed for the fix, and tests are fully mocked.
Follows fix(agents): constrain experiment read-only queries to read tools #347/fix(agents): constrain experiment read-only queries #348 (read-only guard) and fix(agents): stop experiment read-only tool-call loop on weak models #349/fix(agents): stop experiment read-only tool-call loop #350 (one-pass rule), which remain valuable (they stop the derail into scenario tools); this PR handles the residual structured-output failure.

Closes #351

…351)

sourcery-ai

Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

coderabbitai · 2026-06-01T03:16:42Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d6c9a551-40fa-481c-85e2-40437e466e10

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/agents-finalizer-fallback

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

)

w7-mgfcode · 2026-06-01T03:24:32Z

Live verification on Ollama (`llama3.1:8b`)

Rebuilt the backend from this branch and ran the full AgentService.chat() path against the live model with the original failing prompt:

List the most recent model runs and tell me which has the lowest WAPE.

Before: Exceeded maximum output retries (3) → "invalid tool call" error.
After (this branch):

... tool_list_runs ×4 → Exceeded maximum output retries (3)
agents.chat_finalizer_salvage
MESSAGE: The most recent model run with the lowest WAPE is "2fad611b4cef41a0adbad1cc0f804859" with a WAPE of 18.930180701722712.

✅ No crash, and the answer is correct — 2fad611b (naive) = 18.93 is the true minimum (vs seasonal_naive 99.0, prophet_like 999.0).

Second commit (ae25be8) was needed: the first finalizer build returned a wrong lowest (99.0) because the 6000-char cap truncated the run list before the 18.93 run. Compacting the tool payload (dropping model_config/runtime_info/etc., keeping metrics) lets the finalizer see every run's WAPE, so the ranking is correct. Verified live above.

fix(agents): salvage plain-text answer when structured output fails (#…

57cc894

…351)

sourcery-ai Bot reviewed Jun 1, 2026

View reviewed changes

fix(agents): compact tool data for finalizer to fix metric ranking (#351

ae25be8

)

w7-mgfcode merged commit 1b4c3f3 into dev Jun 1, 2026
7 of 8 checks passed

This was referenced Jun 1, 2026

fix(agents): stop experiment read-only tool-call loop on weak models #349

Closed

fix(agents): salvage experiment answer when weak model fails structured output #351

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents): salvage experiment answer when weak model fails structured output#352

fix(agents): salvage experiment answer when weak model fails structured output#352
w7-mgfcode merged 2 commits into
devfrom
fix/agents-finalizer-fallback

w7-mgfcode commented Jun 1, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Uh oh!

w7-mgfcode commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

w7-mgfcode commented Jun 1, 2026

Root cause (from a full capture_run_messages trace, live ollama:llama3.1:8b)

Fix — graceful plain-text finalizer fallback

Tests (deterministic, no live model calls)

Validation

Notes

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

w7-mgfcode commented Jun 1, 2026

Live verification on Ollama (llama3.1:8b)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Root cause (from a full `capture_run_messages` trace, live `ollama:llama3.1:8b`)

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Live verification on Ollama (`llama3.1:8b`)