fix(agents): salvage experiment answer when weak model fails structured output#352
Conversation
There was a problem hiding this comment.
Sorry @w7-mgfcode, you have reached your weekly rate limit of 500000 diff characters.
Please try again later or upgrade to continue using Sourcery
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Live verification on Ollama (
|
Root cause (from a full
capture_run_messagestrace, liveollama:llama3.1:8b)The Experiment Agent's failure on weak local models is not a tool-loop the model chooses — it is structured-output (
PromptedOutput(ExperimentReport)) incompatibility.llama3.1:8bcalls the tool and gets the data, but emits the raw tool-result shape ({"runs":[...]}) instead ofExperimentReport{summary}; PromptedOutput rejects it and the run exhausts its output-retry budget. (qwen3:8bis worse — it emits the tool call itself as text.) No prompt change (#348, #349) can fix this — the model cannot produce the schema while juggling tools.Fix — graceful plain-text finalizer fallback
When an agent run raises
UnexpectedModelBehaviorand there is no pending HITL action to salvage, the service now extracts the tool results already obtained during the run (viacapture_run_messages) and makes ONE tool-less,output_type=strfollow-up call to the same model to answer the user's question from that data. Plain text is what weak models can produce — this converts the structured-output failure into a correct answer, on Ollama, with no cloud model.chat()andstream_chat()misbehavior handlers.stroutput (cannot fail schema validation); on any finalizer error it returnsNoneand the caller emits the existing recoverable error — never a crash.PromptedOutput(ExperimentReport)stays the happy path; cloud models are unaffected.Tests (deterministic, no live model calls)
test_chat_finalizer_salvages_answer_on_misbehavior— on output-retry exhaustion with tool data,chat()returns the salvaged plain-text answer (not the generic error).TestFinalizerSalvage::test_extract_tool_payloads_pulls_tool_returns/_empty_when_no_tool_returns— tool-return extraction.test_salvage_returns_none_without_tool_data— no data →None→ caller falls back to the recoverable error (existingtest_chat_model_misbehavior_returns_friendly_messagestill passes).Validation
ruff check .✓ ·ruff format --check✓mypy app/→ only the pre-existingxgboost/lightgbmoptional-extra import errors in untouched filespyright(changed files) → 0 errorspytest -m "not integration"→ 1696 passed, 12 skippedNotes
capture_run_messagesreproduction; no production agent call was needed for the fix, and tests are fully mocked.Closes #351