fix(reflect): let a fresh mental model short-circuit forced retrieval (no extra LLM call)#2011
Merged
Merged
Conversation
Reflect forced the full hierarchical path search_mental_models -> search_observations -> recall via a named tool_choice on the first iterations. Because a named tool_choice forbids the model from emitting `done`, the agent could never answer off a fresh, directly-relevant mental model — it always paid for the lower layers too (issue #1971). Fix: after the forced search_mental_models result, decide deterministically (no extra LLM call) whether to keep forcing. If the call is low/mid budget and every retrieved mental model is explicitly fresh (is_stale is False) with non-empty content, stop forcing from the next iteration on. That iteration — which happens regardless — now runs under `auto`, so the agent either answers directly or, having just read the mental model, issues its own targeted search_observations/recall. Stale, empty, or missing mental models keep the full forced path; high budget always keeps it. This reuses the agentic step that already occurs instead of adding a separate sufficiency-classifier LLM call, so the sufficient path saves two forced rounds and no path ever adds a round.
Two hs_llm_core end-to-end tests drive the real agent loop (stubbed search functions, real llm_config) to verify behaviour the deterministic MockLLM tests cannot: - fresh + sufficient mental model: the released agent answers off it and never calls search_observations/recall (judge-verified grounding); - stale mental model: no short-circuit, lower layers stay forced, and the agent corrects the stale summary using the freshly retrieved raw fact. The stale case (forcing is deterministic) is used rather than a "fresh-but-incomplete model retrieves deeper on its own" case, because whether a released model chooses to dig deeper is model-dependent and not something the fix guarantees — only release-to-auto is guaranteed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1971.
Reflect forced the full hierarchical retrieval path on the first iterations:
The root cause is one mechanism: those iterations set
tool_choiceto a named function, and a namedtool_choiceforbids the model from emittingdone. So even whensearch_mental_modelsreturned a fresh, directly-relevant mental model, the agent could not answer — it was obligated to callsearch_observationsand thenrecallfirst, paying for the lower layers (and often re-reading the very facts the mental model already summarized).The fix
After the forced
search_mental_modelsresult, decide deterministically — with no extra LLM call — whether to keep forcing:is_stale is False) with non-empty content → stop forcing from the next iteration onward.autotool choice, so the agent either answers directly, or — having just read the mental model — issues its own targetedsearch_observations/recall.The key insight: a reflect iteration always makes exactly one
call_with_tools;tool_choiceonly changes what the model may emit, not whether a call happens. So the sufficiency decision can be folded into the next agentic step that already occurs, rather than added as a separate classifier call.Why not a sufficiency-classifier LLM call
An alternative (see #2005) inserts a dedicated structured "is this mental model sufficient?" LLM call after the mental-model layer. That re-introduces a cost on the very path the issue is about:
This PR never adds an LLM round-trip in any scenario, and the targeted follow-up query the issue asks for comes for free: under
autothe model composes its own observations/recall query having just read the mental model, so it naturally targets the gap/conclusion.The deterministic freshness/usability guard (
is_stale is False+ non-empty content; missing flag treated as unsafe) is what makes "release to auto" safe — stale/empty results never short-circuit. Reliability-over-latency callers usehighbudget, which preserves the full forced path.Behavior
highbudget → full forced verification path (unchanged).auto.Tests
hindsight-api-slim/tests/test_reflect_agent.py:TestMentalModelFreshnessHelper— the deterministic guard (fresh, stale, missing-flag, blank-content, empty-list).test_fresh_mental_model_releases_forced_retrieval— fresh mm → second iteration isauto, obs/recall never called, and no extrallm.call.test_short_circuited_agent_may_still_retrieve_under_auto— after release the agent can still recall, by its own choice, with its own query.test_stale_mental_model_keeps_forced_retrieval/test_no_mental_models_keeps_forced_retrieval— full forced path preserved.test_high_budget_keeps_forced_path_for_fresh_mental_model— high budget unaffected.This is a deterministic control-flow change (no prompt or model-interpretation change), so fast MockLLM unit tests are the appropriate coverage.
Commands run:
./scripts/hooks/lint.shuv run ty check hindsight_api/engine/reflect/agent.pyuv run pytest tests/test_reflect_agent.py(50 passed)