fix(reflect): gate mental-model retrieval by policy#2005
Conversation
Add a retrieval policy step after search_mental_models so low- and mid-budget reflect calls can stop forcing lower-level retrieval when fresh mental models are sufficient. Keep the fast path bounded by deterministic safeguards: stale, missing, or empty mental models continue to search observations and recall. Malformed policy responses fall back to conservative retrieval. When lower layers are needed, carry the policy's follow-up query into the forced tool call instead of only rewriting the tool sequence. Tests cover the forced-path rewrite, stale/empty guard, targeted follow-up query, and real-LLM policy behavior.
|
Opened #2011 as an alternative approach to the same issue (#1971), for comparison. The root cause is that the forced iterations set #2011 instead makes the decision deterministically, with no extra LLM call: after the forced The motivating difference is cost on the path the issue is about:
#2011 never adds an LLM round-trip, and the targeted follow-up query comes for free (the model writes it after reading the mental model, so the The one thing this PR's classifier genuinely buys is a decoupled, conservative sufficiency judgment. While writing #2011's real-LLM e2e tests I actually saw the flip side of not having that: when a fresh-but-incomplete model is released to Not trying to block this — just offering the lighter-weight alternative so the maintainers can pick the tradeoff (guaranteed extra call + stronger conservatism vs. zero extra calls + deterministic guard). Happy to converge on whichever direction is preferred. |
Summary
Fixes #1971.
This PR makes the forced
reflectretrieval path conditional after the mental-model layer. Whensearch_mental_modelsreturns fresh, usable mental models that are sufficient for the user query, low- and mid-budget reflect calls can stop forcing lower-level retrieval and let the agent answer or choose the next action normally.When lower-level retrieval is still needed, the policy can provide a targeted follow-up query for
search_observationsorrecall, so deeper retrieval focuses on the mental-model conclusion, gap, or stale point that needs verification instead of blindly repeating the original query.Motivation
Before this change, the reflect agent forced the same retrieval sequence whenever all layers were enabled:
Because each forced iteration uses
tool_choicewith a specific function, the agent could not answer after a sufficient mental-model result. It still had to callsearch_observationsand thenrecall.That weakens the value of mental models as reusable synthesized knowledge:
Changes
search_mental_models.highbudget on the full verification path.follow_up_queryinto forcedsearch_observationsandrecallcalls when deeper retrieval is needed.search_mental_modelsentry into the override sequence.Behavior
After this change:
high, reflect preserves the fuller verification path.autotool choice.Tests
uv run ruff check tests/test_reflect_agent.pyuv run ty check hindsight_api/engine/reflect/agent.py./scripts/hooks/lint.shuv run pytest tests/test_reflect_agent.py::TestMentalModelRetrievalPolicyRealLLM -qHF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 uv run pytest tests/test_reflect_agent.py -qThe new tests cover:
highbudget;