Skip to content

fix(reflect): gate mental-model retrieval by policy#2005

Closed
Sanderhoff-alt wants to merge 1 commit into
vectorize-io:mainfrom
Sanderhoff-alt:fix/reflect-mental-model-policy
Closed

fix(reflect): gate mental-model retrieval by policy#2005
Sanderhoff-alt wants to merge 1 commit into
vectorize-io:mainfrom
Sanderhoff-alt:fix/reflect-mental-model-policy

Conversation

@Sanderhoff-alt

Copy link
Copy Markdown
Contributor

Summary

Fixes #1971.

This PR makes the forced reflect retrieval path conditional after the mental-model layer. When search_mental_models returns fresh, usable mental models that are sufficient for the user query, low- and mid-budget reflect calls can stop forcing lower-level retrieval and let the agent answer or choose the next action normally.

When lower-level retrieval is still needed, the policy can provide a targeted follow-up query for search_observations or recall, so deeper retrieval focuses on the mental-model conclusion, gap, or stale point that needs verification instead of blindly repeating the original query.

Motivation

Before this change, the reflect agent forced the same retrieval sequence whenever all layers were enabled:

search_mental_models -> search_observations -> recall

Because each forced iteration uses tool_choice with a specific function, the agent could not answer after a sufficient mental-model result. It still had to call search_observations and then recall.

That weakens the value of mental models as reusable synthesized knowledge:

  • fresh and relevant mental models cannot act as an answer boundary;
  • low-budget reflect calls still pay for the full three-layer retrieval path;
  • later retrieval can duplicate evidence already summarized by the mental model;
  • the deeper retrieval query is not guaranteed to verify the specific conclusion or gap that matters.

Changes

  • Adds a structured retrieval-policy decision after search_mental_models.
  • Keeps high budget on the full verification path.
  • Allows low- and mid-budget calls to stop forced lower-level retrieval when fresh mental models are sufficient.
  • Treats stale, missing-staleness, or empty mental models as unsafe for short-circuiting and falls back to lower-level retrieval.
  • Falls back to conservative lower-level retrieval if the policy LLM call or structured output parsing fails.
  • Uses retrieval relevance as metadata only, without adding a fixed score threshold.
  • Carries a policy-generated follow_up_query into forced search_observations and recall calls when deeper retrieval is needed.
  • Records the retrieval-policy decision in the reflect trace.
  • Removes the old forced-sequence alignment workaround that inserted an already-executed search_mental_models entry into the override sequence.

Behavior

After this change:

  • If no mental models are found, reflect continues to the enabled lower layers.
  • If any retrieved mental model is stale or has no usable content, reflect continues to the enabled lower layers.
  • If the budget is high, reflect preserves the fuller verification path.
  • If fresh mental models are sufficient, forced lower-layer retrieval stops and the next agent iteration uses normal auto tool choice.
  • If fresh mental models are not sufficient, reflect forces only the lower layers requested by the policy and uses the policy's targeted query when available.

Tests

  • uv run ruff check tests/test_reflect_agent.py
  • uv run ty check hindsight_api/engine/reflect/agent.py
  • ./scripts/hooks/lint.sh
  • uv run pytest tests/test_reflect_agent.py::TestMentalModelRetrievalPolicyRealLLM -q
  • HF_HUB_OFFLINE=1 TRANSFORMERS_OFFLINE=1 uv run pytest tests/test_reflect_agent.py -q

The new tests cover:

  • stopping forced retrieval when fresh mental models are sufficient;
  • preventing stale or empty mental models from short-circuiting;
  • preserving full retrieval for high budget;
  • using a targeted follow-up query for forced lower-layer retrieval;
  • real-LLM behavior for both sufficient and verification-needed policy decisions.

Add a retrieval policy step after search_mental_models so low- and
mid-budget reflect calls can stop forcing lower-level retrieval when
fresh mental models are sufficient.

Keep the fast path bounded by deterministic safeguards: stale,
missing, or empty mental models continue to search observations and
recall. Malformed policy responses fall back to conservative retrieval.
When lower layers are needed, carry the policy's follow-up query into
the forced tool call instead of only rewriting the tool sequence.

Tests cover the forced-path rewrite, stale/empty guard, targeted
follow-up query, and real-LLM policy behavior.
@nicoloboschi

Copy link
Copy Markdown
Collaborator

Opened #2011 as an alternative approach to the same issue (#1971), for comparison.

The root cause is that the forced iterations set tool_choice to a named function, which forbids the model from emitting done — so a fresh, sufficient mental model can never be an answer boundary. This PR resolves that with a dedicated structured "is the mental model sufficient?" LLM classifier call after the mental-model layer.

#2011 instead makes the decision deterministically, with no extra LLM call: after the forced search_mental_models result, if the call is low/mid budget and every retrieved model is explicitly fresh (is_stale is False) with non-empty content, it stops forcing from the next iteration on. That next iteration already happens — it just runs under auto — so the agent answers directly, or (having just read the model) issues its own targeted search_observations/recall. Stale/empty/missing keep the full forced path; high budget keeps it too.

The motivating difference is cost on the path the issue is about:

Scenario Before This PR (#2005) #2011
model sufficient 3 forced rounds mm + classifier + done → −1 mm + done → −2
model not sufficient 3 forced rounds mm + classifier + obs + recall → +1 mm + obs/recall under auto → 0

#2011 never adds an LLM round-trip, and the targeted follow-up query comes for free (the model writes it after reading the mental model, so the follow_up_query plumbing / query-override bookkeeping isn't needed).

The one thing this PR's classifier genuinely buys is a decoupled, conservative sufficiency judgment. While writing #2011's real-LLM e2e tests I actually saw the flip side of not having that: when a fresh-but-incomplete model is released to auto, gemini-flash-lite sometimes just answers from it rather than digging deeper. So the classifier's conservatism is a real, if costly, benefit — #2011 leans on the deterministic fresh-gate + the high budget tier (and, optionally, a one-line prompt nudge) to cover that instead of a per-call LLM tax.

Not trying to block this — just offering the lighter-weight alternative so the maintainers can pick the tradeoff (guaranteed extra call + stronger conservatism vs. zero extra calls + deterministic guard). Happy to converge on whichever direction is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reflect Forces Lower-Level Retrieval Even When a Mental Model May Be Sufficient

2 participants