Skip to content

Gate prediction-error flags on recurrence and fit; fix contextual_mismatch#47

Merged
raphasouthall merged 1 commit into
mainfrom
fix/prediction-error-precision
Jun 3, 2026
Merged

Gate prediction-error flags on recurrence and fit; fix contextual_mismatch#47
raphasouthall merged 1 commit into
mainfrom
fix/prediction-error-precision

Conversation

@raphasouthall
Copy link
Copy Markdown
Owner

Problem

Triaging all 46 live prediction-error flags on LXC 122 found ~0 genuinely actionable note defects. The subsystem was measuring query difficulty, not note health.

  • low_overlap fired on any single query whose search: add --json flag for machine-readable output #1 result had low cosine. 35 of 36 flags came from one ad-hoc query each: bare-identifier lookups ("azvmsqlp02") that the keyword path matched correctly, abstract/meta sweeps ("open decisions verification needed TODO"), and cross-domain least-bad hits ("spline 3D shader" → neurostack.md).
  • contextual_mismatch flagged correct retrievals — it fired whenever the search: add --json flag for machine-readable output #1 note was absent from the recall-limited in_context_notes boost set, including exact-title hits with strong cosine (m365-copilot-mcp at sim 0.63, reddit-engagement-daemon at 0.69). Root cause: the caller context label ("nyk-azure") isn't even a substring of the folder (nyk-europe-azure), so the set leans on a brittle tag/folder-cosine heuristic that excludes correctly-domiciled notes. Precision ≈ 0%.

These false flags weren't inert — the prediction-error demotion stage down-weighted the (correct) flagged notes in later retrieval.

Changes

  • contextual_mismatch now also requires the top note to be a weak fit (sim < CONTEXTUAL_MISMATCH_MAX_SIM = 0.45). A strong hit outside the boost set is not a mismatch.
  • Surfacing (vault_prediction_errors MCP + CLI) and the retrieval demotion now require PREDICTION_ERROR_MIN_OCCURRENCES (2) distinct events. Single flags still accumulate toward the threshold but neither surface nor demote.
  • New tests/test_prediction_errors.py exercises the detection branch end-to-end with a real in-memory sqlite DB (not MagicMock): fires below threshold, no flag above, FTS-only hits skip detection, only deduped[0] is checked, contextual_mismatch fires in-band and is suppressed for strong hits, and the occurrence gate surfaces only recurrent notes.

Impact

On the live DB this collapses 46 surfaced flags → 1 (third-parties.md, which genuinely surprised two distinct CSP/AOBO queries). Full suite: 575 passed, ruff clean.

…match

Triaging the 46 live flags on LXC 122 showed ~0 actionable note defects.
The subsystem was measuring query difficulty, not note health:

- low_overlap fired on any single query whose top hit had low cosine —
  bare-identifier lookups ("azvmsqlp02") that keyword-matched correctly,
  abstract/meta sweeps, and cross-domain least-bad hits. 35 of 36 flags
  came from one ad-hoc query each.
- contextual_mismatch flagged *correct* retrievals: it fired whenever the
  #1 note was absent from the recall-limited in_context_notes boost set,
  including exact-title hits with strong cosine (m365-copilot-mcp at sim
  0.63, reddit-engagement-daemon at 0.69). Precision ~0%.

These false flags also demoted the correct notes in later searches via the
prediction-error demotion stage.

Changes:
- contextual_mismatch now requires the top note to also be a weak fit
  (sim < CONTEXTUAL_MISMATCH_MAX_SIM = 0.45). A strong hit outside the
  boost set is not a mismatch.
- Surfacing (vault_prediction_errors, CLI) and the retrieval demotion now
  require >= PREDICTION_ERROR_MIN_OCCURRENCES (2) distinct events. Single
  flags still accumulate toward the threshold but neither surface nor demote.
- New tests/test_prediction_errors.py exercises the detection branch
  end-to-end (real in-memory sqlite): low_overlap fires below threshold,
  no flag above, FTS-only hits skip detection, only deduped[0] is checked,
  contextual_mismatch fires in-band and is suppressed for strong hits, and
  the occurrence gate surfaces only recurrent notes.

On the live DB this collapses 46 surfaced flags to 1 (third-parties.md,
which genuinely surprised two distinct CSP/AOBO queries).
@raphasouthall raphasouthall merged commit 942917d into main Jun 3, 2026
5 checks passed
@raphasouthall raphasouthall deleted the fix/prediction-error-precision branch June 3, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant