fix(recall): keyword scoring for vector results + softer adaptive floor#150
Merged
jack-arturo merged 5 commits intoApr 23, 2026
Conversation
…loor (#128) - scoring.py: fall back to content-token hit count for keyword_component when a result came from vector search (was zero before), so vector hits with question-term overlap aren't penalised vs keyword-sourced results. - recall.py: raise adaptive score-floor dropoff threshold from 15% to 25% of the top score and only apply the cut if at least half the results survive it — the prior filter truncated the tail too hard. - docs + unit tests updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ignore `.worktrees/` so isolated local worktrees do not pollute git status while we prepare the full-eval branch.
Capture the full judge-on LoCoMo run for #128 after the harness fixes so the branch has a concrete benchmark checkpoint for moving the recall changes forward.
…lumn Fold the #142 prose block into a single Results-table row so every PR is recorded the same way, and extend the Category Breakdown table with an Overall column so per-category scores no longer require a cross-reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3f3fbb6 to
26aff1c
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves recall ranking quality by restoring keyword influence for vector-sourced hits and by softening the adaptive score-floor truncation so relevant “tail” results aren’t aggressively removed. It also updates the benchmark harness/tests/docs to support judge preflight and GPT‑5-family request behavior.
Changes:
automem/utils/scoring.py: add a content-token overlap fallback forkeywordscoring when results are not keyword/trending sourced (e.g., vector hits).automem/api/recall.py: relax adaptive floor application (25% threshold) and add a guardrail so it only applies when at least half of results remain.- Bench harness + tests/docs: default cat‑5 judge model to
gpt-5.1, add judge preflight, structured output handling for GPT‑5, and env loading via dotenv.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
automem/utils/scoring.py |
Adds keyword fallback scoring for non-keyword result types based on query-token presence in content. |
automem/api/recall.py |
Adjusts adaptive floor threshold and adds a “retain at least half” guardrail. |
tests/test_api_endpoints.py |
Adds unit tests for keyword fallback scoring and adaptive floor behavior. |
tests/test_locomo_cat5_judge.py |
Updates cat‑5 judge tests to use default judge model + adds preflight/structured-output/retry coverage. |
tests/test_benchmark_env_loading.py |
New tests ensuring benchmark scripts load OPENAI_API_KEY from global dotenv config. |
tests/benchmarks/test_locomo.py |
Adds dotenv loading, judge preflight, GPT‑5 structured output request format, retry token budgets, and default judge model constant. |
tests/benchmarks/longmemeval/test_longmemeval.py |
Adds dotenv loading to align benchmark env behavior. |
docs/TESTING.md |
Updates LoCoMo judge docs (default model + preflight behavior). |
docs/ENVIRONMENT_VARIABLES.md |
Documents keyword scoring behavior now including the vector-content fallback. |
test-locomo-benchmark.sh |
Updates help text to reflect new default judge model. |
benchmarks/EXPERIMENT_LOG.md |
Normalizes/updates experiment log entries for #128 and #142 results. |
.gitignore |
Ignores /.worktrees/ for local benchmark worktrees. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
jack-arturo
added a commit
that referenced
this pull request
Apr 23, 2026
🤖 I have created a release *beep* *boop* --- ## [0.15.2](v0.15.1...v0.15.2) (2026-04-23) ### Bug Fixes * **benchmarks:** make LoCoMo judge runs reliable ([#149](#149)) ([c22f2c9](c22f2c9)) * **recall:** bypass tag filter on expansion candidates ([#142](#142)) ([#146](#146)) ([4f0fcf8](4f0fcf8)) * **recall:** keyword scoring for vector results + softer adaptive floor ([#150](#150)) ([591b2c7](591b2c7)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #128.
Summary
scoring.py: fall back to content-token hit count forkeyword_componentwhen a result came from vector search (was zero before), so vector hits with question-term overlap aren't penalised vs keyword-sourced results.recall.py: raise adaptive score-floor dropoff threshold from 15% → 25% of the top score, and only apply the cut when at least half the results survive it — the prior filter was truncating the tail too hard.Benchmark impact (LoCoMo)
Cat-5 scored 92.83% (414/446) with 0 judge skips on the full-judge rerun. Per-question diff vs the 2026-03-10 baseline showed 17 previously-correct questions regressing, all with the same fingerprint: top similarity dropped from ~0.54 to 0.25 — the adaptive floor was cutting valid results that the softened threshold now retains. See
benchmarks/EXPERIMENT_LOG.mdrows for 2026-04-23.Other changes on this branch
d5c547b— log the recall: Keyword scoring dead for vector results + adaptive floor too aggressive #128 full-judge LoCoMo result.26aff1c— normalize EXPERIMENT_LOG entries (fold the recall: tag hard-filter is applied to expansion candidates, making expand_relations a no-op under scoped queries #142 prose block into a table row, add an Overall column to the category breakdown).14d581e— gitignore.worktrees/for local benchmark runs.(The judge-reliability harness fix originally committed on this branch was rebased out because the same work landed on main via #149.)
Known follow-up
There's a residual main-to-main drift between the 2026-03-10 baseline (89.36%, no judge) and today's same-day baseline (82.13%, no judge) on identical corpus. Per-question analysis shows it's the same adaptive-floor interaction, but the score distribution shifted between those dates from something other than #73. #128 recovers ~half the gap. Filing a follow-up to bisect the score-distribution shift (likely #131 or #134) and run a control with
RECALL_ADAPTIVE_FLOOR=falseto confirm.Test plan
pytest tests/test_recall*.py).keywordcomponent.🤖 Generated with Claude Code