fix(recall): keyword scoring for vector results + softer adaptive floor by jack-arturo · Pull Request #150 · verygoodplugins/automem

jack-arturo · 2026-04-23T21:33:43Z

Closes #128.

Summary

scoring.py: fall back to content-token hit count for keyword_component when a result came from vector search (was zero before), so vector hits with question-term overlap aren't penalised vs keyword-sourced results.
recall.py: raise adaptive score-floor dropoff threshold from 15% → 25% of the top score, and only apply the cut when at least half the results survive it — the prior filter was truncating the tail too hard.

Benchmark impact (LoCoMo)

Suite	Baseline (same-day main, no judge)	#128	Δ
LoCoMo-mini (2 convos, 235 Qs)	82.13% (193/235)	85.53% (201/235)	+3.40pp
LoCoMo-full (10 convos, 1986 Qs, judge on)	80.06% (1590/1986) vs #97 anchor	83.99% (1668/1986)	+3.93pp vs canonical anchor

Cat-5 scored 92.83% (414/446) with 0 judge skips on the full-judge rerun. Per-question diff vs the 2026-03-10 baseline showed 17 previously-correct questions regressing, all with the same fingerprint: top similarity dropped from ~0.54 to 0.25 — the adaptive floor was cutting valid results that the softened threshold now retains. See benchmarks/EXPERIMENT_LOG.md rows for 2026-04-23.

Other changes on this branch

d5c547b — log the recall: Keyword scoring dead for vector results + adaptive floor too aggressive #128 full-judge LoCoMo result.
26aff1c — normalize EXPERIMENT_LOG entries (fold the recall: tag hard-filter is applied to expansion candidates, making expand_relations a no-op under scoped queries #142 prose block into a table row, add an Overall column to the category breakdown).
14d581e — gitignore .worktrees/ for local benchmark runs.

(The judge-reliability harness fix originally committed on this branch was rebased out because the same work landed on main via #149.)

Known follow-up

There's a residual main-to-main drift between the 2026-03-10 baseline (89.36%, no judge) and today's same-day baseline (82.13%, no judge) on identical corpus. Per-question analysis shows it's the same adaptive-floor interaction, but the score distribution shifted between those dates from something other than #73. #128 recovers ~half the gap. Filing a follow-up to bisect the score-distribution shift (likely #131 or #134) and run a control with RECALL_ADAPTIVE_FLOOR=false to confirm.

Test plan

Unit tests pass (pytest tests/test_recall*.py).
LoCoMo-mini no-judge rerun on this branch vs same-day main baseline (+3.40pp).
LoCoMo-full with GPT-5.1 cat-5 judge (+3.93pp vs feat(bench): benchmark testing infrastructure for rapid iteration #97 canonical anchor).
Production smoke: "AutoJack" query on Railway returns ≥30 results with non-zero keyword component.

🤖 Generated with Claude Code

…loor (#128) - scoring.py: fall back to content-token hit count for keyword_component when a result came from vector search (was zero before), so vector hits with question-term overlap aren't penalised vs keyword-sourced results. - recall.py: raise adaptive score-floor dropoff threshold from 15% to 25% of the top score and only apply the cut if at least half the results survive it — the prior filter truncated the tail too hard. - docs + unit tests updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ignore `.worktrees/` so isolated local worktrees do not pollute git status while we prepare the full-eval branch.

Capture the full judge-on LoCoMo run for #128 after the harness fixes so the branch has a concrete benchmark checkpoint for moving the recall changes forward.

…lumn Fold the #142 prose block into a single Results-table row so every PR is recorded the same way, and extend the Category Breakdown table with an Overall column so per-category scores no longer require a cross-reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves recall ranking quality by restoring keyword influence for vector-sourced hits and by softening the adaptive score-floor truncation so relevant “tail” results aren’t aggressively removed. It also updates the benchmark harness/tests/docs to support judge preflight and GPT‑5-family request behavior.

Changes:

automem/utils/scoring.py: add a content-token overlap fallback for keyword scoring when results are not keyword/trending sourced (e.g., vector hits).
automem/api/recall.py: relax adaptive floor application (25% threshold) and add a guardrail so it only applies when at least half of results remain.
Bench harness + tests/docs: default cat‑5 judge model to gpt-5.1, add judge preflight, structured output handling for GPT‑5, and env loading via dotenv.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`automem/utils/scoring.py`	Adds keyword fallback scoring for non-keyword result types based on query-token presence in content.
`automem/api/recall.py`	Adjusts adaptive floor threshold and adds a “retain at least half” guardrail.
`tests/test_api_endpoints.py`	Adds unit tests for keyword fallback scoring and adaptive floor behavior.
`tests/test_locomo_cat5_judge.py`	Updates cat‑5 judge tests to use default judge model + adds preflight/structured-output/retry coverage.
`tests/test_benchmark_env_loading.py`	New tests ensuring benchmark scripts load `OPENAI_API_KEY` from global dotenv config.
`tests/benchmarks/test_locomo.py`	Adds dotenv loading, judge preflight, GPT‑5 structured output request format, retry token budgets, and default judge model constant.
`tests/benchmarks/longmemeval/test_longmemeval.py`	Adds dotenv loading to align benchmark env behavior.
`docs/TESTING.md`	Updates LoCoMo judge docs (default model + preflight behavior).
`docs/ENVIRONMENT_VARIABLES.md`	Documents keyword scoring behavior now including the vector-content fallback.
`test-locomo-benchmark.sh`	Updates help text to reflect new default judge model.
`benchmarks/EXPERIMENT_LOG.md`	Normalizes/updates experiment log entries for #128 and #142 results.
`.gitignore`	Ignores `/.worktrees/` for local benchmark worktrees.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

🤖 I have created a release *beep* *boop* --- ## [0.15.2](v0.15.1...v0.15.2) (2026-04-23) ### Bug Fixes * **benchmarks:** make LoCoMo judge runs reliable ([#149](#149)) ([c22f2c9](c22f2c9)) * **recall:** bypass tag filter on expansion candidates ([#142](#142)) ([#146](#146)) ([4f0fcf8](4f0fcf8)) * **recall:** keyword scoring for vector results + softer adaptive floor ([#150](#150)) ([591b2c7](591b2c7)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

Copilot AI review requested due to automatic review settings April 23, 2026 21:33

jack-arturo linked an issue Apr 23, 2026 that may be closed by this pull request

recall: Keyword scoring dead for vector results + adaptive floor too aggressive #128

Closed

Copilot started reviewing on behalf of jack-arturo April 23, 2026 21:34 View session

jack-arturo and others added 4 commits April 23, 2026 22:36

chore(repo): ignore project-local worktrees

14d581e

Ignore `.worktrees/` so isolated local worktrees do not pollute git status while we prepare the full-eval branch.

docs(benchmarks): record full LoCoMo result for #128

d5c547b

Capture the full judge-on LoCoMo run for #128 after the harness fixes so the branch has a concrete benchmark checkpoint for moving the recall changes forward.

jack-arturo force-pushed the fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive branch from 3f3fbb6 to 26aff1c Compare April 23, 2026 21:37

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread automem/utils/scoring.py Outdated

Potential fix for pull request finding

7769e9c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

jack-arturo merged commit 591b2c7 into main Apr 23, 2026
6 checks passed

jack-arturo deleted the fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive branch April 23, 2026 21:47

jack-arturo mentioned this pull request Apr 23, 2026

chore(main): release 0.15.2 #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(recall): keyword scoring for vector results + softer adaptive floor#150

fix(recall): keyword scoring for vector results + softer adaptive floor#150
jack-arturo merged 5 commits into
mainfrom
fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive

jack-arturo commented Apr 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jack-arturo commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark impact (LoCoMo)

Other changes on this branch

Known follow-up

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jack-arturo commented Apr 23, 2026 •

edited

Loading