Skip to content

fix(recall): keyword scoring for vector results + softer adaptive floor#150

Merged
jack-arturo merged 5 commits into
mainfrom
fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive
Apr 23, 2026
Merged

fix(recall): keyword scoring for vector results + softer adaptive floor#150
jack-arturo merged 5 commits into
mainfrom
fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive

Conversation

@jack-arturo

@jack-arturo jack-arturo commented Apr 23, 2026

Copy link
Copy Markdown
Member

Closes #128.

Summary

  • scoring.py: fall back to content-token hit count for keyword_component when a result came from vector search (was zero before), so vector hits with question-term overlap aren't penalised vs keyword-sourced results.
  • recall.py: raise adaptive score-floor dropoff threshold from 15% → 25% of the top score, and only apply the cut when at least half the results survive it — the prior filter was truncating the tail too hard.

Benchmark impact (LoCoMo)

Suite Baseline (same-day main, no judge) #128 Δ
LoCoMo-mini (2 convos, 235 Qs) 82.13% (193/235) 85.53% (201/235) +3.40pp
LoCoMo-full (10 convos, 1986 Qs, judge on) 80.06% (1590/1986) vs #97 anchor 83.99% (1668/1986) +3.93pp vs canonical anchor

Cat-5 scored 92.83% (414/446) with 0 judge skips on the full-judge rerun. Per-question diff vs the 2026-03-10 baseline showed 17 previously-correct questions regressing, all with the same fingerprint: top similarity dropped from ~0.54 to 0.25 — the adaptive floor was cutting valid results that the softened threshold now retains. See benchmarks/EXPERIMENT_LOG.md rows for 2026-04-23.

Other changes on this branch

(The judge-reliability harness fix originally committed on this branch was rebased out because the same work landed on main via #149.)

Known follow-up

There's a residual main-to-main drift between the 2026-03-10 baseline (89.36%, no judge) and today's same-day baseline (82.13%, no judge) on identical corpus. Per-question analysis shows it's the same adaptive-floor interaction, but the score distribution shifted between those dates from something other than #73. #128 recovers ~half the gap. Filing a follow-up to bisect the score-distribution shift (likely #131 or #134) and run a control with RECALL_ADAPTIVE_FLOOR=false to confirm.

Test plan

  • Unit tests pass (pytest tests/test_recall*.py).
  • LoCoMo-mini no-judge rerun on this branch vs same-day main baseline (+3.40pp).
  • LoCoMo-full with GPT-5.1 cat-5 judge (+3.93pp vs feat(bench): benchmark testing infrastructure for rapid iteration #97 canonical anchor).
  • Production smoke: "AutoJack" query on Railway returns ≥30 results with non-zero keyword component.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 23, 2026 21:33
jack-arturo and others added 4 commits April 23, 2026 22:36
…loor (#128)

- scoring.py: fall back to content-token hit count for keyword_component
  when a result came from vector search (was zero before), so vector
  hits with question-term overlap aren't penalised vs keyword-sourced
  results.
- recall.py: raise adaptive score-floor dropoff threshold from 15% to
  25% of the top score and only apply the cut if at least half the
  results survive it — the prior filter truncated the tail too hard.
- docs + unit tests updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ignore `.worktrees/` so isolated local worktrees do not pollute git status while we prepare the full-eval branch.
Capture the full judge-on LoCoMo run for #128 after the harness fixes so the branch has a concrete benchmark checkpoint for moving the recall changes forward.
…lumn

Fold the #142 prose block into a single Results-table row so every PR is
recorded the same way, and extend the Category Breakdown table with an
Overall column so per-category scores no longer require a cross-reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jack-arturo jack-arturo force-pushed the fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive branch from 3f3fbb6 to 26aff1c Compare April 23, 2026 21:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves recall ranking quality by restoring keyword influence for vector-sourced hits and by softening the adaptive score-floor truncation so relevant “tail” results aren’t aggressively removed. It also updates the benchmark harness/tests/docs to support judge preflight and GPT‑5-family request behavior.

Changes:

  • automem/utils/scoring.py: add a content-token overlap fallback for keyword scoring when results are not keyword/trending sourced (e.g., vector hits).
  • automem/api/recall.py: relax adaptive floor application (25% threshold) and add a guardrail so it only applies when at least half of results remain.
  • Bench harness + tests/docs: default cat‑5 judge model to gpt-5.1, add judge preflight, structured output handling for GPT‑5, and env loading via dotenv.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
automem/utils/scoring.py Adds keyword fallback scoring for non-keyword result types based on query-token presence in content.
automem/api/recall.py Adjusts adaptive floor threshold and adds a “retain at least half” guardrail.
tests/test_api_endpoints.py Adds unit tests for keyword fallback scoring and adaptive floor behavior.
tests/test_locomo_cat5_judge.py Updates cat‑5 judge tests to use default judge model + adds preflight/structured-output/retry coverage.
tests/test_benchmark_env_loading.py New tests ensuring benchmark scripts load OPENAI_API_KEY from global dotenv config.
tests/benchmarks/test_locomo.py Adds dotenv loading, judge preflight, GPT‑5 structured output request format, retry token budgets, and default judge model constant.
tests/benchmarks/longmemeval/test_longmemeval.py Adds dotenv loading to align benchmark env behavior.
docs/TESTING.md Updates LoCoMo judge docs (default model + preflight behavior).
docs/ENVIRONMENT_VARIABLES.md Documents keyword scoring behavior now including the vector-content fallback.
test-locomo-benchmark.sh Updates help text to reflect new default judge model.
benchmarks/EXPERIMENT_LOG.md Normalizes/updates experiment log entries for #128 and #142 results.
.gitignore Ignores /.worktrees/ for local benchmark worktrees.

Comment thread automem/utils/scoring.py Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@jack-arturo jack-arturo merged commit 591b2c7 into main Apr 23, 2026
6 checks passed
@jack-arturo jack-arturo deleted the fix/128-recall-keyword-scoring-dead-for-vector-results-adaptive-floor-too-aggressive branch April 23, 2026 21:47
jack-arturo added a commit that referenced this pull request Apr 23, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.15.2](v0.15.1...v0.15.2)
(2026-04-23)


### Bug Fixes

* **benchmarks:** make LoCoMo judge runs reliable
([#149](#149))
([c22f2c9](c22f2c9))
* **recall:** bypass tag filter on expansion candidates
([#142](#142))
([#146](#146))
([4f0fcf8](4f0fcf8))
* **recall:** keyword scoring for vector results + softer adaptive floor
([#150](#150))
([591b2c7](591b2c7))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

recall: Keyword scoring dead for vector results + adaptive floor too aggressive

2 participants