PrecisionMemBench is a multi-dimensional retrieval benchmark for LLM memory systems. It measures four orthogonal properties that single-turn answer-quality benchmarks cannot detect:
- Retrieval precision - does the right belief surface, and only that belief, against a fixed seed corpus of 35 beliefs spanning two domain scopes, a supersession chain, and a secondary-user fixture
- Noise isolation - do beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns across a 10-turn session
- Session-turn latency - does retrieval latency degrade under session load relative to single-turn baselines
- Belief mutability - do beliefs updated mid-session surface immediately within the same session via the alias enrichment flywheel
These properties are independent. A system can pass on precision and fail on drift. A system can have clean single-turn latency and degrade 4x under session load. A system with no write-time mutation primitive cannot be scored on the fourth property at all, it is an architectural absence, not a performance difference.
Every case specifies not just what the memory system must return, but what it must not. Noise is a hard failure, not an invisible inference cost.
89 cases covering: alias resolution · scope disambiguation · supersession chain exclusion · fuzzy matching · cross-user isolation · budget eviction · ranking stability · session-level noise isolation under multi-turn topic drift
Paper: arXiv — Dataset: HuggingFace — Leaderboard: HuggingFace Spaces
| Provider | Active passes | Total passes | Mean precision | Mean recall | Retrieval p50 (ms) | Ingestion total (s) |
|---|---|---|---|---|---|---|
tenure |
43/43 | 77/77 | 1.00 | 1.00 | 9.77 | 1.00 |
supermemory |
17/17 | 44/77 | 0.43 | 0.55 | 819.48 | 0.00 |
gbrain |
5/5 | 34/77 | 0.14 | 0.17 | 543.84 | 28.60 |
agentmemory |
0/0 | 7/77 | 0.17 | 0.97 | 82.28 | 1.10 |
yourmemory |
0/0 | 21/77 | 0.17 | 0.88 | 313.39 | 16.40 |
atomicmemory |
0/0 | 9/77 | 0.15 | 0.95 | 71.01 | 658.90 |
zep |
0/0 | 9/77 | 0.09 | 0.95 | 124.36 | 897.00 |
vector |
0/0 | 11/77 | 0.09 | 1.00 | 71.87 | — |
hindsight |
0/0 | 9/77 | 0.06 | 1.00 | 589.86 | 173.30 |
mem0 |
0/0 | 9/77 | 0.06 | 0.99 | 64.94 | 111.30 |
a-mem |
0/0 | 9/77 | 0.06 | 0.99 | 13.80 | 178.80 |
Active passes are the only column that answers whether the memory system itself retrieved correctly. A system cannot accumulate active passes by returning everything or nothing.
Recall of 1.0 does not imply precision. Every comparison system returns the correct belief alongside many incorrect ones and scores perfectly on recall as a result. Mean precision of 0.05 to 0.09 means roughly 10 to 18 irrelevant beliefs are returned alongside each correct one.
Total pass counts require this breakdown to be interpreted correctly. All counts are over the 77 non-session cases.
| Provider | Active retrieval | Structural | Trivially empty |
|---|---|---|---|
tenure |
43 | 25 | 9 |
supermemory |
17 | 18 | 9 |
gbrain |
5 | 20 | 9 |
a-mem |
0 | 6 | 3 |
agentmemory |
0 | 5 | 2 |
atomicmemory |
0 | 6 | 3 |
hindsight |
0 | 6 | 3 |
mem0 |
0 | 6 | 3 |
vector |
0 | 8 | 3 |
yourmemory |
0 | 15 | 6 |
zep |
0 | 6 | 3 |
- Active retrieval pass - the case carries a
retrievalPrecisionassertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability. - Structural pass - the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.
- Trivially empty pass - the expected
relevantBeliefstier is empty by case design (empty query,maxBeliefs: 0, budget set to exact pinned count). Any system returning an empty set passes by construction.
| Model | Precision | Recall | Passes | Mean (ms) | p95 (ms) |
|---|---|---|---|---|---|
| nomic-embed-text (768) | 0.09 | 1.0 | 11/77 | 43.36 | 85.21 |
| mxbai-embed-large (1024) | 0.09 | 1.0 | 11/77 | 96.48 | 257.24 |
| qwen3-8b (4096) | 0.09 | 1.0 | 11/77 | 1130.95 | 2604.84 |
All 11 passes in every configuration are structural or trivially empty. Active retrieval passes are 0 across all three models.
The 12 session cases test three orthogonal properties: whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns, whether latency degrades under session load, and whether beliefs introduced mid-session surface within the same session window via the alias enrichment flywheel.
The drift score is the fraction of retrieved non-pinned beliefs originating from drift-turn topics; 0 is perfect isolation.
| Provider | Turns passed | Pass rate | Mean drift | Noise isolation | Mean precision | Session p50 (ms) |
|---|---|---|---|---|---|---|
tenure |
12/12 | 1.00 | 0.0000 | 1.00 | 1.0000 | 47.79 |
supermemory |
2/12 | 0.17 | 0.1667 | 0.17 | 0.6000 | 867.83 |
yourmemory |
1/12 | 0.08 | 0.7365 | 0.08 | 0.1965 | 430.49 |
gbrain |
1/12 | 0.08 | 0.0000 ‡ | 0.08 | — | 535.61 |
agentmemory |
0/12 | 0.00 | 0.8087 | 0.00 | 0.1913 | 98.49 |
atomicmemory |
0/12 | 0.00 | 0.8449 | 0.00 | 0.1551 | 355.08 |
zep |
0/12 | 0.00 | 0.8888 | 0.00 | 0.1112 | 418.13 |
vector |
0/12 | 0.00 | 0.9142 | 0.00 | 0.0858 | 256.75 |
a-mem |
0/12 | 0.00 | 0.9259 | 0.00 | 0.0741 | 25.66 |
hindsight |
0/12 | 0.00 | 0.9285 | 0.00 | 0.0715 | 1880.60 |
mem0 |
0/12 | 0.00 | 0.9398 | 0.00 | 0.0602 | 377.93 |
‡ gbrain returned no results for these session cases. A drift score of 0.0 is recorded by construction; no beliefs were returned, so none could originate from drift topics. The correct belief also failed to surface, making this an empty-result failure rather than a genuine isolation pass.
Understanding the three pass types is required to interpret any results table.
Active retrieval pass — the case carries a retrievalPrecision assertion and it is satisfied. This is the only pass type that demonstrates verified retrieval capability. A system cannot accumulate active passes by returning everything or nothing.
Structural pass — the case asserts scope isolation, supersession exclusion, or type routing without a precision assertion, and the structural property holds.
Trivially empty pass — the expected relevantBeliefs tier is empty by case design (empty query, maxBeliefs: 0, budget set to exact pinned count). Any system returning an empty set passes by construction. retrievalPrecision is null for these cases.
Without this breakdown, aggregate pass counts do not distinguish verified retrieval from structural or empty-set passes.
The 89 cases cover the following categories. Session cases extend the corpus dynamically — beliefs are created and alias sets updated mid-session — validating that retrieval reflects the live store state rather than a snapshot.
| Category | Cases |
|---|---|
| Alias resolution | 23 |
| Scope disambiguation | 12 |
| Session-level noise isolation | 12 |
| Fuzzy matching and prefix guards | 8 |
| Design boundary cases | 6 |
| Type routing and open questions | 6 |
| Budget eviction and capacity | 5 |
| Relation expansion | 4 |
| Persona prelude content | 4 |
| Supersession chain exclusion | 3 |
| Ranking stability | 3 |
| Counter-signal retrieval | 2 |
| Cross-user isolation | 1 |
| Cold start behavior | 1 |
| Total | 89 |
Alias resolution — whether variant surface forms (short-form, natural-language, multi-word) resolve to the correct belief.
Scope disambiguation — whether scope alone correctly discriminates between beliefs sharing an alias across different domain scopes.
Supersession chain exclusion — whether superseded beliefs are excluded at depth in a multi-hop chain. A query matching both a superseded and a superseding term must surface neither superseded belief; the active terminal belief surfaces via the pinned facts tier.
Fuzzy matching and prefix guards — whether the retrieval layer correctly handles transpositions and near-miss terms while blocking prefix mismatches that edit distance alone would permit. Both pass and fail behaviors are documented as intentional design properties.
Counter-signal retrieval — whether a query referencing a rejected or superseded term surfaces the active replacement belief via a counter-signal alias. Both cases carry an active retrieval precision assertion.
Relation expansion — whether relation-type beliefs correctly surface and expand their participants via a one-hop join, with participant type routing and scope filters applied during expansion.
Session-level noise isolation — whether beliefs introduced during off-topic drift turns contaminate retrieval on subsequent unrelated turns. The primary case is a 10-turn session with topic drift across 8 turns followed by an implicit return; per-turn assertions verify isolation at re-entry.
Budget eviction and capacity — whether the retrieval layer handles slot constraints correctly, including graceful empty returns, single-slot priority, and resistance to high-reinforcement flooding at the budget ceiling.
Design boundary cases — cases where both pass and fail behaviors are documented as intentional design properties.
Type routing and open questions — whether open questions are retrieved by a separate path that returns only pinned open questions for the active scope and are never returned by text search.
Ranking stability — whether retrieval results remain stable across equivalent queries without score-driven reordering artifacts.
Cross-user isolation — whether beliefs belonging to a second user are structurally excluded from a primary user's retrieval regardless of semantic proximity.
Cold start behavior — whether a new user with zero seeded beliefs returns a fully empty context without error.
Persona prelude content — whether the persona prelude generated from the accumulated belief state is injected correctly and reflects the live belief store.
Four metrics are recorded per case:
- Retrieval precision and recall — computed over the
relevantBeliefstier on cases where that tier carries an active assertion. Cases where this metric is structurally inapplicable record null and are excluded from aggregate computation. - Pinned coverage — recorded on cases where the
pinnedFactstier is asserted. - Question precision and recall — recorded on cases where the
openQuestionstier is asserted.
A pass requires all asserted tiers to be simultaneously satisfied. A case with retrievalPrecision: 1.0 that also carries an unmet pinnedCoverage assertion fails.
Drift score is reported for session cases: the fraction of retrieved non-pinned beliefs originating from drift-turn topics. 0 is perfect isolation.
Pre-run reports for all reference systems are committed at test-results/baseline/:
test-results/baseline/
retrieval-report.json
retrieval-report-vector.json
retrieval-report-mem0.json
retrieval-report-zep.json
retrieval-report-hindsight.json
Each report contains per-case results including passed, failures, retrievalPrecision, retrievalRecall, and retrievalLatencyMs, plus aggregate p50/p95 latency and mean precision/recall at the top level.
When you run against your own provider, compare your output in test-results/ directly against these files.
- Node.js 20+
- Docker (for the vector baseline and provider stacks)
- An Ollama instance for the vector baseline only
npm installStart the provider's stack, then:
MEMORY_PROVIDER=mem0 npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=mem0 npx ava session-retrieval.external.eval.test.tsReports land in test-results/. Valid values: mem0, zep, hindsight
The vector eval manages its own MongoDB Atlas Local container. Docker must be running but you do not set anything up manually.
# One-time: generate embeddings and commit the result
OLLAMA_URL=http://localhost:11434 npx tsx embed-seed.ts
# Run the eval
npx ava retrieval.vector.eval.test.ts
npx ava session-retrieval.vector.eval.test.tsThe Atlas Local container starts and stops automatically per run. Ports 27019 (single-turn) and 27021 (session) are used.
python export_to_hf.py
# Output: hf_export/leaderboard.json + hf_export/README.mdExpose a FastAPI service with three endpoints. See wrappers/mem0_service.py for the full contract.
POST /add
{
"text": "redis_cache Redis",
"user_id": "test-user",
"metadata": { "beliefId": "b-redis-code" }
}POST /search
{ "query": "Redis eviction policy", "user_id": "test-user", "limit": 20 }Returns: { "results": [ { "id": "...", "memory": "...", "metadata": { "beliefId": "..." } } ] }
DELETE /reset
Clears all memories for all users. Called once before seeding.
The beliefId in metadata is how the harness maps provider results back to the benchmark's belief schema. If your provider cannot round-trip arbitrary metadata, implement a custom resolveBeliefId in the adapter.
Add one entry to providers.config.json:
"myprovider": {
"envVar": "MYPROVIDER_URL",
"defaultUrl": "http://localhost:8082",
"seedDelayMs": 1000,
"beliefToText": "canonical_name_aliases"
}MEMORY_PROVIDER=myprovider npx ava retrieval.external.eval.test.ts
MEMORY_PROVIDER=myprovider npx ava session-retrieval.external.eval.test.tsThe eval files themselves never need to change.
- Fork this repo.
- Run the full eval suite against your provider (both
retrieval.external.eval.test.tsandsession-retrieval.external.eval.test.ts). - Commit your report files from
test-results/totest-results/baseline/using the naming conventionretrieval-report-{provider}.json. - Open a PR. Include the provider name, Docker image digest (if applicable), and any relevant configuration notes in the description.
Results from merged PRs are reflected on the live leaderboard.
Tenure's eval lives in the Tenure repo and runs directly against its BeliefsReader and ContextBuilder implementations. It is fully self-contained. The Atlas Local container starts and stops automatically. Reports land in test-results/. Results are re-produced on every pull request via CI.
git clone https://github.com/tenurehq/tenure.git
cd tenure
npm i
npm run test:evalEach comparison provider is wrapped with a thin FastAPI service that normalises the /add / /search / /reset contract. Wrappers are in wrappers/.
cd wrappers && docker compose upRequires MEM0_URL, an Ollama instance for embeddings, and a running Qdrant container (included in docker-compose.yml).
cd wrappers
HINDSIGHT_URL=http://localhost:8888 python hindsight_wrapper.pycd wrappers && docker compose up@article{flynt2026precisionmembench,
title = {Structured Belief State and the First Precision-Aware Benchmark
for LLM Memory Retrieval},
author = {Flynt, Jeffrey},
year = {2026}
}