Memory Arena

A 30-line vector store beats every funded agent-memory SDK on the same eval. None of them is close to solving the problem.

Memory Arena is an open-source harness that runs 20 memory strategies through one lifecycle, one judge, one model, fully reproducible. Run it yourself, add your own strategy, or PR a vendor's tuned config and I'll re-run it.

What this is. Memory Arena runs 20 agent-memory strategies through the same setup -> ingest -> recall -> teardown lifecycle against the same chat-session corpus (LongMemEval-S smoke), judged by the same model (Claude Opus 4.7), with the same top_k. Six are vendor SDKs at their documented defaults (Mem0, Graphiti, Graphiti-on-FalkorDB, Cognee, LangMem, Memori). Twelve are pure-Python baselines and retrievers (vector, BM25, RRF, HyDE, RAPTOR, Reflection, Persona Profile, Karpathy's LLM Wiki, A-MEM, HippoRAG 2, full-context, recency window). Two are quantum rerankers over the same vector store (QISS, a pure-NumPy fidelity reranker; SQR, a Qiskit Aer SWAP-test reranker). Each result JSON is stamped with commit SHA, package versions, model IDs, and seed, so anyone can re-run and bisect.

What it isn't. Not vendor-tuned. Not a single-judge truth (19-way Opus-vs-GPT-4o cross-judge Spearman is +0.967; GPT-4o runs more lenient in absolute terms but agrees on the ranking). Not a large-N study (the smoke subset is 16 questions; the v0.2 full LongMemEval-S brings 500). Read What this benchmark does NOT claim before quoting the numbers.

_{20 systems, one eval. Sorted by accuracy under the Opus judge. The funded vendor SDKs (coral) all land below `naive_vector`, a 30-line ChromaDB script; `mem0` and `langmem` run on the same model as the baselines, so it isn't a model handicap. Opus is a strict judge: GPT-4o grades these ~15pp higher with +0.967 rank agreement. Even the leader gets ~half right, memory is an open problem, not a shipped feature.}

_{The cost view. Accuracy vs cost (log scale). The cheap pure-Python retrievers (navy) own the frontier; the vendor SDKs (coral) and full_context (grey) are both worse and no cheaper once you count their internal calls. qiss lands next to naive_vector: it reranks the same candidates by cosine-squared fidelity, which reorders retrieval a little but doesn't move accuracy past noise; sqr trails. Why, in [`docs/quantum-and-compression.md`](docs/quantum-and-compression.md).}

_{The 20 strategies placed in one design space. X-axis: when computation happens (read-time → write-time). Y-axis: representation (raw text → vectors → derived facts → graph). The pattern: tier-1 accuracy spans the whole grid, no single quadrant wins.}

Rule of thumb: which agent memory should you use?

Start with a plain vector store. Spend your effort on how the model reasons over what it retrieves, not on a fancier index. Reach for a graph only when plain vectors actually fail, which here was time-aware questions. The expensive SDKs don't beat the free baseline.

_{Why this is the rule. Each bar is how often the answer is correct; each diamond is how often the right memory was actually retrieved (recall@5). Every top retriever finds the memory 80-92% of the time but answers correctly only about half. That gap is reasoning, not retrieval, which is why a 30-line vector store is enough and a fancier index or a funded SDK doesn't close it. full_context doesn't even retrieve (it stuffs the whole conversation) and costs the most.}

Default to a plain vector store (embed each turn, top-k cosine, ~30 lines). It beat every funded SDK here and costs almost nothing.
Fix the reasoning, not the retrieval. The right memory gets found 85-92% of the time; the model uses it correctly only ~half the time. A better answer prompt beats a fancier retriever.
Reach for a graph only where plain vectors actually fail: temporal questions. On time-aware "when did X happen" queries every vector method collapsed (4-15%) and a from-scratch HippoRAG 2 recovered about half (~60%, consistent across seeds, at ~10x the cost). On multi-session reasoning the vector store actually won (75% vs 58%), so "graphs for multi-hop" is a theory this 16-question corpus can't confirm; v0.2 will.
Pay a vendor SDK for managed infra, not accuracy. On the same model, Mem0 / Graphiti / Cognee / LangMem didn't beat the free baseline. The convenience (extraction, updates, hosting) is real; the accuracy bump isn't.
Never just stuff the whole conversation into context. full_context is the worst trade on the board: most expensive, and lower accuracy than a cheap retriever.

Full decision tree and per-use-case matrix in docs/decision-guide.md.

What's interesting in the data

One-liner: You don't need an agent-memory SDK to beat one. On this corpus a 30-line vector store tops every memory SDK, including Mem0 run on the exact same model.

The top tier is entirely pure-Python, and every vendor SDK lands below it. hipporag2 (54.5%), qiss (50.5%), persona_profile (49.7%), naive_vector (49.2%), reflection (47.4%), and raptor (46.8%) cluster at the top with overlapping 95% CIs. The funded vendor SDKs all score lower: Mem0 at 31.4% (and we put it on the exact same model as the baselines, Claude Sonnet, so it isn't a model handicap), below even bm25; the graph vendors at ~23%. A 30-line ChromaDB script beats all of them. Leveling cuts both ways, and we measured it: putting Mem0 on Sonnet barely moved it (~3pp; its drop from a naive-default 51% is mostly a v1->v2 regression, ~16pp), while langmem rose +6pp on Sonnet to 33% (the top vendor, still below bm25). graphiti and cognee can't be cleanly leveled in one environment (documented in the methodology note); they stay vendor-default ~26pp back regardless.
Pure-Python advanced retrievers don't separate from the simple vector store. hybrid_rrf (~44%), hyde (~43%), raptor (~47%), and reflection (~47%) are all within ±5pp of plain naive_vector despite costing 1.5-8x more per run. At this sample size these are noise; worth re-checking on the full 500-question corpus.
The "I'll just stuff the whole conversation in" baseline is dead. full_context scores 29.5% at $5.16/run; every retrieval strategy beats it for 1-50% of the cost. Long-context isn't a substitute for retrieval, even when the model can hold it.
Graph-shaped vendors (graphiti, cognee, karpathy_llm_wiki) underperform on this slice. Both temporal-graph vendors land at ~23% and the LLM-maintained wiki at ~22%; about half of naive_vector. This is a measurement boundary, not a verdict. The smoke corpus contains only 4 multi-session-reasoning questions out of 16; graph approaches earn their write-time cost on multi-hop synthesis the v0.2 full corpus (500 questions, ~125 multi-session) will exercise. The graph-vs-vector head-to-head you actually want is in v0.2; treat the smoke numbers for graph systems as a lower bound on what the architecture can do.
The quantum rerankers don't beat the closed-form baseline. qiss (quantum-inspired, pure NumPy) reranks naive_vector's candidate pool by cosine-squared state fidelity. It does reorder retrieval (its top-5 differs from naive_vector's on about one question per seed, so Recall@5 is 84.9% vs 87.0%), but the reordering doesn't help: 50.5% (3-seed mean, 95% CI overlapping the whole top tier) is statistically tied with naive_vector's 49.2%. sqr (a real SWAP-test circuit on the Qiskit Aer simulator) drops to 40.4%: amplitude-encoding forces a PCA squeeze of 3072-d embeddings into a 16-amplitude 4-qubit register, which keeps only 36% of the variance (measured, results/longmemeval-s_quantum_diagnostics.json), and the reranker eats that loss. The honest read: simulated quantum machinery buys nothing over the closed-form cosine on this task; the interesting number is the dimensionality-reduction tax, not a quantum win.
Retrieval is mostly solved; reasoning over the retrieved memory is the bottleneck. Across six top-tier strategies, recall@5 is 85-92% while accuracy is only 47-51%. Restricting to just the questions where the correct session was retrieved barely moves accuracy (~48-54% mean), and they still score below half marks on roughly half of those questions. The hard part isn't finding the memory, it's answering correctly once you hold it. (hipporag2 is the exception that proves it: same recall (~85%), but 63% accuracy on the perfect-recall subset, the best at using what it retrieves. N=16, so directional; computed from the per-question recall_records in results/.)

_{Pairwise significance heatmap. Read row "A" vs column "B": navy = A beats B with 95% bootstrap CI excluding 0; coral = B beats A; white = statistical tie. The white block in the upper-left is the "tier 1 is tied" claim, visually verified.}

What this benchmark does NOT claim

Not "vendor SDK X is bad." We report vendor-default behavior on a single corpus. A tuned config is a separate measurement; vendor PRs are explicitly invited (see Vendors: PR your tuned config).
Not "the top tier is equivalent." N=16 makes 95% CIs wide; we report a statistical tie at this sample size, not equivalence. The full LongMemEval-S (500 questions) ships in v0.2 with CIs tight enough to rank within tier 1.
Not "graph memory is dead." graphiti and cognee score lower than vector retrievers on this slice. The smoke subset has 4 multi-session-reasoning questions; graph approaches need many more to demonstrate their advantage. The full LongMemEval-S (500 questions, ~125 multi-session) and LongMemEval-M (multi-hop heavy) ship in v0.2.
Not "every vendor is on the same model." mem0 and langmem now extract on Claude Sonnet, leveled to the harness (see the methodology note). graphiti and cognee can't be cleanly leveled in one environment (graphiti-core 0.13 ships no Anthropic client, and 0.17, which does, is a version confound; cognee 1.0.3 needs a newer starlette than the pinned fastapi allows), so they run vendor-default (gpt-4o / gpt-4o-mini). Read those two rows as vendor-default, not model-controlled; both sit ~26pp below the top tier regardless of model.
Not "a single judge is bias-free." Opus 4.7 grades every answer, but 19 of the 20 strategies were re-graded by GPT-4o and the ranking holds: Spearman rho = +0.967 (results/cross_judge_report.json). GPT-4o runs ~10-25pp more lenient in absolute terms but agrees on order, and independently ranks every vendor SDK below the pure-Python tier (sqr's results lack a seed suffix, so the seed-based re-grade skips it). See Zheng et al. (NeurIPS 2023, arXiv:2306.05685) for the judge-bias framework.
Not "the configurations are tuned." They are documented vendor defaults plus the deviations listed in Configurations below.
Not "this generalizes to all memory workloads." Chat sessions are one slice; tool-use traces, codebases, and long documents are not in scope.
Not "rank N vs N+1 is meaningful at the smoke scale." With 20 benchmarked strategies pairwise-compared on 7 metrics, family-wise error rates require Benjamini-Hochberg correction (queued for v0.2 alongside the paired bootstrap). Treat within-tier ranks as unordered.

Configurations

Vendor-default everywhere except where flagged. top_k=5 held constant across all strategies for the headline run. Same OpenAI text-embedding-3-large (3072 dims) for every embedding-based strategy. Same Anthropic claude-sonnet-4-6 for the recall-step generator unless the SDK pins its own. Same Anthropic claude-opus-4-7 as the LLM judge.

Strategy	top_k	Embedding model	Generator	Judge	Max tokens (gen)	Notes
`full_context`	n/a	n/a	sonnet-4-6	opus-4-7	4096	Stuffs every turn into the prompt up to `full_context_token_budget=150000`.
`recency_window`	n/a	n/a	sonnet-4-6	opus-4-7	4096	Last `recency_window_n=20` turns only.
`naive_vector`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	ChromaDB collection `naive_<run_id>`.
`bm25`	5	n/a (lexical)	sonnet-4-6	opus-4-7	4096	rank-bm25 in-process.
`hybrid_rrf`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	RRF k=60. Inner top_k = `max(top_k * 4, 20) = 20` per branch.
`hyde`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	Hypothetical answer drafted by sonnet, then embedded for retrieval.
`persona_profile`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	Haiku 4.5 builds the one-shot user profile (~15K-char buffer) before first recall.
`reflection`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	Reflection summary written every 4 sessions by sonnet, indexed alongside raw turns.
`raptor`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	4096	scikit-learn k-means, 4 levels, branch=4. Cluster summaries via Haiku 4.5.
`karpathy_llm_wiki`	5	n/a (markdown wiki)	sonnet-4-6	opus-4-7	4096	Haiku 4.5 picks pages, sonnet writes answers. Lint pass every 10 sessions over the 30 largest pages.
`mem0`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	vendor	mem0ai 2.0.2 (v2). Internal fact-extraction runs on Claude Sonnet, leveled to the harness (v2's Anthropic adapter is fixed). See methodology note.
`graphiti`	5	text-embedding-3-large	gpt-4o	opus-4-7	16384	Deviates from doc default: `max_tokens=16384` (paper default 8192) plus 2-turn (user+assistant) chunked episodes to keep entity-extraction under the cap. Cost-driven trade-off.
`graphiti_falkor`	5	text-embedding-3-large	gpt-4o	opus-4-7	16384	Identical config to `graphiti` but on graphiti-core 0.17 with the FalkorDB driver (host port 6381) instead of Neo4j. Separate env (needs pydantic 2.11+).
`cognee`	5	text-embedding-3-large (3072 dims)	gpt-4o-mini	opus-4-7	vendor	`add` -> `cognify` -> `search(GRAPH_COMPLETION)`. networkx graph (default).
`langmem`	5	text-embedding-3-large	sonnet-4-6	opus-4-7	vendor	LangGraph `InMemoryStore` + `create_memory_store_manager(anthropic:claude-sonnet-4-6)`, leveled to the harness.
`memori`	5	vendor-internal (cloud)	gpt-4o-mini	opus-4-7	vendor	Memori 3.x augmentation runtime owns embedding internally; we cannot pin it to ours. Without `MEMORI_API_KEY` the augmentation is throttled near zero.

Methodology note: leveling mem0's generator

A clean comparison routes every strategy's internal LLM calls through the same model so the only variable is the memory architecture. Older runs of mem0 could not: mem0ai v1 (0.1.114) had an Anthropic-adapter bug that forced its internal fact-extraction onto OpenAI gpt-4o-mini, while pure-Python rows ran end-to-end on Sonnet 4.6. v2 (2.0.2) fixes that adapter, so mem0 now extracts on Claude Sonnet, the same model as the baselines. Embeddings (text-embedding-3-large) are pinned across the table.

We measured the confound directly instead of guessing at it (3-seed and 5-seed runs):

mem0 config	accuracy	recall@5
v1 + gpt-4o-mini (the old published number)	50.9%	91%
v2 + gpt-4o-mini	34.6%	80%
v2 + Sonnet (leveled, what the table now shows)	31.4%	71%

The model swap is small, ~3pp. Holding the version at v2, moving extraction from gpt-4o-mini to Sonnet costs 2.7-3.2pp across 5 seeds. So mem0's score was never a model artifact; the generator the field worried about barely moves it.
The version is the real story, ~16pp. mem0 v2's OSS rewrite (which removed the graph store and consolidates facts more aggressively) scores ~16pp lower than v1 on this haystack corpus, regardless of model. We benchmark v2 because it is what pip install 'memory-arena[mem0]' gives you today; pinning the deprecated v1 to keep a flattering 51% would test software nobody installs.
Cost (~$0.05/run) counts only memory-arena's own generation calls; mem0's internal Sonnet extraction is not counted (footnote ‡).

langmem is leveled the same way, and it cut the other direction: on Sonnet it rose +6pp (26.9% -> 33.0%), the top vendor on the board, so its default model was selling it short. graphiti and cognee can't be cleanly leveled in this environment: graphiti-core 0.13 has no Anthropic client (0.17 does, but the version jump is itself a confound, like mem0 v1->v2), and cognee 1.0.3 needs a newer starlette than the pinned fastapi allows, so it only runs in its own venv. Both stay at their benchmarked vendor default; leveling them needs a separate environment and is the one remaining gap. mem0g (mem0 v1 graph memory) is excluded from the leaderboard entirely: v2 removed the OSS graph store, so it can only run on the deprecated v1 + gpt-4o-mini, and graph memory is already covered by graphiti.

Statistical methodology

Bootstrap. For strategies with 3 seeds, we compute the per-seed per-question accuracy, then resample question IDs with replacement (2000 iterations) and compute the mean. The 95% CI is the 2.5th-97.5th percentile of the bootstrap distribution. Single-seed strategies show no CI; their numbers are point estimates.
What CI width to expect. With N=16 questions per seed and three seeds, the median 95% CI half-width on the accuracy axis is ~18 pp. Strategies with low variance across seeds (e.g. raptor, recency_window, bm25) collapse to <1 pp; strategies whose performance hinges on stochastic extraction (e.g. mem0, hyde) end up >15 pp. The full LongMemEval-S (500 questions, v0.2) shrinks every row's CI by a factor of ~5.6.
Pairwise comparisons. The pairwise heatmap (docs/pairwise.png) uses a paired bootstrap of per-question accuracy differences (2000 resamples), reporting a cell as "significant" only if the 95% CI excludes 0 AND the absolute gap exceeds 5 percentage points. This is more conservative than treating overlapping seed-level CIs as non-significance.
Multiple-comparisons correction. With 20 benchmarked strategies and 7 axes, pairwise tests imply up to 20 x 19 / 2 x 7 = 1330 implicit hypotheses if every cell were treated independently. The published pairwise heatmap is informally corrected by the 5 pp gap floor; a formal Benjamini-Hochberg q-value column ships in v0.2. Until then, treat the pairwise heatmap as a screening tool, not a defendable rank.
Single judge bias. Opus 4.7 is the single judge. Re-grading 19 of the 20 strategies with GPT-4o yields Spearman rho = +0.967 on the ranks (see results/cross_judge_report.json); GPT-4o grades higher in absolute terms but agrees on order. The framework for quantifying judge bias is Zheng et al. (NeurIPS 2023, arXiv:2306.05685).
Reproducibility. Every per-seed result file under results/ carries a metadata block with commit_sha, seed, package_versions, models, top_k, host, python, and timestamp. Re-running scripts/aggregate_bootstrap.py from a fresh checkout reproduces the headline numbers from the per-seed files; re-running the bench against the same seed reproduces those.

No API keys, no Docker, no corpus download. The package ships with a smoke result snapshot and the dashboard renders it from the bundled FastAPI server:

pip install memory-arena
memory-arena demo

The dashboard lives at http://localhost:8000/. Three pages: Home (20 strategy cards, 4 question categories evaluated), Benchmark (sortable table with 95% CIs + cost + latency), Recall Lab (per-question HIT/MISS drill-down).

_{The bundled dashboard's Benchmark page, sortable by any column. memory-arena demo serves it from the snapshot with no API keys. Also: home (20 strategy cards) and recall lab (per-question HIT/MISS).}

Real benchmark numbers

LongMemEval-S smoke subset: 16 questions across 4 categories: 4 questions each from information_extraction, multi_session_reasoning, temporal, and knowledge_update. 82 sessions. Judge: claude-opus-4-7. 3-seed bootstrap where available (13/20 benchmarked strategies); remaining strategies are single-seed. CIs are 95% bootstrap. top_k=5 held constant across all strategies.

Read this first. N=16 yields a median 95% bootstrap CI half-width of ~18 pp on accuracy, so the top tier is a wide statistical tie. hipporag2 posts the top point estimate (54.5%, 3 seeds) but its three seeds alone swing 49-62%, and its CI overlaps qiss, persona_profile, naive_vector, reflection, and raptor, all pure-Python. Every vendor SDK sits below this tier. Treat the leaderboard as "tier 1 is a ~47-55% pure-Python tie; clear gap to bm25; vendors below; floor at recency_window." Cross-judge sanity check: re-grading 19 of the 20 strategies with GPT-4o instead of Opus 4.7 gives Spearman rho = +0.967 on the ranks (see results/cross_judge_report.json). The full 500-question LongMemEval-S lands in v0.2 with CI half-widths small enough to rank within tier 1. See Statistical methodology for the bootstrap recipe, multiple-comparison handling, and known limits.

Strategy	Accuracy (95% CI)	Recall@5	Cost	Latency	Status	What it does
`hipporag2`	54.5% ±18.6	85.2%	$0.859	6542ms	ok	HippoRAG 2 (ICML 2025): extracts entity triples into a graph and ranks passages by personalized PageRank.
`qiss`	50.5% ±17.7	84.9%	$0.091	3724ms	ok	Reranks vector hits by quantum fidelity (cosine squared) over the same embeddings; pure NumPy, optional multi-query superposition fusion.
`persona_profile`	49.7% ±18.2	86.5%	$0.202	6554ms	ok	Builds a one-page bio of the user up front and pastes it into every answer.
`naive_vector`	49.2% ±18.6	87.0%	$0.089	6653ms	ok	Embeds every message and pulls the closest matches by cosine similarity.
`reflection`	47.4% ±18.2	83.3%	$0.431	6838ms	ok	Periodically writes journal-style summaries of recent sessions, then searches both.
`raptor`	46.8% ±19.6	89.1%	$0.115	3861ms	ok	Hierarchical k-means clustering with LLM cluster summaries (Sarthi et al. 2024).
`hybrid_rrf`	43.6% ±18.8	84.9%	$0.731	12360ms	ok	Reciprocal rank fusion of vector + BM25. Blended ranking.
`hyde`	42.5% ±17.0	78.5%	$0.223	10652ms	ok	Guesses what the answer might look like first, then searches for messages like the guess.
`sqr`	40.4% ±17.6	66.7%	$0.067	3714ms	ok	Reranks vector hits with a real SWAP-test circuit on the Qiskit Aer simulator (PCA-reduced amplitude encoding, exact statevector).
`amem`	39.5% ±16.3	91.5%	$1.09	3634ms	ok	A-MEM (NeurIPS 2025): writes an LLM note per memory and periodically links and revises related notes.
`bm25`	34.4% ±16.8	79.2%	$0.090	8299ms	ok	Old-school keyword search. What Google did before vectors.
`langmem`	33.0% ±9.5	-§	$0.163‡	4227ms	ok	LangChain's memory store: extracts facts as they happen and recalls them by similarity.
`mem0`	31.4% ±16.7	70.7%	$0.054‡	4082ms	ok	Vendor SDK that extracts standalone facts and stores them as memories.
`full_context`	29.5% ±19.0	8.3%	$5.16	7664ms	ok	Pastes the entire chat history into every prompt. Ceiling reference; expensive.
`graphiti`	23.4% ±15.3	58.9%	$0.034‡	3489ms	ok	Temporal knowledge graph with valid_at/invalid_at edges (Zep OSS).
`cognee`	23.3% ±17.9	-§	$0.023‡	9205ms	ok	Knowledge-graph memory: add → cognify → search(GRAPH_COMPLETION).
`graphiti_falkor`	22.5% ±14.0	63.0%	$0.034‡	3223ms	ok	Same Graphiti algorithm on FalkorDB (Redis graph engine) instead of Neo4j; isolates the database to test the engine's latency claim.
`karpathy_llm_wiki`	22.5% ±15.5	51.6%	$1.27	4263ms	ok	LLM maintains a markdown wiki of entity pages with cross-links and citations.
`recency_window`	5.8% ±5.9	6.2%	$0.340	5894ms	ok	Only remembers the last 20 messages. Floor baseline.
`memori`	1.4% ±1.0	-§	$0.012‡	2472ms	ok	SQL-native fact store with augmentation pipeline. ‖

Footnotes. ‡ memory-arena-paid generation cost only; vendor SDKs run additional internal LLM calls (mem0 extraction, graphiti entity extraction, langmem fact extraction, cognee cognify, etc.) that aren't counted. True cost is this number plus the vendor-internal component. § Recall@k requires the strategy to return chat-session pointers; LangMem/Cognee/Memori store extracted facts, neither maps to LongMemEval session IDs. ‖ memori at this score reflects the no-MEMORI_API_KEY baseline (cloud augmentation throttled to ~zero). Set the key for the vendor's intended throughput; PRs welcome.

20 strategies. 13/20 strategies have 3-or-more-seed bootstrap CIs; the rest are single-seed (no CI). top_k=5 held constant. Judge: claude-opus-4-7. Generation: claude-sonnet-4-6 (strategies that route through their own SDK use the SDK default). Hardware/SDK versions stamped in results/<strategy>_summary.json metadata block.

FalkorDB vs Neo4j, version-controlled. The graphiti row pins graphiti-core 0.13 (the published baseline); graphiti_falkor needs 0.17, the first release with the FalkorDB driver. So the two rows are not a clean engine comparison: graphiti-core 0.17 by itself lifts Neo4j from 23.4% to 26.7% accuracy and 3489ms to 3054ms latency. Re-running graphiti on Neo4j under the same 0.17 gives 26.7% / 66.1% recall@5 / 3054ms vs graphiti_falkor at 22.5% / 63.0% / 3223ms. Under one library version the two engines are statistically tied (16 questions, single seed, ~14pp CI half-width), with Neo4j marginally ahead on every axis and zero ingest failures (FalkorDB's driver hits ~1.2%). FalkorDB's headline graph-query latency advantage does not surface end-to-end because the answer-generation LLM call dominates recall latency. No measurable FalkorDB win at this level of measurement.

The vendor caveat: rows labeled config-failed-at-default (none in the current run) reflect a vendor's shipped default config breaking ingest at install time, not what the SDK can do when tuned. PRs that ship a working default for any vendor in that bucket are welcome, see CONTRIBUTING.md. The table answers the question "what does memory-arena measure when I run the documented install command?", not "what's the maximum each vendor SDK can achieve."

Sanity check vs the LongMemEval paper

The LongMemEval paper (Wu et al., ICLR 2025; Table 3) reports plain semantic-retrieval baselines on the M variant of the corpus. Their session-granularity vector retriever lands at Recall@5 = 0.706 with GPT-4o reader/judge accuracy = 0.670. Our naive_vector on LongMemEval-S smoke gets Recall@5 = 0.870 (we use the newer text-embedding-3-large) and 49.2% accuracy under Opus 4.7 as judge. The Opus judge is stricter than the paper's GPT-4o; we treat ours as a floor, not a comparable. A 19-way cross-judge under GPT-4o gives Spearman rho = +0.967 on the ranks (GPT-4o more lenient, same order). The underlying retrieval is, if anything, slightly better than the paper's because of the newer embedding model.

Dig deeper

Want…	Read
Why mem0 wins question X but loses question Y	`docs/case-studies.md`: 4 questions, side-by-side answers
Which architecture to pick for your use case	`docs/decision-guide.md`: decision tree + use-case matrix
What every strategy answered for every question	`docs/per-question-comparison.md`: static "ask all 16" page
Common objections + answers	`docs/FAQ.md`: 18 pre-empted questions
Do the quantum strategies beat cosine? (no, and why)	`docs/quantum-and-compression.md`: root cause, literature, and the compression cost frontier
Vendor SDK pin reasons + breakages	`docs/vendor-pins.md`

How memory-arena measures

Every strategy goes through the same lifecycle:

setup(run_id) -> ingest_session(...) sequentially -> recall(query) -> teardown()

Same OpenAI text-embedding-3-large for vectors that need them. Same Anthropic Sonnet 4.6 for generation. Same Anthropic Opus 4.7 for the LLM judge.

The 7-axis evaluator

Structural, must_mention, must_not_claim, max_tokens
Sources, at least one labeled supporting_session_id retrieved
LLM judge, Opus 4.7 grades 0..100 against the reference
Eval memo, identical (answer, reference) pairs cached in-process
Temporal correctness, claimed time-marker overlaps the ground-truth window
Update precision, answer reflects the latest fact version
Abstention F1: F1 over abstention questions. Returns null when no abstention question is evaluated; the current smoke subset has no abstention category (returns null across all rows). The v0.2 sweep restores 4 abstention questions per seed.

The 20 strategies

Pure-Python baselines and retrievers

Strategy	Backing	Notes
`full_context`	in-process	Stuff every turn into the prompt up to the budget.
`recency_window`	in-process	Last N turns. Cheapest baseline.
`naive_vector`	local Chroma	Embed every turn, top-k cosine.
`bm25`	in-process	Pure-Python lexical baseline.
`hybrid_rrf`	local Chroma + rank-bm25	Reciprocal Rank Fusion of vector + BM25.
`hyde`	local Chroma	Hypothetical Document Embeddings, guess answer first, embed that.
`persona_profile`	local Chroma	One-shot persona stuffed as system context for every recall.
`reflection`	local Chroma	Synthetic LLM-authored summaries every 4 sessions, indexed alongside raw turns.
`raptor`	scikit-learn	Hierarchical k-means clustering with LLM cluster summaries.
`karpathy_llm_wiki`	local markdown wiki	LLM maintains a markdown wiki with `[[wikilinks]]` and `[session=...]` citations. (pattern)
`amem`	local Chroma	A-MEM (NeurIPS 2025): LLM-authored memory notes with a periodic link-evolution pass.
`hipporag2`	networkx	HippoRAG 2 (ICML 2025): open-IE triples plus personalized PageRank over an entity graph.

Quantum rerankers (over `naive_vector`)

Both coarse-retrieve top_k x fanout candidates from naive_vector's Chroma index, then rerank by a quantum-state fidelity. They share the vector store, so the retrieval substrate is identical and only the reranking math differs.

Strategy	Backing	Notes
`qiss`	local Chroma + NumPy	Quantum-Inspired Semantic Similarity. Reranks by fidelity Tr(rho_q rho_d) = cosine squared. Pure NumPy, no new deps; optional multi-query superposition adds interference cross-terms.
`sqr`	local Chroma + Qiskit Aer	Simulated Quantum Reranker. SWAP-test circuit on the Aer simulator (exact statevector); embeddings PCA-reduced to 2^n_qubits dims and amplitude-encoded. Needs `pip install 'memory-arena[quantum]'`.

Vendor SDKs

Strategy	Backing	Notes
`mem0`	Chroma	mem0ai v2; internal fact-extraction leveled to Claude Sonnet.
`graphiti`	Neo4j	Temporal knowledge graph (Zep OSS).
`graphiti_falkor`	FalkorDB	Same algorithm as `graphiti` on a Redis-based graph engine; isolates Neo4j vs FalkorDB.
`cognee`	networkx (default) / Neo4j	`add` → `cognify` → `search(GRAPH_COMPLETION)`.
`langmem`	LangGraph InMemoryStore	`create_memory_store_manager(anthropic:claude-sonnet-4-6)`, leveled; text-embedding-3-large.
`memori`	Postgres	SQL-native, augmentation pipeline. Cloud-quota throttled without `MEMORI_API_KEY`.

Quick start (local)

git clone https://github.com/xmpuspus/memory-arena
cd memory-arena
pip install -e '.[dev]'

# The benchmark makes real LLM + embedding calls, so set provider keys first
# (pydantic-settings reads them from the environment or a local .env file):
export ANTHROPIC_API_KEY=...   # Claude: generation + judge
export OPENAI_API_KEY=...      # embeddings (text-embedding-3-large)

# Bring up Neo4j (graphiti) and Postgres-pgvector (memori)
docker compose up -d neo4j postgres

# Pull the LongMemEval corpus and ingest the smoke subset
memory-arena download-longmemeval
memory-arena ingest-sessions --corpus longmemeval-s

# Pure-Python strategies, 3-seed bootstrap
for SEED in 0 1 2; do
  memory-arena benchmark --corpus longmemeval-s \
    --strategy 'bm25,naive_vector,recency_window,hybrid_rrf,hyde,persona_profile,reflection,raptor,karpathy_llm_wiki' \
    --cost-cap 3 --top-k 5 --seed $SEED
done

# Vendor SDK strategies
pip install 'memory-arena[mem0,graphiti,cognee,langmem,memori]'
for SEED in 0 1 2; do
  memory-arena benchmark --corpus longmemeval-s \
    --strategy 'mem0,graphiti,langmem,memori' \
    --cost-cap 3 --top-k 5 --seed $SEED
done

# Aggregate to per-strategy summaries with 95% CIs
python scripts/aggregate_bootstrap.py

# Render the README headline table from those summaries
python scripts/render_readme.py

# Build the hero chart from the same summaries
python scripts/build_hero_chart.py

# Launch the dashboard
memory-arena serve

Every result JSON is stamped with the commit SHA, installed package versions, model IDs, host info, and seed under metadata. If your numbers differ from the published table, post the diff and we'll bisect.

Why this exists

A simple chart someone can send when asked "which memory SDK should I pick?" Vendor benchmarks are vendor-tuned, academic benchmarks are corpus-tuned, and the rankings flip every quarter. Memory Arena is the apples-to-apples version: same lifecycle, same eval, same configs, every strategy run end-to-end with real LLM calls.

Bring your own corpus

Memory Arena reads any chat-session corpus that fits the schema:

class Session(BaseModel):
    id: str
    user_id: str
    timestamp: str | None
    turns: list[Turn]

class Turn(BaseModel):
    id: str
    session_id: str
    role: str   # "user" | "assistant" | "system"
    content: str
    timestamp: str | None

Drop normalized JSONL into datasets/<your-corpus>/processed/sessions.jsonl and YAML question files into datasets/<your-corpus>/questions/smoke/.

Project structure

memory_arena/strategies/, 20 strategies, all subclass MemoryStrategy
memory_arena/sessions/, corpus loaders (LongMemEval today)
memory_arena/benchmark/, runner, evaluator, recall_metrics, recall_lab
memory_arena/llm/, dual-model LLM client (Haiku/Sonnet/Opus, anthropic+openai providers)
memory_arena/chatbot/api.py, FastAPI dashboard server, mounts the Next.js static bundle
memory_arena/data/, bundled smoke corpus + result snapshot for pip install users
memory_arena/static/, built Next.js dashboard, shipped inside the wheel
memory_arena/paths.py, bundled-then-local data resolution (overridable with MEM_ARENA_DATASETS_PATH)
scripts/build_hero_chart.py, generate docs/hero.png from bootstrap summaries
scripts/build_taxonomy_chart.py, generate docs/taxonomy.png (2D design-space placement)
scripts/build_pairwise_chart.py, generate docs/pairwise.png (significance heatmap)
scripts/build_per_question_comparison.py, regenerate docs/per-question-comparison.md
scripts/build_social_preview.py, generate docs/social-preview.png (1280×640 GitHub social card)
scripts/aggregate_bootstrap.py, aggregate _seed{N}.json files into _summary.json
scripts/render_readme.py, rewrite the README headline table from summaries
scripts/cross_judge.py, re-grade with GPT-4o, compute Spearman ρ vs Opus
scripts/robustness.py, gen × judge 2×2 sweep (v0.1.6 deliverable)
scripts/build_showcase_chart.py, regenerate docs/showcase.png (sorted-bar leaderboard hero)
scripts/build_reasoning_gap_chart.py, regenerate docs/reasoning-gap.png (recall@5 vs accuracy, the rule-of-thumb proof)
web/, Next.js 14 dashboard source (cd web && npx next build && cp -R out/* ../memory_arena/static/)
tests/: 362 tests, mock-based, no live API calls

Conventions

Functions: snake_case
Classes: PascalCase
Models: Pydantic v2 BaseModel everywhere with ConfigDict(extra="forbid")
Config: pydantic-settings, all from environment with MEM_ARENA_ prefix
CLI: Typer + Rich
Async: every strategy method is async; the runner is a single asyncio.gather across strategies

Compose profiles

docker compose up -d neo4j postgres        # baseline (graphiti, memori backends)
docker compose --profile full up -d        # also brings up the api+web containers

MEM_ARENA_NEO4J_PASSWORD is required, compose refuses to start without it. Generate one with openssl rand -hex 16.

Limitations

Memori cloud quota. Memori 3.x routes its augmentation runtime through a cloud quota service that 429s anonymous IPs after a few requests. Set MEMORI_API_KEY for full throughput.
Full-context cost cap. full_context always hits the cost cap on the smoke subset; bump --cost-cap to 25+ to evaluate all 16 questions.
Statistical power. N=16 questions across 3 seeds yields a CI half-width of ~18 pp on accuracy. Top-of-table rankings are statistically tied; the gap to recency_window and bm25 is real. A 500-question variant lands in v0.2.
Single generator. Sonnet 4.6 runs the recall-step generation for every strategy that does not pin its own (vendor SDKs use their own internals). A robustness sweep across generators is implemented in scripts/robustness.py; results to be added in v0.2.
Single judge. Opus 4.7 grades every answer; a 19-way cross-judge with GPT-4o yields Spearman ρ = +0.967 on the ranks (same order, GPT-4o more lenient in absolute terms). See results/cross_judge_report.json.

Verify our numbers in 5 minutes

Don't trust the bundled snapshot, verify it. With OPENAI_API_KEY and ANTHROPIC_API_KEY exported and the corpus already ingested, run:

memory-arena benchmark --corpus longmemeval-s \
  --strategy 'naive_vector,bm25' --seed 0 --top-k 5 --cost-cap 1

Expected (~5 min wall, ~$0.50 spend, single seed):

Strategy	Accuracy	Recall@5
`naive_vector`	49% ±5	87% ±5
`bm25`	34% ±5	79% ±5

If your numbers fall outside that envelope, please open an issue with the result JSON attached, we'll bisect.

Vendors: PR your tuned config

The table reports each vendor at its documented default. If your SDK ships with a recommended config that beats the default, open a PR against memory_arena/strategies/<vendor>.py with:

The config delta (new SDK call args)
A link to the vendor doc page that recommends those defaults
A diff between the old and new results/longmemeval-s_<vendor>_summary.json
The reproduction command

We re-run the bench against the new config and merge if the gain is real and reproducible. The PR template walks through every required field: see .github/PULL_REQUEST_TEMPLATE.md.

FAQ

Common objections (small N, single judge, vendor defaults vs tuned, why memori is at 1%, etc.) are answered in docs/FAQ.md. Read that before opening an issue, most of what you'd ask is already addressed there.

About the author

I'm Xavier Puspus, an AI engineering lead. I built Memory Arena because I needed to choose a memory store for an agent at work and could not find a single comparison that ran the same eval against the same corpus across all the vendor SDKs. Vendor blog posts compared themselves to ChatGPT memory; academic papers compared to GPT-3.5. Nothing compared the things you'd actually pick between.

This is what that comparison looks like. Numbers will move as I add the full LongMemEval corpus, cross-judges, and tuned vendor configs in v0.2 - the methodology is open, the result JSONs are stamped, and PRs from vendors are explicitly invited.

Strategies and benchmarks not yet covered

The arena is intentionally narrow at v0.1.8. These are the directions queued for v0.2 and beyond, with paper / repo links so readers can follow the source:

Letta (formerly MemGPT, sleep-time compute), was prototyped and removed from v0.1.5 due to slow per-step context loop; see commit 4ccb115. Worth re-evaluating with their April-2026 sleep-time-compute changes.
Mem0+Graph (mem0g) is excluded from the leaderboard: mem0 v2.0.0 removed the OSS graph store, so it can only run on the deprecated v1 + gpt-4o-mini. Graph memory is covered by graphiti; the strategy stays in the repo for anyone pinning mem0 v1.
MemoryAgentBench (ICLR 2026), arxiv 2507.05257. Defines a 4-competency taxonomy (accurate retrieval, test-time learning, long-range understanding, conflict resolution) the field is converging on. v0.2 will map memory-arena's 7 axes onto it.

Roadmap

Tracked in detail in STATUS.md. Headline items for v0.2:

Tuned-mode runner that records vendor-recommended config for each system.
Live tests in tests/live/ for each vendor SDK.
Audit module retargeted as a memory-gap analyzer.
Arena ELO engine wired so the dashboard's leaderboard reflects actual matches.
Expand smoke corpus to full LongMemEval-S (500 questions). Restores the abstention category (currently absent from the v0.1.6 smoke subset) and tightens per-category CIs from N=4 to ~125 per category.
Investigate why Mem0 / Mem0g / Cognee extract little signal from haystack-style sessions; possibly retrofit a session-aware ingest formatter.
Add tests for the 9 retriever strategies.
Benjamini-Hochberg q-values for the pairwise matrix (paired-bootstrap groundwork landed; q-value column queued for v0.2).
Multi-generator robustness sweep (scripts/robustness.py) results published.

References

The strategies and methodology in Memory Arena build directly on prior work. The full machine-readable list is in CITATION.cff; the most load-bearing references are:

LongMemEval corpus. Wu, Wang, Yu, Zhang, Chang, Yu. "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory." ICLR 2025. arXiv:2410.10813. The chat-session corpus and 4-category question taxonomy used here.
LLM-as-judge methodology + bias. Zheng, Chiang, Sheng, Wu, Zhuang, Lin, Li, Li, Xing, Zhang, Gonzalez, Stoica. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." NeurIPS 2023. arXiv:2306.05685. Framework for quantifying judge bias; cited when interpreting the Opus 4.7 single-judge floor.
BM25. Robertson and Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." Foundations and Trends in Information Retrieval, 2009. Underlies the bm25 strategy and the lexical branch of hybrid_rrf.
Reciprocal Rank Fusion. Cormack, Clarke, Buettcher. "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods." ACM SIGIR 2009. Used by hybrid_rrf with k=60.
HyDE (Hypothetical Document Embeddings). Gao, Ma, Lin, Callan. "Precise Zero-Shot Dense Retrieval without Relevance Labels." 2022. arXiv:2212.10496. Used by hyde.
RAPTOR. Sarthi, Abdullah, Tuli, Khanna, Goldie, Manning. "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval." ICLR 2024. arXiv:2401.18059. Used by raptor.
Generative Agents (reflection memory). Park, O'Brien, Cai, Morris, Liang, Bernstein. "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023. arXiv:2304.03442. Pattern used by reflection.
A-MEM (Agentic Memory). Xu, Liang, Mei, Gao, Tan, Zhang. "A-MEM: Agentic Memory for LLM Agents." 2025. arXiv:2502.12110. Implemented as the amem strategy.
HippoRAG 2. Jimenez Gutierrez and Sun. "HippoRAG 2: Tightening Dense+Sparse Retrieval with Personalized PageRank for Episodic Memory." 2025. arXiv:2502.14802. Implemented as the hipporag2 strategy.
Karpathy's LLM Wiki. Karpathy. "LLM wiki" gist, 2024. https://gist.github.com/karpathy/.... Pattern implemented by karpathy_llm_wiki.
Bootstrap confidence intervals. Efron and Tibshirani. An Introduction to the Bootstrap. Chapman and Hall / CRC, 1993. The resampling procedure under the accuracy + paired-bootstrap CIs.

Cite

If you use Memory Arena in research or a blog post, cite via CITATION.cff. LongMemEval (the underlying corpus) should be cited separately:

@software{puspus2026memoryarena,
  title  = {Memory Arena: Apples-to-apples benchmark for agent-memory architectures},
  author = {Puspus, Xavier},
  year   = {2026},
  url    = {https://github.com/xmpuspus/memory-arena}
}

@inproceedings{wu2024longmemeval,
  title     = {LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory},
  author    = {Wu, Di and Wang, Hongwei and Yu, Wenhao and Zhang, Yuwei and
               Chang, Kai-Wei and Yu, Dong},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025},
  url       = {https://arxiv.org/abs/2410.10813}
}

License

MIT. Vendor SDKs are pinned per their own licenses. The bundled LongMemEval-S smoke subset is derived from xiaowu0162/LongMemEval (MIT, ICLR 2025).

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
cypher		cypher
datasets/longmemeval-s/questions/smoke_synthetic		datasets/longmemeval-s/questions/smoke_synthetic
docker		docker
docs		docs
memory_arena		memory_arena
pypi-placeholders		pypi-placeholders
results		results
scripts		scripts
tests		tests
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
STATUS.md		STATUS.md
demo-flow.yaml		demo-flow.yaml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Memory Arena

Rule of thumb: which agent memory should you use?

What's interesting in the data

What this benchmark does NOT claim

Configurations

Methodology note: leveling mem0's generator

Statistical methodology

Real benchmark numbers

Sanity check vs the LongMemEval paper

Dig deeper

How memory-arena measures

The 7-axis evaluator

The 20 strategies

Pure-Python baselines and retrievers

Quantum rerankers (over naive_vector)

Vendor SDKs

Quick start (local)

Why this exists

Bring your own corpus

Project structure

Conventions

Compose profiles

Limitations

Verify our numbers in 5 minutes

Vendors: PR your tuned config

FAQ

About the author

Strategies and benchmarks not yet covered

Roadmap

References

Cite

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Quantum rerankers (over `naive_vector`)

Packages