ALMA-memory v0.9.0 — LongMemEval R@5=0.964 (#1 Open-Source)
Benchmark Champion: R@5 = 0.964 on LongMemEval
ALMA v0.9.0 achieves the highest LongMemEval score of any open-source AI agent memory system — R@5=0.964 on the full 500-question benchmark, using only local embeddings with zero API keys.
| System | R@5 | R@10 | MRR | API Required |
|---|---|---|---|---|
| ALMA v0.9.0 | 0.964 | 0.980 | 0.872 | No |
| Hindsight | 0.914 | — | — | Yes (Gemini-3 Pro) |
| Zep/Graphiti | 0.638 | — | — | Yes |
| Mem0 | 0.490 | — | — | No |
The Journey: R@5 0.236 → 0.964
This release fixes a critical retrieval ranking bug where FAISS similarity scores were silently discarded in the pipeline. The 4-stage fix:
- Baseline (R@5=0.236): FAISS found answers perfectly (R@50=1.0) but scores were thrown away — ranking was random
- FAISS ordering preserved: Storage layer now maintains FAISS distance order instead of re-sorting by SQL columns
- Pure similarity scoring (R@5=0.800): New BENCHMARK mode uses similarity=1.0 weights, eliminating recency/confidence noise
- Score propagation (R@5=0.964): Actual FAISS cosine similarity scores now flow end-to-end from storage through engine to scorer
Full technical analysis: BENCHMARK-REPORT.md
New Features
- File/Conversation Ingestion (
alma/ingestion/) — Ingest project files and chat exports in 6 formats: Claude Code JSONL, ChatGPT JSON, Claude.ai JSON, OpenAI Codex JSONL, Slack JSON, plain text. Adapted from MemPalace (MIT). - 4-Layer MemoryStack (
alma/context/) — Token-efficient context loading: L0 Identity (~100 tok), L1 Essential Story (~800 tok), L2 On-Demand, L3 Deep Search. Wake-up cost: ~900 tokens. - Entity Detector (
alma/extraction/entity_detector.py) — Regex-based person/project detection from text. - Temporal Graph Edges —
valid_from/valid_toon Relationship dataclass +get_relationships_as_of()temporal queries. - Query Sanitizer (
alma/retrieval/query_sanitizer.py) — Prevents system prompt contamination in search queries. Recovers R@10 from 1% to 80%+. - BENCHMARK and SIMILARITY retrieval modes — Pure cosine similarity scoring for benchmarks and raw retrieval.
- LongMemEval Benchmark Runner (
benchmarks/) — Full benchmark infrastructure. Reproduce results in 30 minutes. - Hybrid Search in Benchmarks — Vector + BM25 via Reciprocal Rank Fusion.
Critical Bug Fixes
retrieve_with_scopescope_filter passthrough — Scope filter was computed but never forwarded to storage backends. Workflow-scoped retrieval was returning ALL memories.- FAISS similarity scores discarded — Storage layer threw away distance scores, causing random ranking.
- SQL ORDER BY destroying FAISS order — Storage re-sorted by confidence/timestamp instead of similarity.
- Scorer receiving fake similarity=1.0 — Engine now extracts
_faiss_similarityfrom metadata and passes to scorer.
Per-Question-Type Results (500 questions)
| Type | R@5 | R@10 | Count |
|---|---|---|---|
| knowledge-update | 1.000 | 1.000 | 78 |
| multi-session | 0.992 | 1.000 | 133 |
| single-session-preference | 0.967 | 0.967 | 30 |
| temporal-reasoning | 0.947 | 0.970 | 133 |
| single-session-assistant | 0.946 | 0.946 | 56 |
| single-session-user | 0.914 | 0.971 | 70 |
Documentation
- 4 Excalidraw architecture diagrams (retrieval pipeline, MemoryStack, learning loop, benchmark comparison)
- Feature docs for ingestion, MemoryStack, entity detection, benchmarks
- Architecture design document for retrieval pipeline fix
- Full benchmark report with technical methodology
Stats
- 1,854 tests passing (222 new)
- 52,000+ lines of production Python
- 7 storage backends, 4 graph backends, 22 MCP tools
- Zero API keys required for core functionality
Install
pip install alma-memory # Core
pip install alma-memory[local] # + SQLite + FAISS + local embeddings
pip install alma-memory[all] # EverythingReproduce the Benchmark
pip install alma-memory[local] sentence-transformers
curl -fsSL -o /tmp/longmemeval.json \
https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
python -m benchmarks.longmemeval.runner --data /tmp/longmemeval.jsonFull changelog: CHANGELOG.md