Release ALMA-memory v0.9.0 — LongMemEval R@5=0.964 (#1 Open-Source) · RBKunnela/ALMA-memory

Benchmark Champion: R@5 = 0.964 on LongMemEval

ALMA v0.9.0 achieves the highest LongMemEval score of any open-source AI agent memory system — R@5=0.964 on the full 500-question benchmark, using only local embeddings with zero API keys.

System	R@5	R@10	MRR	API Required
ALMA v0.9.0	0.964	0.980	0.872	No
Hindsight	0.914	—	—	Yes (Gemini-3 Pro)
Zep/Graphiti	0.638	—	—	Yes
Mem0	0.490	—	—	No

The Journey: R@5 0.236 → 0.964

This release fixes a critical retrieval ranking bug where FAISS similarity scores were silently discarded in the pipeline. The 4-stage fix:

Baseline (R@5=0.236): FAISS found answers perfectly (R@50=1.0) but scores were thrown away — ranking was random
FAISS ordering preserved: Storage layer now maintains FAISS distance order instead of re-sorting by SQL columns
Pure similarity scoring (R@5=0.800): New BENCHMARK mode uses similarity=1.0 weights, eliminating recency/confidence noise
Score propagation (R@5=0.964): Actual FAISS cosine similarity scores now flow end-to-end from storage through engine to scorer

Full technical analysis: BENCHMARK-REPORT.md

New Features

File/Conversation Ingestion (alma/ingestion/) — Ingest project files and chat exports in 6 formats: Claude Code JSONL, ChatGPT JSON, Claude.ai JSON, OpenAI Codex JSONL, Slack JSON, plain text. Adapted from MemPalace (MIT).
4-Layer MemoryStack (alma/context/) — Token-efficient context loading: L0 Identity (~100 tok), L1 Essential Story (~800 tok), L2 On-Demand, L3 Deep Search. Wake-up cost: ~900 tokens.
Entity Detector (alma/extraction/entity_detector.py) — Regex-based person/project detection from text.
Temporal Graph Edges — valid_from/valid_to on Relationship dataclass + get_relationships_as_of() temporal queries.
Query Sanitizer (alma/retrieval/query_sanitizer.py) — Prevents system prompt contamination in search queries. Recovers R@10 from 1% to 80%+.
BENCHMARK and SIMILARITY retrieval modes — Pure cosine similarity scoring for benchmarks and raw retrieval.
LongMemEval Benchmark Runner (benchmarks/) — Full benchmark infrastructure. Reproduce results in 30 minutes.
Hybrid Search in Benchmarks — Vector + BM25 via Reciprocal Rank Fusion.

Critical Bug Fixes

retrieve_with_scope scope_filter passthrough — Scope filter was computed but never forwarded to storage backends. Workflow-scoped retrieval was returning ALL memories.
FAISS similarity scores discarded — Storage layer threw away distance scores, causing random ranking.
SQL ORDER BY destroying FAISS order — Storage re-sorted by confidence/timestamp instead of similarity.
Scorer receiving fake similarity=1.0 — Engine now extracts _faiss_similarity from metadata and passes to scorer.

Per-Question-Type Results (500 questions)

Type	R@5	R@10	Count
knowledge-update	1.000	1.000	78
multi-session	0.992	1.000	133
single-session-preference	0.967	0.967	30
temporal-reasoning	0.947	0.970	133
single-session-assistant	0.946	0.946	56
single-session-user	0.914	0.971	70

Documentation

4 Excalidraw architecture diagrams (retrieval pipeline, MemoryStack, learning loop, benchmark comparison)
Feature docs for ingestion, MemoryStack, entity detection, benchmarks
Architecture design document for retrieval pipeline fix
Full benchmark report with technical methodology

Stats

1,854 tests passing (222 new)
52,000+ lines of production Python
7 storage backends, 4 graph backends, 22 MCP tools
Zero API keys required for core functionality

Install

pip install alma-memory                # Core
pip install alma-memory[local]         # + SQLite + FAISS + local embeddings
pip install alma-memory[all]           # Everything

Reproduce the Benchmark

pip install alma-memory[local] sentence-transformers
curl -fsSL -o /tmp/longmemeval.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
python -m benchmarks.longmemeval.runner --data /tmp/longmemeval.json

Full changelog: CHANGELOG.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ALMA-memory v0.9.0 — LongMemEval R@5=0.964 (#1 Open-Source)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Benchmark Champion: R@5 = 0.964 on LongMemEval

The Journey: R@5 0.236 → 0.964

New Features

Critical Bug Fixes

Per-Question-Type Results (500 questions)

Documentation

Stats

Install

Reproduce the Benchmark

Uh oh!