Skip to content

ALMA-memory v0.9.0 — LongMemEval R@5=0.964 (#1 Open-Source)

Choose a tag to compare

@RBKunnela RBKunnela released this 13 Apr 14:43
· 33 commits to main since this release

Benchmark Champion: R@5 = 0.964 on LongMemEval

ALMA v0.9.0 achieves the highest LongMemEval score of any open-source AI agent memory system — R@5=0.964 on the full 500-question benchmark, using only local embeddings with zero API keys.

System R@5 R@10 MRR API Required
ALMA v0.9.0 0.964 0.980 0.872 No
Hindsight 0.914 Yes (Gemini-3 Pro)
Zep/Graphiti 0.638 Yes
Mem0 0.490 No

The Journey: R@5 0.236 → 0.964

This release fixes a critical retrieval ranking bug where FAISS similarity scores were silently discarded in the pipeline. The 4-stage fix:

  1. Baseline (R@5=0.236): FAISS found answers perfectly (R@50=1.0) but scores were thrown away — ranking was random
  2. FAISS ordering preserved: Storage layer now maintains FAISS distance order instead of re-sorting by SQL columns
  3. Pure similarity scoring (R@5=0.800): New BENCHMARK mode uses similarity=1.0 weights, eliminating recency/confidence noise
  4. Score propagation (R@5=0.964): Actual FAISS cosine similarity scores now flow end-to-end from storage through engine to scorer

Full technical analysis: BENCHMARK-REPORT.md

New Features

  • File/Conversation Ingestion (alma/ingestion/) — Ingest project files and chat exports in 6 formats: Claude Code JSONL, ChatGPT JSON, Claude.ai JSON, OpenAI Codex JSONL, Slack JSON, plain text. Adapted from MemPalace (MIT).
  • 4-Layer MemoryStack (alma/context/) — Token-efficient context loading: L0 Identity (~100 tok), L1 Essential Story (~800 tok), L2 On-Demand, L3 Deep Search. Wake-up cost: ~900 tokens.
  • Entity Detector (alma/extraction/entity_detector.py) — Regex-based person/project detection from text.
  • Temporal Graph Edgesvalid_from/valid_to on Relationship dataclass + get_relationships_as_of() temporal queries.
  • Query Sanitizer (alma/retrieval/query_sanitizer.py) — Prevents system prompt contamination in search queries. Recovers R@10 from 1% to 80%+.
  • BENCHMARK and SIMILARITY retrieval modes — Pure cosine similarity scoring for benchmarks and raw retrieval.
  • LongMemEval Benchmark Runner (benchmarks/) — Full benchmark infrastructure. Reproduce results in 30 minutes.
  • Hybrid Search in Benchmarks — Vector + BM25 via Reciprocal Rank Fusion.

Critical Bug Fixes

  • retrieve_with_scope scope_filter passthrough — Scope filter was computed but never forwarded to storage backends. Workflow-scoped retrieval was returning ALL memories.
  • FAISS similarity scores discarded — Storage layer threw away distance scores, causing random ranking.
  • SQL ORDER BY destroying FAISS order — Storage re-sorted by confidence/timestamp instead of similarity.
  • Scorer receiving fake similarity=1.0 — Engine now extracts _faiss_similarity from metadata and passes to scorer.

Per-Question-Type Results (500 questions)

Type R@5 R@10 Count
knowledge-update 1.000 1.000 78
multi-session 0.992 1.000 133
single-session-preference 0.967 0.967 30
temporal-reasoning 0.947 0.970 133
single-session-assistant 0.946 0.946 56
single-session-user 0.914 0.971 70

Documentation

  • 4 Excalidraw architecture diagrams (retrieval pipeline, MemoryStack, learning loop, benchmark comparison)
  • Feature docs for ingestion, MemoryStack, entity detection, benchmarks
  • Architecture design document for retrieval pipeline fix
  • Full benchmark report with technical methodology

Stats

  • 1,854 tests passing (222 new)
  • 52,000+ lines of production Python
  • 7 storage backends, 4 graph backends, 22 MCP tools
  • Zero API keys required for core functionality

Install

pip install alma-memory                # Core
pip install alma-memory[local]         # + SQLite + FAISS + local embeddings
pip install alma-memory[all]           # Everything

Reproduce the Benchmark

pip install alma-memory[local] sentence-transformers
curl -fsSL -o /tmp/longmemeval.json \
  https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned/resolve/main/longmemeval_s_cleaned.json
python -m benchmarks.longmemeval.runner --data /tmp/longmemeval.json

Full changelog: CHANGELOG.md