MEME Benchmark — Reference Implementation

MEME (Multi-Entity and Evolving Memory Evaluation) is a benchmark that evaluates LLM memory systems along the multi-entity and evolving axes. It defines six tasks including three types absent in prior memory benchmarks (Cascade and Absence for dependency reasoning, and Deletion for post-removal state), alongside Tracking, Aggregation, and Exact Recall. This repo contains the reference implementations of six memory systems, the GPT-4o judge pipeline, and the dataset construction pipeline.

Examples of the six MEME task types across three categories: Retrieval (Exact Recall, Aggregation; merging the Single-Fact and Multi-Fact Retrieval quadrants of the taxonomy), State Management (Tracking, Deletion), and Dependency Reasoning (Cascade, Absence).

Layout

code/
  agents/                    # 6 memory system implementations + base interface
  eval/                      # run_agent.py, judge.py, golden_memory.py, ...
  data/                      # dataset construction pipeline + entity pools, dependency maps
  dataset_tools/             # build_dataset.py, unpack_dataset.py
  scripts/                   # bash orchestration (setup, neo4j cluster, main run)
  requirements.txt           # core deps
  requirements/              # extras for graphiti, mem0
  docker-compose.yml         # neo4j + qdrant
  .env.example
dataset/
  README.md                  # HF dataset card
LICENSE                      # MIT

Quick start

1. Get the dataset

python3 -c "
from huggingface_hub import hf_hub_download
for name in ['meme_filler32k.json', 'meme_filler128k.json', 'meme_nofiller.json']:
    hf_hub_download('meme-benchmark/MEME', name, repo_type='dataset', local_dir='dataset')
"

Expand the flat HF JSON into per-episode files for the eval runner:

python3 code/dataset_tools/unpack_dataset.py --input dataset/meme_filler32k.json --output code/data
# writes code/data/filler32k_pl/episode_NNN.json and code/data/filler32k_sw/...

2. Set up environments

Requires Python 3.12 or newer (Karpathy's claude-memory-compiler pins >=3.12; graphiti-core and mem0ai pin >=3.10). Three isolated Python venvs (main + Graphiti + Mem0 to avoid version conflicts):

PYBIN=python3.12 ./code/scripts/setup_venvs.sh   # or python3.13

Backing services for Graphiti (Neo4j) and Mem0 (Qdrant):

docker compose -f code/docker-compose.yml up -d

The Karpathy Wiki agent wraps coleam00/claude-memory-compiler; clone it once at the expected path (only needed if running --agent-type karpathy):

git clone https://github.com/coleam00/claude-memory-compiler code/.deps/claude-memory-compiler

API keys:

cp code/.env.example code/.env
# Set OPENAI_API_KEY, ANTHROPIC_API_KEY, optionally OPENROUTER_API_KEY

3. Run an agent + judge

# agent
python3 code/eval/run_agent.py -d code/data/filler32k_pl \
    --agent-type bm25 --model gpt-4.1-mini -o output/bm25

# judge
python3 code/eval/judge.py -d output/bm25 -o output/bm25/judge

The full main-table run is ./code/scripts/run_main_experiment.sh (six systems × two domains).

Reproducing paper experiments

Paper artifact	Section	Script
`tab:main_results` (main table)	4.2	`code/scripts/run_main_experiment.sh`
`tab:topk-sweep` (top-$k$ sweep, BM25 + dense + Mem0)	4.3	`code/scripts/run_topk_sweep.sh`
`tab:llm-ablation` (a) — answer-LLM swap	4.3	`code/scripts/run_answer_llm_swap.sh`
`tab:llm-ablation` (b) — internal-LLM swap	4.3	`code/scripts/run_internal_llm_swap.sh`
`fig:noise-detail` — noise robustness	4.3	`code/scripts/run_noise_ablation.sh`
`tab:sd` — multi-seed stability	4.3 / App.	`code/scripts/run_multi_seed.sh`
In-context baseline (Table 2 top rows)	4.2	`code/scripts/run_in_context_baseline.sh`
In-context ceiling (gold facts only)	4.1	`python -m eval.golden_memory -d data/filler32k_pl --model <answer-llm>`
SIMBA prompt opt. (MD-flat)	4.3	`code/scripts/run_simba_mdflat.sh`
SIMBA prompt opt. (Karpathy)	4.3	`code/scripts/run_simba_karpathy.sh`
SIMBA prompt opt. (Graphiti)	4.3	`code/scripts/run_simba_graphiti.sh`
SIMBA prompt opt. (Mem0)	4.3	`code/scripts/run_simba_mem0.sh`

The SIMBA experiments live in code/simba/{mdflat,karpathy,graphiti,mem0}/ with a per-system isolated venv (DSPy / agent-library version conflicts forbid a shared environment). See code/simba/README.md for prerequisites and cost expectations.

Every main / ablation script honors DATA_DIR, OUT, WORKERS, ANSWER_MODEL, INTERNAL_MODEL env vars; see each file's header for the defaults that match the paper config.

After any agent run, judge with:

source .venvs/main_env/bin/activate
python -m eval.judge -d <output_dir>/<system> -o <output_dir>/<system>/judge

Memory system requirements

System	Backing service	Internal LLM	Venv
BM25	(none)	(none)	main_env
Dense (text-embedding-3-small)	OpenAI embedding API	(none)	main_env
Mem0	Qdrant	gpt-4.1-mini	mem0_env
Graphiti	Neo4j	gpt-4.1-mini	graphiti_env
MD-flat	(none)	gpt-4.1-mini	main_env
Karpathy Wiki	(none)	gpt-4.1-mini	main_env

The answering LLM (default gpt-4.1-mini) is shared across systems via --model.

Regenerating the dataset

The construction pipeline (Stages 1–5 + flatten) is in code/data/ and code/dataset_tools/. Pre-built fillers used in haystack assembly are at meme-benchmark/MEME-fillers; the LLM-judge filter that produced them (code/data/filter_fillers.py) requires raw LongMemEval and ShareGPT 52K corpora.

License

Code: MIT (see LICENSE). Dataset: CC BY 4.0, see the dataset card for filler-source attribution (LongMemEval, ShareGPT52K).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
code		code
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MEME Benchmark — Reference Implementation

Layout

Quick start

1. Get the dataset

2. Set up environments

3. Run an agent + judge

Reproducing paper experiments

Memory system requirements

Regenerating the dataset

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MEME Benchmark — Reference Implementation

Layout

Quick start

1. Get the dataset

2. Set up environments

3. Run an agent + judge

Reproducing paper experiments

Memory system requirements

Regenerating the dataset

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages