Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents
GateMem evaluates whether memory-augmented LLM agents can remain useful, enforce access control, and honor deletion requests in multi-principal shared-memory environments.
📄 Paper • 🤗 Hugging Face Dataset • 🧪 Benchmark Toolkit • 🗂 Dataset Card • 🌐 Project Page
GateMem evaluates memory governance in multi-principal shared-memory agents, including utility, access control, and active forgetting.
Most memory benchmarks ask a simple question:
Can the agent remember correctly?
GateMem asks a harder and more deployment-relevant one:
Can the agent govern shared memory correctly across multiple principals, roles, scopes, and deletion requests?
This matters in settings like:
- Medical: patient / clinician / family coordination / ...
- Office: manager / HR / employee / contractor workflows / ...
- Education: student / teacher / counselor / staff interactions / ...
- Household: family / guest / caregiver / resident coordination / ...
GateMem evaluates three tightly coupled capabilities:
| Capability | What it tests |
|---|---|
| Utility | Can the agent answer correctly for authorized requests? |
| Access Control | Can the agent avoid leaking protected information to unauthorized or over-scoped requesters? |
| Active Forgetting | Can the agent avoid recovering or confirming deleted information after explicit deletion requests? |
| Property | Value |
|---|---|
| Domains | Medical, Office, Education, Household |
| Episodes | 91 long-form multi-party episodes |
| Checkpoints | 2,218 hidden checkpoints |
| Evaluation categories | Utility / Access Control / Active Forgetting |
| Baselines included | Long-Context, RAG-Naive, RAG-Policy, A-MEM, Mem0, ReMeM-I, ReMeM-S |
| Main summary metric | Memory Governance Score (MGS) |
Main metric:
MGS = U * (1 - A) * (1 - F)
where:
U= UtilityA= Access-Control Violation RateF= Active-Forgetting Failure Rate
GateMem is designed for shared-memory agents, where memory is not a private cache but a common pool used by multiple principals.
A good shared-memory agent must do all three:
- Be helpful when the request is legitimate
- Be safe when the request exceeds authorization
- Be forgetful when information has been explicitly deleted
This makes GateMem different from prior benchmarks focused mainly on recall, personalization, or long-horizon memory utility.
GateMem is built from domain-specific scenario specifications, long-form multi-party episodes, and hidden checkpoints for utility, access control, and active forgetting.
Across diverse backbones and memory-agent baselines, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting.
- Long-context prompting often gives the strongest overall trade-off, but at substantial token cost.
- Retrieval-based and external-memory methods reduce cost, but still exhibit non-trivial leakage and post-deletion recovery failures.
- Shared-memory governance remains a challenging open problem for memory-augmented LLM agents.
Across diverse backbones and memory-agent baselines, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting.
Install dependencies:
pip install -r requirements.txtSet API keys if you use external LLMs:
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."Run a simple baseline on the medical domain:
python bench/scripts/run_eval.py \
--data_dir bench/data/medical \
--agent long_contextRun Policy RAG with an answer model and LLM judge:
python bench/scripts/run_eval.py \
--data_dir bench/data/medical \
--agent rag_policy \
--llm_provider openai \
--llm_model gpt-4o-mini \
--temperature 0.2 \
--max_output_tokens 4096 \
--use_llm_judge \
--judge_provider openai \
--judge_model gpt-4o \
--judge_concurrency 4For full benchmark usage, see bench/README.md.
GateMem provides an online submission interface for evaluating new methods and updating the public leaderboard.
Submit results: 👉 GateMem-Submit
View leaderboard: 🌐 GateMem Project Page
- Generate a
predictions.jsonlfile with your method. - Upload it to the GateMem Submission Space.
- Fill in method metadata, including method name, backbone model, domain, contact email, and code/artifact link.
- The server scores the submission using the official GateMem evaluator.
- After maintainer approval, the result is added to the public leaderboard.
Your submission should be a JSONL file named predictions.jsonl, with one prediction per checkpoint.
See docs/prediction_format.md.
To implement a method inside this repository, see:
Once predictions.jsonl is generated, submit it through the online interface above.
.
├── README.md
├── DATASET_CARD.md
├── CITATION.cff
├── bench/
│ ├── README.md
│ ├── data/
│ │ ├── medical/
│ │ ├── office/
│ │ ├── education/
│ │ └── household/
│ ├── agents/
│ ├── scripts/
│ └── ...
├── docs/
│ ├── adding_new_agent.md
│ ├── prediction_format.md
│ ├── evaluation_protocol.md
│ └── reproduce_paper.md
├── configs/
└── scripts/
The paper uses the terms:
- Utility
- Access Control
- Active Forgetting
For backward compatibility, the code and released JSON files keep legacy names:
| Paper term | Code / data name |
|---|---|
| Utility | utility |
| Access Control | privacy |
| Active Forgetting | safety |
Corresponding metrics:
utility_accuracy→ UtilityUprivacy_leakage_rate→ Access-Control ViolationAdeletion_leakage_rate→ Active-Forgetting FailureFcompliance_utility_score→ Memory Governance ScoreMGS
GateMem is released as an open-label offline benchmark for transparent research and method development.
During evaluation, methods should not use hidden annotation fields such as:
query_typeexpected_actionjudge_specleak_targets
These fields are intended for scoring and analysis only.
If you use GateMem, please cite the accompanying paper.
@article{gatemem2026,
title={GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents},
author={Ren, Zhe and Yang, Yibo and Chen, Yimeng and Zhao, Zijun and Fu, Benshuo and Shu, Zhihao and Zhang, Bingjie and Guo, Dandan},
year={2026}
}

