GateMem

Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem evaluates whether memory-augmented LLM agents can remain useful, enforce access control, and honor deletion requests in multi-principal shared-memory environments.

📄 Paper • 🤗 Hugging Face Dataset • 🧪 Benchmark Toolkit • 🗂 Dataset Card • 🌐 Project Page

GateMem evaluates memory governance in multi-principal shared-memory agents, including utility, access control, and active forgetting.

✨ What is GateMem?

Most memory benchmarks ask a simple question:

Can the agent remember correctly?

GateMem asks a harder and more deployment-relevant one:

Can the agent govern shared memory correctly across multiple principals, roles, scopes, and deletion requests?

This matters in settings like:

Medical: patient / clinician / family coordination / ...
Office: manager / HR / employee / contractor workflows / ...
Education: student / teacher / counselor / staff interactions / ...
Household: family / guest / caregiver / resident coordination / ...

GateMem evaluates three tightly coupled capabilities:

Capability	What it tests
Utility	Can the agent answer correctly for authorized requests?
Access Control	Can the agent avoid leaking protected information to unauthorized or over-scoped requesters?
Active Forgetting	Can the agent avoid recovering or confirming deleted information after explicit deletion requests?

📌 Benchmark at a Glance

Property	Value
Domains	Medical, Office, Education, Household
Episodes	91 long-form multi-party episodes
Checkpoints	2,218 hidden checkpoints
Evaluation categories	Utility / Access Control / Active Forgetting
Baselines included	Long-Context, RAG-Naive, RAG-Policy, A-MEM, Mem0, ReMeM-I, ReMeM-S
Main summary metric	Memory Governance Score (MGS)

Main metric:

MGS = U * (1 - A) * (1 - F)

where:

U = Utility
A = Access-Control Violation Rate
F = Active-Forgetting Failure Rate

🧠 Why GateMem?

GateMem is designed for shared-memory agents, where memory is not a private cache but a common pool used by multiple principals.

A good shared-memory agent must do all three:

Be helpful when the request is legitimate
Be safe when the request exceeds authorization
Be forgetful when information has been explicitly deleted

This makes GateMem different from prior benchmarks focused mainly on recall, personalization, or long-horizon memory utility.

🏗️ Benchmark Construction Pipeline

GateMem is built from domain-specific scenario specifications, long-form multi-party episodes, and hidden checkpoints for utility, access control, and active forgetting.

📊 Main Takeaway

Across diverse backbones and memory-agent baselines, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting.

Long-context prompting often gives the strongest overall trade-off, but at substantial token cost.
Retrieval-based and external-memory methods reduce cost, but still exhibit non-trivial leakage and post-deletion recovery failures.
Shared-memory governance remains a challenging open problem for memory-augmented LLM agents.

Across diverse backbones and memory-agent baselines, no method simultaneously achieves strong utility, robust access control, and reliable active forgetting.

🚀 Quick Start

Install dependencies:

pip install -r requirements.txt

Set API keys if you use external LLMs:

export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GEMINI_API_KEY="..."

Run a simple baseline on the medical domain:

python bench/scripts/run_eval.py \
  --data_dir bench/data/medical \
  --agent long_context

Run Policy RAG with an answer model and LLM judge:

python bench/scripts/run_eval.py \
  --data_dir bench/data/medical \
  --agent rag_policy \
  --llm_provider openai \
  --llm_model gpt-4o-mini \
  --temperature 0.2 \
  --max_output_tokens 4096 \
  --use_llm_judge \
  --judge_provider openai \
  --judge_model gpt-4o \
  --judge_concurrency 4

For full benchmark usage, see bench/README.md.

🔬 Evaluate Your Own Method

GateMem provides an online submission interface for evaluating new methods and updating the public leaderboard.

Submit results: 👉 GateMem-Submit
View leaderboard: 🌐 GateMem Project Page

Workflow

Generate a predictions.jsonl file with your method.
Upload it to the GateMem Submission Space.
Fill in method metadata, including method name, backbone model, domain, contact email, and code/artifact link.
The server scores the submission using the official GateMem evaluator.
After maintainer approval, the result is added to the public leaderboard.

Prediction Format

Your submission should be a JSONL file named predictions.jsonl, with one prediction per checkpoint.

See docs/prediction_format.md.

Implementing a New Agent

To implement a method inside this repository, see:

Once predictions.jsonl is generated, submit it through the online interface above.

📁 Repository Structure

.
├── README.md
├── DATASET_CARD.md
├── CITATION.cff
├── bench/
│   ├── README.md
│   ├── data/
│   │   ├── medical/
│   │   ├── office/
│   │   ├── education/
│   │   └── household/
│   ├── agents/
│   ├── scripts/
│   └── ...
├── docs/
│   ├── adding_new_agent.md
│   ├── prediction_format.md
│   ├── evaluation_protocol.md
│   └── reproduce_paper.md
├── configs/
└── scripts/

📝 Notes on Evaluation

The paper uses the terms:

Utility
Access Control
Active Forgetting

For backward compatibility, the code and released JSON files keep legacy names:

Paper term	Code / data name
Utility	`utility`
Access Control	`privacy`
Active Forgetting	`safety`

Corresponding metrics:

utility_accuracy → Utility U
privacy_leakage_rate → Access-Control Violation A
deletion_leakage_rate → Active-Forgetting Failure F
compliance_utility_score → Memory Governance Score MGS

⚠️ Release Note

GateMem is released as an open-label offline benchmark for transparent research and method development.

During evaluation, methods should not use hidden annotation fields such as:

query_type
expected_action
judge_spec
leak_targets

These fields are intended for scoring and analysis only.

📚 Citation

If you use GateMem, please cite the accompanying paper.

@article{gatemem2026,
  title={GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents},
  author={Ren, Zhe and Yang, Yibo and Chen, Yimeng and Zhao, Zijun and Fu, Benshuo and Shu, Zhihao and Zhang, Bingjie and Guo, Dandan},
  year={2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GateMem

✨ What is GateMem?

📌 Benchmark at a Glance

🧠 Why GateMem?

🏗️ Benchmark Construction Pipeline

📊 Main Takeaway

🚀 Quick Start

🔬 Evaluate Your Own Method

Workflow

Prediction Format

Implementing a New Agent

📁 Repository Structure

📝 Notes on Evaluation

⚠️ Release Note

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
assets		assets
bench		bench
configs		configs
docs		docs
scripts		scripts
third_party/mem0_upstream		third_party/mem0_upstream
.gitignore		.gitignore
CITATION.cff		CITATION.cff
DATASET_CARD.md		DATASET_CARD.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GateMem

✨ What is GateMem?

📌 Benchmark at a Glance

🧠 Why GateMem?

🏗️ Benchmark Construction Pipeline

📊 Main Takeaway

🚀 Quick Start

🔬 Evaluate Your Own Method

Workflow

Prediction Format

Implementing a New Agent

📁 Repository Structure

📝 Notes on Evaluation

⚠️ Release Note

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages