Skip to content

zz-haooo/Meta-Team

Repository files navigation

Meta-Team

The implementations for the paper "Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems".

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution that are difficult to eliminate during design. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization.

Overview

Meta-Team advocates a simple principle: a MAS should not only execute as a team, but also evolve as a team. After each task, the team reflects collaboratively at three levels:

Level Scope What Evolves
L1 Agent-Level Individual agent scaffold Prompt patches, skill additions
L2 Interaction-Level Communication patterns Teammate profiles, collaboration notes
L3 Team-Level Shared team scaffold Constitution, organization, coordination rules

Meta-Team Overview

Installation

cd Meta-Team
pip install -e .
cp .env.example .env
# Edit .env with your LLM API configuration

Requirements: Python >= 3.10, Docker (for SWE-bench / BeyondSWE tasks)

Configuration

Meta-Team uses any OpenAI-compatible LLM API. Edit .env:

# LLM Gateway (OpenAI-compatible)
DEV_API_BASE=https://api.anthropic.com/v1
DEV_API_KEY=sk-your-key-here

All experiments use Claude Sonnet 4.6 (temperature=0.2, max_tokens=32768). The model name is configured in each agent pool's config.yaml.

Data Preparation

Download benchmark data into the corresponding directories before running experiments:

Benchmark Location Source
SWE-bench Pro benchmarks/SWE-bench-Pro/ ScaleAI/SWE-bench_Pro + Docker images
BeyondSWE benchmarks/BeyondSWE/data/ BeyondSWE + Docker images
LOCA-Bench benchmarks/LOCA-bench/ LOCA-bench
GAIA data/gaia/val_files/ GAIA (attachment files only; task splits are pre-included)
LoCoBench benchmarks/LoCoBench/ LoCoBench
ResearchRubrics data/researchrubrics/researchrubrics.jsonl ResearchRubrics

Evolution and test

Each benchmark follows the same pipeline: Evolve (around 20 cases) → Freeze (save evolved team) → Test (holdout). All results are reported as avg@3 (run the test 3 times independently and average).

SWE-bench Pro

# Ansible
bash scripts/run_swebench_pro_paper_experiment.sh evolve ansible
bash scripts/run_swebench_pro_paper_experiment.sh freeze <run_id> ansible
bash scripts/run_swebench_pro_paper_experiment.sh test ansible pool_SWE_Pro_evolved_ansible --rollout 3

# Qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh evolve qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh freeze <run_id> qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh test qutebrowser pool_SWE_Pro_evolved_qutebrowser --rollout 3

BeyondSWE

# CrossRepo
bash scripts/run_beyondswe_paper_experiment.sh evolve crossrepo
bash scripts/run_beyondswe_paper_experiment.sh freeze <run_id> crossrepo
bash scripts/run_beyondswe_paper_experiment.sh test crossrepo pool_BeyondSWE_evolved_crossrepo --rollout 3

# DepMigrate
bash scripts/run_beyondswe_paper_experiment.sh evolve depmigrate
bash scripts/run_beyondswe_paper_experiment.sh freeze <run_id> depmigrate
bash scripts/run_beyondswe_paper_experiment.sh test depmigrate pool_BeyondSWE_evolved_depmigrate --rollout 3

LOCA-Bench

bash scripts/run_locabench_paper_experiment.sh evolve-mt
bash scripts/run_locabench_paper_experiment.sh freeze <run_id> <version>
bash scripts/run_locabench_paper_experiment.sh eval-all pool_LOCAbench_evolved

LOCA-Bench uses 5 seeds per task; results are averaged across all seeds.

GAIA

bash scripts/run_gaia_evolve_and_test.sh evolve
# Promote: cp -r runs/<run_id>_evolve/team/<latest_version> agents/pool_GAIA_MT_evolved
bash scripts/run_gaia_evolve_and_test.sh test pool_GAIA_MT_evolved --rollout 3

LoCoBench

# Feature Implementation
bash scripts/run_locobench_paper_experiment.sh evolve fi
bash scripts/run_locobench_paper_experiment.sh freeze <run_id> fi
bash scripts/run_locobench_paper_experiment.sh test fi pool_LoCoBench_evolved_fi --rollout 3

# Cross-file Refactoring
bash scripts/run_locobench_paper_experiment.sh evolve cr
bash scripts/run_locobench_paper_experiment.sh freeze <run_id> cr
bash scripts/run_locobench_paper_experiment.sh test cr pool_LoCoBench_evolved_cr --rollout 3

ResearchRubrics

bash scripts/run_rr_paper_experiment.sh evolve
bash scripts/run_rr_paper_experiment.sh freeze <run_id>
bash scripts/run_rr_paper_experiment.sh test pool_DeepResearch_evolved --rollout 3

Note: Each evolve step prints its run_id upon completion — use it in the subsequent freeze step.

Project Structure

Meta-Team/
├── core/                    # Framework engine
│   ├── agent.py             # Agent loop (LLM + tools)
│   ├── runner.py            # Task execution lifecycle
│   ├── reflection_runner.py # Three-layer evolution orchestration
│   ├── llm.py              # LLM API with retry
│   └── message_store.py    # Inter-agent communication
├── tools/                   # Agent capabilities
│   ├── primitives.py       # Team primitives (recruit, send_message, terminate)
│   ├── docker_bash.py      # Sandboxed code execution
│   └── reflection.py       # Evolution tools (L1/L2/L3)
├── evaluation/              # Rubric-based evaluation (ResearchRubrics)
├── agents/                  # Team configurations (6 base pools)
├── benchmarks/              # Benchmark adapters
├── scripts/                 # Experiment pipelines
├── prompts/                 # Reflection prompt templates
├── data/                    # Data splits and placeholders
└── assets/                  # Figures

Contact

For questions or discussions, feel free to contact: haozhezheng@outlook.com

About

The implementation of the paper "Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors