The implementations for the paper "Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems".
LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution that are difficult to eliminate during design. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization.
Meta-Team advocates a simple principle: a MAS should not only execute as a team, but also evolve as a team. After each task, the team reflects collaboratively at three levels:
| Level | Scope | What Evolves |
|---|---|---|
| L1 Agent-Level | Individual agent scaffold | Prompt patches, skill additions |
| L2 Interaction-Level | Communication patterns | Teammate profiles, collaboration notes |
| L3 Team-Level | Shared team scaffold | Constitution, organization, coordination rules |
cd Meta-Team
pip install -e .
cp .env.example .env
# Edit .env with your LLM API configurationRequirements: Python >= 3.10, Docker (for SWE-bench / BeyondSWE tasks)
Meta-Team uses any OpenAI-compatible LLM API. Edit .env:
# LLM Gateway (OpenAI-compatible)
DEV_API_BASE=https://api.anthropic.com/v1
DEV_API_KEY=sk-your-key-hereAll experiments use Claude Sonnet 4.6 (temperature=0.2, max_tokens=32768). The model name is configured in each agent pool's config.yaml.
Download benchmark data into the corresponding directories before running experiments:
| Benchmark | Location | Source |
|---|---|---|
| SWE-bench Pro | benchmarks/SWE-bench-Pro/ |
ScaleAI/SWE-bench_Pro + Docker images |
| BeyondSWE | benchmarks/BeyondSWE/data/ |
BeyondSWE + Docker images |
| LOCA-Bench | benchmarks/LOCA-bench/ |
LOCA-bench |
| GAIA | data/gaia/val_files/ |
GAIA (attachment files only; task splits are pre-included) |
| LoCoBench | benchmarks/LoCoBench/ |
LoCoBench |
| ResearchRubrics | data/researchrubrics/researchrubrics.jsonl |
ResearchRubrics |
Each benchmark follows the same pipeline: Evolve (around 20 cases) → Freeze (save evolved team) → Test (holdout). All results are reported as avg@3 (run the test 3 times independently and average).
# Ansible
bash scripts/run_swebench_pro_paper_experiment.sh evolve ansible
bash scripts/run_swebench_pro_paper_experiment.sh freeze <run_id> ansible
bash scripts/run_swebench_pro_paper_experiment.sh test ansible pool_SWE_Pro_evolved_ansible --rollout 3
# Qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh evolve qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh freeze <run_id> qutebrowser
bash scripts/run_swebench_pro_paper_experiment.sh test qutebrowser pool_SWE_Pro_evolved_qutebrowser --rollout 3# CrossRepo
bash scripts/run_beyondswe_paper_experiment.sh evolve crossrepo
bash scripts/run_beyondswe_paper_experiment.sh freeze <run_id> crossrepo
bash scripts/run_beyondswe_paper_experiment.sh test crossrepo pool_BeyondSWE_evolved_crossrepo --rollout 3
# DepMigrate
bash scripts/run_beyondswe_paper_experiment.sh evolve depmigrate
bash scripts/run_beyondswe_paper_experiment.sh freeze <run_id> depmigrate
bash scripts/run_beyondswe_paper_experiment.sh test depmigrate pool_BeyondSWE_evolved_depmigrate --rollout 3bash scripts/run_locabench_paper_experiment.sh evolve-mt
bash scripts/run_locabench_paper_experiment.sh freeze <run_id> <version>
bash scripts/run_locabench_paper_experiment.sh eval-all pool_LOCAbench_evolvedLOCA-Bench uses 5 seeds per task; results are averaged across all seeds.
bash scripts/run_gaia_evolve_and_test.sh evolve
# Promote: cp -r runs/<run_id>_evolve/team/<latest_version> agents/pool_GAIA_MT_evolved
bash scripts/run_gaia_evolve_and_test.sh test pool_GAIA_MT_evolved --rollout 3# Feature Implementation
bash scripts/run_locobench_paper_experiment.sh evolve fi
bash scripts/run_locobench_paper_experiment.sh freeze <run_id> fi
bash scripts/run_locobench_paper_experiment.sh test fi pool_LoCoBench_evolved_fi --rollout 3
# Cross-file Refactoring
bash scripts/run_locobench_paper_experiment.sh evolve cr
bash scripts/run_locobench_paper_experiment.sh freeze <run_id> cr
bash scripts/run_locobench_paper_experiment.sh test cr pool_LoCoBench_evolved_cr --rollout 3bash scripts/run_rr_paper_experiment.sh evolve
bash scripts/run_rr_paper_experiment.sh freeze <run_id>
bash scripts/run_rr_paper_experiment.sh test pool_DeepResearch_evolved --rollout 3Note: Each
evolvestep prints itsrun_idupon completion — use it in the subsequentfreezestep.
Meta-Team/
├── core/ # Framework engine
│ ├── agent.py # Agent loop (LLM + tools)
│ ├── runner.py # Task execution lifecycle
│ ├── reflection_runner.py # Three-layer evolution orchestration
│ ├── llm.py # LLM API with retry
│ └── message_store.py # Inter-agent communication
├── tools/ # Agent capabilities
│ ├── primitives.py # Team primitives (recruit, send_message, terminate)
│ ├── docker_bash.py # Sandboxed code execution
│ └── reflection.py # Evolution tools (L1/L2/L3)
├── evaluation/ # Rubric-based evaluation (ResearchRubrics)
├── agents/ # Team configurations (6 base pools)
├── benchmarks/ # Benchmark adapters
├── scripts/ # Experiment pipelines
├── prompts/ # Reflection prompt templates
├── data/ # Data splits and placeholders
└── assets/ # Figures
For questions or discussions, feel free to contact: haozhezheng@outlook.com
