Evolora — Symbiotic LLM Ecology

What if we started evolving language models?

Evolora treats a small LLM as a computational substrate—like the "physics" of a world—and evolves a population of tiny LoRA adapters ("organelles") that compete, learn, merge, and die based on their performance. It's artificial life meets language models: adapters pay energy to attempt tasks, earn rewards when they succeed, go bankrupt when they fail, and occasionally merge into new offspring when they prove their worth.

Core Idea

┌─────────────────────────────────────────────────────────────────┐
│                     FROZEN HOST (Qwen3-0.6B)                    │
│                    (weights never change)                       │
├─────────────────────────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐       ┌──────────┐   │
│  │Organelle │  │Organelle │  │Organelle │  ...  │Organelle │   │
│  │  (LoRA)  │  │  (LoRA)  │  │  (LoRA)  │       │  (LoRA)  │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘       └────┬─────┘   │
│       │             │             │                  │         │
│       └─────────────┴──────┬──────┴──────────────────┘         │
│                            ▼                                    │
│              ┌─────────────────────────┐                       │
│              │    ENERGY ECONOMY       │                       │
│              │  • Pay to participate   │                       │
│              │  • Earn via rewards     │                       │
│              │  • Bankrupt → retire    │                       │
│              │  • Merge on uplift      │                       │
│              └─────────────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Three things are being optimised simultaneously:

Per-call competence — each organelle tries to maximise reward on tasks
Energy efficiency — ROI = reward − cost; cheap and good wins
Population diversity — a quality-diversity archive nudges the ecology toward covering different niches

Three Levels of Learning

Local plasticity — reward-modulated local updates within each organelle (Hebbian or backprop; per episode)
Population selection — assimilation + merging of successful adapters (per generation)
Meta-evolution — tuning ecology parameters (assimilation threshold, energy floor, etc.)

Key Results (Qwen3-0.6B, 150 generations)

Condition	Episodes	Mean ROI	Merges	QD Coverage	Holdout Acc	Holdout Cost
Frozen base	1,727	0.94	0	0%	13.3%	1.91
Single adapter	63	−0.06	0	0%	13.3%	1.94
Full ecology	7,413	1.39	52	16.7%	14.2%	1.14

The ecology achieves higher ROI, better holdout accuracy, and 40% lower inference cost compared to baselines—while the frozen host and single-adapter setups remain structurally inert.

See docs/paper_packs/ for detailed run reports and plots.

Key Finding: Specialization vs Robustness

So far, the pattern looks like this:

These robustness numbers come from a historical broad-mix run on Qwen/Qwen2.5-0.5B (kept for comparison with the Qwen3 table above).

SFT wins on specialization: on in-distribution tasks, compute-matched SFT (matched on training FLOPs) is usually the strongest baseline.
Evolution can win on robustness: on a family shift, a routed portfolio of evolved specialists can beat SFT—sometimes by retaining a capability that a single adapter loses (e.g., word.count).
ID (6-family grid mix; Qwen/Qwen2.5-0.5B): base 70/512 (13.7%), SFT 318/512 (62.1%), evo (routed) 157/512 (30.7%)
OOD (paper holdout; math + word.count + code.format; Qwen/Qwen2.5-0.5B): base 5/120 (4.2%), SFT 54/120 (45.0%), evo (routed) 68/120 (56.7%)
- Best single evolved organelle: 58/120 (48.3%); routing + OOD reselection: 68/120 (56.7%)
- Most striking detail: SFT collapsed on word.count (1/40) while evo stayed closer to base (7/40).

Reproduce the OOD comparison (uses an existing run directory)

# Example: evaluate an existing grid run on the paper holdout (OOD family shift).
# IMPORTANT: selection tasks must be different from the holdout to avoid leakage.
# Note: this particular holdout includes two families not present in the 6-family training mix (`math`, `word.count`)
# plus one overlapping family (`code.format`) with a shifted distribution.
#
# Fairness note: "routed + reselection" uses a small selection set to pick specialists (extra eval-time compute).
# If you want a strict single-pass comparison, use a single selected organelle (`--evo-eval-routing single`)
# or reuse an existing routing map.

python scripts/generate_grid_datasets.py \
  --config config/experiments/paper_qwen3_ecology.yaml \
  --seed 4242 \
  --train-size 1000 \
  --selection-size 192 \
  --holdout-size 192 \
  --out-dir artifacts_ood_eval_paper/datasets

python scripts/evaluate_holdout.py \
  --holdout config/evaluation/paper_qwen3_holdout_v1.jsonl \
  --model <BASE_MODEL_MATCHING_RUN> \
  --sft-adapter <RUN_DIR>/sft/peft_adapter \
  --evo-checkpoint <RUN_DIR>/checkpoint.pt \
  --evo-eval-routing family \
  --evo-selection-tasks artifacts_ood_eval_paper/datasets/selection_tasks.jsonl \
  --evo-selection-max-samples 64 \
  --evo-selection-max-new-tokens 64 \
  --output artifacts_ood_eval_paper/eval_paper_holdout.json

Comparing Evolution vs SFT (Compute-Matched)

A key question: does evolutionary adaptation actually outperform standard supervised fine-tuning given the same compute budget? This section explains how to run a fair comparison.

Note: scripts/run_sft.py depends on datasets (in requirements.txt). If you see ModuleNotFoundError: No module named 'datasets', run make setup (or .venv/bin/python -m pip install -r requirements.txt).

What We Found (Regex Generalization Snapshot)

In a historical compute-matched run on Qwen/Qwen2.5-0.5B (LoRA rank=7; evo organelles trained with EVOLORA_PLASTICITY=backprop):

OOD holdout (config/evaluation/regex_generalization.jsonl, 19 tasks): base 2/19 (10.5%), SFT 12/19 (63.2%), evo 5/19 (26.3%)
ID holdout (512 generated tasks from the same task mix): base 94/512 (18.4%), SFT 354/512 (69.1%), evo 148/512 (28.9%)

Interpretation:

Signal density matters: under a fixed training compute budget, SFT (gold targets + dense token-level gradients) is far more sample-efficient than evo (rollouts + sparse scalar rewards).
Evo > base, but ≪ SFT: the best evolved organelle reliably improves over base, but remains well behind compute-matched SFT in this setting.
No-op adapter pitfall: mixed populations can include adapters with effectively zero LoRA weights; evaluate_holdout.py skips these when any non-zero adapters exist.

See docs/regex_generalization.md for the task mix, evaluation details, and additional context.

Quick E2E Example

# End-to-end runner (calibration → resume → compute-matched SFT → eval).
# Generates distribution-matched regex datasets into <run_dir>/datasets/.
make regex-evo-vs-sft REGEX_FULL_GENS=10 REGEX_CALIB_GENS=2 REGEX_CHECKPOINT_EVERY=1

How It Works

Run evolution with compute tracking enabled:
- ComputeBudget tracks total tokens, forward passes, and Hebbian updates
- Metrics saved to checkpoint and gen_summaries.jsonl

Run SFT with matched budget:

# Recommended: match evo "training" compute via FLOPs (more stable than tokens)
python scripts/run_sft.py \
    --checkpoint artifacts_evo/checkpoint.pt \
    --match-budget-field train_flops \
    --backprop-multiplier 3.0 \
    --attn-implementation eager \
    --engine manual \
    --resume \
    --data artifacts_evo/datasets/sft_train.jsonl \
    --output artifacts_sft

# Alternatives: wall-clock match or an explicit token budget
python scripts/run_sft.py \
    --match-budget-field wall_clock_seconds \
    --attn-implementation eager \
    --engine manual \
    --resume \
    --data artifacts_evo/datasets/sft_train.jsonl \
    --output artifacts_sft

python scripts/run_sft.py \
    --token-budget 500000 \
    --data artifacts_evo/datasets/sft_train.jsonl \
    --output artifacts_sft

On macOS/MPS, the default engine uses a manual training loop with NaN guards and resumable checkpoints
Exports LoRA compatible with HostKernel.load_organelle_adapter()

Evaluate both on the same holdout tasks:

# In-distribution regex holdout (and selection set for evo organelle picking)
python scripts/evaluate_holdout.py \
    --holdout artifacts_evo/datasets/holdout_tasks.jsonl \
    --evo-selection-tasks artifacts_evo/datasets/selection_tasks.jsonl \
    --sft-adapter artifacts_sft/peft_adapter \
    --evo-checkpoint artifacts_evo/checkpoint.pt \
    --output comparison_results_id.json

# OOD / mixed holdout (optional)
python scripts/evaluate_holdout.py \
    --holdout config/evaluation/regex_generalization.jsonl \
    --evo-selection-tasks artifacts_evo/datasets/selection_tasks.jsonl \
    --sft-adapter artifacts_sft/peft_adapter \
    --evo-checkpoint artifacts_evo/checkpoint.pt \
    --output comparison_results_ood.json

Compares base model, SFT, and best evolution organelle
Picks the best evolved organelle on --evo-selection-tasks (to avoid test leakage)

Key Scripts

Script	Purpose
`scripts/run_evolution.py`	Run evolutionary LoRA with compute tracking
`scripts/run_sft.py`	Train SFT baseline with matched token budget
`scripts/evaluate_holdout.py`	Compare models on holdout tasks

Important Notes

Model compatibility: The evolution checkpoint and evaluation must use the same base model (check tensor shapes match)
Token tracking: Tokens are tracked in observations.metrics.tokens in episodes.jsonl
Best organelle selection: evaluate_holdout.py selects the best organelle by scoring candidates on a separate selection/validation set (configurable via --evo-selection-tasks), with ROI-based fallback. If any adapters have non-zero LoRA weights, selection ignores no-op (zero-magnitude) adapters to avoid picking a “base-like” organelle.

This enables fair comparison: evolutionary adaptation (many small adapters + population dynamics) vs traditional gradient-based fine-tuning (single adapter, supervised loss).

Project Structure

src/symbiont_ecology/
├── host/           # Frozen backbone wrapper, LoRA slot management
├── organelles/     # Hebbian-PEFT adapters with eligibility traces
├── routing/        # Bandit router for adapter selection
├── evolution/      # Population manager, model merger, morphogenesis
├── environment/    # Task factory, ecology loop, grid controller
├── economics/      # ATP ledger, energy settlement
├── metrics/        # Telemetry sink, QD archive, ComputeBudget tracking
└── config.py       # Pydantic config models

config/
├── experiments/    # YAML configs: frozen, single, ecology variants
├── evaluation/     # Holdout task sets (e.g., regex_generalization.jsonl)
├── training/       # SFT training data (e.g., regex_sft_data.jsonl)
└── ecology.yaml    # Base ecology parameters

scripts/
├── run_evolution.py        # Main experiment runner (resumable)
├── run_sft.py              # SFT baseline trainer (compute-matched)
├── evaluate_holdout.py     # Compare models on holdout tasks
├── analyze_ecology_run.py  # Generate reports + plots from a run
├── evoscope.py             # Interactive run visualisation
├── paper_pack.py           # Bundle runs into tracked summaries
└── benchmark_suite.py      # CI-safe benchmark harness

What you get

YAML-configured experiments under config/experiments/
CI-safe benchmark harness (stubbed backend; no model downloads)
Run artifacts under artifacts_* (telemetry + plots)
Analysis + visualization: scripts/analyze_ecology_run.py, scripts/evoscope.py
Paper packs (tracked summaries): docs/paper_packs/

Quickstart (CI-safe)

make all
AGENT_MODE=baseline .venv/bin/python scripts/benchmark_suite.py --mode ci

Reproduce the paper-style Qwen3 suite

scripts/run_paper_ecology_suite.sh all

Optional: add a fixed, measurement-only holdout suite (paper packs will include holdout metrics):

FINAL_HOLDOUT_TASKS=config/evaluation/paper_qwen3_holdout_v1.jsonl \
FINAL_HOLDOUT_SAMPLE_SIZE=120 \
scripts/run_paper_ecology_suite.sh all

For long runs, a good workflow is “calibrate → resume”: run ~10 generations to validate the config/model load, then resume the same run directory to the full length via --resume-from:

make calibrate-resume CONFIG=config/experiments/paper_qwen3_ecology.yaml FULL_GENS=50

Analyze a run

MPLCONFIGDIR="$(mktemp -d)" .venv/bin/python scripts/analyze_ecology_run.py <run_dir> --plots --report
MPLCONFIGDIR="$(mktemp -d)" .venv/bin/python scripts/evoscope.py <run_dir>

Docs

docs/running.md — how to run longer experiments (resume/holdout)
docs/ecology_overview.md — conceptual overview + baselines framing
docs/architecture.md — module-level architecture sketch
docs/paper_packs/README.md — paper packs (tables + plots) + examples

Notes

Full experiments download model weights from Hugging Face; use the CI benchmark mode above for a fast smoke test without downloads.
On macOS, MPLCONFIGDIR="$(mktemp -d)" avoids matplotlib font-cache issues in long runs/plots.

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
config		config
docs		docs
scripts		scripts
src/symbiont_ecology		src/symbiont_ecology
tests		tests
tests_llm_live		tests_llm_live
.agents.yml		.agents.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
WORK_LOG.md		WORK_LOG.md
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evolora — Symbiotic LLM Ecology

Core Idea

Three Levels of Learning

Key Results (Qwen3-0.6B, 150 generations)

Key Finding: Specialization vs Robustness

Comparing Evolution vs SFT (Compute-Matched)

What We Found (Regex Generalization Snapshot)

Quick E2E Example

How It Works

Key Scripts

Important Notes

Project Structure

What you get

Quickstart (CI-safe)

Reproduce the paper-style Qwen3 suite

Analyze a run

Docs

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evolora — Symbiotic LLM Ecology

Core Idea

Three Levels of Learning

Key Results (Qwen3-0.6B, 150 generations)

Key Finding: Specialization vs Robustness

Comparing Evolution vs SFT (Compute-Matched)

What We Found (Regex Generalization Snapshot)

Quick E2E Example

How It Works

Key Scripts

Important Notes

Project Structure

What you get

Quickstart (CI-safe)

Reproduce the paper-style Qwen3 suite

Analyze a run

Docs

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages