post(v2.0): live atomic e2e test + 4 bug fixes + real integration check + journal #011 by ty13r · Pull Request #7 · ty13r/skillforge

ty13r · 2026-04-11T01:35:58Z

Summary

Post-v2.0 polish pass — items 1, 3, 4, 5 from the backlog. Three commits:

43c531f post(1): live atomic e2e test + 4 bug fixes
977e4bf post(3): real cross-dimension integration check in assembly
2b97328 post(5): journal entry seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges) #11 + PROGRESS.md notes

Item 4 (multi-generation breeding loops) landed inside the item 1 commit because both changes were in variant_evolution.py.

Item 2 (browser QA of the variant breakdown UI) is deferred to a follow-up — it has no code changes, just manual verification.

Item 1: Live atomic e2e test + 4 bug fixes

New tests/test_atomic_evolution_live.py gated behind SKILLFORGE_LIVE_TESTS=1. Runs the full v2.0 atomic pipeline against the real Anthropic API: Taxonomist classifies → variant evolution orchestrator → per-dimension Spawner + Competitor + judging pipeline → Engineer assembles composite → integration check → evolution_complete.

Took four attempts to go green. Each failure surfaced a real bug the mocked unit tests couldn't catch:

Taxonomist slug collision — LLM proposed a family slug that already existed in the DB; save_skill_family hit the UNIQUE constraint. Fix: classify_and_decompose now checks get_family_by_slug first and reuses existing families — symmetric to the _ensure_node lookup-or-create path for taxonomy nodes.
save_run cascade wipe — save_run used INSERT OR REPLACE, which triggers ON DELETE CASCADE on the row being replaced, silently wiping every variant_evolutions / challenges / generations / competition_results row for the run. Only visible when save_run is called twice during submission (which is what routes.py does when atomic mode persists variant_evolution rows between the first and second save_run calls). Fix: INSERT ... ON CONFLICT(id) DO UPDATE SET ... — updates in place without the DELETE cascade.
Spawner variant schema mismatch — spawn_variant_gen0's prompt asked the LLM to return frontmatter and skill_md_content as separate fields, but validate_skill_structure expects frontmatter embedded in skill_md_content. The existing spawn_gen0 schema embeds them; spawn_variant_gen0 now matches.
Atomic cost tracking gap — the atomic orchestrator never updated run.total_cost_usd. _estimate_generation_cost was only called in molecular mode. Fix: the same estimator is now wired into _run_dimension_mini_evolution, with cost_update events emitted per mini-generation.

The fourth attempt produced a real composite:

run_id: run-atomic-live-test
family: fam_5dbe2684831f
best_skill: composite_d65549afe4bd — a real "Pytest Equivalence-Partition Test Generator" assembled by the Engineer from the foundation variant
status: complete, 10:46 wall time
Engineer's integration_notes visible in the composite's mutation_rationale

Budget spent: ~$4 across the four attempts, inside the authorized $5 live-test budget. The cost tracking fix (#4) is not separately validated by another live run to preserve the remaining ~$1 budget, but is unit-test covered by the existing mocked Phase 3 tests.

Item 4 (inside the item 1 commit): Multi-generation mini-evolution

_run_dimension_mini_evolution now runs a bounded for gen in range(num_generations) loop:

gen 0: spawn → compete → judge → score
gen 1..N-1: breed from previous gen → compete → judge → score
pick the highest-fitness genome across ALL generations as the winner

When num_generations <= 1 the loop collapses to a single pass, matching the Phase 3 behavior. DEFAULT_VARIANT_GENS bumped from 1 to 2 but routes.py still creates VariantEvolution rows with num_generations=1 — users opt into breeding via the POST /api/families/{id}/evolve-variant endpoint or a post-v2.0 frontend toggle.

Item 3: Real cross-dimension integration check

Replaces the validate_skill_structure-only stub in assembly._run_integration_check with a layered check:

Structural (always runs): validate_skill_structure catches frontmatter shape, body size, and ${CLAUDE_SKILL_DIR} path resolution failures.
Behavioral (opt-in via enable_behavioral_check=True): runs the composite through the real Competitor against the foundation variant's original challenge, then scores via the judging pipeline. Passes only if aggregate fitness clears BEHAVIORAL_CHECK_THRESHOLD (0.5).

The behavioral check is opt-in because it doubles the API cost per assembly. Default is off so v2.0 production runs don't unexpectedly double in cost. Callers that want rigorous integration testing pass enable_behavioral_check=True to assemble_skill.

_find_foundation_challenge(run_id) looks up the foundation-tier variant_evolution's challenge_id and loads the corresponding Challenge row. This is the regression check: "does the composite still solve the foundation's original task after the capabilities were merged in?"

Item 5: Journal entry #11 + PROGRESS.md notes

journal/011-atomic-evolution-port-phases-2-5.md narrates the full Phase 2-5 port in the established journal voice: the Taxonomist agent + save_run regression, the orchestrator's no-recursion decision, the challenge-persistence FK bug, the Engineer's prompt engineering + _detect_conflicts pre-scan + integration check stub (that item 3 now replaces), Phase 5's Advanced UI + swap/evolve endpoints, the subagent pattern reprise, the PR-per-phase workflow, the two test isolation bugs + fixes.

PROGRESS.md appended with one entry documenting all four bug fixes, the behavioral integration check, and the breeding loop.

Quantitative signal

98 v2.0 unit tests still pass (no regressions from any of the changes; ruff clean on all touched files)
Live atomic test went green on the fourth attempt after all four bug fixes landed (not gating CI — SKILLFORGE_LIVE_TESTS=1 required)
Budget spent: ~$4 of the authorized $5 live-test budget

What's deferred to a follow-up

Item 2 (browser QA) — needs dev servers running + manual click-through. No code changes. Will be a separate follow-up.
Re-run the live test with the cost tracking fix — would cost another ~$1-2 to confirm total_cost_usd > 0 now reports properly. Deferred to preserve the remaining budget.
Flip enable_behavioral_check=True as the default — post-v2.1 decision; needs cost analysis against real runs first.

Test plan

uv run pytest tests/test_variant_evolution.py tests/test_engineer.py tests/test_assembly.py -q — 19/19 green
uv run pytest tests/test_taxonomist.py tests/test_evolve_taxonomist_integration.py -q — 18/18 green
uv run pytest tests/test_models_v2.py tests/test_db_v2.py tests/test_taxonomy_queries.py tests/test_taxonomy_api.py tests/test_report.py tests/test_swap_evolve_endpoints.py -q — 61/61 green
uv run ruff check skillforge/agents/taxonomist.py skillforge/agents/spawner.py skillforge/db/queries.py skillforge/engine/variant_evolution.py skillforge/engine/assembly.py tests/test_atomic_evolution_live.py — clean
SKILLFORGE_LIVE_TESTS=1 uv run pytest tests/test_atomic_evolution_live.py — attempt 4 produced a real composite end-to-end; the post-hoc cost tracking fix was not re-validated by a 5th live run but is unit-test covered

🤖 Generated with Claude Code

Adds tests/test_atomic_evolution_live.py gated behind SKILLFORGE_LIVE_TESTS=1. Runs the full v2.0 atomic pipeline against the real Anthropic API: POST /api/evolve (atomic) -> Taxonomist classifies -> variant evolution orchestrator -> per-dimension Spawner + Competitor + judging pipeline -> Engineer assembles composite -> integration check -> evolution_complete The test took four attempts to go green. Each failure surfaced a real bug the mocked unit tests couldn't catch: 1. **Taxonomist slug collision** (taxonomist.py). When the LLM proposed a family slug that already existed in the DB, the unconditional save_skill_family call hit the UNIQUE constraint. Fixed by checking get_family_by_slug first and reusing the existing family if found, symmetric to the _ensure_node lookup-or-create path for taxonomy nodes. 2. **save_run cascade wipe** (queries.py). save_run used INSERT OR REPLACE, which triggers ON DELETE CASCADE on the row being replaced — silently wiping every variant_evolutions / challenges / generations / competition row for the run. Only visible when save_run is called twice during run submission, which is exactly what _classify_run_via_taxonomist does when atomic mode persists variant_evolution rows between the first and second save_run calls. Fixed by switching to INSERT ... ON CONFLICT(id) DO UPDATE SET ... which updates in place without the DELETE cascade. 3. **Spawner variant schema mismatch** (spawner.py). spawn_variant_gen0's prompt asked the LLM to return frontmatter and skill_md_content as separate fields, but validate_skill_structure expects the frontmatter embedded in skill_md_content. The existing spawn_gen0 schema embeds frontmatter in skill_md_content — I updated spawn_variant_gen0 to match. 4. **Atomic cost tracking gap** (variant_evolution.py). The atomic orchestrator never updated run.total_cost_usd — molecular mode's _estimate_generation_cost was only called in the molecular loop. The fix wires the same estimator into _run_dimension_mini_evolution and emits cost_update events per mini-generation so the frontend and budget tracking both see real numbers. The fourth attempt produced a real composite: - run_id: run-atomic-live-test - family: fam_5dbe2684831f (test-fixture spec classification) - best_skill: composite_d65549afe4bd — a real "Pytest Equivalence- Partition Test Generator" assembled by the Engineer from the foundation variant - status: complete, 10:46 wall time - all variant_evolutions terminal - Engineer's integration_notes visible in mutation_rationale Budget: ~$4 across all four attempts, inside the authorized $5 live-test budget. The post-hoc cost tracking fix is not separately validated by another live run (save $2-3 of budget) but is unit-test covered via the existing mock path. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaces the validate_skill_structure-only stub in assembly._run_integration_check with a layered check: - Structural (always runs): validate_skill_structure catches frontmatter shape, body size, and ${CLAUDE_SKILL_DIR} path resolution failures. - Behavioral (opt-in via enable_behavioral_check=True): runs the composite through the real Competitor against the foundation variant's original challenge, then scores the result via the judging pipeline. Passes only if aggregate fitness clears BEHAVIORAL_CHECK_THRESHOLD (0.5). The behavioral check is opt-in because it doubles the API cost per assembly — every composite gets an extra Competitor + Reviewer run. Default is off so v2.0 production runs don't unexpectedly double in cost. Callers that want rigorous integration testing pass enable_behavioral_check=True to assemble_skill. Added _find_foundation_challenge(run_id) which looks up the foundation-tier variant_evolution's challenge_id and loads the corresponding Challenge row from the DB. This is the regression check: "does the composite still solve the foundation's original task after the capabilities were merged in?" The violations list now includes behavioral prefix markers: behavioral:below_threshold=0.32<0.5 behavioral:competitor_failed=<exc> behavioral:judging_failed=<exc> behavioral:no_run_context (when enable_behavioral_check but no run) assemble_skill gained an enable_behavioral_check kwarg that threads through to _run_integration_check. The refinement pass re-uses the same flag so the retry path is consistent with the first attempt. QA: 19 existing Phase 4 tests pass unchanged (behavioral check is opt-in, default off). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds journal/011-atomic-evolution-port-phases-2-5.md narrating the full Phase 2-5 port in the established journal voice: - The Taxonomist agent and the save_run regression that surfaced while wiring its integration test. - The variant evolution orchestrator's decision NOT to recurse into run_evolution, and why. - The challenge-persistence FK bug and the minimal fix. - The Engineer's prompt engineering, the _detect_conflicts pre-scan, and the integration check stub that item 3 now replaces. - Phase 5's Advanced UI, the swap/evolve endpoints, and the parent_run_id resolution path in the re-evolve endpoint. - The subagent pattern reprise (when to use subagents vs just writing directly from the main thread). - The PR-per-phase workflow we settled on. - The two test isolation bugs (fams[0] pollution, mock taxonomy slug collisions) and the fixes. PROGRESS.md appended with one entry per post-v2.0 polish item documenting the four bug fixes the live atomic test surfaced, the behavioral integration check, and the multi-generation mini-evolution loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Browser QA via Chrome MCP on the post-v2.0 `main` (commit e279cf0): - /taxonomy page renders cleanly with 40 nodes / 48 families summary - Domain → Focus → Language tree drill-down works at every level (tested DevOps → Containers → docker, families grid filters to 1 match) - Per-node family counts are accurate and descendant-aware - /runs/{id} for atomic runs renders the real composite SKILL.md with the full pytest equivalence-partition test generator content - Advanced — Variant Breakdown toggle appears ONLY for atomic runs (runDetail.evolution_mode === "atomic") - Clicking Show Advanced renders the VariantBreakdown component with real foundation variant data (primary-strategy dimension, 2 variants, fitness 0.667/0.60, swap dropdown populated, Re-evolve button present) - swap-variant endpoint works against the real atomic run: curl POST deactivates one variant and activates the other, persistence verified - evolve-variant endpoint works: returns a pending variant_evolution row with the right population_size, num_generations, tier, parent_run_id **Real bug surfaced by the QA pass**: the variant orchestrator was setting ``is_active=True`` on every new winner without deactivating any existing active variants in the same ``(family, dimension)``. Across multiple re-runs on the same family this left multiple variants all marked active simultaneously, violating the "exactly one active variant per (family, dimension)" invariant that swap-variant and the frontend rely on. Fix: skillforge/engine/variant_evolution.py::_run_dimension_mini_evolution now looks up existing variants in the dimension via get_variants_for_family and flips is_active=False on each one before stamping the new winner as active. Symmetric to the swap-variant endpoint's deactivate-all-then- activate-one pattern. QA - 8 Phase 3 unit tests still pass (the deactivation is a no-op when no existing variants exist, which matches the mocked test setup) - Live atomic test validates both the fix and the item-1 cost tracking gap fix (the 5th live test run, with ~$2 extra budget authorized by Matt) Notes from the QA session that are NOT bugs: - total_cost_usd shows 0.00 on pre-existing atomic runs because the cost tracking fix landed in PR #7 and the earlier runs pre-date it. The 5th live test validates the fix. - best_fitness shows 0.00 on atomic runs because composite genomes have empty pareto_objectives (the Engineer's integration_notes live in mutation_rationale instead). This is a known gap documented in the v2.1 backlog — the frontend display needs to fall back to the foundation variant's fitness when the composite has no score. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Matt (via Claude Code) and others added 3 commits April 10, 2026 20:34

ty13r merged commit e279cf0 into main Apr 11, 2026

ty13r deleted the v2.0/post-1-live-atomic-test branch April 19, 2026 20:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

post(v2.0): live atomic e2e test + 4 bug fixes + real integration check + journal #011#7

post(v2.0): live atomic e2e test + 4 bug fixes + real integration check + journal #011#7
ty13r merged 3 commits intomainfrom
v2.0/post-1-live-atomic-test

ty13r commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 11, 2026

Summary

Item 1: Live atomic e2e test + 4 bug fixes

Item 4 (inside the item 1 commit): Multi-generation mini-evolution

Item 3: Real cross-dimension integration check

Item 5: Journal entry #11 + PROGRESS.md notes

Quantitative signal

What's deferred to a follow-up

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant