Conversation
Adds tests/test_atomic_evolution_live.py gated behind SKILLFORGE_LIVE_TESTS=1.
Runs the full v2.0 atomic pipeline against the real Anthropic API:
POST /api/evolve (atomic) -> Taxonomist classifies -> variant evolution
orchestrator -> per-dimension Spawner + Competitor + judging pipeline ->
Engineer assembles composite -> integration check -> evolution_complete
The test took four attempts to go green. Each failure surfaced a real bug
the mocked unit tests couldn't catch:
1. **Taxonomist slug collision** (taxonomist.py). When the LLM proposed a
family slug that already existed in the DB, the unconditional
save_skill_family call hit the UNIQUE constraint. Fixed by checking
get_family_by_slug first and reusing the existing family if found,
symmetric to the _ensure_node lookup-or-create path for taxonomy nodes.
2. **save_run cascade wipe** (queries.py). save_run used INSERT OR REPLACE,
which triggers ON DELETE CASCADE on the row being replaced — silently
wiping every variant_evolutions / challenges / generations / competition
row for the run. Only visible when save_run is called twice during run
submission, which is exactly what _classify_run_via_taxonomist does when
atomic mode persists variant_evolution rows between the first and second
save_run calls. Fixed by switching to INSERT ... ON CONFLICT(id) DO
UPDATE SET ... which updates in place without the DELETE cascade.
3. **Spawner variant schema mismatch** (spawner.py). spawn_variant_gen0's
prompt asked the LLM to return frontmatter and skill_md_content as
separate fields, but validate_skill_structure expects the frontmatter
embedded in skill_md_content. The existing spawn_gen0 schema embeds
frontmatter in skill_md_content — I updated spawn_variant_gen0 to match.
4. **Atomic cost tracking gap** (variant_evolution.py). The atomic
orchestrator never updated run.total_cost_usd — molecular mode's
_estimate_generation_cost was only called in the molecular loop. The
fix wires the same estimator into _run_dimension_mini_evolution and
emits cost_update events per mini-generation so the frontend and
budget tracking both see real numbers.
The fourth attempt produced a real composite:
- run_id: run-atomic-live-test
- family: fam_5dbe2684831f (test-fixture spec classification)
- best_skill: composite_d65549afe4bd — a real "Pytest Equivalence-
Partition Test Generator" assembled by the Engineer from the
foundation variant
- status: complete, 10:46 wall time
- all variant_evolutions terminal
- Engineer's integration_notes visible in mutation_rationale
Budget: ~$4 across all four attempts, inside the authorized $5 live-test
budget. The post-hoc cost tracking fix is not separately validated by
another live run (save $2-3 of budget) but is unit-test covered via the
existing mock path.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the validate_skill_structure-only stub in
assembly._run_integration_check with a layered check:
- Structural (always runs): validate_skill_structure catches frontmatter
shape, body size, and ${CLAUDE_SKILL_DIR} path resolution failures.
- Behavioral (opt-in via enable_behavioral_check=True): runs the composite
through the real Competitor against the foundation variant's original
challenge, then scores the result via the judging pipeline. Passes only
if aggregate fitness clears BEHAVIORAL_CHECK_THRESHOLD (0.5).
The behavioral check is opt-in because it doubles the API cost per
assembly — every composite gets an extra Competitor + Reviewer run.
Default is off so v2.0 production runs don't unexpectedly double in cost.
Callers that want rigorous integration testing pass
enable_behavioral_check=True to assemble_skill.
Added _find_foundation_challenge(run_id) which looks up the foundation-tier
variant_evolution's challenge_id and loads the corresponding Challenge row
from the DB. This is the regression check: "does the composite still solve
the foundation's original task after the capabilities were merged in?"
The violations list now includes behavioral prefix markers:
behavioral:below_threshold=0.32<0.5
behavioral:competitor_failed=<exc>
behavioral:judging_failed=<exc>
behavioral:no_run_context (when enable_behavioral_check but no run)
assemble_skill gained an enable_behavioral_check kwarg that threads through
to _run_integration_check. The refinement pass re-uses the same flag so
the retry path is consistent with the first attempt.
QA: 19 existing Phase 4 tests pass unchanged (behavioral check is opt-in,
default off).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds journal/011-atomic-evolution-port-phases-2-5.md narrating the full Phase 2-5 port in the established journal voice: - The Taxonomist agent and the save_run regression that surfaced while wiring its integration test. - The variant evolution orchestrator's decision NOT to recurse into run_evolution, and why. - The challenge-persistence FK bug and the minimal fix. - The Engineer's prompt engineering, the _detect_conflicts pre-scan, and the integration check stub that item 3 now replaces. - Phase 5's Advanced UI, the swap/evolve endpoints, and the parent_run_id resolution path in the re-evolve endpoint. - The subagent pattern reprise (when to use subagents vs just writing directly from the main thread). - The PR-per-phase workflow we settled on. - The two test isolation bugs (fams[0] pollution, mock taxonomy slug collisions) and the fixes. PROGRESS.md appended with one entry per post-v2.0 polish item documenting the four bug fixes the live atomic test surfaced, the behavioral integration check, and the multi-generation mini-evolution loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ty13r
pushed a commit
that referenced
this pull request
Apr 11, 2026
Browser QA via Chrome MCP on the post-v2.0 `main` (commit e279cf0): - /taxonomy page renders cleanly with 40 nodes / 48 families summary - Domain → Focus → Language tree drill-down works at every level (tested DevOps → Containers → docker, families grid filters to 1 match) - Per-node family counts are accurate and descendant-aware - /runs/{id} for atomic runs renders the real composite SKILL.md with the full pytest equivalence-partition test generator content - Advanced — Variant Breakdown toggle appears ONLY for atomic runs (runDetail.evolution_mode === "atomic") - Clicking Show Advanced renders the VariantBreakdown component with real foundation variant data (primary-strategy dimension, 2 variants, fitness 0.667/0.60, swap dropdown populated, Re-evolve button present) - swap-variant endpoint works against the real atomic run: curl POST deactivates one variant and activates the other, persistence verified - evolve-variant endpoint works: returns a pending variant_evolution row with the right population_size, num_generations, tier, parent_run_id **Real bug surfaced by the QA pass**: the variant orchestrator was setting ``is_active=True`` on every new winner without deactivating any existing active variants in the same ``(family, dimension)``. Across multiple re-runs on the same family this left multiple variants all marked active simultaneously, violating the "exactly one active variant per (family, dimension)" invariant that swap-variant and the frontend rely on. Fix: skillforge/engine/variant_evolution.py::_run_dimension_mini_evolution now looks up existing variants in the dimension via get_variants_for_family and flips is_active=False on each one before stamping the new winner as active. Symmetric to the swap-variant endpoint's deactivate-all-then- activate-one pattern. QA - 8 Phase 3 unit tests still pass (the deactivation is a no-op when no existing variants exist, which matches the mocked test setup) - Live atomic test validates both the fix and the item-1 cost tracking gap fix (the 5th live test run, with ~$2 extra budget authorized by Matt) Notes from the QA session that are NOT bugs: - total_cost_usd shows 0.00 on pre-existing atomic runs because the cost tracking fix landed in PR #7 and the earlier runs pre-date it. The 5th live test validates the fix. - best_fitness shows 0.00 on atomic runs because composite genomes have empty pareto_objectives (the Engineer's integration_notes live in mutation_rationale instead). This is a known gap documented in the v2.1 backlog — the frontend display needs to fall back to the foundation variant's fitness when the composite has no score. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Post-v2.0 polish pass — items 1, 3, 4, 5 from the backlog. Three commits:
43c531fpost(1): live atomic e2e test + 4 bug fixes977e4bfpost(3): real cross-dimension integration check in assembly2b97328post(5): journal entry seed(elixir-oban-worker): SKLD-bench v2.1 challenge pool (100 challenges) #11 + PROGRESS.md notesItem 4 (multi-generation breeding loops) landed inside the item 1 commit because both changes were in
variant_evolution.py.Item 2 (browser QA of the variant breakdown UI) is deferred to a follow-up — it has no code changes, just manual verification.
Item 1: Live atomic e2e test + 4 bug fixes
New
tests/test_atomic_evolution_live.pygated behindSKILLFORGE_LIVE_TESTS=1. Runs the full v2.0 atomic pipeline against the real Anthropic API: Taxonomist classifies → variant evolution orchestrator → per-dimension Spawner + Competitor + judging pipeline → Engineer assembles composite → integration check → evolution_complete.Took four attempts to go green. Each failure surfaced a real bug the mocked unit tests couldn't catch:
Taxonomist slug collision — LLM proposed a family slug that already existed in the DB;
save_skill_familyhit the UNIQUE constraint. Fix:classify_and_decomposenow checksget_family_by_slugfirst and reuses existing families — symmetric to the_ensure_nodelookup-or-create path for taxonomy nodes.save_run cascade wipe —
save_runusedINSERT OR REPLACE, which triggersON DELETE CASCADEon the row being replaced, silently wiping everyvariant_evolutions/challenges/generations/competition_resultsrow for the run. Only visible whensave_runis called twice during submission (which is what routes.py does when atomic mode persists variant_evolution rows between the first and secondsave_runcalls). Fix:INSERT ... ON CONFLICT(id) DO UPDATE SET ...— updates in place without the DELETE cascade.Spawner variant schema mismatch —
spawn_variant_gen0's prompt asked the LLM to returnfrontmatterandskill_md_contentas separate fields, butvalidate_skill_structureexpects frontmatter embedded inskill_md_content. The existingspawn_gen0schema embeds them;spawn_variant_gen0now matches.Atomic cost tracking gap — the atomic orchestrator never updated
run.total_cost_usd._estimate_generation_costwas only called in molecular mode. Fix: the same estimator is now wired into_run_dimension_mini_evolution, withcost_updateevents emitted per mini-generation.The fourth attempt produced a real composite:
run-atomic-live-testfam_5dbe2684831fcomposite_d65549afe4bd— a real "Pytest Equivalence-Partition Test Generator" assembled by the Engineer from the foundation variantmutation_rationaleBudget spent: ~$4 across the four attempts, inside the authorized $5 live-test budget. The cost tracking fix (#4) is not separately validated by another live run to preserve the remaining ~$1 budget, but is unit-test covered by the existing mocked Phase 3 tests.
Item 4 (inside the item 1 commit): Multi-generation mini-evolution
_run_dimension_mini_evolutionnow runs a boundedfor gen in range(num_generations)loop:When
num_generations <= 1the loop collapses to a single pass, matching the Phase 3 behavior.DEFAULT_VARIANT_GENSbumped from 1 to 2 butroutes.pystill createsVariantEvolutionrows withnum_generations=1— users opt into breeding via thePOST /api/families/{id}/evolve-variantendpoint or a post-v2.0 frontend toggle.Item 3: Real cross-dimension integration check
Replaces the
validate_skill_structure-only stub inassembly._run_integration_checkwith a layered check:validate_skill_structurecatches frontmatter shape, body size, and${CLAUDE_SKILL_DIR}path resolution failures.enable_behavioral_check=True): runs the composite through the real Competitor against the foundation variant's original challenge, then scores via the judging pipeline. Passes only if aggregate fitness clearsBEHAVIORAL_CHECK_THRESHOLD(0.5).The behavioral check is opt-in because it doubles the API cost per assembly. Default is off so v2.0 production runs don't unexpectedly double in cost. Callers that want rigorous integration testing pass
enable_behavioral_check=Truetoassemble_skill._find_foundation_challenge(run_id)looks up the foundation-tier variant_evolution'schallenge_idand loads the corresponding Challenge row. This is the regression check: "does the composite still solve the foundation's original task after the capabilities were merged in?"Item 5: Journal entry #11 + PROGRESS.md notes
journal/011-atomic-evolution-port-phases-2-5.mdnarrates the full Phase 2-5 port in the established journal voice: the Taxonomist agent + save_run regression, the orchestrator's no-recursion decision, the challenge-persistence FK bug, the Engineer's prompt engineering +_detect_conflictspre-scan + integration check stub (that item 3 now replaces), Phase 5's Advanced UI + swap/evolve endpoints, the subagent pattern reprise, the PR-per-phase workflow, the two test isolation bugs + fixes.PROGRESS.md appended with one entry documenting all four bug fixes, the behavioral integration check, and the breeding loop.
Quantitative signal
SKILLFORGE_LIVE_TESTS=1required)What's deferred to a follow-up
total_cost_usd > 0now reports properly. Deferred to preserve the remaining budget.enable_behavioral_check=Trueas the default — post-v2.1 decision; needs cost analysis against real runs first.Test plan
uv run pytest tests/test_variant_evolution.py tests/test_engineer.py tests/test_assembly.py -q— 19/19 greenuv run pytest tests/test_taxonomist.py tests/test_evolve_taxonomist_integration.py -q— 18/18 greenuv run pytest tests/test_models_v2.py tests/test_db_v2.py tests/test_taxonomy_queries.py tests/test_taxonomy_api.py tests/test_report.py tests/test_swap_evolve_endpoints.py -q— 61/61 greenuv run ruff check skillforge/agents/taxonomist.py skillforge/agents/spawner.py skillforge/db/queries.py skillforge/engine/variant_evolution.py skillforge/engine/assembly.py tests/test_atomic_evolution_live.py— cleanSKILLFORGE_LIVE_TESTS=1 uv run pytest tests/test_atomic_evolution_live.py— attempt 4 produced a real composite end-to-end; the post-hoc cost tracking fix was not re-validated by a 5th live run but is unit-test covered🤖 Generated with Claude Code