Skip to content

post(v2.0): live atomic e2e test + 4 bug fixes + real integration check + journal #011#7

Merged
ty13r merged 3 commits intomainfrom
v2.0/post-1-live-atomic-test
Apr 11, 2026
Merged

post(v2.0): live atomic e2e test + 4 bug fixes + real integration check + journal #011#7
ty13r merged 3 commits intomainfrom
v2.0/post-1-live-atomic-test

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 11, 2026

Summary

Post-v2.0 polish pass — items 1, 3, 4, 5 from the backlog. Three commits:

Item 4 (multi-generation breeding loops) landed inside the item 1 commit because both changes were in variant_evolution.py.

Item 2 (browser QA of the variant breakdown UI) is deferred to a follow-up — it has no code changes, just manual verification.

Item 1: Live atomic e2e test + 4 bug fixes

New tests/test_atomic_evolution_live.py gated behind SKILLFORGE_LIVE_TESTS=1. Runs the full v2.0 atomic pipeline against the real Anthropic API: Taxonomist classifies → variant evolution orchestrator → per-dimension Spawner + Competitor + judging pipeline → Engineer assembles composite → integration check → evolution_complete.

Took four attempts to go green. Each failure surfaced a real bug the mocked unit tests couldn't catch:

  1. Taxonomist slug collision — LLM proposed a family slug that already existed in the DB; save_skill_family hit the UNIQUE constraint. Fix: classify_and_decompose now checks get_family_by_slug first and reuses existing families — symmetric to the _ensure_node lookup-or-create path for taxonomy nodes.

  2. save_run cascade wipesave_run used INSERT OR REPLACE, which triggers ON DELETE CASCADE on the row being replaced, silently wiping every variant_evolutions / challenges / generations / competition_results row for the run. Only visible when save_run is called twice during submission (which is what routes.py does when atomic mode persists variant_evolution rows between the first and second save_run calls). Fix: INSERT ... ON CONFLICT(id) DO UPDATE SET ... — updates in place without the DELETE cascade.

  3. Spawner variant schema mismatchspawn_variant_gen0's prompt asked the LLM to return frontmatter and skill_md_content as separate fields, but validate_skill_structure expects frontmatter embedded in skill_md_content. The existing spawn_gen0 schema embeds them; spawn_variant_gen0 now matches.

  4. Atomic cost tracking gap — the atomic orchestrator never updated run.total_cost_usd. _estimate_generation_cost was only called in molecular mode. Fix: the same estimator is now wired into _run_dimension_mini_evolution, with cost_update events emitted per mini-generation.

The fourth attempt produced a real composite:

  • run_id: run-atomic-live-test
  • family: fam_5dbe2684831f
  • best_skill: composite_d65549afe4bd — a real "Pytest Equivalence-Partition Test Generator" assembled by the Engineer from the foundation variant
  • status: complete, 10:46 wall time
  • Engineer's integration_notes visible in the composite's mutation_rationale

Budget spent: ~$4 across the four attempts, inside the authorized $5 live-test budget. The cost tracking fix (#4) is not separately validated by another live run to preserve the remaining ~$1 budget, but is unit-test covered by the existing mocked Phase 3 tests.

Item 4 (inside the item 1 commit): Multi-generation mini-evolution

_run_dimension_mini_evolution now runs a bounded for gen in range(num_generations) loop:

  • gen 0: spawn → compete → judge → score
  • gen 1..N-1: breed from previous gen → compete → judge → score
  • pick the highest-fitness genome across ALL generations as the winner

When num_generations <= 1 the loop collapses to a single pass, matching the Phase 3 behavior. DEFAULT_VARIANT_GENS bumped from 1 to 2 but routes.py still creates VariantEvolution rows with num_generations=1 — users opt into breeding via the POST /api/families/{id}/evolve-variant endpoint or a post-v2.0 frontend toggle.

Item 3: Real cross-dimension integration check

Replaces the validate_skill_structure-only stub in assembly._run_integration_check with a layered check:

  • Structural (always runs): validate_skill_structure catches frontmatter shape, body size, and ${CLAUDE_SKILL_DIR} path resolution failures.
  • Behavioral (opt-in via enable_behavioral_check=True): runs the composite through the real Competitor against the foundation variant's original challenge, then scores via the judging pipeline. Passes only if aggregate fitness clears BEHAVIORAL_CHECK_THRESHOLD (0.5).

The behavioral check is opt-in because it doubles the API cost per assembly. Default is off so v2.0 production runs don't unexpectedly double in cost. Callers that want rigorous integration testing pass enable_behavioral_check=True to assemble_skill.

_find_foundation_challenge(run_id) looks up the foundation-tier variant_evolution's challenge_id and loads the corresponding Challenge row. This is the regression check: "does the composite still solve the foundation's original task after the capabilities were merged in?"

Item 5: Journal entry #11 + PROGRESS.md notes

journal/011-atomic-evolution-port-phases-2-5.md narrates the full Phase 2-5 port in the established journal voice: the Taxonomist agent + save_run regression, the orchestrator's no-recursion decision, the challenge-persistence FK bug, the Engineer's prompt engineering + _detect_conflicts pre-scan + integration check stub (that item 3 now replaces), Phase 5's Advanced UI + swap/evolve endpoints, the subagent pattern reprise, the PR-per-phase workflow, the two test isolation bugs + fixes.

PROGRESS.md appended with one entry documenting all four bug fixes, the behavioral integration check, and the breeding loop.

Quantitative signal

  • 98 v2.0 unit tests still pass (no regressions from any of the changes; ruff clean on all touched files)
  • Live atomic test went green on the fourth attempt after all four bug fixes landed (not gating CI — SKILLFORGE_LIVE_TESTS=1 required)
  • Budget spent: ~$4 of the authorized $5 live-test budget

What's deferred to a follow-up

  • Item 2 (browser QA) — needs dev servers running + manual click-through. No code changes. Will be a separate follow-up.
  • Re-run the live test with the cost tracking fix — would cost another ~$1-2 to confirm total_cost_usd > 0 now reports properly. Deferred to preserve the remaining budget.
  • Flip enable_behavioral_check=True as the default — post-v2.1 decision; needs cost analysis against real runs first.

Test plan

  • uv run pytest tests/test_variant_evolution.py tests/test_engineer.py tests/test_assembly.py -q — 19/19 green
  • uv run pytest tests/test_taxonomist.py tests/test_evolve_taxonomist_integration.py -q — 18/18 green
  • uv run pytest tests/test_models_v2.py tests/test_db_v2.py tests/test_taxonomy_queries.py tests/test_taxonomy_api.py tests/test_report.py tests/test_swap_evolve_endpoints.py -q — 61/61 green
  • uv run ruff check skillforge/agents/taxonomist.py skillforge/agents/spawner.py skillforge/db/queries.py skillforge/engine/variant_evolution.py skillforge/engine/assembly.py tests/test_atomic_evolution_live.py — clean
  • SKILLFORGE_LIVE_TESTS=1 uv run pytest tests/test_atomic_evolution_live.py — attempt 4 produced a real composite end-to-end; the post-hoc cost tracking fix was not re-validated by a 5th live run but is unit-test covered

🤖 Generated with Claude Code

Matt (via Claude Code) and others added 3 commits April 10, 2026 20:34
Adds tests/test_atomic_evolution_live.py gated behind SKILLFORGE_LIVE_TESTS=1.
Runs the full v2.0 atomic pipeline against the real Anthropic API:

  POST /api/evolve (atomic) -> Taxonomist classifies -> variant evolution
  orchestrator -> per-dimension Spawner + Competitor + judging pipeline ->
  Engineer assembles composite -> integration check -> evolution_complete

The test took four attempts to go green. Each failure surfaced a real bug
the mocked unit tests couldn't catch:

1. **Taxonomist slug collision** (taxonomist.py). When the LLM proposed a
   family slug that already existed in the DB, the unconditional
   save_skill_family call hit the UNIQUE constraint. Fixed by checking
   get_family_by_slug first and reusing the existing family if found,
   symmetric to the _ensure_node lookup-or-create path for taxonomy nodes.

2. **save_run cascade wipe** (queries.py). save_run used INSERT OR REPLACE,
   which triggers ON DELETE CASCADE on the row being replaced — silently
   wiping every variant_evolutions / challenges / generations / competition
   row for the run. Only visible when save_run is called twice during run
   submission, which is exactly what _classify_run_via_taxonomist does when
   atomic mode persists variant_evolution rows between the first and second
   save_run calls. Fixed by switching to INSERT ... ON CONFLICT(id) DO
   UPDATE SET ... which updates in place without the DELETE cascade.

3. **Spawner variant schema mismatch** (spawner.py). spawn_variant_gen0's
   prompt asked the LLM to return frontmatter and skill_md_content as
   separate fields, but validate_skill_structure expects the frontmatter
   embedded in skill_md_content. The existing spawn_gen0 schema embeds
   frontmatter in skill_md_content — I updated spawn_variant_gen0 to match.

4. **Atomic cost tracking gap** (variant_evolution.py). The atomic
   orchestrator never updated run.total_cost_usd — molecular mode's
   _estimate_generation_cost was only called in the molecular loop. The
   fix wires the same estimator into _run_dimension_mini_evolution and
   emits cost_update events per mini-generation so the frontend and
   budget tracking both see real numbers.

The fourth attempt produced a real composite:
  - run_id: run-atomic-live-test
  - family: fam_5dbe2684831f (test-fixture spec classification)
  - best_skill: composite_d65549afe4bd — a real "Pytest Equivalence-
    Partition Test Generator" assembled by the Engineer from the
    foundation variant
  - status: complete, 10:46 wall time
  - all variant_evolutions terminal
  - Engineer's integration_notes visible in mutation_rationale

Budget: ~$4 across all four attempts, inside the authorized $5 live-test
budget. The post-hoc cost tracking fix is not separately validated by
another live run (save $2-3 of budget) but is unit-test covered via the
existing mock path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the validate_skill_structure-only stub in
assembly._run_integration_check with a layered check:

- Structural (always runs): validate_skill_structure catches frontmatter
  shape, body size, and ${CLAUDE_SKILL_DIR} path resolution failures.
- Behavioral (opt-in via enable_behavioral_check=True): runs the composite
  through the real Competitor against the foundation variant's original
  challenge, then scores the result via the judging pipeline. Passes only
  if aggregate fitness clears BEHAVIORAL_CHECK_THRESHOLD (0.5).

The behavioral check is opt-in because it doubles the API cost per
assembly — every composite gets an extra Competitor + Reviewer run.
Default is off so v2.0 production runs don't unexpectedly double in cost.
Callers that want rigorous integration testing pass
enable_behavioral_check=True to assemble_skill.

Added _find_foundation_challenge(run_id) which looks up the foundation-tier
variant_evolution's challenge_id and loads the corresponding Challenge row
from the DB. This is the regression check: "does the composite still solve
the foundation's original task after the capabilities were merged in?"

The violations list now includes behavioral prefix markers:
  behavioral:below_threshold=0.32<0.5
  behavioral:competitor_failed=<exc>
  behavioral:judging_failed=<exc>
  behavioral:no_run_context  (when enable_behavioral_check but no run)

assemble_skill gained an enable_behavioral_check kwarg that threads through
to _run_integration_check. The refinement pass re-uses the same flag so
the retry path is consistent with the first attempt.

QA: 19 existing Phase 4 tests pass unchanged (behavioral check is opt-in,
default off).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds journal/011-atomic-evolution-port-phases-2-5.md narrating the full
Phase 2-5 port in the established journal voice:

- The Taxonomist agent and the save_run regression that surfaced while
  wiring its integration test.
- The variant evolution orchestrator's decision NOT to recurse into
  run_evolution, and why.
- The challenge-persistence FK bug and the minimal fix.
- The Engineer's prompt engineering, the _detect_conflicts pre-scan, and
  the integration check stub that item 3 now replaces.
- Phase 5's Advanced UI, the swap/evolve endpoints, and the
  parent_run_id resolution path in the re-evolve endpoint.
- The subagent pattern reprise (when to use subagents vs just writing
  directly from the main thread).
- The PR-per-phase workflow we settled on.
- The two test isolation bugs (fams[0] pollution, mock taxonomy slug
  collisions) and the fixes.

PROGRESS.md appended with one entry per post-v2.0 polish item documenting
the four bug fixes the live atomic test surfaced, the behavioral
integration check, and the multi-generation mini-evolution loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ty13r ty13r merged commit e279cf0 into main Apr 11, 2026
ty13r pushed a commit that referenced this pull request Apr 11, 2026
Browser QA via Chrome MCP on the post-v2.0 `main` (commit e279cf0):

- /taxonomy page renders cleanly with 40 nodes / 48 families summary
- Domain → Focus → Language tree drill-down works at every level
  (tested DevOps → Containers → docker, families grid filters to 1 match)
- Per-node family counts are accurate and descendant-aware
- /runs/{id} for atomic runs renders the real composite SKILL.md with
  the full pytest equivalence-partition test generator content
- Advanced — Variant Breakdown toggle appears ONLY for atomic runs
  (runDetail.evolution_mode === "atomic")
- Clicking Show Advanced renders the VariantBreakdown component with
  real foundation variant data (primary-strategy dimension, 2 variants,
  fitness 0.667/0.60, swap dropdown populated, Re-evolve button present)
- swap-variant endpoint works against the real atomic run: curl POST
  deactivates one variant and activates the other, persistence verified
- evolve-variant endpoint works: returns a pending variant_evolution row
  with the right population_size, num_generations, tier, parent_run_id

**Real bug surfaced by the QA pass**: the variant orchestrator was
setting ``is_active=True`` on every new winner without deactivating any
existing active variants in the same ``(family, dimension)``. Across
multiple re-runs on the same family this left multiple variants all
marked active simultaneously, violating the "exactly one active variant
per (family, dimension)" invariant that swap-variant and the frontend
rely on.

Fix: skillforge/engine/variant_evolution.py::_run_dimension_mini_evolution
now looks up existing variants in the dimension via get_variants_for_family
and flips is_active=False on each one before stamping the new winner as
active. Symmetric to the swap-variant endpoint's deactivate-all-then-
activate-one pattern.

QA
- 8 Phase 3 unit tests still pass (the deactivation is a no-op when no
  existing variants exist, which matches the mocked test setup)
- Live atomic test validates both the fix and the item-1 cost tracking
  gap fix (the 5th live test run, with ~$2 extra budget authorized by
  Matt)

Notes from the QA session that are NOT bugs:
- total_cost_usd shows 0.00 on pre-existing atomic runs because the
  cost tracking fix landed in PR #7 and the earlier runs pre-date it.
  The 5th live test validates the fix.
- best_fitness shows 0.00 on atomic runs because composite genomes
  have empty pareto_objectives (the Engineer's integration_notes live
  in mutation_rationale instead). This is a known gap documented in
  the v2.1 backlog — the frontend display needs to fall back to the
  foundation variant's fitness when the composite has no score.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ty13r ty13r deleted the v2.0/post-1-live-atomic-test branch April 19, 2026 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant