fix: 3 LLM-fragility bugs exposed by cheap-Haiku atomic live test by ty13r · Pull Request #54 · ty13r/skillforge

ty13r · 2026-04-19T21:09:02Z

Summary

Fixes three LLM-fragility bugs uncovered by the first live atomic-evolution runs. Each one was killing the whole pipeline in 1-of-3 cheap-Haiku runs; each one now repairs at the boundary instead of escalating to an LLM retry.

Bug 1: Engineer oversize composite description

Haiku routinely overshoots the 250-char composite description cap by 10–40%. Previous behavior: raise ValueError, burn a ~$0.20 retry that often produced the same class of overshoot.

Fix: _try_truncate_description helper in agents/engineer.py repairs oversize descriptions at a word boundary, preserving the "Use when" pushy-pattern marker. _validate_composite_shape now repairs in place before raising; only truly unsalvageable cases (no "Use when" in the first 250 chars) still raise.

Bug 2: Managed-agents skill upload 400 on missing YAML frontmatter

~1% of managed_agents.upload_skill calls returned 400 SKILL.md must start with YAML frontmatter (---). Models occasionally prepend a UTF-8 BOM or stray whitespace that our structural validator tolerates but the API doesn't.

Fix: upload_skill now lstrips BOM + whitespace before calling the API. If the normalized content still doesn't start with ---, raise ValueError with a clear message instead of round-tripping to a generic 400. Caller fallback-to-inline still works for genuinely bad content; BOM/whitespace damage now uploads cleanly.

Bug 3: Spawner missing `${CLAUDE_SKILL_DIR}` reference files

Spawner emits SKILL.md bodies that reference references/*-guide.md in prose but forget to include the file in supporting_files. Structural rule 8 rejected them, killing 1-of-3 atomic dimensions.

Fix: new _auto_repair_missing_references helper in agents/spawner.py runs before _validate_genomes. For each missing reference, stub a placeholder file with a clear auto-generated marker. Skill renders, reference resolves at runtime, validation passes; the Breeder can flesh out stubs in later generations.

Verification — live run passed

Full end-to-end atomic evolution on this branch:

SKILLFORGE_TEST_TIER=cheap \
SKILLFORGE_MODEL_ENGINEER=claude-sonnet-4-6 \
SKILLFORGE_LIVE_TESTS=1 \
SKILLFORGE_COMPETITOR_BACKEND=managed \
uv run pytest tests/test_atomic_evolution_live.py

Result: 1 passed in 19m 04s
Real cost (from DB): $1.46 (under the $5 authorized budget)
Pareto objectives (all 5 axes measured): correctness=0.00, code_quality=0.97, token_efficiency=0.04, trigger_accuracy=0.98, consistency=0.00
Zero upload failures, zero skill leaks (vs 3 + 5 on the first un-fixed run)

Before this PR, the same config failed at Engineer assembly. After this PR, the pipeline runs clean end-to-end.

Test plan

uv run ruff check skillforge — clean
uv run mypy skillforge — 65 files pass
uv run pytest tests/ — 410 passed (+7 from new tests), 2 skipped
Live atomic test passes with cheap Haiku + Sonnet Engineer — confirmed locally ($1.46)
Frontend untouched — still green

Unit tests added (7 total)

test_try_truncate_description_noop_when_under_cap
test_try_truncate_description_repairs_oversize_at_word_boundary
test_try_truncate_description_returns_none_when_use_when_is_past_cap
test_validate_composite_shape_truncates_oversize_description_in_place (replaces old reject-test)
test_validate_composite_shape_rejects_unsalvageable_oversize_description
test_upload_skill_rejects_payload_without_frontmatter (replaces old fallback-test)
test_upload_skill_strips_bom_and_leading_whitespace
test_auto_repair_missing_references_stubs_missing_files
test_auto_repair_missing_references_noop_when_all_present

🤖 Generated with Claude Code

Fixes two bugs surfaced by the first live atomic-evolution run on the cheap Haiku tier (see journal #17). 1. Engineer oversize description (assembly-killer) ------------------------------------------------- Haiku routinely overshoots the 250-char composite description cap by 10–40%. The previous behavior raised ValueError, which triggered an ~$0.20 LLM retry that often produced the same class of overshoot. For 2-of-3 runs it killed the whole atomic pipeline. Fix: add `_try_truncate_description` helper that repairs oversize descriptions at a word boundary when it can do so without clobbering the "Use when" pushy-pattern marker. `_validate_composite_shape` now repairs in place before raising; only truly unsalvageable cases (no "Use when" in the first 250 chars) still raise. Covered by 3 new unit tests plus the updated existing oversize test. 2. Managed-agents skill upload YAML frontmatter 400 (upload-killer) ------------------------------------------------------------------- ~1% of `managed_agents.upload_skill` calls returned ``400 SKILL.md must start with YAML frontmatter (---)`` — the Anthropic Skills API is byte-strict about the leading ``---``. Models occasionally prepend a UTF-8 BOM or stray whitespace that the structural validator (which uses `startswith("---")` after standard string handling) happens to tolerate. Fix: `upload_skill` now `lstrip`s BOM + whitespace from the payload before calling the API. If the normalized content still doesn't start with ``---``, raise ValueError with a clear message instead of round-tripping to a generic 400. Caller (`competitor_managed`) still falls back to inline for genuine bad content; BOM/whitespace-only damage now uploads cleanly instead of falling back. Covered by 2 new unit tests (reject + BOM-strip). QA -- ruff check skillforge - clean mypy skillforge - 65 files pass pytest tests/ - 408 passed (+5), 2 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Third LLM-fragility bug from the cheap-Haiku atomic run (same class as the two fixed in the previous commit): the Spawner routinely emits SKILL.md bodies that reference references/*-guide.md files in prose but forgets to include the file in supporting_files. Structural rule 8 rejects those genomes, and in atomic mode (pop=2, 1 retry) this was killing the whole dimension 1-of-3 times. Fix: new _auto_repair_missing_references helper runs before _validate_genomes. For each ${CLAUDE_SKILL_DIR}/<path> reference missing from supporting_files, stub a placeholder file with a clear auto-generated marker. The skill renders, the reference resolves at runtime, validation passes, and the Breeder can flesh out the stub in later generations if the signal warrants it. Same defensive-repair pattern as the Engineer description truncation: cheap LLM produces almost-valid output, we repair at the boundary instead of burning another ~\$0.20 retry that often reproduces the same oversight. Covered by 2 new unit tests (stubs-missing + noop-when-present). QA: ruff + mypy + 410 pytest (+2 from new tests) — all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matt (via Claude Code) and others added 2 commits April 19, 2026 15:38

ty13r merged commit bb1728d into main Apr 19, 2026
2 checks passed

ty13r deleted the fix/live-test-engineer-and-upload-bugs branch April 19, 2026 21:10

ty13r mentioned this pull request Apr 20, 2026

Known limitation: composite scorer is Elixir-scoped #58

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 3 LLM-fragility bugs exposed by cheap-Haiku atomic live test#54

fix: 3 LLM-fragility bugs exposed by cheap-Haiku atomic live test#54
ty13r merged 2 commits intomainfrom
fix/live-test-engineer-and-upload-bugs

ty13r commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ty13r commented Apr 19, 2026

Summary

Bug 1: Engineer oversize composite description

Bug 2: Managed-agents skill upload 400 on missing YAML frontmatter

Bug 3: Spawner missing ${CLAUDE_SKILL_DIR} reference files

Verification — live run passed

Test plan

Unit tests added (7 total)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 3: Spawner missing `${CLAUDE_SKILL_DIR}` reference files