Skip to content

fix: 3 LLM-fragility bugs exposed by cheap-Haiku atomic live test#54

Merged
ty13r merged 2 commits intomainfrom
fix/live-test-engineer-and-upload-bugs
Apr 19, 2026
Merged

fix: 3 LLM-fragility bugs exposed by cheap-Haiku atomic live test#54
ty13r merged 2 commits intomainfrom
fix/live-test-engineer-and-upload-bugs

Conversation

@ty13r
Copy link
Copy Markdown
Owner

@ty13r ty13r commented Apr 19, 2026

Summary

Fixes three LLM-fragility bugs uncovered by the first live atomic-evolution runs. Each one was killing the whole pipeline in 1-of-3 cheap-Haiku runs; each one now repairs at the boundary instead of escalating to an LLM retry.

Bug 1: Engineer oversize composite description

Haiku routinely overshoots the 250-char composite description cap by 10–40%. Previous behavior: raise ValueError, burn a ~$0.20 retry that often produced the same class of overshoot.

Fix: _try_truncate_description helper in agents/engineer.py repairs oversize descriptions at a word boundary, preserving the "Use when" pushy-pattern marker. _validate_composite_shape now repairs in place before raising; only truly unsalvageable cases (no "Use when" in the first 250 chars) still raise.

Bug 2: Managed-agents skill upload 400 on missing YAML frontmatter

~1% of managed_agents.upload_skill calls returned 400 SKILL.md must start with YAML frontmatter (---). Models occasionally prepend a UTF-8 BOM or stray whitespace that our structural validator tolerates but the API doesn't.

Fix: upload_skill now lstrips BOM + whitespace before calling the API. If the normalized content still doesn't start with ---, raise ValueError with a clear message instead of round-tripping to a generic 400. Caller fallback-to-inline still works for genuinely bad content; BOM/whitespace damage now uploads cleanly.

Bug 3: Spawner missing ${CLAUDE_SKILL_DIR} reference files

Spawner emits SKILL.md bodies that reference references/*-guide.md in prose but forget to include the file in supporting_files. Structural rule 8 rejected them, killing 1-of-3 atomic dimensions.

Fix: new _auto_repair_missing_references helper in agents/spawner.py runs before _validate_genomes. For each missing reference, stub a placeholder file with a clear auto-generated marker. Skill renders, reference resolves at runtime, validation passes; the Breeder can flesh out stubs in later generations.

Verification — live run passed

Full end-to-end atomic evolution on this branch:

SKILLFORGE_TEST_TIER=cheap \
SKILLFORGE_MODEL_ENGINEER=claude-sonnet-4-6 \
SKILLFORGE_LIVE_TESTS=1 \
SKILLFORGE_COMPETITOR_BACKEND=managed \
uv run pytest tests/test_atomic_evolution_live.py
  • Result: 1 passed in 19m 04s
  • Real cost (from DB): $1.46 (under the $5 authorized budget)
  • Pareto objectives (all 5 axes measured): correctness=0.00, code_quality=0.97, token_efficiency=0.04, trigger_accuracy=0.98, consistency=0.00
  • Zero upload failures, zero skill leaks (vs 3 + 5 on the first un-fixed run)

Before this PR, the same config failed at Engineer assembly. After this PR, the pipeline runs clean end-to-end.

Test plan

  • uv run ruff check skillforge — clean
  • uv run mypy skillforge — 65 files pass
  • uv run pytest tests/ — 410 passed (+7 from new tests), 2 skipped
  • Live atomic test passes with cheap Haiku + Sonnet Engineer — confirmed locally ($1.46)
  • Frontend untouched — still green

Unit tests added (7 total)

  • test_try_truncate_description_noop_when_under_cap
  • test_try_truncate_description_repairs_oversize_at_word_boundary
  • test_try_truncate_description_returns_none_when_use_when_is_past_cap
  • test_validate_composite_shape_truncates_oversize_description_in_place (replaces old reject-test)
  • test_validate_composite_shape_rejects_unsalvageable_oversize_description
  • test_upload_skill_rejects_payload_without_frontmatter (replaces old fallback-test)
  • test_upload_skill_strips_bom_and_leading_whitespace
  • test_auto_repair_missing_references_stubs_missing_files
  • test_auto_repair_missing_references_noop_when_all_present

🤖 Generated with Claude Code

Matt (via Claude Code) and others added 2 commits April 19, 2026 15:38
Fixes two bugs surfaced by the first live atomic-evolution run on the
cheap Haiku tier (see journal #17).

1. Engineer oversize description (assembly-killer)
-------------------------------------------------
Haiku routinely overshoots the 250-char composite description cap by
10–40%. The previous behavior raised ValueError, which triggered an
~$0.20 LLM retry that often produced the same class of overshoot. For
2-of-3 runs it killed the whole atomic pipeline.

Fix: add `_try_truncate_description` helper that repairs oversize
descriptions at a word boundary when it can do so without clobbering
the "Use when" pushy-pattern marker. `_validate_composite_shape` now
repairs in place before raising; only truly unsalvageable cases
(no "Use when" in the first 250 chars) still raise.

Covered by 3 new unit tests plus the updated existing oversize test.

2. Managed-agents skill upload YAML frontmatter 400 (upload-killer)
-------------------------------------------------------------------
~1% of `managed_agents.upload_skill` calls returned
``400 SKILL.md must start with YAML frontmatter (---)`` — the
Anthropic Skills API is byte-strict about the leading ``---``. Models
occasionally prepend a UTF-8 BOM or stray whitespace that the
structural validator (which uses `startswith("---")` after standard
string handling) happens to tolerate.

Fix: `upload_skill` now `lstrip`s BOM + whitespace from the payload
before calling the API. If the normalized content still doesn't start
with ``---``, raise ValueError with a clear message instead of
round-tripping to a generic 400. Caller (`competitor_managed`) still
falls back to inline for genuine bad content; BOM/whitespace-only
damage now uploads cleanly instead of falling back.

Covered by 2 new unit tests (reject + BOM-strip).

QA
--
  ruff check skillforge     - clean
  mypy skillforge           - 65 files pass
  pytest tests/             - 408 passed (+5), 2 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third LLM-fragility bug from the cheap-Haiku atomic run (same class as
the two fixed in the previous commit): the Spawner routinely emits
SKILL.md bodies that reference references/*-guide.md files in prose
but forgets to include the file in supporting_files. Structural rule 8
rejects those genomes, and in atomic mode (pop=2, 1 retry) this was
killing the whole dimension 1-of-3 times.

Fix: new _auto_repair_missing_references helper runs before
_validate_genomes. For each ${CLAUDE_SKILL_DIR}/<path> reference
missing from supporting_files, stub a placeholder file with a clear
auto-generated marker. The skill renders, the reference resolves at
runtime, validation passes, and the Breeder can flesh out the stub
in later generations if the signal warrants it.

Same defensive-repair pattern as the Engineer description truncation:
cheap LLM produces almost-valid output, we repair at the boundary
instead of burning another ~\$0.20 retry that often reproduces the
same oversight.

Covered by 2 new unit tests (stubs-missing + noop-when-present).

QA: ruff + mypy + 410 pytest (+2 from new tests) — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ty13r ty13r merged commit bb1728d into main Apr 19, 2026
2 checks passed
@ty13r ty13r deleted the fix/live-test-engineer-and-upload-bugs branch April 19, 2026 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant