fix: use sonnet for skill trigger evals instead of haiku by stack72 · Pull Request #809 · systeminit/swamp

stack72 · 2026-03-21T00:41:43Z

Summary

Remove EVAL_MODEL: "claude-haiku-4-5-20251001" override from the
skill-trigger-eval CI job. Skill descriptions were tuned against Sonnet,
and Haiku doesn't follow routing instructions (like "use this skill INSTEAD
OF domain-specific skills") reliably enough — swamp-workflow dropped to 63%
and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet.
Keep EVAL_RUNS=1 and 25 concurrent workers for speed. 185 Sonnet calls
dispatched in parallel is still fast.

Test plan

Haiku results: 2 skills failing (workflow 63%, troubleshooting 65%)
Sonnet results: all skills passing (≥80%)

🤖 Generated with Claude Code

## Summary - Remove `EVAL_MODEL: "claude-haiku-4-5-20251001"` override from the skill-trigger-eval CI job. Skill descriptions were tuned against Sonnet, and Haiku doesn't follow routing instructions (like "use this skill INSTEAD OF domain-specific skills") reliably enough — swamp-workflow dropped to 63% and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet. - Keep `EVAL_RUNS=1` and 25 concurrent workers for speed. 185 Sonnet calls dispatched in parallel is still fast. ## Test plan - [x] Haiku results: 2 skills failing (workflow 63%, troubleshooting 65%) - [x] Sonnet results: all skills passing (≥80%) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

github-actions

Adversarial Review

This is a one-line CI config change removing the EVAL_MODEL: "claude-haiku-4-5-20251001" env var from the skill-trigger-eval job. When EVAL_MODEL is unset, the eval script passes no --model flag to claude -p, which defaults to Sonnet.

Critical / High

None.

Medium

None.

Low

.github/workflows/ci.yml:137 — Implicit model default: The default model is now whatever claude -p ships with, rather than an explicit pin. If Anthropic changes the CLI default, eval results could shift without any PR touching this repo. This is unlikely to cause a real problem and is arguably the intended behavior (track the recommended model), but worth noting.

Verdict

PASS — Trivial CI config change. The removed env var override is correctly handled as optional by the eval script (Deno.env.get("EVAL_MODEL") returns undefined, which propagates to skip the --model flag). No logic, security, or correctness concerns.

github-actions

LGTM — clean, minimal CI config change.

The diff removes EVAL_MODEL: "claude-haiku-4-5-20251001" from the skill-trigger-eval job, falling back to the default Sonnet model. Verified that EVAL_MODEL is optional in scripts/eval_skill_triggers.ts (line 107) — when unset, claude -p uses its default model.

No blocking issues.

The rationale is sound: skill descriptions were tuned against Sonnet, and Haiku doesn't follow routing instructions reliably enough (63-65% vs 80%+ on Sonnet). Keeping EVAL_RUNS=1 and 25 workers maintains fast CI execution.

🤖 Generated with Claude Code

github-actions bot reviewed Mar 21, 2026

View reviewed changes

github-actions bot approved these changes Mar 21, 2026

View reviewed changes

stack72 merged commit 7925fc7 into main Mar 21, 2026
10 checks passed

stack72 deleted the eval-skills-model branch March 21, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use sonnet for skill trigger evals instead of haiku#809

fix: use sonnet for skill trigger evals instead of haiku#809
stack72 merged 1 commit intomainfrom
eval-skills-model

stack72 commented Mar 21, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stack72 commented Mar 21, 2026

Summary

Test plan

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Adversarial Review

Critical / High

Medium

Low

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant