fix: use sonnet for skill trigger evals instead of haiku#809
Conversation
## Summary - Remove `EVAL_MODEL: "claude-haiku-4-5-20251001"` override from the skill-trigger-eval CI job. Skill descriptions were tuned against Sonnet, and Haiku doesn't follow routing instructions (like "use this skill INSTEAD OF domain-specific skills") reliably enough — swamp-workflow dropped to 63% and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet. - Keep `EVAL_RUNS=1` and 25 concurrent workers for speed. 185 Sonnet calls dispatched in parallel is still fast. ## Test plan - [x] Haiku results: 2 skills failing (workflow 63%, troubleshooting 65%) - [x] Sonnet results: all skills passing (≥80%) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
There was a problem hiding this comment.
Adversarial Review
This is a one-line CI config change removing the EVAL_MODEL: "claude-haiku-4-5-20251001" env var from the skill-trigger-eval job. When EVAL_MODEL is unset, the eval script passes no --model flag to claude -p, which defaults to Sonnet.
Critical / High
None.
Medium
None.
Low
.github/workflows/ci.yml:137 — Implicit model default: The default model is now whateverclaude -pships with, rather than an explicit pin. If Anthropic changes the CLI default, eval results could shift without any PR touching this repo. This is unlikely to cause a real problem and is arguably the intended behavior (track the recommended model), but worth noting.
Verdict
PASS — Trivial CI config change. The removed env var override is correctly handled as optional by the eval script (Deno.env.get("EVAL_MODEL") returns undefined, which propagates to skip the --model flag). No logic, security, or correctness concerns.
There was a problem hiding this comment.
LGTM — clean, minimal CI config change.
The diff removes EVAL_MODEL: "claude-haiku-4-5-20251001" from the skill-trigger-eval job, falling back to the default Sonnet model. Verified that EVAL_MODEL is optional in scripts/eval_skill_triggers.ts (line 107) — when unset, claude -p uses its default model.
No blocking issues.
The rationale is sound: skill descriptions were tuned against Sonnet, and Haiku doesn't follow routing instructions reliably enough (63-65% vs 80%+ on Sonnet). Keeping EVAL_RUNS=1 and 25 workers maintains fast CI execution.
🤖 Generated with Claude Code
Summary
EVAL_MODEL: "claude-haiku-4-5-20251001"override from theskill-trigger-eval CI job. Skill descriptions were tuned against Sonnet,
and Haiku doesn't follow routing instructions (like "use this skill INSTEAD
OF domain-specific skills") reliably enough — swamp-workflow dropped to 63%
and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet.
EVAL_RUNS=1and 25 concurrent workers for speed. 185 Sonnet callsdispatched in parallel is still fast.
Test plan
🤖 Generated with Claude Code