Skip to content

fix: use sonnet for skill trigger evals instead of haiku#809

Merged
stack72 merged 1 commit intomainfrom
eval-skills-model
Mar 21, 2026
Merged

fix: use sonnet for skill trigger evals instead of haiku#809
stack72 merged 1 commit intomainfrom
eval-skills-model

Conversation

@stack72
Copy link
Contributor

@stack72 stack72 commented Mar 21, 2026

Summary

  • Remove EVAL_MODEL: "claude-haiku-4-5-20251001" override from the
    skill-trigger-eval CI job. Skill descriptions were tuned against Sonnet,
    and Haiku doesn't follow routing instructions (like "use this skill INSTEAD
    OF domain-specific skills") reliably enough — swamp-workflow dropped to 63%
    and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet.
  • Keep EVAL_RUNS=1 and 25 concurrent workers for speed. 185 Sonnet calls
    dispatched in parallel is still fast.

Test plan

  • Haiku results: 2 skills failing (workflow 63%, troubleshooting 65%)
  • Sonnet results: all skills passing (≥80%)

🤖 Generated with Claude Code

## Summary

- Remove `EVAL_MODEL: "claude-haiku-4-5-20251001"` override from the
  skill-trigger-eval CI job. Skill descriptions were tuned against Sonnet,
  and Haiku doesn't follow routing instructions (like "use this skill INSTEAD
  OF domain-specific skills") reliably enough — swamp-workflow dropped to 63%
  and swamp-troubleshooting to 65% on Haiku vs 80%+ on Sonnet.
- Keep `EVAL_RUNS=1` and 25 concurrent workers for speed. 185 Sonnet calls
  dispatched in parallel is still fast.

## Test plan

- [x] Haiku results: 2 skills failing (workflow 63%, troubleshooting 65%)
- [x] Sonnet results: all skills passing (≥80%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adversarial Review

This is a one-line CI config change removing the EVAL_MODEL: "claude-haiku-4-5-20251001" env var from the skill-trigger-eval job. When EVAL_MODEL is unset, the eval script passes no --model flag to claude -p, which defaults to Sonnet.

Critical / High

None.

Medium

None.

Low

  1. .github/workflows/ci.yml:137 — Implicit model default: The default model is now whatever claude -p ships with, rather than an explicit pin. If Anthropic changes the CLI default, eval results could shift without any PR touching this repo. This is unlikely to cause a real problem and is arguably the intended behavior (track the recommended model), but worth noting.

Verdict

PASS — Trivial CI config change. The removed env var override is correctly handled as optional by the eval script (Deno.env.get("EVAL_MODEL") returns undefined, which propagates to skip the --model flag). No logic, security, or correctness concerns.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean, minimal CI config change.

The diff removes EVAL_MODEL: "claude-haiku-4-5-20251001" from the skill-trigger-eval job, falling back to the default Sonnet model. Verified that EVAL_MODEL is optional in scripts/eval_skill_triggers.ts (line 107) — when unset, claude -p uses its default model.

No blocking issues.

The rationale is sound: skill descriptions were tuned against Sonnet, and Haiku doesn't follow routing instructions reliably enough (63-65% vs 80%+ on Sonnet). Keeping EVAL_RUNS=1 and 25 workers maintains fast CI execution.

🤖 Generated with Claude Code

@stack72 stack72 merged commit 7925fc7 into main Mar 21, 2026
10 checks passed
@stack72 stack72 deleted the eval-skills-model branch March 21, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant