From ee8604eb0843b5b0ebd7ca980ed3504f0f8e4504 Mon Sep 17 00:00:00 2001 From: stack72 Date: Mon, 6 Apr 2026 23:23:40 +0100 Subject: [PATCH] docs: add skill lifecycle design document MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Describes how skills are authored, tested, and shipped — covering skill-creator conventions, tessl quality review, promptfoo trigger routing evaluation, CI pipeline, and weekly multi-model regression. Co-Authored-By: Claude Opus 4.6 (1M context) --- design/skills.md | 283 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 283 insertions(+) create mode 100644 design/skills.md diff --git a/design/skills.md b/design/skills.md new file mode 100644 index 00000000..dcf3c495 --- /dev/null +++ b/design/skills.md @@ -0,0 +1,283 @@ +# Skill Lifecycle + +This document describes how skills are authored, tested, and shipped through CI. + +## Overview + +A skill gives Claude specialized knowledge about a specific domain. At runtime, +skills work through progressive disclosure: + +- Claude always sees every skill's `name` and `description` in the system prompt +- When a user query matches a skill's description, Claude loads the full + `SKILL.md` body into context +- If the skill references detailed docs in `references/`, Claude loads those on + demand as needed + +This means the `description` field is the single most important part of a +skill — it's the trigger mechanism that decides whether Claude engages the skill +at all. + +Skills live in `.claude/skills//` and follow a three-phase lifecycle: + +1. **Author** — create the skill following skill-creator conventions +2. **Test** — validate quality (tessl) and trigger routing (promptfoo) +3. **Ship** — CI gates enforce quality before merge + +Authoring and testing are iterative — after using a skill on real tasks, you +refine the description, body, or evals based on what worked and what didn't. + +## Authoring a Skill + +Always load the `skill-creator` skill before creating or modifying a skill — it +is the authoritative guide for structure and best practices. + +### Directory Structure + +``` +.claude/skills// +├── SKILL.md # Required — uppercase, contains frontmatter + body +├── references/ # Optional — detailed docs loaded on demand +│ └── *.md +└── evals/ # Optional — trigger evaluation test cases + └── trigger_evals.json +``` + +### SKILL.md + +**Frontmatter** (required YAML between `---` fences): + +- `name` — hyphen-case identifier, must match directory name, max 64 chars +- `description` — what the skill does AND when to trigger it. This is the + primary mechanism Claude uses to decide whether to load the skill, so include + specific trigger phrases and contexts. Max 1024 chars. + +No other frontmatter fields (except optional `license`). + +**Body** guidelines: + +- Keep under 500 lines — split detailed content into `references/` files +- Use imperative form ("Use this for…", "To create…") +- Do not duplicate "when to use" guidance that belongs in the description +- Run `deno fmt` after editing + +### Progressive Disclosure + +Skills load in three tiers to manage context efficiently: + +1. **Metadata** (name + description) — always in context (~100 words) +2. **SKILL.md body** — loaded when the skill triggers (<5k words) +3. **References** — loaded on demand by Claude as needed (unlimited) + +Keep only core workflow and selection guidance in SKILL.md. Move variant-specific +details, API references, and lengthy examples into `references/`. + +## Writing Trigger Evals + +Trigger evals define test cases that verify Claude routes user queries to the +correct skill. They live at +`.claude/skills//evals/trigger_evals.json`. + +### Format + +```json +[ + { + "query": "Search for model types that handle payments", + "should_trigger": true, + "note": "type search is a listed trigger keyword" + }, + { + "query": "List all the data I have stored", + "should_trigger": false, + "note": "Listing stored data routes to swamp-data, not this skill" + } +] +``` + +Each entry has: + +- `query` — a realistic user request +- `should_trigger` — `true` if this query should route to this skill, `false` if + it should route elsewhere +- `note` — optional explanation of the routing decision + +Write both positive and negative cases. Negative cases are important — they +verify that similar-sounding queries route to the correct neighboring skill +rather than this one. + +## Testing Locally + +Both testing tools require **Node.js/npm** and **Deno** installed locally. They +test fundamentally different things: + +- **Tessl** answers: _"Is this skill well-written?"_ — it scores the quality of + the SKILL.md description and content. +- **Promptfoo** answers: _"Does the LLM actually route to this skill + correctly?"_ — it sends real queries to a model and checks which skill gets + called. + +A skill can score perfectly on tessl but still fail promptfoo if its description +overlaps with another skill. Both must pass. + +### Tessl — Skill Quality Review + +Tessl evaluates the structural quality of a skill's SKILL.md by scoring its +description and content. + +**Review a single skill:** + +```bash +npx tessl skill review .claude/skills/ --json +``` + +This returns three metrics: + +- **Validation** — structural checks (frontmatter format, naming conventions) +- **Description Judge** — quality score for the frontmatter description (0–1.0) +- **Content Judge** — quality score for the SKILL.md body (0–1.0) + +The average of description and content scores must be ≥ 90%. + +**Review all skills at once:** + +```bash +deno run review-skills +``` + +This runs tessl against every skill in `.claude/skills/` and prints a summary +table. In CI, it writes the table to the GitHub Actions step summary. + +### Promptfoo — Trigger Routing Evaluation + +Promptfoo tests whether an LLM correctly routes user queries to the right skill. +It reads all `trigger_evals.json` files, presents each query to the model as a +tool-selection task, and asserts that the correct skill was (or was not) called. + +**Run against the default model (sonnet):** + +```bash +export ANTHROPIC_API_KEY= +deno run eval-skill-triggers +``` + +**Run against a specific model:** + +```bash +export ANTHROPIC_API_KEY= +deno run eval-skill-triggers --model opus + +export OPENAI_API_KEY= +deno run eval-skill-triggers --model gpt-4.1 + +export GOOGLE_API_KEY= +deno run eval-skill-triggers --model gemini-2.5-pro +``` + +**Tune concurrency and threshold:** + +```bash +deno run eval-skill-triggers --model sonnet --concurrency 10 --threshold 0.85 +``` + +Defaults are concurrency 20 and threshold 90%. Each sonnet run costs ~$2 and +takes ~30 seconds. + +If the required API key is missing, the script gracefully skips (exits 0) rather +than failing. + +## CI Pipeline + +CI runs skill checks automatically when relevant files change. The change +detection filter triggers on modifications to: + +- `.claude/skills/**` +- `CLAUDE.md` +- `scripts/review_skills.ts` +- `evals/promptfoo/**` + +### PR Checks + +Two jobs run in parallel on every PR that touches skill files: + +**skill-review** — runs `deno run review-skills` (tessl) against all skills. +Fails the PR if any skill scores below 90%. + +**skill-trigger-eval** — runs `deno run eval-skill-triggers` (promptfoo) with +the default sonnet model. Fails the PR if the pass rate drops below 90%. + +Both jobs write detailed results to the GitHub Actions step summary for easy +review. + +### Weekly Multi-Model Regression + +A scheduled workflow (`multi-model-eval.yml`) runs every Saturday at 08:00 UTC. +It tests trigger routing against four models in parallel: + +| Model | Provider | Concurrency | API Key Env | +| -------------- | --------- | ----------- | ------------------ | +| sonnet | Anthropic | 20 | `ANTHROPIC_API_KEY` | +| opus | Anthropic | 20 | `ANTHROPIC_API_KEY` | +| gpt-4.1 | OpenAI | 5 | `OPENAI_API_KEY` | +| gemini-2.5-pro | Google | 20 | `GOOGLE_API_KEY` | + +This catches model-specific regressions that the single-model PR check would +miss. It can also be triggered manually via `workflow_dispatch` with a model +filter. + +### After CI + +Once all checks pass (skill review, trigger eval, code review, and standard +tests), PRs are auto-merged unless the `hold` label is applied. + +## Lifecycle Flow + +``` + ┌─────────────────────┐ + │ Author │ + │ │ + │ Create skill dir │ + │ Write SKILL.md │ + │ Write trigger evals│ + │ Add references │ + └─────────┬───────────┘ + │ + ▼ + ┌────────────────────────┐ + │ Test Locally │ + │ │ + │ tessl → quality ok? │ + │ promptfoo → routing ok?│ + └────────────┬────────────┘ + │ + ┌───────┴───────┐ + │ Pass? │ + │ no ──────────┼──── loop back to Author + │ yes │ + └───────┬───────┘ + │ + ▼ + ┌────────────────────────┐ + │ Open PR │ + │ │ + │ CI: skill-review │ + │ CI: skill-trigger-eval │ + │ CI: code review │ + └────────────┬────────────┘ + │ + ▼ + ┌────────────────────────┐ + │ Merge │ + └────────────┬────────────┘ + │ + ▼ + ┌────────────────────────┐ + │ Use on real tasks │ + │ │ + │ Notice gaps or │ + │ mis-routes? Loop back │ + │ to Author. │ + └──────────────────────────┘ + + Weekly: multi-model regression catches + model-specific routing drift (Sat 08:00 UTC) +```