Skip to content

Runbook skill#21

Merged
mxriverlynn merged 4 commits into
mainfrom
runbook-skill
May 28, 2026
Merged

Runbook skill#21
mxriverlynn merged 4 commits into
mainfrom
runbook-skill

Conversation

@mxriverlynn
Copy link
Copy Markdown
Collaborator

@mxriverlynn mxriverlynn commented May 28, 2026

Summary

This PR adds the /runbook skill so engineers can capture a single operational procedure from a real trigger — a fired alert, an incident, a recurring task, or a live failure mode — without writing speculative on-call docs.

  • New deterministic skill that produces one runbook file per invocation at docs/runbooks/{slug}.md using a cross-format core template (symptom-first metadata block with Severity, Triggers, Reversible, Last validated, Last edited, Owner, Origin; then Symptoms, Prerequisites, imperative Resolve steps with expected output per step, Verify, Escalate, Rollback).
  • Hard YAGNI preflight gates the work: the user must point at a real alert that has fired, a documented incident or post-mortem, a scheduled recurring task, a live failure mode on a service receiving traffic, or a stakeholder commitment. With no evidence the skill recommends deferral and names the trigger that would justify revisiting; the user can override and the override is recorded in the runbook's Origin field.
  • Skill catalog grows from 20 to 21; new "Operations" group introduced in the skills index. About fifteen sibling skill docs gained one-to-three-line "this is not /runbook — use /runbook for that" cross-links in their negative-space sections.
  • Documentation-only PR. No runtime code paths in the plugin shift; the change is a new skill definition plus a template, a long-form operator doc, the underlying research report, and index updates.

Behavior changes

Before: users had no operations-category skill; runbook-shaped requests landed in /project-documentation (wrong shape — describes systems, not procedures) or /architectural-decision-record (wrong shape — records decisions, not procedures). After: /runbook exists, refuses to write speculative runbooks by default, and discovers the project's existing runbook directory and filename convention (flat, per-service, or alert-keyed) instead of imposing one.

What to look at first

  • The YAGNI preflight in SKILL.md Step 2. This is the central design decision and the place the skill will most often surprise users. Worth checking whether the five accepted evidence types and the override-recording behavior match how the team wants on-call documentation to be gated.
  • The choice of a deterministic template installer over a multi-agent interview. The research report walks six options and explains why the elaborate interview-plus-review design did not survive adversarial validation. If the conclusion feels under-justified, that's the file to push back on.
  • Filename-convention discovery in Step 3 (flat / per-service / alert-keyed). The skill matches existing convention when the project already has more than two runbooks. Worth confirming the "two-runbook" threshold and the fallback order are sensible.
  • The negative-space cross-links scattered across sibling skill docs. They are short, but there are about fifteen of them; check that the wording is consistent and that none of them oversell what /runbook does.

Files of interest

  • plugin/skills/runbook/SKILL.md — The skill's seven-step process, including the YAGNI preflight and the mode-selection table. Primary decision surface.
  • plugin/...runbook/references/runbook-template.md — The output template every runbook will copy from; shapes every future runbook in any project that installs this plugin.
  • docs/research/runbook-skill-research.md — Evidence and adversarial validation behind picking a deterministic template installer over an interview-driven design.
  • docs/skills/runbook.md — Operator-facing long-form doc; the "When to use it" and "Do not invoke for" lists are where users will form their mental model.
  • CLAUDE.md — Skill count moves 20 to 21 and a new "Operations" group lands in the catalog; confirms the index header alignment fix.

Research for the new `/runbook` skill (issue #18). Surveys industry
formats (Google SRE, GitLab, OpenShift, PagerDuty, Atlassian, Rootly,
OneUptime, FireHydrant, incident.io, Nobl9, et al.) and Han codebase
patterns. Recommends a deterministic template installer with a YAGNI
preflight and mandatory staleness metadata, after adversarial validation
collapsed an earlier interview-driven design.

Refs #18
Adds a new `/runbook` skill that creates or updates a runbook for a
single operational scenario (alert that has fired, incident, recurring
task, known failure mode) using a consistent symptom-first template.

The template was reviewed by information-architect and junior-developer
passes to land progressive disclosure on what problem the runbook
resolves (Symptoms surfaced above secondary disambiguation) and how to
resolve it (Quick fix → Resolve with imperative commands and expected
output → Verify the fix landed → If a step fails → Escalate with
condition-then-channel → Rollback). Metadata block carries the trust
signals an on-call engineer needs at 2am: Severity, Triggers,
Reversible, Last validated vs. Last edited (distinct), Owner, Origin.

Applies the project's YAGNI rule as a preflight: requires real evidence
(alert that has fired, documented incident, recurring task, live
failure mode on a service receiving traffic, or stakeholder commitment)
before producing the runbook. Speculative runbooks are deferred; the
user can override and the override is recorded in the runbook's Origin
field.

Includes plugin/skills/runbook/SKILL.md, the reviewed
references/runbook-template.md, docs/skills/runbook.md long-form
operator doc, and index updates across CLAUDE.md, docs/skills/README.md,
docs/concepts.md, README.md, and every long-form skill doc's
"All N skills" reference. Bumps the skill count from 20 to 21 and adds
an "Operations" purpose grouping in the skills index.

Refs #18
Add reverse boundary statements and Related documentation pointers so the
new /runbook skill is discoverable from the skills it sits next to:

- project-documentation, architectural-decision-record, coding-standard
  now point users to /runbook for operational scenarios, in both the
  SKILL.md frontmatter and the long-form "Do not invoke for" section.
- investigate's Related documentation now names the investigate -> runbook
  composition pair.
The CLAUDE.md skill catalog described the last group as 'operational',
while docs/skills/README.md uses 'Operations' and docs/concepts.md uses
'operations'. Align CLAUDE.md to match.
@mxriverlynn mxriverlynn marked this pull request as ready for review May 28, 2026 15:53
@mxriverlynn mxriverlynn merged commit 489eeb3 into main May 28, 2026
@mxriverlynn mxriverlynn deleted the runbook-skill branch May 28, 2026 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant