diff --git a/CLAUDE.md b/CLAUDE.md index 4569cc9..87adc17 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -16,7 +16,7 @@ Han is a Claude Code plugin: a suite of skills and agents for solo (or small-tea │ ├── .claude-plugin/ │ │ └── plugin.json │ ├── agents/ # 23 agent definitions (.md with frontmatter) -│ ├── skills/ # 20 skill directories, each with SKILL.md + references/ +│ ├── skills/ # 21 skill directories, each with SKILL.md + references/ │ └── references/ # Cross-skill reference files (e.g. yagni-rule.md) ├── docs/ # Operator-facing documentation │ ├── writing-voice.md # Voice profile every doc follows @@ -25,7 +25,7 @@ Han is a Claude Code plugin: a suite of skills and agents for solo (or small-tea │ ├── sizing.md │ ├── yagni.md │ ├── agents/ # Long-form docs for all 23 agents, plus README -│ ├── skills/ # Long-form docs for all 20 skills, plus README +│ ├── skills/ # Long-form docs for all 21 skills, plus README │ ├── guidance/ # Contributor-facing authoring guidance │ ├── templates/ # Templates and coverage rule for long-form docs │ ├── plans/ # Plan documents (one folder per plan; nested research lives inside) @@ -56,7 +56,7 @@ The plugin is shipped from `plugin/`; documentation lives in `docs/`. Long-form ### Skill catalog (`docs/skills/`) -- **[docs/skills/README.md](./docs/skills/README.md).** Index of all 20 skills grouped by purpose (planning, building, investigation and research, review, discovery, conventions, reporting). Start here when looking for the right slash command. +- **[docs/skills/README.md](./docs/skills/README.md).** Index of all 21 skills grouped by purpose (planning, building, investigation and research, review, discovery, conventions, reporting, operations). Start here when looking for the right slash command. - **[docs/skills/plan-a-feature.md](./docs/skills/plan-a-feature.md).** Spec a feature from scratch through an evidence-based interview that walks the design tree and dispatches specialist reviewers. - **[docs/skills/plan-implementation.md](./docs/skills/plan-implementation.md).** Turn a feature specification into an implementation plan through a project-manager-led team conversation. - **[docs/skills/plan-a-phased-build.md](./docs/skills/plan-a-phased-build.md).** Split a body of context (gap analysis, PRD, design doc) into a numbered sequence of vertical-slice phases, each independently demoable. @@ -77,6 +77,7 @@ The plugin is shipped from `plugin/`; documentation lives in `docs/`. Long-form - **[docs/skills/coding-standard.md](./docs/skills/coding-standard.md).** Create and update coding standards from existing patterns or evidence-based research. - **[docs/skills/architectural-decision-record.md](./docs/skills/architectural-decision-record.md).** Create, extract, or convert architectural decision records (ADRs). - **[docs/skills/update-pr-description.md](./docs/skills/update-pr-description.md).** Generate a PR description from the current branch's changes. +- **[docs/skills/runbook.md](./docs/skills/runbook.md).** Create or update a runbook for a single operational scenario (alert that has fired, incident, recurring task, known failure mode). Applies a YAGNI preflight that requires real evidence before writing. ### Agent catalog (`docs/agents/`) @@ -134,4 +135,4 @@ Folder selection rule: if the artifact is the plan, write to `docs/plans/{plan-n - **Every long-form doc links up.** The first bullet of the "Related Documentation" section always points back to the README at the repo root. - **Voice is uniform.** Every doc follows [docs/writing-voice.md](./docs/writing-voice.md). No em-dashes, direct second person, no flattery or hype. - **YAGNI applies to docs too.** Don't add speculative sections, for-future-flexibility warnings, or examples for behavior the skill doesn't have. The same evidence rule that gates plan steps gates docs. -- **Counts to verify when editing indexes.** 23 agents in `plugin/agents/`; 20 skills in `plugin/skills/`; 23 long-form agent docs in `docs/agents/`; 20 long-form skill docs in `docs/skills/`. +- **Counts to verify when editing indexes.** 23 agents in `plugin/agents/`; 21 skills in `plugin/skills/`; 23 long-form agent docs in `docs/agents/`; 21 long-form skill docs in `docs/skills/`. diff --git a/README.md b/README.md index c55e573..e575749 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Read [Concepts](./docs/concepts.md) for the skill-and-agent model that runs thro ## Which path are you on? - **New to han?** → Start with [Concepts](./docs/concepts.md), then the [Quickstart](./docs/quickstart.md). -- **Looking for a specific skill?** → [Skills Index](./docs/skills/README.md). 20 skills grouped by purpose. +- **Looking for a specific skill?** → [Skills Index](./docs/skills/README.md). 21 skills grouped by purpose. - **Looking for a specific agent?** → [Agents Index](./docs/agents/README.md). 23 agents grouped by role. - **Wondering how the agent swarms scale?** → [Sizing](./docs/sizing.md). The small / medium / large dispatch model used by `/architectural-analysis`, `/code-review`, `/gap-analysis`, `/iterative-plan-review`, `/plan-a-feature`, `/plan-implementation`, and `/research`. - **Wondering why a skill said "YAGNI"?** → [YAGNI](./docs/yagni.md). The evidence-based rule every planning, review, and architecture skill applies before committing items to an artifact. @@ -34,7 +34,7 @@ Add the Test Double skills marketplace to Claude Code, then install the plugin: - [Concepts](./docs/concepts.md). Skill vs. agent, and how they compose. Read once before using the plugin. - [Quickstart](./docs/quickstart.md). Four paths for four common situations. Each path is a short sequence of skills. -- [Skills Index](./docs/skills/README.md). All 20 skills, grouped by purpose. +- [Skills Index](./docs/skills/README.md). All 21 skills, grouped by purpose. - [Agents Index](./docs/agents/README.md). All 23 agents, grouped by role. - [Sizing](./docs/sizing.md). The small / medium / large model that decides how many agents the swarming skills dispatch. - [YAGNI](./docs/yagni.md). The evidence-based "You Aren't Gonna Need It" rule every planning, review, and architecture skill applies. diff --git a/docs/concepts.md b/docs/concepts.md index f2b9050..f7d2ce1 100644 --- a/docs/concepts.md +++ b/docs/concepts.md @@ -92,7 +92,7 @@ Direct invocation uses the `Agent` tool with `subagent_type: han:{agent-name}` ( ## What does the plugin include? -- **20 skills.** The [skills index](./skills/README.md) groups them by purpose (planning, building, investigation and research, review, discovery, conventions, reporting). +- **21 skills.** The [skills index](./skills/README.md) groups them by purpose (planning, building, investigation and research, review, discovery, conventions, reporting, operations). - **23 agents.** The [agents index](./agents/README.md) groups them by role (planning and facilitation, adversarial reviewers, investigation, architecture, testing, gap and content). Skim the indexes after you read this page. Pick the one skill you need right now. Come back later to learn the rest. diff --git a/docs/research/runbook-skill-research.md b/docs/research/runbook-skill-research.md new file mode 100644 index 0000000..6eaf57f --- /dev/null +++ b/docs/research/runbook-skill-research.md @@ -0,0 +1,494 @@ +# Research: A `/runbook` skill for Han + +One open-ended question: how should a new `/runbook` skill produce runbooks in a consistent format, and what format, scope, inputs, and dispatch model should it adopt? + +Evidence mode: **strict** (default — every claim that bears on the recommendation is sourced or carried with an explicit single-source caveat). + +## Summary + +Industry runbook practice splits cleanly into two structural families — full operations manuals and per-alert incident triage documents — and the production-grade examples that combine both (GitLab, OpenShift) layer them, with a per-service README on top and small per-scenario runbook files underneath. A small core set of sections recurs across nearly every published format: who owns it, when to use it, the exact commands to run with their expected output, how to verify the fix worked, who to escalate to, how to roll back. Staleness is the universally cited failure mode, and the strongest mitigation is making the runbook live in version control next to the code it describes, with explicit owner and last-validated metadata that surface decay rather than hide it. + +For Han's primary audience of solo or small-team product engineers, the recommended skill is the simplest version that satisfies the evidence: a deterministic template installer that asks a few targeted questions, fills a single cross-format core template, enforces a YAGNI preflight that the runbook is grounded in something real (an alert that has fired, a recurring task, a live failure mode on a service that has traffic), and writes a single runbook file per invocation. The evidence is well-corroborated for the structural choices (template content, file location, staleness metadata) and medium-confidence for the input-collection style; an earlier, more elaborate "bounded interview plus optional specialist review" design did not survive adversarial validation and was simplified. + +## Research Results + +### Two structural families and one production hybrid + +Across the surveyed formats, runbooks split into two structural families. **Comprehensive operations manuals** (SkeltonThatcher's `run-book-template`, the Limoncelli seven-section model surfaced via PagerDuty and Process.st, Lab Zero's DevOps Runbook Guide, the Atlassian Confluence DevOps Runbook Template) treat a runbook as a wide-scope document covering service overview, architecture, deployment, configuration, routine operations, monitoring, and disaster recovery (A1, A2, A4, A7, A9). **Incident-focused triage documents** (Emmer's incident runbook template, Rootly's incident-response runbooks guide, OneUptime's effective-runbooks guide, Nobl9's runbook example, The Good Shell's incident runbook template) treat a runbook as a narrow, per-alert artifact organized around trigger, diagnosis, mitigation, verification, escalation, and rollback (A3, A12, A15, A16, A19). + +In live production at scale, the two families combine rather than compete. GitLab organizes its production runbooks at `{service-name}/{runbook-name}.md` within a dedicated runbooks repository: each service directory has a README that covers the operations-manual concerns and separate per-scenario `.md` files that handle individual symptoms (A6). OpenShift uses `alerts/{operator-name}/{AlertName}.md` — naming the file after the alert it answers (A17). Google SRE's "playbook entry" model is structurally equivalent: every alert ties to a playbook entry with severity, impact, debugging suggestions, and mitigation steps (A13). + +### The cross-format core: sections that appear nearly everywhere + +A small set of sections recurs across most of the surveyed formats. These are the sections corroborated independently by at least four sources, and they map directly to what an engineer actually needs at 3am: header metadata (owner, last-updated, last-validated date, severity, alert or trigger link), a trigger or "when to use" statement, step-by-step procedure written in imperative voice with copy-pasteable commands and expected output per step, verification of the fix, escalation path, and rollback (A3, A12, A15, A16; reinforced by GitLab and OpenShift production practice in A6 and A17). Sections that are format-specific or supported by only a single source — explicit incident-commander/comms-lead role assignments (A19), SLA section (A1, A7), full deployment instructions (A1, A7), and Lab Zero's "Future Considerations" (A9) — are real choices but lower-confidence, and are not in the cross-format core. + +The "five A's" framework (Actionable, Accessible, Accurate, Authoritative, Adaptable) appears in three sources — Emmer, Rootly, and incident.io (A3, A12, A20) — and is a useful vocabulary, but it does not appear in Google SRE, GitLab production practice, or OpenShift production practice [V8]. Treat it as helpful shorthand, not industry-standard terminology. + +### Staleness is the universally cited failure mode + +Every surveyed source identifies staleness as the runbook failure mode that matters most. Google SRE names a specific tension: the more detailed the runbook, the faster it goes out of date as systems change (A13). The Hacker News practitioner thread confirms staleness as the most-cited reason runbooks get abandoned (A14). Vendor-source claims that "outdated runbooks are worse than no runbooks" (A6) and "if an engineer runs a command that fails, they will stop using the runbook entirely" (A8) appear with commercial interest behind them [V3, V8], but the directional finding — staleness destroys trust — is corroborated by non-vendor production practice in GitLab and OpenShift (active per-runbook maintenance) and by Google SRE. + +The strongest mitigation in the evidence is structural: keep the runbook in version control next to the code it describes (A6, A8, A9, A13), ship runbook updates in the same pull request as infrastructure changes (A6), and require owner plus last-validated metadata on every runbook so decay is visible rather than hidden (A6 dual-date tracking; A9 metadata headers; A13 ownership fields). Game-day testing and quarterly review cycles are corroborated mitigations (A9, A10, A12), but they are workflow recommendations the skill cannot enforce on its own. + +### Audience: 3am on-call, with a documented gap for solo engineers + +Every incident-focused source explicitly designs for an on-call engineer under pressure who may have been onboarded recently and may have been pulled out of REM sleep (A3, A12, A14, A16). Operations-manual sources additionally serve new-hire onboarding and ops-team reference (A2, A4, A9). The Hacker News thread surfaces a real tension: runbooks written for new hires drift into over-explanation that experienced engineers don't want when an alert fires (A14). + +No surveyed source specifically designs for solo or small-team product engineers — Han's primary audience. The closest fits are the Atlassian (A4) and Lab Zero (A9) operations-manual formats, both of which implicitly assume a team with dedicated ops roles. This is a documented audience gap, not corroborated evidence that any specific format serves Han's audience [V2]. The implication is that an opinionated, low-friction format is more valuable to Han's audience than a comprehensive format borrowed from enterprise practice. + +### Level of detail: imperative commands with expected output, not prose + +There is near-universal convergence across sources on the right level of detail: exact, copy-paste-ready commands written in imperative voice, with expected output for each step (A1, A2, A8, A9, A12, A13, A16). Rootly's framing — "every step should be a command, not a paragraph" (A12) — is corroborated by practitioner reports that placeholders requiring mental substitution at 3am are a usability failure (A14). Screenshots are recommended as supplements for visually complex steps, not as substitutes (A1, A12). Troubleshooting trees with conditional branches handle non-deterministic incident paths (A2, A12). + +### Input modes in practice + +Four input modes are observable in the field, but only the first is broadly corroborated: + +- **Deterministic template fill-in** is the dominant documented approach: a shared template the engineer fills in, with the structural scaffolding providing the consistency and the human supplying the specifics (A1, A2, A9, A10). +- **Questionnaire / onboarding interview** appears in AWS Incident Detection and Response, which uses a CLI-based onboarding questionnaire to derive runbook drafts (A5). This is a managed-service enterprise practice rather than a general industry pattern [V2], and Nobl9 (a vendor source [V8]) similarly recommends SME surveys for wider input collection (A10). +- **Postmortem-derived** continuous improvement — runbooks updated from incident action items — is a corroborated authoring trigger, not a primary creation mode (A3, A4, A6, A13, A16). +- **System-area scan** — generating a runbook from code or infrastructure inspection — is not described by any surveyed source. + +The academic FATA framework (A14) claims 27.7–47.4% quality improvement from proactive clarifying questions, but the figures are single-source, the paper is not runbook-specific, and the magnitude cannot be independently verified [V1]. Discounting A14 entirely leaves the directional claim (asking targeted questions before generating produces better output) corroborated weakly by A5 and A10, both of which are domain- or audience-mismatched. + +### File location and naming + +There is no single industry standard, but corroborated directional patterns are clear. GitLab uses `{service}/{runbook-name}.md` in a dedicated runbooks repository (A6). OpenShift uses `alerts/{operator-name}/{AlertName}.md` (A17). OneUptime recommends `{service}-{action}-{scope}.md` (A9, single-source on that exact formula). The directional convention — service or alert identifier first in the name, kebab-case, version-controlled, alongside or adjacent to code — is corroborated across A6, A7, A9, A13, A17. + +For Han, which writes into whatever target project it is invoked in, the simplest defensible convention is `docs/runbooks/{slug}.md` with kebab-case slugs of the form `{service-or-area}-{scenario}`. Subdirectories per service (`docs/runbooks/{service}/{scenario}.md`) are the right form for projects with multiple services. + +### Han codebase patterns this skill must align with + +Han's documentation-producing skills (`project-documentation`, `architectural-decision-record`, `coding-standard`, `stakeholder-summary`) follow a uniform pattern: resolve project context from CLAUDE.md's Project Discovery section first, fall back to `project-discovery.md`, then glob for defaults; ask before overwriting an existing file; optionally dispatch agents for review; write the output; update CLAUDE.md or an index. The skill skeleton — YAML frontmatter, optional Pre-requisites section, Project Context block of context-injection commands, numbered imperative steps — is consistent across all skills, and a reference folder holds the templates the SKILL.md reads. + +Two scope facts from existing agents are load-bearing. The `on-call-engineer` agent **explicitly excludes runbook documents** from its scope, naming them as `devops-engineer`'s domain. The `devops-engineer` agent **explicitly names "runbook for an alert that has never fired" as a YAGNI anti-pattern**, requiring evidence the alert is firing or imminently will. The canonical example is Sentry runbooks for staging-only Sentry where data isn't reaching production. This is a strict version of the YAGNI gate, and applying it as a blanket prohibition on proactive runbooks would over-trigger [V5]: a runbook for a known failure mode (disk full, OOM kill) on a service that is in production but hasn't yet hit that failure is not the same as a runbook for an alert that will never fire because no signal flows. + +The relevant Han skills the `/runbook` skill should not absorb the work of: `/project-documentation` (feature and system docs, not operational triage), `/architectural-decision-record` (one-off decisions, not repeatable procedures), `/coding-standard` (conventions, not operational procedures). `/project-documentation` could in principle produce a runbook-shaped document but its template is for Overview / Key Files / Behavior / Configuration / Error Handling, not for triage sequences with copy-pasteable commands [V4]. + +## Options to Consider + +### O1: Comprehensive operations-manual format (Limoncelli / SkeltonThatcher model) + +- **What it is:** One wide-scope document per service covering overview, architecture, deploy, ops, monitoring, troubleshooting. +- **Trade-offs:** Captures everything a new team member needs. Broad maintenance surface; harder to scan at 3am. Conflates "operate this day-to-day" with "respond to this alert now." Largest staleness exposure. +- **Rests on:** A1, A2, A7, A9 +- **Evidence status:** corroborated + +### O2: Per-alert incident-focused runbook (Emmer / Rootly / OneUptime model) + +- **What it is:** One runbook per alert or failure mode: trigger, diagnosis, mitigation, verification, escalation, rollback. +- **Trade-offs:** Best 3am usability and alert-to-runbook linking; lowest per-document maintenance surface. Requires mature alerting infrastructure; may be overhead for solo engineers maintaining many small files. +- **Rests on:** A3, A12, A15, A16, A17, A19 (with vendor-source caveat per V3 on the breadth of "6+ corroborating sources") +- **Evidence status:** corroborated, with vendor weighting noted + +### O3: Two-layer hybrid (GitLab / OpenShift production pattern) + +- **What it is:** Per-service README covering operations-manual concerns plus per-scenario runbook files in the incident-focused structure. +- **Trade-offs:** Separates "understanding the system" from "resolving the alert." Two artifacts to maintain. No single published template targets this for small teams; the synthesis is from observed production practice. +- **Rests on:** A6, A17 with synthesis [reasoning] +- **Evidence status:** corroborated for the pattern; [reasoning] for the small-team applicability + +### O4: Minimal cross-format core template (deterministic fill-in) + +- **What it is:** A single lean template — header metadata, trigger, steps, verification, escalation, rollback — that the user fills in. No interview, no agent dispatch. +- **Trade-offs:** Lowest build complexity. Lowest per-runbook authoring friction for experienced authors. Blank-page problem for first-time authors; relies on the template prompting being good. No architecture context — that lives elsewhere. +- **Rests on:** A3, A12, A15, A16; reinforced by GitLab and OpenShift production sections in A6, A17 +- **Evidence status:** corroborated + +### O5: Bounded-interview hybrid with optional specialist review (original recommendation) + +- **What it is:** A bounded interview collecting service identity, trigger, mitigation commands with expected output, escalation, rollback, validation contact; deterministic template producing the draft; optional `devops-engineer` review pass. +- **Trade-offs:** Captures structured input quality benefits — if those benefits are real. Most complex to build and test. The "optional devops-engineer review" component does not match an existing agent's protocols [V6]. The interview-over-template advantage rests on a single domain-mismatched academic paper (A14) and a single audience-mismatched enterprise practice (A5) [V1, V2, V7]. +- **Rests on:** A5 [V2], A14 [V1], A10 (vendor), plus the corroborated cross-format core +- **Evidence status:** core sections corroborated; interview-structure justification weakened to single-source after validation; "optional review" component refuted + +### O6: Template installer with YAGNI preflight and mandatory staleness metadata + +- **What it is:** A deterministic skill that resolves project context, asks a small number of targeted questions (service or area, scenario or trigger, owner, last-validated date), enforces a YAGNI preflight (the scenario is real — an alert has fired, a recurring task exists, or a live service has the failure mode in scope), fills the minimal cross-format core template, and writes one runbook per invocation to `docs/runbooks/{slug}.md`. No agent dispatch. Mandatory metadata fields force staleness signals to be visible from day one. +- **Trade-offs:** Simplest version that satisfies the evidence. Aligns with how `/architectural-decision-record` works — install a template, force the user to surface the forcing function. No interview loop; the targeted questions are part of normal Project Context resolution rather than a separate conversational phase. Misses the audience benefit (if any) of free-form clarifying conversation. Quality depends heavily on template prompting. +- **Rests on:** A3, A12, A15, A16 for sections; A6, A17 for naming and version-control conventions; Han's `devops-engineer` agent and `yagni-rule.md` for the YAGNI preflight; V4 and V7 for framing. +- **Evidence status:** corroborated + +## Recommendation + +**Recommendation: O6 — template installer with YAGNI preflight and mandatory staleness metadata, writing one runbook per invocation to `docs/runbooks/{slug}.md`.** + +This is the original-recommendation O5 with the components that did not survive validation stripped out: the bounded interview is reduced to targeted preflight questions, and the "optional `devops-engineer` review" is removed because that agent's protocols do not match runbook-document review [V6]. The result is structurally closer to how `/architectural-decision-record` works, which is the right shape for a Han skill: a template, a forcing-function gate, and the human filling in the specifics with prompting in the template itself. + +**Evidence basis:** + +- **Cross-format core sections** (header metadata, trigger, steps with imperative copy-pasteable commands and expected output, verification, escalation, rollback) — corroborated by at least four independent sources (A3, A12, A15, A16) and reinforced by GitLab and OpenShift production practice (A6, A17). This is the strongest evidence in the set. +- **File location and naming** (`docs/runbooks/{slug}.md`, kebab-case, service-or-area first) — corroborated directionally across A6, A7, A9, A13, A17. The exact path is a defensible synthesis of corroborated patterns. +- **Mandatory staleness metadata** (owner, last-validated date) — corroborated mitigations against the universally cited staleness failure mode, anchored in non-vendor practice (GitLab and OpenShift active maintenance, Google SRE post-page updates) even after vendor sources are downweighted [V8]. +- **YAGNI preflight** — anchored in Han's own `devops-engineer` agent and `yagni-rule.md`, with the threshold tuned to match the rule's actual evidence test (real production code path, documented incident, real alert that has fired, recurring task, measured metric) rather than the strict Sentry-style "no signal at all" form [V5]. The preflight warns and offers to proceed if evidence is thin; it blocks only when the scenario is purely speculative. +- **Single-runbook scope per invocation** — the per-alert incident-focused model is the most usable shape at 3am (A14, A12), and producing one runbook at a time keeps each invocation focused. + +**What this recommendation does not rest on:** the FATA framework's quality-improvement figures (A14, single-source and domain-mismatched [V1]); AWS IDR's questionnaire-to-template pipeline as a model for solo engineers (A5, audience-mismatched [V2]); vendor-sourced strong-form staleness quotes (A6, A8 [V3, V8]); the "five A's" framework as industry-standard vocabulary [V8]; specialist-agent review of runbook drafts [V6]. + +**Deciding criteria for teams who would want a different answer:** + +- Mature alerting infrastructure with many services and well-named alerts: O2 scales better than O6 because alert-to-runbook linking becomes the primary navigation. +- Onboarding is the primary use case: O1 (full operations manual) fits the need; `/project-documentation` may already cover it. +- A team with a real DevOps reviewer in the loop: O5's "specialist review" component becomes meaningful if it's a human reviewer rather than an agent that doesn't match the protocol [V6]. + +## Validation + +### V1: FATA framework is single-source and domain-mismatched + +- **Strategy:** Challenge the Evidence +- **Investigation:** Checked corroboration for arXiv 2508.08308's 27.7–47.4% quality-improvement claim; found no independent replication and no runbook-specific application of the framework. +- **Result:** Partially Refuted — the magnitude figures are not load-bearing; the directional claim survives weakly via A5 and A10. +- **Impact:** Recommendation pivoted away from interview-structure as primary justification; O5 → O6. + +### V2: AWS IDR is audience-mismatched to Han users + +- **Strategy:** Challenge the Evidence +- **Investigation:** A5 is a paid enterprise managed-service workflow; no source addresses solo or small-team product engineers, Han's primary audience. +- **Result:** Partially Refuted — A5 establishes that interview-driven runbook collection exists in practice, but not that it generalizes to Han's audience. +- **Impact:** The bounded-interview component lost its primary corroboration; reduced to a few targeted preflight questions in O6. + +### V3: Vendor-blog concentration inflates the "6+ sources" cross-format core claim + +- **Strategy:** Challenge the Evidence-Gathering Integrity +- **Investigation:** Reweighted sources: Rootly, OneUptime, FireHydrant, incident.io, Nobl9 sell adjacent tooling. Independent non-vendor corroboration reduces to Emmer + The Good Shell + GitLab production + OpenShift production. +- **Result:** Partially Refuted — the cross-format core survives, but the strength is "4 independent sources including production practice" rather than "6+ corroborating sources." +- **Impact:** Sections beyond the core (specifically dual-date metadata, severity level) are treated as recommended-not-required in the template. + +### V4: "No new skill" and "template installer" options were missing from the framing + +- **Strategy:** Challenge the Options Framing +- **Investigation:** `/project-documentation` has a Guard check that suggests siblings for ADRs and standards; a runbook is neither, and that skill's template targets feature/system docs, not triage. A template-installer option matching how `/architectural-decision-record` works was not framed. +- **Result:** Refuted — the original options set had a real gap. +- **Impact:** Added O6; recommendation pivoted to it. + +### V5: The YAGNI gate threshold is undefined + +- **Strategy:** Challenge the Assumptions +- **Investigation:** The `devops-engineer` agent's canonical YAGNI example (Sentry where no data flows) is a stricter case than "alert has not fired yet on a live service." The general YAGNI rule allows several other forms of evidence (production code path, recurring task, measured metric). +- **Result:** Confirmed — the original O5 was ambiguous on which gate applies. +- **Impact:** O6 specifies that the preflight blocks purely speculative runbooks and warns-and-proceeds when evidence is thin but plausible (live service, known failure mode class, recurring task). + +### V6: "Optional devops-engineer review" component is phantom + +- **Strategy:** Challenge the Recommendation +- **Investigation:** `devops-engineer`'s protocols cover DORA, Twelve-Factor, Golden Signals, production-readiness — not runbook-document quality review. `on-call-engineer` explicitly excludes runbook documents from its scope. No Han agent's defined scope covers runbook-document quality. +- **Result:** Confirmed — the component is unvalidated. +- **Impact:** Removed from O6. + +### V7: Discounting A14 + A5 erases O5's edge over O4 + +- **Strategy:** Challenge the Recommendation +- **Investigation:** With A14 (FATA) discounted as single-source and A5 (AWS IDR) downweighted as audience-mismatched, the interview-vs-template distinction in O5 loses its evidence base. The remaining substantive differences (YAGNI gate, forced metadata, structured output) can all be implemented in a template installer. +- **Result:** Refuted — the original O5 recommendation does not survive when its weakest sources are discounted. +- **Impact:** Recommendation rewritten to O6. + +### V8: "Five A's" framework and strong-form staleness claims are vendor-anchored + +- **Strategy:** Challenge the Evidence-Gathering Integrity +- **Investigation:** The "five A's" appears in Emmer (practitioner blog), Rootly (vendor), incident.io (vendor) — three sources, only one non-vendor. Strong-form staleness quotes ("worse than no runbooks") trace to SupportBench and UptimeLabs (both vendors). Non-vendor practice (Google SRE, GitLab, OpenShift) corroborates the direction but not the strong form. +- **Result:** Partially Refuted — staleness mitigations survive (owner, last-validated date); the "five A's" treated as shorthand, not as industry-standard vocabulary. +- **Impact:** Template does not adopt "five A's" as required structure; staleness metadata fields are required. + +### Adjustments Made + +The recommendation was rewritten from O5 (bounded interview + minimal template + optional devops-engineer review) to O6 (template installer with YAGNI preflight and mandatory staleness metadata). The change is driven by V4 (missed option), V6 (phantom review component), and V7 (interview justification collapses under V1 + V2 discounting). The structural choices the original recommendation made — cross-format core sections, file location and naming, mandatory staleness metadata, YAGNI gate — all survive; the input-collection style is simpler. + +### Confidence Assessment + +- **Confidence:** Medium-high on structural and content choices; medium on input-collection style. +- **Remaining risks:** + - **Audience gap is unresolved.** No source addresses solo or small-team product engineers; the recommendation extrapolates from enterprise and open-source-infrastructure practice. If the audience needs differ materially, the template's prompting will need iteration. + - **YAGNI gate calibration.** V5 set the threshold direction; the actual prompts in the preflight need to be tested on real invocations to confirm they neither over-block (refusing reasonable proactive runbooks) nor under-block (waving through speculative ones). + - **No agent reviews runbook quality.** If quality variance is high after the skill ships, the answer is either to define a new `runbook-reviewer` agent with the right protocols or to harden the template's in-line prompting. Both are deferred until evidence of variance accumulates. + - **Single-author bias.** The template installer model assumes the author and the eventual runbook user may be the same person (solo / small team). For larger teams, the absent peer-review workflow (corroborated as a mitigation in A4, A6, A8) is unaddressed; teams that need it can layer their normal PR review on top. + +## Artifacts + +### A1: PagerDuty — What is a Runbook? +- **Link / location:** https://www.pagerduty.com/resources/automation/learn/what-is-a-runbook/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor — interested-party scrutiny) +- **Summary:** Defines a runbook as a how-to guide for repeated tasks; cites Limoncelli's seven sections (Overview, Build, Deploy, Common Tasks, Pager Playbook, DR, SLA). Distinguishes runbook (single-task) from playbook (multi-runbook strategy). +- **Evidence status:** corroborated by A7 + +### A2: SkeltonThatcher run-book-template +- **Link / location:** https://github.com/SkeltonThatcher/run-book-template +- **Retrieved:** 2026-05-28 +- **Trust class:** web (open-source template) +- **Summary:** Ten-section operations-manual template covering service overview, characteristics, resources, security, configuration, backup/restore, monitoring, operational tasks, maintenance, failover/recovery. Dev team owns it. +- **Evidence status:** corroborated by A1, A7, A9 on operations-manual breadth + +### A3: Christian Emmer — An Effective Incident Runbook Template +- **Link / location:** https://emmer.dev/blog/an-effective-incident-runbook-template/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (independent practitioner blog) +- **Summary:** Five-section incident runbook (Summary, Triage, Mitigation, Validation, Remediation). Introduces "five A's." Cites Google SRE's 3x MTTR improvement claim. Recommends continuous editing over formal review cycles. +- **Evidence status:** corroborated by A12, A20 on five A's; by A13 on the 3x claim + +### A4: Atlassian Confluence — DevOps Runbook Template +- **Link / location:** https://www.atlassian.com/software/confluence/templates/devops-runbook +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Three-section template: architecture overview, contacts, procedures (start/stop/monitor/troubleshoot). Heavier on architecture and contacts than on incident branching. +- **Evidence status:** corroborated by A2, A6 on architecture-overview structure + +### A5: AWS Incident Detection and Response — Develop Runbooks +- **Link / location:** https://docs.aws.amazon.com/IDR/latest/userguide/idr-workloads-dev-runbook.html +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor / managed service) +- **Summary:** AWS IDR uses CLI-based onboarding questionnaire to derive runbook drafts for enterprise customers. +- **Evidence status:** single-source for questionnaire-to-runbook pipeline; audience-mismatched per V2 + +### A6: GitLab Runbooks repository +- **Link / location:** https://runbooks.gitlab.com/ ; https://gitlab.com/gitlab-com/runbooks +- **Retrieved:** 2026-05-28 +- **Trust class:** web (production open-source practice) +- **Summary:** Service-centric directory structure matching service catalog; per-service README plus individual `.md` runbooks organized by symptom. Naming kebab-case. Co-located with infrastructure code; updates ship in same PRs. +- **Evidence status:** corroborated by A17 on alert-keyed organization + +### A7: Process.st — How to Create a Runbook +- **Link / location:** https://www.process.st/create-a-runbook/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor blog) +- **Summary:** Independent corroboration of Limoncelli's seven sections; recommends plan/write/test phases. +- **Evidence status:** corroborated by A1 + +### A8: UptimeLabs — Incident Response Runbook +- **Link / location:** https://uptimelabs.io/learn/what-is-an-incident-response-runbook/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Service teams own service runbooks; SRE owns shared infrastructure runbooks. PR-based updates. "If an engineer runs a command that fails, they will stop using the runbook entirely." +- **Evidence status:** strong-form staleness quote single-sourced per V8; ownership model corroborated by A4 (IncidentHub), A16 (drdroid) + +### A9: Lab Zero — DevOps Runbook Guide +- **Link / location:** https://guides.labzero.com/technical_guides/dev_ops_runbook_guide.html +- **Retrieved:** 2026-05-28 +- **Trust class:** web (consultancy guide) +- **Summary:** "Table of contents" model: Overview, Observability, Onboarding, Admin, Deploy, Server, Services, Config, Certificates, Further Docs, Known Failures, Future Considerations. Runbook as link hub more than play-by-play. +- **Evidence status:** corroborated by A2 on operations-manual breadth + +### A10: Nobl9 — Runbook Example: A Best Practices Guide +- **Link / location:** https://www.nobl9.com/it-incident-management/runbook-example +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor — SLO tooling, commercially adjacent) +- **Summary:** Core template: Title/Objective, Triggers, Instructions, Outcomes, Escalation, Contact. Recommends SME conversations and surveys for input collection. Emphasizes cross-runbook consistency. +- **Evidence status:** corroborated by A12, A15 on sections; vendor weighting per V8 + +### A11: GitLab Runbooks — directory layout (operational artifact) +- **Link / location:** https://gitlab.com/gitlab-com/runbooks (tree) +- **Retrieved:** 2026-05-28 +- **Trust class:** codebase-equivalent (production open-source repository) +- **Summary:** Folder convention explicitly tied to service catalog; explicit rule against ad-hoc top-level directories. +- **Evidence status:** corroborated by A6 (same source, operational view) + +### A12: Rootly — Incident Response Runbooks Guide +- **Link / location:** https://rootly.com/incident-response/runbooks +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor — incident management) +- **Summary:** Seven sections: Trigger/Detection, Impact, Containment, Resolution, Validation, Communication, Post-Incident. Endorses five A's. Recommends copy-pasteable commands, version control, quarterly reviews. +- **Evidence status:** corroborated by A3, A15, A16; vendor weighting per V3 + +### A13: Google SRE Workbook — On-Call +- **Link / location:** https://sre.google/workbook/on-call/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (industry reference, non-vendor) +- **Summary:** Google calls their runbooks "playbooks"; every alert ties to a playbook entry with severity, impact, debugging, mitigation. Names the specificity-vs-staleness tension. Update after every page. +- **Evidence status:** corroborated by A3 on 3x MTTR claim; anchors several other findings as non-vendor source + +### A14: arXiv 2508.08308 — FATA framework +- **Link / location:** https://arxiv.org/html/2508.08308v1 +- **Retrieved:** 2026-05-28 +- **Trust class:** web (academic, single-source) +- **Summary:** Claims 27.7–47.4% quality improvement from proactive clarifying questions. Not runbook-specific. +- **Evidence status:** single-source; magnitude figures not load-bearing per V1 + +### A15: Nobl9 — Runbook Example (separate from A10) +- **Link / location:** see A10 +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Merged into A10 in this consolidated registry; retained as ID for cross-references from upstream research. +- **Evidence status:** see A10 + +### A16: OneUptime — How to Create Effective Runbooks +- **Link / location:** https://oneuptime.com/blog/post/2026-02-02-effective-runbooks/view +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Seven sections: Metadata Header (ID, version, owner, duration, risk), Trigger, Prerequisites, Steps with expected outputs, Verification, Escalation, Rollback. Imperative voice; copy-paste-ready commands; change-triggered review; monthly game days. Recommends `{service}-{action}-{scope}.md` naming. +- **Evidence status:** corroborated by A12 on structure; naming pattern single-source on exact form (direction corroborated by A6) + +### A17: OpenShift Runbooks +- **Link / location:** https://github.com/openshift/runbooks +- **Retrieved:** 2026-05-28 +- **Trust class:** codebase-equivalent (production open-source) +- **Summary:** `alerts/{operator-name}/{AlertName}.md` naming. Files named after the alert they address. Confirms alert-to-runbook linking as first-class. +- **Evidence status:** corroborated by A6 on alert-keyed organization + +### A18: FireHydrant — Runbook Best Practices +- **Link / location:** https://docs.firehydrant.com/docs/runbook-best-practices +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Six-section template (Title, Scope, Objective, Steps, Troubleshooting, References) with branching paths and visual aids. +- **Evidence status:** corroborated by A9, A15 on scope/references structure; vendor weighting per V3 + +### A19: The Good Shell — Incident Runbook Template +- **Link / location:** https://thegoodshell.com/incident-runbook-template/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (technical blog) +- **Summary:** Ten-section incident-focused template with explicit role assignments (incident commander, ops lead, comms lead, scribe) and communication templates per severity level. +- **Evidence status:** role-assignments component single-source per V8 + +### A20: incident.io — What Are Runbooks? +- **Link / location:** https://incident.io/blog/what-are-runbooks +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor — incident management) +- **Summary:** Endorses five A's framework. Distinguishes runbooks (specific incidents, technical) from playbooks (overall strategy). +- **Evidence status:** five A's corroborated weakly per V8 + +### A21: SupportBench — Runbook Maintenance Best Practices +- **Link / location:** https://www.supportbench.com/how-to-maintain-runbooks-when-engineering-changes-processes/ +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor — customer support software) +- **Summary:** Identifies unclear ownership and lack of change-control integration as primary rot causes. Recommends ship-with-code, dual-date tracking, peer review. "Outdated runbooks are worse than no runbooks." +- **Evidence status:** strong-form quote single-sourced per V8; ship-with-code mitigation corroborated by A6 + +### A22: incident.io — Automated Runbooks Guide +- **Link / location:** https://incident.io/blog/automated-runbook-guide +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Addresses staleness/trust problem directly. Service-based ownership via service catalog. Identifies trigger types (alert, webhook, manual, scheduled). +- **Evidence status:** corroborated by A8, A21 on staleness; service-based ownership corroborated by A6 + +### A23: IncidentHub — No-Nonsense Guide to Runbook Best Practices +- **Link / location:** https://blog.incidenthub.cloud/The-No-Nonsense-Guide-to-Runbook-Best-Practices +- **Retrieved:** 2026-05-28 +- **Trust class:** web (technical blog) +- **Summary:** Service-team vs. SRE/ops ownership split; post-incident update by on-call engineer reviewed by peers. Recommends descriptive titles like `runbook-cpu-usage-critical-alert`. +- **Evidence status:** corroborated by A8, A16 on ownership + +### A24: Cortex — Runbooks vs. Playbooks +- **Link / location:** https://www.cortex.io/post/runbooks-vs-playbooks +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Runbooks tactical, playbooks strategic. DR playbook contains a runbook per technical sub-task. +- **Evidence status:** corroborated by A20 on tactical/strategic split + +### A25: Cutover — Runbooks vs. Playbooks vs. SOPs +- **Link / location:** https://cutover.com/blog/differences-runbooks-playbooks-sops +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Playbooks strategic-adaptive; runbooks complex multi-step known operations; SOPs granular routine. Predictability spectrum. +- **Evidence status:** corroborated by A24, A26 + +### A26: Upstat — Runbook vs. SOP +- **Link / location:** https://upstat.io/blog/runbook-vs-sop +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Runbooks reactive with branching; SOPs proactive linear. Deployment guides are specialized SOPs. +- **Evidence status:** corroborated by A25 + +### A27: Hacker News — Writing Runbook Documentation When You're an SRE (thread) +- **Link / location:** https://news.ycombinator.com/item?id=22207452 +- **Retrieved:** 2026-05-28 +- **Trust class:** web (practitioner discussion, mixed) +- **Summary:** Practitioner debate: staleness most-cited failure mode. Recommendations for single-page keyword-dense format, `$VARIABLE` notation for safe copy-paste, sample expected outputs. +- **Evidence status:** corroborated by A13, A16 on practitioner experience + +### A28: drdroid — Runbooks Guide for SRE / On-Call Teams +- **Link / location:** https://drdroid.io/guides/runbooks-guide-for-sre-on-call-teams +- **Retrieved:** 2026-05-28 +- **Trust class:** web (vendor) +- **Summary:** Runbook creation as documentation requirement for new launches; on-call updates after incidents. +- **Evidence status:** corroborated by A4, A23 + +### A29: Han codebase — `plugin/skills/project-documentation/SKILL.md` +- **Link / location:** plugin/skills/project-documentation/SKILL.md +- **Retrieved:** n/a (codebase current state) +- **Trust class:** codebase +- **Summary:** Resolves docs directory from CLAUDE.md Project Discovery section; falls back to project-discovery.md. Dispatches 2-3 codebase-explorer agents in parallel. Updates CLAUDE.md with reference. Asks before overwriting. +- **Evidence status:** corroborated by A30, A31 on shared skeleton + +### A30: Han codebase — `plugin/skills/architectural-decision-record/SKILL.md` +- **Link / location:** plugin/skills/architectural-decision-record/SKILL.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Writes to discovered ADR directory with one- or two-level filename hierarchy. Template installer pattern with forcing-function YAGNI gate. Dispatches validators. +- **Evidence status:** corroborated by A29, A31 + +### A31: Han codebase — `plugin/skills/coding-standard/SKILL.md` +- **Link / location:** plugin/skills/coding-standard/SKILL.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Writes to `{docs-dir}/coding-standards/{name}.md` plus path-scoped index files. Applies YAGNI gate with evidence of active use and friction. +- **Evidence status:** corroborated by A29, A30 + +### A32: Han codebase — `plugin/agents/on-call-engineer.md` +- **Link / location:** plugin/agents/on-call-engineer.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Hard scope boundary: "You do not audit ... runbook documents ... Those belong to `devops-engineer`. Your altitude is application source files only." +- **Evidence status:** load-bearing for skill ownership question + +### A33: Han codebase — `plugin/agents/devops-engineer.md` +- **Link / location:** plugin/agents/devops-engineer.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Explicitly reads runbooks. Names "runbook for an alert that has never fired" as a YAGNI anti-pattern; canonical example is Sentry runbooks where data isn't reaching production. +- **Evidence status:** load-bearing for YAGNI preflight + +### A34: Han codebase — `plugin/references/yagni-rule.md` +- **Link / location:** plugin/references/yagni-rule.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Gate 1 evidence test: at least one of user-described need, named dependency, production code path, regulatory rule, documented incident, real alert that fired, customer report, measured metric. Gate 2: simpler-version test. +- **Evidence status:** load-bearing for V5 threshold calibration and V7 simpler-version pivot + +### A35: Han codebase — `plugin/skills/stakeholder-summary/SKILL.md` +- **Link / location:** plugin/skills/stakeholder-summary/SKILL.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Single-file output with strict plain-language constraint; references template; three self-check passes; no agent dispatch. +- **Evidence status:** corroborated by A29, A30, A31 on the template-installer pattern + +### A36: Han codebase — `docs/templates/skill-long-form-template.md` +- **Link / location:** docs/templates/skill-long-form-template.md +- **Retrieved:** n/a +- **Trust class:** codebase +- **Summary:** Strict 13-section structure for long-form operator docs. Related Documentation must link back to README. +- **Evidence status:** load-bearing for the long-form doc deliverable + +## References + +- PagerDuty — What is a Runbook? https://www.pagerduty.com/resources/automation/learn/what-is-a-runbook/ +- SkeltonThatcher run-book-template. https://github.com/SkeltonThatcher/run-book-template +- Christian Emmer — An Effective Incident Runbook Template. https://emmer.dev/blog/an-effective-incident-runbook-template/ +- Atlassian Confluence — DevOps Runbook Template. https://www.atlassian.com/software/confluence/templates/devops-runbook +- AWS Incident Detection and Response — Develop Runbooks. https://docs.aws.amazon.com/IDR/latest/userguide/idr-workloads-dev-runbook.html +- GitLab Runbooks. https://runbooks.gitlab.com/ ; https://gitlab.com/gitlab-com/runbooks +- Process.st — How to Create a Runbook. https://www.process.st/create-a-runbook/ +- UptimeLabs — Incident Response Runbook. https://uptimelabs.io/learn/what-is-an-incident-response-runbook/ +- Lab Zero — DevOps Runbook Guide. https://guides.labzero.com/technical_guides/dev_ops_runbook_guide.html +- Nobl9 — Runbook Example: A Best Practices Guide. https://www.nobl9.com/it-incident-management/runbook-example +- Rootly — Incident Response Runbooks Guide. https://rootly.com/incident-response/runbooks +- Google SRE Workbook — On-Call. https://sre.google/workbook/on-call/ +- arXiv 2508.08308 — FATA framework. https://arxiv.org/html/2508.08308v1 +- OneUptime — How to Create Effective Runbooks. https://oneuptime.com/blog/post/2026-02-02-effective-runbooks/view +- OpenShift Runbooks. https://github.com/openshift/runbooks +- FireHydrant — Runbook Best Practices. https://docs.firehydrant.com/docs/runbook-best-practices +- The Good Shell — Incident Runbook Template. https://thegoodshell.com/incident-runbook-template/ +- incident.io — What Are Runbooks? https://incident.io/blog/what-are-runbooks +- SupportBench — Runbook Maintenance Best Practices. https://www.supportbench.com/how-to-maintain-runbooks-when-engineering-changes-processes/ +- incident.io — Automated Runbooks Guide. https://incident.io/blog/automated-runbook-guide +- IncidentHub — No-Nonsense Guide to Runbook Best Practices. https://blog.incidenthub.cloud/The-No-Nonsense-Guide-to-Runbook-Best-Practices +- Cortex — Runbooks vs. Playbooks. https://www.cortex.io/post/runbooks-vs-playbooks +- Cutover — Runbooks vs. Playbooks vs. SOPs. https://cutover.com/blog/differences-runbooks-playbooks-sops +- Upstat — Runbook vs. SOP. https://upstat.io/blog/runbook-vs-sop +- Hacker News — Writing Runbook Documentation When You're an SRE. https://news.ycombinator.com/item?id=22207452 +- drdroid — Runbooks Guide for SRE / On-Call Teams. https://drdroid.io/guides/runbooks-guide-for-sre-on-call-teams +- Han plugin — plugin/skills/project-documentation/SKILL.md +- Han plugin — plugin/skills/architectural-decision-record/SKILL.md +- Han plugin — plugin/skills/coding-standard/SKILL.md +- Han plugin — plugin/skills/stakeholder-summary/SKILL.md +- Han plugin — plugin/agents/on-call-engineer.md +- Han plugin — plugin/agents/devops-engineer.md +- Han plugin — plugin/references/yagni-rule.md +- Han plugin — docs/templates/skill-long-form-template.md diff --git a/docs/skills/README.md b/docs/skills/README.md index 32dd5c4..b7c63cf 100644 --- a/docs/skills/README.md +++ b/docs/skills/README.md @@ -64,6 +64,12 @@ Skills for turning the work back into something sharable. - **[`/update-pr-description`](./update-pr-description.md).** Generate a PR description from the current branch's changes. +## Operations + +Skills for capturing operational knowledge in artifacts the next on-call engineer can use. + +- **[`/runbook`](./runbook.md).** Create or update a runbook for a single operational scenario (alert that has fired, incident, recurring task, known failure mode). Symptom-first template with imperative-voice procedure, expected output per step, escalation conditions, and rollback. Applies a YAGNI preflight that requires real evidence before writing. + --- ## How dispatch scales: sizing diff --git a/docs/skills/architectural-analysis.md b/docs/skills/architectural-analysis.md index 757fb1d..9238d42 100644 --- a/docs/skills/architectural-analysis.md +++ b/docs/skills/architectural-analysis.md @@ -166,7 +166,7 @@ URL: https://www.domainlanguage.com/ddd/ ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [Sizing](../sizing.md). The small / medium / large dispatch model this skill shares with the other swarming skills. - [`structural-analyst`](../agents/structural-analyst.md), [`behavioral-analyst`](../agents/behavioral-analyst.md), [`concurrency-analyst`](../agents/concurrency-analyst.md). The discovery analysts. - [`adversarial-security-analyst`](../agents/adversarial-security-analyst.md), [`data-engineer`](../agents/data-engineer.md), [`devops-engineer`](../agents/devops-engineer.md), [`on-call-engineer`](../agents/on-call-engineer.md), [`codebase-explorer`](../agents/codebase-explorer.md). The signal-selected specialists added at medium and large. diff --git a/docs/skills/architectural-decision-record.md b/docs/skills/architectural-decision-record.md index 4d2effb..5a62366 100644 --- a/docs/skills/architectural-decision-record.md +++ b/docs/skills/architectural-decision-record.md @@ -34,6 +34,7 @@ Operator documentation for the `/architectural-decision-record` skill in the han - **Enforceable coding rules.** Use [`/coding-standard`](./coding-standard.md). An ADR records the decision; a coding standard encodes the rule it produces. - **Feature documentation.** Use [`/project-documentation`](./project-documentation.md). - **Recording an investigation's findings.** Use [`/investigate`](./investigate.md) for bug investigations with evidence and validation. +- **Runbooks for operational scenarios.** Use [`/runbook`](./runbook.md). A runbook captures the procedure for an alert or incident; an ADR records the decision that shaped the system the runbook operates on. ## How to invoke it @@ -120,7 +121,7 @@ URL: https://www.thoughtworks.com/radar/techniques/lightweight-architecture-deci - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/coding-standard`](./coding-standard.md). For rules that come out of a decision. Link the standard to the ADR. - [`/architectural-analysis`](./architectural-analysis.md). Often produces decisions worth recording as ADRs. - [`/project-documentation`](./project-documentation.md). For feature docs that reference the ADR. diff --git a/docs/skills/code-review.md b/docs/skills/code-review.md index 4675a51..253457d 100644 --- a/docs/skills/code-review.md +++ b/docs/skills/code-review.md @@ -171,7 +171,7 @@ URL: https://itrevolution.com/product/accelerate/ - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/gh-pr-review`](./gh-pr-review.md). Wraps this skill and posts the review to a GitHub PR. - [`/investigate`](./investigate.md). Next step when a CRIT finding hides a bug whose root cause needs deeper analysis. - [`/architectural-analysis`](./architectural-analysis.md). Run alongside when the change touches module boundaries. diff --git a/docs/skills/coding-standard.md b/docs/skills/coding-standard.md index 32b5220..5f03e90 100644 --- a/docs/skills/coding-standard.md +++ b/docs/skills/coding-standard.md @@ -35,6 +35,7 @@ Operator documentation for the `/coding-standard` skill in the han plugin. This - **Feature documentation.** Use [`/project-documentation`](./project-documentation.md) for describing how a system works. - **Style rules that a linter or formatter can enforce.** Configure the tool. Do not write a standard that duplicates it. - **Open-ended research not destined for a standard.** Use [`/research`](./research.md) to survey options and prior art when the output you want is a recommendation, not an enforceable rule. +- **Runbooks for operational scenarios.** Use [`/runbook`](./runbook.md). A runbook captures the procedure for an alert or incident; a coding standard encodes a rule the code itself must follow. ## How to invoke it @@ -143,7 +144,7 @@ URL: https://code.claude.com/docs/en/memory - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/architectural-decision-record`](./architectural-decision-record.md). For decisions rather than rules. Link the standard to the ADR when the rule embeds a choice. - [`/project-documentation`](./project-documentation.md). For system and feature documentation that is not a rule. - [`/code-review`](./code-review.md). Reads standards during every review. Violations become findings. diff --git a/docs/skills/gap-analysis.md b/docs/skills/gap-analysis.md index 59a9128..83b7d7b 100644 --- a/docs/skills/gap-analysis.md +++ b/docs/skills/gap-analysis.md @@ -201,7 +201,7 @@ URLs: https://hbr.org/2007/09/performing-a-project-premortem and https://en.wiki ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [Sizing](../sizing.md). The cross-skill sizing model. Explains the small / medium / large bands, the default-to-small rule, and the `$size` override. - [`gap-analyzer`](../agents/gap-analyzer.md). The agent that performs the underlying gap analysis. The skill always dispatches it once and reads its full output. - [`adversarial-validator`](../agents/adversarial-validator.md). Required swarm role at every size. Attacks each gap with counter-evidence to produce per-gap `confirmed` / `contradicted` / `inconclusive` verdicts. diff --git a/docs/skills/gh-pr-review.md b/docs/skills/gh-pr-review.md index ef72132..eaa51ab 100644 --- a/docs/skills/gh-pr-review.md +++ b/docs/skills/gh-pr-review.md @@ -98,7 +98,7 @@ URL: https://google.github.io/eng-practices/review/reviewer/ ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/code-review`](./code-review.md). The skill this one wraps. Use directly for local review without GitHub posting. - [`/update-pr-description`](./update-pr-description.md). For writing the PR description. - [`/investigate`](./investigate.md). Next step when a Critical finding hides a bug. diff --git a/docs/skills/investigate.md b/docs/skills/investigate.md index 8dd05df..50ce373 100644 --- a/docs/skills/investigate.md +++ b/docs/skills/investigate.md @@ -122,7 +122,7 @@ URL: https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/issue-triage`](./issue-triage.md). Run before investigation when the incoming report is too vague to trace; triage produces the sharp problem statement investigation needs. - [`/research`](./research.md). The question-shaped sibling. Use it when nothing is broken and you want options, prior art, or how something works before committing. - [`evidence-based-investigator`](../agents/evidence-based-investigator.md). The agent the skill dispatches in parallel for multi-angle evidence gathering. @@ -130,4 +130,5 @@ URL: https://pragprog.com/titles/tpp20/the-pragmatic-programmer-20th-anniversary - [`concurrency-analyst`](../agents/concurrency-analyst.md), [`behavioral-analyst`](../agents/behavioral-analyst.md), [`data-engineer`](../agents/data-engineer.md). Specialist analysts dispatched alongside the investigators when the symptom classification calls for them. - [`/iterative-plan-review`](./iterative-plan-review.md). Pair when the fix plan needs further stress-testing before implementation. - [`/code-review`](./code-review.md). Run before merge when the fix lands, to audit the change end-to-end. +- [`/runbook`](./runbook.md). Pair after the investigation lands a procedure the team will reuse. Investigate captures the root cause and fix; the runbook captures the procedure for the next engineer who sees the same symptom. - [`SKILL.md` for /investigate](../../plugin/skills/investigate/SKILL.md). The internal process definition. diff --git a/docs/skills/issue-triage.md b/docs/skills/issue-triage.md index 018abf0..7eeb3f5 100644 --- a/docs/skills/issue-triage.md +++ b/docs/skills/issue-triage.md @@ -170,7 +170,7 @@ The skill dispatches no sub-agents. It reads the report and, only to sharpen the ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/investigate`](./investigate.md). The natural next skill when the issue is a bug or failure with enough context to trace. - [`/plan-a-feature`](./plan-a-feature.md). The natural next skill when the issue is a feature request with enough context to spec. - [`/plan-implementation`](./plan-implementation.md). The next skill when triage confirms a well-defined problem and a spec already exists. diff --git a/docs/skills/iterative-plan-review.md b/docs/skills/iterative-plan-review.md index 2f64b3a..abb2055 100644 --- a/docs/skills/iterative-plan-review.md +++ b/docs/skills/iterative-plan-review.md @@ -191,7 +191,7 @@ URLs: https://asana.com/resources/raid-log and https://projectmanagementcompass. - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [Sizing](../sizing.md). The cross-skill sizing model. Explains the small / medium / large bands, the default-to-small rule, and the `$size` override. - [`/plan-a-feature`](./plan-a-feature.md). The upstream skill for producing a feature specification from scratch. This skill can iterate on that spec, but the typical handoff is spec → `/plan-implementation` → this skill. - [`/plan-implementation`](./plan-implementation.md). The upstream skill for producing a committable implementation plan. This skill is the natural next step when the team wants the implementation plan stress-tested across multiple review passes. diff --git a/docs/skills/plan-a-feature.md b/docs/skills/plan-a-feature.md index e14f4a9..5c49111 100644 --- a/docs/skills/plan-a-feature.md +++ b/docs/skills/plan-a-feature.md @@ -184,7 +184,7 @@ URLs: https://asana.com/resources/raid-log and https://projectmanagementcompass. - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [Sizing](../sizing.md). The cross-skill sizing model. Explains the small / medium / large bands, the default-to-small rule, and the `$size` override. - [`/plan-implementation`](./plan-implementation.md). The next step after this skill. Takes the `feature-specification.md` produced here and turns it into a feature-implementation-plan through an iterative, project-manager-led team conversation. - [`/stakeholder-summary`](./stakeholder-summary.md). The optional sibling for non-technical feedback. Takes the `feature-specification.md` produced here and turns it into a plain-language stakeholder summary with Mermaid diagrams, for sharing with leadership, product, or customer-facing reviewers before implementation kicks off. diff --git a/docs/skills/plan-a-phased-build.md b/docs/skills/plan-a-phased-build.md index 1c505ee..f2baa1f 100644 --- a/docs/skills/plan-a-phased-build.md +++ b/docs/skills/plan-a-phased-build.md @@ -195,7 +195,7 @@ URL: see [`information-architect` agent definition](../../plugin/agents/informat - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`information-architect`](../agents/information-architect.md). The agent the skill dispatches at runtime to review the rendered outline. Also the agent that reviewed the output template before the skill shipped. - [`/gap-analysis`](./gap-analysis.md). Pair upstream when the source artifact is a comparison between current and desired state. Run `/gap-analysis` first to produce the gap report, then point this skill at the report. `G-NNN` gap IDs become source citations on the phase entries that close them. - [`/plan-a-feature`](./plan-a-feature.md). Pair upstream when the source artifact is a single feature that needs a phased rollout. Run `/plan-a-feature` first to produce the spec, then point this skill at the spec when the feature is large enough to ship in slices rather than all at once. diff --git a/docs/skills/plan-implementation.md b/docs/skills/plan-implementation.md index 477f9f3..1ef1b53 100644 --- a/docs/skills/plan-implementation.md +++ b/docs/skills/plan-implementation.md @@ -200,7 +200,7 @@ URL: https://ieeexplore.ieee.org/document/1204375 - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [Sizing](../sizing.md). The cross-skill sizing model. Explains the small / medium / large bands, the default-to-small rule, and the `$size` override. - [`/plan-a-feature`](./plan-a-feature.md). The prior step. Produces the `feature-specification.md` this skill consumes. Running the two in sequence is the intended flow: *what* first, *how* second. - [`/stakeholder-summary`](./stakeholder-summary.md). The optional intermediate step. Turns the `feature-specification.md` into a plain-language summary for non-technical stakeholders before this skill runs, so the implementation plan starts from a shape stakeholders have already greenlit. diff --git a/docs/skills/plan-work-items.md b/docs/skills/plan-work-items.md index 18a9e31..c5c9870 100644 --- a/docs/skills/plan-work-items.md +++ b/docs/skills/plan-work-items.md @@ -115,7 +115,7 @@ URL: https://www.mountaingoatsoftware.com/books/user-stories-applied ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule. This skill does not gate on it; enforcement belongs upstream. - [`project-manager`](../agents/project-manager.md). Dispatched in Step 5 to draft the work item breakdown. - [`/plan-implementation`](./plan-implementation.md). Pair upstream to produce the implementation plan this skill breaks down. diff --git a/docs/skills/project-discovery.md b/docs/skills/project-discovery.md index e8b6e19..3ec546a 100644 --- a/docs/skills/project-discovery.md +++ b/docs/skills/project-discovery.md @@ -98,7 +98,7 @@ URL: https://research.google/pubs/why-google-stores-billions-of-lines-of-code-in ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/project-documentation`](./project-documentation.md). For feature and system docs. Reads the discovery reference to find the right directory and language. - [`/coding-standard`](./coding-standard.md). For coding rules. Reads the discovery reference to find the standards directory. - [`/architectural-decision-record`](./architectural-decision-record.md). For architectural decisions. Reads the discovery reference to find the ADR directory. diff --git a/docs/skills/project-documentation.md b/docs/skills/project-documentation.md index 948e6d5..24441df 100644 --- a/docs/skills/project-documentation.md +++ b/docs/skills/project-documentation.md @@ -34,6 +34,7 @@ Operator documentation for the `/project-documentation` skill in the han plugin. - **Architectural decisions.** Use [`/architectural-decision-record`](./architectural-decision-record.md). - **Coding conventions.** Use [`/coding-standard`](./coding-standard.md). - **PR descriptions.** Use [`/update-pr-description`](./update-pr-description.md). +- **Runbooks for operational scenarios.** Use [`/runbook`](./runbook.md). A runbook captures what to do when an alert fires or a known failure mode occurs; project documentation describes how the feature or system works. ## How to invoke it @@ -115,7 +116,7 @@ URL: https://en.wikipedia.org/wiki/Darwin_Information_Typing_Architecture ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/project-discovery`](./project-discovery.md). Run first. The documentation skill reads the discovery reference to find the docs directory and stack language. - [`/architectural-decision-record`](./architectural-decision-record.md). Use for decisions rather than system documentation. - [`/coding-standard`](./coding-standard.md). Use for rules rather than descriptions. diff --git a/docs/skills/research.md b/docs/skills/research.md index bbd8a27..d041f48 100644 --- a/docs/skills/research.md +++ b/docs/skills/research.md @@ -127,7 +127,7 @@ URL: https://hbr.org/2007/09/performing-a-project-premortem ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/investigate`](./investigate.md). The symptom-shaped sibling. Use it when something is broken; use `/research` when you have a question. - [`/plan-a-feature`](./plan-a-feature.md). Pair downstream: turn a recommended option into a behavioral spec. - [`research-analyst`](../agents/research-analyst.md). The agent the skill dispatches for the web / prior-art / option-comparison angles. diff --git a/docs/skills/runbook.md b/docs/skills/runbook.md new file mode 100644 index 0000000..19fcc14 --- /dev/null +++ b/docs/skills/runbook.md @@ -0,0 +1,151 @@ +# /runbook + +Operator documentation for the `/runbook` skill in the han plugin. This document helps you decide *when* and *how* to use the skill. For what the skill does internally, read the skill definition at [`plugin/skills/runbook/SKILL.md`](../../plugin/skills/runbook/SKILL.md). + +> See also: [Plugin landing page](../../README.md) · [All skills](./README.md) · [All agents](../agents/README.md) · [YAGNI](../yagni.md) + +## TL;DR + +- **What it does.** Creates or updates a runbook for a single operational scenario, using a consistent template that leads with symptoms and progressively discloses the procedure. +- **When to use it.** An alert has fired, an incident has occurred, a recurring task needs to be captured, or a known failure mode on a live service needs a documented response. +- **What you get back.** A single runbook file under `docs/runbooks/` (or the project's existing runbook directory) with metadata, symptoms, prerequisites, an imperative-voice procedure with expected output per step, verification, escalation, and rollback. + +## Key concepts + +- **Three modes.** Creating new, Updating existing (edit in place, new change-history entry), Validating existing (refresh `Last validated` after running the procedure end-to-end). +- **One runbook per invocation.** The skill produces a single file. Rerun the skill per scenario; do not try to batch. +- **YAGNI preflight.** Before the skill writes anything, it requires the scenario to be real: an alert that has fired, a documented incident, a recurring task, a live failure mode on a service receiving traffic, or a customer / stakeholder commitment. Speculative runbooks are deferred. +- **Symptom-first structure.** The template promotes Symptoms to a top-level section directly under the metadata block so a reader arriving from an alert link can confirm "this is the right runbook" in under ten seconds. +- **Imperative commands with expected output.** Every step in the procedure shows the exact command and what success looks like. Prose paragraphs in place of commands are an authoring failure the skill prompts against. +- **Staleness made visible.** Owner, Last validated, Last edited, Reversible, Origin, and a Change history with validation status all sit in the metadata so decay shows up in the artifact instead of hiding inside it. + +## When to use it + +**Invoke when:** + +- An alert just fired for the first time and you mitigated it manually; capture what you did before you forget. +- A documented incident or post-mortem produced a procedure that should be reusable. +- The team performs a recurring task (cert rotation, index rebuild, monthly data export) and the procedure should be captured so it does not live only in one person's head. +- A known failure mode on a live service needs a documented response before the next on-call rotation. +- You ran an existing runbook end-to-end and want to refresh its `Last validated` date and change-history entry. + +**Do not invoke for:** + +- **Feature or system documentation.** Use [`/project-documentation`](./project-documentation.md). That skill describes what a feature does and how it works; this skill describes what to do when an operational scenario occurs. +- **An architectural or design decision.** Use [`/architectural-decision-record`](./architectural-decision-record.md). An ADR records a decision and its alternatives; a runbook captures an operational procedure. +- **Coding rules or conventions.** Use [`/coding-standard`](./coding-standard.md). +- **An incident investigation in flight.** Use [`/investigate`](./investigate.md) for evidence-based root-cause work. Run `/runbook` after the investigation lands a procedure that the team will reuse. +- **A speculative runbook for an alert that has not fired.** The skill's YAGNI preflight will defer it. Wait until the alert actually fires or until evidence accumulates. + +## How to invoke it + +Run `/runbook` in Claude Code. + +Give it: + +1. **The scenario.** Lead with the observable symptom or operation: *"Postgres primary unreachable: connections time out,"* *"Weekly reindex job,"* *"Queue backlog over 5000."* The clearer the scenario, the less the skill needs to ask. +2. **The evidence the scenario is real.** A link to the firing alert, a post-mortem, the schedule file, a customer report, or a brief description of how you observed the failure mode. The skill's YAGNI preflight needs this before it will write the runbook. +3. **The procedure that worked.** The exact commands you ran, what their output looked like, what you checked to confirm the fix. The skill captures these verbatim; it does not invent commands. +4. **Optional: an existing runbook to update.** Pass the path. The skill will read it, ask what changed, and edit in place with a new change-history entry. + +Example prompts: + +- `/runbook`. *"Write the runbook for the queue-backlog alert I just mitigated. Alert fired at 14:22 today, incident report at `docs/incidents/2026-05-28-queue-backlog.md`. Fix was to restart the consumer pool with `kubectl rollout restart deploy/consumer -n workers` and verify queue depth dropped below 1000 within five minutes."* +- `/runbook`. *"Capture our weekly Postgres reindex procedure. Schedule lives in `ops/cron/reindex.yaml`; the steps are in my head."* +- `/runbook docs/runbooks/postgres-primary-unreachable.md`. *"Update — we changed the escalation channel from PagerDuty to OpsGenie last week, and I ran the procedure end-to-end this morning."* +- `/runbook`. *"I want to write a runbook for a Sentry alert we don't have data flowing to yet."* The skill will defer this per YAGNI. + +## What you get back + +A single runbook file plus light integration: + +- **`docs/runbooks/{slug}.md`** (or the project's existing runbook directory and convention). The file follows the template at [`references/runbook-template.md`](../../plugin/skills/runbook/references/runbook-template.md). Required sections: title, one-line description, metadata block (Severity, Triggers, Reversible, Last validated, Last edited, Owner, Origin), Symptoms, Prerequisites, Resolve (or Quick fix), Verify the fix landed, Escalate, Rollback, Live links, Change history. Optional sections (deleted entirely if they do not apply): Likely cause, Not this — try instead, Background, Quick fix, If a step fails, If the problem comes back, What didn't work and why, Background and related. +- **A metadata block tuned for 2am scanning.** Severity and Triggers up top; Reversible visible before the engineer commits to any destructive step; Last validated distinct from Last edited so trust signals are not muddied; Origin holding the YAGNI evidence. +- **An imperative procedure.** Every step shows the exact command and what success looks like, with explicit branching when output differs. +- **Filename convention discovered from the project.** Flat (`docs/runbooks/{scenario}.md`), per-service (`docs/runbooks/{service}/{scenario}.md`), or alert-keyed (`docs/runbooks/alerts/{AlertName}.md`) depending on what the project already uses. The skill matches existing convention when more than two runbooks are present; consistency is the larger value. +- **Cross-references.** If CLAUDE.md or AGENTS.md lists runbooks, the skill adds an entry. If the runbook closes a procedure in an incident report or post-mortem, the skill adds a back-reference. If the alert that triggers the runbook has a definition file in the repository, the skill adds a comment in that file pointing to the runbook. + +## How to get the most out of it + +- **Bring real evidence, not "we should probably have a runbook for X."** The YAGNI preflight will defer speculative runbooks. The skill is most useful right after a real incident, while the procedure is fresh. +- **Capture the commands verbatim.** The skill writes what you give it. If you paste the exact `kubectl` invocation that worked, that is what the runbook will say. If you describe the procedure in prose, the skill will ask you for the commands before writing. +- **Note "what didn't work" too.** The template has an optional section for it. The next reader benefits from knowing which paths look promising but fail. +- **Run the procedure end-to-end before updating `Last validated`.** Editing the runbook does not validate it. The skill keeps Last edited and Last validated separate on purpose. +- **Pair with `/investigate`** when the runbook comes out of a bug investigation. The investigation lands the fix; `/runbook` captures the procedure for the next engineer who sees the same symptom. + +## YAGNI + +A runbook requires **evidence the scenario is real today**: an alert that has fired, a documented incident, a recurring task that exists, a live failure mode on a service receiving production traffic, or a customer or stakeholder commitment to document the procedure. Runbooks for hypothetical alerts, "we might need this someday," or symmetry with other runbooks ("we have one for the database, so we should have one for the cache") are YAGNI candidates and are deferred. + +The canonical project anti-pattern: Sentry runbooks for staging-only Sentry where data isn't reaching production. The alerts will never fire because no signal flows, and the runbook becomes a load-bearing pattern future agents will copy. + +When the preflight finds no current trigger, the skill recommends deferring the runbook and names the trigger that would justify revisiting (the alert firing, the first occurrence of the failure mode, the first run of the recurring task, a customer commitment landing). The user always wins; if they override, the override is recorded explicitly in the runbook's Origin field so future readers can see the runbook was written without standard evidence. + +See [YAGNI](../yagni.md) for the two gates, the acceptable-evidence list, and the named anti-patterns. + +## Cost and latency + +The skill is deterministic and does not dispatch agents. A typical run is one or two short rounds of clarifying questions (the YAGNI evidence, missing metadata, the exact commands) followed by a single file write. Runs are fast; the cost is dominated by the back-and-forth needed to capture the procedure accurately. + +The skill is built for tight-loop iteration after an incident: write the runbook now while the commands are fresh, then rerun the skill in validate mode the next time someone executes the procedure to refresh `Last validated`. + +## In more detail + +The skill walks a seven-step process: + +1. **Determine mode.** Creating new, Updating existing, or Validating existing. +2. **YAGNI preflight.** Gate the work on real evidence: alert that has fired, incident, recurring task, live failure mode, customer commitment. Recommend deferral when no trigger exists; the user can override and the override is recorded. +3. **Discover project structure.** Resolve the runbooks directory from CLAUDE.md's Project Discovery section, then `project-discovery.md`, then defaults (`docs/runbooks/`, `runbooks/`). Detect whether the project organizes runbooks flat, per-service, or alert-keyed. +4. **Gather context.** Title, severity, triggers, reversibility, origin, owner, prerequisites, symptoms, the procedure with exact commands and expected output, verification, escalation conditions and channels, rollback. +5. **Write the runbook.** Copy the template, fill the metadata, fill each required section, fill applicable optional sections, delete the headings for optional sections that do not apply, delete the author guidance block. +6. **Integration.** CLAUDE.md or AGENTS.md entry if the project lists runbooks; back-reference from incident reports or post-mortems; comment in alert-definition files that point to the runbook. +7. **Verification.** Re-read the file, confirm no placeholders remain, confirm Origin contains real evidence (or an explicit override), confirm Symptoms is concrete, confirm every step shows command and expected output, confirm Verify is distinct from per-step output, confirm Escalate leads with conditions, confirm Rollback is filled or explicitly marked not applicable, confirm empty optional sections are deleted, confirm the change-history creation entry exists. + +The template is reviewed by [`information-architect`](../agents/information-architect.md) and [`junior-developer`](../agents/junior-developer.md) inputs that landed during its design pass. Progressive disclosure runs in two directions: from observable symptom toward likely cause and adjacent failures, and from quick fix toward branching procedure with verification and rollback. The metadata block carries the front-door signals (Severity, Reversible, Last validated) that a tired reader needs before committing to any step. + +## Sources + +The skill's structure is grounded in established runbook practice and the project's own evidence-based conventions. + +### Google SRE Workbook — On-Call + +The "playbook entry" pattern in Google SRE — every alert ties to a playbook entry with severity, impact, debugging, and mitigation — anchors the skill's per-scenario structure and the alert-to-runbook linking convention. The corroborated 3x MTTR improvement claim is the only quantitative evidence in the field for runbook value. + +URL: https://sre.google/workbook/on-call/ + +### GitLab Production Runbooks + +GitLab's per-service runbooks repository demonstrates the production-grade pattern the skill mirrors: kebab-case filenames, runbooks organized by service or alert, owned by the team that operates the service, updated in the same pull requests as the infrastructure they describe. The skill's flat / per-service / alert-keyed convention detection traces to this practice. + +URL: https://runbooks.gitlab.com/ + +### OpenShift Runbooks + +The alert-keyed naming convention (`alerts/{operator}/{AlertName}.md`) the skill detects and matches comes from OpenShift's runbook repository, where the runbook file name is the alert it answers. + +URL: https://github.com/openshift/runbooks + +### `plugin/references/yagni-rule.md` + +The skill's YAGNI preflight applies the project's own evidence-based YAGNI rule. The canonical anti-pattern — "runbook for an alert that has never fired" — comes directly from this rule and from the `devops-engineer` agent definition that codifies it. + +URL: [`plugin/references/yagni-rule.md`](../../plugin/references/yagni-rule.md) + +### `docs/research/runbook-skill-research.md` + +The skill's design rests on a research pass that surveyed industry runbook formats (Google SRE, GitLab, OpenShift, PagerDuty, Atlassian, Rootly, OneUptime, FireHydrant, incident.io, Nobl9, and more), Han codebase patterns, and adversarial validation. The validation collapsed an earlier interview-driven design in favor of the simpler template installer. + +URL: [`docs/research/runbook-skill-research.md`](../research/runbook-skill-research.md) + +## Related documentation + +- [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. +- [YAGNI](../yagni.md). The evidence-based rule the skill applies before writing a runbook. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. +- [Skills Index](./README.md). All skills, grouped by purpose. +- [`/investigate`](./investigate.md). The investigation skill that often produces a procedure worth capturing as a runbook. Investigate first, then capture. +- [`/project-documentation`](./project-documentation.md). For feature and system docs. Pair when a runbook needs background a feature doc already provides. +- [`/architectural-decision-record`](./architectural-decision-record.md). For decisions that produce the system the runbook operates on. +- [`information-architect`](../agents/information-architect.md). Reviewed the runbook output template for progressive disclosure during the skill's design pass. +- [`junior-developer`](../agents/junior-developer.md). Reviewed the runbook output template for generalist readability during the skill's design pass. +- [`devops-engineer`](../agents/devops-engineer.md). The agent that consumes runbooks during production-readiness review and whose YAGNI anti-pattern definition anchors the skill's preflight. +- [`SKILL.md` for /runbook](../../plugin/skills/runbook/SKILL.md). The internal process definition. diff --git a/docs/skills/tdd.md b/docs/skills/tdd.md index 98e61b6..47c770b 100644 --- a/docs/skills/tdd.md +++ b/docs/skills/tdd.md @@ -126,7 +126,7 @@ URL: https://growing-object-oriented-software.com/ ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule the refactor step and test list apply. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. - [`/test-planning`](./test-planning.md). Plan what to test without writing code. Use it before `/tdd` to enumerate behaviors, or instead of it when you want analysis rather than implementation. - [`/plan-a-feature`](./plan-a-feature.md). Specify behavior first; the spec becomes the test list `/tdd` builds from. diff --git a/docs/skills/test-planning.md b/docs/skills/test-planning.md index f49fcca..3b304e0 100644 --- a/docs/skills/test-planning.md +++ b/docs/skills/test-planning.md @@ -115,7 +115,7 @@ URL: https://www.wiley.com/en-us/Testing+Computer+Software%2C+2nd+Edition-p-9780 - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. - [YAGNI](../yagni.md). The evidence-based "You Aren't Gonna Need It" rule this skill applies before committing items. The two gates, the acceptable-evidence list, the named anti-patterns, and the deferral format. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/code-review`](./code-review.md). Dispatches the same agents plus `adversarial-security-analyst`. Use when you want correctness findings too. - [`/architectural-analysis`](./architectural-analysis.md). For structural testability concerns. - [`/iterative-plan-review`](./iterative-plan-review.md). Use to stress-test an already-written test plan. diff --git a/docs/skills/update-pr-description.md b/docs/skills/update-pr-description.md index e01e6d3..144aa8b 100644 --- a/docs/skills/update-pr-description.md +++ b/docs/skills/update-pr-description.md @@ -100,7 +100,7 @@ URL: https://martinfowler.com/articles/feature-toggles.html ## Related documentation - [Plugin landing page](../../README.md). The front door. Start here if you arrived from outside the docs tree. -- [Skills Index](./README.md). All 20 skills, grouped by purpose. +- [Skills Index](./README.md). All 21 skills, grouped by purpose. - [`/gh-pr-review`](./gh-pr-review.md). Post a code review to the same PR. - [`/code-review`](./code-review.md). Local code review without touching GitHub. - [`junior-developer`](../agents/junior-developer.md). Runs the reviewer context check against the drafted description. diff --git a/plugin/skills/architectural-decision-record/SKILL.md b/plugin/skills/architectural-decision-record/SKILL.md index 42c3d83..6aea172 100644 --- a/plugin/skills/architectural-decision-record/SKILL.md +++ b/plugin/skills/architectural-decision-record/SKILL.md @@ -7,7 +7,8 @@ description: > design decision, or updating the status of an existing ADR. Does not create or update enforceable coding standards or conventions — use coding-standard for that. Does not write feature or system documentation — use - project-documentation instead. + project-documentation instead. Does not produce runbooks for operational + scenarios — use runbook for that. argument-hint: [topic-or-title or document-path] allowed-tools: Read, Write, Edit, Glob, Grep, Agent, Bash(git config *), Bash(whoami), Bash(mkdir *), Bash(find *) --- diff --git a/plugin/skills/coding-standard/SKILL.md b/plugin/skills/coding-standard/SKILL.md index 088eb82..d67c472 100644 --- a/plugin/skills/coding-standard/SKILL.md +++ b/plugin/skills/coding-standard/SKILL.md @@ -9,7 +9,8 @@ description: > records — use architectural-decision-record for ADRs. Does not write feature or system documentation — use project-documentation for that. Does not research open-ended options or prior art that is not destined for a standard — use - research. + research. Does not produce runbooks for operational scenarios — use runbook + for that. argument-hint: [standard-topic or document-path] allowed-tools: Read, Write, Edit, Glob, Grep, Agent, Bash(git config *), Bash(whoami), Bash(mkdir *), Bash(find *) --- diff --git a/plugin/skills/project-documentation/SKILL.md b/plugin/skills/project-documentation/SKILL.md index f0c23d7..830aa77 100644 --- a/plugin/skills/project-documentation/SKILL.md +++ b/plugin/skills/project-documentation/SKILL.md @@ -9,7 +9,8 @@ description: > analysis and config detection. Does not create architectural decision records — use architectural-decision-record for ADRs. Does not create or update coding standards — use coding-standard instead. Does not generate PR - descriptions — use update-pr-description for that. + descriptions — use update-pr-description for that. Does not produce runbooks + for operational scenarios — use runbook for that. argument-hint: [feature-name or document-path] allowed-tools: Read, Write, Edit, Glob, Grep, Agent, Bash(date *), Bash(git config *), Bash(whoami), Bash(mkdir *), Bash(find *) --- diff --git a/plugin/skills/runbook/SKILL.md b/plugin/skills/runbook/SKILL.md new file mode 100644 index 0000000..289725b --- /dev/null +++ b/plugin/skills/runbook/SKILL.md @@ -0,0 +1,149 @@ +--- +name: runbook +description: > + Create or update a runbook for an operational scenario — an incident an + alert fires for, a recurring scheduled task, or a known failure mode on a + live service — using a consistent template. Use when writing, drafting, + authoring, or updating a runbook for an alert, incident, on-call procedure, + scheduled maintenance, or operational SOP. Each invocation produces one + runbook at a time. Applies a YAGNI preflight that requires the scenario to + be real (an alert that has fired, a recurring task that exists, or a live + failure mode on a service that receives traffic) before producing the + runbook. Does not produce feature or system documentation — use + project-documentation. Does not record architectural decisions — use + architectural-decision-record. Does not create coding standards — use + coding-standard. +argument-hint: [topic or scenario, or path to existing runbook to update] +allowed-tools: Read, Write, Edit, Glob, Grep, Bash(git config *), Bash(whoami), Bash(date *), Bash(mkdir *), Bash(find *) +--- + +# Create or Update Runbook + +## Operating Principles + +- **YAGNI applies to runbooks themselves.** Apply the evidence-based YAGNI rule from [../../references/yagni-rule.md](../../references/yagni-rule.md). A runbook is worth writing only when the scenario is grounded in something real: an alert that has actually fired, a documented incident, a recurring task that exists, or a known failure mode on a service that receives production traffic. Runbooks for hypothetical alerts, "best practice says we should have one," or "we'll need this someday" are YAGNI candidates and the runbook should be deferred until the scenario actually occurs. The canonical anti-pattern from project history: Sentry runbooks for staging-only Sentry where data isn't reaching production — alerts that will never fire because no signal flows. The user always wins; the rule's job is to make the cost of speculative runbooks visible. +- **One runbook per invocation.** The skill produces a single runbook file. Multi-runbook batches conflate scope; rerun the skill per scenario. +- **Imperative commands with expected output.** The template requires every step to show the exact command and what success looks like. Prose paragraphs in place of commands are an authoring failure the skill prompts against. +- **Staleness is the failure mode.** The template requires owner, last-validated, last-edited, and a change-history entry so decay is visible rather than hidden. The skill does not enforce a review cadence — that is a team-level workflow concern — but the metadata fields make the cadence auditable. + +## Project Context + +- Git user: !`git config user.name` (!`git config user.email`) +- OS username: !`whoami` +- Today's date: !`date +%Y-%m-%d` +- CLAUDE.md: !`find . -maxdepth 1 -name "CLAUDE.md" -type f` +- project-discovery.md: !`find . -maxdepth 3 -name "project-discovery.md" -type f` + +## Step 1: Determine Mode + +Determine which mode to operate in based on the user's request: + +| Mode | When | Then | +|------|------|------| +| Creating new | Drafting a runbook for a scenario the project does not yet have one for | → Step 2 | +| Updating existing | Modifying an existing runbook (new step, validation date refresh, escalation change) | Read the existing runbook → Step 4 | +| Validating existing | User says they ran the procedure end-to-end and wants to refresh `Last validated` and add a change-history entry | Read the existing runbook → Step 4 (update mode, validation entry only) | + +## Step 2: Apply the YAGNI Preflight + +Before discovering structure or gathering context, gate the work. Ask the user (or confirm from their request) which of the following describes the scenario: + +1. **An alert that has actually fired** — name the alert, link the firing incident or alert manager record. +2. **A documented incident or post-mortem** — link it. +3. **A recurring scheduled task** that the team performs (weekly index rebuild, monthly cert rotation, etc.) — name the cadence and where the schedule lives. +4. **A live failure mode** on a service that receives production traffic, where the failure has occurred or is expected to occur with current measured pressure — name the service and the failure mode. +5. **Customer report or stakeholder commitment** requiring this procedure to be documented now — link it. + +If none of these applies, recommend deferring the runbook. Surface the recommendation to the user with the trigger that would justify revisiting: + +> "I don't see a current trigger forcing this runbook. Per the project's YAGNI rule, runbooks for alerts that have never fired are an anti-pattern. Recommend deferring until {trigger — first alert fires, first occurrence of the failure mode, first run of the recurring task, customer commitment lands}. Override and proceed anyway?" + +The user always wins. If they override, record the override in the runbook's Origin field as `"override: written preventively at user request on {date} — {reason}"` so future readers can see the runbook was written without standard evidence. + +If the scenario does pass the preflight, capture the evidence — the user will be asked again at Step 4 to drop the link or reference into the runbook's `Origin` metadata field. + +## Step 3: Discover Project Structure + +1. **Resolve project config.** Read CLAUDE.md's `## Project Discovery` section for documented runbook and docs directories. Fall back to `project-discovery.md`. Fall back to Glob defaults (`docs/runbooks/`, `runbooks/`, `docs/`). Continue without any keys that remain unfound. + +2. **Determine the runbooks directory.** Use the runbooks directory if found; otherwise use `{docs-dir}/runbooks/` if a docs directory was found; otherwise default to `docs/runbooks/`. Run `mkdir -p` on the resolved directory to ensure it exists. + +3. **Enumerate existing runbooks.** Use Glob to find existing `.md` files in the runbooks directory and any service subdirectories. Read filenames to detect whether the project organizes runbooks flat (`docs/runbooks/{scenario}.md`), per-service (`docs/runbooks/{service}/{scenario}.md`), or alert-keyed (`docs/runbooks/alerts/{AlertName}.md`). + +4. **Resolve author information.** If git user or email is empty in the project context above, ask the user for their name and email. + +5. **Check existing runbook format.** If existing runbooks were found, read one to understand the project's format. If it differs from [runbook-template.md](references/runbook-template.md), ask the user whether to match the existing format or use this skill's template. Default to matching the existing format when the project already has more than two runbooks — consistency is the larger value. + +## Step 4: Gather Context + +From the arguments, conversation, and YAGNI preflight in Step 2, capture: + +- **Title** — the symptom-first title per the template's title rule. Lead with the observable failure or operation, not the system name. Good: `Postgres primary unreachable: connections time out`. Bad: `Database failover`. +- **Severity** — the org's severity scheme. If the alert uses a different name (P1/P2), record both. +- **Triggers** — the alert name (with link to alert definition or monitoring), the schedule, the upstream runbook, or "manual". +- **Reversibility** — yes, partial, no — wait it out, no — data loss possible. This sets the front-door signal so the engineer knows before they commit whether they can back out. +- **Origin** — the link or reference captured in Step 2. Required. +- **Owner** — team or person paged at 2am for this runbook's freshness. +- **Prerequisites** — access groups, VPN, kubectl context, CLI tools with minimum versions, on-call privileges. "None — workstation only" is a valid answer; blank is not. +- **Symptoms** — what the engineer sees that brings them to this runbook. +- **The procedure** — for each step, the exact command (or non-command action), what success looks like, and what to do if the output differs. Use imperative voice. +- **Verification** — how to confirm the original symptom is gone (separate from per-step expected output). +- **Escalation** — for each escalation step, the condition (time-box or specific failure), the recipient, and the channel (PagerDuty service, Slack room, phone). +- **Rollback** — how to undo the fix, or the explicit alternative if rollback is not possible. + +If any of these are unclear, use `AskUserQuestion` to clarify before writing. Ask only for what is genuinely missing; do not re-ask for values present in the user's request. + +When the user gives you a recent incident, post-mortem, or alert as the scenario, read it to extract the symptoms, the procedure that worked, and the verification — do not re-derive these from the model's understanding. + +## Step 5: Write the Runbook + +1. **Copy the template** from [runbook-template.md](references/runbook-template.md). + +2. **File name and location.** Place the file in the runbooks directory from Step 3. + + - **Slug:** kebab-case, lead with the scenario or symptom, not the system name. `postgres-primary-unreachable.md`, not `failover.md`. + - **Per-service subdirectory:** when the project already organizes runbooks per-service (detected in Step 3), place the file under the matching service directory: `docs/runbooks/{service}/{scenario}.md`. Reuse an existing service directory when one fits; only introduce a new service directory when no existing one applies. + - **Alert-keyed:** when the project organizes by alert name (detected in Step 3), use the alert name as the file name: `docs/runbooks/alerts/{AlertName}.md`. + - **Flat default:** when the project has no convention yet, place the file at `docs/runbooks/{slug}.md`. + - If the project has more than one reasonable placement, ask the user before writing. + +3. **Fill the metadata block** with Severity, Triggers, Reversible, Last validated (today's date and the validating party — if the procedure has not been run end-to-end, leave `Last validated` empty and note in change history that it has not yet been validated), Last edited (today's date), Owner, and Origin (from the YAGNI preflight in Step 2). + +4. **Fill each required section** following the template's HTML comments for guidance: + - **Symptoms** — what the engineer sees. + - **Prerequisites** — required access and tools. Write "None — workstation only" if nothing is required; do not leave blank. + - **Resolve** — numbered steps with exact commands and expected output. One logical action per step. + - **Verify the fix landed** — concrete checks that the original symptom is gone. + - **Escalate** — condition → recipient → channel. + - **Rollback** — steps to undo, or explicit "Not applicable — {reason and alternative}". + - **Live links** — operational surfaces used during the incident. + - **Change history** — start with the creation entry citing the Origin reference. + +5. **Fill applicable optional sections** and **delete the headings for any optional section that does not apply**. The optional sections are: Likely cause, Not this — try instead, Background, Quick fix, If a step fails, If the problem comes back, What didn't work and why, Background and related. An empty heading reads as "this runbook is incomplete" — delete rather than leave blank. + +6. **Delete the author guidance comment block** at the top of the template once the file is filled in. + +7. **If updating an existing runbook:** edit the existing file in place. Append a new change-history entry on top with the date, your name, what changed and why, and the validation status. Update `Last edited` to today; update `Last validated` only if you actually ran the procedure end-to-end against production or a faithful staging environment. + +## Step 6: Integration + +1. If the project's CLAUDE.md or AGENTS.md has a section that lists runbooks (or that references operational documentation by name), add a one-line entry pointing to the new runbook. Follow the pattern of existing entries; do not invent a new convention. +2. If the runbook closes a procedure documented in an incident report, post-mortem, or related ADR, add a cross-reference from that document back to the runbook. +3. If the runbook's `Triggers` field names an alert that has a definition file in the repository (Prometheus rule, monitoring-as-code config), add a comment in the alert definition pointing to the runbook path. + +## Step 7: Verification + +Read back the runbook file and confirm: + +1. All metadata fields are filled — no `{placeholder}` values remain in Severity, Triggers, Reversible, Owner, Origin. `Last validated` is either a real date with the validating party or explicitly noted as not yet validated in change history. +2. The Origin field contains a real link or reference per the YAGNI preflight. If the user overrode the preflight, the override is recorded explicitly. +3. The Symptoms section is concrete (alert text, error message, log line, or user-visible behavior) rather than generic prose. +4. Every step in Resolve has either an exact command with expected output, or a non-command action with the equivalent "what success looks like" signal. +5. Verify the fix landed lists at least one concrete check that the original symptom is gone, distinct from per-step expected output. +6. Escalate entries lead with a condition (when), then the recipient, then the channel. +7. Rollback is either filled with steps or explicitly marked not applicable with an alternative. +8. Optional sections that do not apply have been deleted entirely — no empty headings remain. +9. The author guidance comment block at the top of the template has been removed. +10. Change history has at least one entry — the creation entry citing Origin. + +Fix any issues found before presenting the runbook to the user. diff --git a/plugin/skills/runbook/references/runbook-template.md b/plugin/skills/runbook/references/runbook-template.md new file mode 100644 index 0000000..2784735 --- /dev/null +++ b/plugin/skills/runbook/references/runbook-template.md @@ -0,0 +1,159 @@ + + +# Runbook: {Title} + + + +> {One-line description: what the engineer will see and what this runbook does about it. Mirror the alert text where possible.} + +- **Severity:** {SEV-1 | SEV-2 | SEV-3 | routine} +- **Triggers:** {alert name(s) and link, schedule, upstream runbook, customer report, or "manual"} +- **Reversible:** {yes — see Rollback | partial — see Rollback | no — wait it out | no — data loss possible} +- **Last validated:** {YYYY-MM-DD by {who}} +- **Last edited:** {YYYY-MM-DD} +- **Owner:** {team or person paged at 2am for this runbook's freshness} +- **Origin:** {link to the incident, alert-firing record, ticket, recurring task, or "first observed YYYY-MM-DD in {context}"} + +## Symptoms + + + +- … + +### Likely cause (optional) + + + +### Not this — try instead (optional) + + + +- **{Adjacent symptom}** — try {other runbook path} + +## Background (optional) + + + +## Prerequisites + + + +- … + +## Quick fix (optional) + + + +**Run only if:** {the single precondition that makes this safe to run sight-unseen}. + +``` +$ {exact-command-1} +$ {exact-command-2} +``` + +Then jump to [Verify the fix landed](#verify-the-fix-landed). + +## Resolve + + + +### 1. {First action} + +``` +$ {exact-command} +``` + +Expected output: + +``` +{what success looks like} +``` + +If you see something different: {what that means and which step or escalation to jump to}. + +### 2. {Second action} + +``` +$ {exact-command} +``` + +Expected output: + +``` +{what success looks like} +``` + + + +## Verify the fix landed + + + +- {Check 1 — what to look at, what counts as healthy} +- {Check 2 — alert auto-clears, dashboard returns to baseline, user confirmation, …} + +## If a step fails (optional) + + + +- **Step N failed with {error}:** {what to try, or which escalation} + +## If the problem comes back (optional) + + + +- **Recurrence pattern:** {what to investigate, which related runbook to consult, when to open an incident} + +## What didn't work and why (optional) + + + +- **{What was tried}:** {why it failed, when not to try it again} + +## Escalate + + + +1. **If {condition, e.g., step 3 fails or 15 minutes elapsed without resolution}:** page {role / person} via {channel — PagerDuty service `service-name`, Slack `#channel`, phone} +2. **If {next condition}:** {next contact and channel} + +## Rollback + + + +{Describe the rollback as steps; include exact commands when applicable. If not applicable, write "Not applicable — {reason and what to do instead}."} + +## Live links + + + +- {label}: {url} + +## Background and related (optional) + + + +- {label}: {path or url} + +## Change history + + + +- **{YYYY-MM-DD}** — {who}: {what changed and why} [validated: yes | no | partial — {scope}]