feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills) by rlaope · Pull Request #94 · rlaope/cloudy

rlaope · 2026-05-28T03:04:47Z

Summary

Expands the built-in skill library from 13 → 24, all read-only markdown playbooks composing existing tool groups (no new tools, no Go change — //go:embed builtin/*.md auto-loads them; TestLoadBuiltin asserts >= 7).

Incident / ops playbooks (5)

deploy-regression — align Argo CD sync vs. error/latency onset; verdict + explicit rollback-target SHA.
capacity-scheduling — classify Pending pods (capacity / constraint / downstream-block) from the verbatim FailedScheduling reason; reasons on requests, not utilisation.
slo-burn — multi-window multi-burn-rate budget analysis; fast-burn vs. slow-burn page decision.
network-connectivity — walk DNS → Service/endpoints → NetworkPolicy → Ingress/mesh, stop at the first broken layer.
triage-orchestrator — first-responder router: localise the blast radius, hand off to one specialist skill.

Runtime / compile-engine + AI playbooks (6)

Fill the engine-level gaps where perf/prom tools existed but no skill drove them. Each embeds the runtime's compile/optimisation model so the output names a cause class + lever. perf.* sampling is RiskHigh (operator-approved).

node-runtime — V8 event-loop lag vs. GC (scavenge/mark-sweep) vs. TurboFan deopt; perf.v8_inspector_*.
go-runtime — goroutine leaks, GC pacing (GOGC/GOMEMLIMIT), GOMAXPROCS oversubscription; perf.go_pprof_cpu.
ruby-runtime — GVL contention, generational GC, YJIT, malloc bloat; perf.rbspy_dump.
dotnet-runtime — gen2/LOH GC, ThreadPool starvation, Server-vs-Workstation GC, tiered-JIT warmup (prom EventCounters).
native-perf — C/C++/Rust hotspots: codegen/cache/branch/contention classes; perf.linux_perf_record.
ai-inference — LLM serving TTFT vs. ITL decomposition, KV-cache/batch saturation, GPU compute-vs-memory bound (vLLM/Triton/TGI/TorchServe).

Combined with existing jvm-gc / jvm-thread / py-perf, runtime coverage now spans Go, JVM, .NET, V8/Node, Python, Ruby, native, and AI serving.

Test plan

go test ./internal/core/skills/... — all 24 builtin skills parse/load
go test ./... — full suite green
Every allowed_tools entry references a registered tool (k8s/prom/log/trace/db/alert/gitops/perf)
skills.Parse invariants hold for each file (name==stem, description, ≥1 allowed_tools, non-empty body)

…, SLO burn, connectivity, triage router Adds five read-only diagnostic skills that compose existing tool groups (no new tools, no code change — go:embed picks up builtin/*.md): - deploy-regression: align Argo sync time vs. error/latency onset, name the rollback-target revision. - capacity-scheduling: classify Pending pods as capacity vs. constraint vs. downstream-block from the verbatim FailedScheduling reason. - slo-burn: multi-window multi-burn-rate budget analysis, fast-burn vs. slow-burn page decision. - network-connectivity: walk the request path (DNS → Service/endpoints → NetworkPolicy → Ingress/mesh) and stop at the first broken layer. - triage-orchestrator: breadth-first first-responder router that localises the blast radius and hands off to one specialist skill. Signed-off-by: rlaope <piyrw9754@gmail.com>

…ruby, dotnet, native, ai) Fills the engine-level monitoring gaps where the perf/prom tools already exist but no skill drove them. Each playbook embeds the runtime's compile/optimisation model so the diagnosis names the cause class and the lever, not just the symptom. Read-only; perf.* sampling is RiskHigh and gated on operator approval. - node-runtime: V8 event-loop lag vs. GC (scavenge/mark-sweep) vs. TurboFan deopt; CPU profile via perf.v8_inspector_*. - go-runtime: goroutine leaks, GC pacing (GOGC/GOMEMLIMIT), GOMAXPROCS oversubscription; pprof via perf.go_pprof_cpu. - ruby-runtime: GVL contention, generational GC, YJIT, malloc bloat; stacks via perf.rbspy_dump. - dotnet-runtime: gen2/LOH GC, ThreadPool starvation, Server-vs-Workstation GC, tiered JIT warmup (prom EventCounters). - native-perf: C/C++/Rust CPU hotspots — codegen/cache/branch/contention cause classes; perf.linux_perf_record call graph. - ai-inference: LLM serving TTFT vs. ITL decomposition, KV-cache/batch saturation, GPU compute-vs-memory bound across vLLM/Triton/TGI/TorchServe. Signed-off-by: rlaope <piyrw9754@gmail.com>

…ng, Node heap default, vLLM KV-cache metric, .NET threadpool injection) Signed-off-by: rlaope <piyrw9754@gmail.com>

rlaope added 2 commits May 28, 2026 12:04

rlaope changed the title ~~feat(skills): Phase 1 SRE playbooks (deploy/capacity/SLO/network/triage)~~ feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills) May 28, 2026

docs(skills): tighten runtime accuracy from code review (Go GOGC paci…

cbd5d16

…ng, Node heap default, vLLM KV-cache metric, .NET threadpool injection) Signed-off-by: rlaope <piyrw9754@gmail.com>

rlaope merged commit c93a980 into master May 28, 2026
1 of 2 checks passed

rlaope deleted the feat/sre-skills-phase1 branch May 28, 2026 03:22

This was referenced May 28, 2026

fix(skills): correct perf tool-arg names + metric/runtime inaccuracies (code-review followup) #95

Merged

docs(skills): consolidate to one embedded source + sync README/CHANGELOG #99

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills)#94

feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills)#94
rlaope merged 3 commits into
masterfrom
feat/sre-skills-phase1

rlaope commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rlaope commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Incident / ops playbooks (5)

Runtime / compile-engine + AI playbooks (6)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rlaope commented May 28, 2026 •

edited

Loading