Skip to content

feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills)#94

Merged
rlaope merged 3 commits into
masterfrom
feat/sre-skills-phase1
May 28, 2026
Merged

feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills)#94
rlaope merged 3 commits into
masterfrom
feat/sre-skills-phase1

Conversation

@rlaope
Copy link
Copy Markdown
Owner

@rlaope rlaope commented May 28, 2026

Summary

Expands the built-in skill library from 13 → 24, all read-only markdown playbooks composing existing tool groups (no new tools, no Go change — //go:embed builtin/*.md auto-loads them; TestLoadBuiltin asserts >= 7).

Incident / ops playbooks (5)

  • deploy-regression — align Argo CD sync vs. error/latency onset; verdict + explicit rollback-target SHA.
  • capacity-scheduling — classify Pending pods (capacity / constraint / downstream-block) from the verbatim FailedScheduling reason; reasons on requests, not utilisation.
  • slo-burn — multi-window multi-burn-rate budget analysis; fast-burn vs. slow-burn page decision.
  • network-connectivity — walk DNS → Service/endpoints → NetworkPolicy → Ingress/mesh, stop at the first broken layer.
  • triage-orchestrator — first-responder router: localise the blast radius, hand off to one specialist skill.

Runtime / compile-engine + AI playbooks (6)

Fill the engine-level gaps where perf/prom tools existed but no skill drove them. Each embeds the runtime's compile/optimisation model so the output names a cause class + lever. perf.* sampling is RiskHigh (operator-approved).

  • node-runtime — V8 event-loop lag vs. GC (scavenge/mark-sweep) vs. TurboFan deopt; perf.v8_inspector_*.
  • go-runtime — goroutine leaks, GC pacing (GOGC/GOMEMLIMIT), GOMAXPROCS oversubscription; perf.go_pprof_cpu.
  • ruby-runtime — GVL contention, generational GC, YJIT, malloc bloat; perf.rbspy_dump.
  • dotnet-runtime — gen2/LOH GC, ThreadPool starvation, Server-vs-Workstation GC, tiered-JIT warmup (prom EventCounters).
  • native-perf — C/C++/Rust hotspots: codegen/cache/branch/contention classes; perf.linux_perf_record.
  • ai-inference — LLM serving TTFT vs. ITL decomposition, KV-cache/batch saturation, GPU compute-vs-memory bound (vLLM/Triton/TGI/TorchServe).

Combined with existing jvm-gc / jvm-thread / py-perf, runtime coverage now spans Go, JVM, .NET, V8/Node, Python, Ruby, native, and AI serving.

Test plan

  • go test ./internal/core/skills/... — all 24 builtin skills parse/load
  • go test ./... — full suite green
  • Every allowed_tools entry references a registered tool (k8s/prom/log/trace/db/alert/gitops/perf)
  • skills.Parse invariants hold for each file (name==stem, description, ≥1 allowed_tools, non-empty body)

rlaope added 2 commits May 28, 2026 12:04
…, SLO burn, connectivity, triage router

Adds five read-only diagnostic skills that compose existing tool groups
(no new tools, no code change — go:embed picks up builtin/*.md):

- deploy-regression: align Argo sync time vs. error/latency onset, name
  the rollback-target revision.
- capacity-scheduling: classify Pending pods as capacity vs. constraint
  vs. downstream-block from the verbatim FailedScheduling reason.
- slo-burn: multi-window multi-burn-rate budget analysis, fast-burn vs.
  slow-burn page decision.
- network-connectivity: walk the request path (DNS → Service/endpoints →
  NetworkPolicy → Ingress/mesh) and stop at the first broken layer.
- triage-orchestrator: breadth-first first-responder router that
  localises the blast radius and hands off to one specialist skill.

Signed-off-by: rlaope <piyrw9754@gmail.com>
…ruby, dotnet, native, ai)

Fills the engine-level monitoring gaps where the perf/prom tools already
exist but no skill drove them. Each playbook embeds the runtime's
compile/optimisation model so the diagnosis names the cause class and the
lever, not just the symptom. Read-only; perf.* sampling is RiskHigh and
gated on operator approval.

- node-runtime: V8 event-loop lag vs. GC (scavenge/mark-sweep) vs. TurboFan
  deopt; CPU profile via perf.v8_inspector_*.
- go-runtime: goroutine leaks, GC pacing (GOGC/GOMEMLIMIT), GOMAXPROCS
  oversubscription; pprof via perf.go_pprof_cpu.
- ruby-runtime: GVL contention, generational GC, YJIT, malloc bloat;
  stacks via perf.rbspy_dump.
- dotnet-runtime: gen2/LOH GC, ThreadPool starvation, Server-vs-Workstation
  GC, tiered JIT warmup (prom EventCounters).
- native-perf: C/C++/Rust CPU hotspots — codegen/cache/branch/contention
  cause classes; perf.linux_perf_record call graph.
- ai-inference: LLM serving TTFT vs. ITL decomposition, KV-cache/batch
  saturation, GPU compute-vs-memory bound across vLLM/Triton/TGI/TorchServe.

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope changed the title feat(skills): Phase 1 SRE playbooks (deploy/capacity/SLO/network/triage) feat(skills): SRE skill-library expansion — incident playbooks + runtime/AI engines (11 skills) May 28, 2026
…ng, Node heap default, vLLM KV-cache metric, .NET threadpool injection)

Signed-off-by: rlaope <piyrw9754@gmail.com>
@rlaope rlaope merged commit c93a980 into master May 28, 2026
1 of 2 checks passed
@rlaope rlaope deleted the feat/sre-skills-phase1 branch May 28, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant