feat: pod-startup diagnostics + make issues mean "what's broken right now" by nadaverell · Pull Request #775 · skyhook-io/radar

nadaverell · 2026-05-25T21:19:16Z

Two coupled changes. (1) Makes pod-startup failures a first-class signal — radar's biggest blind spot. (2) Tightens issues into a single curated "what's broken right now" stream and stops exporting detector taxonomy as an API/agent knob.

1. Pod-startup diagnostics

Unlike state problems (CrashLoop, OOM) or reference problems (missing PVC), the why of a pod that can't reach Running is diffuse — spread across the Pod's PodScheduled condition, a sibling ReplicaSet's FailedCreate, and the join between a pod's constraints and the fleet's node labels. The scheduler already did the analysis; it just hands it back as one opaque string. We parse it and (the differentiator) resolve "didn't match node affinity/selector" to the specific offending label by joining the node cache:

Unschedulable: no node has kubernetes.io/arch=arm64 — 2 node(s) carry [amd64] (0/2 available)

Three layers (internal/k8s/scheduling.go, ~30 tests): bind-time (PodScheduled=False + node-fit resolver; SchedulingGated correctly excluded), admission (controller FailedCreate: quota / LimitRange / PodSecurity / webhook — the layer with no Pod to inspect, latest-blocker-wins dedup), post-bind (ContainerCreating → CNI IP-exhaustion / volume attach-mount, latest-event-wins).

Plumbed end-to-end: ResourceQuota added to the typed informer cache (browsable, RBAC-probed) and to the Helm ClusterRole's read-only grant (self-hosted/Cloud installs need this on upgrade or the quota section silently no-ops); scheduling issue source; MCP issues + diagnose (startupBlockers section) + dashboard; UI Pod banner / Namespace quota bars / topology tooltips.

2. `issues` = "what's broken right now"

issues is now one curated operational stream (problems + dangling refs + pod-startup blockers + False CRD conditions), severity-ranked. Detection provenance is no longer a filter axis:

Dropped source= from /api/issues + the MCP tool — source survives only as an output label + CEL binding.
Removed event + kyverno from composition (and include_events/include_kyverno). Raw events → get_events/timeline; policy posture → get_cluster_audit.
Removed proactive quota saturation from the stream — it's namespace capacity context, not a live failure (reactive FailedCreate + the Namespace quota UI still cover it).
get_resource drops the crippled include=logs; docs/mcp.md tool table refreshed; write tools documented as destructiveHint: true.

Verified live

Against radar-test-nonprod with induced fixtures (arm64 nodeSelector + near-full quota): /api/issues, /api/dashboard, /api/resources/resourcequotas correct. Live run caught a real bug (/api/dashboard is a separate builder from MCP get_dashboard and initially double-surfaced pods — fixed).

Test plan

go build ./... + pkg/k8score clean; go test ./internal/{k8s,issues,mcp,server}/ + pkg/k8score pass; make tsc clean; live API verification.

Notes / follow-ups

Hub: /api/issues no longer reads source=/include_* (extra params ignored, no 400). radar-hub-web should be checked for any fleet view that relied on source=kyverno/include_kyverno.
Deferred: Kyverno's aggregated home (fold into get_cluster_audit vs a compliance view — open decision; per-resource renderer is the interim); MCP tool consolidation (deemed low-value).

Note

Medium Risk
Behavioral breaking changes for API/MCP clients that relied on source=, Kyverno/events in issues, or get_resource include=logs; core triage paths and dashboard issues composition were touched across server, MCP, and issues packages.

Overview
This PR couples pod-startup diagnostics with a tighter definition of what issues means for operators and AI agents.

Pod-startup failures get a dedicated scheduling path in internal/k8s/scheduling.go: bind-time unschedulable pods (scheduler messages parsed and node affinity/selector misses resolved against the node cache), admission-time FailedCreate blocks (quota, LimitRange, PodSecurity, webhooks), and post-bind stalls (CNI IP exhaustion, volume attach/mount). Generic problem detection skips unschedulable pods; richer scheduling rows win over duplicate bare Pending problem rows. The same signals feed dashboard, MCP get_dashboard, diagnose (startupBlockers), and the issues composer.

issues is now “what’s broken right now” only: composition always includes problem, missing_ref, scheduling, and CRD condition sources. source=, include_events / include_kyverno, and warning-event/Kyverno composition are removed; source remains on each row and in CEL filter=. Kyverno PolicyReports are documented as per-resource posture, not the issues stream. get_resource no longer supports include=logs (explicit redirect to log/diagnose tools). MCP/docs now describe write tools as destructiveHint: true.

ResourceQuota is added to the informer cache, list/get APIs, Helm ClusterRole, capabilities probes, and Namespace detail UI (usage bars). README/docs reflect LimitRanges/ResourceQuotas and the issues/MCP contract changes.

^{Reviewed by Cursor Bugbot for commit e68685b. Bugbot is set up for automated code reviews on this repo. Configure here.}

Decompose why a Pod can't run into structured signals: - bind-time: PodScheduled=False → parse the scheduler verdict + resolve node affinity/selector misses against the node cache, naming the offending label (e.g. "no node has kubernetes.io/arch=arm64") - admission: controller FailedCreate (quota/LimitRange/PodSecurity/webhook) + proactive ResourceQuota saturation — the layer with no Pod to inspect - post-bind: ContainerCreating decoded into CNI IP-exhaustion + volume attach/mount, cross-checked against still-stuck pods Add ResourceQuota to the typed informer cache (mirroring LimitRange) so the proactive quota read + a browsable ResourceQuota view work. The generic problem detector now defers unschedulable pods to the scheduling source so they aren't double-reported as a bare "Pending".

New SourceScheduling composes the three scheduling detectors through the issues pipeline (default-on, high-signal operational state). /api/issues, the MCP issues tool, and per-resource summaryContext now surface placement/ admission/post-bind failures, filterable via source=scheduling. ParseSources accepts the new value; the Provider gains DetectScheduling.

- issues tool: source=scheduling documented and in the default set - diagnose: a schedulability section scoped to the workload — its unschedulable pods, its ReplicaSet's FailedCreate, and any namespace ResourceQuota saturation (the one-shot answer for an admission/quota stall) - get_dashboard: scheduling rows roll into the problem list; admission rows have no Pod, so the dashboard pod loop never surfaced them before

- PodRenderer: lead the banner with the decomposed scheduler verdict instead of a bare "Unschedulable" (untolerated taints, insufficient resources, and affinity/selector misses named). New PodProblem.detail keeps message exact so filter-chip matching is unaffected. - NamespaceRenderer: a ResourceQuota usage section with per-resource saturation bars (amber >=90%, red >=100%) — quota pressure was shown nowhere despite being exactly why a namespace stops admitting pods. Fetched via a new useNamespaceQuotas hook over /api/resources/resourcequotas. - topology tooltips: scheduling-aware guidance for the new reason keywords (Unschedulable, QuotaExceeded, IPExhaustion, VolumeMount/Attach, …).

/api/dashboard (the home ProblemsPanel source) is a separate builder from the MCP get_dashboard one wired earlier — it only gathered DetectProblems + DetectMissingRefs, so unschedulable pods and quota saturation never reached the home view. Append the three scheduling detectors directly (bypassing the Missing-ref Pod filter, since an Unschedulable row is the reason, not a dup). Verified live: the panel now shows the arch-mismatch Unschedulable row (with the offending label named) and the 99% QuotaNearLimit row.

nadaverell · 2026-05-25T21:44:07Z

bugbot run

- server: route /api/resources/resourcequotas through the typed informer in handleListResources + handleGetResource (it fell through to the dynamic cache, so the namespace quota UI could read [] on first open before sync). - scheduling: restrict the quota-pressure check to pod-admission-relevant resources (cpu/memory/pods/ephemeral-storage/requests.*/limits.*/PVC) so an object-count quota (configmaps/services) no longer shows as "blocks new pods". - scheduling: cross-check the involved workload's current readiness before emitting an admission FailedCreate row — a since-recovered workload no longer surfaces as critical off a lingering event. - dashboard: skip unschedulable pods in the REST rollup (they're owned by the scheduling rows) so they don't double-surface; fix the stale comment. - frontend: thread the namespace quota fetch error through — 403 hides the section, but 500/503 now shows a note instead of silently rendering quota-free. - types: drop dead NodeFacts.Taints/Unschedulable + TaintFact (written, never read); document the SchedulingReason union invariant. - mcp: add scheduling to the issues tool Description defaults + example. - comments: correct the node-fit resolver doc (no taint cache-join); strip external bench scenario name. - tests: cache-level integration tests for the quota ramp + S1 filter, bind-time node-fit naming, and the admission recovered-workload cross-check; ParseSources scheduling token; frontend summarizeSchedulerMessage.

…ostics # Conflicts: # internal/issues/issues_test.go # internal/mcp/tools.go

…a tones Replace inline raw red/orange Tailwind in the namespace quota section with the shared severity-color constants, per the repo styling rule.

- detectAdmissionFailures: dedup FailedCreate rows by involved object. A quota-blocked controller emits one event per attempt, each with a different generated pod name (distinct cached events), so one workload produced many near-identical rows. Now one row per workload. - admissionTargetStillBlocked: gate on created-count (Status.Replicas / CurrentNumberScheduled) below desired, not readiness. A workload whose pods were created but stay not-ready for another reason (e.g. unschedulable after a quota was raised) is no longer admission-blocked, so a stale FailedCreate no longer surfaces a critical QuotaExceeded row. - admissionTargetStillBlocked (Job): a terminally-failed Job (Failed>0) no longer counts as blocked — only a Job that has created nothing (Active, Succeeded, Failed all 0) does. - diagnose schedulingFindingsForWorkload: tighten the Deployment→ReplicaSet match to a single hyphen-free hash suffix (isReplicaSetOf), so diagnosing "api" no longer claims "api-gateway-<hash>". - tests: dedup assertion, created-but-not-ready skip, isReplicaSetOf table.

…oundary - Add TestDetectAdmissionProblems_JobAndDaemonSetCrossCheck: a Job that created no pod and a partially-scheduled DaemonSet surface QuotaExceeded; a terminally- failed Job (Failed>0) and a fully-scheduled DaemonSet are skipped — pins the net-new Job (Failed==0) and DaemonSet (CurrentNumberScheduled) cross-check branches that the ReplicaSet test didn't exercise. - Add a below-threshold (50%) quota case to the saturation test so the >=90% warn boundary is pinned, not just the >=90%/100% arms. - Reword the Job cross-check comment to state the true invariant (any of Active/Succeeded/Failed > 0 means a pod was created) instead of the inaccurate "terminally-failed (backoffLimit)" phrasing; note the mid-retry trade-off explicitly. - Replace the opaque "S1 filter" test comment with the real mechanism name (isPodAdmissionQuotaResource).

… first-seen FailedCreate events are deduped per involved object, but informer List order is arbitrary and the active blocker can change within the 30m window (quota cleared, webhook now rejects). Keep the latest event by LastTimestamp so the surfaced reason reflects the current cause, not whichever the cache iterated first. Pin with a quota→webhook test. Also fix stale source= comments/examples to include scheduling.

Two Bugbot findings: - DetectPostBindProblems kept the first qualifying kubelet event per pod by informer order, so a stale blocker could win when the cause changed (NetworkNotReady → FailedMount). Keep the latest by LastTimestamp, mirroring detectAdmissionFailures. - A pod stuck post-bind surfaced twice in the issues composer: a generic problem-source Pending row AND the richer scheduling-source row. Dedup in the composer so the scheduling row wins for the same Pod. A plain DetectProblems skip can't do this — the problem threshold is 5m but the post-bind event window is 10m, so a pod stuck >10m would lose its only row.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit af8ad61. Configure here.}

…rface The /api/dashboard builder is separate from the issues composer: its pod health rollup flagged long-Pending pods as warnings (skipping only unschedulable) while also appending post-bind scheduling rows, yielding two rows for one stuck pod (bare Pending + the richer VolumeMount/CNI row). Compute the post-bind-owned pod set up front and skip those in the rollup the same way unschedulable pods are skipped; reuse the slice for the scheduling append. Gap-free — only pods that actually get a post-bind row are skipped, so a pod past the 10m post-bind window keeps its rollup row.

…taxonomy filters issues is now one curated operational stream — workload/pod problems, dangling refs, pod-startup blockers, and False CRD conditions — severity ranked. Detection provenance is no longer a user/agent filter axis: - Drop the source= filter from /api/issues and the MCP issues tool. source survives only as an output label on each row + a CEL filter binding. - Remove event + kyverno from issue composition entirely (and the include_events/include_kyverno flags). Raw events live in get_events / the timeline; policy posture lives in get_cluster_audit. Shrinks the issues.Provider interface (WarningEvents/KyvernoFindings/KyvernoStatus) and deletes the source-parse plumbing. - SchedulingGated pods are no longer flagged Unschedulable (gate on reason==Unschedulable, matching the frontend). - Remove proactive ResourceQuota saturation from the stream — a saturated quota is namespace capacity context, not a live failure; the reactive FailedCreate path and the Namespace quota UI still cover it. This also fixes diagnose over-attributing namespace quota to unrelated workloads. - Rename diagnose's scheduling response field to startupBlockers (it spans bind-time, admission, and post-bind — not just scheduling). - Drop the crippled include=logs from get_resource (use get_pod_logs / get_workload_logs / diagnose). - Refresh docs/mcp.md tool table (was missing 6+ central tools) and fix the "non-destructive" wording — write tools are destructiveHint:true. NOTE: /api/issues no longer reads source=/include_*; extra query params are ignored (no 400). radar-hub-web should be checked for any fleet view that relied on source=kyverno/include_kyverno.

…ostics

…resh stale docs Addresses review findings on the issues/scheduling work: - clusterrole.yaml: add resourcequotas to the core read-only rule. The PR caches + probes ResourceQuota (capabilities.go) but in-cluster installs could not list it, silently hiding the Namespace quota section + the ResourceQuota API/UI for the users who need them. - get_resource: include=logs was dropped but silently became a no-op; return a logsError pointing to get_pod_logs / get_workload_logs / diagnose so a client on a stale schema is redirected instead of seeing empty success. - Refresh stale docs/comments the issues refactor missed: issues/types.go (package doc, Severity mapping, Source doc, Issue doc still described the removed event/kyverno sources + source-as-filter), summarycontext.go (referenced removed Filters fields), docs/mcp.md "non-destructive" line, docs/integrations.md (Kyverno-via-/api/issues no longer exists). - Delete now-dead policy_reports_testhooks.go (its only consumer was the removed issues_handler_test.go). - Tests: pin the CEL source binding (now the only source-slicing path) and startupBlockersForWorkload workload-scoping (its contract changed). The origin/main merge (prior commit) absorbs #780, resolving the apparent client.ts cache-seeding revert (merge skew - this branch never touched it).

…nknown include values; document resourcequotas RBAC Follow-up review findings on the issues/MCP work: - docs/integrations.md + issues MCP tool description wrongly routed Kyverno PolicyReport findings to the cluster audit (/api/audit + get_cluster_audit). Audit consumes only typed K8s + Crossplane and has zero PolicyReport input. Kyverno surfaces per-resource (PolicyReport detail view + resourceContext policy rollup); say that, and stop pointing agents at a tool that returns nothing for it. - Issue struct doc: the snapshot-timestamp note was only true for problem/ missing_ref/scheduling (LastSeen=compose time); condition rows set both timestamps to the condition's lastTransitionTime. Distinguish the two. - get_resource: include=logs was guarded, but every OTHER unknown token (typos, or "relationships" which moved to resourceContext) still silently no-op'd. Surface unknown include values via includeError so a token that did nothing is reported, not swallowed. - README.md + docs/in-cluster.md: document the resourcequotas (and LimitRanges) read grant the chart ClusterRole now carries, so the supported-resources lists match the deployed RBAC.

nadaverell added 5 commits May 26, 2026 00:07

nadaverell requested a review from hisco as a code owner May 25, 2026 21:19