Skip to content

feat: pod-startup diagnostics + make issues mean "what's broken right now"#775

Merged
nadaverell merged 17 commits into
mainfrom
feat/scheduling-diagnostics
May 26, 2026
Merged

feat: pod-startup diagnostics + make issues mean "what's broken right now"#775
nadaverell merged 17 commits into
mainfrom
feat/scheduling-diagnostics

Conversation

@nadaverell
Copy link
Copy Markdown
Contributor

@nadaverell nadaverell commented May 25, 2026

Two coupled changes. (1) Makes pod-startup failures a first-class signal — radar's biggest blind spot. (2) Tightens issues into a single curated "what's broken right now" stream and stops exporting detector taxonomy as an API/agent knob.

1. Pod-startup diagnostics

Unlike state problems (CrashLoop, OOM) or reference problems (missing PVC), the why of a pod that can't reach Running is diffuse — spread across the Pod's PodScheduled condition, a sibling ReplicaSet's FailedCreate, and the join between a pod's constraints and the fleet's node labels. The scheduler already did the analysis; it just hands it back as one opaque string. We parse it and (the differentiator) resolve "didn't match node affinity/selector" to the specific offending label by joining the node cache:

Unschedulable: no node has kubernetes.io/arch=arm64 — 2 node(s) carry [amd64] (0/2 available)

Three layers (internal/k8s/scheduling.go, ~30 tests): bind-time (PodScheduled=False + node-fit resolver; SchedulingGated correctly excluded), admission (controller FailedCreate: quota / LimitRange / PodSecurity / webhook — the layer with no Pod to inspect, latest-blocker-wins dedup), post-bind (ContainerCreating → CNI IP-exhaustion / volume attach-mount, latest-event-wins).

Plumbed end-to-end: ResourceQuota added to the typed informer cache (browsable, RBAC-probed) and to the Helm ClusterRole's read-only grant (self-hosted/Cloud installs need this on upgrade or the quota section silently no-ops); scheduling issue source; MCP issues + diagnose (startupBlockers section) + dashboard; UI Pod banner / Namespace quota bars / topology tooltips.

2. issues = "what's broken right now"

issues is now one curated operational stream (problems + dangling refs + pod-startup blockers + False CRD conditions), severity-ranked. Detection provenance is no longer a filter axis:

  • Dropped source= from /api/issues + the MCP tool — source survives only as an output label + CEL binding.
  • Removed event + kyverno from composition (and include_events/include_kyverno). Raw events → get_events/timeline; policy posture → get_cluster_audit.
  • Removed proactive quota saturation from the stream — it's namespace capacity context, not a live failure (reactive FailedCreate + the Namespace quota UI still cover it).
  • get_resource drops the crippled include=logs; docs/mcp.md tool table refreshed; write tools documented as destructiveHint: true.

Verified live

Against radar-test-nonprod with induced fixtures (arm64 nodeSelector + near-full quota): /api/issues, /api/dashboard, /api/resources/resourcequotas correct. Live run caught a real bug (/api/dashboard is a separate builder from MCP get_dashboard and initially double-surfaced pods — fixed).

Test plan

go build ./... + pkg/k8score clean; go test ./internal/{k8s,issues,mcp,server}/ + pkg/k8score pass; make tsc clean; live API verification.

Notes / follow-ups

  • Hub: /api/issues no longer reads source=/include_* (extra params ignored, no 400). radar-hub-web should be checked for any fleet view that relied on source=kyverno/include_kyverno.
  • Deferred: Kyverno's aggregated home (fold into get_cluster_audit vs a compliance view — open decision; per-resource renderer is the interim); MCP tool consolidation (deemed low-value).

Note

Medium Risk
Behavioral breaking changes for API/MCP clients that relied on source=, Kyverno/events in issues, or get_resource include=logs; core triage paths and dashboard issues composition were touched across server, MCP, and issues packages.

Overview
This PR couples pod-startup diagnostics with a tighter definition of what issues means for operators and AI agents.

Pod-startup failures get a dedicated scheduling path in internal/k8s/scheduling.go: bind-time unschedulable pods (scheduler messages parsed and node affinity/selector misses resolved against the node cache), admission-time FailedCreate blocks (quota, LimitRange, PodSecurity, webhooks), and post-bind stalls (CNI IP exhaustion, volume attach/mount). Generic problem detection skips unschedulable pods; richer scheduling rows win over duplicate bare Pending problem rows. The same signals feed dashboard, MCP get_dashboard, diagnose (startupBlockers), and the issues composer.

issues is now “what’s broken right now” only: composition always includes problem, missing_ref, scheduling, and CRD condition sources. source=, include_events / include_kyverno, and warning-event/Kyverno composition are removed; source remains on each row and in CEL filter=. Kyverno PolicyReports are documented as per-resource posture, not the issues stream. get_resource no longer supports include=logs (explicit redirect to log/diagnose tools). MCP/docs now describe write tools as destructiveHint: true.

ResourceQuota is added to the informer cache, list/get APIs, Helm ClusterRole, capabilities probes, and Namespace detail UI (usage bars). README/docs reflect LimitRanges/ResourceQuotas and the issues/MCP contract changes.

Reviewed by Cursor Bugbot for commit e68685b. Bugbot is set up for automated code reviews on this repo. Configure here.

Decompose why a Pod can't run into structured signals:
- bind-time: PodScheduled=False → parse the scheduler verdict + resolve node
  affinity/selector misses against the node cache, naming the offending label
  (e.g. "no node has kubernetes.io/arch=arm64")
- admission: controller FailedCreate (quota/LimitRange/PodSecurity/webhook) +
  proactive ResourceQuota saturation — the layer with no Pod to inspect
- post-bind: ContainerCreating decoded into CNI IP-exhaustion + volume
  attach/mount, cross-checked against still-stuck pods

Add ResourceQuota to the typed informer cache (mirroring LimitRange) so the
proactive quota read + a browsable ResourceQuota view work. The generic
problem detector now defers unschedulable pods to the scheduling source so
they aren't double-reported as a bare "Pending".
New SourceScheduling composes the three scheduling detectors through the
issues pipeline (default-on, high-signal operational state). /api/issues, the
MCP issues tool, and per-resource summaryContext now surface placement/
admission/post-bind failures, filterable via source=scheduling. ParseSources
accepts the new value; the Provider gains DetectScheduling.
- issues tool: source=scheduling documented and in the default set
- diagnose: a schedulability section scoped to the workload — its unschedulable
  pods, its ReplicaSet's FailedCreate, and any namespace ResourceQuota
  saturation (the one-shot answer for an admission/quota stall)
- get_dashboard: scheduling rows roll into the problem list; admission rows
  have no Pod, so the dashboard pod loop never surfaced them before
- PodRenderer: lead the banner with the decomposed scheduler verdict instead
  of a bare "Unschedulable" (untolerated taints, insufficient resources, and
  affinity/selector misses named). New PodProblem.detail keeps message exact
  so filter-chip matching is unaffected.
- NamespaceRenderer: a ResourceQuota usage section with per-resource
  saturation bars (amber >=90%, red >=100%) — quota pressure was shown nowhere
  despite being exactly why a namespace stops admitting pods. Fetched via a new
  useNamespaceQuotas hook over /api/resources/resourcequotas.
- topology tooltips: scheduling-aware guidance for the new reason keywords
  (Unschedulable, QuotaExceeded, IPExhaustion, VolumeMount/Attach, …).
/api/dashboard (the home ProblemsPanel source) is a separate builder from the
MCP get_dashboard one wired earlier — it only gathered DetectProblems +
DetectMissingRefs, so unschedulable pods and quota saturation never reached
the home view. Append the three scheduling detectors directly (bypassing the
Missing-ref Pod filter, since an Unschedulable row is the reason, not a dup).
Verified live: the panel now shows the arch-mismatch Unschedulable row (with
the offending label named) and the 99% QuotaNearLimit row.
@nadaverell nadaverell requested a review from hisco as a code owner May 25, 2026 21:19
Comment thread internal/k8s/scheduling.go
Comment thread internal/k8s/problems.go
@nadaverell
Copy link
Copy Markdown
Contributor Author

bugbot run

Comment thread internal/mcp/tools_diagnose.go
- server: route /api/resources/resourcequotas through the typed informer in
  handleListResources + handleGetResource (it fell through to the dynamic
  cache, so the namespace quota UI could read [] on first open before sync).
- scheduling: restrict the quota-pressure check to pod-admission-relevant
  resources (cpu/memory/pods/ephemeral-storage/requests.*/limits.*/PVC) so an
  object-count quota (configmaps/services) no longer shows as "blocks new pods".
- scheduling: cross-check the involved workload's current readiness before
  emitting an admission FailedCreate row — a since-recovered workload no longer
  surfaces as critical off a lingering event.
- dashboard: skip unschedulable pods in the REST rollup (they're owned by the
  scheduling rows) so they don't double-surface; fix the stale comment.
- frontend: thread the namespace quota fetch error through — 403 hides the
  section, but 500/503 now shows a note instead of silently rendering quota-free.
- types: drop dead NodeFacts.Taints/Unschedulable + TaintFact (written, never
  read); document the SchedulingReason union invariant.
- mcp: add scheduling to the issues tool Description defaults + example.
- comments: correct the node-fit resolver doc (no taint cache-join); strip
  external bench scenario name.
- tests: cache-level integration tests for the quota ramp + S1 filter, bind-time
  node-fit naming, and the admission recovered-workload cross-check; ParseSources
  scheduling token; frontend summarizeSchedulerMessage.
Comment thread internal/k8s/scheduling.go
Comment thread internal/k8s/scheduling.go Outdated
…ostics

# Conflicts:
#	internal/issues/issues_test.go
#	internal/mcp/tools.go
…a tones

Replace inline raw red/orange Tailwind in the namespace quota section with the
shared severity-color constants, per the repo styling rule.
- detectAdmissionFailures: dedup FailedCreate rows by involved object. A
  quota-blocked controller emits one event per attempt, each with a different
  generated pod name (distinct cached events), so one workload produced many
  near-identical rows. Now one row per workload.
- admissionTargetStillBlocked: gate on created-count (Status.Replicas /
  CurrentNumberScheduled) below desired, not readiness. A workload whose pods
  were created but stay not-ready for another reason (e.g. unschedulable after
  a quota was raised) is no longer admission-blocked, so a stale FailedCreate
  no longer surfaces a critical QuotaExceeded row.
- admissionTargetStillBlocked (Job): a terminally-failed Job (Failed>0) no
  longer counts as blocked — only a Job that has created nothing (Active,
  Succeeded, Failed all 0) does.
- diagnose schedulingFindingsForWorkload: tighten the Deployment→ReplicaSet
  match to a single hyphen-free hash suffix (isReplicaSetOf), so diagnosing
  "api" no longer claims "api-gateway-<hash>".
- tests: dedup assertion, created-but-not-ready skip, isReplicaSetOf table.
…oundary

- Add TestDetectAdmissionProblems_JobAndDaemonSetCrossCheck: a Job that created
  no pod and a partially-scheduled DaemonSet surface QuotaExceeded; a terminally-
  failed Job (Failed>0) and a fully-scheduled DaemonSet are skipped — pins the
  net-new Job (Failed==0) and DaemonSet (CurrentNumberScheduled) cross-check
  branches that the ReplicaSet test didn't exercise.
- Add a below-threshold (50%) quota case to the saturation test so the >=90%
  warn boundary is pinned, not just the >=90%/100% arms.
- Reword the Job cross-check comment to state the true invariant (any of
  Active/Succeeded/Failed > 0 means a pod was created) instead of the
  inaccurate "terminally-failed (backoffLimit)" phrasing; note the mid-retry
  trade-off explicitly.
- Replace the opaque "S1 filter" test comment with the real mechanism name
  (isPodAdmissionQuotaResource).
… first-seen

FailedCreate events are deduped per involved object, but informer List
order is arbitrary and the active blocker can change within the 30m
window (quota cleared, webhook now rejects). Keep the latest event by
LastTimestamp so the surfaced reason reflects the current cause, not
whichever the cache iterated first. Pin with a quota→webhook test.

Also fix stale source= comments/examples to include scheduling.
Comment thread internal/k8s/scheduling.go Outdated
Two Bugbot findings:

- DetectPostBindProblems kept the first qualifying kubelet event per pod
  by informer order, so a stale blocker could win when the cause changed
  (NetworkNotReady → FailedMount). Keep the latest by LastTimestamp,
  mirroring detectAdmissionFailures.

- A pod stuck post-bind surfaced twice in the issues composer: a generic
  problem-source Pending row AND the richer scheduling-source row. Dedup
  in the composer so the scheduling row wins for the same Pod. A plain
  DetectProblems skip can't do this — the problem threshold is 5m but the
  post-bind event window is 10m, so a pod stuck >10m would lose its only
  row.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit af8ad61. Configure here.

Comment thread internal/server/dashboard.go
…rface

The /api/dashboard builder is separate from the issues composer: its pod
health rollup flagged long-Pending pods as warnings (skipping only
unschedulable) while also appending post-bind scheduling rows, yielding
two rows for one stuck pod (bare Pending + the richer VolumeMount/CNI
row). Compute the post-bind-owned pod set up front and skip those in the
rollup the same way unschedulable pods are skipped; reuse the slice for
the scheduling append. Gap-free — only pods that actually get a post-bind
row are skipped, so a pod past the 10m post-bind window keeps its rollup
row.
…taxonomy filters

issues is now one curated operational stream — workload/pod problems,
dangling refs, pod-startup blockers, and False CRD conditions — severity
ranked. Detection provenance is no longer a user/agent filter axis:

- Drop the source= filter from /api/issues and the MCP issues tool. source
  survives only as an output label on each row + a CEL filter binding.
- Remove event + kyverno from issue composition entirely (and the
  include_events/include_kyverno flags). Raw events live in get_events /
  the timeline; policy posture lives in get_cluster_audit. Shrinks the
  issues.Provider interface (WarningEvents/KyvernoFindings/KyvernoStatus)
  and deletes the source-parse plumbing.
- SchedulingGated pods are no longer flagged Unschedulable (gate on
  reason==Unschedulable, matching the frontend).
- Remove proactive ResourceQuota saturation from the stream — a saturated
  quota is namespace capacity context, not a live failure; the reactive
  FailedCreate path and the Namespace quota UI still cover it. This also
  fixes diagnose over-attributing namespace quota to unrelated workloads.
- Rename diagnose's scheduling response field to startupBlockers (it spans
  bind-time, admission, and post-bind — not just scheduling).
- Drop the crippled include=logs from get_resource (use get_pod_logs /
  get_workload_logs / diagnose).
- Refresh docs/mcp.md tool table (was missing 6+ central tools) and fix the
  "non-destructive" wording — write tools are destructiveHint:true.

NOTE: /api/issues no longer reads source=/include_*; extra query params are
ignored (no 400). radar-hub-web should be checked for any fleet view that
relied on source=kyverno/include_kyverno.
@nadaverell nadaverell changed the title feat: scheduling as a first-class concern (issues + MCP + UI) feat: pod-startup diagnostics + make issues mean "what's broken right now" May 26, 2026
…resh stale docs

Addresses review findings on the issues/scheduling work:

- clusterrole.yaml: add resourcequotas to the core read-only rule. The PR
  caches + probes ResourceQuota (capabilities.go) but in-cluster installs
  could not list it, silently hiding the Namespace quota section + the
  ResourceQuota API/UI for the users who need them.
- get_resource: include=logs was dropped but silently became a no-op; return
  a logsError pointing to get_pod_logs / get_workload_logs / diagnose so a
  client on a stale schema is redirected instead of seeing empty success.
- Refresh stale docs/comments the issues refactor missed: issues/types.go
  (package doc, Severity mapping, Source doc, Issue doc still described the
  removed event/kyverno sources + source-as-filter), summarycontext.go
  (referenced removed Filters fields), docs/mcp.md "non-destructive" line,
  docs/integrations.md (Kyverno-via-/api/issues no longer exists).
- Delete now-dead policy_reports_testhooks.go (its only consumer was the
  removed issues_handler_test.go).
- Tests: pin the CEL source binding (now the only source-slicing path) and
  startupBlockersForWorkload workload-scoping (its contract changed).

The origin/main merge (prior commit) absorbs #780, resolving the apparent
client.ts cache-seeding revert (merge skew - this branch never touched it).
…nknown include values; document resourcequotas RBAC

Follow-up review findings on the issues/MCP work:

- docs/integrations.md + issues MCP tool description wrongly routed Kyverno
  PolicyReport findings to the cluster audit (/api/audit + get_cluster_audit).
  Audit consumes only typed K8s + Crossplane and has zero PolicyReport input.
  Kyverno surfaces per-resource (PolicyReport detail view + resourceContext
  policy rollup); say that, and stop pointing agents at a tool that returns
  nothing for it.
- Issue struct doc: the snapshot-timestamp note was only true for problem/
  missing_ref/scheduling (LastSeen=compose time); condition rows set both
  timestamps to the condition's lastTransitionTime. Distinguish the two.
- get_resource: include=logs was guarded, but every OTHER unknown token
  (typos, or "relationships" which moved to resourceContext) still silently
  no-op'd. Surface unknown include values via includeError so a token that
  did nothing is reported, not swallowed.
- README.md + docs/in-cluster.md: document the resourcequotas (and
  LimitRanges) read grant the chart ClusterRole now carries, so the
  supported-resources lists match the deployed RBAC.
@nadaverell nadaverell merged commit fab0799 into main May 26, 2026
8 checks passed
@nadaverell nadaverell deleted the feat/scheduling-diagnostics branch May 26, 2026 20:58
@nadaverell nadaverell restored the feat/scheduling-diagnostics branch May 29, 2026 23:45
@nadaverell nadaverell deleted the feat/scheduling-diagnostics branch May 30, 2026 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant