Skip to content

TCL-5951: Reconcile orphaned Slurm node registrations#26

Merged
jhu-svg merged 4 commits into
jhu/tcl-5951-add-delete-node-interfacefrom
jhu/tcl-5951-reconcile-orphaned-slurm-nodes
May 8, 2026
Merged

TCL-5951: Reconcile orphaned Slurm node registrations#26
jhu-svg merged 4 commits into
jhu/tcl-5951-add-delete-node-interfacefrom
jhu/tcl-5951-reconcile-orphaned-slurm-nodes

Conversation

@jhu-svg
Copy link
Copy Markdown

@jhu-svg jhu-svg commented May 4, 2026

Summary

  • Adds DeleteOrphanedNodes to SlurmControlInterface — lists all Slurm nodes, compares against current pods, deletes entries with no matching pod
  • Inserts syncOrphanedSlurmNodes step in the sync() chain, after cache refresh and before syncNodeSet (scale decisions)
  • Fixes ghost scontrol entries that persist when pods terminate without running PreStop (force-delete, OOM, node crash)
  • Uses the NodeSet's hostname template prefix (e.g. slinky-) to match nodes, not nodeset.Name — critical fix found during staging E2E

Context

Part of TCL-5951 (Slinky operator lifecycle gaps). This is PR 2 of 3, stacked on PR #25:

  1. PR TCL-5951: Add DeleteNode to SlurmControlInterface #25 — adds DeleteNode capability (must merge first)
  2. This PR — uses DeleteNode to clean up ghosts in the reconciliation loop
  3. TCCO PVC remediation — detects PVCs bound to occupied (not just missing) nodes

Why this matters

On the FA cluster incident (May 2026), 6 ghost entries accumulated after hardware failures. The operator had no fallback when PreStop didn't run — the only scontrol delete call was inside the dying pod. Ghost entries cascade into parking spot collisions (Gap 2) and wrong AHC lookups (Gap 3).

Test plan

  • Unit tests: 12 Ginkgo specs pass (orphan deleted, no-op, cross-NodeSet, prefix-overlap, hostname-template)
  • go build ./... passes
  • Staging E2E on jhu-test-slurm-slinky-gap-orphaned-node (2-node H100 cluster, staging s2-us-central-8a):
    • Injected slinky-99 → deleted by operator within one reconcile
    • Injected 3 orphans at once (slinky-5, -10, -42) → all deleted in single cycle
    • Active slinky-0 and slinky-1 → never touched across all tests
    • bar-0 (different NodeSet) → NOT deleted
    • slurm-worker-slinky-99 (doesn't match hostname prefix) → NOT deleted
    • slurm-worker-slinky-extra-0 (non-integer suffix) → NOT deleted
    • Force-delete both workers simultaneously → no crash, clean recovery
    • Operator stable (1/1 Running) throughout all tests
  • Bug found & fixed in staging: isNodeFromNodeSet was using nodeset.Name as prefix but real clusters use the hostname template (slinky-). Fixed with nodeNamePrefixForNodeSet() that reads nodeset.Spec.Template.PodSpecWrapper.Hostname.

jhu-svg and others added 4 commits May 4, 2026 16:28
Add DeleteOrphanedNodes to the sync loop so ghost scontrol entries
are cleaned up automatically. When a worker pod terminates without
running its PreStop hook (force-delete, OOM, node crash), its Slurm
node registration persists forever. This step compares the Slurm node
list against current pods and deletes entries with no matching pod.

Runs after RefreshNodeCache (cache is fresh) and before syncNodeSet
(scale decisions) so the operator doesn't count ghosts when deciding
replica count.
DeleteOrphanedNodes previously listed all Slurm nodes from the
controller and deleted any without a matching pod in the current
NodeSet's pod list. If multiple NodeSets share a controller, this
would incorrectly delete other NodeSets' valid nodes.

Filter by nodeNamePrefix (nodeset.Name + "-") so only nodes belonging
to the reconciling NodeSet are considered for deletion.
Only delete Slurm nodes that match the current NodeSet's exact ordinal naming pattern so prefix-overlapping NodeSets cannot be touched.

Co-authored-by: Cursor <cursoragent@cursor.com>
The NodeSet hostname template (e.g. "slinky-") determines the actual
Slurm node names ("slinky-0"), not the NodeSet name ("slurm-worker-slinky").
Use the template prefix so orphan detection works on real clusters.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jhu-svg jhu-svg merged commit 3eb4716 into jhu/tcl-5951-add-delete-node-interface May 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants