Skip to content

feat: add system defaults to buildSlurmConf (TCL-5588)#23

Merged
jhu-svg merged 3 commits into
slurm-1.0-together-changesfrom
TCL-5588/system-defaults-in-buildSlurmConf
Apr 24, 2026
Merged

feat: add system defaults to buildSlurmConf (TCL-5588)#23
jhu-svg merged 3 commits into
slurm-1.0-together-changesfrom
TCL-5588/system-defaults-in-buildSlurmConf

Conversation

@jhu-svg
Copy link
Copy Markdown

@jhu-svg jhu-svg commented Apr 20, 2026

Summary

Add system-critical Slurm defaults to the Slinky operator so every cluster (IC and BM) gets them automatically. These are infrastructure config in the same category as AuthType=auth/slurm.

What is added

slurm.conf (buildSlurmConf)

### SYSTEM DEFAULTS ###
UnkillableStepTimeout=600
HealthCheckInterval=60
HealthCheckNodeState=ANY
HealthCheckProgram=/usr/bin/gpu_healthcheck.sh

cgroup.conf (buildCgroupConf)

ConstrainRAMSpace=yes

Why

  • UnkillableStepTimeout=600: Prevents "kill task failed" node drains (default 60s too short for GPU job teardown). FA had 20 nodes drain from this.
  • HealthCheckProgram: Auto-resumes nodes drained for "kill task failed" if GPUs healthy.
  • ConstrainRAMSpace=yes: Without this, --mem in sbatch is just a scheduling hint. Tangible confirmed --mem=500GB did not prevent OOM. With this, Slurm enforces memory limits via cgroups.
  • Covers IC and BM: Both use the Slinky operator.
  • No annotation gate needed: Slinky operator rebuilds slurm.conf natively.

What is NOT hardcoded

  • MemSpecLimit: Node-size-dependent. Set per-cluster via extraConf/UI.

Requires

v1.0.7+ worker images for HealthCheckProgram (/usr/bin/gpu_healthcheck.sh must exist).

QA verification (Apr 18)

Cluster jhu-test-slurm-default-configs on QA (t-c7f12d03). Operator image 1.0.3-tcl5588-test + worker v1.0.7-test.

# Test Result
1 slurm.conf has SYSTEM DEFAULTS with all 4 directives PASS
2 cgroup.conf has ConstrainRAMSpace=yes PASS
3 scontrol show config confirms all values active PASS
4 Both partitions alive (slinky + all*), node idle PASS
5 17 SSSD packages installed on worker (no crash) PASS
6 sssd_sudo responder running on worker PASS
7 /usr/bin/gpu_healthcheck.sh exists with auto-undrain logic PASS
8 nsswitch.conf has sudoers: files sss PASS
9 Drain "kill task failed" -> auto-resumed within 60s PASS
10 Drain "maintenance: BIOS upgrade" -> NOT resumed PASS

Follow-up

After this ships in a new chart version, remove slurmHealthCheckConf() from the cluster operator buildSlurmExtraConf (currently duplicated via PR #472).

One-pager: https://www.notion.so/345b878aad1a81148a86c29cdb2c7d3b
Ticket: TCL-5588

jhu-svg added 3 commits April 20, 2026 15:11
Add UnkillableStepTimeout=600, HealthCheckInterval=60,
HealthCheckNodeState=ANY, HealthCheckProgram to the generated
slurm.conf as system defaults. These appear before ### EXTRA CONFIG ###
so user extraConf can override if needed (Slurm uses last value).

Covers both IC and BM clusters since both use the Slinky operator.
No annotation gate needed — Slinky operator rebuilds slurm.conf
natively on any Controller CR change.

Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at
/usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless
"HealthCheckProgram not found" warnings.

Made-with: Cursor
Without this, --mem in sbatch is just a scheduling hint — Slurm
doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs
that exceed their memory allocation get killed by Slurm instead
of triggering the kernel OOM killer.

MemSpecLimit is NOT added as a default because the right value
depends on node memory size (per-cluster tuning via extraConf).

Made-with: Cursor
Prevents OOM → requeue → OOM loops. When a node OOMs and the job
fails, Slurm auto-requeues it by default (JobRequeue=1). The
requeued job runs again, OOMs again, creating a crash loop.

Same category as UnkillableStepTimeout — every cluster should have
this. Placed before EXTRA CONFIG so user can override if needed.

Made-with: Cursor
@jhu-svg jhu-svg merged commit e76f20c into slurm-1.0-together-changes Apr 24, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants