feat: add system defaults to buildSlurmConf (TCL-5588) by jhu-svg · Pull Request #23 · togethercomputer/slurm-operator

jhu-svg · 2026-04-20T22:12:22Z

Summary

Add system-critical Slurm defaults to the Slinky operator so every cluster (IC and BM) gets them automatically. These are infrastructure config in the same category as AuthType=auth/slurm.

What is added

slurm.conf (buildSlurmConf)

### SYSTEM DEFAULTS ###
UnkillableStepTimeout=600
HealthCheckInterval=60
HealthCheckNodeState=ANY
HealthCheckProgram=/usr/bin/gpu_healthcheck.sh

cgroup.conf (buildCgroupConf)

ConstrainRAMSpace=yes

Why

UnkillableStepTimeout=600: Prevents "kill task failed" node drains (default 60s too short for GPU job teardown). FA had 20 nodes drain from this.
HealthCheckProgram: Auto-resumes nodes drained for "kill task failed" if GPUs healthy.
ConstrainRAMSpace=yes: Without this, --mem in sbatch is just a scheduling hint. Tangible confirmed --mem=500GB did not prevent OOM. With this, Slurm enforces memory limits via cgroups.
Covers IC and BM: Both use the Slinky operator.
No annotation gate needed: Slinky operator rebuilds slurm.conf natively.

What is NOT hardcoded

MemSpecLimit: Node-size-dependent. Set per-cluster via extraConf/UI.

Requires

v1.0.7+ worker images for HealthCheckProgram (/usr/bin/gpu_healthcheck.sh must exist).

QA verification (Apr 18)

Cluster jhu-test-slurm-default-configs on QA (t-c7f12d03). Operator image 1.0.3-tcl5588-test + worker v1.0.7-test.

#	Test	Result
1	slurm.conf has SYSTEM DEFAULTS with all 4 directives	PASS
2	cgroup.conf has ConstrainRAMSpace=yes	PASS
3	scontrol show config confirms all values active	PASS
4	Both partitions alive (slinky + all*), node idle	PASS
5	17 SSSD packages installed on worker (no crash)	PASS
6	sssd_sudo responder running on worker	PASS
7	/usr/bin/gpu_healthcheck.sh exists with auto-undrain logic	PASS
8	nsswitch.conf has sudoers: files sss	PASS
9	Drain "kill task failed" -> auto-resumed within 60s	PASS
10	Drain "maintenance: BIOS upgrade" -> NOT resumed	PASS

Follow-up

After this ships in a new chart version, remove slurmHealthCheckConf() from the cluster operator buildSlurmExtraConf (currently duplicated via PR #472).

One-pager: https://www.notion.so/345b878aad1a81148a86c29cdb2c7d3b
Ticket: TCL-5588

Add UnkillableStepTimeout=600, HealthCheckInterval=60, HealthCheckNodeState=ANY, HealthCheckProgram to the generated slurm.conf as system defaults. These appear before ### EXTRA CONFIG ### so user extraConf can override if needed (Slurm uses last value). Covers both IC and BM clusters since both use the Slinky operator. No annotation gate needed — Slinky operator rebuilds slurm.conf natively on any Controller CR change. Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at /usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless "HealthCheckProgram not found" warnings. Made-with: Cursor

Without this, --mem in sbatch is just a scheduling hint — Slurm doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs that exceed their memory allocation get killed by Slurm instead of triggering the kernel OOM killer. MemSpecLimit is NOT added as a default because the right value depends on node memory size (per-cluster tuning via extraConf). Made-with: Cursor

Prevents OOM → requeue → OOM loops. When a node OOMs and the job fails, Slurm auto-requeues it by default (JobRequeue=1). The requeued job runs again, OOMs again, creating a crash loop. Same category as UnkillableStepTimeout — every cluster should have this. Placed before EXTRA CONFIG so user can override if needed. Made-with: Cursor

jhu-svg added 3 commits April 20, 2026 15:11

jhu-svg requested review from clarkzinzow, eb3095 and sagrawal-byte April 23, 2026 15:54

eb3095 approved these changes Apr 24, 2026

View reviewed changes

jhu-svg merged commit e76f20c into slurm-1.0-together-changes Apr 24, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add system defaults to buildSlurmConf (TCL-5588)#23

feat: add system defaults to buildSlurmConf (TCL-5588)#23
jhu-svg merged 3 commits into
slurm-1.0-together-changesfrom
TCL-5588/system-defaults-in-buildSlurmConf

jhu-svg commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhu-svg commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is added

slurm.conf (buildSlurmConf)

cgroup.conf (buildCgroupConf)

Why

What is NOT hardcoded

Requires

QA verification (Apr 18)

Follow-up

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhu-svg commented Apr 20, 2026 •

edited

Loading