feat: add system defaults to buildSlurmConf (TCL-5588)#23
Merged
jhu-svg merged 3 commits intoApr 24, 2026
Merged
Conversation
Add UnkillableStepTimeout=600, HealthCheckInterval=60, HealthCheckNodeState=ANY, HealthCheckProgram to the generated slurm.conf as system defaults. These appear before ### EXTRA CONFIG ### so user extraConf can override if needed (Slurm uses last value). Covers both IC and BM clusters since both use the Slinky operator. No annotation gate needed — Slinky operator rebuilds slurm.conf natively on any Controller CR change. Requires v1.0.7+ worker images (gpu_healthcheck.sh must exist at /usr/bin/gpu_healthcheck.sh). Pre-v1.0.7 workers will log harmless "HealthCheckProgram not found" warnings. Made-with: Cursor
Without this, --mem in sbatch is just a scheduling hint — Slurm doesn't enforce memory limits. With ConstrainRAMSpace=yes, jobs that exceed their memory allocation get killed by Slurm instead of triggering the kernel OOM killer. MemSpecLimit is NOT added as a default because the right value depends on node memory size (per-cluster tuning via extraConf). Made-with: Cursor
Prevents OOM → requeue → OOM loops. When a node OOMs and the job fails, Slurm auto-requeues it by default (JobRequeue=1). The requeued job runs again, OOMs again, creating a crash loop. Same category as UnkillableStepTimeout — every cluster should have this. Placed before EXTRA CONFIG so user can override if needed. Made-with: Cursor
eb3095
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add system-critical Slurm defaults to the Slinky operator so every cluster (IC and BM) gets them automatically. These are infrastructure config in the same category as AuthType=auth/slurm.
What is added
slurm.conf (buildSlurmConf)
cgroup.conf (buildCgroupConf)
Why
What is NOT hardcoded
Requires
v1.0.7+ worker images for HealthCheckProgram (/usr/bin/gpu_healthcheck.sh must exist).
QA verification (Apr 18)
Cluster jhu-test-slurm-default-configs on QA (t-c7f12d03). Operator image 1.0.3-tcl5588-test + worker v1.0.7-test.
Follow-up
After this ships in a new chart version, remove slurmHealthCheckConf() from the cluster operator buildSlurmExtraConf (currently duplicated via PR #472).
One-pager: https://www.notion.so/345b878aad1a81148a86c29cdb2c7d3b
Ticket: TCL-5588