-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: better job status queries for slurm executor #2136
Conversation
… listing them in comments so it is easily brought back for specific debugging
…s can also generate timings on other slurm clusters
…pendent on max_status_checks_per_second -- that variable is a rate limit, not a sleep time, and its usage here has caused confusion
…instead of 0.03 for reasonable checking times
Please format your code with black: |
# grep "sacct output" .snakemake/log/2023-02-13T210004.601290.snakemake.log | \ | ||
# awk '{ counter += 1; sum += $6; sum_of_squares += ($6)^2 } \ | ||
# END { print "average: ",sum/counter," sd: ",sqrt((sum_of_squares - sum^2/counter) / counter); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quick code to get average and standard deviation of sacct
timings that this PR introduces (along with the respective scontrol
timings) into the debugging output. If others could use it to also provide timings, this could help decide whether scontrol
is the better default for checking the status of active_jobs
.
I have just highlighted the timing code in the comments. Also, I have just asked on the slurm-users mailing list, whether |
From the slurmctld's perspective, sacct is great. Because it's querying the slurmdbd instead of the slurmctld. :) Note that there are minor delays in propagating job status info to the slurmdbd, and you're probably better off getting this status from slurmctld in most instances. I haven't looked in depth at what you're doing, so apologies if this doesn't line up well, but: a single 'scontrol show job' (or an equivalent squeue command with appropriate filtering options set) that grabs the entire job queue, rather than 10s to 1000s of individual 'scontrol show job ' calls is preferred from the slurmctld's perspective. The trade-offs are around central job lock contention, and a single RPC (although grabbing the entire queue state rather than the state of one individual job) vs. a slew of them is vastly preferred as it limits the lock contention. |
Thanks already for the insights, and this confirms what I have also been hearing on the
So I guess I'll use this PR to also address this, changing the code to do batch requests. |
Just documenting some further context I gathered from the very helpful
|
max_status_checks_per_second
defaults
For (3), note that the job name filtering for squeue is handled entirely within the client-command. It won't make a difference in terms of the load on slurmctld. (For sacct, the SlurmDBD does use the name as part of the underlying MySQL query to limit the returned results.) Both squeue and 'scontrol show job' are using the same underlying API calls (slurm_load_jobs), so I'd encourage you to use whichever command is more convenient for you to parse and manipulate. |
Many thanks again, @wickberg , that's useful info for the implementation. For |
Awesome guys! I am very much looking forward to the outcome of this PR! Just keep going! |
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
That was me being logged into the wrong account, sorry :-) |
🤖 I have created a release *beep* *boop* --- ## [7.24.1](v7.24.0...v7.24.1) (2023-03-09) ### Bug Fixes * better job status queries for slurm executor ([#2136](#2136)) ([a4df38c](a4df38c)) * get python version for script environment in a backwards compatible way that works down to python 2.7 ([#2161](#2161)) ([44e59b9](44e59b9)) * prevents DeprecationWarning caused by using old draft of json schema ([#2152](#2152)) ([9791ffb](9791ffb)) ### Performance Improvements * Gfal2 remote provider using gfal2-python instead of gfal2-utils. ([#2128](#2128)) ([0b9bfe5](0b9bfe5)) ### Documentation * fix minor typos in a linting rule ([#2162](#2162)) ([71e1171](71e1171)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description This is a follow-up pull request to #2136 , providing a more thorough solution to putting less strain on the slurm cluster system with job status queries. The solution suggested here: * assigns the same UUID4 as the job name during `sbatch` job submission for all jobs during a run of snakemake -- this goes back to [an idea by Sean Maxwell on the `slurm-users` mailing list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html) * uses `sacct` as the job status query command, as this [will put less strain on `slurmctld` and instead query the `slurmdbd`](#2136 (comment)) and we assume that any reasonable cluster setup will have at least basic [job accounting](https://slurm.schedmd.com/accounting.html) set up * uses `sacct --name <uuid4>` to make more efficient database queries to `slurmdbd`, according to [a suggestion by @wickberg](#2136 (comment)) * adaptively adjusts the minimum wait time between job status query calls (although this only applies to retries when previously seen jobs are somehow not present -- this usually shouldn't happen, and usually queries are spaced by at least 20 seconds, see below), which will be the maximum (longer waiting time period) from either (this also goes back to [an idea by Sean Maxwell on the `slurm-users` mailing list`](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html)): * the specified rate limit (default now set to 2 seconds between calls, but users can set this differently), or * 5 times the query time by the last call to `sacct` * increase the sleep time between checking on all the running jobs by 10 seconds every time no job had finished at the last check; reset to a minimum wait time of 20 seconds once a job finishing was detected; cap at a maximum sleep time of 180 seconds The solution explicitly DOES NOT use: * `squeue --noheader --format=\"%F|%T\" --name <uuid4> --jobs <job_ids_of_jobs_with_no_sacct_status>` as the fallback job status query command any more, because we assume basic accounting is always set up on a cluster where workflows would be run (it would be better than `scontrol show job`, because it can also do the `--name` based filtering) * either the `--yaml` or `--json` flags of slurm commands, as these seem to require (optional) plugins to be installed and they weren't on the system I tested -- so we cannot assume these to usually work. (This came up as an [idea by Bas van der Vlies on the `slurm-users` malining list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009728.html) * use some kind of mechanism to wait for jobs finishing (this came up as an [idea by Brian Andrus on the `slurm-users` mailing list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009724.html), because the only easy solution I saw, was to have [`sbatch --wait`](https://slurm.schedmd.com/sbatch.html#OPT_wait) until a job finishes and only return then. However, this would keep a thread running for every job currently active, likely overstraining the login node in cases of high parallelism in the workflow (or oversubmissions). ### QC <!-- Make sure that you can tick the boxes below. --> * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). --------- Co-authored-by: David Laehnemann <david.laehnemann@helmholtz-hzi.de> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
Description
This PR changes two things:
active_jobs
was only initiated after sleeping for1/max_status_checks_per_second
seconds, which meant a minimum sleeping time of 30 seconds. This sleeping interval is now set to a fixed 30 seconds. However, one idea could also be to turn this into a new variable for the generic executor, and expose it to the user via a command-line argument -- it is also manually set for other executors (e.g. the google executor), and might require fine-tuning for specific workflows and/or clusters.max_status_checks_per_second > 1
specified by the user was set down tomax_status_checks_per_second = 0.03
(no clue why there were different values in the comparison and the final setting) and the default value was also0.03
. This means that only one job was checked for its status every 30 seconds. For workflows with hundreds or thousands of short-running jobs, this meant that it could easily take hours for finished jobs to be registered asCOMPLETED
and until that point, the respective resources were not available for scheduling new jobs. The new default and maximum ofmax_status_checks_per_second = 2
considerably speeds this up, but stays well below the number of status checks that would be possible on the Slurm cluster I tested this on (which would be almost 20 forsacct
and more than 50 forscontrol
). Here, it would be useful if users on other Slurm clusters could check, whetherscontrol
is also always quicker for them, compared tosacct
. If so, it would make sense to makescontrol
the (cheaper) default (which also works for jobs that are not yet started), andsacct
the fall-back (which also works for jobs that have already finished a while ago and might not be found byscontrol
any more).QC
docs/
) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).