fix: better job status queries for slurm executor #2136

dlaehnemann · 2023-02-23T09:56:24Z

Description

This PR changes two things:

Previously, checking active_jobs was only initiated after sleeping for 1/max_status_checks_per_second seconds, which meant a minimum sleeping time of 30 seconds. This sleeping interval is now set to a fixed 30 seconds. However, one idea could also be to turn this into a new variable for the generic executor, and expose it to the user via a command-line argument -- it is also manually set for other executors (e.g. the google executor), and might require fine-tuning for specific workflows and/or clusters.
Previously, a max_status_checks_per_second > 1 specified by the user was set down to max_status_checks_per_second = 0.03 (no clue why there were different values in the comparison and the final setting) and the default value was also 0.03. This means that only one job was checked for its status every 30 seconds. For workflows with hundreds or thousands of short-running jobs, this meant that it could easily take hours for finished jobs to be registered as COMPLETED and until that point, the respective resources were not available for scheduling new jobs. The new default and maximum of max_status_checks_per_second = 2 considerably speeds this up, but stays well below the number of status checks that would be possible on the Slurm cluster I tested this on (which would be almost 20 for sacct and more than 50 for scontrol). Here, it would be useful if users on other Slurm clusters could check, whether scontrol is also always quicker for them, compared to sacct. If so, it would make sense to make scontrol the (cheaper) default (which also works for jobs that are not yet started), and sacct the fall-back (which also works for jobs that have already finished a while ago and might not be found by scontrol any more).

QC

The PR contains a test case for the changes or the changes are already covered by an existing test case.
The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

… listing them in comments so it is easily brought back for specific debugging

…s can also generate timings on other slurm clusters

…pendent on max_status_checks_per_second -- that variable is a rate limit, not a sleep time, and its usage here has caused confusion

…instead of 0.03 for reasonable checking times

github-actions · 2023-02-23T09:57:57Z

Please format your code with black: black snakemake tests/*.py.

dlaehnemann · 2023-02-23T10:42:42Z

snakemake/scheduler.py

+            #   grep "sacct output" .snakemake/log/2023-02-13T210004.601290.snakemake.log | \
+            #   awk '{ counter += 1; sum += $6; sum_of_squares += ($6)^2 } \
+            #     END { print "average: ",sum/counter," sd: ",sqrt((sum_of_squares - sum^2/counter) / counter); }


This is quick code to get average and standard deviation of sacct timings that this PR introduces (along with the respective scontrol timings) into the debugging output. If others could use it to also provide timings, this could help decide whether scontrol is the better default for checking the status of active_jobs.

dlaehnemann · 2023-02-23T10:59:08Z

I have just highlighted the timing code in the comments. Also, I have just asked on the slurm-users mailing list, whether scontrol is always expected to be quicker than sacct and whether scontrol would thus make a better default that puts less strain on the scheduler. We might want to wait for answer to that question, before moving along with this PR.

wickberg · 2023-02-24T00:28:51Z

From the slurmctld's perspective, sacct is great. Because it's querying the slurmdbd instead of the slurmctld. :)

Note that there are minor delays in propagating job status info to the slurmdbd, and you're probably better off getting this status from slurmctld in most instances.

I haven't looked in depth at what you're doing, so apologies if this doesn't line up well, but: a single 'scontrol show job' (or an equivalent squeue command with appropriate filtering options set) that grabs the entire job queue, rather than 10s to 1000s of individual 'scontrol show job ' calls is preferred from the slurmctld's perspective. The trade-offs are around central job lock contention, and a single RPC (although grabbing the entire queue state rather than the state of one individual job) vs. a slew of them is vastly preferred as it limits the lock contention.

dlaehnemann · 2023-02-24T09:19:59Z

Thanks already for the insights, and this confirms what I have also been hearing on the slurm-users mailing list. So from what I understand:

sacct is the better default command compared to scontrol when it comes to straining the slurm system.
Both sacct and scontrol should preferably be queried with one batch request with a list of all the jobids of interest, not one individual request for each jobid.

So I guess I'll use this PR to also address this, changing the code to do batch requests.

dlaehnemann · 2023-02-27T11:32:11Z

Just documenting some further context I gathered from the very helpful slurm-users mailing list:

For the suggested batch queries, it would make sense to have an adaptive interval in which they are done, something like the maximum of a fixed interval (say 30 or 45 seconds) and something like 5x the time that the previous batch query took. That way, whenever the system is not very responsive, we do not overstrain it further.
It might make sense to use one name for all of the jobs in one snakemake run, as this might speed up queries if they are done with this (single and unique) name. However, we then have to ensure that this job name is really unique for each snakemake run (a UUID?) and will have to still parse every job id from the respective output internally.
squeue might be an alternative as a fallback command, instead of scontrol, because it can handle job names. However, both seem to query slurmctld, and one should be careful not to overstrain slurmctld with queries. Hopefully, using it only as a fallback when sacct doesn't return anything, already helps reducing the strain a lot. I have usually only seen sacct not return very shortly after submission, probably because there is a lag before they show up in the slurmdbd database that sacct queries. So an initial waiting period before the first query at the start of a snakemake run, plus a reasonable minimum waiting time between batch queries, should further minimize the instances where snakemake has to even use the fallback command.

wickberg · 2023-02-27T17:37:21Z

For (3), note that the job name filtering for squeue is handled entirely within the client-command. It won't make a difference in terms of the load on slurmctld. (For sacct, the SlurmDBD does use the name as part of the underlying MySQL query to limit the returned results.)

Both squeue and 'scontrol show job' are using the same underlying API calls (slurm_load_jobs), so I'd encourage you to use whichever command is more convenient for you to parse and manipulate.

dlaehnemann · 2023-02-27T19:00:39Z

Many thanks again, @wickberg , that's useful info for the implementation.

For squeue and scontrol show job, it probably still makes sense to submit batch queries with either a name (that many jobs share) or a list of JobIds, right? Or would querying each JobId individually be the exact same amount of load, anyways? Such individual queries might keep the code for handling the JobIds that need the fallback call after sacct a bit simpler. But maybe I am also overthinking this, I should just jump in and try it some time this week!

johanneskoester · 2023-02-28T09:08:54Z

Awesome guys! I am very much looking forward to the outcome of this PR! Just keep going!

sonarcloud · 2023-03-02T21:06:45Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

johanneskoester · 2023-03-07T13:29:54Z

That was me being logged into the wrong account, sorry :-)

🤖 I have created a release *beep* *boop* --- ## [7.24.1](v7.24.0...v7.24.1) (2023-03-09) ### Bug Fixes * better job status queries for slurm executor ([#2136](#2136)) ([a4df38c](a4df38c)) * get python version for script environment in a backwards compatible way that works down to python 2.7 ([#2161](#2161)) ([44e59b9](44e59b9)) * prevents DeprecationWarning caused by using old draft of json schema ([#2152](#2152)) ([9791ffb](9791ffb)) ### Performance Improvements * Gfal2 remote provider using gfal2-python instead of gfal2-utils. ([#2128](#2128)) ([0b9bfe5](0b9bfe5)) ### Documentation * fix minor typos in a linting rule ([#2162](#2162)) ([71e1171](71e1171)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

@wickberg

### Description This is a follow-up pull request to #2136 , providing a more thorough solution to putting less strain on the slurm cluster system with job status queries. The solution suggested here: * assigns the same UUID4 as the job name during `sbatch` job submission for all jobs during a run of snakemake -- this goes back to [an idea by Sean Maxwell on the `slurm-users` mailing list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html) * uses `sacct` as the job status query command, as this [will put less strain on `slurmctld` and instead query the `slurmdbd`](#2136 (comment)) and we assume that any reasonable cluster setup will have at least basic [job accounting](https://slurm.schedmd.com/accounting.html) set up * uses `sacct --name <uuid4>` to make more efficient database queries to `slurmdbd`, according to [a suggestion by @wickberg](#2136 (comment)) * adaptively adjusts the minimum wait time between job status query calls (although this only applies to retries when previously seen jobs are somehow not present -- this usually shouldn't happen, and usually queries are spaced by at least 20 seconds, see below), which will be the maximum (longer waiting time period) from either (this also goes back to [an idea by Sean Maxwell on the `slurm-users` mailing list`](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html)): * the specified rate limit (default now set to 2 seconds between calls, but users can set this differently), or * 5 times the query time by the last call to `sacct` * increase the sleep time between checking on all the running jobs by 10 seconds every time no job had finished at the last check; reset to a minimum wait time of 20 seconds once a job finishing was detected; cap at a maximum sleep time of 180 seconds The solution explicitly DOES NOT use: * `squeue --noheader --format=\"%F|%T\" --name <uuid4> --jobs <job_ids_of_jobs_with_no_sacct_status>` as the fallback job status query command any more, because we assume basic accounting is always set up on a cluster where workflows would be run (it would be better than `scontrol show job`, because it can also do the `--name` based filtering) * either the `--yaml` or `--json` flags of slurm commands, as these seem to require (optional) plugins to be installed and they weren't on the system I tested -- so we cannot assume these to usually work. (This came up as an [idea by Bas van der Vlies on the `slurm-users` malining list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009728.html) * use some kind of mechanism to wait for jobs finishing (this came up as an [idea by Brian Andrus on the `slurm-users` mailing list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009724.html), because the only easy solution I saw, was to have [`sbatch --wait`](https://slurm.schedmd.com/sbatch.html#OPT_wait) until a job finishes and only return then. However, this would keep a thread running for every job currently active, likely overstraining the login node in cases of high parallelism in the workflow (or oversubmissions). ### QC  * [ ] The PR contains a test case for the changes or the changes are already covered by an existing test case. * [x] The documentation (`docs/`) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake). --------- Co-authored-by: David Laehnemann <david.laehnemann@helmholtz-hzi.de> Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>

David Laehnemann added 4 commits February 23, 2023 10:27

reduce excessive logging output of all active jobs, but keep code for…

b73b7b3

… listing them in comments so it is easily brought back for specific debugging

introduce timing of sacct and scontrol, leave this here so that other…

a0f04ba

…s can also generate timings on other slurm clusters

do not make waiting time between the checking of slurm active_jobs de…

a65b562

…pendent on max_status_checks_per_second -- that variable is a rate limit, not a sleep time, and its usage here has caused confusion

set default and maximum of slurm max_status_checks_per_second to 2 …

b91651d

…instead of 0.03 for reasonable checking times

dlaehnemann requested a review from johanneskoester as a code owner February 23, 2023 09:56

dlaehnemann commented Feb 23, 2023

View reviewed changes

dlaehnemann and others added 3 commits February 23, 2023 11:59

Merge branch 'main' into debug-slurm-register-finished-jobs

d6cd1f5

black formatting

914e14a

black again

e9f2822

dlaehnemann mentioned this pull request Feb 23, 2023

continuously fewer jobs scheduled by snakemake on cluster (slurm), as jobs very slowly registered as finished by snakemake #2091

Closed

dlaehnemann changed the title ~~fix: more reasonable slurm max_status_checks_per_second defaults~~ fix: better job status queries for slurm executor Feb 27, 2023

Merge branch 'main' into debug-slurm-register-finished-jobs

d0b563b

bioconda-bot approved these changes Mar 7, 2023

View reviewed changes

johanneskoester approved these changes Mar 7, 2023

View reviewed changes

johanneskoester merged commit a4df38c into main Mar 7, 2023

johanneskoester deleted the debug-slurm-register-finished-jobs branch March 7, 2023 13:30

github-actions bot mentioned this pull request Mar 7, 2023

chore(main): release 7.24.1 #2157

Merged

dlaehnemann mentioned this pull request Mar 10, 2023

fix: slurm batch job status queries #2167

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: better job status queries for slurm executor #2136

fix: better job status queries for slurm executor #2136

dlaehnemann commented Feb 23, 2023

github-actions bot commented Feb 23, 2023

dlaehnemann Feb 23, 2023

dlaehnemann commented Feb 23, 2023

wickberg commented Feb 24, 2023

dlaehnemann commented Feb 24, 2023

dlaehnemann commented Feb 27, 2023

wickberg commented Feb 27, 2023

dlaehnemann commented Feb 27, 2023

johanneskoester commented Feb 28, 2023

sonarcloud bot commented Mar 2, 2023

johanneskoester commented Mar 7, 2023

fix: better job status queries for slurm executor #2136

fix: better job status queries for slurm executor #2136

Conversation

dlaehnemann commented Feb 23, 2023

Description

QC

github-actions bot commented Feb 23, 2023

dlaehnemann Feb 23, 2023

Choose a reason for hiding this comment

dlaehnemann commented Feb 23, 2023

wickberg commented Feb 24, 2023

dlaehnemann commented Feb 24, 2023

dlaehnemann commented Feb 27, 2023

wickberg commented Feb 27, 2023

dlaehnemann commented Feb 27, 2023

johanneskoester commented Feb 28, 2023

sonarcloud bot commented Mar 2, 2023

johanneskoester commented Mar 7, 2023