Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: better job status queries for slurm executor #2136

Merged
merged 8 commits into from Mar 7, 2023

Conversation

dlaehnemann
Copy link
Contributor

Description

This PR changes two things:

  1. Previously, checking active_jobs was only initiated after sleeping for 1/max_status_checks_per_second seconds, which meant a minimum sleeping time of 30 seconds. This sleeping interval is now set to a fixed 30 seconds. However, one idea could also be to turn this into a new variable for the generic executor, and expose it to the user via a command-line argument -- it is also manually set for other executors (e.g. the google executor), and might require fine-tuning for specific workflows and/or clusters.
  2. Previously, a max_status_checks_per_second > 1 specified by the user was set down to max_status_checks_per_second = 0.03 (no clue why there were different values in the comparison and the final setting) and the default value was also 0.03. This means that only one job was checked for its status every 30 seconds. For workflows with hundreds or thousands of short-running jobs, this meant that it could easily take hours for finished jobs to be registered as COMPLETED and until that point, the respective resources were not available for scheduling new jobs. The new default and maximum of max_status_checks_per_second = 2 considerably speeds this up, but stays well below the number of status checks that would be possible on the Slurm cluster I tested this on (which would be almost 20 for sacct and more than 50 for scontrol). Here, it would be useful if users on other Slurm clusters could check, whether scontrol is also always quicker for them, compared to sacct. If so, it would make sense to make scontrol the (cheaper) default (which also works for jobs that are not yet started), and sacct the fall-back (which also works for jobs that have already finished a while ago and might not be found by scontrol any more).

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

David Laehnemann added 4 commits February 23, 2023 10:27
… listing them in comments so it is easily brought back for specific debugging
…s can also generate timings on other slurm clusters
…pendent on max_status_checks_per_second -- that variable is a rate limit, not a sleep time, and its usage here has caused confusion
…instead of 0.03 for reasonable checking times
@github-actions
Copy link
Contributor

Please format your code with black: black snakemake tests/*.py.

Comment on lines +202 to +204
# grep "sacct output" .snakemake/log/2023-02-13T210004.601290.snakemake.log | \
# awk '{ counter += 1; sum += $6; sum_of_squares += ($6)^2 } \
# END { print "average: ",sum/counter," sd: ",sqrt((sum_of_squares - sum^2/counter) / counter); }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quick code to get average and standard deviation of sacct timings that this PR introduces (along with the respective scontrol timings) into the debugging output. If others could use it to also provide timings, this could help decide whether scontrol is the better default for checking the status of active_jobs.

@dlaehnemann
Copy link
Contributor Author

I have just highlighted the timing code in the comments. Also, I have just asked on the slurm-users mailing list, whether scontrol is always expected to be quicker than sacct and whether scontrol would thus make a better default that puts less strain on the scheduler. We might want to wait for answer to that question, before moving along with this PR.

@wickberg
Copy link

From the slurmctld's perspective, sacct is great. Because it's querying the slurmdbd instead of the slurmctld. :)

Note that there are minor delays in propagating job status info to the slurmdbd, and you're probably better off getting this status from slurmctld in most instances.

I haven't looked in depth at what you're doing, so apologies if this doesn't line up well, but: a single 'scontrol show job' (or an equivalent squeue command with appropriate filtering options set) that grabs the entire job queue, rather than 10s to 1000s of individual 'scontrol show job ' calls is preferred from the slurmctld's perspective. The trade-offs are around central job lock contention, and a single RPC (although grabbing the entire queue state rather than the state of one individual job) vs. a slew of them is vastly preferred as it limits the lock contention.

@dlaehnemann
Copy link
Contributor Author

Thanks already for the insights, and this confirms what I have also been hearing on the slurm-users mailing list. So from what I understand:

  1. sacct is the better default command compared to scontrol when it comes to straining the slurm system.
  2. Both sacct and scontrol should preferably be queried with one batch request with a list of all the jobids of interest, not one individual request for each jobid.

So I guess I'll use this PR to also address this, changing the code to do batch requests.

@dlaehnemann
Copy link
Contributor Author

Just documenting some further context I gathered from the very helpful slurm-users mailing list:

  1. For the suggested batch queries, it would make sense to have an adaptive interval in which they are done, something like the maximum of a fixed interval (say 30 or 45 seconds) and something like 5x the time that the previous batch query took. That way, whenever the system is not very responsive, we do not overstrain it further.
  2. It might make sense to use one name for all of the jobs in one snakemake run, as this might speed up queries if they are done with this (single and unique) name. However, we then have to ensure that this job name is really unique for each snakemake run (a UUID?) and will have to still parse every job id from the respective output internally.
  3. squeue might be an alternative as a fallback command, instead of scontrol, because it can handle job names. However, both seem to query slurmctld, and one should be careful not to overstrain slurmctld with queries. Hopefully, using it only as a fallback when sacct doesn't return anything, already helps reducing the strain a lot. I have usually only seen sacct not return very shortly after submission, probably because there is a lag before they show up in the slurmdbd database that sacct queries. So an initial waiting period before the first query at the start of a snakemake run, plus a reasonable minimum waiting time between batch queries, should further minimize the instances where snakemake has to even use the fallback command.

@dlaehnemann dlaehnemann changed the title fix: more reasonable slurm max_status_checks_per_second defaults fix: better job status queries for slurm executor Feb 27, 2023
@wickberg
Copy link

For (3), note that the job name filtering for squeue is handled entirely within the client-command. It won't make a difference in terms of the load on slurmctld. (For sacct, the SlurmDBD does use the name as part of the underlying MySQL query to limit the returned results.)

Both squeue and 'scontrol show job' are using the same underlying API calls (slurm_load_jobs), so I'd encourage you to use whichever command is more convenient for you to parse and manipulate.

@dlaehnemann
Copy link
Contributor Author

Many thanks again, @wickberg , that's useful info for the implementation.

For squeue and scontrol show job, it probably still makes sense to submit batch queries with either a name (that many jobs share) or a list of JobIds, right? Or would querying each JobId individually be the exact same amount of load, anyways? Such individual queries might keep the code for handling the JobIds that need the fallback call after sacct a bit simpler. But maybe I am also overthinking this, I should just jump in and try it some time this week!

@johanneskoester
Copy link
Contributor

Awesome guys! I am very much looking forward to the outcome of this PR! Just keep going!

@sonarcloud
Copy link

sonarcloud bot commented Mar 2, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@johanneskoester
Copy link
Contributor

That was me being logged into the wrong account, sorry :-)

@johanneskoester johanneskoester merged commit a4df38c into main Mar 7, 2023
@johanneskoester johanneskoester deleted the debug-slurm-register-finished-jobs branch March 7, 2023 13:30
johanneskoester pushed a commit that referenced this pull request Mar 9, 2023
🤖 I have created a release *beep* *boop*
---


##
[7.24.1](v7.24.0...v7.24.1)
(2023-03-09)


### Bug Fixes

* better job status queries for slurm executor
([#2136](#2136))
([a4df38c](a4df38c))
* get python version for script environment in a backwards compatible
way that works down to python 2.7
([#2161](#2161))
([44e59b9](44e59b9))
* prevents DeprecationWarning caused by using old draft of json schema
([#2152](#2152))
([9791ffb](9791ffb))


### Performance Improvements

* Gfal2 remote provider using gfal2-python instead of gfal2-utils.
([#2128](#2128))
([0b9bfe5](0b9bfe5))


### Documentation

* fix minor typos in a linting rule
([#2162](#2162))
([71e1171](71e1171))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
johanneskoester added a commit that referenced this pull request Mar 22, 2023
### Description

This is a follow-up pull request to #2136 , providing a more thorough
solution to putting less strain on the slurm cluster system with job
status queries.

The solution suggested here:
* assigns the same UUID4 as the job name during `sbatch` job submission
for all jobs during a run of snakemake -- this goes back to [an idea by
Sean Maxwell on the `slurm-users` mailing
list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html)
* uses `sacct` as the job status query command, as this [will put less
strain on `slurmctld` and instead query the
`slurmdbd`](#2136 (comment))
and we assume that any reasonable cluster setup will have at least basic
[job accounting](https://slurm.schedmd.com/accounting.html) set up
* uses `sacct --name <uuid4>` to make more efficient database queries to
`slurmdbd`, according to [a suggestion by
@wickberg](#2136 (comment))
* adaptively adjusts the minimum wait time between job status query
calls (although this only applies to retries when previously seen jobs
are somehow not present -- this usually shouldn't happen, and usually
queries are spaced by at least 20 seconds, see below), which will be the
maximum (longer waiting time period) from either (this also goes back to
[an idea by Sean Maxwell on the `slurm-users` mailing
list`](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009713.html)):
* the specified rate limit (default now set to 2 seconds between calls,
but users can set this differently), or
  * 5 times the query time by the last call to `sacct`
* increase the sleep time between checking on all the running jobs by 10
seconds every time no job had finished at the last check; reset to a
minimum wait time of 20 seconds once a job finishing was detected; cap
at a maximum sleep time of 180 seconds

The solution explicitly DOES NOT use:
* `squeue --noheader --format=\"%F|%T\" --name <uuid4> --jobs
<job_ids_of_jobs_with_no_sacct_status>` as the fallback job status query
command any more, because we assume basic accounting is always set up on
a cluster where workflows would be run (it would be better than
`scontrol show job`, because it can also do the `--name` based
filtering)
* either the `--yaml` or `--json` flags of slurm commands, as these seem
to require (optional) plugins to be installed and they weren't on the
system I tested -- so we cannot assume these to usually work. (This came
up as an [idea by Bas van der Vlies on the `slurm-users` malining
list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009728.html)
* use some kind of mechanism to wait for jobs finishing (this came up as
an [idea by Brian Andrus on the `slurm-users` mailing
list](https://lists.schedmd.com/pipermail/slurm-users/2023-February/009724.html),
because the only easy solution I saw, was to have [`sbatch
--wait`](https://slurm.schedmd.com/sbatch.html#OPT_wait) until a job
finishes and only return then. However, this would keep a thread running
for every job currently active, likely overstraining the login node in
cases of high parallelism in the workflow (or oversubmissions).

### QC
<!-- Make sure that you can tick the boxes below. -->

* [ ] The PR contains a test case for the changes or the changes are
already covered by an existing test case.
* [x] The documentation (`docs/`) is updated to reflect the changes or
this is not necessary (e.g. if the change does neither modify the
language nor the behavior or functionalities of Snakemake).

---------

Co-authored-by: David Laehnemann <david.laehnemann@helmholtz-hzi.de>
Co-authored-by: Johannes Köster <johannes.koester@tu-dortmund.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants