Job groups and SLURM job arrays support #301

lpla · 2020-04-01T11:31:53Z

Is your feature request related to a problem? Please describe.
Hello. I am a developer of Bitextor (https://github.com/bitextor/bitextor), which is based on Snakemake, and we are having issues running it in huge HPCC scale. We have been using Bitextor in several HPCCs running PBS and SLURM for the https://paracrawl.eu project and the number of submitted jobs is big enough to disturb the cluster scheduler.

Describe the solution you'd like
SLURM, which is the scheduler we are using right now, has a 'job array' option to group identical jobs with different inputs in a few submits (if too many, otherwise with only one submit). Snakemake doesn't have an easy way to automate this.

Describe alternatives you've considered
We tried to redesign our code so the number of tasks and the time to construct the DAG is reduced, I researched about other workflow managers (none of them support this; only Airflow supports a similar feature, AWS batch jobs) and, as a last resort, we had to run the whole pipeline/workflow manually using 'job arrays' (--array) and bash scripts iterating through each rule.

Additional context
https://bitbucket.org/snakemake/snakemake/issues/676/job-groups-and-slurm-jobarray

lpla · 2020-05-05T09:39:27Z

Temporary workaround? #343

ianrgraham · 2021-11-01T20:37:26Z

Hey there, did you ever find a nice solution for this?

lpla · 2021-11-02T09:16:12Z

Nope. As I said back then, we have had to run the whole workflow manually this whole time under SLURM.

ianrgraham · 2021-11-02T13:11:58Z

Hm, I wonder if the devs would accept an enhancement that allows this to be done. I mean, it wouldn't be too hard to write, no?

lpla · 2021-11-02T14:07:01Z

Well, I think it is not that easy. Snakemake could need to take into account all jobs from a rule and wait until all input for those jobs are ready, and then group them as a job array, which probably also need to rename inputs so they only differ between jobs by a number. This makes processing more "horizontal" instead of the verticality that Snakemake provides by design (Snakemake only waits for inputs, not for similar jobs with similar inputs).

lpla · 2022-12-05T13:28:54Z

As @johanneskoester said on Twitter (https://twitter.com/zngu/status/1499479835290308618), the Pull Request #1015 should allow grouping jobs using SLURM job arrays.

lpla · 2022-12-21T22:06:26Z

Hi. I reviewed the PR that implements the SLURM backend in Snakemake, and it doesn't allow performing job arrays yet. It still needs code work, as there is no reference of the option --array from the sbatch client calls. See https://help.rc.ufl.edu/doc/SLURM_Job_Arrays

This issue should be kept open until this feature is finally implemented.

pvandyken · 2022-12-21T22:14:13Z

Sure thing. @cmeesters, I think this is more in your ballpark

cmeesters · 2022-12-22T07:28:51Z

Thank you, @pvandyken. However, I am not sure what to think and here is why:

What is the purpose of job arrays? "Historically" it was a convenience feature for people working with batch systems and in some implementations a way to avoid a little scheduling overhead. With SLURM you avoid a bit of looping (e.g. in bash). Yet, submitting a few hundred or even a few thousand individual jobs usually is a non issue with negligible overhead. Only, if there are many such jobs, overhead kicks in. Then again, admins impose a limit on arrays to avoid a flooded, non-functional cluster (sometimes, they first have to make the experience first hand). So, on many clusters there is a sweet spot. With snakemake -j unlimited ... we are already able to launch as many jobs as the workflow allows at any given point.

OK. How would we implement it in snakemake? We cannot use the slurm_args resource definition, because this is essentially ignored by snakemake (on purpose). We cannot give an additional resource flag either, because the executor only sees individual jobs (or group jobs - which are conceptionally different). Hence, we would need a feature array_job (or similarly named) at the keyword level and information to propagate down to the executor. That, however, would be in conflict with the idea that all workflows should be portable, because only in cluster environments we find a job array option.

Besides, I do not see the argument for more "horizontal" processing as you do, @ipla . Assuming a rule with predecessors , potentially carried out many times: When snakemake is able to submit a job, it will do so (if not throttled). If we add an array feature like proposed, we would add synchronization overhead on the workflow itself, hence a throttle.

Hence, either I have made a mistake in my line of thoughts and there is an overlooked implementation option or this is perhaps not a good idea after all. Feedback is very much appreciated! @johanneskoester ?

lpla · 2022-12-22T09:47:26Z

Hello, @cmeesters. Thank you for your response.

I do agree that a few thousand jobs shouldn't be an issue in a modern Supercomputer. But in our practical case, that's not what we had.

We were forced by the clusters sysadmins (from two different clusters: EPCC and CSD3) to use job arrays because the amount of files we had to process was huge for them. Using their words from the mails I received, our "jobs have been very disruptive on the system - it's not really designed for lots of single core jobs" and the scheduler was "being overwhelmed by requests, [producing] timeouts in job submission and queue queries". They even insist: "If you are submitting many jobs using repeated sbatch commands, please learn how to use array jobs to reduce the overhead". But they don't specify a number to describe how many jobs are "many".

In our pipeline, Bitextor, our initial preprocessing rules are not designed to be parallelized at thread level. Also, input files size in Paracrawl project was quite heterogeneous (each file could be from KBs to GBs), so we even implemented a program that reduces the amount of following tasks jobs by grouping the preprocessed files in a specific way for our problem.

So, in our case, throttling through the workflow by adding an optional synchronization to make job arrays work on Snakemake would probably keep the sysadmins happy and allow people to run this kind of pipelines in picky SLURM clusters the same way we are running it in dedicated servers. But I need to apologize because I don't know that much of the internals of Snakemake to be sure about this.

@jelmervdl was in charge of running our code for production for Paracrawl, so he could better explain the exact issues and requirements for a feature like this. Tagging @kpu just in case, as he asked @johanneskoester on Twitter.

cmeesters · 2022-12-22T10:04:59Z

Ah, so there are two issues here:

I/O and
pooling a.k.a. node level scheduling.

snakemake can be a remedy for certain I/O issues, by doing some stage-in and stage-out prior to execution. But that in turn might require adjustments to the rules.

With regard to pooling jobs, the group feature is a partial remedy, already. This depends on the number of job steps, resp. snakemake group jobs therein - so probably not what you require. As discussed in the PR #1015 I much rather have an option to launch more job steps from a master node. The original idea was to oversubscribe (via SLURM) the resources for a group job with respect to the number of tasks and trigger srun to execute on the other reserved nodes.

This however would break the group job feature when dealing with pipes in group jobs. It would, however, be a remedy for your case and one for mine, so (next year) we will see, what we can do about it. Tweaking this idea is what I favour instead of array jobs. The relative arrogance of my colleagues ("please learn how to use array jobs to reduce the overhead") should not keep us from thinking out of the box. Particularly as array jobs are for balanced resource requirements (e.g. run times) and not a suitable solution with respect to scheduling and your fair share when dealing with uneven input sizes to crunch.

With regard to the I/O issue, we would need a little more info and perhaps an online meeting to get the details.

jdblischak · 2022-12-22T13:59:33Z

In practice, I often create a one-off sbatch script to submit a bottleneck rule as a big array job, which I briefly describe in #1814 (comment). Even if you don't get in trouble with your sys admins, waiting for Snakemake to serially submit hundreds of thousands of jobs takes forever.

cmeesters · 2022-12-22T14:07:38Z

THAT might be a work-around, but not really a solution within snakemake. Is there a published workflow, such that the scenario can be traced in the code?

snakemake/snakemake#301 Doesn't actually use the smk-simple-slurm profile. However, it is a useful (albeit hacky) workaround that I use often

jdblischak · 2022-12-22T20:25:09Z

I don't have a public example that I can share. I put together a minimal example to demonstrate my workaround

https://github.com/jdblischak/smk-simple-slurm/tree/main/examples/job-array

cmeesters · 2022-12-22T20:45:00Z

This is exactly, the scenario I have in mind: like in a group job, snakemake would need to be aware of the number of job-items, then submit 1 job, which launches number-of-job-items SLURM job steps and requires number-of-job-items x number-of-threads resources, presumably modulo a fudge factor (if the job is smaller and some steps finish sooner, the waste of resources can be minimized). The trick here: SLURM can execute job steps, triggered by srun across reserved nodes. The only limitation: Pipes across rules will not work for such scenarios. But whoever requires launching this many jobs, will not need the piping anyway (for example: sorting bam files for instance via a pipe is so slow, that poses an enormous overhead for our scenario).

The idea is similar to this solution with GNU parallel.

Coding this for snakemake is fairly simple. But I do not know how to test it in the CI (the CI-SLURM is pretty limited in resources, and even a toy example requires more than we have). I shall give this a try early next year, if Johannes agrees to the implementation I have in mind. In any case, my January is pretty booked already: Do not expect miracles.

One more question, though: Is your idea to start this many aligners?

jdblischak · 2022-12-22T21:12:21Z

One more question, though: Is your idea to start this many aligners?

It was just an example. My pipeline wasn't aligning reads. But now I'm curious: what if I have RNAseq for 20k single cells? What is your recommendation?

cmeesters · 2022-12-23T08:04:50Z

Do not use a (sequential) pipeline in the first place! ;-)

On a more serious note: The I/O issue might be far more important. Assuming the reference is bigger than the file system cache, stage-in the reference prior to any job (e.g. onto a node-local filesystem), to avoid random I/O. Here, our contemplated "have one big job with many job steps" approach, might reduce the overhead significantly (only a few stage-in steps). Supporting sbcast in snakemake will not be easy, but worth a consideration, too.

jdblischak · 2022-12-23T14:40:18Z

Do not use a (sequential) pipeline in the first place! ;-)

A similar idea to reduce the number of times the reference has to be loaded would be to align multiple samples at once. However, the thing I like about Snakemake and similar approaches is keeping everything organized. When I have thousands of samples, I really appreciate the peace of mind when I can easily read my Snakefile and know that each sample was processed correctly (even if not done in the most efficient manner).

tardigradus · 2023-01-23T15:14:21Z

[snip (11 lines)]

This however would break the group job feature when dealing with pipes in group jobs. It would, however, be a remedy for your case and one for mine, so (next year) we will see, what we can do about it. Tweaking this idea is what I favour instead of array jobs. The relative arrogance of my colleagues ("please learn how to use array jobs to reduce the overhead") should not keep us from thinking out of the box. Particularly as array jobs are for balanced resource requirements (e.g. run times) and not a suitable solution with respect to scheduling and your fair share when dealing with uneven input sizes to crunch.

[snip (2 lines)]

Not using an array when submitting a large number of job has some drawbacks and can seriously impact the throughput of an HPC system using Slurm. From a practical point of view command line tools which list jobs will become unwieldy if there are thousands of individual jobs. On the other had, the jobs in an array can all be represented on a single line. The major issue, however, is that a large number of individual jobs which all have the same resource requirements will prevent backfill from working efficiently. This is the mechanism whereby jobs can started earlier than they would otherwise by the scheduler being able to identify gaps in the planned schedule. However, only a limited number of jobs are considered for backfilling. If these all have identical requirements, this can prevent other jobs which might be eligible from being considered. If run-times really are very different, then that would be a potential argument against an array, but in my limited experience, people tend to generate a large number of jobs all with the same run-time.

I don't think that the request "please learn how to use array jobs to reduce the overhead" is arrogant. People running HPC systems have an obligation to all users, whether they use snakemake or not. If snakemake causes a degradation of the quality of service for people who don't use it, then the operators are going to have to take measures to minimize that effect.

Marc-commits · 2023-12-13T19:38:33Z

Any update on this?

cmeesters · 2023-12-13T20:10:54Z

I'm afraid, no. Awaiting the release of Snakemake 8 (and with it restructuring Snakemake Executor Plugins) and tons of others work items, no work on this has started, yet.

However, note that stage-in /-out processes will be able with Snakemake 8. See https://snakemake.readthedocs.io/en/latest/snakefiles/storage.html - using default-storage-provider: fs.

tardigradus · 2024-02-15T15:54:16Z

Could you elaborate on how stage-in / -out might alleviate the problem which job arrays address?

BTW: Nextflow's support for job arrays seems to be progressing.

cmeesters · 2024-02-15T18:56:14Z

We are currently consolidating the changes introduced with v8, I'm afraid.

Could you elaborate on how stage-in / -out might alleviate the problem which job arrays address?

Not at all. It's been a remark on a side-discussion.

lpla added the enhancement New feature or request label Apr 1, 2020

lpla mentioned this issue Sep 25, 2020

Advanced DAG partitioning #232

Merged

2 tasks

pvandyken closed this as completed Dec 21, 2022

pvandyken reopened this Dec 21, 2022

jdblischak added a commit to jdblischak/smk-simple-slurm that referenced this issue Dec 22, 2022

Add example that uses a job array

049b0db

snakemake/snakemake#301 Doesn't actually use the smk-simple-slurm profile. However, it is a useful (albeit hacky) workaround that I use often

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job groups and SLURM job arrays support #301

Job groups and SLURM job arrays support #301

lpla commented Apr 1, 2020 •

edited

lpla commented May 5, 2020 •

edited

ianrgraham commented Nov 1, 2021 •

edited

lpla commented Nov 2, 2021

ianrgraham commented Nov 2, 2021

lpla commented Nov 2, 2021

lpla commented Dec 5, 2022

lpla commented Dec 21, 2022 •

edited

pvandyken commented Dec 21, 2022

cmeesters commented Dec 22, 2022 •

edited

lpla commented Dec 22, 2022 •

edited

cmeesters commented Dec 22, 2022 •

edited

jdblischak commented Dec 22, 2022

cmeesters commented Dec 22, 2022

jdblischak commented Dec 22, 2022

cmeesters commented Dec 22, 2022 •

edited

jdblischak commented Dec 22, 2022

cmeesters commented Dec 23, 2022

jdblischak commented Dec 23, 2022

tardigradus commented Jan 23, 2023 •

edited

Marc-commits commented Dec 13, 2023

cmeesters commented Dec 13, 2023

tardigradus commented Feb 15, 2024

cmeesters commented Feb 15, 2024

Job groups and SLURM job arrays support #301

Job groups and SLURM job arrays support #301

Comments

lpla commented Apr 1, 2020 • edited

lpla commented May 5, 2020 • edited

ianrgraham commented Nov 1, 2021 • edited

lpla commented Nov 2, 2021

ianrgraham commented Nov 2, 2021

lpla commented Nov 2, 2021

lpla commented Dec 5, 2022

lpla commented Dec 21, 2022 • edited

pvandyken commented Dec 21, 2022

cmeesters commented Dec 22, 2022 • edited

lpla commented Dec 22, 2022 • edited

cmeesters commented Dec 22, 2022 • edited

jdblischak commented Dec 22, 2022

cmeesters commented Dec 22, 2022

jdblischak commented Dec 22, 2022

cmeesters commented Dec 22, 2022 • edited

jdblischak commented Dec 22, 2022

cmeesters commented Dec 23, 2022

jdblischak commented Dec 23, 2022

tardigradus commented Jan 23, 2023 • edited

Marc-commits commented Dec 13, 2023

cmeesters commented Dec 13, 2023

tardigradus commented Feb 15, 2024

cmeesters commented Feb 15, 2024

lpla commented Apr 1, 2020 •

edited

lpla commented May 5, 2020 •

edited

ianrgraham commented Nov 1, 2021 •

edited

lpla commented Dec 21, 2022 •

edited

cmeesters commented Dec 22, 2022 •

edited

lpla commented Dec 22, 2022 •

edited

cmeesters commented Dec 22, 2022 •

edited

cmeesters commented Dec 22, 2022 •

edited

tardigradus commented Jan 23, 2023 •

edited