New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job groups and SLURM job arrays support #301
Comments
Temporary workaround? #343 |
Hey there, did you ever find a nice solution for this? |
Nope. As I said back then, we have had to run the whole workflow manually this whole time under SLURM. |
Hm, I wonder if the devs would accept an enhancement that allows this to be done. I mean, it wouldn't be too hard to write, no? |
Well, I think it is not that easy. Snakemake could need to take into account all jobs from a rule and wait until all input for those jobs are ready, and then group them as a job array, which probably also need to rename inputs so they only differ between jobs by a number. This makes processing more "horizontal" instead of the verticality that Snakemake provides by design (Snakemake only waits for inputs, not for similar jobs with similar inputs). |
As @johanneskoester said on Twitter (https://twitter.com/zngu/status/1499479835290308618), the Pull Request #1015 should allow grouping jobs using SLURM job arrays. |
Hi. I reviewed the PR that implements the SLURM backend in Snakemake, and it doesn't allow performing job arrays yet. It still needs code work, as there is no reference of the option This issue should be kept open until this feature is finally implemented. |
Sure thing. @cmeesters, I think this is more in your ballpark |
Thank you, @pvandyken. However, I am not sure what to think and here is why: What is the purpose of job arrays? "Historically" it was a convenience feature for people working with batch systems and in some implementations a way to avoid a little scheduling overhead. With SLURM you avoid a bit of looping (e.g. in bash). Yet, submitting a few hundred or even a few thousand individual jobs usually is a non issue with negligible overhead. Only, if there are many such jobs, overhead kicks in. Then again, admins impose a limit on arrays to avoid a flooded, non-functional cluster (sometimes, they first have to make the experience first hand). So, on many clusters there is a sweet spot. With OK. How would we implement it in snakemake? We cannot use the Besides, I do not see the argument for more "horizontal" processing as you do, @ipla . Assuming a rule with predecessors , potentially carried out many times: When snakemake is able to submit a job, it will do so (if not throttled). If we add an array feature like proposed, we would add synchronization overhead on the workflow itself, hence a throttle. Hence, either I have made a mistake in my line of thoughts and there is an overlooked implementation option or this is perhaps not a good idea after all. Feedback is very much appreciated! @johanneskoester ? |
Hello, @cmeesters. Thank you for your response. I do agree that a few thousand jobs shouldn't be an issue in a modern Supercomputer. But in our practical case, that's not what we had. We were forced by the clusters sysadmins (from two different clusters: EPCC and CSD3) to use job arrays because the amount of files we had to process was huge for them. Using their words from the mails I received, our "jobs have been very disruptive on the system - it's not really designed for lots of single core jobs" and the scheduler was "being overwhelmed by requests, [producing] timeouts in job submission and queue queries". They even insist: "If you are submitting many jobs using repeated sbatch commands, please learn how to use array jobs to reduce the overhead". But they don't specify a number to describe how many jobs are "many". In our pipeline, Bitextor, our initial preprocessing rules are not designed to be parallelized at thread level. Also, input files size in Paracrawl project was quite heterogeneous (each file could be from KBs to GBs), so we even implemented a program that reduces the amount of following tasks jobs by grouping the preprocessed files in a specific way for our problem. So, in our case, throttling through the workflow by adding an optional synchronization to make job arrays work on Snakemake would probably keep the sysadmins happy and allow people to run this kind of pipelines in picky SLURM clusters the same way we are running it in dedicated servers. But I need to apologize because I don't know that much of the internals of Snakemake to be sure about this. @jelmervdl was in charge of running our code for production for Paracrawl, so he could better explain the exact issues and requirements for a feature like this. Tagging @kpu just in case, as he asked @johanneskoester on Twitter. |
Ah, so there are two issues here:
snakemake can be a remedy for certain I/O issues, by doing some stage-in and stage-out prior to execution. But that in turn might require adjustments to the rules. With regard to pooling jobs, the This however would break the group job feature when dealing with pipes in group jobs. It would, however, be a remedy for your case and one for mine, so (next year) we will see, what we can do about it. Tweaking this idea is what I favour instead of array jobs. The relative arrogance of my colleagues ("please learn how to use array jobs to reduce the overhead") should not keep us from thinking out of the box. Particularly as array jobs are for balanced resource requirements (e.g. run times) and not a suitable solution with respect to scheduling and your fair share when dealing with uneven input sizes to crunch. With regard to the I/O issue, we would need a little more info and perhaps an online meeting to get the details. |
In practice, I often create a one-off sbatch script to submit a bottleneck rule as a big array job, which I briefly describe in #1814 (comment). Even if you don't get in trouble with your sys admins, waiting for Snakemake to serially submit hundreds of thousands of jobs takes forever. |
THAT might be a work-around, but not really a solution within snakemake. Is there a published workflow, such that the scenario can be traced in the code? |
snakemake/snakemake#301 Doesn't actually use the smk-simple-slurm profile. However, it is a useful (albeit hacky) workaround that I use often
I don't have a public example that I can share. I put together a minimal example to demonstrate my workaround https://github.com/jdblischak/smk-simple-slurm/tree/main/examples/job-array |
This is exactly, the scenario I have in mind: like in a group job, snakemake would need to be aware of the number of job-items, then submit 1 job, which launches The idea is similar to this solution with GNU parallel. Coding this for snakemake is fairly simple. But I do not know how to test it in the CI (the CI-SLURM is pretty limited in resources, and even a toy example requires more than we have). I shall give this a try early next year, if Johannes agrees to the implementation I have in mind. In any case, my January is pretty booked already: Do not expect miracles. One more question, though: Is your idea to start this many aligners? |
It was just an example. My pipeline wasn't aligning reads. But now I'm curious: what if I have RNAseq for 20k single cells? What is your recommendation? |
Do not use a (sequential) pipeline in the first place! ;-) On a more serious note: The I/O issue might be far more important. Assuming the reference is bigger than the file system cache, stage-in the reference prior to any job (e.g. onto a node-local filesystem), to avoid random I/O. Here, our contemplated "have one big job with many job steps" approach, might reduce the overhead significantly (only a few stage-in steps). Supporting |
A similar idea to reduce the number of times the reference has to be loaded would be to align multiple samples at once. However, the thing I like about Snakemake and similar approaches is keeping everything organized. When I have thousands of samples, I really appreciate the peace of mind when I can easily read my Snakefile and know that each sample was processed correctly (even if not done in the most efficient manner). |
[snip (11 lines)]
[snip (2 lines)] Not using an array when submitting a large number of job has some drawbacks and can seriously impact the throughput of an HPC system using Slurm. From a practical point of view command line tools which list jobs will become unwieldy if there are thousands of individual jobs. On the other had, the jobs in an array can all be represented on a single line. The major issue, however, is that a large number of individual jobs which all have the same resource requirements will prevent backfill from working efficiently. This is the mechanism whereby jobs can started earlier than they would otherwise by the scheduler being able to identify gaps in the planned schedule. However, only a limited number of jobs are considered for backfilling. If these all have identical requirements, this can prevent other jobs which might be eligible from being considered. If run-times really are very different, then that would be a potential argument against an array, but in my limited experience, people tend to generate a large number of jobs all with the same run-time. I don't think that the request "please learn how to use array jobs to reduce the overhead" is arrogant. People running HPC systems have an obligation to all users, whether they use snakemake or not. If snakemake causes a degradation of the quality of service for people who don't use it, then the operators are going to have to take measures to minimize that effect. |
Any update on this? |
I'm afraid, no. Awaiting the release of Snakemake 8 (and with it restructuring Snakemake Executor Plugins) and tons of others work items, no work on this has started, yet. However, note that stage-in /-out processes will be able with Snakemake 8. See https://snakemake.readthedocs.io/en/latest/snakefiles/storage.html - using |
Could you elaborate on how stage-in / -out might alleviate the problem which job arrays address? BTW: Nextflow's support for job arrays seems to be progressing. |
We are currently consolidating the changes introduced with v8, I'm afraid.
Not at all. It's been a remark on a side-discussion. |
Is your feature request related to a problem? Please describe.
Hello. I am a developer of Bitextor (https://github.com/bitextor/bitextor), which is based on Snakemake, and we are having issues running it in huge HPCC scale. We have been using Bitextor in several HPCCs running PBS and SLURM for the https://paracrawl.eu project and the number of submitted jobs is big enough to disturb the cluster scheduler.
Describe the solution you'd like
SLURM, which is the scheduler we are using right now, has a 'job array' option to group identical jobs with different inputs in a few submits (if too many, otherwise with only one submit). Snakemake doesn't have an easy way to automate this.
Describe alternatives you've considered
We tried to redesign our code so the number of tasks and the time to construct the DAG is reduced, I researched about other workflow managers (none of them support this; only Airflow supports a similar feature, AWS batch jobs) and, as a last resort, we had to run the whole pipeline/workflow manually using 'job arrays' (--array) and bash scripts iterating through each rule.
Additional context
https://bitbucket.org/snakemake/snakemake/issues/676/job-groups-and-slurm-jobarray
The text was updated successfully, but these errors were encountered: