Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snakemake waiting for files much longer than latency-wait (slurm, NFS) #2739

Closed
Dmitry-Antipov opened this issue Mar 7, 2024 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@Dmitry-Antipov
Copy link

When running large swarm of jobs on NFS cluster sometimes I see that snakemake is waiting for files way more longer than it is stated in latency-wait option

This is likely related to NFS problem first mentioned in this issue, #39 , but the difference that I do not get IncompleteFilesException . After few hours snakemake 'realizes' that the job is finished and execution continues.

For example, for one of the jobs I have 07:18:42 end time reported by sacct, 07:42 last modified time for output files reported by ls and 12:22:34 as an end time reported by snakemake log (same day).

Can it be related to this place in code, https://github.com/snakemake/snakemake/blob/977951ea541bceb97b6a77709fde863f6c638352/snakemake/io.py#L895C7-L895C8 ? - As far as I see snakemake just waits until file appear to exists in _IOCache, regardless to latency-wait

--latency-wait is set to 30 seconds, so expected behavior for such case would be crashing - is it intentional that it does not happen?

And is there a known workaround to avoid this problem without running ls in separate window or modifying snakemake's code (as mentioned in #39) ?

@Dmitry-Antipov Dmitry-Antipov added the bug Something isn't working label Mar 7, 2024
@cmeesters
Copy link
Contributor

Please list the plugins and the plugin versions you are listing.

@Dmitry-Antipov
Copy link
Author

We finally were able to resolve the problem with delay. It happened because --max-status-checks-per-second set to 0.02, so it took ~5 hours just to check the status of the swarm of ~400 jobs.

I still do not get why filesystem timestamp is different from both slurm-reported job end time and "snakemake's" job end time, but is not so important (at least for me). No plugins were used, native snakemake's slurm support only. snakemake v7.30.1

@cmeesters
Copy link
Contributor

Ah, that's a good one: I will make a note for the SLURM executor plugin documentation. Thank you.

@conchoecia
Copy link
Contributor

conchoecia commented May 8, 2024

Possibly related to: #2496 vvvv

@cmeesters
Copy link
Contributor

@conchoecia no: overwriting job-name will cause the job state testing to fail. What is needed is a mechanism to prevent this overwrite by users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants