Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improved error handling for cluster status scripts and smarter job selector choice in case of cluster submission (use greedy for single jobs). #1142

Merged
merged 6 commits into from Aug 27, 2021

Conversation

johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented Aug 20, 2021

Description

See title. Could help with issue #759.

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

johanneskoester and others added 2 commits Aug 20, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Aug 20, 2021

Please format your code with black: black snakemake tests/*.py.

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Aug 23, 2021

@vsoch do you have any idea why the GLS test suddenly fails here? I did not expect my changes to have any side effects on GLS.

@vsoch
Copy link
Contributor

vsoch commented Aug 23, 2021

The last step in the GLS test it looks like is now using this greedy selector, so that's the change:

DEBUG snakemake.logging:logging.py:553 Using greedy selector because only single job has to be scheduled.

This one here:

And IIRC this step was added to make sure that we can handle the directory creation:

I think it probably has to do with the fact that in this new function we try to calculate a size for the input directory, which won't have one / possibly hasn't been created yet.

Which ultimately gets us here:

self._inputsize = sum(f.size for f in self.input)

and then calls this size_local function on it:

return self._iofile.size_local

So ultimately it's this call that is wrong and isn't aware of directories perhaps?

self.check_broken_symlink()

If I remember, we had to add a bunch of new logic to handle directories, and I think it would need to be added in here somewhere so we essentially don't stat something that is an input but doesn't exist yet.

That's my best guest without interacting with it directory - I would add directory handling, meaning some check using

@property
. It looks like there is other logic in that file for that case in other places!

@sonarcloud
Copy link

sonarcloud bot commented Aug 27, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Aug 27, 2021

Thanks a lot @vsoch for figuring out the issue!

@johanneskoester johanneskoester merged commit 48d2dd9 into main Aug 27, 2021
6 checks passed
@johanneskoester johanneskoester deleted the cluster-statusscript-robustness branch Aug 27, 2021
@mbhall88
Copy link
Member

mbhall88 commented Apr 18, 2022

I'm not sure if it's caused by these changes or not, but I've noticed recently that snakemake's status checking for cluster jobs is very delayed.

For example, in the LSF profile, if we can't get the job status using bjobs -o stat <jobid> - because it completed a while ago - then we go looking in the log file. I've noticed with a recent pipeline that has ~29k jobs that it spends a lot of time submitting jobs and then after a while, it tries to go back and check the status of the earlier jobs and they've now dropped out of the bjobs history.

I hope this makes sense? Basically I'm asking the there is some kind of new priority between submitting and status-checking?

@SichongP
Copy link
Contributor

SichongP commented Apr 18, 2022

@mbhall88
Copy link
Member

mbhall88 commented Apr 18, 2022

Sort of. It looks at the last 100 lsb.events files. Which means potentially opening 100 files. This is much slower than our current solution of looking in the job's cluster log file. Anway, that's not really the point of my question. I was just wanting to know if the submission/status priority has changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants