Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: efficient job status checking when using DRMAA API (this should yield much better parallelization and performance when using --drmaa) #1156

Merged
merged 3 commits into from Sep 3, 2021

Conversation

johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented Aug 27, 2021

Description

see above

QC

  • The PR contains a test case for the changes or the changes are already covered by an existing test case.
  • The documentation (docs/) is updated to reflect the changes or this is not necessary (e.g. if the change does neither modify the language nor the behavior or functionalities of Snakemake).

@marrip
Copy link

marrip commented Aug 30, 2021

I tried running this branch on our infrastructure and it seems something is causing a error:

Traceback (most recent call last):
  File "/home/marrip/snakemake_drmaa_test/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 727, in _wait_thread
    self._wait_for_jobs()
  File "/home/marrip/snakemake_drmaa_test/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 1458, in _wait_for_jobs
    suspended_msg.remove(active_job.job.jobid)
KeyError: 1657

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Aug 31, 2021

I tried running this branch on our infrastructure and it seems something is causing a error:

Traceback (most recent call last):
  File "/home/marrip/snakemake_drmaa_test/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 727, in _wait_thread
    self._wait_for_jobs()
  File "/home/marrip/snakemake_drmaa_test/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 1458, in _wait_for_jobs
    suspended_msg.remove(active_job.job.jobid)
KeyError: 1657

Thanks for trying. I think this should be fixed now. Can you give it another try?

@sonarcloud
Copy link

sonarcloud bot commented Aug 31, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@marrip
Copy link

marrip commented Sep 1, 2021

Hey Johannes,

lewking very promising. I submitted a total of 17847 jobs and atm it is running smoothly. I will keep an eye on it for a bit but it seems to work 🎉

@marrip
Copy link

marrip commented Sep 3, 2021

I had the workflow running now for about 2 days and it is down again to 20, sometimes 40 jobs even though I see it could submit more jobs as the dependencies are met and I set a limit to 100 jobs. Yesterday, so after 1 day, it still looked fine but now the lag seems to be back...

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Sep 3, 2021

I had the workflow running now for about 2 days and it is down again to 20, sometimes 40 jobs even though I see it could submit more jobs as the dependencies are met and I set a limit to 100 jobs. Yesterday, so after 1 day, it still looked fine but now the lag seems to be back...

Mhm, do you get any of these messages here: https://github.com/snakemake/snakemake/pull/1156/files#diff-438f3317205fd7130727d0589d2fc1a6c2e1f6fc48c2c04d354a8a09b91ba2f4R1447?

@marrip
Copy link

marrip commented Sep 3, 2021

I checked the logs and found 14 of them for the total of 2 days running it. I think that should not be the reason why it's lagging. But somehow the workflow recovered now and is back to 100 jobs. I am not quite sure what the problem was. Could it be related to many short small jobs that snakemake has problems "catching up" due to filesystem latency etc.?

@johanneskoester
Copy link
Contributor Author

johanneskoester commented Sep 3, 2021

Yes, that makes sense. Maybe those jobs had finished, but the main process was then waiting for their output files to become visible. I am relieved to hear that is seems to work then :-). Let me merge this, but please contact me via discord if other problems occur with DRMAA. I really want this to work as good as possible.

@johanneskoester johanneskoester merged commit ac004cb into main Sep 3, 2021
6 checks passed
@johanneskoester johanneskoester deleted the drmaa-get-status branch Sep 3, 2021
@marrip
Copy link

marrip commented Sep 6, 2021

Sure, I will keep you updated if I see any inconsistent behavior again. Thank you so much for your help and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants