-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split the concat_zarrs step to avoid very large dask task counts #1034
Conversation
Thanks for this! Does this change fix your problem? Do you think we should keep the old behaviour as a default for smaller datasets or doesn't it make much difference? |
With this change the conversion of this dataset was successful! |
I'll run a quick comparison on a smaller VCF. |
I've done a test on chr 22 of the 1k genomes data, there is no difference in runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Pending fix to pre-commit.
BTW - I tried bigger blocks to reduce the number of dask tasks needed, but could only go 2.5x bigger than the default before I hit a RAM limit in the first vcf parsing step, so it is a tradeo-off between these two steps. |
cb57444
to
8179307
Compare
This can go in after #1041 fixes the build |
@mergify rebase |
To get the future behavior now, you can configure Or you can create a dedicated github account for squash and rebase operations, and use it in different |
✅ Branch has been successfully rebased |
8179307
to
3b5251d
Compare
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #1034 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 49 49
Lines 4549 4545 -4
=========================================
- Hits 4549 4545 -4
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
When converting very large VCFs I'm running out of RAM when constructing the task graph for the
concat_zarrs
step. This PR breaks the task graph into the independent steps.