Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split the concat_zarrs step to avoid very large dask task counts #1034

Merged
merged 1 commit into from
Mar 2, 2023

Conversation

benjeffery
Copy link
Collaborator

When converting very large VCFs I'm running out of RAM when constructing the task graph for the concat_zarrs step. This PR breaks the task graph into the independent steps.

@tomwhite
Copy link
Collaborator

Thanks for this! Does this change fix your problem?

Do you think we should keep the old behaviour as a default for smaller datasets or doesn't it make much difference?

@benjeffery
Copy link
Collaborator Author

benjeffery commented Feb 28, 2023

With this change the conversion of this dataset was successful!
Screenshot from 2023-02-28 10-42-57
(Excuse the screenshot I can't cut and paste as the security protocols won't let me!)
Thanks for the amazing work on the vcf conversion code @tomwhite, it has worked very well.

@benjeffery
Copy link
Collaborator Author

Do you think we should keep the old behaviour as a default for smaller datasets or doesn't it make much difference?

I'll run a quick comparison on a smaller VCF.

@benjeffery
Copy link
Collaborator Author

I've done a test on chr 22 of the 1k genomes data, there is no difference in runtime.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Pending fix to pre-commit.

@benjeffery
Copy link
Collaborator Author

BTW - I tried bigger blocks to reduce the number of dask tasks needed, but could only go 2.5x bigger than the default before I hit a RAM limit in the first vcf parsing step, so it is a tradeo-off between these two steps.

@tomwhite tomwhite added the auto-merge Auto merge label for mergify test flight label Feb 28, 2023
@tomwhite
Copy link
Collaborator

tomwhite commented Mar 1, 2023

This can go in after #1041 fixes the build

@benjeffery
Copy link
Collaborator Author

@mergify rebase

@mergify
Copy link
Contributor

mergify bot commented Mar 2, 2023

⚠️ This pull request got rebased on behalf of a random user of the organization.
This behavior will change on the 1st February 2023, Mergify will pick the author of the pull request instead.

To get the future behavior now, you can configure bot_account options (e.g.: bot_account: { author } or update_bot_account: { author }.

Or you can create a dedicated github account for squash and rebase operations, and use it in different bot_account options.

@mergify
Copy link
Contributor

mergify bot commented Mar 2, 2023

rebase

✅ Branch has been successfully rebased

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2023

Codecov Report

Merging #1034 (3b5251d) into main (2db3a38) will not change coverage.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff            @@
##              main     #1034   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           49        49           
  Lines         4549      4545    -4     
=========================================
- Hits          4549      4545    -4     
Impacted Files Coverage Δ
sgkit/io/vcfzarr_reader.py 100.00% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@mergify mergify bot merged commit 16d187c into sgkit-dev:main Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants