Skip to content

bcf conversion memory requirement #661

Answered by tomwhite
alxsimon asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @alxsimon - thanks for raising this issue.

Could you explain to me what is going on under the hood?

The vcf_to_zarr function partitions the VCF (or BCF in this case) into a set of contiguous regions that cover the while file,
and then writes an intermediate Zarr file for each partition, in parallel (using Dask). The intermediate Zarr files are concatenated, rechunked, then written to the final output (again using Dask). We have had problems in the past with this rechunking step running out of memory, which is what seems to be happening in this case.

The underyling Dask issue has never been fixed (see dask/dask#6745), but we do have some (internal) code in sgkit that avoids this memor…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@alxsimon
Comment options

Answer selected by alxsimon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants