The best way to append samples to a dataset?

Hi,

I am working with array datasets and wish to concatenate samples across multiple zarr stores (>100). Since these are genotyping array datasets, they only differ in sample dimension. Everything else is identical. Is there a built in optimized function in sgkit to do that? I see in [`concat_zarrs`](https://sgkit-dev.github.io/sgkit/0.6.0/generated/sgkit.io.vcf.concat_zarrs.html) in the documentation (version 0.6.0) but cannot see the source code and seems deprecated. In version 0.9.0, I see `concat_zarrs_optimized` but again cannot find it's source code or documentation. 

Alternatively, I am simply trying the following:

````python
import xarray as xr
import sgkit as sg
variables_to_concat =['call_GQ','call_IGC','call_LRR','call_NORMX','call_NORMY','call_R','call_THETA','call_X','call_Y','call_genotype','call_genotype_mask','call_genotype_phased','sample_id']
ds = xr.open_mfdataset(dslist,concat_dim = "samples",combine='nested',data_vars=variables_to_concat)
ds = ds.chunk(chunks={"samples":100})
sg.save_dataset(ds,"samples.zarr")
````

The zarr expects uniform chunk size and the rechunking seems to be expensive. I read the discussions on sgkit and see that you already encountered the issue. I wanted to check if there is an optimized function within sgkit to do such concatenation. 

Highly appreciate any suggestions or pointers.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The best way to append samples to a dataset? #1295

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The best way to append samples to a dataset? #1295

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions