Description
Hi,
I am working with array datasets and wish to concatenate samples across multiple zarr stores (>100). Since these are genotyping array datasets, they only differ in sample dimension. Everything else is identical. Is there a built in optimized function in sgkit to do that? I see in concat_zarrs
in the documentation (version 0.6.0) but cannot see the source code and seems deprecated. In version 0.9.0, I see concat_zarrs_optimized
but again cannot find it's source code or documentation.
Alternatively, I am simply trying the following:
import xarray as xr
import sgkit as sg
variables_to_concat =['call_GQ','call_IGC','call_LRR','call_NORMX','call_NORMY','call_R','call_THETA','call_X','call_Y','call_genotype','call_genotype_mask','call_genotype_phased','sample_id']
ds = xr.open_mfdataset(dslist,concat_dim = "samples",combine='nested',data_vars=variables_to_concat)
ds = ds.chunk(chunks={"samples":100})
sg.save_dataset(ds,"samples.zarr")
The zarr expects uniform chunk size and the rechunking seems to be expensive. I read the discussions on sgkit and see that you already encountered the issue. I wanted to check if there is an optimized function within sgkit to do such concatenation.
Highly appreciate any suggestions or pointers.
Thank you.