Skip to content

The best way to append samples to a dataset? #1295

Open
@rajwanir

Description

@rajwanir

Hi,

I am working with array datasets and wish to concatenate samples across multiple zarr stores (>100). Since these are genotyping array datasets, they only differ in sample dimension. Everything else is identical. Is there a built in optimized function in sgkit to do that? I see in concat_zarrs in the documentation (version 0.6.0) but cannot see the source code and seems deprecated. In version 0.9.0, I see concat_zarrs_optimized but again cannot find it's source code or documentation.

Alternatively, I am simply trying the following:

import xarray as xr
import sgkit as sg
variables_to_concat =['call_GQ','call_IGC','call_LRR','call_NORMX','call_NORMY','call_R','call_THETA','call_X','call_Y','call_genotype','call_genotype_mask','call_genotype_phased','sample_id']
ds = xr.open_mfdataset(dslist,concat_dim = "samples",combine='nested',data_vars=variables_to_concat)
ds = ds.chunk(chunks={"samples":100})
sg.save_dataset(ds,"samples.zarr")

The zarr expects uniform chunk size and the rechunking seems to be expensive. I read the discussions on sgkit and see that you already encountered the issue. I wanted to check if there is an optimized function within sgkit to do such concatenation.

Highly appreciate any suggestions or pointers.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions