SampleData frontend for sgkit/xarray dataset #728

jeromekelleher · 2022-10-11T08:46:37Z

An evolutionary step towards getting rid of our SampleData file format is to use it as a frontend for an sgkit zarr dataset. The workflow would be:

Obtain VCF
Convert to sgkit
QC the dataset to mark sites used for ancestral haplotype inference, import ancestral states etc using sgkit and save to zarr (variable names used for these columns TBD)
Create a SampleData instance using a method like from_sgkit which provides a thin wrapper around this sgkit dataset. Do not create a copy of the data.

I'm imagining that we could do this reasonably straightforwardly something like this:

 class SgkitSampleData(SampleData):
   def __init__(self, path):
        self.data = zarr.open(path, mode="r") # or whatever
        # Check for presence of required variables in the zarr, raising informative error if missing.

   # Override properties to access the differently (but largely equivalently) formatted 
   @property
    def sites_genotypes(self):
        return self.data["call_genotype"] # May need to be reshaped with a view?

The tricky bits will be dealing with all the extra stuff like individuals, provenances etc, but I think we can probably add the required variables to the dataset.

@benjeffery - any thoughts on feasibility here?

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2022-10-11T08:52:12Z

The nice thing here is that we don't get any immediate software dependency on sgkit, we're just using a zarr with a slightly different format as the storage backend.

hyanwong · 2022-10-11T09:00:33Z

Re sites used for inference, we now determine those at runtime, when generating ancestors, dependant on the number of alleles defined (must be 2 for inference) and whether or not the site position is in an excluded list.

So the SampleData dataset need not mark sites, but we need to check that the number of alleles that we get out of the sgkit dataset is as expected.

Re the ancestral states, I'm assuming that we want to address https://github.com/pystatgen/sgkit/issues/585 in SGkit to get this to work nicely?

jeromekelleher · 2022-10-11T09:04:33Z

We don't need any upstream changes in sgkit, we just need to decide on a variable name to use by default and provide an API for overriding this default.

jeromekelleher · 2023-05-15T12:37:57Z

Done in #791 and related PRs

jeromekelleher mentioned this issue Oct 11, 2022

Allow sample data and ancestors files to store their time units #696

Open

jeromekelleher mentioned this issue Oct 11, 2022

Unify SampleData and AncestorsData file formats #729

Open

hyanwong added this to the Release 0.4.0 milestone Oct 26, 2022

jeromekelleher closed this as completed May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SampleData frontend for sgkit/xarray dataset #728

SampleData frontend for sgkit/xarray dataset #728

jeromekelleher commented Oct 11, 2022 •

edited

Loading

jeromekelleher commented Oct 11, 2022

hyanwong commented Oct 11, 2022

jeromekelleher commented Oct 11, 2022

jeromekelleher commented May 15, 2023

SampleData frontend for sgkit/xarray dataset #728

SampleData frontend for sgkit/xarray dataset #728

Comments

jeromekelleher commented Oct 11, 2022 • edited Loading

jeromekelleher commented Oct 11, 2022

hyanwong commented Oct 11, 2022

jeromekelleher commented Oct 11, 2022

jeromekelleher commented May 15, 2023

jeromekelleher commented Oct 11, 2022 •

edited

Loading