Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SampleData frontend for sgkit/xarray dataset #728

Closed
jeromekelleher opened this issue Oct 11, 2022 · 4 comments
Closed

SampleData frontend for sgkit/xarray dataset #728

jeromekelleher opened this issue Oct 11, 2022 · 4 comments
Milestone

Comments

@jeromekelleher
Copy link
Member

jeromekelleher commented Oct 11, 2022

An evolutionary step towards getting rid of our SampleData file format is to use it as a frontend for an sgkit zarr dataset. The workflow would be:

  1. Obtain VCF
  2. Convert to sgkit
  3. QC the dataset to mark sites used for ancestral haplotype inference, import ancestral states etc using sgkit and save to zarr (variable names used for these columns TBD)
  4. Create a SampleData instance using a method like from_sgkit which provides a thin wrapper around this sgkit dataset. Do not create a copy of the data.

I'm imagining that we could do this reasonably straightforwardly something like this:

 class SgkitSampleData(SampleData):
   def __init__(self, path):
        self.data = zarr.open(path, mode="r") # or whatever
        # Check for presence of required variables in the zarr, raising informative error if missing.

   # Override properties to access the differently (but largely equivalently) formatted 
   @property
    def sites_genotypes(self):
        return self.data["call_genotype"] # May need to be reshaped with a view?

The tricky bits will be dealing with all the extra stuff like individuals, provenances etc, but I think we can probably add the required variables to the dataset.

@benjeffery - any thoughts on feasibility here?

@jeromekelleher
Copy link
Member Author

The nice thing here is that we don't get any immediate software dependency on sgkit, we're just using a zarr with a slightly different format as the storage backend.

@hyanwong
Copy link
Member

Re sites used for inference, we now determine those at runtime, when generating ancestors, dependant on the number of alleles defined (must be 2 for inference) and whether or not the site position is in an excluded list.

So the SampleData dataset need not mark sites, but we need to check that the number of alleles that we get out of the sgkit dataset is as expected.

Re the ancestral states, I'm assuming that we want to address https://github.com/pystatgen/sgkit/issues/585 in SGkit to get this to work nicely?

@jeromekelleher
Copy link
Member Author

We don't need any upstream changes in sgkit, we just need to decide on a variable name to use by default and provide an API for overriding this default.

@jeromekelleher
Copy link
Member Author

Done in #791 and related PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants