-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SampleData frontend for sgkit/xarray dataset #728
Comments
The nice thing here is that we don't get any immediate software dependency on sgkit, we're just using a zarr with a slightly different format as the storage backend. |
Re sites used for inference, we now determine those at runtime, when generating ancestors, dependant on the number of alleles defined (must be 2 for inference) and whether or not the site position is in an excluded list. So the SampleData dataset need not mark sites, but we need to check that the number of alleles that we get out of the sgkit dataset is as expected. Re the ancestral states, I'm assuming that we want to address https://github.com/pystatgen/sgkit/issues/585 in SGkit to get this to work nicely? |
We don't need any upstream changes in sgkit, we just need to decide on a variable name to use by default and provide an API for overriding this default. |
Done in #791 and related PRs |
An evolutionary step towards getting rid of our SampleData file format is to use it as a frontend for an sgkit zarr dataset. The workflow would be:
from_sgkit
which provides a thin wrapper around this sgkit dataset. Do not create a copy of the data.I'm imagining that we could do this reasonably straightforwardly something like this:
The tricky bits will be dealing with all the extra stuff like individuals, provenances etc, but I think we can probably add the required variables to the dataset.
@benjeffery - any thoughts on feasibility here?
The text was updated successfully, but these errors were encountered: