Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document data structures and design philosophy #87

Closed
hammer opened this issue Aug 3, 2020 · 10 comments
Closed

Document data structures and design philosophy #87

hammer opened this issue Aug 3, 2020 · 10 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

Now that #51 is in, it would be good to have some documentation to describe the data structures at the heart of sgkit and the design philosophy used to formulate them.

I will pick this up and work with @alimanfoo and @eric-czech to ensure I capture their thinking as the intellectual forebears of sgkit.

@hammer hammer added the documentation Improvements or additions to documentation label Aug 3, 2020
@hammer
Copy link
Contributor Author

hammer commented Aug 3, 2020

@tomwhite notes http://xarray.pydata.org/en/stable/data-structures.html is a good example of this sort of documentation.

@hammer
Copy link
Contributor Author

hammer commented Aug 7, 2020

Some thoughts from @eric-czech at https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845. As he notes in that comment, the design philosophy of sgkit right now is to treat xarray as a container for genetics data and to only check its shape and content with the various check_ calls when invoking a method. Because our methods don’t all hang off a central data structure, and each method can take a subset or transformation of the central data structure, it doesn’t make sense to center a data structure in the docs.

@hammer
Copy link
Contributor Author

hammer commented Aug 7, 2020

Some examples of documentation that centers the data structure with a diagram:

The latest Hail docs emphasize "input unification", but don't have a diagram for it.

@alimanfoo
Copy link
Collaborator

alimanfoo commented Aug 7, 2020

This is somewhat off-the-wall but I spent some time thinking about the web a few years back, and enjoyed reading Roy Fielding's PHD dissertation on the design of REST. I know we're talking about something quite different here, but the approach of thinking about design in terms of adding constraints was something I found novel and interesting. Chapter 5 is probably the most relevant.

@eric-czech
Copy link
Collaborator

I wanted to try to flesh this out a bit more so I started writing this description as if it was the sort of thing that would eventually live in our documentation somewhere. This is following up on https://github.com/pystatgen/sgkit/pull/78#issuecomment-669878845 and tries to explain some of that with much higher level context as well. I didn't want to go a lot further though without making sure we're all in agreement with at least this much. Here's what I've got so far:


Sgkit supports a variety of analytical methods for quantitative and population genetics using general-purpose frameworks such as Xarray, Dask, and Zarr. The intent of the sgkit API is to facilitate genetic analysis over large datasets while still offering seamless scaling down to smaller, experimental studies and new users. While traditional workflows of a similar nature often involve a heterogenous mixture of algorithm implementations, programming languages, system dependencies and even hardware, sgkit strives to offer the same flexibility in a single distributed computing framework. This flexibility is largely a result of the capabilities already inherent to other Python libraries for scientific computing, and sgkit attempts to better adapt these capabilities to the genetics domain by formalizing conventions for common quantities, providing access to appropriate file formats, porting standard algorithms, and prioritizing documentation/examples that promote best practices.

The primary interface is to sgkit functionality begins with the Xarray API. There are currently no data models in the library that attempt to capture the complexity of many (or even common) analyses and the data structures that would support them -- operations are applied to solely to Xarray Dataset objects. Users are free to manipulate data within these objects as they see fit, but they must do so within the confines of a set of conventions for variable names, dimensions, and underlying data types. The example below illustrates a Dataset format that would result from an assay expressible as PLINK or BGEN. This is a guideline however, and a Dataset seen in practice might include many more or fewer variables and dimensions.

Screen Shot 2020-08-11 at 3 30 11 PM


Let me know if you all think that's on the right track and I'll keep going at some point.

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Aug 12, 2020

Looks good @eric-czech. The model in the diagram more-or-less applies to VCF data too, right?

@eric-czech
Copy link
Collaborator

Yep. Perhaps it makes sense even at that introductory level to show an Xarray dataset for VCF, but just as the repr of an actual dataset and not a diagram.

@jeromekelleher
Copy link
Collaborator

Yeah. I think it's a good idea to say that across all the formats we work with a dense variant matrix that looks like your diagram, but the exact details of what goes in the cells and the information we have about the rows and columns differs a bit depending on the source.

@hammer
Copy link
Contributor Author

hammer commented Sep 10, 2020

It may be useful to point to some external documentation on migrating from working on NumPy to working with Xarray and Dask. The Satpy project has a dedicated page in their docs on this topic: Migrating to xarray and dask.

eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 23, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 24, 2020
eric-czech added a commit to eric-czech/sgkit that referenced this issue Sep 29, 2020
eric-czech added a commit that referenced this issue Sep 30, 2020
* Add usage and design documentation #87

* Update docs/index.rst

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

* Suggested changes

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>

* Force push gh-pages branch in gh action

* Suggested changes

* Fix typo

Co-authored-by: Alistair Miles <alimanfoo@googlemail.com>
@tomwhite
Copy link
Collaborator

tomwhite commented Oct 1, 2020

Fixed in #278

@tomwhite tomwhite closed this as completed Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants