Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove slash from all data variable names #81

Closed
eric-czech opened this issue Aug 1, 2020 · 11 comments
Closed

Remove slash from all data variable names #81

eric-czech opened this issue Aug 1, 2020 · 11 comments
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc

Comments

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 1, 2020

We're currently naming variables like call/genotype, variant/id, sample/id, etc., but I think we should switch to call_genotype, variant_id, sample_id, etc.

The disadvantages of using the slashes are:

  • Xarray stores these as separate Zarr groups which means you can't load an sgkit dataset with single command. You have to instead do something like this: ds = xr.merge([xr.open_zarr(path, group=g) for g in ['call', 'variant', 'sample']). There is no clear advantage to having the variables split up on disk by this grouping. If they were instead grouped by something more meaningful like contig, the partitioning would make more sense but creating directories based on similar variables does not.
  • Assigning variables requires a kwargs splat rather than using the simpler ds.assign(call_genotype=...) syntax, e.g. ds.assign(**{'call/genotype': ...})
  • I've found that for some datasets, you can't pass custom Zarr encodings to Xarry when variables have '/' in the name -- the bug has been hard to reproduce on a small dataset so I'm not sure why yet.
  • You cannot autocomplete variable names on a dataset instance

The only disadvantage I can see to not using the '/' is that it offers a convenient delimiter for extracting the group name for a set of variables like "variant" or "call". I don't think that's difficult to live without and using underscore case is more common in other pydata projects anyhow.

@alimanfoo or @tomwhite do you have any objections to this?

@alimanfoo
Copy link
Collaborator

alimanfoo commented Aug 1, 2020 via email

@ravwojdyla
Copy link
Collaborator

@eric-czech +1 btw, have you thought about storing those strings as constants, so that we don't have to retype them? Example here: https://github.com/pystatgen/sgkit/blob/rav/api_dataset/sgkit/api.py#L51-L69 (doesn't have to be like that obviously).

@eric-czech
Copy link
Collaborator Author

Yes and I don't want to use them: https://github.com/pystatgen/sgkit/issues/17

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2020

No objections from me.

@jeromekelleher
Copy link
Collaborator

I'm going to push back slightly here, as using slashes is quite a natural way of grouping things, and communicates neatly to users that everything starting with call/ has to do with calls. I'm not too worried about the problems with xr.open_zarr, as I think we'll want to have a function sgkit.open/sgkit.load or whatever anyway, which will be either interacting with xr.open_zarr in quite a detailed way, or possibly bypassing some of it entirely. I don't think that xr.open_zarr should be something we recommend users interact with at all.

Your points about mapping to Python variables are well taken though, which is why I'm only pushing back very, very gently!

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Aug 3, 2020

So, to me this comes down to the following question: do we ever imagine being able to do something like v = ds.variant or v = ds["variant"], where we regard the groupings as sub-datasets, or will we always regard all of the variables as independent? This is the advantage of grouping, and the slash delimiters then have meaning. Repeating the same x_ prefix on variable names is a big warning sign for me that the object model is missing a level of structure.

I realise that this doesn't necessarily map to how things work with xarray in practise, but we should at least keep the door open to this sort of structure in the future, if we think it might be useful.

@eric-czech
Copy link
Collaborator Author

eric-czech commented Aug 3, 2020

Repeating the same x_ prefix on variable names is a big warning sign for me that the object model is missing a level of structure

I think most (but certainly not all) of that structure is present though in the dimensions. Xarray doesn't have it unfortunately but kind of like multi-indexes in pandas or the pandas_df.select_dtypes function, I think the groupings that the / implied is somewhat redundant with something like ds[[v for v in ds if ds[v].dims == ("variants",)]] # equivalent to ds["variant"]. They have a drop_dims function but a select_dims would be nice.

but we should at least keep the door open to this sort of structure in the future, if we think it might be useful.

I think the structure is still accessible though I suppose it remains to be seen if some structure within a given set of dimensions will also be necessary. That could potentially be organized as variable groups in attrs or with separate datasets given how lightweight merging is. That's maybe better than reflecting the groupings as variable prefixes where you have split on a common delimiter with a limit (i.e. 'call_genotype_mask".split('_', 1)) rather than a unique delimiter but either way, +1 to always reflecting any useful groupings in way that's documented and easily accessible.

@jeromekelleher
Copy link
Collaborator

Sounds good @eric-czech - I just wanted to raise a flag here.

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

They have a drop_dims function but a select_dims would be nice.

File upstream?

@hammer hammer added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Aug 3, 2020
@eric-czech
Copy link
Collaborator Author

File upstream?

@hammer I would if I could see a way to generalize it beyond what's easy with a comprehension. I imagine they wouldn't want to add it without a stronger case and I'm coming up empty for one at the moment. drop_dims wouldn't be a nice one-liner so I think it's much easier to justify the need for that.

@eric-czech
Copy link
Collaborator Author

Changed in https://github.com/pystatgen/sgkit/pull/83.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc
Projects
None yet
Development

No branches or pull requests

6 participants