Remove slash from all data variable names #81

eric-czech · 2020-08-01T14:22:20Z

We're currently naming variables like call/genotype, variant/id, sample/id, etc., but I think we should switch to call_genotype, variant_id, sample_id, etc.

The disadvantages of using the slashes are:

Xarray stores these as separate Zarr groups which means you can't load an sgkit dataset with single command. You have to instead do something like this: ds = xr.merge([xr.open_zarr(path, group=g) for g in ['call', 'variant', 'sample']). There is no clear advantage to having the variables split up on disk by this grouping. If they were instead grouped by something more meaningful like contig, the partitioning would make more sense but creating directories based on similar variables does not.
Assigning variables requires a kwargs splat rather than using the simpler ds.assign(call_genotype=...) syntax, e.g. ds.assign(**{'call/genotype': ...})
I've found that for some datasets, you can't pass custom Zarr encodings to Xarry when variables have '/' in the name -- the bug has been hard to reproduce on a small dataset so I'm not sure why yet.
You cannot autocomplete variable names on a dataset instance

The only disadvantage I can see to not using the '/' is that it offers a convenient delimiter for extracting the group name for a set of variables like "variant" or "call". I don't think that's difficult to live without and using underscore case is more common in other pydata projects anyhow.

@alimanfoo or @tomwhite do you have any objections to this?

The text was updated successfully, but these errors were encountered:

alimanfoo · 2020-08-01T20:12:48Z

Hi Eric, no objection, sounds like several things will fit with xarray more naturally if we avoid slash so happy to go with your suggestion.

…

On Sat, 1 Aug 2020, 15:22 Eric Czech, ***@***.***> wrote: We're currently naming variables like call/genotype, variant/id, sample/id, etc., but I think we should switch to call_genotype, variant_id, sample_id, etc. The disadvantages of using the slashes are: - Xarray stores these as separate Zarr groups which means you can't load an sgkit dataset with single command. You have to instead do something like this: ds = xr.merge([xr.open_zarr(path, group=g) for g in ['call', 'variant', 'sample']). There is no clear advantage to have the variables split up on disk by this grouping. If they were instead grouped by something more meaningful like contig, the partitioning would make more sense but creating directories based on similar variables does not. - Assigning variables requires a kwargs splat rather than using the simpler ds.assign(call_genotype=...) syntax, e.g. ds.assign(**{'call/genotype': ...}) - I've found that for some datasets, you can't pass custom Zarr encodings to Xarry when variables have '/' in the name -- the bug has been hard to reproduce on a small dataset so I'm not sure why yet. - You cannot autocomplete variable names on a dataset instance The only disadvantage I can see to not using the '/' is that it offers a convenient delimiter for extracting the group name for a set of variables like "variant" or "call". I don't think that's difficult to live without and using underscore case is more common in other pydata projects anyhow. @alimanfoo <https://github.com/alimanfoo> or @tomwhite <https://github.com/tomwhite> do you have any objections to this? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/pystatgen/sgkit/issues/81>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFLYQR2SL5NMCUEVY37SY3R6QQKTANCNFSM4PRZBAQQ> .

ravwojdyla · 2020-08-03T09:55:07Z

@eric-czech +1 btw, have you thought about storing those strings as constants, so that we don't have to retype them? Example here: https://github.com/pystatgen/sgkit/blob/rav/api_dataset/sgkit/api.py#L51-L69 (doesn't have to be like that obviously).

eric-czech · 2020-08-03T09:58:12Z

Yes and I don't want to use them: https://github.com/pystatgen/sgkit/issues/17

tomwhite · 2020-08-03T10:12:49Z

No objections from me.

jeromekelleher · 2020-08-03T11:34:03Z

I'm going to push back slightly here, as using slashes is quite a natural way of grouping things, and communicates neatly to users that everything starting with call/ has to do with calls. I'm not too worried about the problems with xr.open_zarr, as I think we'll want to have a function sgkit.open/sgkit.load or whatever anyway, which will be either interacting with xr.open_zarr in quite a detailed way, or possibly bypassing some of it entirely. I don't think that xr.open_zarr should be something we recommend users interact with at all.

Your points about mapping to Python variables are well taken though, which is why I'm only pushing back very, very gently!

jeromekelleher · 2020-08-03T11:42:48Z

So, to me this comes down to the following question: do we ever imagine being able to do something like v = ds.variant or v = ds["variant"], where we regard the groupings as sub-datasets, or will we always regard all of the variables as independent? This is the advantage of grouping, and the slash delimiters then have meaning. Repeating the same x_ prefix on variable names is a big warning sign for me that the object model is missing a level of structure.

I realise that this doesn't necessarily map to how things work with xarray in practise, but we should at least keep the door open to this sort of structure in the future, if we think it might be useful.

eric-czech · 2020-08-03T12:35:42Z

Repeating the same x_ prefix on variable names is a big warning sign for me that the object model is missing a level of structure

I think most (but certainly not all) of that structure is present though in the dimensions. Xarray doesn't have it unfortunately but kind of like multi-indexes in pandas or the pandas_df.select_dtypes function, I think the groupings that the / implied is somewhat redundant with something like ds[[v for v in ds if ds[v].dims == ("variants",)]] # equivalent to ds["variant"]. They have a drop_dims function but a select_dims would be nice.

but we should at least keep the door open to this sort of structure in the future, if we think it might be useful.

I think the structure is still accessible though I suppose it remains to be seen if some structure within a given set of dimensions will also be necessary. That could potentially be organized as variable groups in attrs or with separate datasets given how lightweight merging is. That's maybe better than reflecting the groupings as variable prefixes where you have split on a common delimiter with a limit (i.e. 'call_genotype_mask".split('_', 1)) rather than a unique delimiter but either way, +1 to always reflecting any useful groupings in way that's documented and easily accessible.

jeromekelleher · 2020-08-03T12:50:40Z

Sounds good @eric-czech - I just wanted to raise a flag here.

hammer · 2020-08-03T14:18:19Z

They have a drop_dims function but a select_dims would be nice.

File upstream?

eric-czech · 2020-08-04T12:28:09Z

File upstream?

@hammer I would if I could see a way to generalize it beyond what's easy with a comprehension. I imagine they wouldn't want to add it without a stronger case and I'm coming up empty for one at the moment. drop_dims wouldn't be a nice one-liner so I think it's much easier to justify the need for that.

eric-czech · 2020-08-05T09:46:51Z

Changed in https://github.com/pystatgen/sgkit/pull/83.

This was referenced Aug 3, 2020

Removing slash from variable names #83

Merged

Removing slash from variable names sgkit-dev/sgkit-plink#13

Merged

Removing slash from variable names sgkit-dev/sgkit-bgen#4

Merged

hammer added the data representation Issues related to how data is represented: data types, data structures, indexes, access methods, etc label Aug 3, 2020

eric-czech closed this as completed Aug 5, 2020

hammer mentioned this issue Aug 5, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

tomwhite referenced this issue in tomwhite/sgkit Aug 6, 2020

Fixes following https://github.com/pystatgen/sgkit/issues/81

3b1ea7a

tomwhite referenced this issue in tomwhite/sgkit Aug 10, 2020

Fixes following https://github.com/pystatgen/sgkit/issues/81

8f494ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove slash from all data variable names #81

Remove slash from all data variable names #81

eric-czech commented Aug 1, 2020 •

edited

Loading

alimanfoo commented Aug 1, 2020 via email

ravwojdyla commented Aug 3, 2020

eric-czech commented Aug 3, 2020

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020 •

edited

Loading

eric-czech commented Aug 3, 2020 •

edited

Loading

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020 •

edited

Loading

eric-czech commented Aug 4, 2020

eric-czech commented Aug 5, 2020

Remove slash from all data variable names #81

Remove slash from all data variable names #81

Comments

eric-czech commented Aug 1, 2020 • edited Loading

alimanfoo commented Aug 1, 2020 via email

ravwojdyla commented Aug 3, 2020

eric-czech commented Aug 3, 2020

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020 • edited Loading

eric-czech commented Aug 3, 2020 • edited Loading

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020 • edited Loading

eric-czech commented Aug 4, 2020

eric-czech commented Aug 5, 2020

eric-czech commented Aug 1, 2020 •

edited

Loading

jeromekelleher commented Aug 3, 2020 •

edited

Loading

eric-czech commented Aug 3, 2020 •

edited

Loading

hammer commented Aug 3, 2020 •

edited

Loading