New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rules for propagating attrs and encoding #1614

Open
jhamman opened this Issue Oct 9, 2017 · 15 comments

Comments

Projects
None yet
8 participants
@jhamman
Copy link
Member

jhamman commented Oct 9, 2017

We need to come up with some clear rules for when and how xarray should propagate metadata (attrs/encoding). This has come up routinely (e.g. #25, #138, #442, #688, #828, #988, #1009, #1271, #1297, #1586) and we don't have a clear direction as to when to keep/drop metadata.

I'll take a first cut:

operation attrs encoding
reduce drop drop
arithmetic drop drop
copy keep keep
concat keep first keep first
slice keep drop

cc @shoyer (following up on #1586 (comment))

@ethan-campbell

This comment has been minimized.

Copy link

ethan-campbell commented Nov 10, 2017

I'd also suggest that a global option of always_keep_attrs=True would be useful. While I understand the logic of dropping units during certain operations, it makes attributes unusable for storing other miscellaneous metadata, e.g. on data provenance. As a recent xarray convert, this behavior has been frustrating.

@mraspaud

This comment has been minimized.

Copy link
Contributor

mraspaud commented Feb 2, 2018

This issue is very relevant for me too. I would like to also propose that a user could provide a function that would know how to combine the attrs of different DataArrays.

@brey

This comment has been minimized.

Copy link

brey commented Feb 2, 2018

I am also interested. In terms of the table from @jhamman I am in principle ok with. However, there could be an option to refer to the original attrs in order to provide provenance even on operations like reduce and arithmetic. The idea here is reproducibility and tractability. Maybe an 'origin' attribute?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Feb 3, 2018

The challenge with a user-specified function is that there can potentially be weird conflicts if multiple libraries try to override it. Possibly it's worth it for the convenience, but subclasses allowing for explicit hooks (like numpy) is probably the cleanest solution.

@SeanDS

This comment has been minimized.

Copy link

SeanDS commented Jun 18, 2018

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

@SeanDS

This comment has been minimized.

Copy link

SeanDS commented Jun 18, 2018

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation? Perhaps metadata could be saved from each step of a set of operations, so that there is a full paper trail for the set of operations have been applied to the data. It could however get complicated to merge together objects with their own separate histories, especially if they ultimately descend from the same original data set.

This would be very relevant for scientific analyses.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 18, 2018

Hi, this feature would be very relevant to the intended use case of a project I'd like to use xarray for. Is the behaviour discussed in the first post implemented anywhere, e.g. in the trunk, for me to play with?

are you referring to a different issue? the first post only summarizes some simple proposed rules.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 18, 2018

Also - might I suggest you consider some kind of history tracker as part of the metadata propagation?

Certainly this would be out of scope for xarray itself, but this perhaps be done with a library that wraps xarray's API. If I recall correctly, @pwolfram was also interested in this.

We did discuss customizable hooks for attribute handling in #988 but I'm no longer sure that is a good idea. These sort of overloads are really hard to get right, as we've seen with NumPy's long history of different override protocols (the most recent example being __array_ufunc__).

@max-sixty

This comment has been minimized.

Copy link
Collaborator

max-sixty commented Jun 18, 2018

consider some kind of history tracker as part of the metadata propagation?

Data lineage is a big, hard, unsolved problem (for us, above both naming things and cache invalidation)

To second @shoyer, I think it's big and difficult enough to be a separate library

@SeanDS

This comment has been minimized.

Copy link

SeanDS commented Jun 18, 2018

are you referring to a different issue? the first post only summarizes some simple proposed rules.

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 18, 2018

No, just the proposed feature to keep or delete metadata based on the various operations. Is this behaviour already part of the library, and this issue is just to clarify the intended behaviour, or is this a feature proposal?

We already have most of this behavior (matching what @jhamman lists in the first comment), though it isn't clearly documented. It should just work if you use xarray methods/functions.

@ethan-campbell

This comment has been minimized.

Copy link

ethan-campbell commented Jun 18, 2018

@shoyer, I assume you are referring to the keep_attrs option. Is there a way to persist attrs during arithmetic options? I find myself writing a bunch of boilerplate to transfer the wealth of metadata included with most netCDF files.

I realize that adding a module-level or DataArray instance-specific maintain_attrs configuration flag (as discussed in #131, #988, #1271) could be problematic, but this strikes me as complexity worth adding. The current approach of dropping all metadata (not just units) seems heavy-handed and unintuitive for new/casual users. As you mentioned in #1271, better to have stale metadata than no metadata at all.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Jun 18, 2018

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

@gerritholl

This comment has been minimized.

Copy link
Contributor

gerritholl commented Oct 31, 2018

Another one to decide is xarray.zeros_like(...) and friends.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 3, 2018

I would happy to add a global keep_attrs option to xarray.set_options(), which we could use for controlling arithmetic. I'm not planning on working on it personally, but I would be happy to review a PR.

Note that this was implemented by @TomNicholas in #2482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment