Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray-style "attrs", global and per-record field #1391

Closed
jpivarski opened this issue Mar 31, 2022 · 6 comments · Fixed by #2757
Closed

xarray-style "attrs", global and per-record field #1391

jpivarski opened this issue Mar 31, 2022 · 6 comments · Fixed by #2757
Assignees
Labels
feature New feature or request

Comments

@jpivarski
Copy link
Member

Description of new feature

@philippemiron is converting data from NetCDF4 files into Awkward Arrays, and one of the features we lack is a place to put attributes. These can be descriptions, units, meanings of flags, etc., and a single field can have more than one of these (e.g. so that "units" are programmatically accessible). In general, any JSON-encodable data.

This applies at two levels: globally for a whole array, in such a way that all derived arrays pass on the attributes, and per-record field. The per-record field attributes should only be passed on if the meaning of the field is not changed, and should not be counted as part of the Content node's type. Also, these should be read and written to files as metadata wherever possible.

Why not use parameter __doc__?

This would work as a per-record field attribute that is passed on whenever the meaning of the field is not changed, and doesn't count as part of the Content node's type. However, it has to be a string, since this is what goes into the Python __doc__ property (and therefore IPython and Jupyter help). The attributes have to be general JSON-encodable metadata.

Also, this only encodes per-record field attributes, not global attributes.

Why not use a new parameter?

It would only be accessible through idioms like array.layout.parameters["attrs"]. This is high-level data analyst information, and they shouldn't have to go through layout, which is mid-level, for library developers. Also, such an idiom would fail if the layout ever gets buried in another Content node (i.e. the user has to be aware of layouts and how they change, which is not a high-level view!). For example, rearranging records in a RecordArray nests the RecordArray within an IndexedArray for performance reasons; so after certain kinds of slices, array.layout.parameters["attrs"] becomes array.layout.content.parameters["attrs"], probably unexpectedly.

Why not use behavior?

It's an odd thing to do, but behavior is passed from an ak.Array to any new ak.Array derived from it. Any keys in behavior that aren't recognized are ignored. However, this would only encode global attributes, not per-record field attributes, and behavior (which typically contains function objects and class objects) is not serialized when writing files, or filled when reading files.

What instead?

This is really two new features:

  1. Per-record field metadata, as a parameter, that follows all the same rules as __doc__, but doesn't have to be a string and is exposed at high-level in a different way (one that allows non-strings). Parameters are already serializable.
  2. Global metadata that follows all the same rules as behavior, but is serializable.

They should have names like attrs, following xarray's convention, but the per-record field attrs should be distinguishable somehow from the global attrs.

Perhaps both of these should be dicts (JSON objects) and when we're looking at one field, the two dicts are overlaid, with the per-field keys taking precedence over the global keys? That sounds natural and would minimize the use of names, but it sounds like just the sort of thing that would break something in the future.

What does xarray do?

>>> import xarray as xr
>>> dataarray = xr.DataArray([1, 2, 3], attrs={"wow": "look!"})
>>> dataset = xr.Dataset({"x": dataarray}, attrs={"wow": "wee!", "another": "thing"})
>>> dataset.attrs
{'wow': 'wee!', 'another': 'thing'}
>>> dataset.x.attrs
{'wow': 'look!'}

xarray never has this issue because DataArray and Dataset are different types and can never be confused. In Awkward, an ak.Array is an ak.Array is an ak.Array, regardless of whether it's an array of lists of records, an array of just the records, or an array of one of the fields of those records. The global attrs can propagate down when extracting the array of records from the array of lists of records, but when you get to the array of one of the fields of those records, there's a conflict. They should probably be named differently, since "per-field attributes" and "global attributes" are different concepts, but which one gets the exalted name of "attrs"? And what would the other one be called?

@jpivarski jpivarski added the feature New feature or request label Mar 31, 2022
@jpivarski jpivarski added this to To-do: easy in Prioritized issues Apr 15, 2022
@jpivarski jpivarski added the pr-next-release Required for the next release label Oct 31, 2022
@jpivarski jpivarski self-assigned this Nov 10, 2022
@agoose77
Copy link
Collaborator

agoose77 commented Nov 12, 2022

I was thinking about this, and the per-field attributes could be implemented using parameters; we just need to provide the appropriate high-level API for this. I imagine this in the same way that the string abstraction is a specialisation of the behaviour system, we could do the same for attrs. However, a drawback of this method is that we don't retain the benefits of field-association, i.e. with_field(array, None, 'field') wouldn't update the attr metadata unless we implemented logic for this.

Saying this, in general, I prefer the idea of a dedicated per-field metadata object rather than piggybacking an existing mechanism. If we redesigned awkward from scratch, I'd be tempted to do the same for the __array__ behavior mechanism such that parameters only contains user-metadata.

Maybe the name of this mechanism is field_parameters? I'll give it some more thought...

@agoose77 agoose77 removed the pr-next-release Required for the next release label Nov 15, 2022
@agoose77
Copy link
Collaborator

Jim and I discussed this - we'll move this to another release in favour of more pressing issues. This will some require some thought.

@agoose77
Copy link
Collaborator

agoose77 commented Feb 1, 2023

We spoke about this today, and in-general, parameters added to a RecordArray should survive most operations, unless we wouldn't want them to anyway. So, we shouldn't need a new place for Xarray metadata here, it can sit under an __xarray__ name. Meanwhile, although the ak.behavior dict is very close to what we want for a metadata object, it's not a perfect fit:

  • metadata should be JSON serialisable
  • metadata should be stored in parquet, etc.

Although we can still use behavior for this, it might be clearer to explicitly add a new metadata mechanism that enforces all of these rules instead of piggy-backing onto the existing behaviour mechanism.

I believe this would motivate replacing ak._util.wrap and ak.to_layout usage internally with a forward/backwards deconstruction mechanism, e.g. ak.compose(...) and ak.decompose(...) that pulls out the (layout, metadata, behavior).

Finally, we likely also want a mechanism for resolving parameters within a dimension.

This metadata should be per-array, explicitly.

@jpivarski
Copy link
Member Author

As discussed in CoffeaTeam/coffea#824, there's a use-case for wanting a local storage on ak.Array (in the meta of a dak.Array) that

  1. is propagated to derived arrays, just as behavior is
  2. is appropriate for non-behaviors (Coffea wants to store whole arrays there)
  3. is not serialized in any way: pickle, Parquet, etc.

That use-case shares desirata (1) and (2) with the attrs requested here. The attrs requested here need to be more persistent than behavior, to be saved not only in pickle but also file formats like Parquet (i.e. be included in the AwkwardArrowType.metadata), so the two use-cases strongly disagree on (3).

So how about adding both attrs and transient_attrs to ak.Array. It would be propagated to all derived arrays like behavior, merging like the behavior dict when an operation involves more than one array, but separate from the behavior machinery itself. Where they diverge is

  • attrs must be JSON-like and will be serialized with the array in all forms
  • transient_attrs can be any Python objects and will never be serialized.

dask-awkward should someday implement attrs as well, so that the JSON-like metadata can be used on Dask workers, but dask-awkward will never need to implement the transient_attrs, since you never know if a Dask object has been serialized and sent to another worker or not. (Maybe there would be value in dak.Array.transient_attrs for caching node-level information, but that can be reconsidered.) Even with that limitation, it still satisfies the Coffea use-case because Coffea only needs this at type-tracer/Dask graph-building time, and dak.Array.meta.transient_attrs would then be available.

Cc: @lgray, @nsmith-, @martindurant, @douglasdavis, @agoose77

@agoose77
Copy link
Collaborator

agoose77 commented Jun 11, 2023

I was starting to work on this, and concluded that the merging rules for attrs are at first-glance more like parameters than behavior. I've been thinking about attrs as array-associated metadata, much like parameters are. But, it seems like what Coffea wants is more like non-serialisable "context", very much like "behavior".

@lgray / @nsmith- could you possibly elucidate the properties that you'd need from a new context object?

Specifically, whether we need union or intersection properties for operations on two arrays? (behavior has union, properties have intersection). I assume you need union, because you're currently using behavior for this.

My current list is:

  • non serialised
  • union-merging rules

@nsmith-
Copy link
Contributor

nsmith- commented Jun 12, 2023

On our end we had not bothered to deal with name clashes if two NanoEvents arrays have different origins, so this didn't come up. I suppose we should figure out an appropriate hash for a NanoEventsFactory-generated array (should be straightforward) to ensure correct provenance. But in that case, I think we would actually want intersection because the mixin will not know which lineage to use for its components (unless they all get labeled somehow, which sounds like a can of worms best left unopened.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants