Add support for netCDF4.EnumType #8147

bzah · 2023-09-05T17:20:50Z

This pull request add support for enums on netcdf4 backend.

Enum were added in netCDF4-python in 1.2.0 (September 2015).
In the netcdf format, they are defined as types and can be use across the dataset to type variable when on creation.
They are meant to be an alternative to flag_values, flag_meanings.

This pull request makes it possible for xarray to read existing enums in a file, convert them into flag_values/flag_meanings and save them as enums when an special encoding flag is filled in.

TODO:

Add implementation for other backends ? Will be added in follow-up PR

Closes Add support for netcdf4 enum #8144
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

xarray/conventions.py

bzah · 2023-09-06T10:24:58Z

According to netCDF4 example here, to deal with missing values in variables based having a meaning explained by an enum, the enum declaration should contain a mapping between fillvalue and "Missing" or equivalent value.

But the netCDF C and python libraries do not enforce this.
So we can end up with clunky variables that contains missing_values but which are invalid values according to the enum declaration. The example below illustrate this

import netCDF4 as nc

ds = nc.Dataset("./toto.nc", "w")
my_enum = ds.createEnumType("u1","my_enum",{"a":0, "b":1})
ds.createDimension("time", 10)
my_var = ds.createVariable("my_var", my_enum, "time") 
# no fill_value defined above
my_var[:].data 
# my_var[:] is full of 255, which is the default fillvalue for unsigne byte (u1)

Related issue to Unidata/netcdf-c#982, in particular Unidata/netcdf-c#982 (comment)

In this PR, this causes to_netcdf to crash.

benbovy · 2023-09-06T12:34:34Z

Thanks @bzah for working on this! I’m probably not the most qualified to review this PR as I’m not familiar with the netCDF4 Enum data type, though.

Validate with xarray team if it's ok to add attribute to Variable and DataArray

This seems a bit too invasive to me for such an edge case. Propagating variable metadata (attrs and encoding) in xarray operations is already complicated and is often source of bugs. Extending the xarray.Variable class with two additional attributes to track will make the situation worse.

Since this is specific to the netCDF4 format, a less invasive approach would be to implement a class EnumArray(VariableCoder) (see xarray/coding/variables.py) and plug it into xarray's builtin netcdf backend. I see two options for decoding an enum netcdf variable:

return an xarray.Variable with integer (e.g., np.uint8) dtype and extract the enum_dict as an item of the variable attrs (or encoding?).
wrap the netcdf4 variable as a custom, explicitly indexed (duck) array (see xarray/core/indexing.py for a few classes inheriting from ExplicitlyIndexedNDArrayMixin)

bzah · 2023-09-06T13:07:34Z

Hi @benbovy and thanks for the comment. I guessed it would be a bit spicy to add attributes to Variable, but I wanted to have a naive working approach to enable discussions.
Note that enums are also available in HDF5 so we could implement an enum handler for some other backends. I don't think enums are part of zarr specification though.

TBH, I don't really like using encoding and attrs. I can see that it is convenient to put everything into theses two dictionaries, but this is not tidy enough for me. I like when I can look at a class definition and know what can be there and what cannot. With attrs or encoding this is all hidden IMHO.

I like the ExplicitlyIndexedNDArrayMixin idea. I will try to make something out of it.

dcherian · 2023-09-08T16:10:35Z

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

(2) would be a lot more involved. What kind of operations would you like to see take advantage of the enum dictionary?

jhamman · 2023-09-08T17:35:51Z

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

+1 on this. Don't want to push you too hard off (2) but (1) would have been my recommended approach.

TBH, I don't really like using encoding and attrs.

I'd be interested to get more of your rational on this. We've been discussing making encoding a read-only field (outside of the backend constructors). For more control over the content of attrs, you may want to look at xarray-dataclasses or xarray-schema.

kmuehlbauer · 2023-09-08T18:36:37Z

(1) is definitely a lot easier. We'd also want to support specifying the enum type at write, so that we can roundtrip the file.

+1 on that, too. I'm having this on the plate for h5netcdf anyway, so would be good to coordinate.

bzah · 2023-09-11T20:36:28Z

Understood, thanks for the feedback everyone. I will then try implementing 1):

return an xarray.Variable with integer (e.g., np.uint8) dtype and extract the enum_dict as an item of the variable attrs (or encoding?).

I should be able to find time for that this week.

@jhamman, my though was that attrs and encoding can be filled with basically anything (can they ?) and it may be hard to keep track of what may be in there. Whereas having dedicated properties at class level make it obvious what is the purpose of each attribute. But maybe it's just my java instinct that is tickling.
Thanks for sharing xarray-dataclasses and xarray-schema, this looks very interesting to me.

Reuse instead of duplicating function.

Remove attempts to workaround the fill_value issues.

for more information, see https://pre-commit.ci

bzah · 2023-09-13T15:03:17Z

Ok I have a simple working implementation for enums. I still have unit tests to fix and to add though.
However, I know that it will not work for datasets that already have enums out there.
There is indeed a hole in the specification and implementation of HDF5 and netcdf-c that make it possible to create clunky datasets.

Basically, you can create datasets with fill_values outside the enum range and they are considered valid by HDF5.
For example when creating a dataset with:

    import netCDF4 as nc
    import xarray as xr

    clouds_ds = nc.Dataset("clouds_ds__explicit_fill_value.nc", "w")
    cloud_type = clouds_ds.createEnumType("u1","cloud_type", {"clear": 0, "cloudy":1})
    clouds_ds.createDimension("time", size=10)
    clouds_ds.createVariable(
                "clouds",
                cloud_type,
                "time",
                fill_value=255, # or None, same result
            )
    clouds_ds["clouds"][0] = 1
    print(clouds_ds["clouds"][:].data)
    # [out] [ 1 255 255 255 255 255 255 255 255 255 ]
    clouds_ds.close()

netCDF4 lets you create a variable with fill_value outside the enum range but

ncdump will crash if you do ncdump clouds_ds__explicit_fill_value.nc
And with the current PR's implementation xarray will fail to write it too:

xr_ds = xr.open_dataset("clouds_ds__enumed_fill_value.nc")
xr_ds.to_netcdf("xr_clouds-clouds_ds__enumed_fill_value.nc")
# --> throws an exception because xarray unmasks the values 
#       and try to push 255 (fill_value) in the resulting netCDF file.

It's worth noting that data producer may be tempted to avoid specifying a missing values in the enum definition if they believe it will always be filled with something. But I believe it should be discouraged.

Possible workarounds

When reading a netCDF with enums, if fill_value (either in attributes or from the mask) is not in the enum possible values and there are missing values, then:

1. Add a dedicated enum key value pair for the fill_value. Like {"_Undefined": <fill_value>}. Thus modifying the enum declaration.
1. Or, keep track of the mask of missing values and re-apply it at writing time (assuming it's still valid).
1. Or, forbid opening the file as we consider it invalid, and perhaps have a flag to fallback to i.

I don;t like i. because we loose by simply opening a file and rewriting a copy, the content would not be identical.
ii. seems bad too because if the variable is modified by the user we can't know where to apply the mask.
iii. would be the best IMO, in particular it might force data producer that use xarray to think about missing_values when designing enums.

Relevant discussion on netcdf-c: Unidata/netcdf-c#982

Do you have suggestions ?

kmuehlbauer · 2023-09-13T16:15:14Z

@bzah Thanks for this first wip implementation. I'll try to review over the next days.

kmuehlbauer · 2023-09-14T08:18:58Z

Hi @bzah, yes indeed, the default netcdf fill_value issue is a tricky one. There is a general discussion in #2742 with quite some offspring issues.

In general xarray has a relaxed view when it comes to reading non-standard/broken/mismatched (you name it) files. If it is readable, xarray should be able to import it. As netcdf4-python is able to read those files users will expect xarray to ingest it too.

So I'd add another point to your above list:

iv. do nothing on read (just read), then raise on write with a meaningful error

As this affects only existing files

which have default_fill value activated and are not written completely
which set a _FillValue outside the enum-range

we might get away without too much hassle.

For the overall approach we could think to create flag_meanings and flag_values attributes on read (established CF convention) in conjunction with a new key enum in encoding (encoding["enum"]="my_enum_name"). On write, those attributes should be removed and the netCDF4.EnumType be created instead.

PROs:

users current workflows with flag_meanings/flag_values will automagically work for the new netCDF4.EnumType too
even if encoding is lost while processing, data will still be written out in a meaningful way (flag_meanings and flag_values)
can be meaningful serialized to other backends, which do not have a notion of enum
users could use the new netCDF4.EnumType as output just by setting encoding["enum"]="my_enum_name" for their existing variables (if they have flag_meanings and flag_values available)
no need to add dict to valid_types for the backend API
no need to invent new naming scheme (enum_name, enum_meaning)
backwards and roundtrip compatible (encoding["dtype"] isn't touched)

CONs:

I'm surely biased, but I can't immediately see any

Also note this comment from @samain-eum cf-convention/discuss#238 (comment) where EUMETSAT is following a similar path.

WDYT @bzah?

bzah · 2023-09-14T10:07:42Z

Hi @kmuehlbauer, thanks for linking the fill_value issue, interesting read. I would be willing to try to open a PR to fix that (fetching the mask, getting the implicit fill_value and making it explicit in attrs) once I'm done with Enums.

Regarding:

iv. do nothing on read (just read), then raise on write with a meaningful error

Looks better than raising an error on read, but might be frustrating for user if they do all their modifications and get an error on calling to_netcdf.
In particular if the opening, processing and saving is done by a library, then it would be the role of this library to reject or fix the file.
Maybe printing a warning at the opening of the file and having a flag has_valid_enum_fill_value would help ?

As for:

create flag_meanings and flag_values attributes on read [...] [and] encoding["enum"]="my_enum_name". On write, those attributes should be removed and the netCDF4.EnumType be created instead.

I like the idea, it's simpler than mine and from xarray point of view looks elegant and make it easy to be CF compliant.
In particular, we avoid some of inconsistencies described by @samain-eum because xarray users will not have access to the EnumType, so they can only modify flag_meanings and flag_values to update the enum declaration.

However, one enumType may be used in several variables, possibly in several groups. In my opinion, the biggest isuse with flag_meanings and flag_values is how to synchronize them across variables when one change:
Say I have an enum like: safety: {"safe": 0, "unsafe": 1}.
As this is a very generic enum, it can be used in several groups and variables. We would be tempted to declare it once in a parent group, just like you would do with a dimension.
However, if our xarray user want to add a level "very_unsafe": 2, it would be tedious to keep the consistency for all the flag_meanings and flag_values in other variables.
Even without groups, in the same Dataset we could have multiple DataArray which rely on the same enum and we would have the same consistency issue.
Whereas, the EnumType in netCDF4 declared once and modifying it would update every variable typed with it.

One solution could be to have these <enum_name>_flag_values and <enum_name>_flag_meanings at Dataset level but it's no longer CF compliant.

Note that my implementation has the same consistency problem as what you are suggesting.

kmuehlbauer · 2023-09-14T12:09:37Z

Hi @kmuehlbauer, thanks for linking the fill_value issue, interesting read. I would be willing to try to open a PR to fix that (fetching the mask, getting the implicit fill_value and making it explicit in attrs) once I'm done with Enums.

This has been tried before, but there was no conclusion on how to handle this, see #5680 (comment) and ff.

Regarding:

iv. do nothing on read (just read), then raise on write with a meaningful error

Looks better than raising an error on read, but might be frustrating for user if they do all their modifications and get an error on calling to_netcdf. In particular if the opening, processing and saving is done by a library, then it would be the role of this library to reject or fix the file. Maybe printing a warning at the opening of the file and having a flag has_valid_enum_fill_value would help ?

Warnings might be overlooked or disabled by users. They might help a bit, though. The users might be frustrated because their source data is broken, but an explicit error message raised by xarray describing the problem should help them to fix things before another write attempt. And, just to note that, without that the error would be thrown by netCDF4-python (as is now).

As for:

create flag_meanings and flag_values attributes on read [...] [and] encoding["enum"]="my_enum_name". On write, those attributes should be removed and the netCDF4.EnumType be created instead.

I like the idea, it's simpler than mine and from xarray point of view looks elegant and make it easy to be CF compliant. In particular, we avoid some of inconsistencies described by @samain-eum because xarray users will not have access to the EnumType, so they can only modify flag_meanings and flag_values to update the enum declaration.

I'm interested in how to modify an existing enum. I could not find anything about that in the netCDF4-python docs.

However, one enumType may be used in several variables, possibly in several groups. In my opinion, the biggest isuse with flag_meanings and flag_values is how to synchronize them across variables when one change: Say I have an enum like: safety: {"safe": 0, "unsafe": 1}. As this is a very generic enum, it can be used in several groups and variables. We would be tempted to declare it once in a parent group, just like you would do with a dimension. However, if our xarray user want to add a level "very_unsafe": 2, it would be tedious to keep the consistency for all the flag_meanings and flag_values in other variables. Even without groups, in the same Dataset we could have multiple DataArray which rely on the same enum and we would have the same consistency issue. Whereas, the EnumType in netCDF4 declared once and modifying it would update every variable typed with it.

This is one thing which might be a problem, but the backends can do this kind of discovery (eg. for dimensions IIRC). I have doubt's that the enum type can be updated without touching/rewriting all connected variables. The enum type is directly written to the hdf5 dataset (see h5dump), beside being declared as DATATYPE.

DATATYPE "my_enum" H5T_ENUM {
      H5T_STD_I16LE;
      "a"                1;
      "b"                2;
   };
 DATASET "my_var" {
    DATATYPE  H5T_ENUM {
       H5T_STD_I16LE;
       "a"                1;
       "b"                2;
    }
  :
  :

Looking at this, I'm also interested how netcdf-c maps the declared DATATYPE to already existing DATASETs with that DATATYPE? And why netcdf decided to have named enum types as obviously every DATASET has the enum type attached to itself? I'll do some experimenting myself also for the implementation in h5netcdf.

Another interesting note is that h5py adds a little metadata to it's numpy dtype to mark it as enum:

import h5py
dt = h5py.enum_dtype({"RED": 0, "GREEN": 1, "BLUE": 42}, basetype='i')
print(dt.type)
print(dt.metadata)
<class 'numpy.int32'>
{'enum': {'RED': 0, 'GREEN': 1, 'BLUE': 42}}

kmuehlbauer · 2024-01-11T09:15:59Z

I've added my suggestions and also removed the test file which slipped in. Needed to special case h5netcdf for now until upstream has added this feature (named enum types).

To make it more explicit I had to add NativeEnumCoder to add the dtype metadata in the encoding step. We still can have dedicated CFEnumCoder later (flags* stuff).

With the metadata trick this works also for h5netcdf backend (with invalid_netcdf=True). It writes a transient enum to the dataset (as netCDF4 does) but does not create the named enum type in the file. This has to be implemented in h5netcdf.

I always have trouble with typing, so appreciate any help with this. Beside the typing this is ready. @bzah is this also good to go from your side? We might have to add some note to the io-docs.

bzah · 2024-01-11T10:32:47Z

Many thanks @kmuehlbauer for these improvements. I will have a look at mypy issues.

bzah · 2024-01-11T10:59:30Z

xarray/backends/netCDF4_.py

        attributes = {k: var.getncattr(k) for k in var.ncattrs()}
+        data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
+        encoding: dict[str, Any] = {}


A TypedDict for encoding and its possible values would be cleaner.

AFAICS this is taken care of in in #8520. So if #8520 goes in first, we should change here.

kmuehlbauer · 2024-01-11T14:28:30Z

@dcherian I think this is ready for final round of review. Can't get the one windows run to work, error seems unrelated. Thanks @bzah for fixing the typing.

kmuehlbauer

Beside the one doubled same this is LGTM.

xarray/backends/netCDF4_.py

kmuehlbauer · 2024-01-15T06:48:07Z

xarray/backends/netCDF4_.py

        attributes = {k: var.getncattr(k) for k in var.ncattrs()}
+        data = indexing.LazilyIndexedArray(NetCDF4ArrayWrapper(name, self))
+        encoding: dict[str, Any] = {}


AFAICS this is taken care of in in #8520. So if #8520 goes in first, we should change here.

xarray/backends/netCDF4_.py

kmuehlbauer · 2024-01-16T07:48:39Z

@dcherian I'm about to merge this, but I'm a bit unsure about the failing test. It looks like its not related, but every once in a while all tests succeed. Is this a flaky thing and we can merge away? Thanks!

kmuehlbauer · 2024-01-17T07:20:31Z

Thanks @bzah for sticking with us and pushing this through!

bzah · 2024-01-17T08:09:13Z

Many thanks @kmuehlbauer , @dcherian and others for making it possible !

dcherian · 2024-01-17T16:44:27Z

Thanks @bzah and @kmuehlbauer . It'd be nice to follow up with some docs on how to create a new variable that gets encoded to Enum on write.

kmuehlbauer · 2024-01-17T19:10:50Z

@dcherian Definitely! I'm about to release EnumType feature in h5netcdf the next days. I'll open a PR with the necessary changes on the xarray side and will add documentation appropriately. Thanks for pointing that out!

github-actions bot added topic-backends topic-CF conventions topic-cftime io labels Sep 5, 2023

bzah force-pushed the enh/add-enum-support branch from 59e7390 to bdfa8ce Compare September 5, 2023 17:22

bzah commented Sep 6, 2023

View reviewed changes

xarray/conventions.py Outdated Show resolved Hide resolved

bzah changed the title ~~WIP: naive implementation of enum~~ WIP: Add enum support Sep 6, 2023

Abel Aoun added 3 commits September 12, 2023 15:21

ENH: make a light refactoring

f1bc33b

Reuse instead of duplicating function.

dirty commit

4da8938

Clean

ab53970

Remove attempts to workaround the fill_value issues.

bzah force-pushed the enh/add-enum-support branch from ced991c to ab53970 Compare September 13, 2023 10:22

github-actions bot removed the topic-cftime label Sep 13, 2023

pre-commit-ci bot and others added 2 commits September 13, 2023 10:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

75e00c7

for more information, see https://pre-commit.ci

wip: fix tests

e1d51e3

kmuehlbauer mentioned this pull request Sep 13, 2023

Add support for netcdf4 enum #8144

Closed

Abel Aoun added 3 commits September 15, 2023 13:18

dirty

95e30b2

clean

a3160c5

Remove dict from valid attrs type

59ef686

kmuehlbauer added 4 commits January 11, 2024 09:44

add NativeEnumCoder, adapt tests

89a8751

remove test-file

ac20a40

restructure datatype extraction

d515e0d

use invalid_netcdf for h5netcdf tests

5c66563

kmuehlbauer self-requested a review January 11, 2024 09:17

FIX: encoding typing

d62ac29

bzah commented Jan 11, 2024

View reviewed changes

kmuehlbauer changed the title ~~Add enum support~~ Add support for netCDF4.EnumType Jan 11, 2024

kmuehlbauer mentioned this pull request Jan 14, 2024

ENH: add EnumType h5netcdf/h5netcdf#226

Merged

3 tasks

kmuehlbauer approved these changes Jan 14, 2024

View reviewed changes

xarray/backends/netCDF4_.py Outdated Show resolved Hide resolved

Update xarray/backends/netCDF4_.py

f834ede

kmuehlbauer added the plan to merge Final call for comments label Jan 14, 2024

kmuehlbauer approved these changes Jan 15, 2024

View reviewed changes

kmuehlbauer reviewed Jan 15, 2024

View reviewed changes

xarray/backends/netCDF4_.py Show resolved Hide resolved

kmuehlbauer added 2 commits January 15, 2024 09:12

Merge branch 'main' into enh/add-enum-support

9a3980a

Merge branch 'main' into enh/add-enum-support

2a3103f

kmuehlbauer closed this Jan 16, 2024

kmuehlbauer reopened this Jan 16, 2024

Merge branch 'main' into enh/add-enum-support

f22046d

kmuehlbauer merged commit d20ba0d into pydata:main Jan 17, 2024
26 checks passed

bzah deleted the enh/add-enum-support branch January 17, 2024 08:09

bzah mentioned this pull request Feb 6, 2024

MNT: Update enum parsing xarray-contrib/xncml#68

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for netCDF4.EnumType #8147

Add support for netCDF4.EnumType #8147

bzah commented Sep 5, 2023 •

edited by kmuehlbauer

bzah commented Sep 6, 2023 •

edited

benbovy commented Sep 6, 2023

bzah commented Sep 6, 2023

dcherian commented Sep 8, 2023

jhamman commented Sep 8, 2023

kmuehlbauer commented Sep 8, 2023

bzah commented Sep 11, 2023

bzah commented Sep 13, 2023

kmuehlbauer commented Sep 13, 2023

kmuehlbauer commented Sep 14, 2023

bzah commented Sep 14, 2023

kmuehlbauer commented Sep 14, 2023

kmuehlbauer commented Jan 11, 2024 •

edited

bzah commented Jan 11, 2024

bzah Jan 11, 2024 •

edited

kmuehlbauer Jan 15, 2024

kmuehlbauer commented Jan 11, 2024

kmuehlbauer left a comment

kmuehlbauer Jan 15, 2024

kmuehlbauer commented Jan 16, 2024

kmuehlbauer commented Jan 17, 2024

bzah commented Jan 17, 2024

dcherian commented Jan 17, 2024

kmuehlbauer commented Jan 17, 2024

Add support for netCDF4.EnumType #8147

Add support for netCDF4.EnumType #8147

Conversation

bzah commented Sep 5, 2023 • edited by kmuehlbauer

bzah commented Sep 6, 2023 • edited

benbovy commented Sep 6, 2023

bzah commented Sep 6, 2023

dcherian commented Sep 8, 2023

jhamman commented Sep 8, 2023

kmuehlbauer commented Sep 8, 2023

bzah commented Sep 11, 2023

bzah commented Sep 13, 2023

Possible workarounds

kmuehlbauer commented Sep 13, 2023

kmuehlbauer commented Sep 14, 2023

bzah commented Sep 14, 2023

kmuehlbauer commented Sep 14, 2023

kmuehlbauer commented Jan 11, 2024 • edited

bzah commented Jan 11, 2024

bzah Jan 11, 2024 • edited

Choose a reason for hiding this comment

kmuehlbauer Jan 15, 2024

Choose a reason for hiding this comment

kmuehlbauer commented Jan 11, 2024

kmuehlbauer left a comment

Choose a reason for hiding this comment

kmuehlbauer Jan 15, 2024

Choose a reason for hiding this comment

kmuehlbauer commented Jan 16, 2024

kmuehlbauer commented Jan 17, 2024

bzah commented Jan 17, 2024

dcherian commented Jan 17, 2024

kmuehlbauer commented Jan 17, 2024

bzah commented Sep 5, 2023 •

edited by kmuehlbauer

bzah commented Sep 6, 2023 •

edited

kmuehlbauer commented Jan 11, 2024 •

edited

bzah Jan 11, 2024 •

edited