Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for netcdf4 enum #8144

Closed
bzah opened this issue Sep 4, 2023 · 10 comments · Fixed by #8147
Closed

Add support for netcdf4 enum #8144

bzah opened this issue Sep 4, 2023 · 10 comments · Fixed by #8147

Comments

@bzah
Copy link
Contributor

bzah commented Sep 4, 2023

Is your feature request related to a problem?

When a netcdf file contains netcdf4 enums , xarray ignores the underlying enum type.
The association between the values of the variable and their actual meaning is then lost.

MRE:

import netCDF4 as nc
import xarray as xr

# -- Create dataset with an enum using the netcdf4 lib
ds = nc.Dataset("mre.nc", "w", format="NETCDF4")   
cloud_type_enum = ds.createEnumType(int,"cloud_type",{"clear":0, "cloudy":1})
print(ds.enumtypes)
# {'cloud_type': <class 'netCDF4._netCDF4.EnumType'>: name = 'cloud_type', numpy dtype = int64, fields/values ={'clear': 0, 'cloudy': 1}} 
ds.createVariable("cloud", cloud_type_enum)
ds["cloud"][0] = 1
ds.close()

# -- Open dataset with xarray
xr_ds = xr.open_dataset("./mre.nc")
print(xr_ds.cloud)
# <xarray.DataArray 'cloud' ()> \n [1 values with dtype=int64]   
# --> We get no metadata about the cloud_type enum that we created above 
xr.ds.to_netcdf("mre_xr.nc")

# -- Open xarray outputted dataset with netCDF4 lib
print(nc.Dataset("mre_xr.nc", "r", format="NETCDF4").enumtypes())
# {}
# --> Empty dictionary: the enum we created is lost

If you know CF, enums could replace replace flag_meanings and flag_values, see CF
Enums are not yet part of CF though.

Describe the solution you'd like

As far as I understand, to describe the enum we only need a dictionary that map numbers (enum key) to string (enum value) and a way to reference this dictionary in variables that are "typed" to this enum.
Bear in mind that the dtype of the variable would still be a number, the enum type would be a secondary metadata.

Describe alternatives you've considered

Most people that produce data could get away with using flag_meanings and flag_values to describe their data in a way which is both CF proof and properly managed by xarray.
For me, the only workaround at the moment is to use the netCDF4 library directly.

Additional context

nc.__version__
# 1.6.2

xr.__version__
# 2023.2.0
@welcome
Copy link

welcome bot commented Sep 4, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@bzah
Copy link
Contributor Author

bzah commented Sep 6, 2023

Also, it might be nice to have a way to automatically translate variable typed with enums to the CF flag_meanings, flag_values attributes. And the other way around too.
This should probably be in cf-xarray though.

@kmuehlbauer
Copy link
Contributor

I'll add ncdump/h5dump of the above file for those who are interested:

ncdump -h mre.nc

netcdf mre {
types:
  int64 enum cloud_type {clear = 0, cloudy = 1} ;
variables:
	cloud_type cloud ;
data:

 cloud = cloudy ;
}

h5dump mre.nc

HDF5 "mre.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 34;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=2,netcdf=4.9.2,hdf5=1.14.1"
      }
   }
   DATASET "cloud" {
      DATATYPE  H5T_ENUM {
         H5T_STD_I64LE;
         "clear"            0;
         "cloudy"           1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): cloudy
      }
   }
   DATATYPE "cloud_type" H5T_ENUM {
      H5T_STD_I64LE;
      "clear"            0;
      "cloudy"           1;
   };
}
}

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2023

A core problem is that there isn't an enumerated array type other than the 1D pandas.Categorical so we can't decode to anything very useful.

IMO the best solution at the moment is to "decode" by saving the enum dictionary as an attribute at read time and "encode" by specifying enum types at write time.

it might be nice to have a way to automatically translate variable typed with enums to the CF flag_meanings, flag_values attributes. And the other way around too.
This should probably be in cf-xarray though.

👍 PR welcome!

@kmuehlbauer
Copy link
Contributor

@bzah Thanks for tackling this. We've just discussed this at the dev-meeting.

It should be decoded into encoding or attrs as suggested by @dcherian #8147 (comment).

For the user it might be more discoverable that a certain DataArray is of enum type, if we would attach the information as attributes. This would follow along flag_meanings and flag_values.

To roundtrip we could use .encoding dtype-key (eg. dtype='enum) and special case this in the backends.

@bzah
Copy link
Contributor Author

bzah commented Sep 14, 2023

Ah snap I would have tagged along if I new the dev-meeting was yesterday, my bad.
I will try to come for the meeting of 11.10.2023, if this issue is not yet closed.

The only issue with attrs so far is that when saving the file with netCDF4 we have to remove the enum dict from the attributes because an attribute cannot be a dict for netCDF4.
I was also thinking about adding a str and/or html_repr for DataArray that have an enum to show the enum values instead of the int flags (same as ncdump does). Maybe in the next PR.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Sep 14, 2023

@bzah Yeah, sorry, I've put this on the agenda just short before the meeting. You are very welcome to attend any meeting. I've added some ideas on your PR how the suggested attributes/encoding solution could be laid out.

@bzah
Copy link
Contributor Author

bzah commented Sep 22, 2023

In case people are interested, there will be a lightning talk and a hackaton session about adding Enums in CF at the CF workshop.
It's online and the dates are 3rd-5th of October. See cf-convention/discuss#243 for registration and details.

@dcherian
Copy link
Contributor

dcherian commented Jan 5, 2024

What do you think of saving an "enum" entry in attrs with a python Enum as value. Conversion to flag_* can be done later, for e.g. through cf-xarray?

That feels a lot more explicit to me.

@bzah
Copy link
Contributor Author

bzah commented Jan 8, 2024

If you mean generating the Enum with something like attrs["enum"] = Enum("cloudiness", {"cloudy": 42, "not_cloudy": 0 } )
that would work for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants