-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Two steps of the netcdf3 encoding process are triggering unexpected computation in some specific cases. This happens when the function xarray.backend.netcdf4.encode_nc3_variable is called. This concerns saving to netcdf with scipy, but also with netcdf4 when format='NETCDF4_CLASSIC' is required. In both sub-issues, passing compute=False has no impact.
Issue A
Let's say one has a dataset with a dask-backed data var which is a count of days ( ex: yearly series of the count of days over 30°C in a region). Units are denoted as "days" and the dtype of the data is "int32". When we save to netcdf with format='NETCDF4_CLASSIC', the computation is triggered here:
xarray/xarray/backends/netcdf3.py
Lines 106 to 118 in e0ad952
| def _maybe_prepare_times(var): | |
| # checks for integer-based time-like and | |
| # replaces np.iinfo(np.int64).min with _FillValue or np.nan | |
| # this keeps backwards compatibility | |
| data = var.data | |
| if data.dtype.kind in "iu": | |
| units = var.attrs.get("units", None) | |
| if units is not None and coding.variables._is_time_like(units): | |
| mask = data == np.iinfo(np.int64).min | |
| if mask.any(): | |
| data = np.where(mask, var.attrs.get("_FillValue", np.nan), data) | |
| return data |
My guess is that this method was written with coordinates in mind, i.e. variable which are almost never using dask or at least are pretty light. In my situation here, the data variable can be quite heavy, requiring a lot of computation I don't want to trigger twice.
I'm not sure the process is needed for timedelta variables however. Maybe the problem could be reduced by changing the test with coding.variables._is_time_like(units) == 'timedelta'. The issue would still happen for datetime data variables.
May be the np.where line could be applied without checking for mask.any() ? If mask.any() is False, then the np.where is a no-op anyway, but at least the computation is not triggered and the result is a dask array.
Issue B
The encoding process then calls coerce_nc3_dtype. When non-netcdf3 integer dtypes are encountered, the data is casted and the verification process triggers computation.
xarray/xarray/backends/netcdf3.py
Lines 77 to 85 in e0ad952
| dtype = str(arr.dtype) | |
| if dtype in _nc3_dtype_coercions: | |
| new_dtype = _nc3_dtype_coercions[dtype] | |
| # TODO: raise a warning whenever casting the data-type instead? | |
| cast_arr = arr.astype(new_dtype) | |
| if not (cast_arr == arr).all(): | |
| raise ValueError( | |
| COERCION_VALUE_ERROR.format(dtype=dtype, new_dtype=new_dtype) | |
| ) |
Solutions
For issue A, I had some ideas, but I don't have any for issue B. Maybe in the case of dask-backed arrays, a warning could be raised instead of the checking ? Like "Data was in non-supported integer type {dtype_in} and was converted to {dtype_out}. This could lead to errors.".
For users, the simplest workaround for issue A is to convert to float and for issue B is to convert to one of int32, int16 or int8.