Skip to content

Writing integer data in netcdf3 (or classic) is not always lazy #10826

@aulemahal

Description

@aulemahal

Two steps of the netcdf3 encoding process are triggering unexpected computation in some specific cases. This happens when the function xarray.backend.netcdf4.encode_nc3_variable is called. This concerns saving to netcdf with scipy, but also with netcdf4 when format='NETCDF4_CLASSIC' is required. In both sub-issues, passing compute=False has no impact.

Issue A

Let's say one has a dataset with a dask-backed data var which is a count of days ( ex: yearly series of the count of days over 30°C in a region). Units are denoted as "days" and the dtype of the data is "int32". When we save to netcdf with format='NETCDF4_CLASSIC', the computation is triggered here:

def _maybe_prepare_times(var):
# checks for integer-based time-like and
# replaces np.iinfo(np.int64).min with _FillValue or np.nan
# this keeps backwards compatibility
data = var.data
if data.dtype.kind in "iu":
units = var.attrs.get("units", None)
if units is not None and coding.variables._is_time_like(units):
mask = data == np.iinfo(np.int64).min
if mask.any():
data = np.where(mask, var.attrs.get("_FillValue", np.nan), data)
return data

My guess is that this method was written with coordinates in mind, i.e. variable which are almost never using dask or at least are pretty light. In my situation here, the data variable can be quite heavy, requiring a lot of computation I don't want to trigger twice.

I'm not sure the process is needed for timedelta variables however. Maybe the problem could be reduced by changing the test with coding.variables._is_time_like(units) == 'timedelta'. The issue would still happen for datetime data variables.

May be the np.where line could be applied without checking for mask.any() ? If mask.any() is False, then the np.where is a no-op anyway, but at least the computation is not triggered and the result is a dask array.

Issue B

The encoding process then calls coerce_nc3_dtype. When non-netcdf3 integer dtypes are encountered, the data is casted and the verification process triggers computation.

dtype = str(arr.dtype)
if dtype in _nc3_dtype_coercions:
new_dtype = _nc3_dtype_coercions[dtype]
# TODO: raise a warning whenever casting the data-type instead?
cast_arr = arr.astype(new_dtype)
if not (cast_arr == arr).all():
raise ValueError(
COERCION_VALUE_ERROR.format(dtype=dtype, new_dtype=new_dtype)
)

Solutions

For issue A, I had some ideas, but I don't have any for issue B. Maybe in the case of dask-backed arrays, a warning could be raised instead of the checking ? Like "Data was in non-supported integer type {dtype_in} and was converted to {dtype_out}. This could lead to errors.".

For users, the simplest workaround for issue A is to convert to float and for issue B is to convert to one of int32, int16 or int8.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions