-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_zarr
with mode='a'
fails when _FillValue
is present, caused by open_zarr
with mask_and_scale=False
.
#9053
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
Hi @cpegel I just stumbled on the same problem while rechunking some large datasets. This also occurs by just doing I think fixing this is very relevant! One or two years ago, I was using the So something must have changed in the past year or so! |
MVCE four our use case:
|
@ghiggi Can you explain why you are using packed data ( |
Hey @kmuehlbauer. I am not sure I understand your question. I had a long day. What do you refer with |
@ghiggi With unpacked = packed * scale_factor + add_offset This means, that any decimal places which are there in the unpacked data are effectively removed when packing. I can't see how you get back any decimal places when unpacking. Maybe I'm missing something, though. But nevertheless I think I've detected the root cause of the issue: xarray/xarray/backends/zarr.py Lines 674 to 698 in 1f3bc7e
In this block the existing variables in the store are decoded (even if the dataset was opened with I do not immediately see a solution to this. Maybe just checking if the data to be written has |
Hey @kmuehlbauer. Shame on me. I was to tired yesterday to note that in the MCVE I provided I unconsciously specified |
This is a tricky issue. From a fundamental point of view, this is challenging because there are really two layers of encoding: Xarray's encoding pipelines and the Zarr codecs. In Zarr world, codec handle all of the data transformations, while attrs are used to hold user-defined metadata. Xarray overloads the Zarr attrs with encoding information, and Zarr doesn't really know or understand this. The opinion I have developed is that it would be best if we could push as much encoding as possible down to the Zarr layer, rather than Xarray. (This is already possible today for scale_factor and add_offset). When we originally implemented the Zarr backend, we just copied a lot from NetCDF, which delegated lots of encoding to Xarray. But, especially with Zarr V3, we have the ability to create new, custom codecs that do exactly what we want. CF-time encoding, for example, could be a Zarr codec. In the case where Xarray is handling the encoding, I feel like we need a more comprehensive set of compatibility checks between the encoding attributes and the zarr attributes on disk in the existing store. Right now these are quite ad hoc. We need a way to clearly distinguish which zarr attributes are actually encoding parameters. |
Though wouldn't using a custom codec, for writing, potentially create some interoperability challenges with other clients? |
"Custom codec" here means an official Zarr extension. We now have this possibility in Zarr V3. Yes, implementations would have to support it. It's important to recognize that we are already essentially using such custom codecs (without any spec at the Zarr level) when we let Xarray handle the encoding. If a non-Xarray Zarr client opens data with cf-time encoding, or |
👍 to official codec. I agree it's good to formalize, make the user experience more robust. I don't think it's necessarily better to provide an error rather than undecoded data. Undecoded data can be introspected and allows the possibility for the user to manually handle. An error is a full stop, with no workaround. It is by definition, user-hostile, especially in the case where they're not responsible for creating the file. |
What happened?
When opening a Zarr dataset with
open_zarr
and then writing to it usingto_zarr
, we get aValueError
when the_FillValue
attribute of at least one data variable or coordinate is present. This happens for example when opening the dataset withmask_and_scale=False
.The error can be worked around by deleting the
_FillValue
attributes of all data variables and coordinates before callingto_zarr
. However, if the Zarr metadata contains meaningfulfill_value
attributes beforehand, they will then be lost after a round-trip ofopen_zarr
(withmask_and_scale=False
) andto_zarr
.The same behavior, but without the cause being
mask_and_scale=False
, has been reported in the open issues #6069 and #6329 without any solutions.What did you expect to happen?
I expect to be able to read from and then write to a Zarr storage without having to delete attributes in between or lose available
fill_value
metadata from the Zarr storage. Callingto_zarr
withmode='a'
should just write any DataArrays_FillValue
attribute to thefill_value
field in the Zarr metadata instead of failing with a ValueError.Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 07:53:56) [MSC v.1937 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('de_DE', 'cp1252')
libhdf5: None
libnetcdf: None
xarray: 2024.3.0
pandas: 2.2.0
numpy: 1.26.4
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.18.0
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.4.1
distributed: 2024.4.1
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 24.0
conda: None
pytest: None
mypy: None
IPython: 8.21.0
sphinx: None
The text was updated successfully, but these errors were encountered: