Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correctly encode/decode _FillValues/missing_values/dtypes for packed data #8713

Merged
merged 19 commits into from
Mar 15, 2024

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Feb 6, 2024

This resurrects some of #7654. It takes special care of correctly handling dtypes when encoding/decoding packed data

@kmuehlbauer
Copy link
Contributor Author

Still an issue with some example in the docs. Probably non-conforming NaN _FillValue for packed data in that one example.

xarray/tests/test_backends.py Outdated Show resolved Hide resolved
xarray/coding/variables.py Show resolved Hide resolved
@dcherian
Copy link
Contributor

dcherian commented Feb 6, 2024

@mankoff if you have time, can you check your workload against this PR, or review it please?

Comment on lines +254 to +258
raw_fill_dict = {}
[
pop_to(attrs, raw_fill_dict, attr, name=name)
for attr in ("missing_value", "_FillValue")
]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pops missing_values and/or _FillValue into temporary raw_fill_dict.

Comment on lines +260 to +272
for k in list(raw_fill_dict):
v = raw_fill_dict[k]
kfill = {fv for fv in np.ravel(v) if not pd.isnull(fv)}
if not kfill and np.issubdtype(dtype, np.integer):
warnings.warn(
f"variable {name!r} has non-conforming {k!r} "
f"{v!r} defined, dropping {k!r} entirely.",
SerializationWarning,
stacklevel=3,
)
del raw_fill_dict[k]
else:
encoded_fill_values |= kfill
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterate over the (two) possible keys and extract the provided fill values. If the extracted fill values are empty (due to filtering with pd.isnull) a warning is issued for integer type data and the according key is deleted from the dict. This prevents from moving the nonconforming fill value into encoding.

Comment on lines +274 to +280
if len(encoded_fill_values) > 1:
warnings.warn(
f"variable {name!r} has multiple fill values "
f"{encoded_fill_values} defined, decoding all values to NaN.",
SerializationWarning,
stacklevel=3,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have multiple fill values after the procedure a warning is issued.

if raw_fill_dict:
dims, data, attrs, encoding = unpack_for_decoding(variable)
[
safe_setitem(encoding, attr, value, name=name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move remaining key, value pairs into encoding.

@@ -348,20 +375,51 @@ def _scale_offset_decoding(data, scale_factor, add_offset, dtype: np.typing.DTyp
return data


def _choose_float_dtype(dtype: np.dtype, has_offset: bool) -> type[np.floating[Any]]:
def _choose_float_dtype(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function checks for the most appropriate dtype to use when encoding/decoding in CFScaleOffsetCoder.

@kmuehlbauer
Copy link
Contributor Author

@mankoff if you have time, can you check your workload against this PR, or review it please?

It would be great if Ken could have a look here. At least I tried to follow the changes to CF which followed after the discussion in cf-convention/cf-conventions#374.

@mankoff
Copy link
Contributor

mankoff commented Feb 7, 2024 via email

@kmuehlbauer
Copy link
Contributor Author

I'm reading this via slow satellite in Antarctica. I'm 'offline'(ish) for another week, then have 2 months of emails and work to catch up on. Reviewing this PR will be low priority.

Thanks Ken, for letting us know. Greetings to Antarctica, hope time is good there! I'll try to ping some folks from the linked issues to get more input here.

@dcherian
Copy link
Contributor

@kmuehlbauer shall we merge? A number of numpy warnings (and presumably numpy 2 failures) are from this type of dtype maniputation. 🤞🏾 this PR fixes them all! :)

@dcherian
Copy link
Contributor

xarray/tests/test_conventions.py::test_decode_cf_with_conflicting_fill_missing_value
  /home/runner/work/xarray/xarray/xarray/conventions.py:286: SerializationWarning: variable 't' has non-conforming '_FillValue' nan defined, dropping '_FillValue' entirely.
    var = coder.decode(var, name=name)

should silence this warning

@kmuehlbauer
Copy link
Contributor Author

@dcherian I'm good to merge this, we still can iterate later if complaints are coming in.

@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented Mar 15, 2024

@dcherian There are still some warnings which could be fixed/silenced with this PR. All try to get behind it now.

Update: so some warnings (cast RuntimeWarnings) are not directly connected to this PR but to some ill-defined test setup (mixing int/uint) with resulting casting issues. Best resolved in separate PR.

@@ -63,7 +63,13 @@ def test_decode_cf_with_conflicting_fill_missing_value() -> None:
np.arange(10),
{"units": "foobar", "missing_value": np.nan, "_FillValue": np.nan},
)
actual = conventions.decode_cf_variable("t", var)

# the following code issues two warnings, so we need to check for both
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian I've tried to make this way shorter. Using for-loop looks ugly, though.

@dcherian
Copy link
Contributor

LGTM. Thanks!

@dcherian dcherian enabled auto-merge (squash) March 15, 2024 16:14
@dcherian dcherian merged commit fbcac76 into pydata:main Mar 15, 2024
27 of 29 checks passed
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 15, 2024
* main: (31 commits)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  Grouper object design doc (pydata#8510)
  Bump the actions group with 2 updates (pydata#8804)
  tokenize() should ignore difference between None and {} attrs (pydata#8797)
  fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791)
  Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782)
  Migrate treenode module. (pydata#8757)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Mar 16, 2024
* main: (42 commits)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  Grouper object design doc (pydata#8510)
  Bump the actions group with 2 updates (pydata#8804)
  tokenize() should ignore difference between None and {} attrs (pydata#8797)
  fix: remove Coordinate from __all__ in xarray/__init__.py (pydata#8791)
  Fix non-nanosecond casting behavior for `expand_dims` (pydata#8782)
  Migrate treenode module. (pydata#8757)
  ...
dcherian added a commit to kmsquire/xarray that referenced this pull request Mar 21, 2024
* upstream/main: (765 commits)
  increase typing annotations coverage in `xarray/core/indexing.py` (pydata#8857)
  pandas 3 MultiIndex fixes (pydata#8847)
  FIX: adapt handling of copy keyword argument in scipy backend for numpy >= 2.0dev (pydata#8851)
  FIX: do not cast _FillValue/missing_value in CFMaskCoder if _Unsigned is provided (pydata#8852)
  Implement setitem syntax for `.oindex` and `.vindex` properties (pydata#8845)
  Support pandas copy-on-write behaviour (pydata#8846)
  correctly encode/decode _FillValues/missing_values/dtypes for packed data (pydata#8713)
  Expand use of `.oindex` and `.vindex` (pydata#8790)
  Return a dataclass from Grouper.factorize (pydata#8777)
  [skip-ci] Fix upstream-dev env (pydata#8839)
  Add dask-expr for windows envs (pydata#8837)
  [skip-ci] Add dask-expr dependency to doc.yml (pydata#8835)
  Add `dask-expr` to environment-3.12.yml (pydata#8827)
  Make list_chunkmanagers more resilient to broken entrypoints (pydata#8736)
  Do not attempt to broadcast when global option ``arithmetic_broadcast=False`` (pydata#8784)
  try to get the `upstream-dev` CI to complete again (pydata#8823)
  Bump the actions group with 1 update (pydata#8818)
  Update documentation for clarity (pydata#8817)
  DOC: link to zarr.convenience.consolidate_metadata (pydata#8816)
  Refactor Grouper objects (pydata#8776)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants