Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

Closed
mps01060 opened this issue Oct 3, 2023 · 4 comments · Fixed by #8277
Closed

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

mps01060 opened this issue Oct 3, 2023 · 4 comments · Fixed by #8277

Comments

@mps01060
Copy link

mps01060 commented Oct 3, 2023

What is your issue?

When moving from xarray version '2023.8.0' to '2023.9.0' the behavior of importing a zarr changed for me (code to create example zarr at end of this post). When importing a variable with units "days accumulated", the values are scaled differently between the two versions. The latest version seems to automatically treat this as as time-like array (I think the -9.223372e+18 seen are NaT-like?).

Open the zarr:

import xarray as xr
ds = xr.open_dataset('debug.zarr', engine='zarr', chunks={})

Print as a pandas-like table for each version of xarray for readability:

 ds.to_dataframe()

Version '2023.8.0':

time dapr (dtype=float32) mdpr (dtype=float32)
2000-01-01 NaN NaN
2000-01-02 NaN NaN
2000-01-03 2.0 1.5

Version '2023.9.0':

time dapr (dtype=float64) mdpr (dtype=float32)
2000-01-01 -9.223372e+18 NaN
2000-01-02 -9.223372e+18 NaN
2000-01-03 2.000000e+00 1.5

I can manually disable this by using the "use_cf=False", "mask_and_scale=False", and then manually scale this variable, though that is not ideal. The "decode_timedelta" doesn't seem to have an effect on this data, either.

I understand the "days" keyword is in my units, however the full unit is "days accumulated". Has the behavior of xarray changed to find keywords such as "days" occurring anywhere in the units (eg. as a substring)? Do you have any other suggestions? Thank you for the help.

Code to create the debug.zarr for the tables above:

import numpy as np
import pandas as pd
import xarray as xr
import zarr

# Create some multiday precipitation data (similar to https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
# mdpr is the amount of a multiday total (inches)
# dapr is the number of days each multiday total occurred over (days accumulated).
# In this example, 1.50 inches of rain fell over 2 days (2 observation periods), ending on 2000-01-03
# I use float32 to represent these, but pack these as int16 values in the zarr.
mdpr = np.array([np.NaN, np.NaN, 1.50], dtype=np.float32)
dapr = np.array([np.NaN, np.NaN , 2.0], dtype=np.float32)
time = pd.date_range('2000-01-01', periods=3)

# Create a dataset from these values
ds = xr.Dataset(
    data_vars=dict(
        mdpr=(['time'], mdpr),
        dapr=(['time'], dapr),
    ),
    coords=dict(
        time=time,
    ),
    attrs=dict(description='multiday precipitation data'),
)

# Specify encoding to pack these float32 values as int16
encoding = {
    'mdpr' : {
        'chunks' : (3,),
        'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1),
        'filters': None,
        'missing_value': -32768,
        '_FillValue': -32768,
        'scale_factor': 0.01,
        'add_offset': 0.0,
        'dtype': np.int16,
    },
    'dapr' : {
        'chunks' : (3,),
        'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1),
        'filters': None,
        'missing_value': -32768,
        '_FillValue': -32768,
        'scale_factor': 1.0,
        'add_offset': 0.0,
        'dtype': np.int16,
    },
}

# Create attributes. The "units" for the dapr variable seems to be the issue "days" in the
# "days accumulated"
ds.mdpr.attrs['units'] = 'inches'
ds.mdpr.attrs['description'] = 'multiday precip amount'

ds.dapr.attrs['units'] = 'days accumulated'
ds.dapr.attrs['description'] = 'number of days included in the multiday precipitation'

# Save to zarr
ds.to_zarr('debug.zarr', mode='w', encoding=encoding)
@mps01060 mps01060 added the needs triage Issue that has not been reviewed by xarray team member label Oct 3, 2023
@welcome
Copy link

welcome bot commented Oct 3, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@kmuehlbauer
Copy link
Contributor

Thanks @mps01060, that's most likely a regression when checking the units for Timedelta-like as you already suggested.

We've had several changes for time/timedelta encoding in 2023.09.0. I'm trying to get behind it and issue a fix.

@kmuehlbauer kmuehlbauer added regression and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 4, 2023
@kmuehlbauer
Copy link
Contributor

I was a bit too fast here, as the decoding can't be the source:

Timedeltas

def decode(self, variable: Variable, name: T_Name = None) -> Variable:
units = variable.attrs.get("units", None)
if isinstance(units, str) and units in TIME_UNITS:
dims, data, attrs, encoding = unpack_for_decoding(variable)

Datetimes

def decode(self, variable: Variable, name: T_Name = None) -> Variable:
units = variable.attrs.get("units", None)
if isinstance(units, str) and "since" in units:
dims, data, attrs, encoding = unpack_for_decoding(variable)

@kmuehlbauer
Copy link
Contributor

It's not directly related to time decoding, but in the attempt to preserve nanosecond resolution we've added some code to CFMaskCoder to special handle time-like data.

And yes, it's currently only checked if there are specifc time related strings in the units:

def _is_time_like(units):
# test for time-like
time_strings = [
"since",
"days",
"hours",
"minutes",
"seconds",
"milliseconds",
"microseconds",
"nanoseconds",
]
return any(tstr in str(units) for tstr in time_strings)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants