open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

mps01060 · 2023-10-03T16:19:54Z

What is your issue?

When moving from xarray version '2023.8.0' to '2023.9.0' the behavior of importing a zarr changed for me (code to create example zarr at end of this post). When importing a variable with units "days accumulated", the values are scaled differently between the two versions. The latest version seems to automatically treat this as as time-like array (I think the -9.223372e+18 seen are NaT-like?).

Open the zarr:

import xarray as xr
ds = xr.open_dataset('debug.zarr', engine='zarr', chunks={})

Print as a pandas-like table for each version of xarray for readability:

 ds.to_dataframe()

Version '2023.8.0':

time	dapr (dtype=float32)	mdpr (dtype=float32)
2000-01-01	NaN	NaN
2000-01-02	NaN	NaN
2000-01-03	2.0	1.5

Version '2023.9.0':

time	dapr (dtype=float64)	mdpr (dtype=float32)
2000-01-01	-9.223372e+18	NaN
2000-01-02	-9.223372e+18	NaN
2000-01-03	2.000000e+00	1.5

I can manually disable this by using the "use_cf=False", "mask_and_scale=False", and then manually scale this variable, though that is not ideal. The "decode_timedelta" doesn't seem to have an effect on this data, either.

I understand the "days" keyword is in my units, however the full unit is "days accumulated". Has the behavior of xarray changed to find keywords such as "days" occurring anywhere in the units (eg. as a substring)? Do you have any other suggestions? Thank you for the help.

Code to create the debug.zarr for the tables above:

import numpy as np
import pandas as pd
import xarray as xr
import zarr

# Create some multiday precipitation data (similar to https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)
# mdpr is the amount of a multiday total (inches)
# dapr is the number of days each multiday total occurred over (days accumulated).
# In this example, 1.50 inches of rain fell over 2 days (2 observation periods), ending on 2000-01-03
# I use float32 to represent these, but pack these as int16 values in the zarr.
mdpr = np.array([np.NaN, np.NaN, 1.50], dtype=np.float32)
dapr = np.array([np.NaN, np.NaN , 2.0], dtype=np.float32)
time = pd.date_range('2000-01-01', periods=3)

# Create a dataset from these values
ds = xr.Dataset(
    data_vars=dict(
        mdpr=(['time'], mdpr),
        dapr=(['time'], dapr),
    ),
    coords=dict(
        time=time,
    ),
    attrs=dict(description='multiday precipitation data'),
)

# Specify encoding to pack these float32 values as int16
encoding = {
    'mdpr' : {
        'chunks' : (3,),
        'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1),
        'filters': None,
        'missing_value': -32768,
        '_FillValue': -32768,
        'scale_factor': 0.01,
        'add_offset': 0.0,
        'dtype': np.int16,
    },
    'dapr' : {
        'chunks' : (3,),
        'compressor': zarr.Blosc(cname='zstd', clevel=3, shuffle=1),
        'filters': None,
        'missing_value': -32768,
        '_FillValue': -32768,
        'scale_factor': 1.0,
        'add_offset': 0.0,
        'dtype': np.int16,
    },
}

# Create attributes. The "units" for the dapr variable seems to be the issue "days" in the
# "days accumulated"
ds.mdpr.attrs['units'] = 'inches'
ds.mdpr.attrs['description'] = 'multiday precip amount'

ds.dapr.attrs['units'] = 'days accumulated'
ds.dapr.attrs['description'] = 'number of days included in the multiday precipitation'

# Save to zarr
ds.to_zarr('debug.zarr', mode='w', encoding=encoding)

welcome · 2023-10-03T16:19:58Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

kmuehlbauer · 2023-10-04T13:41:28Z

Thanks @mps01060, that's most likely a regression when checking the units for Timedelta-like as you already suggested.

We've had several changes for time/timedelta encoding in 2023.09.0. I'm trying to get behind it and issue a fix.

kmuehlbauer · 2023-10-04T13:51:16Z

I was a bit too fast here, as the decoding can't be the source:

Timedeltas

xarray/xarray/coding/times.py

Lines 859 to 863 in d5f1785

    
           def decode(self, variable: Variable, name: T_Name = None) -> Variable: 
        
               units = variable.attrs.get("units", None) 
        
               if isinstance(units, str) and units in TIME_UNITS: 
        
                   dims, data, attrs, encoding = unpack_for_decoding(variable)

Datetimes

xarray/xarray/coding/times.py

Lines 824 to 828 in d5f1785

    
           def decode(self, variable: Variable, name: T_Name = None) -> Variable: 
        
               units = variable.attrs.get("units", None) 
        
               if isinstance(units, str) and "since" in units: 
        
                   dims, data, attrs, encoding = unpack_for_decoding(variable)

kmuehlbauer · 2023-10-06T07:03:42Z

It's not directly related to time decoding, but in the attempt to preserve nanosecond resolution we've added some code to CFMaskCoder to special handle time-like data.

And yes, it's currently only checked if there are specifc time related strings in the units:

xarray/xarray/coding/variables.py

Lines 218 to 230 in 2cd8f96

    
           def _is_time_like(units): 
        
               # test for time-like 
        
               time_strings = [ 
        
                   "since", 
        
                   "days", 
        
                   "hours", 
        
                   "minutes", 
        
                   "seconds", 
        
                   "milliseconds", 
        
                   "microseconds", 
        
                   "nanoseconds", 
        
               ] 
        
               return any(tstr in str(units) for tstr in time_strings)

mps01060 added the needs triage Issue that has not been reviewed by xarray team member label Oct 3, 2023

kmuehlbauer added regression and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 4, 2023

kmuehlbauer mentioned this issue Oct 6, 2023

fix regression in time-like check when decoding masked data #8277

Merged

3 tasks

dcherian closed this as completed in #8277 Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

mps01060 commented Oct 3, 2023 •

edited

welcome bot commented Oct 3, 2023

kmuehlbauer commented Oct 4, 2023

kmuehlbauer commented Oct 4, 2023

kmuehlbauer commented Oct 6, 2023

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

open_dataset with engine='zarr' changed from '2023.8.0' to '2023.9.0' #8269

Comments

mps01060 commented Oct 3, 2023 • edited

What is your issue?

Code to create the debug.zarr for the tables above:

welcome bot commented Oct 3, 2023

kmuehlbauer commented Oct 4, 2023

kmuehlbauer commented Oct 4, 2023

kmuehlbauer commented Oct 6, 2023

mps01060 commented Oct 3, 2023 •

edited