Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'open_mfdataset' zarr zip timestamp issue #7354

Open
4 tasks done
peterdudfield opened this issue Dec 4, 2022 · 5 comments
Open
4 tasks done

'open_mfdataset' zarr zip timestamp issue #7354

peterdudfield opened this issue Dec 4, 2022 · 5 comments
Labels
bug topic-backends topic-zarr Related to zarr storage library

Comments

@peterdudfield
Copy link

peterdudfield commented Dec 4, 2022

What happened?

We have been collecting Satellite data and we save each image as one {time}.zarr.zip file.
We then collate the images using xr.open_mfdataset and same them to large.zarr.zip file.
When loading this file the timestamps are all the same.

This bug did not appear in 2022.3.0 but it did in 2022.6.0

I tried to keep this as minimum as possible, but its a bit of a long example. Hopefully the comments help.

Sorry if this has already been reported, but I could not find it in the issue list

What did you expect to happen?

Expected the time stamps to reflect the data that went in

Minimal Complete Verifiable Example

import pandas as pd
import xarray as xr
import numpy as np
from datetime import datetime, timedelta
import zarr
import os
import glob

# ids and times
path = "tmp.zarr.zip"
ids = np.array(range(0, 10))
times = [datetime(2022, 9, 1) + timedelta(minutes=60 * i) for i in range(0, 10)]


# make 10 random zipp files
for time in times:
    dataset = xr.DataArray(
        np.random.uniform(size=(1, len(ids))),
        coords=(("time", [time]), ("id", ids)),
        name="data",
    ).to_dataset(name="data")

    file_name = f"tmp_dir/{time.isoformat()}.zarr.zip"

    if os.path.exists(file_name):
        os.remove(file_name)
    with zarr.ZipStore(file_name) as store:
        dataset.to_zarr(store)

# load them all together
files = list(glob.glob(f"tmp_dir/*.zarr.zip"))
dataset = xr.open_mfdataset(files, engine="zarr").sortby("time")

# this is fine!
assert pd.to_datetime(dataset.time.values[0]) == times[0]
assert pd.to_datetime(dataset.time.values[1]) == times[1]

# save to file
if os.path.exists(path):
    os.remove(path)
with zarr.ZipStore(path) as store:
    dataset.to_zarr(store)

# read the file
dataset_read = xr.open_dataset(path, engine="zarr")
print(dataset_read)

# this casues an error
assert pd.to_datetime(dataset_read.time.values[0]) == times[0]
assert pd.to_datetime(dataset_read.time.values[1]) == times[1]

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

/Users/peterdudfield/Documents/Github/nwp/venv/lib/python3.8/site-packages/xarray/core/dataset.py:2060: SerializationWarning: saving variable None with floating point data as an integer dtype without any _FillValue to use for NaNs
  return to_zarr(  # type: ignore
<xarray.Dataset>
Dimensions:  (time: 10, id: 10)
Coordinates:
  * id       (id) int64 0 1 2 3 4 5 6 7 8 9
  * time     (time) datetime64[ns] 2022-09-01 2022-09-01 ... 2022-09-01
Data variables:
    data     (time, id) float64 ...
Traceback (most recent call last):
  File "/Users/peterdudfield/Documents/Github/nwp/venv/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3251, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-16-45f86e8a5977>", line 36, in <module>
    assert pd.to_datetime(dataset_read.time.values[1]) == times[1]
AssertionError

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.8.2 (default, Jun 8 2021, 11:59:35)
[Clang 12.0.5 (clang-1205.0.22.11)]
python-bits: 64
OS: Darwin
OS-release: 20.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('en_GB', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.7.4
xarray: 2022.6.0
pandas: 1.4.2
numpy: 1.22.0
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.13.1
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.6.0
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.10
cfgrib: 0.9.9.1
iris: None
bottleneck: 1.3.4
dask: 2022.01.0
distributed: None
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 57.0.0
pip: 21.1.2
conda: None
pytest: 6.2.5
IPython: 8.0.1
sphinx: None

@peterdudfield peterdudfield added bug needs triage Issue that has not been reviewed by xarray team member labels Dec 4, 2022
@jhamman
Copy link
Member

jhamman commented Dec 12, 2022

@peterdudfield - have you tried this workflow with the latest version of xarray (2022.12.0)?

@peterdudfield
Copy link
Author

@peterdudfield - have you tried this workflow with the latest version of xarray (2022.12.0)?

Yea the same bug appeared. So this appears at 2022.6.0 and onwards.

@dcherian dcherian added topic-zarr Related to zarr storage library needs triage Issue that has not been reviewed by xarray team member and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 12, 2022
@jhamman
Copy link
Member

jhamman commented Dec 14, 2022

I took a minute to look into this and think I understand what is going on. First, a little debugging:

for name in [files[0], files[1], path]:
    print(name)
    ds = xr.open_zarr(name, decode_cf=False)
    print('  > time.attrs', ds.time.attrs)
    print('  > time.encoding', ds.time.encoding)
tmp_dir/2022-09-01T03:00:00.zarr.zip
  > time.attrs {'calendar': 'proleptic_gregorian', 'units': 'days since 2022-09-01 03:00:00'}
  > time.encoding {'chunks': (1,), 'preferred_chunks': {'time': 1}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int64')}
tmp_dir/2022-09-01T04:00:00.zarr.zip
  > time.attrs {'calendar': 'proleptic_gregorian', 'units': 'days since 2022-09-01 04:00:00'}
  > time.encoding {'chunks': (1,), 'preferred_chunks': {'time': 1}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int64')}
tmp.zarr.zip
  > time.attrs {'calendar': 'proleptic_gregorian', 'units': 'days since 2022-09-01'}
  > time.encoding {'chunks': (1,), 'preferred_chunks': {'time': 1}, 'compressor': Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0), 'filters': None, 'dtype': dtype('int64')}

A few things that I noticed:

  • the dtype of the time variable is int64.
  • the units attr is days since ....

open_mfdataset tends to take the units of the first file and doesn't check if all the others agree. It also does not clear out the dtype encoding.

One quick solution here is that you could add

del dataset['time'].encoding['units']

to the line right after your open_mfdataset call. You could also update the dtype of your time variable to be a float64.

@jhamman
Copy link
Member

jhamman commented Dec 14, 2022

After thinking about this for a bit longer, I think we should be strongly considering dropping source encoding for datasets generated by open_mfdataset. Or, if nothing else, thinking about ways to alert the user that encoding was not consistent across all of the datasets loaded.

Other relevant issues:

@peterdudfield
Copy link
Author

Thanks @jhamman for looking into this. I'll try your suggestions

@jhamman jhamman added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

3 participants