Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't store zarr after open_zarr and isel #2278

Closed
apatlpo opened this issue Jul 11, 2018 · 10 comments
Closed

can't store zarr after open_zarr and isel #2278

apatlpo opened this issue Jul 11, 2018 · 10 comments

Comments

@apatlpo
Copy link
Contributor

apatlpo commented Jul 11, 2018

Code Sample, a copy-pastable example if possible

This works fine:

nx, ny, nt = 32, 32, 64
ds = xr.Dataset({}, coords={'x':np.arange(nx),'y':np.arange(ny), 't': np.arange(nt)})
ds = ds.assign(v=ds.t*np.cos(np.pi/180./100*ds.x)*np.cos(np.pi/180./50*ds.y))
ds = ds.chunk({'t': 1, 'x': nx/2, 'y': ny/2})

ds.isel(t=0).to_zarr('data_t0.zarr', mode='w')

But if I store, reload and select, I cannot store:

ds.to_zarr('data.zarr', mode='w')
ds = xr.open_zarr('data.zarr')
ds.isel(t=0).to_zarr('data_t0.zarr', mode='w')

Error message ends with:

~/.miniconda3/envs/equinox/lib/python3.6/site-packages/xarray/backends/zarr.py in _extract_zarr_variable_encoding(variable, raise_on_invalid)
    181 
    182     chunks = _determine_zarr_chunks(encoding.get('chunks'), variable.chunks,
--> 183                                     variable.ndim)
    184     encoding['chunks'] = chunks
    185     return encoding

~/.miniconda3/envs/equinox/lib/python3.6/site-packages/xarray/backends/zarr.py in _determine_zarr_chunks(enc_chunks, var_chunks, ndim)
    112         raise ValueError("zarr chunks tuple %r must have same length as "
    113                          "variable.ndim %g" %
--> 114                          (enc_chunks_tuple, ndim))
    115 
    116     for x in enc_chunks_tuple:

ValueError: zarr chunks tuple (1, 16, 16) must have same length as variable.ndim 2

Output of xr.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 3.12.53-60.30-default machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

xarray: 0.10.7
pandas: 0.23.1
numpy: 1.14.2
scipy: 1.1.0
netCDF4: 1.4.0
h5netcdf: 0.6.1
h5py: 2.8.0
Nio: None
zarr: 2.2.0
bottleneck: 1.2.1
cyordereddict: None
dask: 0.18.1
distributed: 1.22.0
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.2.0
pip: 10.0.1
conda: None
pytest: None
IPython: 6.4.0
sphinx: None

@shoyer
Copy link
Member

shoyer commented Jul 11, 2018

Yes, this is definitely a bug.

One workaround is to explicitly remove the broken chunks encoding from the loaded dataset, e.g., del ds['v'].encoding['chunks']

@apatlpo
Copy link
Contributor Author

apatlpo commented Jul 12, 2018

thanks for the workaround suggestion.
Apparently you also need to delete chunks for the t singleton coordinate though.
The workaround looks at the end like:

ds = xr.open_zarr('data.zarr')
del ds['v'].encoding['chunks']
del ds['t'].encoding['chunks']
ds.isel(t=0).to_zarr('data_t0.zarr', mode='w')

Any idea about how serious this is and/or where it's coming from?

@rabernat
Copy link
Contributor

rabernat commented Jul 12, 2018

Any idea about how serious this is and/or where it's coming from?

The source of the bug is that encoding metadata chunks (which describes the chunk size of the underlying zarr store) is automatically getting populated when you load the zarr store (ds = xr.open_zarr('data.zarr')), and this encoding metadata is being preserved as you transform (sub-select) the dataset. Some possible solutions would be to

  1. Not put chunks into encoding at all.
  2. Figure out a way to strip chunks when performing selection operations or other operations that change shape.

Idea 1 is easier but would mean discarding some relevant metadata about encoding. This would break round-tripping of the un-modified zarr dataset.

@apatlpo
Copy link
Contributor Author

apatlpo commented Jul 12, 2018

With the same case, I have another error message which may reflect the same issue (or not), maybe you can tell me. The error message is different which is the reason I am posting this.

Starting from the same dataset:

nx, ny, nt = 32, 32, 64
ds = xr.Dataset({}, coords={'x':np.arange(nx),'y':np.arange(ny), 't': np.arange(nt)})
ds = ds.assign(v=ds.t*np.cos(np.pi/180./100*ds.x)*np.cos(np.pi/180./50*ds.y))
ds = ds.chunk({'t': 1, 'x': nx/2, 'y': ny/2})
ds.to_zarr('data.zarr', mode='w')

Case 1 works fine:

ds = ds.chunk({'t': nt, 'x': nx/4, 'y': ny/4})
ds.to_zarr('data_rechunked.zarr', mode='w')

Case 2 breaks:

ds = xr.open_zarr('data.zarr')
ds = ds.chunk({'t': nt, 'x': nx/4, 'y': ny/4})
ds.to_zarr('data_rechunked.zarr', mode='w')

with the following error message:

 ....
NotImplementedError: Specified zarr chunks (1, 16, 16) would overlap multiple dask chunks ((64,), (8, 8, 8, 8), (8, 8, 8, 8)). This is not implemented in xarray yet.  Consider rechunking the data using `chunk()` or specifying different chunks in encoding.

@apatlpo
Copy link
Contributor Author

apatlpo commented Jul 12, 2018

Note that there is also a fix for case 2 that is simply del ds['v'].encoding['chunks'] prior to data storage.

@rabernat
Copy link
Contributor

rabernat commented Jul 12, 2018 via email

@shoyer
Copy link
Member

shoyer commented Jul 12, 2018 via email

@apatlpo
Copy link
Contributor Author

apatlpo commented Jul 13, 2018

Could you please be more specific about where this is done for netCDF?

@shoyer
Copy link
Member

shoyer commented Jul 13, 2018

if chunks_too_big or changed_shape:
del encoding['chunksizes']

@tinaok
Copy link

tinaok commented May 17, 2019

Hi,
second test case indicated by Apatlpo on on 12 Jul 2018, brakes

nx, ny, nt = 32, 32, 64
ds = xr.Dataset({}, coords={'x':np.arange(nx),'y':np.arange(ny), 't': np.arange(nt)})
ds = ds.assign(v=ds.t*np.cos(np.pi/180./100*ds.x)*np.cos(np.pi/180./50*ds.y))
ds = ds.chunk({'t': 1, 'x': nx/2, 'y': ny/2})
ds.to_zarr('data.zarr', mode='w')
python
ds = xr.open_zarr('data.zarr')
ds = ds.chunk({'t': nt, 'x': nx/4, 'y': ny/4})
ds.to_zarr('data_rechunked.zarr', mode='w')

Err message is following .

ValueError: Final chunk of Zarr array must be the same size or smaller than the first. The specified Zarr chunk encoding is (1, 16, 16), but (64,) in variable Dask chunks ((64,), (8, 8, 8, 8), (8, 8, 8, 8)) is incompatible. Consider rechunking using `chunk()

(if I add del ds.v.encoding['chunks'] as follows, it does not break)

nx, ny, nt = 32, 32, 64
ds = xr.Dataset({}, coords={'x':np.arange(nx),'y':np.arange(ny), 't': np.arange(nt)})
ds = ds.assign(v=ds.t*np.cos(np.pi/180./100*ds.x)*np.cos(np.pi/180./50*ds.y))
ds = ds.chunk({'t': 1, 'x': nx/2, 'y': ny/2})
ds.to_zarr('data.zarr', mode='w')
ds = xr.open_zarr('data.zarr')
del ds.v.encoding['chunks']
ds = ds.chunk({'t': nt, 'x': nx/4, 'y': ny/4})
ds.to_zarr('data_rechunked.zarr', mode='w')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants