Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening from zarr.ZipStore fails to read (store???) unicode characters #3815

Open
hmaarrfk opened this issue Mar 1, 2020 · 20 comments
Open
Labels
topic-backends topic-zarr Related to zarr storage library

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Mar 1, 2020

See upstream: zarr-developers/zarr-python#551

It seems that using a ZipStore creates 1 byte objects for Unicode string attributes.

For example, saving the same Dataset with a DirectoryStore and a Zip Store creates an attribute for a unicode array with 20 bytes in size in the first, and 1 byte in size in the second.

In fact, ubuntu file roller isn't even allowing me to extract the files.

I have a feeling it is due to the note in the zarr documentation

Note that Zip files do not provide any way to remove or replace existing entries.

https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore

MCVE Code Sample

ZipStore

import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
with zarr.ZipStore('test_store.zip', mode='w') as store:
    x.to_zarr(store)
with zarr.ZipStore('test_store.zip', mode='r') as store:
    x_read = xr.open_zarr(store).compute()
Issued error
---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-1-2a92a6db56ab> in <module>
      7     x.to_zarr(store)
      8 with zarr.ZipStore('test_store.zip', mode='r') as store:
----> 9     x_read = xr.open_zarr(store).compute()

~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, **kwargs)
    803         """
    804         new = self.copy(deep=False)
--> 805         return new.load(**kwargs)
    806 
    807     def _persist_inplace(self, **kwargs) -> "Dataset":

~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, **kwargs)
    655         for k, v in self.variables.items():
    656             if k not in lazy_data:
--> 657                 v.load()
    658 
    659         return self

~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, **kwargs)
    370             self._data = as_compatible_data(self._data.compute(**kwargs))
    371         elif not hasattr(self._data, "__array_function__"):
--> 372             self._data = np.asarray(self._data)
    373         return self
    374 

~/miniconda3/envs/dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype)
    545     def __array__(self, dtype=None):
    546         array = as_indexable(self.array)
--> 547         return np.asarray(array[self.key], dtype=None)
    548 
    549     def transpose(self, order):

~/miniconda3/envs/dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key)
     46         array = self.get_array()
     47         if isinstance(key, indexing.BasicIndexer):
---> 48             return array[key.tuple]
     49         elif isinstance(key, indexing.VectorizedIndexer):
     50             return array.vindex[

~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
    570 
    571         fields, selection = pop_fields(selection)
--> 572         return self.get_basic_selection(selection, fields=fields)
    573 
    574     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
    693         if self._shape == ():
    694             return self._get_basic_selection_zd(selection=selection, out=out,
--> 695                                                 fields=fields)
    696         else:
    697             return self._get_basic_selection_nd(selection=selection, out=out,

~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields)
    709             # obtain encoded data for chunk
    710             ckey = self._chunk_key((0,))
--> 711             cdata = self.chunk_store[ckey]
    712 
    713         except KeyError:

~/miniconda3/envs/dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key)
   1249         with self.mutex:
   1250             with self.zf.open(key) as f:  # will raise KeyError
-> 1251                 return f.read()
   1252 
   1253     def __setitem__(self, key, value):

~/miniconda3/envs/dev/lib/python3.7/zipfile.py in read(self, n)
    914             self._offset = 0
    915             while not self._eof:
--> 916                 buf += self._read1(self.MAX_N)
    917             return buf
    918 

~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _read1(self, n)
   1018         if self._left <= 0:
   1019             self._eof = True
-> 1020         self._update_crc(data)
   1021         return data
   1022 

~/miniconda3/envs/dev/lib/python3.7/zipfile.py in _update_crc(self, newdata)
    946         # Check the CRC if we're at the end of the file
    947         if self._eof and self._running_crc != self._expected_crc:
--> 948             raise BadZipFile("Bad CRC-32 for file %r" % self.name)
    949 
    950     def read1(self, n):

BadZipFile: Bad CRC-32 for file 'hello/0'
0
2
Untitled10.ipynb

Working Directory Store example

import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
store = zarr.DirectoryStore('test_store2.zarr')
x.to_zarr(store)
x_read = xr.open_zarr(store)
x_read.compute()
assert x_read.hello == x.hello

Expected Output

The string metadata should work.

Output of xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jan  7 2020, 22:33:48) 
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 5.3.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8
libhdf5: None
libnetcdf: None

xarray: 0.14.1
pandas: 1.0.0
numpy: 1.17.5
scipy: 1.4.1
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.4.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.10.1
distributed: 2.10.0
matplotlib: 3.1.3
cartopy: None
seaborn: None
numbagg: None
setuptools: 45.1.0.post20200119
pip: 20.0.2
conda: None
pytest: 5.3.1
IPython: 7.12.0
sphinx: 2.3.1
@max-sixty
Copy link
Collaborator

Hi @hmaarrfk , I don't know this area of code well but attempting to ensure that we resolve the question.

Should this be an xarray or a zarr issue? Is there something you think xarray is doing incorrectly here?

@hmaarrfk
Copy link
Contributor Author

My guess is that that xarray might be trying to write to the store character by character???

Otherwise, not too sure.

@max-sixty
Copy link
Collaborator

Do you have a reproducible example? Otherwise I'm not sure how to help resolve...

@hmaarrfk
Copy link
Contributor Author

I thought I provided it, but in either case, here is my traceback

In [3]: import xarray as xr 
   ...: import zarr 
   ...: x = xr.Dataset() 
   ...: x['hello'] = 'world' 
   ...: x 
   ...: with zarr.ZipStore('test_store.zip', mode='w') as store: 
   ...:     x.to_zarr(store) 
   ...: with zarr.ZipStore('test_store.zip', mode='r') as store: 
   ...:     x_read = xr.open_zarr(store).compute() 
   ...:                                                                                                  
---------------------------------------------------------------------------
BadZipFile                                Traceback (most recent call last)
<ipython-input-3-5ad5a0456766> in <module>
      7     x.to_zarr(store)
      8 with zarr.ZipStore('test_store.zip', mode='r') as store:
----> 9     x_read = xr.open_zarr(store).compute()
     10 

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/dataset.py in compute(self, **kwargs)
    805         """
    806         new = self.copy(deep=False)
--> 807         return new.load(**kwargs)
    808 
    809     def _persist_inplace(self, **kwargs) -> "Dataset":

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/dataset.py in load(self, **kwargs)
    657         for k, v in self.variables.items():
    658             if k not in lazy_data:
--> 659                 v.load()
    660 
    661         return self

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/variable.py in load(self, **kwargs)
    373             self._data = as_compatible_data(self._data.compute(**kwargs))
    374         elif not hasattr(self._data, "__array_function__"):
--> 375             self._data = np.asarray(self._data)
    376         return self
    377 

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/core/indexing.py in __array__(self, dtype)
    555     def __array__(self, dtype=None):
    556         array = as_indexable(self.array)
--> 557         return np.asarray(array[self.key], dtype=None)
    558 
    559     def transpose(self, order):

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/xarray/backends/zarr.py in __getitem__(self, key)
     47         array = self.get_array()
     48         if isinstance(key, indexing.BasicIndexer):
---> 49             return array[key.tuple]
     50         elif isinstance(key, indexing.VectorizedIndexer):
     51             return array.vindex[

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in __getitem__(self, selection)
    570 
    571         fields, selection = pop_fields(selection)
--> 572         return self.get_basic_selection(selection, fields=fields)
    573 
    574     def get_basic_selection(self, selection=Ellipsis, out=None, fields=None):

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in get_basic_selection(self, selection, out, fields)
    693         if self._shape == ():
    694             return self._get_basic_selection_zd(selection=selection, out=out,
--> 695                                                 fields=fields)
    696         else:
    697             return self._get_basic_selection_nd(selection=selection, out=out,

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/core.py in _get_basic_selection_zd(self, selection, out, fields)
    709             # obtain encoded data for chunk
    710             ckey = self._chunk_key((0,))
--> 711             cdata = self.chunk_store[ckey]
    712 
    713         except KeyError:

~/miniconda3/envs/mcam_dev/lib/python3.7/site-packages/zarr/storage.py in __getitem__(self, key)
   1249         with self.mutex:
   1250             with self.zf.open(key) as f:  # will raise KeyError
-> 1251                 return f.read()
   1252 
   1253     def __setitem__(self, key, value):

~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in read(self, n)
    914             self._offset = 0
    915             while not self._eof:
--> 916                 buf += self._read1(self.MAX_N)
    917             return buf
    918 

~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in _read1(self, n)
   1018         if self._left <= 0:
   1019             self._eof = True
-> 1020         self._update_crc(data)
   1021         return data
   1022 

~/miniconda3/envs/mcam_dev/lib/python3.7/zipfile.py in _update_crc(self, newdata)
    946         # Check the CRC if we're at the end of the file
    947         if self._eof and self._running_crc != self._expected_crc:
--> 948             raise BadZipFile("Bad CRC-32 for file %r" % self.name)
    949 
    950     def read1(self, n):

BadZipFile: Bad CRC-32 for file 'hello/0'

@hmaarrfk
Copy link
Contributor Author

See the zipstore example in my first comment

@max-sixty
Copy link
Collaborator

You did supply the traceback! But it's not reproducible. And consistent with #3831 (comment), it's difficult to attribute the break on examples like these.

You could try debugging into the return array[key.tuple] call, which is the last one in xarray, and identifying whether xarray is making a bad call or zarr is incorrectly failing?

@hmaarrfk
Copy link
Contributor Author

hmm i didn't realize this. I"m running from conda-forge + linux.

Let me try on your CIs.

@hmaarrfk
Copy link
Contributor Author

Not sure if the builds in #3888 help reproduce things or not?

@hmaarrfk
Copy link
Contributor Author

I will have to try the debugging things you mentionned some later time :/

@max-sixty
Copy link
Collaborator

@hmaarrfk my apologies, I misread your issue and it was always reproducible. I had thought it was referencing a file test_store.zip when actually it was writing it, mea culpa.

@max-sixty
Copy link
Collaborator

I had a quick debug but didn't completely resolve it. It works fine if you add it as an attribute; do you mean to be adding a single string as a data variable?

In [8]: import xarray as xr
   ...: import zarr
   ...: x = xr.Dataset()
   ...: x.attrs['hello'] = 'world'  # changed
   ...: x
   ...: with zarr.ZipStore('test_store.zip', mode='w') as store:
   ...:     x.to_zarr(store)
   ...: with zarr.ZipStore('test_store.zip', mode='r') as store:
   ...:     x_read = xr.open_zarr(store).compute()
   ...:

Debugging is non-trivial because the ZipFile is closed by the time I could run %debug. It's not impossible but would take more than a quick look in a repl. That said, I do think this looks likely to be a zarr issue given it depends on the storage format that zarr is using. It's possible xarray is calling it incorrectly but I'd think it's less likely.

@hmaarrfk
Copy link
Contributor Author

Hmm, interesting!

I've avoided attrs since they often get "lost" in computation, and don't get dragged along as rigorously as coordinates.

I do have some real coordinates that are stored as strings.

Thanks for the quickfeedback.

Here is the reproducing code without using context managers (which auto clsoe things you know)

import xarray as xr
import zarr
x = xr.Dataset()
x['hello'] = 'world'
x
with zarr.ZipStore('test_store.zip', mode='w') as store:
    x.to_zarr(store)

read_store = zarr.ZipStore('test_store.zip', mode='r')                                           
x_read = xr.open_zarr(read_store).compute() 
# The error will happen before this line is executed
# read_store.close()

@max-sixty
Copy link
Collaborator

I've avoided attrs since they often get "lost" in computation, and don't get dragged along as rigorously as coordinates.

Yeah, we're gradually fixing this. You can set a config to keep attrs by default, or pass keep_attrs=True in most methods. We might still be missing a couple of methods; would appreciate a bug report for any of those.

@max-sixty
Copy link
Collaborator

Thanks for the alternative repro. And I'm fairly confident this is a zarr issue; if the format can't handle a zero-dimensioned coord, it should raise a specific error rather than it being a bad ZipFile

@hmaarrfk
Copy link
Contributor Author

Honestly, i've found that __ coordinates are treated as less strict coordinates, so I've been using those.

Keeping things like:

  • The version of different software that was used.

Seems more like an attribute, but really, it is all data, so :/.

Regarding the originally issue, it seems that you are right in the sense that a 0 dimension string might be buggy in zarr itself.

I guess we (when we have time) will have to dig down to find a MVC example that reproduces the issue without xarray to submit to Zarr.

@hmaarrfk
Copy link
Contributor Author

@jakirkham not sure if you have any thoughts on why the code above is bugging out.

@jakirkham
Copy link

Sorry I don't know. Maybe @rabernat can advise? 🙂

@hmaarrfk
Copy link
Contributor Author

I think the reason is that for zero sized arrays, you technically aren't allowed to write data to them.

writer.add(v.data, zarr_array)

This means that when you create the 0 sized array, you can't actually change the value.

Here is a reproducer without xarray

import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, shape=data.shape, dtype=data.dtype)
root[name][...] = data
zarr_array[...]

Though the code path follows what xarray does in the backend.

@hmaarrfk
Copy link
Contributor Author

And actually, zarr provides a data argument in create_dataset that actually encounters the same bug

import zarr
import numpy as np
name = 'hello'
data = np.array('world', dtype='<U5')
store = zarr.ZipStore('test_store.zip', mode='w')
root = zarr.open(store , mode='w')
zarr_array = root.create_dataset(name, data=data, shape=data.shape, dtype=data.dtype)
zarr_array[...]

I guess i can open upstream in zarr, but I think for catching the 0 sized array case, it is probably best to use the data argument.

@jakirkham
Copy link

Sure an upstream issue would be welcome. Thanks for unpacking that further Mark 😀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

4 participants