Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr.open_array does not properly recognize NestedDirectoryStore #530

Closed
JimBiardCics opened this issue Dec 18, 2019 · 11 comments · Fixed by #716
Closed

zarr.open_array does not properly recognize NestedDirectoryStore #530

JimBiardCics opened this issue Dec 18, 2019 · 11 comments · Fixed by #716

Comments

@JimBiardCics
Copy link

import numpy
import zarr

a = zarr.create((10, 100, 100), chunks = (1, 100, 100), dtype = 'f4',
                store = zarr.NestedDirectoryStore('foo.zarr'), overwrite = True)

a[...] = 1.0

print('shape =', a.shape)
print('a has non-zero values =', numpy.any(a))

del a

a = zarr.open_array('foo.zarr', mode = 'r')

print('shape =', a.shape)
print('a has non-zero values =', numpy.any(a))

Problem description

A zarr 'file' created using NestedDirectoryStore will not load correctly when later opened using the path alone. The shape (and other metadata) is correct, but the values are not loaded. In the code sample above, the shape of the original and reloaded array is the same, but the contents don't match. The second array contains only zeroes.

Version and installation information

  • Value of zarr.__version__ - 2.3.2
  • Value of numcodecs.__version__ - 0.6.4
  • Version of Python interpreter - 3.6.7 (anaconda)
  • Operating system (Linux/Windows/Mac) - Mac OS Mojave (10.14.6)
  • How Zarr was installed - using conda
@jrbourbeau
Copy link
Member

Thanks for opening an issue and the nice example @JimBiardCics! I'm able to reproduce the same behavior.

When using a string path with zarr.open_array, zarr will use the following logic to guess what store should be used

def normalize_store_arg(store, clobber=False, default=dict):
if store is None:
return default()
elif isinstance(store, str):
if store.endswith('.zip'):
mode = 'w' if clobber else 'a'
return ZipStore(store, mode=mode)
elif store.endswith('.n5'):
return N5Store(store)
else:
return DirectoryStore(store)
else:
return store

For a path like 'foo.zarr', zarr will use a DirectoryStore. Since the array was saved using a NestedDirectoryStore, you'll want to open the array with the same store.

In your example, replacing

a = zarr.open_array('foo.zarr', mode = 'r')

with

a = zarr.open_array(zarr.NestedDirectoryStore('foo.zarr'), mode = 'r')

should fix things

@JimBiardCics
Copy link
Author

JimBiardCics commented Dec 18, 2019

@jrbourbeau The problem, as I see it, is that the application using the file has no way to determine which type of store to use. It is not particularly reasonable to expect all applications to know ahead of time which store type to use when opening a zarr file.

@stuarteberg
Copy link

+1 for this issue!

Instead of guessing the store from the file extension, can the store class name simply be added to the metadata in .zarray?

@alimanfoo
Copy link
Member

alimanfoo commented Feb 28, 2020 via email

@joshmoore
Copy link
Member

A general big 👍 from me as well, and a cross link to zarr-developers/zarr-specs#49, but I'd add a caution that a class name will not necessarily be portable between implementations.

@joshmoore
Copy link
Member

@alimanfoo : do you imagine it would be possible to do something in v2 attributes as a prototype of the ultimate v3 implementation?

@alimanfoo
Copy link
Member

@alimanfoo : do you imagine it would be possible to do something in v2 attributes as a prototype of the ultimate v3 implementation?

Yes this should be doable. The code that constructs the chunk key is _chunk_key(). You could modify that to be something like:

    def _chunk_key(self, chunk_coords):
        return self._key_prefix + self._key_separator.join(map(str, chunk_coords))

You could then modify the implementation of the Array class constructor to look either in array metadata or attributes for a 'zarr_chunk_key_separator' field or something like that, store the value as an instance attribute, then use it within the _chunk_key() method. A few questions to work through, like how to specify the chunk key when creating a new array, and how to store it. Happy to elaborate and review a PR if anyone has interest in implementing.

@perlman
Copy link

perlman commented Jul 31, 2020

I hit this as well, despite having previously seen this issue!

I look forward to the fix in v3.

@joshmoore
Copy link
Member

(cross-posting from the linked neuroglancer PR) I will be looking at a PR for zarr-python ... which will use one of a small number of options for finding the key. Work on the Java side has taken place in bcdev/jzarr#17:

  • heuristic that loops through chunks "0.0.0", "0/0/0", "0.0.1", "0/0/1" searching for existence... (with the downside being the overheaded of the search if no key_separator was chosen)
  • uses optional metadata in the .zarray (with the downside that we need to agree on this metadata)
  • ...

see also: ome/ngff#29 which references https://gitter.im/zarr-developers/community?at=601d471fc83ec358be27944f

@andreasg123
Copy link

In Python, one should be able to resolve this with an os.listdir, possibly with an additional os.stat, assuming a local file system. Obviously, this would be more difficult for S3 and other storage options. If os.listdir finds anything matching N.M* (glob), the separator is .. If none of them match and any N is a directory, the separator is / (defining N and M as a sequence of digits).

@joshmoore
Copy link
Member

I believe #716 now implements a solution for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants