Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requirements of store data #349

Closed
jakirkham opened this issue Dec 2, 2018 · 11 comments
Closed

Requirements of store data #349

jakirkham opened this issue Dec 2, 2018 · 11 comments

Comments

@jakirkham
Copy link
Member

Raising this issue to get an idea of what our requirements are of stores and what can be placed in them.

For instance in many cases we require Arrays to have an object_codec to allow storing object types and many stores would have difficulty with this data without explicit conversion to some sort of bytes-like object; however, we appear to be placing objects in a store as a test. Also we seem to expect stores to be easily comparable; however, this doesn't work if the store has NumPy ndarrays in it. ( #348 )

Should we set some explicit requirements about what stores require? If so, what would those requirements be? Also how would we enforce them?

@alimanfoo
Copy link
Member

Thanks @jakirkham, very good idea to raise this.

FWIW I think as a minimum, a store:

  • must implement the MutableMapping interface
  • must support keys that are ASCII strings (provided as str on PY3 and also as str on PY2)
  • must normalise keys as described in the zarr format spec v2
  • must reject keys that are invalid as defined in the zarr format spec v2
  • must support setting values that are any buffer-like object
  • must return values as any buffer-like object
  • must be pickleable
  • must (should?) implement equals comparison (__eq__) and compare true to another store of the same type holding the same data

...where a "buffer-like object" is an object exporting (PY2 and PY3) the new-style buffer interface or (PY2 only) the old-style buffer interface.

Optionally, a store:

  • may reject any key that is not an ascii string
  • may reject a value that is not a buffer-like object
  • may (should?) reject a value that is buffer-like but of object dtype
  • may implement case-insensitive comparison of keys (not ideal, but hard to avoid for storage on some file-systems)
  • should implement listdir(), rmdir(), rename(), getsize() methods where possible

I know this doesn't directly answer your question about tests involving object arrays, but maybe gives a bit more context to that discussion.

At least this helps to clarify for me that we shouldn't really be using dict as a store class, as this does not handle the requirements regarding key normalisation. So your proposed move to use DictStore instead of dict as the default store class for arrays seems good to me.

@jakirkham
Copy link
Member Author

Thanks for clearly outlining this clearly.

Based on our discussion here and in ( #348 ) am leaning towards hardening the requirement that DictStore must hold bytes ( #350 ) (or at least bytes-like data) and that Array should use DictStore for in-memory storage to enforce this requirement ( #351 ).

Something to think about in the larger context is how we validate a store. Should we have a function that is able to run through a store and make sure it is spec conforming?

@alimanfoo
Copy link
Member

alimanfoo commented Dec 3, 2018 via email

@jakirkham
Copy link
Member Author

That's an interesting idea. Was thinking about it in the context of validating stores for use with Array (though maybe this could/should be thought of more generally). How dynamic probably depends on when/how it is run. For instance we could register valid stores, in which case this would happen once when they are registered. Stores could be pre-registered as well (i.e. builtin ones). Alternatively we could do it whenever the store is used (maybe caching valid store types). Could also just leave this as a user facing function and trust users have tested their store with it. This latter case may lead to defensive coding on our part though.

@alimanfoo
Copy link
Member

That's an interesting idea. Was thinking about it in the context of validating stores for use with Array (though maybe this could/should be thought of more generally). How dynamic probably depends on when/how it is run. For instance we could register valid stores, in which case this would happen once when they are registered. Stores could be pre-registered as well (i.e. builtin ones). Alternatively we could do it whenever the store is used (maybe caching valid store types). Could also just leave this as a user facing function and trust users have tested their store with it. This latter case may lead to defensive coding on our part though.

FWIW I'd be happy if we provided developer support so store class developers can thoroughly test a store class implementation, but then at runtime trust that users provide something sensible as a store. We're already pretty defensive, e.g., we normalise all storage paths above the storage layer. We could also check the result of the chunk encoding pipeline is an object supporting the buffer protocol, before passing on to storage. So i.e., guarantee that we'll provide valid keys and values to the storage layer. But after that I think we can just trust stores to do something reasonable with keys and values.

@rabernat
Copy link
Contributor

rabernat commented Jan 9, 2019

I just reviewed this thread. I'm thinking ahead towards questions of language inter-operability, and I'm concerned that our definition of a store is too python-centric.

While it should always be possible to implement a custom store by following the requirements above, perhaps we should also define a spec for a store that does not depend on python concepts such as mutable-mapping, pickleable, etc. This would make it easier to implement zarr in other languages.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 10, 2019 via email

@rabernat
Copy link
Contributor

Thanks for the clarification. I see how this thread is specific to python implementations.

I guess I worry that the spec is too vague with regards to the implementation of the key value store, and the methods that can be used to query it:

A Zarr array can be stored in any storage system that provides a key/value interface, where a key is an ASCII string and a value is an arbitrary sequence of bytes, and the supported operations are read (get the sequence of bytes associated with a given key), write (set the sequence of bytes associated with a given key) and delete (remove a key/value pair).

In terms of operations, "Read", "write", and "delete" doesn't seem like enumeration of operations a store must support. When implementing a store, you also need at least some form of "list" operation; otherwise zarr can't discover what is in the store. (The exception is consolidated metadata stores.) In fact, you have to implement a MutableMapping, which has five methods: __getitem__, __setitem__, __delitem__, __iter__, and __len__.)

More generally, how do we ensure that DirectoryStore, ZipStore, or any of the myriad cloud stores that have been developed can truly be read from different implementations of zarr? I wonder if it would be worth explicitly defining a spec for certain commonly used stores that gives more detail about the implementation choices that have already been made in the zarr python code.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 10, 2019 via email

@jakirkham
Copy link
Member Author

In PR ( #789 ) we added a BaseStore class, which addresses some of these basic needs of Stores

Subsequent discussion around the v3 spec and storing standardized data from libraries handles other concerns raised here

Were there any other things still needing to be addressed here?

cc @joshmoore @grlee77

@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

Closing now that the v3 spec goes into much more detail on the subject.

@jhamman jhamman closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants