Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify status and semantics of object ('O') data type in storage spec #6

Open
alimanfoo opened this issue Jan 16, 2019 · 7 comments
Open
Labels
data-type protocol-extension Protocol extension related issue v2

Comments

@alimanfoo
Copy link
Member

The current storage spec in the section on “data type encoding” describes how the data type of an array should be encoded in the array metadata. Here is the content from the beginning of the section:

Simple data types are encoded within the array metadata as a string,
following the `NumPy array protocol type string (typestr) format
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html>`_. The format
consists of 3 parts:

* One character describing the byteorder of the data (``"<"``: little-endian;
  ``">"``: big-endian; ``"|"``: not-relevant)
* One character code giving the basic type of the array (``"b"``: Boolean (integer
  type where all values are only True or False); ``"i"``: integer; ``"u"``: unsigned
  integer; ``"f"``: floating point; ``"c"``: complex floating point; ``"m"``: timedelta;
  ``"M"``: datetime; ``"S"``: string (fixed-length sequence of char); ``"U"``: unicode
  (fixed-length sequence of Py_UNICODE); ``"V"``: other (void * – each item is a
  fixed-size chunk of memory))
* An integer specifying the number of bytes the type uses.

The byte order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
``"|S12"`` are valid data type encodings.

The spec then goes on to describe how datetime and timedelta data types are encoded:

For datetime64 ("M") and timedelta64 ("m") data types, these MUST also include the
units within square brackets. A list of valid units and their definitions are given in
the `NumPy documentation on Datetimes and Timedeltas
<https://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#datetime-units>`_.
For example, ``"<M8[ns]"`` specifies a datetime64 data type with nanosecond time units.

...and also how structured data data types are encoded:

Structured data types (i.e., with multiple named fields) are encoded as a list
of two-element lists, following `NumPy array protocol type descriptions (descr)
<http://docs.scipy.org/doc/numpy/reference/arrays.interface.html#>`_. For
example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", "|u1"]]`` defines a
data type composed of three single-byte unsigned integers labelled "r", "g" and
"b".

Implicit in all of this is that the spec is inheriting the numpy definition of data types, and deferring to the numpy documentation as much as possible.

In addition to fixed-memory data types, numpy also defines an “object” data type (character code ‘O’). In numpy, an array with object data type is an array of memory pointers, where each pointer dereferences to a Python object. Although the object data type is described in the numpy documentation, it is not mentioned at all in the zarr storage spec. It is therefore unclear whether it is or is not a valid data type for use in a zarr array, and if it is, what it’s semantics are.

The Python zarr implementation has in fact fully supported the object data type since version 2.2 (zarr-developers/zarr-python#212). The implementation follows numpy in the sense that, when data are retrieved from a zarr array with object data type, they are returned to the user as a numpy array with object data type, i.e., as an array of Python objects.

However, this does not mean that the encoded zarr data are necessarily Python-specific. When storing data into a zarr array with object data type, how the objects are encoded is delegated to the first codec in the filter chain. For example, if the first codec in the filter chain is the MsgPack codec, then data will be encoded using MessagePack encoding, which is a language-independent encoding and can in principle be decoded in a variety of programming languages. Similarly, if the array contains only string objects, then the VLenUTF8 codec can be used, which will encode data in a format similar to parquet encoding, and which could be decoded in any programming language. Further examples are given in the sections on string arrays and object arrays in the zarr tutorial.

In the longer term, the community may want to revisit the approach to encoding of arrays with variable-length data types, and to produce a new major revision of the storage spec. However, I suggest that we first aim to resolve this issue by adding some clarifying text to the version 2 storage spec, to make explicit the status and semantics of the object data type. As precedent, we have previously made a number of edits to the version 2 storage spec to make clarifications, see the changes section of the spec. For example, we added clarifications regarding the datatime and timedelta data types, and we added clarifications regarding the encoding of fill values, so I am hoping for a similar resolution here.

@jakirkham
Copy link
Member

I'm sure @DennisHeimbigner has thoughts here :)

@DennisHeimbigner
Copy link

One immediate thought is that using a codec/filter (like for compression) to manage
the format of an object inroduces the same problem as with filters. Namely it
becomes impossible to read the data if you do not have the codec.

Also, how does this handle, say, trees, where nodes will contain other O typed objects?
How is the codec determined? Or is it assumed that the top-level codec knows
everything about the types used in the tree?

@alimanfoo
Copy link
Member Author

One immediate thought is that using a codec/filter (like for compression) to manage
the format of an object inroduces the same problem as with filters. Namely it
becomes impossible to read the data if you do not have the codec.

Yes that's true. I'd say that the way to address that would be for the community to converge on a set of "core" codecs that all implementations are encouraged to support where possible. That applies to all codecs, including compressors.

Also, how does this handle, say, trees, where nodes will contain other O typed objects?
How is the codec determined? Or is it assumed that the top-level codec knows
everything about the types used in the tree?

The object data are passed to the first codec in the filter chain, and it is assumed that this codec will flatten the data into a contiguous sequence of bytes and pass that on to the next codec in the chain (or to storage if there are no more codecs). The first codec can raise an error if the object data contains types that it can't encode. E.g., the VLenUTF8 codec will raise if it gets an array of anything other than string objects.

Btw I'm not trying to sell this as a perfect solution, I'm sure we will want to revisit this in the next iteration of the core spec. But this is how the Python implementation works currently, and I think it's worth clarifying in the v2 spec that this is how it works, so that anyone targeting compatibility with the v2 spec knows what to aim for.

@jstriebel jstriebel added data-type protocol-extension Protocol extension related issue core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 16, 2022
@jstriebel
Copy link
Member

Atm, object data type is not supported in v3 and might be added as an extension. So this is mostly a v2 issue for now I assume

@jstriebel jstriebel added v2 and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 23, 2022
@rabernat
Copy link
Contributor

The issue here is whether V3 aims to be fully backwards compatible with V2. In the memory-layout discussion, the desire for backwards compatibility was cited as a reason to allow both C and F order. If V3 will not support "O" data type, then this will break backwards compatibility with existing V2 data.

The extension would be an obvious way around this.

I personally support the idea of dropping "O" dtype in the core spec and then re-adding it immediately as an extension.

@jakirkham
Copy link
Member

Am also in favor of dropping the "O" dtype. Adding through an extension sounds reasonable

@jbms
Copy link
Contributor

jbms commented Nov 23, 2022

Definitely agreed that it should be dropped for now.

It could be added back as a Python-specific python-object data type for use with a pickle codec, but not all uses of the zarr v2 data type should be migrated to it.

For example:

With json codec it should be migrated to a new json data type.

With vlenutf8 codec it should be migrated to a new vlen Unicode string data type.

etc.

The end result would be that remaining uses of python-object are likely pretty rare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-type protocol-extension Protocol extension related issue v2
Projects
None yet
Development

No branches or pull requests

6 participants