Add minimal drafts of a few more dtype extensions #135

grlee77 · 2022-03-02T19:17:27Z

This adds draft extensions for the remaining dtypes supported for Zarr v2 arrays in zarr-python that are not part of the core v3 spec. Merging this would at least give us a URL to point to for these dtypes in zarr-developers/zarr-python#898.

The language here will eventually need updating to be more spec-like and more implementation neutral (e.g. less NumPy/Python specific). What is there now, at least describes the current behavior in zarr-python.

In the last community call @jbms (from from the tensorstore team) mentioned that NumPy's fixed width unicode arrays are not very desirable and variable length strings are likely of more interest. There is not a variable length string dtype in NumPy, but variable length strings can be stored as object arrays (H5Py for example does this). I know Pandas 1.0 also introduced their own StringDType, but I haven't looked into the specifics of that.

jbms · 2022-03-02T19:57:31Z

docs/protocol/extensions/object-dtypes/v1.0.rst

+is to have an array where each element is another array (and each array can
+have a different length). Another use case is to store an array of variable
+length strings. It is important to note that such an array actually just stores the references to the Python objects and not the objects themselves. Accessing
+an element of the array returns the Python object it refers to.


This should specify how this data type is actually encoded in the store --- presumably via pickling.

In general this sort of encoding issue might fall under the domain of "filters", which are not yet part of zarr v3 but might be important to add.

In general I have a number of concerns about this data type:

The pickle format by its nature is inherently tied to Python and cannot be supported by non-Python implementations.

By the nature of the format, the Python unpickle operation can lead to arbitrary code execution and therefore requires that the source data be fully trusted. I think if it is supported it is likely to be used out of convenience even in situations that don't call for that level of trust. In particular, if it is available by default in zarr-python then opening any untrusted zarr array becomes a potential security vulnerability.

In cases where the intent is not to store an arbitrary Python object, but e.g. just a variable length string or json value, just using the "object" type leaves the actual type unspecified.

I think it would be much better instead to decouple the zarr v3 data types from numpy data types --- and define as explicit zarr v3 data types:

variable length unicode and byte strings

json values

I think it would be best to avoid supporting a general Python object type at all.

Note: These variable length strings and json values when stored in a numpy array would have to be stored using the numpy object data type. But that does not need to match the data type of the zarr array.

Yes, I agree that the extension should not be specifically for general Python objects and needs to be revised substantially. We actually did not have any tests in core zarr-python involving object arrays, but I found that they were used in the Xarray test suite. Specifically this method on the DatasetIOBase class that the Zarr backend tests are a subclass of.

jbms · 2022-03-02T20:01:38Z

docs/protocol/extensions/string-dtypes/v1.0.rst

+-------------------------------------------
+
+These are fixed width strings corresponding to NumPy dtypes with `kind` 'S'.
+For backward compatibility with Python 2's ``str`` these are zero-terminated


Zero-terminated strings normally refers to the C null-terminated string convention where the length is not explicitly indicated and instead the end of the string is indicated by a '\0' value. I would call this zero-padded rather than zero-terminated, since there is no zero terminator if the string is exactly equal to the specified length.

Yeah, I was a bit confused regarding this as the NumPy dtype docs refer to this type as "zero-terminated bytes".

jbms · 2022-03-02T20:04:21Z

docs/protocol/extensions/struct-dtypes/v1.0.rst

+     - Numerical type
+     - Size (no. bytes)
+     - Byte order
+   * - list of (<name>, <type>) tuples


NumPy (and zarr v2) supports a third variant, where you have (name, type, shape). The shape specifies the dimensions of an "inner" array of elements.

Yeah, I had forgotten to list that one and will update it here to describe an optional [, shape]. I think we do test it already.

joshmoore

Great to have these discussions started, @grlee77! Also 👍 for @jbms' comments, but I could also see getting this into the dev branch before they are resolved as with zarr-developers/zarr-python#898 so that we have more people working from the same starting point.

joshmoore · 2022-03-07T18:45:59Z

docs/protocol/extensions.rst

@@ -11,6 +11,9 @@ Under construction.
   extensions/filters/v1.0
   extensions/complex-dtypes/v1.0
   extensions/datetime-dtypes/v1.0
+   extensions/object-dtypes/v1.0


Mostly as a side note as we work out the process, we may want to lower these numbers depending in our confidence and the intended interpretation.

joshmoore · 2022-05-06T14:02:51Z

Wonderfully, this has no conflicts despite the recent updates! I'm merging and will leave the comments above unresolved for capturing as the editing of v3 moves forward. (If anyone feels moved to create issues, please just cross-link)

Note: as mentioned in gitter, merging PRs into core-protocol-v3.0-dev doesn't imply that a feature is final.

grlee77 added 3 commits March 2, 2022 13:04

add draft of string dtypes extension

177c552

add draft of struct dtype extension

a16e120

add draft of object dtype extension

b73be77

grlee77 mentioned this pull request Mar 2, 2022

Implement Zarr V3 protocol zarr-developers/zarr-python#898

Merged

6 tasks

jbms reviewed Mar 2, 2022

View reviewed changes

joshmoore reviewed Mar 7, 2022

View reviewed changes

grlee77 mentioned this pull request Apr 11, 2022

implement Zarr v3 spec support pydata/xarray#6475

Merged

4 tasks

jbms mentioned this pull request May 4, 2022

v3 mission #140

Closed

joshmoore merged commit e889419 into zarr-developers:core-protocol-v3.0-dev May 6, 2022

joshmoore mentioned this pull request May 6, 2022

WIP: Core protocol v3.0 development branch #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimal drafts of a few more dtype extensions #135

Add minimal drafts of a few more dtype extensions #135

grlee77 commented Mar 2, 2022

jbms Mar 2, 2022

jbms Mar 2, 2022

grlee77 Mar 2, 2022 •

edited

jbms Mar 2, 2022

grlee77 Mar 2, 2022

jbms Mar 2, 2022

grlee77 Mar 2, 2022

joshmoore left a comment

joshmoore Mar 7, 2022

joshmoore commented May 6, 2022 •

edited

Add minimal drafts of a few more dtype extensions #135

Add minimal drafts of a few more dtype extensions #135

Conversation

grlee77 commented Mar 2, 2022

jbms Mar 2, 2022

Choose a reason for hiding this comment

jbms Mar 2, 2022

Choose a reason for hiding this comment

grlee77 Mar 2, 2022 • edited

Choose a reason for hiding this comment

jbms Mar 2, 2022

Choose a reason for hiding this comment

grlee77 Mar 2, 2022

Choose a reason for hiding this comment

jbms Mar 2, 2022

Choose a reason for hiding this comment

grlee77 Mar 2, 2022

Choose a reason for hiding this comment

joshmoore left a comment

Choose a reason for hiding this comment

joshmoore Mar 7, 2022

Choose a reason for hiding this comment

joshmoore commented May 6, 2022 • edited

grlee77 Mar 2, 2022 •

edited

joshmoore commented May 6, 2022 •

edited