Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol extensions #49

Open
hammer opened this issue Feb 27, 2020 · 10 comments
Open

Protocol extensions #49

hammer opened this issue Feb 27, 2020 · 10 comments
Labels
protocol-extension Protocol extension related issue

Comments

@hammer
Copy link

hammer commented Feb 27, 2020

Protocol extensions are mentioned throughout the v3.0 core protocol docs but they are still undocumented. Has there been any design discussions around how to specify and implement protocol extensions yet?

@ryan-williams
Copy link
Member

Very timely q, @hammer 😀; @joshmoore, @jakirkham, and I have been discussing a simple but expressive formulation of "plugins" that can be supported by Zarr relatively non-invasively.

Here's a writeup of my thinking coming from those discussions; it's by no means authoritative, and I'm not even sure we all mean the same things when we refer to "plugins" vs. "extensions", but those discussions seem to finally be happening more concretely after a bit of a hiatus:

"Plugins" draft proposal

A surprising variety of extensions to core Zarr behaviors can be enabled via a simple hook into Zarr's "load" and "save" paths, intercepting the transformation between [a given "on-disk" hierarchy] and [a Zarr Group|Dataset] (and vice versa):

"Read" side

  • when reading/loading a Zarr Group or Dataset, a "plugin" is essentially a partial function, passed to Zarr's loading machinery, that takes a "path" (really a store "key") as input, and returns something that looks like a Group or Dataset (but can actually be Any type)
  • during loading, Zarr passes each path (which would normally point to a Group|Dataset to load) to each enabled "plugin", which can return either:
    • None (to opt out of operating on the given input), or
    • [some other object that should replace the input path's position in the loaded Zarr tree]
  • the returned object can have special attrs that allow it to present, to the Zarr loading machinery, as being a virtual Group that Zarr should continue to expand/traverse the descendants of.

"Write" side

A straightforward inversion of the above applies; a plugin presents to Zarr's write/save machinery as:

  • a partial function, taking an Any (typically a normal Group|Dataset) and (optionally) returning a Group|Dataset (or some other object that supports serialization into a directory-tree structure, but that potentially looks nothing like a Zarr Group|Dataset)
  • during saving, Zarr passes each Group|Dataset|[object that was loaded by a plugin] to each plugin, which optionally transforms it into [something Zarr knows how to serialize to a directory tree].

Examples

Lots of commonly-discussed extensions to Zarr can be modeled by the above scheme:

Sparse arrays

Sparse arrays are a great "hello world" example of a plugin:

  • input: Group with 3 1-D arrays (data, indices, indptr) and attr with value "CSR" or "CSC"
  • output: scipy.sparse 2-D array

This would allow users to load a Zarr tree with a scipy.sparse array substituted for Groups of a certain shape. This is approximately how the zsparse proof-of-concept is implemented.

N5 interop/support

N5 and Zarr have similar specs (cf. #3), and there are plans for Zarr to more seamlessly support+interoperate with N5 in v3.

N5 groups and datasets could be loaded virtually, as their Zarr counterparts, by an appropriate "plugin" that made the requisite changes to how paths are loaded from and saved to.

N5 support is currently provided by the N5Store, but that doesn't compose with other "stores" (cf. zarr-developers/zarr-python#395). "Plugins" as described here could solve this issue; it's also possible that a cleaner way to nest/compose "stores" would work (while keeping "extensions" more narrowly-scoped).

Pyramidal Data

Various ways of encoding "pyramidal data" (e.g. the same image data, duplicated at various power-of-2-downscaled resolutions, common in the imaging and geo communities and a widely requested Zarr feature; cf. #23) can be implemented as "plugins" by injecting a processing step during loading/saving (similarly to the sparse arrays example above).

Specific encodings can vary:

  • a naive set of sibling Datasets, one per resolution
  • a JPEG2000 (which stores multiple resolutions natively), and perhaps some zattr indicating the data should be parsed as JPEG2000
  • the output could be a tensor (where all resolutions are stacked, with the downsampled layers virtually-upscaled) or some other object that forces a choice of resolution at the outset, or etc.

HDF5/TileDB interop/support

Combining ideas from the N5 and Pyramidal examples above, other binary/opaque structures could be embedded in a Zarr tree.

An "HDF5" plugin for Zarr could:

  • intercept Zarr's loading machinery and detect the presence of an HDF5 file as a child of a Zarr group
  • load the HDF5 file (e.g. using h5py)
  • return a virtual Zarr Group wrapping the loaded HDF5 file's root group

Consolidated Metadata

Zarr "plugins" as described can change how metadata is loaded (e.g. loading a single "consolidated metadata" file containing the metadata for all descendents of a Group, and virtually attaching each Group and Dataset's metadata to it before returning to the user; cf. zarr-developers/zarr-python#268).

Symlinks

A plugin can read a specific zattr key as encoding a pointer or symlink to another Zarr Group or Dataset, allowing linking (cf. zarr-developers/zarr-python#297, zarr-developers/zarr-python#389)

Heterogenous Chunk Sizes

cf. #40. This is less directly something that only needs a different code-path injected during loading/saving time; getitem/setitem on the returned object would need to be handled substantially differently. It may be outside the scope of "plugins" as currently conceived of here.

Non-JSON Metadata, Attributes

cf. #37. Likely implementable as a plugin in this scheme.

Discussion

  • The number of significant changes to Zarr that may be implementable using "plugins" of this type is large enough to give pause about whether this design is overly broad.
  • It should have minimal footprint on the main Zarr codebase, which is nice.
  • If 90% of Zarr usage doesn't use any such plugins, then it's probably fine and good that there is a long tail of extended functionality available to users based on specific contexts' needs.
  • Composing plugins should be handled with care:
    • plugins (like filters today) will be order-dependent, and their outputs should (optionally) be suitable inputs to downstream plugins
      • having plugins that simply don't play nicely with downstream plugins is probably fine to allow, as a baseline
      • otherwise, plugins' outputs should have a way to pretend to look like directory trees (to downstream plugins consuming them as inputs)

Next Steps

  • It will be really good to prototype a loading/saving hook like is described above (and a few "plugins" using that hook that implement some of the above examples above).
  • We should also critique the proposed design in the meantime, and analyze other comparable "plugin"/"extension" frameworks.
  • Some pieces of the above design can be hashed out further (unspecified zattrs for communication between plugins' returned values and Zarr, e.g. for presenting as or being serialized to directory trees)

@meggart
Copy link
Member

meggart commented Feb 27, 2020

I want to throw in another use case I was thinking about for quite some time, which is overlapping chunks. The problem is that for moving window operations in >1D you can not process your data chunk by chunk, even if all you need are just a few data points from the boundary a the neighbouring chunk.

So the idea would be to duplicate some data at the chunk boundaries to make read operations that read a chunk+ a few neighbouring data points faster. Software implementing this extension would need:

  1. a filter that allocates a larger matrix when reading a chunk and generates a view with the appropriate offset
  2. a modified __get()___ which optimizes access across chunk boundaries when the user supplies indices that just overlap the chunk boundary so that necessary data is available in a single chunk
  3. a sync_boundaries or similar function which copies the overlap data from every chunk to its neighbours, so that the dataset is consistent again after write operations

I think (1) could be handled by hooking into load/save as you described. But (2) and (3) I guess this would not be possible with your current proposed save/load hook? I don't think this should stop your approach since it indeed covers a lot of cases, but I just wanted to mention it here, maybe others had similar thoughts already and have ideas how towards an implementation.

This leads me to the next question: Should/can Zarr extensions be language-specific? That means I could potentially implement overlapping chunks as an optional extension to the Zarr Julia package. Would it be necessary to provide a python implementation as well? I just want to understand if what you call "plugin" is thought from a software side or from a specs side. I could quite easily write a spec extension for my scenario just describing how the data would be stored in the chunks, but the implementation would be language-dependent.

@alimanfoo
Copy link
Member

Thanks @ryan-williams, this is really helpful.

One quick comment, I think it helps to make a distinction between protocols and implementations.

A protocol is a definition of how data and metadata are organised, transferred and processed, and should be described in a way that is independent of any programming language.

We are working on a zarr core protocol. Hopefully there will be multiple implementations of that protocol, in various programming languages.

The community will also want to define a variety of protocol extensions. For any given protocol extension, there will then be at least one implementation.

Some protocol extensions (e.g., consolidated metadata) may be general purpose and widely used, and so many of the software libraries which implement the core protocol may also choose to implement some of these protocol extensions.

Some protocol extensions may be a bit more specialist or domain-specific, and for those it may in some cases make sense to implement in a separate library, as a plugin to an existing core protocol library. To put this another way, a software library which implements the core protocol might also provide some hooks where additional "plugin" code can be registered and can be delegated to provide special behaviour when a protocol extension is being used and has been declared in the zarr metadata.

So for each of the examples given above, I would ask two separate questions:

  1. Can we define a protocol extension that describes, constrains or modifies how zarr data and metadata are organised and processed?

  2. How might such a protocol extension be implemented in a given programming language and tooling ecosystem?

There is also a third question, which is, does the zarr core protocol provide the necessary extension points to allow such a protocol extension to be declared?

Hope that's useful, I'll follow up with some discussion of specific examples shortly.

@alimanfoo
Copy link
Member

alimanfoo commented Apr 8, 2020

Speaking generally, for any of the examples given above, I imagine that a spec for a protocol extension would ideally include things like:

  • What problem is this protocol extension trying to solve? I.e., what are the use cases? What additional or modified functionality will it enable?

  • What is the URI for the protocol extension?

  • Is this a protocol extension that can reasonably be ignored by a zarr reader and still provide some useful functionality? Or should a zarr reader terminate processing if it does not implement this extension?

  • What additional and/or modified data and/or metadata will be stored for a zarr hierarchy that uses this protocol extension? What store keys will be used? How will the data and/or metadata be formatted?

  • How is an implementation of this protocol extension expected to behave during reading? And during writing (which could include creating arrays or groups or writing data within an array)?

To take the example of consolidated metadata, this is a protocol extension which is intended to accelerate reading of zarr metadata, particularly in storage systems with high latency such as cloud object stores. It is used for data that is typically written once and read many times, and thus where the zarr metadata is relatively static once the hierarchy is created. The idea is to combine all the array and group metadata documents from a hierarchy into a single document that can be read via a single storage request.

Ideally some kind of persistent URI would be declared for this protocol extension, e.g., something like https://purl.org/zarr/spec/protocol/extension/consolidated-metadata/1.0. This URI then provides an identifier to use in zarr metadata to declare the protocol extension is being used. The URI should also dereference (redirect) to the actual spec document published on the web somewhere, so it is easily discoverable.

This would be a protocol extension that can be safely ignored. I.e., a zarr reader reader could ignore it and still provide useful functionality, because all of the original metadata documents are still present.

The protocol extension spec would then define the concept of a consolidated metadata document, and define its format (JSON) and how it would be structured internally to store the contents of all the array and group metadata documents. The spec would also declare the storage key under which this document would be stored.

The spec would also define how to declare the protocol extension is in use, along with any configuration options. In this case, because this is a protocol extension that applies to the whole hierarchy, this could be done via the zarr entry point metadata, e.g., a hierarchy with consolidated metadata would have entry point metadata like:

{
    "zarr_format": "https://purl.org/zarr/spec/protocol/core/3.0",
    "metadata_encoding": "application/json",
    "extensions": [
        {
            "extension": "https://purl.org/zarr/spec/protocol/extension/consolidated-metadata/1.0",
            "must_understand": false,
            "configuration": {
                "key": "metadata.json",
                "compressor": null
            }
        }
    ]
}

Adding this information to the entry point metadata provides a standard route for implementations to discover that consolidated metadata is present. It would also tell the implementation where to look for the consolidated metadata (in this case under the "metadata.json" key in the store). Any other variations, such as whether the consolidated metadata document was compressed or not, could also be declared here via the extension config.

A zarr reader which implements this extension would start by reading the entry point metadata document, and would look to see if the extension had been declared. If it was, it would then retrieve the document under the "metadata.json" storage key, parse it, and use the consolidated metadata instead of the individual array and group metadata documents.

The spec would probably also need to deal with issues like what if consolidated metadata is out of sync with array and group metadata documents (e.g., it is up to the user to keep them in sync in a way that makes sense for the given dataset and its intended usage.)

Hope that helps to illustrate what a protocol extension might look like for a particular example. I'll try to do a couple more.

@alimanfoo
Copy link
Member

To work through another example, here's some thoughts on how to do sparse arrays.

The idea of a sparse arrays protocol extension is to define a common convention for how to store the different component arrays of a sparse array.

The URI of the protocol extension could be something like https://purl.org/zarr/spec/protocol/extension/sparse-arrays/1.0.

The sparse arrays protocol extension spec would define conventions for how to store data for each of the common types of sparse arrays.

For example, for a sparse array using the compressed sparse row (CSR) representation, the data should be stored in three component 1-D arrays named "data", "indices" and "indptr", which are siblings within the same group. The spec would describe what each of these component arrays should contain, any constraints on dtype and shape, etc.

This would be a protocol extension that can be safely ignored, i.e., a zarr reader could ignore it and still provide some useful functionality, because it might still be useful to view the underlying group and arrays or manually extract data from a component array.

Because this is a protocol extension that applies to individual groups within a hierarchy, the spec would then also define how to declare that a particular group contains a sparse array conforming to the spec. This could be done via the zarr group metadata document, e.g.:

{
    "extensions": [
        {
            "extension": "https://purl.org/zarr/spec/protocol/extension/sparse-arrays/1.0",
            "must_understand": false,
            "configuration": {
                "sparse_type": "csr_matrix",
            }
        }
    ]
}

An implementation might then offer a function to load a sparse array from a given path within a zarr hierarchy. For example, if a sparse array had been stored in a group at path "/foo/bar" then a call to load_sparse("/foo/bar") would return an appropriate object for the given programming language and tooling ecosystem. In Python, for example, it would make sense to return an instance of the scipy.sparse.csr_matrix class.

Similarly, an implementation might offer a function to save a sparse array to a given hierarchy path within a zarr hierarchy. E.g., in Python this function might accept an instance of scipy.sparse.csr_matrix and a hierarchy path like "/foo/bar" and then write out the component arrays into a group at that path.

@alimanfoo
Copy link
Member

Here's an illustration of how the non-regular chunks example could be accommodated as a protocol extension. This example is a bit different from the others above, because this hooks into a specific extension point within the v3 core protocol, because it is an alternative chunk grid type.

The protocol extension spec for non-regular chunks would describe how in some use cases it is necessary to organise data into chunks where chunks form a rectilinear grid. The spec would also give a precise definition of this grid type. E.g., (borrowing from wikipedia) a rectilinear grid is a tesselation of an array's index space by rectangles (or rectangular cuboids) that are not, in general, all congruent to each other. The chunks in the grid may still be indexed by integers as in a regular grid, but the mapping from indexes to vertex coordinates is not uniform than in a regular grid.

As in the other examples, the extension would need a URI, e.g. https://purl.org/zarr/spec/protocol/extension/rectilinear-chunk-grid/1.0.

The protocol extension would then define how to store the sizes of the chunks along each dimension in the grid. Given a general rectilinear grid, this would require a list of integers, one for each dimension of the array, where the length of each of these lists corresponds to the number of grid divisions in that dimension, and the integer values give the chunk sizes along that dimension.

Given that this is a protocol extension that applies to an individual array within a hierarchy, this would be declared within a zarr array metadata document. There is a specific hook for this type of extension, which is using the chunk_grid name in the metadata. E.g.:

{
    "shape": [10000, 1000],
    "data_type": "<f8",
    "chunk_grid": {
        "type": "rectilinear",
        "extension": "https://purl.org/zarr/spec/protocol/extension/rectilinear-chunk-grid/1.0",
        "chunk_sizes": [
            [100, 200, 300, 400],
            [200, 100, 200, 100, 200, 100, 100]
        ]
    },
    "chunk_memory_layout": "C",
    "compressor": {
        "codec": "https://purl.org/zarr/spec/codec/gzip/1.0",
        "configuration": {
            "level": 1
        }
    },
    "fill_value": "NaN",
    "extensions": [],
    "attributes": {
        "foo": 42,
        "bar": "apples",
        "baz": [1, 2, 3, 4]
    }
}

The store keys used to store array chunks could still be the same as for a regular grid. I.e., you could still use keys like "0.1.3" for a chunk in a 3D array, because a rectilinear chunk grid can be indexed in the same way that a regular chunk grid can, even though the chunk sizes are not equal.

The protocol extension spec would then need to describe how an implementation would deal with operations that are reading or writing to a specific region within an array using this grid type. This would include a description of how to identify the set of chunks that overlap a given array region.

Note that an implementation of this protocol extension would need to hook into and override the part of a core protocol implementation that deals with reading and writing regions of an array. You could still think of this as a "plugin" possibly, although it is a specific type of plugin that implements a chunk grid protocol extension.

Also note that this is an example of a protocol extension which a zarr reader must understand in order to read data correctly. However, I did not include a must_understand key in the array metadata document because I assumed that chunk grid extension types are always must understand, and therefore a zarr reader that does not recognise the chunk grid type should terminate processing immediately.

@joshmoore
Copy link
Member

@DennisHeimbigner gave an update yesterday on the netcdf-c zarr implementation ("nczarr") during which he touched on the special issues that will be faced in C when trying to discover, instantiate, etc. extensions. I'll leave him to say more.

@alimanfoo
Copy link
Member

@DennisHeimbigner gave an update yesterday on the netcdf-c zarr implementation ("nczarr") during which he touched on the special issues that will be faced in C when trying to discover, instantiate, etc. extensions. I'll leave him to say more.

Sorry I missed that, looks like a lot of great discussion.

Certainly an important question to pin down the codec interface, decide if one interface can be used for both compressors and filters, and if we need to allow for data type information to be passed through in addition to raw buffers.

@DennisHeimbigner
Copy link

As part of the netcdf zarr, I was looking forward to implementing types like
enumerations and structures (in the C sense) using extensions.
Let me focus on structures. Can someone outline to me how a struct
extension would be specified?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

7 participants