Zarr N5 spec diff #3

constantinpape · 2019-01-15T20:20:59Z

Overview of the diff between zarr and n5 specs with the potential goal of consolidating the two formats.
@alimanfoo, @jakirkham / @axtimwalde please correct me if I am misrepresenting zarr / N5 spec or if you think there is something to add here.
Note that the zarr and n5 spec have different naming conventions.
The data-containers are called arrays in zarr and datasets in n5.
Zarr refers to the nested storage of data-containers as hierarchies or groups (it is not quite clear to me
what the actual difference is, see below), n5 only refers to groups.
I will use the group / dataset notation.

Edit:
Some corrections from @alimanfoo, I left in the original statements but striked them out.

Groups

attributes

zarr: groups MUST contain a json file .zgroup which MUST contain zarr_format and MUST NOT contain any other keys. They CAN contain additional attributes in .zattrs
n5: groups CAN contain a file attributes.json containing arbitrary json serializable attributes. The root group "/" MUST contain the key n5 with the n5 version.

zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. Zarr does not allow nested datasets (i.e. a dataset containing another dataset). This is not allowed in n5 either, I think. The spec does not explicitly forbid it though.

Datasets

metadata

zarr: metadata is stored in .zarray.
n5: metadata is stored in attributes.json.

layout:

zarr: supports C (row-major) and F (column major) indexing, which determines ~~how chunks are indexed and~~ how chunks are stored. This is determined via the key order. Chunks are always indexed as row-major.
n5: chunk indexing and storage is done according to column-major layout (F).

dtype:

zarr: key dtype holds numpy type encoding. Importantly, supports big- and little- endian, which MUST be specified.
n5: key dataType, only numerical types and only big endian.

compression:

zarr: supports all numcodecs compressors (and no compression), stored in key compressors.
n5: by default supports raw (= no compression), bzip2, gzip, lz4 and xz. There is a mechanism to support additional compressors. Stored in key compression.

filters:

zarr: supports additional filters from numcodecs that can be applied to chunks before (de)-serialization. Stored in key filters.
n5: does not support this. However, the mechanism for additional compressors could be hijacked to achieve something similar.

fill-value:

zarr: the fill-value determines how chunks that don't exist are initialised. Stored in key fill_value.
n5: fill-value is hard-coded to 0 (and hence not part of the spec).

attributes:

zarr: additional attributes can be stored in .zattributes
n5: additional attributes can be stored in attributes.json. MUST NOT override keys reserved for metadata.
In addition, zarr and n5 store the shape of the dataset and of the chunks in the metadata with the keys shape, chunks / dimensions, blockSize.

Chunk storage

header:

zarr: chunks are stored without a header
n5: chunks are stored with header, that encodes the chunk's mode (see 3.) and the shape of the chunk.

shape of edge chunks:

zarr: chunks are always stored with full chunk shape, also if they are over-hanging (e.g. chunk-shape (30, 30) and dataset shape (100, 100)).
n5: only valid part of chunks is stored. This is possible due to 1.

varlength chunks

zarr: as far as I know not supported.
n5: supports var-length mode (specified in header). In this case, the size of the chunk is not determined by the chunk's shape, but is additionally defined in the header. This is useful for ND storage of "less structured" data. E.g. a histogram of the values in the ROI corresponding to the chunk.

indexing / storage

zarr: chunks are indexed by . separated keys, e.g. 2.4. ~~I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.~~ These keys get mapped to a representation appropriate for the implementation. E. g. on the filesystem, keys can be trivially mapped to files called 2.4 or nested as 2/4.
n5: chunks are stored nested, e.g. 2/4. (This is also implementation dependent. There are implementations where nested might not make sense. The difference is only . separated vs. / separated.)

The text was updated successfully, but these errors were encountered:

alimanfoo · 2019-01-15T22:11:17Z

zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. This is not allowed in n5, I think. The spec does not explicitly forbid it though.

Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.

alimanfoo · 2019-01-15T22:16:21Z

zarr: supports C and F indexing, which determines how chunks are indexed and how chunks are stored. This is determined via the key order.

In zarr, the "C" or "F" order refers to the ordering of items within a chunk.

Not completely sure what you mean by "indexing" here. E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.

constantinpape · 2019-01-15T22:17:45Z

Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.

Thanks for clarifying, that makes total sense.
I find the part on groups and hierarchies in the spec a bit confusing, maybe there is room to improve it.
That would be a separate issue though.

constantinpape · 2019-01-15T22:19:50Z

E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.

Yes, that's what I meant.
Good to know.

alimanfoo · 2019-01-15T22:33:15Z

zarr: chunks are stored flat by . seperated keys, e.g. 2.4. I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features. E.g., the default file-system storage class in zarr python (DirectoryStore) does the obvious thing of mapping keys to file paths without any transformation, so you will get a file called "2.4". But there is an alternative implementation of file-system storage (NestedDirectoryStore) which applies a transformation on the chunk keys to get to file paths, so you get file paths like "2/4".

This is an example of how in the zarr storage spec there is a separation between the store interface, which is assumed to be an abstract key-value interface and does not make any assumptions about how that will get implemented in terms of files or objects or memory or whatever; and the underlying storage implementation, which makes concrete decisions about what files to create (if using a file system) and what each file should contain.

The zarr storage spec does not place any constraint on the storage implementation, as long as you can provide a key-value interface over it then any form of storage is allowed. E.g., storing data inside an sqlite3, bdb or lmdb database, or zip file, are all valid ways of storing zarr data on a file system.

That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs, so that e.g. anyone who wants to implement a specific file format can do so more easily.

alimanfoo · 2019-01-15T22:35:18Z

Many thanks @constantinpape, great summary. Hopefully comments have clarified a few points, but very happy to expand on any areas.

constantinpape · 2019-01-15T22:52:14Z

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.

Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
This means the difference to n5 is rather cosmetic, i.e. / separated vs. . separated. (And also C vs F order).

That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs

Yes, that would be very helpful indeed.

constantinpape · 2019-01-15T22:53:20Z

Thanks for clarifying @alimanfoo.
I have edited the text accordingly.

jakirkham · 2019-01-16T01:51:19Z

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.

Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
This means the difference to n5 is rather cosmetic, i.e. / separated vs. . separated. (And also C vs F order).

Along these lines PR ( zarr-developers/zarr-python#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.

constantinpape · 2019-01-16T08:13:01Z

Along these lines PR ( zarr-developers/zarr#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.

Thanks for pointing this out @jakirkham.
Will be very useful to read n5 from the zarr main library.
Though afaik the n5 varlen mode is not supported yet. Maybe @funkey could clarify.

I think that in general consolidating the specs would be of great use nevertheless.
It would reduce the double development effort and new language implementations would be able to read both formats by default.

joshmoore · 2019-01-16T12:12:58Z

This is great, thank you!

I think that in general consolidating the specs would be of great use nevertheless. It would reduce the double development effort and new language implementations would be able to read both formats by default.

👍

For the items you listed under Groups and Datasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.

The chunk differences seem to reverse that though....

constantinpape · 2019-01-16T13:09:00Z

For the items you listed under Groups and Datasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.

The chunk differences seem to reverse that though....

Yes that captures it pretty well. For Groups the differences are minimal though.
For Datasets zarr is more elaborate, as it allows for more datatypes, e.g. unicode and strucuted datatypes (= tuple of datatypes) and also supports litte and big endian as well as C and F order
and native support for filters other than compression.

For chunks n5 is more expressive, as it supports clipped edge chunks and varlength mode by means of the header data.
Note that zarr supports something similar to the varlength chunks as well with datatype O, where encoding and decoding are achieved through a filter, see #6.
To me the n5 approach seems more portable though, because chunks can be decoded without the need for an extra filter.

joshmoore mentioned this issue Jan 23, 2019

Nested cloud storage zarr-developers/zarr-python#395

Closed

constantinpape mentioned this issue Feb 27, 2019

Chunk Spec #7

Open

ryan-williams mentioned this issue Feb 27, 2020

Protocol extensions #49

Open

jstriebel added the v3-meta label Nov 16, 2022

jstriebel removed the v3-meta label Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr N5 spec diff #3

Zarr N5 spec diff #3

constantinpape commented Jan 15, 2019 •

edited

Loading

alimanfoo commented Jan 15, 2019

alimanfoo commented Jan 15, 2019

constantinpape commented Jan 15, 2019

constantinpape commented Jan 15, 2019

alimanfoo commented Jan 15, 2019 •

edited

Loading

alimanfoo commented Jan 15, 2019

constantinpape commented Jan 15, 2019

constantinpape commented Jan 15, 2019

jakirkham commented Jan 16, 2019

constantinpape commented Jan 16, 2019

joshmoore commented Jan 16, 2019

constantinpape commented Jan 16, 2019 •

edited

Loading

Zarr N5 spec diff #3

Zarr N5 spec diff #3

Comments

constantinpape commented Jan 15, 2019 • edited Loading

alimanfoo commented Jan 15, 2019

alimanfoo commented Jan 15, 2019

constantinpape commented Jan 15, 2019

constantinpape commented Jan 15, 2019

alimanfoo commented Jan 15, 2019 • edited Loading

alimanfoo commented Jan 15, 2019

constantinpape commented Jan 15, 2019

constantinpape commented Jan 15, 2019

jakirkham commented Jan 16, 2019

constantinpape commented Jan 16, 2019

joshmoore commented Jan 16, 2019

constantinpape commented Jan 16, 2019 • edited Loading

constantinpape commented Jan 15, 2019 •

edited

Loading

alimanfoo commented Jan 15, 2019 •

edited

Loading

constantinpape commented Jan 16, 2019 •

edited

Loading