Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr N5 spec diff #3

Open
constantinpape opened this issue Jan 15, 2019 · 12 comments
Open

Zarr N5 spec diff #3

constantinpape opened this issue Jan 15, 2019 · 12 comments

Comments

@constantinpape
Copy link

constantinpape commented Jan 15, 2019

Overview of the diff between zarr and n5 specs with the potential goal of consolidating the two formats.
@alimanfoo, @jakirkham / @axtimwalde please correct me if I am misrepresenting zarr / N5 spec or if you think there is something to add here.
Note that the zarr and n5 spec have different naming conventions.
The data-containers are called arrays in zarr and datasets in n5.
Zarr refers to the nested storage of data-containers as hierarchies or groups (it is not quite clear to me
what the actual difference is, see below)
, n5 only refers to groups.
I will use the group / dataset notation.

Edit:
Some corrections from @alimanfoo, I left in the original statements but striked them out.

Groups

  1. attributes
  • zarr: groups MUST contain a json file .zgroup which MUST contain zarr_format and MUST NOT contain any other keys. They CAN contain additional attributes in .zattrs
  • n5: groups CAN contain a file attributes.json containing arbitrary json serializable attributes. The root group "/" MUST contain the key n5 with the n5 version.
  1. zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. Zarr does not allow nested datasets (i.e. a dataset containing another dataset). This is not allowed in n5 either, I think. The spec does not explicitly forbid it though.

Datasets

  1. metadata
  • zarr: metadata is stored in .zarray.
  • n5: metadata is stored in attributes.json.
  1. layout:
  • zarr: supports C (row-major) and F (column major) indexing, which determines how chunks are indexed and how chunks are stored. This is determined via the key order. Chunks are always indexed as row-major.
  • n5: chunk indexing and storage is done according to column-major layout (F).
  1. dtype:
  • zarr: key dtype holds numpy type encoding. Importantly, supports big- and little- endian, which MUST be specified.
  • n5: key dataType, only numerical types and only big endian.
  1. compression:
  • zarr: supports all numcodecs compressors (and no compression), stored in key compressors.
  • n5: by default supports raw (= no compression), bzip2, gzip, lz4 and xz. There is a mechanism to support additional compressors. Stored in key compression.
  1. filters:
  • zarr: supports additional filters from numcodecs that can be applied to chunks before (de)-serialization. Stored in key filters.
  • n5: does not support this. However, the mechanism for additional compressors could be hijacked to achieve something similar.
  1. fill-value:
  • zarr: the fill-value determines how chunks that don't exist are initialised. Stored in key fill_value.
  • n5: fill-value is hard-coded to 0 (and hence not part of the spec).
  1. attributes:
  • zarr: additional attributes can be stored in .zattributes
  • n5: additional attributes can be stored in attributes.json. MUST NOT override keys reserved for metadata.
    In addition, zarr and n5 store the shape of the dataset and of the chunks in the metadata with the keys shape, chunks / dimensions, blockSize.

Chunk storage

  1. header:
  • zarr: chunks are stored without a header
  • n5: chunks are stored with header, that encodes the chunk's mode (see 3.) and the shape of the chunk.
  1. shape of edge chunks:
  • zarr: chunks are always stored with full chunk shape, also if they are over-hanging (e.g. chunk-shape (30, 30) and dataset shape (100, 100)).
  • n5: only valid part of chunks is stored. This is possible due to 1.
  1. varlength chunks
  • zarr: as far as I know not supported.
  • n5: supports var-length mode (specified in header). In this case, the size of the chunk is not determined by the chunk's shape, but is additionally defined in the header. This is useful for ND storage of "less structured" data. E.g. a histogram of the values in the ROI corresponding to the chunk.
  1. indexing / storage
  • zarr: chunks are indexed by . separated keys, e.g. 2.4. I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec. These keys get mapped to a representation appropriate for the implementation. E. g. on the filesystem, keys can be trivially mapped to files called 2.4 or nested as 2/4.
  • n5: chunks are stored nested, e.g. 2/4. (This is also implementation dependent. There are implementations where nested might not make sense. The difference is only . separated vs. / separated.)
@alimanfoo
Copy link
Member

  1. zarr makes a distinction between hierarchies and groups. I am not quite certain if there is a difference. The way I read the spec, having nested datasets is allowed, i.e. having a dataset that contains another dataset. This is not allowed in n5, I think. The spec does not explicitly forbid it though.

Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.

@alimanfoo
Copy link
Member

  • zarr: supports C and F indexing, which determines how chunks are indexed and how chunks are stored. This is determined via the key order.

In zarr, the "C" or "F" order refers to the ordering of items within a chunk.

Not completely sure what you mean by "indexing" here. E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.

@constantinpape
Copy link
Author

Sorry for any confusion, nesting datasets inside datasets is not allowed in zarr. I.e., you can put a group inside another group, or you can put a dataset inside a group. The word "hierarchy" in the zarr spec is used to mean a tree of groups and datasets, starting from some root group.

Thanks for clarifying, that makes total sense.
I find the part on groups and hierarchies in the spec a bit confusing, maybe there is room to improve it.
That would be a separate issue though.

@constantinpape
Copy link
Author

E.g., do you mean how we refer to a specific chunk within the grid of chunks for a given array? If so, the indexing of chunks within the chunk grid is only ever done in zarr in row-major order. E.g., for a 2D array of shape (100, 100) and chunk shape (10, 10), chunk "0.1" always means the chunk with rows 0-9 and columns 10-19.

Yes, that's what I meant.
Good to know.

@alimanfoo
Copy link
Member

alimanfoo commented Jan 15, 2019

  • zarr: chunks are stored flat by . seperated keys, e.g. 2.4. I think somewhere @alimanfoo mentioned that zarr also supports nested chunks, but I can't find this in the spec.

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features. E.g., the default file-system storage class in zarr python (DirectoryStore) does the obvious thing of mapping keys to file paths without any transformation, so you will get a file called "2.4". But there is an alternative implementation of file-system storage (NestedDirectoryStore) which applies a transformation on the chunk keys to get to file paths, so you get file paths like "2/4".

This is an example of how in the zarr storage spec there is a separation between the store interface, which is assumed to be an abstract key-value interface and does not make any assumptions about how that will get implemented in terms of files or objects or memory or whatever; and the underlying storage implementation, which makes concrete decisions about what files to create (if using a file system) and what each file should contain.

The zarr storage spec does not place any constraint on the storage implementation, as long as you can provide a key-value interface over it then any form of storage is allowed. E.g., storing data inside an sqlite3, bdb or lmdb database, or zip file, are all valid ways of storing zarr data on a file system.

That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs, so that e.g. anyone who wants to implement a specific file format can do so more easily.

@alimanfoo
Copy link
Member

Many thanks @constantinpape, great summary. Hopefully comments have clarified a few points, but very happy to expand on any areas.

@constantinpape
Copy link
Author

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.

Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
This means the difference to n5 is rather cosmetic, i.e. / separated vs. . separated. (And also C vs F order).

That said I fully take the point that several people have made that it would be useful to also have some concrete storage implementations documented, either within the zarr storage spec or in some associated specs

Yes, that would be very helpful indeed.

@constantinpape
Copy link
Author

Thanks for clarifying @alimanfoo.
I have edited the text accordingly.

@jakirkham
Copy link
Member

In zarr the chunk keys are always formed by separating chunk indices with a period, e.g. "2.4". However, the storage layer can make choices about how it maps keys down to underlying storage features.

Ok, this makes perfect sense. There might be implementations where nested does not have any meaning.
This means the difference to n5 is rather cosmetic, i.e. / separated vs. . separated. (And also C vs F order).

Along these lines PR ( zarr-developers/zarr-python#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.

@constantinpape
Copy link
Author

Along these lines PR ( zarr-developers/zarr#309 ) might be of interest. This effectively remaps keys to allow access of N5 content from within the Python Zarr library.

Thanks for pointing this out @jakirkham.
Will be very useful to read n5 from the zarr main library.
Though afaik the n5 varlen mode is not supported yet. Maybe @funkey could clarify.

I think that in general consolidating the specs would be of great use nevertheless.
It would reduce the double development effort and new language implementations would be able to read both formats by default.

@joshmoore
Copy link
Member

This is great, thank you!

I think that in general consolidating the specs would be of great use nevertheless. It would reduce the double development effort and new language implementations would be able to read both formats by default.

👍

For the items you listed under Groups and Datasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.

The chunk differences seem to reverse that though....

@constantinpape
Copy link
Author

constantinpape commented Jan 16, 2019

For the items you listed under Groups and Datasets, @constantinpape, I get the feeling that n5 could be characterized as "minimum" whereas zarr might be "standard", if that's a useful relationship to have between specs.

The chunk differences seem to reverse that though....

Yes that captures it pretty well. For Groups the differences are minimal though.
For Datasets zarr is more elaborate, as it allows for more datatypes, e.g. unicode and strucuted datatypes (= tuple of datatypes) and also supports litte and big endian as well as C and F order
and native support for filters other than compression.

For chunks n5 is more expressive, as it supports clipped edge chunks and varlength mode by means of the header data.
Note that zarr supports something similar to the varlength chunks as well with datatype O, where encoding and decoding are achieved through a filter, see #6.
To me the n5 approach seems more portable though, because chunks can be decoded without the need for an extra filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants