Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

z5 library (Zarr/N5 interoperability) #44

Closed
jakirkham opened this issue Jan 24, 2018 · 9 comments
Closed

z5 library (Zarr/N5 interoperability) #44

jakirkham opened this issue Jan 24, 2018 · 9 comments

Comments

@jakirkham
Copy link
Member

Ran across z5 recently, which allows reading and writing of both Zarr and N5 in C++ and Python. As Zarr and N5 have both grown FWICT for similar reasons, but in different languages (Python and Java respectively), am interested to understand the similarities and differences between them. Along those lines, it would be good to learn in what areas interoperability between Zarr and N5 can be improved. I think we would be in a really great place if data can more smoothly move between these two formats and different languages.

cc @constantinpape @saalfeldlab

@constantinpape
Copy link

constantinpape commented Jan 24, 2018

I started writing z5, because I needed access to a chunked data storage that allows parallel I/O
from C++. I wasn't sure whether to use N5 or Zarr, so I decided to support both, as the differences in the specification are small.
Since then, the project has evolved to function more as python bindings for N5, though I still try to maintain compatability with Zarr, but this is not tested as much.

The main differences I can think of right now:

  • Zarr uses z,y,x axis conventions, N5 x ,y, z. In z5, I use z, y, x and switch internally when using the N5 backend.
  • Zarr always stores full chunks, even if they are not fully contained in the datset. N5 only stores the part contained in the dataset. This creates the need for chunk headers, which are not necessary in Zarr. Also, if chunks become too small, they might cause issues with the compression library (though I have not seen this happen in practice).
  • N5 supports a mode for storing multiple values per index (not supported in z5 yet).
  • N5 stores data in big endian, whereas Zarr supports both endianness (little endian by default).
  • IN Zarr, all chunks are stored in the dataset directory as z.y.x, in N5 they are stored in subdirectories x/y/z.

@jhamman
Copy link
Member

jhamman commented Jan 24, 2018

@jakirkham - thank your for posting. With the recent addition of zarr as a backend to the xarray project, we have been discussing that a potentially key limitation to the library/storage format is the lack of a low level language (e.g. C++) for others, beyond the Python world, to use. It sounds like this has been accomplished by z5 which is indeed encouraging on many fronts.

cc @rabernat and @kmpaul.

@mrocklin
Copy link

Is it a sensible goal to evolve both projects to implement the same spec?

@jakirkham
Copy link
Member Author

Possibly. FWIW here's N5's spec.

@jakirkham
Copy link
Member Author

Thanks for the info @constantinpape. That's very helpful. :)

Regarding axis convention, did you play at all with Zarr Array's order parameter? Personally I haven't really gone outside the default C ordering. That said, a quick test writing and loading the data directly with NumPy shows this is Fortran ordered on disk. So maybe this would help? Would be curious to know. :)

Could you please elaborate on the partial chunks point? What parts are being stored? What is tracked in the header?

Also could you elaborate on what it means to store multiple values per index?

As per the directory layout, recently PR ( zarr-developers/zarr-python#177 ) provided the option to nest directories with Zarr as well. So this option should be available in the 2.2.0 release.

It seems also N5 stores the attributes in a different file (though still JSON).

@constantinpape
Copy link

constantinpape commented Jan 25, 2018

@jakirkham

The axis convention actually matter in two different places:

First in how the attributes (i.e. shape and chunk shape) are stored and accordingly in which order the chunks are addressed. Here, Zarr uses the z, y, x convention (i.e. C-order), whereas N5 uses x, y, z (i.e. F-order).

Second, in how chunks are stored on disc. Here, Zarr supports both orders with the order parameter,
whereas N5 stores chunk in C order. (In z5, I am only supporting C-order for the time being).

Regarding chunks:
The chunk-header in N5 is built up as follows:

  • 2 bytes storing the mode (0 means the default mode, 1 the varlength mode (i.e. multiple values per position))
  • 4 bytes storing the number of dimensions D
  • followed by D times 8 bytes indicating the size for each dimension
    If the varlength mode is activated, this is followed by another 8 bytes, that indicate the actual number of elements (because for the varlength mode, each position can hold multiple values).

Note that the chunk shape is adjusted, s.t. each chunk "fits" into the dataset.
E.g. given a dataset with shape (100,) and chunks (30,) the last chunk would only have shape (10,) .
This fact (and the varlength mode) create the need for the header.

Zarr doesn't need it, because it always stores the chunk with full shape, e.g. in the case above
the last chunk would have shape (30,) but [10:] would not be accessible.

Good to know that Zarr will also support nested directories.

N5 stores all metadata in attributes.json. This includes shape, chunk-shape, datatype and compression. Also, custom attributes are stored there, which are not allowed to clash with the metadata entries.

@jakirkham
Copy link
Member Author

Thanks for the follow-up @constantinpape.

Was wondering if the chunk naming was the cause for the C/F difference. It might be possible to make this configurable in Zarr for the intention of closer compatibility with other formats. Raised issue ( zarr-developers/zarr-python#232 ) to discuss this point and have laid out a potential way forward. Though other thoughts/suggestions would certainly be welcome.

As to the chunk layout, we probably could investigate shortening edge chunks somehow. Have written up some thoughts on it in issue ( zarr-developers/zarr-python#233 ) with a few possibilities. Think we could implement this without a header, but it would be good to know if I'm missing anything. Also would generally appreciate feedback on the ideas there and any other approaches that would be worth considering.

Am still a little confused on varlength. Is this only used for making shortened edge blocks or is there something else it is helpful for? In particular each position holding multiple values is confusing. Any other examples that might be useful?

As to the metadata, raised issue ( saalfeldlab/n5#24 ) suggesting these be broken out into two files. Though that is admittedly a breaking change as it is proposed. Perhaps there is a way to do it so that it is non-breaking?

Any thoughts on any of these issues would be appreciated.

@axtimwalde
Copy link

The current plan for N5 is to support zarr as one of many possible backends. Towards this end, I introduced extensible compression schemes in 2.0.0 because zarr uses blosc instead of the Java internals that we use out of convenience. n5-hdf5 is an example for such an alternative backend. It implements the N5 API on top of HDF5 at a best effort level. For zarr, this best effort will be better because the concepts are more similar. I reached out to @alimanfoo about the similarities and differences and we are both interested in converging, yet limited in how much time we can spend. The new "/" block separator for zarr is one of the outcomes of this, this project "z5" is another ;). Z5 currently only supports the file system spec of both formats, not the AWS and Google cloud implementations. Shortening edge chunks without header means that you have to do some meta-data math before loading, and I wanted to keep it simple. It also allows overlapping blocks which starts to become handy in applications where we use overlapping blocks. varlength is for datatypes that do not store scalar pixel values but e.g. multisets. I am undecided what the best and most flexible while simple and efficient spec would be and we will see evolution.

@alimanfoo
Copy link
Member

This has been folded into various other discussions so I'm going to close it, but I'll also transfer the issue to the zarr-specs repo so it's nearer to other spec-related discussions.

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants