Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contributing Golang and Rust implementations #5

Open
rw opened this issue May 8, 2019 · 25 comments
Open

Contributing Golang and Rust implementations #5

rw opened this issue May 8, 2019 · 25 comments

Comments

@rw
Copy link

rw commented May 8, 2019

Hi there!

Via @ryan-williams, I'm posting here because I'm interested in making two ports of Zarr: to native Golang and to native Rust. (To me, "native" means "no FFI to an existing Zarr library".)

For background:

I wrote and maintain the official Golang, Python, Rust ports of Google's FlatBuffers serialization library. I received a Google Open Source Contributors Award for my volunteer efforts on FlatBuffers.

Here are some relevant links:

All of that is to say, that I have a background in writing high-performance serialization code in open-source projects.

So, is there a need from the community for Golang and/or Rust ports? I'm happy to spearhead/lead those initiatives, if so.

I'm interested in getting involved with Zarr, because I like both your technical solutions, as well as the community-friendly group dynamics that I've seen.

Best,
Robert

@martindurant
Copy link
Member

I am total novice with Rust, but have been looking for a project to teach myself on. Since I know something about the python implementation (although my main experience in this are is with pythonic file-systems), this might be the perfect entry point.

As for need, I simply don't know - you would look for other science or other big-array workflows in the Rust world. Where some have been speaking of C/C++ implementations of the zarr spec, there is certainly an argument to trying to get the same performance, but with more modern code, via Rust.

@rw
Copy link
Author

rw commented May 8, 2019

I am total novice with Rust, but have been looking for a project to teach myself on.

@martindurant Great! I'm thinking I can use my Rust serialization experience to start the project, then would you be interested in helping to maintain it?

@constantinpape
Copy link

There is a rust implementation for n5 already:
https://github.com/aschampion/rust-n5

It should be easy to port this to zarr, given how close the specs are.

@jakirkham
Copy link
Member

Thanks @constantinpape. Let’s see if @aschampion has thoughts. 🙂

@martindurant
Copy link
Member

I don't want to promise too much, given how much I (don't) know...

There is a rust implementation for n5 already

There is a chance of merging the libraries or spec eventually, right? (z5, whatever)

@rw
Copy link
Author

rw commented May 8, 2019

There is a rust implementation for n5 already:
https://github.com/aschampion/rust-n5

It should be easy to port this to zarr, given how close the specs are.

@constantinpape I'm not an expert yet in Zarr nor N5, but after reading that code base, I think there are a lot of opportunities to improve its usage of the heap. The way it is written leads me to believe that it is not designed to minimize heap allocations.

@constantinpape
Copy link

@martindurant

There is a chance of merging the libraries or spec eventually, right? (z5, whatever)

Yes, hopefully zarr spec v3 will merge zarr and n5. (I couldn't make the call today so I am not quite up-to-date on the progress regarding this).

@rw

It's not clear to me that that project is written with performance in mind. Reading the code, I see multiple heap allocations throughout the code, in what I understand to be tight loops.

That's probably something @aschampion could comment more on.

@constantinpape
Copy link

cc @j6k4m8 who might be interested in a Go implementation

@alimanfoo
Copy link
Member

Hi @rw, just to say thanks for proposing this, it would be very cool to have implementations in these languages.

The current version of the underlying specification is version 2. We are just getting started on work towards a version 3.0 of the core protocol, which will hopefully provide a common implementation target for both the zarr and n5 communities. Current vision for the 3.0 core protocol is that it will be quite minimal and so may be a slightly easier implementation target than the current version 2 spec.

In any case, you'd have a choice about whether to target the v2 or v3 spec. It would be nice if you could target the v3 spec while it's in development, as that would give us some early feedback on implementation experiences and pain points. However it may take some time to fully flesh out the spec, and there may be some to-and-fro on some decision points, so it would be a moving target to a certain extent. So if you'd rather target v2 initially (or a subset of v2) to get some interoperability with existing implementations and data then I'd perfectly understand.

@aschampion
Copy link

@constantinpape I'm not an expert yet in Zarr nor N5, but after reading that code base, I think there are a lot of opportunities to improve its usage of the heap. The way it is written leads me to believe that it is not designed to minimize heap allocations.

There are opportunities for optimization. As is it's at least a bit faster than the Java reference implementation, which it started as a rather direct translation of

java-rust-n5

I had a branch which reduced some allocations with SmallVec, etc., but since in practice my perf is usually IO bound the difference was marginal, and having fewer dependencies was critical to get wasm compilation working then. Because of improvement in the rust toolchain and ecosystem in the last years, that's less of a concern now. With modern rust one could easily get rid of the allocs for compression dispatch, etc.

The kludge for composing block reads/writes from ndarrays is a mess of needless allocations, but that's effectively an outer loop around block IO.

I don't have an immediate need or time to develop zarr support myself, but would happily accept PRs, including major restructuring, so long as the downstream wasm and conda-less pip installable python packages are still possible. But I also understand if you'd rather to start from scratch.

@rw
Copy link
Author

rw commented May 11, 2019

@alimanfoo I'm happy to start with the V3 spec, and give feedback as I go. Would you be able to fill in the TODOs in the v3 spec with your best guesses, so that I have a concrete description upon which to work? Or, perhaps, provide links to the relevant parts of the v2 spec to fill in the gaps?

@alimanfoo
Copy link
Member

alimanfoo commented May 13, 2019 via email

@rw
Copy link
Author

rw commented May 17, 2019

@alimanfoo Hmm, should I just start with v2? I can still offer feedback on the v3 spec.

@alimanfoo
Copy link
Member

@alimanfoo Hmm, should I just start with v2? I can still offer feedback on the v3 spec.

That would be fine I'm sure, lots of value in having something that would interoperate with current Zarr implementations.

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
@gauteh
Copy link

gauteh commented Dec 11, 2020

Hi, I would be interested in a rust native version of zarr. I've created a native HDF5 reader and streamer for rust, hidefix, in order to be able to concurrently and simultaneously read HDF5 files for a rust OPeNDAP server dars. The reader performs quite well, especially for concurrent reads. In some ways it works in similar ways to zarr as far as I can see, by creating an index of the hdf5 file. For it to work with the DAP server I implemented zero-copy serialization/deserialization of the index (cannot keep indexes of datasets in memory), originally using flatbuffers, but gave up on that in favor of bincode because of performance and alignment issues, think it should work though.

It would be very interesting to support zarr in this server, but a must for this is concurrent reads, otherwise performance will be very poor. HDF5 is starting async work upstream, but it will be a very big job to make that fast and correct. It seems this should be possible to do more safely with zarr + rust.

@joshmoore
Copy link
Member

Thanks for re-raising this conversation, @gauteh. I recently had a conversation with @clbarnes in a conference slack (i.e. I'm considering it largely public):

We have a rust implementation of N5, which also compiles to webassembly and has a python wrapper designed to be a drop-in h5py replacement. We've got a branch somewhere to make it more generic, with the intention of adding backends for zarr3 and possibly zarr2.

see: https://github.com/aschampion/rust-n5

cc: @aschampion

@aschampion
Copy link

To be specific, we made a prototype implementation of zarr v3 in rust about a year ago, and now that the spec seems to be stabilizing and that we have the time, just this week picked it up again. When the crate is available I'll link it here.

@gauteh
Copy link

gauteh commented Dec 11, 2020

To be specific, we made a prototype implementation of zarr v3 in rust about a year ago, and now that the spec seems to be stabilizing and that we have the time, just this week picked it up again. When the crate is available I'll link it here.

Great! Will this implementation support concurrent/parallel reads? Is the development version available somewhere already?

@Carreau
Copy link

Carreau commented Dec 11, 2020

If it's available I would be happy to test as well while working on the Python impl.

@aschampion
Copy link

Great! Will this implementation support concurrent/parallel reads? Is the development version available somewhere already?

The implementation is thread safe like the current rust-n5 crate, but I'm guessing based on how you're asking I should clarify: our Zarr implementation exposes a minimal API that attempts to somewhat faithfully match the Zarr spec for doing low-level chunk-based operations. This means the expectation is one builds concurrent/parallel access in libraries on top of that, rather than it being done implicitly. For example pyn5, built on top of rust-n5, threads chunk-wise access if requested. So in that sense the Zarr implementation itself doesn't do any parallel reads, because it is at a layer below parallelism. The filesystem implementation itself does N5-style advisory file locking, although because of Zarr's KV-store approach that can't prevent some data races N5 can, but file locking is only an extra safety anyway, not a concurrency coordinator.

We are also making an interface crate on top of the Zarr crate that provides more h5py-like, ergonomic, rust-idiomatic use of the zarr rust backend with ndarrays, along with the existing rust-n5 backend and eventually a few others (likely TIFF stacks, KLB, and HDF). That crate would provide wrappers for doing implicit or easy parallel operations. It will be released around whenever I upload a version 0.1 of the Zarr crate to crates.io (i.e., after the holidays).

Since there's interest in poking at the Zarr impl and it can now read hierarchies output from zarrita, I made the repo public here. Caveats:

  • This is pre-semver, so breaking changes are being made constantly without notice. Not may, will.
  • There's a lot of cruft remaining from the N5 rewrite that no longer makes sense in Zarr, but is yet to be removed (e.g., having to wrap chunks in the *DataChunk container types)
  • Docs have yet to be done.
  • The path/node/key name handling is a heap-y mess from deciding what the correct behavior is, and will be rewritten.
  • Raw data types (r*) are not well supported. Use of extensions or extension types should at least return errors, although not yet in a consistent way for different extension points (e.g., grids vs types).

@joshmoore
Copy link
Member

Community call suggestion

The subject of Rust and particularly rust-n5 came up during the 2021-01-27 community call. @aschampion @pattonw clbarnes, would any of you be free/interested in joining the next call, a week from today Wed. 10.02 at 2000 CET? If that's not an ideal slot, would you care to suggest another?

cc @WardF @DennisHeimbigner in case there are any points of discussion re: libnczarr.

@aschampion
Copy link

Sure, I can join the call next week.

@gauteh
Copy link

gauteh commented Feb 5, 2021

Great! Will this implementation support concurrent/parallel reads? Is the development version available somewhere already?

The implementation is thread safe like the current rust-n5 crate, but I'm guessing based on how you're asking I should clarify: our Zarr implementation exposes a minimal API that attempts to somewhat faithfully match the Zarr spec for doing low-level chunk-based operations. This means the expectation is one builds concurrent/parallel access in libraries on top of that, rather than it being done implicitly. For example pyn5, built on top of rust-n5, threads chunk-wise access if requested.

That sounds great, this would certainly keep it general enough that we can build something async on top.

@aschampion
Copy link

That sounds great, this would certainly keep it general enough that we can build something async on top.

Definitely -- n5-wasm already does this, for example, to build an N5 implementation on top of HTTP Promises. By now it's an out-of-date futures 0.1-style trait, but eventually the Zarr crate and our cross-backend interface crate will have a modern async keyword/futures trait for async block access now that those are stabilizing. I'm just waiting for a few ergonomic issues to get sorted out with futures before committing to an interface.

@jakirkham
Copy link
Member

Ran across a Go implementation recently. Opened issue ( #50 ) with more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants