Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JVM Zarr implementation? #15

Open
ryan-williams opened this issue Aug 1, 2018 · 27 comments
Open

JVM Zarr implementation? #15

ryan-williams opened this issue Aug 1, 2018 · 27 comments

Comments

@ryan-williams
Copy link
Member

There isn't one, is there?

I've started making one, will post updates here.

@martindurant
Copy link
Member

In #285 there was mention of n5, which has a java and rust implementations, maybe more. n5 is similar in concept to zarr, apparently.

@jakirkham
Copy link
Member

jakirkham commented Aug 1, 2018

N5 is basically this. The specs differ a bit in minor ways. Convergence would be good to have. Some relevant discussion in issue ( https://github.com/zarr-developers/zarr/issues/231 ).

ref: https://github.com/saalfeldlab/n5

@alimanfoo
Copy link
Member

JVM implementation of Zarr would be very cool, particularly if it had the same flexibility as the Python implementation to plug in different storage back-ends including cloud object stores.

@ryan-williams
Copy link
Member Author

Thanks for all the pointers! I've looked a bit at n5 and z5; a couple questions:

  • what are the tradeoffs of wrapping z5's C implementation for JVM use?
    • we would call z5 via JNI, right?
    • JNI seems to have a reputation for being hard/brittle; is that warranted? I'm not experienced with it.
  • can z5 read/write directly to cloud stores?
    • doing this in python (via gcsfs/s3fs) and Java (via NIO adapters) seems to work well
    • otoh I've been stymied by python-wrapped C libraries that don't seem to allow this, e.g. h5py

@martindurant
Copy link
Member

I'm not aware of a way to read HDF5 from cloud stores in python

gcsfs's FUSE module does allow this, and there are other FUSE solutions out there too. The implementation is not at all performance compared to zarr. In addition, https://github.com/ContinuumIO/intake-xarray will shortly allow streaming of any xarray dataset, including hdf, from a server; again, there are other solutions that do something similar.

@clbarnes
Copy link

clbarnes commented Aug 1, 2018

n5, which has a java and rust implementations, maybe more

z5 acts as a C++ and a python implementation for both zarr and N5

can z5 read/write directly to cloud stores

No, it's purely targeted at the file system format for both zarr and n5 as far as I know.

@ryan-williams there is already a bit of an ecosystem (albeit one tightly constrained to one institute...) rapidly evolving around the java N5 implementation, including a high-performance 3D data viewer, some image registration tools, and a volumetric image annotation suite. The java N5 already supports a number of backends, including the N5 filesystem format, HDF5, google cloud, and AWS (take a look here). It might make sense for a JVM implementation of the zarr file system format to take the form of an N5 backend (initially, at least) - that would potentially give all of those other tools access to zarr datasets for free, as well as saving you writing some of the higher-level boilerplate. That's if you're happy with the API, of course.

My feeling is that zarr has more momentum behind it and will have more impact in the future. Convergence would be great, but if the N5 tool ecosystem could get access to zarr file system arrays for free, that could also solve the problem.

@jakirkham
Copy link
Member

I'd be really happy if Zarr and N5 converged on the same spec. It would make it much easier for people in this problem domain to collaborate more effectively on many other common challenges.

@ryan-williams
Copy link
Member Author

ryan-williams commented Sep 28, 2018

checking in here after a long gap!

I'm far along with a Zarr implementation in Scala, which will address the "JVM implementation" request here.

Some notes:

Looking forward to sharing more info on this shortly!

@alimanfoo
Copy link
Member

alimanfoo commented Sep 28, 2018 via email

@lesserwhirls
Copy link

Excellent! I would love to build off of this work on the netCDF-Java side to provide an IOSP to Zarr (read Zarr into the Common Data Model). At that point, we could enable the THREDDS Data Server to serve data stored in Zarr :-)

Would you be open to that idea, and does the license permit such usage?

@ryan-williams
Copy link
Member Author

@lesserwhirls yea, it will be Apache-2.0 licensed, happy to have it feed into netCDF things!

@lesserwhirls
Copy link

It might be helpful/less painful for everyone if we get the changes made to netCDF-Java made upstream. @tomwhite - would you be willing to contribute those changes?

@tomwhite
Copy link

tomwhite commented Oct 3, 2018

@lesserwhirls, yes I'd be happy to. I'll open an issue/PR to discuss.

@aluhamaa
Copy link

Hi @ryan-williams how is it going?

I'm far along with a Zarr implementation in Scala, which will address the "JVM implementation" request here.

Some notes:

  • it's in a branch that I am aggressively cleaning up atm; I'll send a link by Monday, but wanted to just mention now since other relevant discussions are ongoing.

@ryan-williams
Copy link
Member Author

hello! I've been side-tracked, but what I have is here lasersonlab/ndarray.scala. it's pretty "alpha" still, and the issues reasonably capture the things I'm focused on next.

I'll be checking back in on this in the coming weeks, and will give some more updates here.

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
@joshmoore
Copy link
Member

@SabineEmbacher
Copy link

SabineEmbacher commented Mar 31, 2020 via email

@SabineEmbacher
Copy link

If you need array objects which behave almost like NumPy arrays you also can wrap the data using ND4J INDArray from deeplearning4j.org. You can find examples in the data writing and reading examples.

https://jzarr.readthedocs.io/en/latest/tutorial.html#writing-and-reading-data

Or directly in the code example
https://github.com/bcdev/jzarr/blob/master/docs/examples/java/Tutorial_rtd.java#L41

@SabineEmbacher
Copy link

Can any of you tell me how to register the jzarr java library to the maven central repository. I've never done this before.
Does any of you have the time to guide or support me?

Best Regards
Sabine

@joshmoore
Copy link
Member

Hi @SabineEmbacher. I don't remember what HOWTO we followed originally for our jars (cc: @sbesson) but https://stackoverflow.com/questions/28846802/how-to-manually-publish-jar-to-maven-central looks reasonable enough. The biggest hurdles I remember are (1) proving that you own your groupId (*.bc.com) and (2) making sure that all of your dependencies are accessible from maven central. I've created bcdev/jzarr#4 since this may become protracted, but certainly happy to help. ~Josh

@sbesson
Copy link

sbesson commented Apr 7, 2020

Following-up on #15 (comment), the process used by OME for releasing some of its Java components to Sonatype is documented here with the relevant links to OSSRH in case it's useful. If possible, big 👍 for having jzarr available from Maven Central.

@SabineEmbacher
Copy link

SabineEmbacher commented Apr 7, 2020

alimanfoo commented on 1 Aug 2018

JVM implementation of Zarr would be very cool, particularly if it had the same flexibility as the Python implementation to plug in different storage back-ends including cloud object stores.

Did you see the example of how to read and write to Amazon AWS S3 cloud storage using JZarr?
See:
https://jzarr.readthedocs.io/en/latest/amazonS3.html
and code example
https://github.com/bcdev/jzarr/blob/master/docs/examples/java/S3Array_nio.java

@axtimwalde
Copy link

Completely missed this thread but wanted to mention that https://github.com/saalfeldlab/n5-zarr implements https://zarr.readthedocs.io/en/stable/spec/v2.html as an N5 backend since September 2019. This way it is available for array processing with ImgLib2 https://github.com/saalfeldlab/n5-imglib2 which has no size limits and built in memory caching, and is also the native data library for BigDataViewer and a bunch of processing tools that we use and build. n5-zarr includes blosc compression and locking and is included in the standard distribution of https://fiji.sc/. With the N5-API, talking to Zarr, N5, HDF5 is all the same.

There is currently no official cloud backend (other than through FS wrappers) for N5-Zarr because we haven't yet separated the interfaces for store and translation layers, i.e. writing a backend for HDF5 or Zarr is entangled with writing a backend for another store (like the AWS and GoogleCloud stores for N5). I remember that there was a fork that copied the n5-aws-s3 logic into n5-zarr as a temporary solution @joshmoore wasn't that you who did this?

@bogovicj
Copy link

bogovicj commented Feb 5, 2021

I remember that there was a fork that copied the n5-aws-s3 logic into n5-zarr as a temporary solution @joshmoore wasn't that you who did this?

Yup, see saalfeldlab/n5-aws-s3#10 and saalfeldlab/n5-zarr#5

@joshmoore
Copy link
Member

Yup. It then got copied into the bdv/mobie code base for @tischi's I2K work. Having a way to unblock all of that would be great. (Note: I only copied-n-pasted the reader side of things. Writing still needs work as far as I know.)

@joshmoore
Copy link
Member

As with the rust focus during the Feb. 10th meeting, there may be a Java-leaning to the upcoming call this Wednesday if anyone is interested in joining to chat.

cc: @SabineEmbacher @axtimwalde @DennisHeimbigner @WardF

@axtimwalde
Copy link

Thanks @joshmoore! I'll be there. Looking forward to seeing you all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests