Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Projects using Zarr #19

Open
alimanfoo opened this issue Jan 3, 2018 · 29 comments
Open

Projects using Zarr #19

alimanfoo opened this issue Jan 3, 2018 · 29 comments

Comments

@alimanfoo
Copy link
Member

@alimanfoo alimanfoo commented Jan 3, 2018

If you are using Zarr for whatever purpose, we would love to hear about it. If you can, please leave a couple of sentences in a comment below describing who you are and what you are using Zarr for. This information helps us to get a sense of the different use cases that are important for Zarr users, and helps core developers to justify the time they spend on the project.

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo commented Jan 3, 2018

Zarr is used by MalariaGEN within several large-scale collaborative scientific projects related to the genetic study of malaria. For example, Zarr is used by the Anopheles gambiae 1000 Genomes project to store and analyse data on genotype calls derived from whole genome sequencing of malaria mosquitoes, which led to this publication.

@jhamman
Copy link
Member

@jhamman jhamman commented Jan 3, 2018

Xarray has recently (pydata/xarray#1528) developed a Zarr backend for reading and writing datasets to local and remote stores. In the Xarray use case, we hope to use Zarr to interface with Dask to support parallel reads/writes from cloud-based datastores. We hope to find that this approach out performs traditional file based workflows that use HDF5/NetCDF (pangeo-data/pangeo#48). Xarray itself is used in a wide variety of fields, from climate science to finance.

@rabernat gets 99% of the credit for the zarr implementation in xarray.

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo commented Jan 4, 2018

Thanks @jhamman.

@MattKleinsmith
Copy link

@MattKleinsmith MattKleinsmith commented Jan 22, 2018

I'm participating in a Kaggle competition where the goal is to classify images by the camera models that took them, which is useful for forensics; for example, determining whether an image is a splice of multiple images.

I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016, @alimanfoo said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr.

I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play even when only reads will occur? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you.

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo commented Jan 22, 2018

@NickMortimer
Copy link

@NickMortimer NickMortimer commented Jun 28, 2018

Hi I'm currently looking at processing Argo float data in the cloud. Argo float basic data comes at single netcdf files, now you could read that data into a big database but I'm trying to come up with a databaseless solution so that the current files processing systems see no change.

@jhamman I like zarr and find it much easier to understand than HDF5 and more flexible. Where I am struggling a little is with xarray I tried just sending each cast to zarr with xarray's to_zarr, but this resulted in thousands of small files. Because of set cluster size this resulted in poor disk utalisation (in windows). This was because each attribute ended up in it's sown director with a single float in a 1k file.

Instead I've been storing an xarray object as a pickle I know this could cause problems (security etc.) but it is working very nicely. I create a large arrays of profiles (pickled objects) and it works great as a cache

@jhamman
Copy link
Member

@jhamman jhamman commented Jun 28, 2018

@NickMortimer - if there are some xarray/zarr specific issues that would help your use case, I think we'd be keen to engage on the xarray issue tracker.

@NickMortimer
Copy link

@NickMortimer NickMortimer commented Jun 28, 2018

@jhamman OK I'll go there. I think it's mainly due to my file that I'm trying to process. There are lots of single value attributes that get stored in their own file. I will talk more over at xarray

@alimanfoo alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019
@alimanfoo alimanfoo pinned this issue Jul 3, 2019
@alimanfoo alimanfoo unpinned this issue Jul 3, 2019
@jrbourbeau
Copy link
Member

@jrbourbeau jrbourbeau commented Aug 20, 2019

Dask array has to_zarr/from_zarr functionality for interfacing with zarr

helps core developers to justify the time they spend on the project

This is a good point. It could be useful to have a webpage to point to that lists projects using zarr. For example, xarray has an "Xarray related projects" page in their sphinx docs and dask has a "Powered by Dask" section in https://dask.org/. Perhaps zarr could have something similar

@QueensGambit
Copy link

@QueensGambit QueensGambit commented Dec 19, 2019

Hello @alimanfoo ,
the Zarr library is used to compress the training data for learning a neural network named CrazyAra to play chess variants:

In 2019, Zarr was also used for compressing the data which was generated in self-play games in a reinforcement learning setup.
Here, Zarr was combined with the z5 library to export data in C++ and subsequently load it in python for training.
I would like to cite this awesome library in my master thesis. Is there a desired way for doing so?
Otherwise I would rely on the default way of citing GitHub repositories:

@misc{miles2019zarr,
	title = {zarr-developers/zarr-python},
        author = {Miles, Alistair},
	copyright = {MIT},
	url = {https://github.com/zarr-developers/zarr-python},
	abstract = {An implementation of chunked, compressed, N-dimensional arrays for Python.},
	urldate = {2019-12-19},
	publisher = {Zarr Developers},
	month = dec,
	year = {2019},
	note = {original-date: 2015-12-15T14:49:40Z}
}

@tiagoantao
Copy link

@tiagoantao tiagoantao commented Jan 4, 2020

I am writing a book for Manning tentatively called "High performance Python for data analytics" and zarr is a big part of it, especially because a good chunk of the book is about persistence efficiency

@amcnicho
Copy link

@amcnicho amcnicho commented Jan 30, 2020

Working on a project that is currently exploring the adoption of a stack built around Dask and Zarr (interface via Xarray). This is envisioned as a replacement for a legacy persistent data model underlying the flagging, calibration, synthesis, visualization, and analysis of astronomical data, especially from radio interferometers (e.g., ALMA).

@alimanfoo, does the Zarr project have a preferred method of citation in academic publications?

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo commented Jan 30, 2020

Just to say thanks everyone for adding projects here, very much appreciated.

Re citation, we don't have a preferred method, your suggestion @QueensGambit sounds good in the interim. Writing a short paper about zarr is very much on the wish list.

@benbovy
Copy link

@benbovy benbovy commented Mar 23, 2020

I've recently added saving simulation data as Zarr datasets in xarray-simlab.

https://xarray-simlab.readthedocs.io/en/latest/io_storage.html#using-zarr

Everything went smoothly when working on this. Thanks @alimanfoo and all Zarr developers! (Also thanks @rabernat, @jhamman and others for Zarr integration with Xarray).

I'm working now on running batches of simulations in parallel using Dask and save all that data in the same Zarr datasets (along a batch dimension), but I'm struggling with different things so I might ask for some help.

@joshmoore
Copy link
Member

@joshmoore joshmoore commented Apr 6, 2020

https://google.github.io/tensorstore/tensorstore/driver/index.html#chunked-storage-drivers

A TensorStore is an asynchronous view of a multi-dimensional array. Every TensorStore is backed by a driver, which connects the high-level TensorStore interface to an underlying data storage mechanism. Using an appropriate driver, a TensorStore may be used to access:

  • chunked storage formats like zarr, N5, Neuroglancer precomputed, backed by a supported key-value storage system, such as: Google Cloud Storage, Local and network filesystems

@olly-writes-code
Copy link

@olly-writes-code olly-writes-code commented Apr 30, 2020

I'm running parallel simulations using dask distributed. I'm using Zarr to create and persist input data to disk for fast reading for each simulation. The data is too large to pass between the parallel processes. I was using Feather before but I needed more complex data structures. Switching to Zarr improved simulation speed by roughly 30%. Thanks for the work on this ❤️

@gzuidhof
Copy link

@gzuidhof gzuidhof commented Jun 27, 2020

Lyft Level 5 just released a dataset in zarr format, it consists of over 1000 hours of driving data and observations. Along with it they have open sourced a codebase relating to the prediction and planning task in autonomous vehicles. Zarr is a great fit for ML problems!

Links:

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Jun 28, 2020

Do they happen to have a tweet? May be worth retweeting

@gzuidhof
Copy link

@gzuidhof gzuidhof commented Jun 29, 2020

@alimanfoo
Copy link
Member Author

@alimanfoo alimanfoo commented Jun 29, 2020

Thanks @gzuidhof, tweeted from zarr_dev here: https://twitter.com/zarr_dev/status/1277515272270815232

@jakirkham
Copy link
Member

@jakirkham jakirkham commented Jun 29, 2020

Thanks both 🙂

@ericgyounkin
Copy link

@ericgyounkin ericgyounkin commented Nov 8, 2020

Hi, I am building a distributed multibeam sonar processing software suite using dask/xarray/zarr. Big fan of zarr! The format is so easy to work with. Really appreciate the work of this community.

Kluster

@PaulJWright
Copy link

@PaulJWright PaulJWright commented Apr 16, 2021

We are going to be hosting the Solar Dynamics Observatory (SDO) Machine Learning dataset (https://iopscience.iop.org/article/10.3847/1538-4365/ab1005) on a public-facing Google Cloud bucket, stored in Zarr format! Appreciate the work here, this is fantastic!

@parashardhapola
Copy link

@parashardhapola parashardhapola commented May 18, 2021

Hi @alimanfoo and Zarr developers,

First of all, thank you for developing, maintaining and consistently improving this incredible software.

We have created a memory efficient single-cell genomics data processing toolkit, called Scarf, using Zarr.
There is a preprint describing Scarf here
A tweet introducing Scarf can be found here

We chose Zarr over HDF5 because:

  • support for highly performant compression libraries like LZ4
  • parallel read support through Dask.
  • random access to data supported on AWS S3
  • rectangular chunking that allows data to be loaded in batches of both rows and column

@RichardScottOZ
Copy link

@RichardScottOZ RichardScottOZ commented Jun 22, 2021

Using it to enable country scale machine learning geology modelling from Sentinel data.

@jhamman
Copy link
Member

@jhamman jhamman commented Sep 27, 2021

We're using Zarr for a new visualization application, rending chunks of data directly from cloud object store using WebGL and Mapbox. A full writeup of our approach is here: https://carbonplan.org/blog/maps-library-release

@thomcom
Copy link

@thomcom thomcom commented Oct 28, 2021

Hi there! I wanted to let you know that I'm working on a python wrapper for NVIDIA's nvcomp that will implement the numcodecs API and hopefully work seamlessly with zarr. We're investigating using Zarr with cudf and dask.

@aladinor
Copy link

@aladinor aladinor commented Dec 3, 2021

Hi there! I am using Zarr to convert several hdf5 files coming from NASA P3 aircraft (planes that fly inside clouds including hurricanes). This conversion allows us to manage/access large data set for machine learning applied to cloud microphysics.

@magnunor
Copy link

@magnunor magnunor commented Jun 8, 2022

Hey, we've added support for zarr in the HyperSpy library, which is a library for processing electron microscopy data. These last years, we've been getting faster and faster detectors, so our datasets have drastically increased in size: from 100 MBs to 100+ GBs. Previously we used HDF5 for this, but using zarr has greatly improved performance for working with these large datasets.

Kudos to @CSSFrancis for adding this to HyperSpy: hyperspy/hyperspy#2825

https://hyperspy.org/hyperspy-doc/current/user_guide/io.html#zspy-format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests