Projects using Zarr #19

alimanfoo · 2018-01-03T23:42:27Z

If you are using Zarr for whatever purpose, we would love to hear about it. If you can, please leave a couple of sentences in a comment below describing who you are and what you are using Zarr for. This information helps us to get a sense of the different use cases that are important for Zarr users, and helps core developers to justify the time they spend on the project.

alimanfoo · 2018-01-03T23:48:39Z

Zarr is used by MalariaGEN within several large-scale collaborative scientific projects related to the genetic study of malaria. For example, Zarr is used by the Anopheles gambiae 1000 Genomes project to store and analyse data on genotype calls derived from whole genome sequencing of malaria mosquitoes, which led to this publication.

jhamman · 2018-01-03T23:55:45Z

Xarray has recently (pydata/xarray#1528) developed a Zarr backend for reading and writing datasets to local and remote stores. In the Xarray use case, we hope to use Zarr to interface with Dask to support parallel reads/writes from cloud-based datastores. We hope to find that this approach out performs traditional file based workflows that use HDF5/NetCDF (pangeo-data/pangeo#48). Xarray itself is used in a wide variety of fields, from climate science to finance.

@rabernat gets 99% of the credit for the zarr implementation in xarray.

alimanfoo · 2018-01-04T00:22:39Z

Thanks @jhamman.

MattKleinsmith · 2018-01-22T14:07:37Z

I'm participating in a Kaggle competition where the goal is to classify images by the camera models that took them, which is useful for forensics; for example, determining whether an image is a splice of multiple images.

I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016, @alimanfoo said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr.

I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play even when only reads will occur? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you.

alimanfoo · 2018-01-22T15:04:23Z

Hi Matt, thanks for posting. AFAIK it should be safe to read a bcolz carray from multiple processes. IIRC previously some issues were encountered when reading a bcolz carray from multiple threads, which were related to the interface with the c-blosc library which handles decompression. These may have since been resolved, in which case it may also now be safe to read a bcolz carray from multiple threads, but I cannot say for sure. If you only need an array class (and not a table class) then you may be better off switching to zarr, as zarr has been designed for concurrency and provides some additional storage and chunking options. Hth.

…

On Mon, Jan 22, 2018 at 2:07 PM, Matt Kleinsmith ***@***.***> wrote: I'm participating in a Kaggle competition <https://www.kaggle.com/c/sp-society-camera-model-identification> where the goal is to classify images by the camera models that took them, which is useful for forensics, for example, determining whether an image is a splice of multiple images. I'm using bcolz right now (carrays only), but when I googled whether bcolz can handle simultaneous reads from two processes, one read per process, I came across Zarr. In a Dask issue in 2016 <dask/dask#1033>, @alimanfoo <https://github.com/alimanfoo> said that bcolz was not thread-safe. If this is also true of processes, then I'll likely to switch to Zarr. I'm a beginner when it comes to concurrency and parallelism (which I why I used the term "simultaneous" above). Could someone tell me why thread-safety and process-safety come into play *even when only reads will occur*? Does it have something to do with decompression? Even the name of a concept I could google would be helpful. Thank you. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <https://github.com/zarr-developers/zarr/issues/228#issuecomment-359433012>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAq8QulMG1s_eHbvTWb8rTAi1dcSy1Agks5tNJYpgaJpZM4RSZug> .

-- Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health <http://cggh.org> Big Data Institute Building Old Road Campus Roosevelt Drive Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 Email: alimanfoo@googlemail.com Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/ Twitter: https://twitter.com/alimanfoo

NickMortimer · 2018-06-28T02:14:07Z

Hi I'm currently looking at processing Argo float data in the cloud. Argo float basic data comes at single netcdf files, now you could read that data into a big database but I'm trying to come up with a databaseless solution so that the current files processing systems see no change.

@jhamman I like zarr and find it much easier to understand than HDF5 and more flexible. Where I am struggling a little is with xarray I tried just sending each cast to zarr with xarray's to_zarr, but this resulted in thousands of small files. Because of set cluster size this resulted in poor disk utalisation (in windows). This was because each attribute ended up in it's sown director with a single float in a 1k file.

Instead I've been storing an xarray object as a pickle I know this could cause problems (security etc.) but it is working very nicely. I create a large arrays of profiles (pickled objects) and it works great as a cache

jhamman · 2018-06-28T02:17:47Z

@NickMortimer - if there are some xarray/zarr specific issues that would help your use case, I think we'd be keen to engage on the xarray issue tracker.

NickMortimer · 2018-06-28T02:23:17Z

@jhamman OK I'll go there. I think it's mainly due to my file that I'm trying to process. There are lots of single value attributes that get stored in their own file. I will talk more over at xarray

jrbourbeau · 2019-08-20T00:35:18Z

Dask array has to_zarr/from_zarr functionality for interfacing with zarr

helps core developers to justify the time they spend on the project

This is a good point. It could be useful to have a webpage to point to that lists projects using zarr. For example, xarray has an "Xarray related projects" page in their sphinx docs and dask has a "Powered by Dask" section in https://dask.org/. Perhaps zarr could have something similar

QueensGambit · 2019-12-19T14:00:31Z

Hello @alimanfoo ,
the Zarr library is used to compress the training data for learning a neural network named CrazyAra to play chess variants:

https://github.com/queensgambit/crazyara

In 2019, Zarr was also used for compressing the data which was generated in self-play games in a reinforcement learning setup.
Here, Zarr was combined with the z5 library to export data in C++ and subsequently load it in python for training.
I would like to cite this awesome library in my master thesis. Is there a desired way for doing so?
Otherwise I would rely on the default way of citing GitHub repositories:

@misc{miles2019zarr,
	title = {zarr-developers/zarr-python},
        author = {Miles, Alistair},
	copyright = {MIT},
	url = {https://github.com/zarr-developers/zarr-python},
	abstract = {An implementation of chunked, compressed, N-dimensional arrays for Python.},
	urldate = {2019-12-19},
	publisher = {Zarr Developers},
	month = dec,
	year = {2019},
	note = {original-date: 2015-12-15T14:49:40Z}
}

tiagoantao · 2020-01-04T21:56:42Z

I am writing a book for Manning tentatively called "High performance Python for data analytics" and zarr is a big part of it, especially because a good chunk of the book is about persistence efficiency

amcnicho · 2020-01-30T16:52:31Z

Working on a project that is currently exploring the adoption of a stack built around Dask and Zarr (interface via Xarray). This is envisioned as a replacement for a legacy persistent data model underlying the flagging, calibration, synthesis, visualization, and analysis of astronomical data, especially from radio interferometers (e.g., ALMA).

@alimanfoo, does the Zarr project have a preferred method of citation in academic publications?

alimanfoo · 2020-01-30T21:46:05Z

Just to say thanks everyone for adding projects here, very much appreciated.

Re citation, we don't have a preferred method, your suggestion @QueensGambit sounds good in the interim. Writing a short paper about zarr is very much on the wish list.

benbovy · 2020-03-23T08:32:12Z

I've recently added saving simulation data as Zarr datasets in xarray-simlab.

https://xarray-simlab.readthedocs.io/en/latest/io_storage.html#using-zarr

Everything went smoothly when working on this. Thanks @alimanfoo and all Zarr developers! (Also thanks @rabernat, @jhamman and others for Zarr integration with Xarray).

I'm working now on running batches of simulations in parallel using Dask and save all that data in the same Zarr datasets (along a batch dimension), but I'm struggling with different things so I might ask for some help.

joshmoore · 2020-04-06T09:23:50Z

https://google.github.io/tensorstore/tensorstore/driver/index.html#chunked-storage-drivers

A TensorStore is an asynchronous view of a multi-dimensional array. Every TensorStore is backed by a driver, which connects the high-level TensorStore interface to an underlying data storage mechanism. Using an appropriate driver, a TensorStore may be used to access:

chunked storage formats like zarr, N5, Neuroglancer precomputed, backed by a supported key-value storage system, such as: Google Cloud Storage, Local and network filesystems

olly-writes-code · 2020-04-30T18:36:50Z

I'm running parallel simulations using dask distributed. I'm using Zarr to create and persist input data to disk for fast reading for each simulation. The data is too large to pass between the parallel processes. I was using Feather before but I needed more complex data structures. Switching to Zarr improved simulation speed by roughly 30%. Thanks for the work on this ❤️

gzuidhof · 2020-06-27T23:53:20Z

Lyft Level 5 just released a dataset in zarr format, it consists of over 1000 hours of driving data and observations. Along with it they have open sourced a codebase relating to the prediction and planning task in autonomous vehicles. Zarr is a great fit for ML problems!

Links:

jakirkham · 2020-06-28T23:02:48Z

Do they happen to have a tweet? May be worth retweeting

gzuidhof · 2020-06-29T01:45:58Z

https://twitter.com/LyftLevel5

alimanfoo · 2020-06-29T08:14:27Z

Thanks @gzuidhof, tweeted from zarr_dev here: https://twitter.com/zarr_dev/status/1277515272270815232

jakirkham · 2020-06-29T11:06:52Z

Thanks both 🙂

ericgyounkin · 2020-11-08T22:19:02Z

Hi, I am building a distributed multibeam sonar processing software suite using dask/xarray/zarr. Big fan of zarr! The format is so easy to work with. Really appreciate the work of this community.

Kluster

PaulJWright · 2021-04-16T22:24:18Z

We are going to be hosting the Solar Dynamics Observatory (SDO) Machine Learning dataset (https://iopscience.iop.org/article/10.3847/1538-4365/ab1005) on a public-facing Google Cloud bucket, stored in Zarr format! Appreciate the work here, this is fantastic!

parashardhapola · 2021-05-18T22:38:08Z

Hi @alimanfoo and Zarr developers,

First of all, thank you for developing, maintaining and consistently improving this incredible software.

We have created a memory efficient single-cell genomics data processing toolkit, called Scarf, using Zarr.
There is a preprint describing Scarf here
A tweet introducing Scarf can be found here

We chose Zarr over HDF5 because:

support for highly performant compression libraries like LZ4
parallel read support through Dask.
random access to data supported on AWS S3
rectangular chunking that allows data to be loaded in batches of both rows and column

RichardScottOZ · 2021-06-22T02:16:24Z

Using it to enable country scale machine learning geology modelling from Sentinel data.

jhamman · 2021-09-27T20:25:03Z

We're using Zarr for a new visualization application, rending chunks of data directly from cloud object store using WebGL and Mapbox. A full writeup of our approach is here: https://carbonplan.org/blog/maps-library-release

thomcom · 2021-10-28T23:40:00Z

Hi there! I wanted to let you know that I'm working on a python wrapper for NVIDIA's nvcomp that will implement the numcodecs API and hopefully work seamlessly with zarr. We're investigating using Zarr with cudf and dask.

aladinor · 2021-12-03T05:47:39Z

Hi there! I am using Zarr to convert several hdf5 files coming from NASA P3 aircraft (planes that fly inside clouds including hurricanes). This conversion allows us to manage/access large data set for machine learning applied to cloud microphysics.

magnunor · 2022-06-08T10:17:25Z

Hey, we've added support for zarr in the HyperSpy library, which is a library for processing electron microscopy data. These last years, we've been getting faster and faster detectors, so our datasets have drastically increased in size: from 100 MBs to 100+ GBs. Previously we used HDF5 for this, but using zarr has greatly improved performance for working with these large datasets.

Kudos to @CSSFrancis for adding this to HyperSpy: hyperspy/hyperspy#2825

https://hyperspy.org/hyperspy-doc/current/user_guide/io.html#zspy-format

julioasotodv · 2023-07-05T23:06:15Z

Hi! At Sandoz Pharmaceuticals we use zarr to store large, compressed n-dimensional arrays for a wide variety of projects, including biosimilars structure optimization with high-dimensional (>1M-dimensional) bayesian optimization, storage of large adjacency matrices for large molecular graphs (some of these matrices are of dimensions 30M x 30M), and for medical imaging data.

Multi-threaded reads and writes using Dask have saved us hours of waiting!

Unfortunately I can't share any further details due to trade secrets, but I am really thankful for your great work with this storage format 😄

miccoli · 2023-09-01T12:36:59Z

Hi, zarr has become a cornerstone of all my data acquisition pipelines. The main “selling point“ for me is how easy it is to build very robust systems, with limited data loss in case of hardware/software failures.

PowerChell · 2023-12-07T23:20:59Z

Hello everyone, I am a Developer Advocate from Radiant Earth, and I love seeing all of this Zarr work. If any of you are interested in hosting your Zarr data on Source Cooperative (currently in public beta at beta.source.coop), please email me at hello@source.coop. We would love to increase exposure to any Zarr projects that want to be seen/shared.

Also interested in hosting other cloud-optimized data formats on Source as well. To read more about Source, please see this blog post.

jeromekelleher · 2023-12-08T10:05:05Z

I've used Zarr to store large-scale genetic variation data efficiently in tsinfer

I'm also involved in the sgkit project which uses Zarr to work with genomics data more generally (and currently writing a paper about it). Zarr is awesome 👍

alimanfoo transferred this issue from zarr-developers/zarr-python Jul 3, 2019

alimanfoo pinned this issue Jul 3, 2019

alimanfoo unpinned this issue Jul 3, 2019

jakirkham mentioned this issue Jan 21, 2022

Entrypoints zarr-developers/numcodecs#300

Merged

7 tasks

jakirkham mentioned this issue May 4, 2022

direct-to-GPU decoding? zarr-developers/numcodecs#316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Projects using Zarr #19

Projects using Zarr #19

alimanfoo commented Jan 3, 2018

alimanfoo commented Jan 3, 2018

jhamman commented Jan 3, 2018

alimanfoo commented Jan 4, 2018

MattKleinsmith commented Jan 22, 2018 •

edited

Loading

alimanfoo commented Jan 22, 2018 via email

NickMortimer commented Jun 28, 2018 •

edited

Loading

jhamman commented Jun 28, 2018

NickMortimer commented Jun 28, 2018

jrbourbeau commented Aug 20, 2019

QueensGambit commented Dec 19, 2019 •

edited

Loading

tiagoantao commented Jan 4, 2020

amcnicho commented Jan 30, 2020

alimanfoo commented Jan 30, 2020

benbovy commented Mar 23, 2020

joshmoore commented Apr 6, 2020

olly-writes-code commented Apr 30, 2020 •

edited

Loading

gzuidhof commented Jun 27, 2020

jakirkham commented Jun 28, 2020

gzuidhof commented Jun 29, 2020

alimanfoo commented Jun 29, 2020

jakirkham commented Jun 29, 2020

ericgyounkin commented Nov 8, 2020

PaulJWright commented Apr 16, 2021

parashardhapola commented May 18, 2021 •

edited

Loading

RichardScottOZ commented Jun 22, 2021

jhamman commented Sep 27, 2021

thomcom commented Oct 28, 2021

aladinor commented Dec 3, 2021

magnunor commented Jun 8, 2022

julioasotodv commented Jul 5, 2023

miccoli commented Sep 1, 2023

PowerChell commented Dec 7, 2023 •

edited

Loading

jeromekelleher commented Dec 8, 2023

Projects using Zarr #19

Projects using Zarr #19

Comments

alimanfoo commented Jan 3, 2018

alimanfoo commented Jan 3, 2018

jhamman commented Jan 3, 2018

alimanfoo commented Jan 4, 2018

MattKleinsmith commented Jan 22, 2018 • edited Loading

alimanfoo commented Jan 22, 2018 via email

NickMortimer commented Jun 28, 2018 • edited Loading

jhamman commented Jun 28, 2018

NickMortimer commented Jun 28, 2018

jrbourbeau commented Aug 20, 2019

QueensGambit commented Dec 19, 2019 • edited Loading

tiagoantao commented Jan 4, 2020

amcnicho commented Jan 30, 2020

alimanfoo commented Jan 30, 2020

benbovy commented Mar 23, 2020

joshmoore commented Apr 6, 2020

olly-writes-code commented Apr 30, 2020 • edited Loading

gzuidhof commented Jun 27, 2020

jakirkham commented Jun 28, 2020

gzuidhof commented Jun 29, 2020

alimanfoo commented Jun 29, 2020

jakirkham commented Jun 29, 2020

ericgyounkin commented Nov 8, 2020

PaulJWright commented Apr 16, 2021

parashardhapola commented May 18, 2021 • edited Loading

RichardScottOZ commented Jun 22, 2021

jhamman commented Sep 27, 2021

thomcom commented Oct 28, 2021

aladinor commented Dec 3, 2021

magnunor commented Jun 8, 2022

julioasotodv commented Jul 5, 2023

miccoli commented Sep 1, 2023

PowerChell commented Dec 7, 2023 • edited Loading

jeromekelleher commented Dec 8, 2023

MattKleinsmith commented Jan 22, 2018 •

edited

Loading

NickMortimer commented Jun 28, 2018 •

edited

Loading

QueensGambit commented Dec 19, 2019 •

edited

Loading

olly-writes-code commented Apr 30, 2020 •

edited

Loading

parashardhapola commented May 18, 2021 •

edited

Loading

PowerChell commented Dec 7, 2023 •

edited

Loading