Recipes for loading Zarr arrays onto Spark RDD #523

dokooh · 2019-12-02T14:27:18Z

Hi,
I am experimenting with a large count of NetCDF ND-arrays which I have compressed (Blosc/level3) and stored in Zarr, using the latest stable version via PyPi, and with help of Azure Blob container (ABStore) integration.

Momentarily I am trying to parallelize a selection of the variables onto Spark RDD. I was wondering if anyone here has been down this road, and knows the optimal way of doing this? Any tips and tricks are more than helpful.

The most relevant library I have spotted seems to be Zappy:
https://github.com/lasersonlab/zappy
Which has been archived and the build status is not stable, so I am hesitant to try it.
Thanks for your attention before hand.

jakirkham · 2019-12-02T14:52:09Z

cc @ryan-williams (in case you have thoughts 😉)

rabernat · 2019-12-02T16:38:26Z

@dokooh - I can't speak for Spark RDD specifically, but the general approach would be to implement a lightweight MutableMapping interface for your storage medium. Once you can address your storage medium as a MutableMapping (i.e. a dictionary), then you can start putting zarr into it.

https://docs.python.org/3/library/collections.abc.html#collections.abc.MutableMapping

If you take care to make sure your mapping is pickleable, it can be serialized and used for distributed computations using dask.

jakirkham · 2019-12-02T18:13:30Z

cc @tomwhite (who also may have thoughts here 🙂)

NightMachinery · 2022-01-27T21:46:14Z

@dokooh Did you succeed in loading zarr files into Spark? dask can load these files easily; can't we just feed dask arrays into pyspark?

shoyer · 2022-01-27T22:13:51Z

There are a few ways you could do this, but my guess is that strategies similar to those we use in Xarray-Beam could be effective. Beam has a very similar data model to Spark's RDD.

dokooh · 2022-01-28T09:51:16Z

@dokooh Did you succeed in loading zarr files into Spark? dask can load these files easily; can't we just feed dask arrays into pyspark?

We did as I have been told, although the smaller the chunks the easier it was to load them so we kept the zarr sizes to only several months per each year and faced no issues.

dokooh · 2022-01-28T09:52:23Z

There are a few ways you could do this, but my guess is that strategies similar to those we use in Xarray-Beam could be effective. Beam has a very similar data model to Spark's RDD.

Thanks Stephen, have heard of Beam as it has gained attention in wider large data processing community, it will be added to my to-do list for a try out.

dokooh closed this as completed Jan 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recipes for loading Zarr arrays onto Spark RDD #523

Recipes for loading Zarr arrays onto Spark RDD #523

dokooh commented Dec 2, 2019

jakirkham commented Dec 2, 2019

rabernat commented Dec 2, 2019

jakirkham commented Dec 2, 2019

NightMachinery commented Jan 27, 2022

shoyer commented Jan 27, 2022

dokooh commented Jan 28, 2022

dokooh commented Jan 28, 2022

Recipes for loading Zarr arrays onto Spark RDD #523

Recipes for loading Zarr arrays onto Spark RDD #523

Comments

dokooh commented Dec 2, 2019

jakirkham commented Dec 2, 2019

rabernat commented Dec 2, 2019

jakirkham commented Dec 2, 2019

NightMachinery commented Jan 27, 2022

shoyer commented Jan 27, 2022

dokooh commented Jan 28, 2022

dokooh commented Jan 28, 2022