Using the Zarr library to read HDF5 #535

rsignell-usgs · 2020-02-05T15:44:51Z

The USGS contracted the HDFGroup to do a test:
Could we make HDF5 format as performant on the Cloud as Zarr format by writing the HDF5 chunk locations into .zmetadata and then having the Zarr library read from those chunks instead of Zarr format chunks?

From our first test the answer appears to be YES: https://gist.github.com/rsignell-usgs/3cbe15670bc2be05980dec7c5947b540

We modified both the zarr and xarray libraries to make that notebook possible, adding the FileChunkStore concept. The modified libraries are: https://github.com/rsignell-usgs/hurricane-ike-water-levels/blob/zarr-hdf5/binder/environment.yml#L20-L21

Feel free to try running the notebook yourself:
(If you run into a 'stream is closed` error computing the max of the zarr data, just run the cell again.
I'm trying to figure out why that error occurs sometimes)

The text was updated successfully, but these errors were encountered:

Cadair · 2020-02-05T15:47:43Z

I like the sound of the FileChunkStore. The ability to delegate reading chunks (and also metadata) to other libraries would be super useful for instance for asdf-format/asdf#718

rabernat · 2020-02-05T16:43:58Z

This recent blog post by @RPrudden gives some very pertinent suggestions:
https://medium.com/informatics-lab/arrays-on-the-fly-9509b6b2e46a

alimanfoo · 2020-02-07T17:18:08Z

Hi @rsignell-usgs, thanks a lot for posting, nice proof of concept.

rsignell-usgs · 2020-02-10T14:43:27Z

Thanks @alimanfoo! As @RPrudden pointed out to me, probably the neatest thing here is a mechanism for allowing existing community formats to be used efficiently via the Zarr library.

I've got a draft Medium post, in case folks are interested in commenting.

djhoese · 2020-02-10T15:01:49Z

Moving a question from gitter here: What would it take to do this type of operation for HDF5 files without generating the external .zmetadata file? For HDF5, could the C library be modified to read some byte range for the header, parse out the chunk locations for variable data, and then use that to load the chunks of data? I'm not sure if the HDF5 format has a static length header size which would probably be required to do this reliably.

rsignell-usgs · 2020-02-10T16:42:26Z

@ajelenak can provide a better answer, but my understanding is that it would be pretty complicated to make the HDF5 library have the functionality we demonstrated here. That's what made using the Zarr library very convenient!

I know there are several annoying steps in the current workflow that could be improved upon. We could imagine computing the augmented .zmetadata on the fly, or perhaps invent some JSON convention "virtual zarr file" that points to both (1) the augmented .zmetadata and (2) the binary file containing the chunks.

Do folks have other ideas how this could be made more user friendly?

djhoese · 2020-02-10T16:57:18Z

We could imagine computing the augmented .zmetadata on the fly

When the binary file is first created? Or when the binary data is being read?

I'm just hoping I can read GOES-R ABI NetCDF4 data from Amazon/Google without having to convince NOAA to add .zmetadata files for everything. 😉

ajelenak · 2020-02-10T17:38:19Z

The HDF5 library has the S3 Virtual File Driver (released in v1.10.6, I think) that enables access to HDF5 files in S3. You also can use h5py with the library without this virtual file driver. I have an example notebook how to do that. Both of these methods are not optimized and need to make frequent requests for small amounts of file content to figure out chunk file locations.

To answer @rsignell-usgs, the HDF5 library would need some form of the very similar information as in .zmetadata so a post-creation step to generate that information would also be needed. Of course, the library could be extended to generate the "consolidated file content" on every file close operation.

One solution to this problem now could be that someone sets up a Lambda function that will generate .zmetadata each time NOAA uploads a new GOES file in their S3 bucket.

rsignell-usgs · 2020-02-10T18:42:08Z

@djhoese , you don't have to convince NOAA to create the augmented .zmetadata! You can create them yourself and start working effectively with that data right now!

djhoese · 2020-02-10T18:51:14Z

you don't have to convince NOAA to create the augmented .zmetadata! You can create them yourself and start working effectively with that data right now!

I thought I needed the entire input file to properly generate the .zmetadata with the script mentioned in the medium post?

rsignell-usgs · 2020-02-10T19:17:40Z

@djhoese, you just need to extract the metadata from the existing GOES NetCDF4 files and stick it in another file. Then you reference both files when you read, as in our example notebook

djhoese · 2020-02-10T19:33:29Z

you just need to extract the metadata from the existing GOES NetCDF4 files

I just read the notebook @ajelenak linked to. This makes it more clear. When the Python file-like object from fsspec is passed to h5py.File it doesn't read the entire file, but knows to only parse specific byte ranges to get all the metadata it needs. Even though it makes a ton of requests, It won't download the entire file which is what I was worried about. So theoretically you don't need to make the .zmetadata file, you could generate that information on the fly from a h5py.File object, but for the best performance and least amount of HTTP requests (as @ajelenak pointed out) a .zmetadata file should be created before processing. Correct me if I'm wrong.

ajelenak · 2020-02-10T20:03:50Z

You are correct.

rsignell-usgs · 2020-02-10T22:38:43Z

@djhoese, I'm afraid if you are after efficient reading of the GOES data, we would have some more work to do. I downloaded a sample file and the chunks are tiny:

import xarray as xr
file = '/users/rsignell/downloads/OR_ABI-L2-SSTF-M6_G16_s20200412100059_e20200412159366_c20200412206173.nc'
ds = xr.open_dataset(file)
print(ds['SST'].encoding)

produces:

{'zlib': True,
 'shuffle': False,
 'complevel': 1,
 'fletcher32': False,
 'contiguous': False,
 'chunksizes': (226, 226),
 'source': 'c:\\users\\rsignell\\downloads\\OR_ABI-L2-SSTF-M6_G16_s20200412100059_e20200412159366_c20200412206173.nc',
 'original_shape': (5424, 5424),
 'dtype': dtype('int16'),
 '_Unsigned': 'true',
 '_FillValue': 65535,
 'scale_factor': 0.00244163,
 'add_offset': 180.0,
 'coordinates': 'retrieval_local_zenith_angle quantitative_local_zenith_angle retrieval_solar_zenith_angle t y x'}

so each chunk is only 216x216 Int16 values, so 2162162/1e6=0.1MB and that's before compression!

@ajelenak had some ideas that when we encounter tiny chunks we might create meta chunks to make the S3 byte-range requests and dask jobs bigger. But that would require more thought and effort...

rsignell-usgs · 2020-02-26T21:31:43Z

For those interested, the Medium Blog Post on this work.

pbranson · 2020-04-03T08:02:19Z

+1

I think this is awesome and will be particularly beneficial for the Australian Ocean Data Network which has a vast quantity of data stored in NetCDF format in AWS S3.

How is it looking in regards to incorporating a 'FileChunkStore' and a convenience function to generate the chunk metadata into the main development branch?

satra · 2021-06-29T17:41:01Z

just a quick update here, the h5py library now includes support for the ros3 driver. so potentially one could use that to create the .zmetadata and pass it on to zarr.

RichardScottOZ · 2022-03-28T21:43:23Z

I now have a 200K smallish hdf5 type files to deal with, so an 'automatic' metadata creator sounds handy.

e.g. https://geoh5py.readthedocs.io/en/stable/

@rsignell-usgs @rabernat - any idea where you would start?

satra · 2022-03-28T22:04:33Z

there is also this now, which could help: https://github.com/fsspec/kerchunk

rsignell-usgs · 2022-03-28T22:49:30Z

Thanks @satra ! Yes, https://github.com/fsspec/kerchunk should help -- version 0.0.6 now not only allows merging files along a dimension, but merging variables together! Give it a shot @RichardScottOZ and raise an issue on https://github.com/fsspec/kerchunk/issues and we'll try to help out and perhaps improve the docs!

RichardScottOZ · 2022-03-28T22:50:46Z

Thanks Rich, will do, having a look now.

bilts mentioned this issue Mar 22, 2020

Allow stores to be smarter when accessing multiple chunks #547

Open

bilts mentioned this issue Apr 19, 2020

async in zarr #536

Closed

weiji14 mentioned this issue Jun 18, 2020

Reading multiple ICESat-2 ATL11 point cloud data nicely via Zarr weiji14/deepicedrain#100

Open

6 tasks

joshmoore mentioned this issue Sep 23, 2021

Outreachy project proposals (Oct. 2021) zarr-developers/community#39

Closed

martindurant mentioned this issue Jul 12, 2022

Integration with Zarr and chunked arrays asdf-format/asdf#718

Closed

zarr-developers locked and limited conversation to collaborators Feb 3, 2024

jhamman converted this issue into discussion #1645 Feb 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Using the Zarr library to read HDF5 #535

Using the Zarr library to read HDF5 #535

rsignell-usgs commented Feb 5, 2020 •

edited

Loading

Cadair commented Feb 5, 2020

rabernat commented Feb 5, 2020

alimanfoo commented Feb 7, 2020

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

djhoese commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020

djhoese commented Feb 10, 2020

ajelenak commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

djhoese commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020

djhoese commented Feb 10, 2020

ajelenak commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

rsignell-usgs commented Feb 26, 2020 •

edited

Loading

pbranson commented Apr 3, 2020 •

edited

Loading

satra commented Jun 29, 2021

RichardScottOZ commented Mar 28, 2022

satra commented Mar 28, 2022

rsignell-usgs commented Mar 28, 2022

RichardScottOZ commented Mar 28, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Using the Zarr library to read HDF5 #535

Using the Zarr library to read HDF5 #535

Comments

rsignell-usgs commented Feb 5, 2020 • edited Loading

Cadair commented Feb 5, 2020

rabernat commented Feb 5, 2020

alimanfoo commented Feb 7, 2020

rsignell-usgs commented Feb 10, 2020 • edited Loading

djhoese commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020

djhoese commented Feb 10, 2020

ajelenak commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020 • edited Loading

djhoese commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020

djhoese commented Feb 10, 2020

ajelenak commented Feb 10, 2020

rsignell-usgs commented Feb 10, 2020 • edited Loading

rsignell-usgs commented Feb 26, 2020 • edited Loading

pbranson commented Apr 3, 2020 • edited Loading

satra commented Jun 29, 2021

RichardScottOZ commented Mar 28, 2022

satra commented Mar 28, 2022

rsignell-usgs commented Mar 28, 2022

RichardScottOZ commented Mar 28, 2022

This issue was moved to a discussion.

rsignell-usgs commented Feb 5, 2020 •

edited

Loading

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

rsignell-usgs commented Feb 10, 2020 •

edited

Loading

rsignell-usgs commented Feb 26, 2020 •

edited

Loading

pbranson commented Apr 3, 2020 •

edited

Loading