Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! #7446

gauteh · 2023-01-17T08:56:03Z

What is your issue?

Greetings,

I have developed a parallel or multi-threaded (and even async) reader for HDF5 and NetCDF4 files. It is still at a somewhat experimental stage (and does not support all compressions etc), but has been tested a fair bit by now. The reader is written in Rust with Python bindings:

https://github.com/gauteh/hidefix (pending conda package: conda-forge/staged-recipes#21742)

Regular NetCDF4 and HDF5 is not thread-safe, and there's a global process-wide lock for reading files. With hidefix this lock is removed. This would allow parallel reading of datasets to be done in the same process, as opposed to split across processes. Additionally, the reader can read directly into the target buffer and thus avoids a cache for decoded chunks (effectively reducing memory usage and chunk re-decoding).

The reader works by indexing the chunks of a dataset so that chunks can be accessed independently.

I have created a basic xarray backend, combined with the NetCDF4 backend for reading attributes etc: https://github.com/gauteh/hidefix/blob/main/python/hidefix/xarray.py and it works pretty well for reading:

on my laptop with 8 CPUs we get 6x speed-up over the xarray NetCDF4 backend (reading a 380mb variable)! On larger machines the speed-up is even greater (if you want to control the number of CPUs set the RAYON_NUM_THREADS env variable).

Running benchmarks along the lines of:

import xarray as xr

i = xr.open_dataset('tests/data/barents_zdepth_m00_FC.nc', engine='hidefix')
d = i['v']
v = d[...].values
print(v.shape, type(v))

for the different backends (with or without xarray):

At this point it turns out that a significant point of time was spent setting the _FillValue for the returned array (less important for NetCDF4 since the reader took much longer time anyway), this could also be done in rust in parallel: https://github.com/gauteh/hidefix/blob/main/src/python/mod.rs#L128 . Reducing it to a negligible amount of time. This can also be used on the existing xarray NetCDF4 backend.

I hope this can be of general interest, and if it would be of interest to move the hidefix xarray backend into xarray that would be very cool.

Best regards, Gaute

The text was updated successfully, but these errors were encountered:

rabernat · 2023-01-17T16:23:01Z

Hi @gauteh! This is very cool! Thanks for sharing. I'm really excited about way that Rust can be used to optimized different parts of our stack.

A couple of questions:

Can your reader read over HTTP / S3 protocol? Or is it just local files?
Do you know about kerchunk? The approach you described:

The reader works by indexing the chunks of a dataset so that chunks can be accessed independently.

...is identical to the approach taken by Kerchunk (although the implementation is different). I'm curious what specification you use to store your indexes. Could we make your implementation interoperable with kerchunk, such that a kerchunk reference specification could be read by your reader? It would be great to reach for some degree of alignment here.
Do you know about hdf5-coro - http://icesat2sliderule.org/h5coro/ - they have similar goals, but focused on cloud-based access

I hope this can be of general interest, and if it would be of interest to move the hidefix xarray backend into xarray that would be very cool.

This is definitely of general interest! However, it is not necessary to add a new backend directly into xarray. We support entry points which allow packages to implement their own readers, as you have apparently already discovered: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html

Installing your package should be enough to enable the new engine.

We would, however, welcome a documentation PR that described how to use this package on the I/O page.

gauteh · 2023-01-19T07:44:30Z

On Tue, Jan 17, 2023 at 5:23 PM Ryan Abernathey ***@***.***> wrote: Hi @gauteh <https://github.com/gauteh>! This is very cool! Thanks for sharing. I'm really excited about way that Rust can be used to optimized different parts of our stack. A couple of questions: - Can your reader read over HTTP / S3 protocol? Or is it just local files? It is built to do this, but I haven't implemented it. I initially wrote it

for an OpenDAP server (dars: https://github.com/gauteh/dars), where the plan is to also support files stored in the cloud. So the hidefix-reader can read from any interface that supports ReadAt or Read + Seek. It would probably be beneficial to index the files beforehand. I submitted a patch to HDF5 that allows it to iterate over the chunks quickly, so indexing a 5-6 GB file takes only a couple of hundred ms - so I no longer store the index for local files. It is still faster than native HDF5 including the indexing.

- - Do you know about kerchunk <https://fsspec.github.io/kerchunk/>? The approach you described: The reader works by indexing the chunks of a dataset so that chunks can be accessed independently. ...is identical to the approach taken by Kerchunk (although the implementation is different). I'm curious what specification you use to store your indexes. Could we make your implementation interoperable with kerchunk, such that a kerchunk reference specification could be read by your reader? It would be great to reach for some degree of alignment here.

The index is serializable using the rust serde system, so it can be stored in any format supported by that. A fair amount of effort went into making the deserialization _zero-copy_: that means that I can read the e.g. 10mb index for a 5-6gb file very quickly, but it requires very little deserialization since the read buffers are already memory-mapped to the structures making it very fast. I don't have a specific format at the moment, but I have used bincode a lot in e.g. dars.

- - Do you know about hdf5-coro - http://icesat2sliderule.org/h5coro/ - they have similar goals, but focused on cloud-based access I hope this can be of general interest, and if it would be of interest to move the hidefix xarray backend into xarray that would be very cool. This is definitely of general interest! However, it is not necessary to add a new backend directly into xarray. We support entry points which allow packages to implement their own readers, as you have apparently already discovered: https://docs.xarray.dev/en/stable/internals/how-to-add-new-backend.html Installing your package should be enough to enable the new engine. We would, however, welcome a documentation PR that described how to use this package on the I/O page.

Great, the package should already register itself with xarray.

dcherian · 2023-06-26T16:27:23Z

@gauteh is there anything we can help with.

cc @scottyhq @betolink

gauteh · 2023-06-26T16:47:26Z

Yeah, lots to do :D In what way were you thinking?

The conda package is stalling, which I think limits stuff: add hidefix conda-forge/staged-recipes#21742
Docs on xarray
There are some issues on the repo
I think a very interesting application is object-store readers. Zarr is not that fast when in object store, maybe this approach is better. Would be interesting for Zarr as well to know if it can be done faster.
The OpenDAP server, unless this + object-store is fast enough.

dcherian · 2023-06-26T18:05:13Z

The conda package is stalling, which I think limits stuff

cc @ocefpaf

Docs on xarray

I think we could advertise hidefix and h5netcdf more here
We just have a "tip" at the moment.

ocefpaf · 2023-06-26T18:08:03Z

The conda package is stalling, which I think limits stuff

cc @ocefpaf

There is a pending review. Please address the reviewer comments in that PR and/or explain why you cannot.

gauteh · 2023-06-26T18:21:43Z

The conda package is stalling, which I think limits stuff

cc @ocefpaf

There is a pending review. Please address the reviewer comments in that PR and/or explain why you cannot.

When I click the review thing it doesn't show anything anymore. I think it is on some outdated code.

gauteh · 2023-06-26T18:33:02Z

By the way. Applying fill_value and similar is taking a significant time of loading netcdf-files. By doing this in Rust in a parallel way things go much faster. This could be used for the regular NetCDF reader as well, and it would probably save tens of seconds for large datasets:

dcherian · 2023-06-26T22:06:46Z

@gauteh see nsidc/earthaccess#251

gauteh added the needs triage Issue that has not been reviewed by xarray team member label Jan 17, 2023

gauteh mentioned this issue Jan 17, 2023

xarray backend: tracking issue gauteh/hidefix#17

Open

dcherian added topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Jan 17, 2023

gauteh mentioned this issue May 4, 2023

Backend registration does not match docs, and is no longer specifiable in maturin pyproject toml #7816

Closed

4 tasks

weiji14 mentioned this issue Aug 15, 2023

Wishlist ICESAT-2HackWeek/h5cloud#18

Open

8 tasks

TomNicholas mentioned this issue Apr 11, 2024

Using hidefix to determine byte ranges in HDF files? gauteh/hidefix#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! #7446

Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! #7446

gauteh commented Jan 17, 2023

rabernat commented Jan 17, 2023

gauteh commented Jan 19, 2023 via email

dcherian commented Jun 26, 2023

gauteh commented Jun 26, 2023

dcherian commented Jun 26, 2023

ocefpaf commented Jun 26, 2023

gauteh commented Jun 26, 2023

gauteh commented Jun 26, 2023

dcherian commented Jun 26, 2023

Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! #7446

Parallel + multi-threaded reading of NetCDF4 + HDF5: Hidefix! #7446

Comments

gauteh commented Jan 17, 2023

What is your issue?

rabernat commented Jan 17, 2023

gauteh commented Jan 19, 2023 via email

dcherian commented Jun 26, 2023

gauteh commented Jun 26, 2023

dcherian commented Jun 26, 2023

ocefpaf commented Jun 26, 2023

gauteh commented Jun 26, 2023

gauteh commented Jun 26, 2023

dcherian commented Jun 26, 2023