-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Change default netCDF engine to use h5netcdf and add netcdf_engine_order #10755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The default `engine` when reading/writing netCDF files is now h5netcdf or scipy, which are typically faster than the prior default of netCDF4-python. You can control this default behavior explicitly via the new `netcdf_engine_order` parameter in `set_options()`, e.g., `xr.set_options(netcdf_engine_order=['netcdf4', 'scipy', 'h5netcdf'])` to restore the prior defaults. I've also updated the documentation page which misled @lesserwhirls about Xarray supporting invalid netCDF files without `invalid_netcdf=True`. Fixes pydata#10657
Looking at the test failures, it looks like we previously supported writing NCZarr with Should we try to auto-detect URLs like this and use netcdf4 as the backend? Or is it better to encourage users to make an explicit choice? |
in general I'm pro "explicit choice", but this would be a breaking change. @malmans2 how common is |
I went ahead and added automatic support for writing nczarr. This wasn't hard to check. |
I've never seen it actually used in python applications either. From a quick search on GitHub, it looks like the few packages that write to nczarr directly use netcdf4-python rather than xarray |
I added This turned up because scipy is now used in preference to netcdf4 when opening netcdf v3 files, but scipy doesn't support opening groups. In principle we could add support for reading groups to the SciPy backend (netcdf3 files arguably contain a single group, at the root node), but in any case this will also come up for custom backends. |
I would love to get this in before the next release, to avoid needing repeated breaking changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Stephan. Nice to be able to parametrize this.
…tcdf engine first for xarray loading - caused due to pydata/xarray#10755
I know I'm late to the party, but I just wanted to mention that the (local) netcdf files we use in our community (Earth Observation Satellite processing) are in general faster read with netcdf4 than h5netcdf, as h5netcdf takes about twice the time. |
Sorry if I missed this documentation somewhere else, but I didn't see it mentioned here or in the related issue. Does anyone know of any benchmarks done between the engines with recent versions of other dependency libraries (ex. numpy, pandas, dask). I have the same use cases as @mraspaud above, but I'll admit it's been a while since I've compared the netcdf4 and h5netcdf engines. Since there are so many ways to access files (local, S3 URI, open file-like object, parallel or single-threaded, etc) and so many different types of files (array size, on-disk chunking, etc) I'm wondering if anyone has done the work and documented what they've found for performance for some of these cases. It seems there is ongoing (or at least wishful thinking) optimizations for h5netcdf (see h5netcdf/h5netcdf#195) that would be interesting to compare against any existing numbers. |
The default
engine
when reading/writing netCDF files is now h5netcdf or scipy, which are typically faster than the prior default of netCDF4-python. You can control this default behavior explicitly via the newnetcdf_engine_order
parameter inset_options()
, e.g.,xr.set_options(netcdf_engine_order=['netcdf4', 'scipy', 'h5netcdf'])
to restore the prior defaults.I've also updated the documentation page which misled @lesserwhirls about Xarray supporting invalid netCDF files without
invalid_netcdf=True
.whats-new.rst
api.rst