Skip to content

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Sep 16, 2025

The default engine when reading/writing netCDF files is now h5netcdf or scipy, which are typically faster than the prior default of netCDF4-python. You can control this default behavior explicitly via the new netcdf_engine_order parameter in set_options(), e.g., xr.set_options(netcdf_engine_order=['netcdf4', 'scipy', 'h5netcdf']) to restore the prior defaults.

I've also updated the documentation page which misled @lesserwhirls about Xarray supporting invalid netCDF files without invalid_netcdf=True.

The default `engine` when reading/writing netCDF files is now h5netcdf
or scipy, which are typically faster than the prior default of netCDF4-python.
You can control this default behavior explicitly via the new
`netcdf_engine_order` parameter in `set_options()`, e.g.,
`xr.set_options(netcdf_engine_order=['netcdf4', 'scipy', 'h5netcdf'])` to
restore the prior defaults.

I've also updated the documentation page which misled @lesserwhirls
about Xarray supporting invalid netCDF files without
`invalid_netcdf=True`.

Fixes pydata#10657
@github-actions github-actions bot added topic-backends topic-DataTree Related to the implementation of a DataTree class io labels Sep 16, 2025
@shoyer shoyer changed the title Add option for netcdf_engine_order Change default netCDF engine to use h5netcdf and add netcdf_engine_order Sep 16, 2025
@shoyer
Copy link
Member Author

shoyer commented Sep 16, 2025

Looking at the test failures, it looks like we previously supported writing NCZarr with ds.to_netcdf(f"file://{filename}#mode=nczarr"). Now we require also passing engine='netcdf4' explicitly.

Should we try to auto-detect URLs like this and use netcdf4 as the backend? Or is it better to encourage users to make an explicit choice?

@dcherian
Copy link
Contributor

in general I'm pro "explicit choice", but this would be a breaking change.

@malmans2 how common is nczarr use? I haven't really seen it.

@shoyer
Copy link
Member Author

shoyer commented Sep 17, 2025

I went ahead and added automatic support for writing nczarr. This wasn't hard to check.

@malmans2
Copy link
Contributor

in general I'm pro "explicit choice", but this would be a breaking change.

@malmans2 how common is nczarr use? I haven't really seen it.

I've never seen it actually used in python applications either. From a quick search on GitHub, it looks like the few packages that write to nczarr directly use netcdf4-python rather than xarray

@shoyer
Copy link
Member Author

shoyer commented Sep 17, 2025

I added supports_groups to BackendEntrypoint. Otherwise, we have no way to check if a backend supports open_datatree() short of calling the open_datatree() method.

This turned up because scipy is now used in preference to netcdf4 when opening netcdf v3 files, but scipy doesn't support opening groups.

In principle we could add support for reading groups to the SciPy backend (netcdf3 files arguably contain a single group, at the root node), but in any case this will also come up for custom backends.

@shoyer
Copy link
Member Author

shoyer commented Sep 23, 2025

I would love to get this in before the next release, to avoid needing repeated breaking changes.

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Stephan. Nice to be able to parametrize this.

@shoyer shoyer merged commit 4722bf1 into pydata:main Sep 24, 2025
37 checks passed
@shoyer shoyer deleted the netcdf_engine_order branch September 24, 2025 18:40
rajeeja added a commit to UXARRAY/uxarray that referenced this pull request Sep 29, 2025
@mraspaud
Copy link
Contributor

mraspaud commented Oct 1, 2025

I know I'm late to the party, but I just wanted to mention that the (local) netcdf files we use in our community (Earth Observation Satellite processing) are in general faster read with netcdf4 than h5netcdf, as h5netcdf takes about twice the time.
No harm done, since we now have the possibility to change the engine preference, but I just thought I'll let you know for reference.

@djhoese
Copy link
Contributor

djhoese commented Oct 3, 2025

Sorry if I missed this documentation somewhere else, but I didn't see it mentioned here or in the related issue. Does anyone know of any benchmarks done between the engines with recent versions of other dependency libraries (ex. numpy, pandas, dask). I have the same use cases as @mraspaud above, but I'll admit it's been a while since I've compared the netcdf4 and h5netcdf engines. Since there are so many ways to access files (local, S3 URI, open file-like object, parallel or single-threaded, etc) and so many different types of files (array size, on-disk chunking, etc) I'm wondering if anyone has done the work and documented what they've found for performance for some of these cases.

It seems there is ongoing (or at least wishful thinking) optimizations for h5netcdf (see h5netcdf/h5netcdf#195) that would be interesting to compare against any existing numbers.

@shoyer
Copy link
Member Author

shoyer commented Oct 3, 2025

@djhoese Let's discuss this back in #10657

I am thinking that perhaps the change to the default ordering here was pre-mature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should Xarray prefer h5netcdf and scipy to netCDF4?
6 participants