-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netCDF4 indexing: reindex_like
is very slow if dataset not loaded into memory
#8945
Comments
Can you try h5netcdf? |
Using So I guess the issue is with netCDF4, or how xarray uses netCDF4. In the netcdf4-python issue I linked above, I think the problem wasn't present in h5netcdf as well. That issue is slightly different, since the slow down was from loading with an index like 0, 2, 4, ... . Reindexing with I'll try some variations/simplifying the example later. |
reindex_like
is very slow if dataset not loaded into memoryreindex_like
is very slow if dataset not loaded into memory
Interestingly, we actually treat h5netcdf and netCDF4 arrays differently in terms of indexing: xarray/xarray/backends/netCDF4_.py Lines 100 to 103 in 5a35ca4
xarray/xarray/backends/h5netcdf_.py Lines 51 to 54 in 5a35ca4
But in this example, the underling arrays are being indexed with the same tuple, so yes I'm inclined to blame netCDF4.
I'm going to close since there doesn't seem like much to do here, except switch to |
Thanks for looking into it! |
What is your issue?
Reindexing a dataset without loading it into memory seems to be very slow (about 1000x slower than reindexing after loading into memory).
Here is a minimum working example:
Then
takes over a minute, while
is almost instantaneous (timeit says 91ms, including opening the dataset... I'm not sure if caching is influencing this).
Profiling the "reindex without load" cell:
The
getitem
call at the top is fromxarray.backends.netCDF4_.py
, line 114. Because of the jittered coordinates influx
, I'm assuming that the index passed to netCDF4 is not consecutive/strictly monotonic integers (0, 1, 2, 3, ...). In the past, this has caused issues: Unidata/netcdf4-python#680.In my venv, netCDF4 was installed from a wheel with the following versions:
This is with xarray version 2023.12.0, numpy 1.26, and pandas 1.5.3.
I will try to investigate more and hopefully simplify the example. (Can't quite justify spending more time on it at work because this is just to tag a version that was used in some experiments before we switch to zarr as a backend, so hopefully it won't be relevant at that point.)
The text was updated successfully, but these errors were encountered: