-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset.where performances regression. #7516
Comments
Can confirm, on my machine it went from 520ms to 5s |
Git bisect pinpoints this to #6690 which funny enough, is my PR haha. |
I am a bit puzzled here... The major difference I can find is: Maybe someone with more experience in dask can help out? |
The old code had:
This loaded the array once and then passed numpy values to the indexing code. Now, the dask array is passed to the indexing code and is computed many times . #5873 raises an error saying boolean indexing with dask arrays is not allowed. For here just do I think we should close this. |
Does |
This "compute" finishes and takes more than 80sec on both versions with a huge memory consumption (it loads the 4 coordinates and the result itself). I know xarray has to keep more information regarding coordinates and dimensions but doing this (just dask arrays) :
Takes less than 6 seconds. |
Yeah that was another change I guess. We could extract out the variable using
do your |
Hello, I'm not sure performances problematics were fully addressed (we're now forced to fully compute/load the selection expression) but changes made in the last versions makes this issue irrelevant and I think we can close it. Thank you! |
What happened?
Hello,
I'm using the Dataset.where function to select data based on some fields values and it takes way to much time!
The dask dashboard seems to show some tasks repeating themselves many times.
The provided example uses a 1D array for which the selection could be done with Dataset.sel but with our real usecase we make selections on 2D variables.
This problem seems to have appeared with the 2022.6.0 xarray release, the 2022.3.0 is working as expected.
What did you expect to happen?
Using the 2022.3 release, this selection takes 1.37 seconds.
Using the 2022.6.0 up to the 2023.2.0 (the one from yesterday), this selection takes 8.47 seconds.
This example is a very simple and small one, with real data and use case we simply cannot use this function anymore.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
Problematic version
INSTALLED VERSIONS
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2023.2.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
flox: None
numpy_groupies: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
mypy: None
IPython: 8.7.0
sphinx: 5.3.0
Working version
INSTALLED VERSIONS
commit: None
python: 3.10.9 | packaged by conda-forge | (main, Feb 2 2023, 20:20:04) [GCC 11.3.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-58-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: ('fr_FR', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: 4.8.1
xarray: 2022.3.0
pandas: 1.5.3
numpy: 1.23.5
scipy: 1.8.1
netCDF4: 1.6.2
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.13.6
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.3.4
cfgrib: 0.9.10.3
iris: None
bottleneck: None
dask: 2023.1.1
distributed: 2023.1.1
matplotlib: 3.6.3
cartopy: 0.21.1
seaborn: None
numbagg: None
fsspec: 2023.1.0
cupy: None
pint: 0.20.1
sparse: None
setuptools: 67.1.0
pip: 23.0
conda: 22.11.1
pytest: 7.2.1
IPython: 8.7.0
sphinx: 5.3.0
The text was updated successfully, but these errors were encountered: