Allow indexing unindexed dimensions using dask arrays #5873

bzah · 2021-10-18T07:56:58Z

Closes Array indexing with dask arrays #2511
Closes Fancy indexing a Dataset with dask DataArray triggers multiple computes #4663
Closes isel with 0d dask array index fails #4276
xref Dataset.where performances regression. #7516
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

This is a naive attempt to make isel work with Dask.
Known limitation: it triggers the computation.

WIP
The code presented here is mainly to support the discussion on #2511.
It has not been unit tested and should probably not be merged as is.

pep8speaks · 2021-10-18T07:57:02Z

Hello @bzah! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-03 09:02:49 UTC

github-actions · 2021-10-18T08:58:55Z

Unit Test Results

        4 files         4 suites 31m 32s ⏱️
15 157 tests 13 818 ✔️ 1 339 💤 0 ❌
60 628 runs 55 272 ✔️ 5 356 💤 0 ❌

Results for commit 270873a.

♻️ This comment has been updated with latest results.

xarray/core/indexing.py

This is a naive attempt to make `isel` work with Dask Known limitation: it triggers the computation.

dcherian · 2021-12-13T17:06:28Z

xarray/tests/test_indexing.py

+    idx = da.argmax("time")
+    actual = da.isel(time=idx)
+
+    assert np.all(actual == 42)


How about

Suggested change

idx = da.argmax("time")

actual = da.isel(time=idx)

assert np.all(actual == 42)

with raise_if_dask_computes():

actual = da.isel(time=dask.array.from_array([9], chunks=(1,))

expected = da.isel(time=[9])

assert_identical(expected, actual)

with from . import raise_if_dask_computes at the top.

We could test 9, [9], and [9, 9]

Edited
When the indexer is an independent Dask array, the computation is triggered.
By modifying the unit test as follow, it raises the "RuntimeError: Too many computes. Total: 1 > max: 0"

def test_indexing_dask_array(): da = DataArray( np.ones(10 * 3 * 3).reshape((10, 3, 3)), dims=("time", "x", "y"), ).chunk(dict(time=-1, x=1, y=1)) with raise_if_dask_computes(): actual = da.isel(time=dask.array.from_array([9], chunks=(1,)))

I thought it would only be triggered when the indexer is constructed from da.
That's why the unit test was using idx = da.argmax("time").

I thought it would only be triggered when the indexer is constructed from da.

Yeah that raise_if_dask_computes decorator is great for checking these ideas.

I've got it fixed but we should wait for @benbovy to finish with #5692 (though it doesn't look like that PR conflicts with these changes). It would be useful to add more test cases here.

Some other tests could be the ones in #4276, #4663, and this one from #2511

import dask.array as da import numpy as np import xarray as xr from xarray.tests import raise_if_dask_computes darr = xr.DataArray(data=[0.2, 0.4, 0.6], coords={"z": range(3)}, dims=("z",)) indexer = xr.DataArray( data=np.random.randint(0, 3, 8).reshape(4, 2).astype(int), coords={"y": range(4), "x": range(2)}, dims=("y", "x"), ) with raise_if_dask_computes(): actual = darr[indexer.chunk({"y": 2})] xr.testing.assert_identical(actual, darr[indexer])

Which actually does work now, but end up computing the indexer

TBH, I was not expecting these changes to be merged, it was only to support the discussion on the issue but it's nice if can be integrated.
I'll try to add the unit tests.

* main: (34 commits) Explicit indexes (pydata#5692) Remove test_rasterio_vrt_network (pydata#6371) Allow write_empty_chunks to be set in Zarr encoding (pydata#6348) Add new tutorial video (pydata#6353) Revert "explicitly install `ipython_genutils` (pydata#6350)" (pydata#6361) [pre-commit.ci] pre-commit autoupdate (pydata#6357) Run pyupgrade on core/groupby (pydata#6351) Generate reductions for DataArray, Dataset, GroupBy and Resample (pydata#5950) explicitly install `ipython_genutils` (pydata#6350) Bump actions/setup-python from 2 to 3 (pydata#6338) Bump actions/checkout from 2 to 3 (pydata#6337) In documentation on adding a new backend, add missing import and tweak headings (pydata#6330) Lengthen underline, correct spelling, and reword (pydata#6326) quantile: use skipna=None (pydata#6303) New whatsnew section v2022.03.0 release notes (pydata#6319) fix typos (using codespell) (pydata#6316) Add General issue template (pydata#6314) Disable CI runs on forks (pydata#6315) Enable running sphinx-build on Windows (pydata#6237) ...

xarray/tests/test_dataarray.py

for more information, see https://pre-commit.ci

dcherian · 2022-06-24T22:01:12Z

xarray/core/indexing.py

                if not np.issubdtype(k.dtype, np.integer):
                    raise TypeError(
                        f"invalid indexer array, does not have integer dtype: {k!r}"
                    )
-                if k.ndim != 1:
+                if k.ndim > 1:


k might be a 0d dask array (see below and #4276)

dcherian · 2022-06-24T22:08:42Z

xarray/core/nputils.py

@@ -109,7 +111,11 @@ def _advanced_indexer_subspaces(key):
        return (), ()

    non_slices = [k for k in key if not isinstance(k, slice)]
-    ndim = len(np.broadcast(*non_slices).shape)


I think this is just an optimization. Happy to add it in a different PR.

dcherian · 2022-06-25T17:30:05Z

xarray/core/variable.py

        key = tuple(
-            k.data.item() if isinstance(k, Variable) and k.ndim == 0 else k for k in key
+            k.data if isinstance(k, Variable) and k.ndim == 0 else k for k in key


Necessary for #4276

dcherian · 2022-06-25T17:30:20Z

xarray/tests/test_dask.py

-            (da.array([1, 99, 99, 99]), np.arange(4) > 0),
-            (da.array([99, 99, 99, 99]), Variable(("x"), da.array([1, 2, 3, 4])) > 0),
+            (da.array([1, 99, 99, 99]), np.array([False, True, True, True])),
+            (da.array([99, 99, 99, 99]), Variable(("x"), np.array([True] * 4))),


The last test was silently computing earlier.

* upstream/main: (291 commits) Update error message when saving multiindex (pydata#7475) DOC: cross ref the groupby tutorial (pydata#7555) [pre-commit.ci] pre-commit autoupdate (pydata#7543) supress namespace_package deprecation warning (doctests) (pydata#7548) [skip-ci] Add PDF of Xarray logo (pydata#7530) Support complex arrays in xr.corr (pydata#7392) clarification for thresh arg of dataset.dropna() (pydata#7481) [pre-commit.ci] pre-commit autoupdate (pydata#7524) Require to explicitly defining optional dimensions such as hue and markersize (pydata#7277) Use plt.rc_context for default styles (pydata#7318) Update HOW_TO_RELEASE.md (pydata#7512) Update whats-new for dev (pydata#7511) Fix whats-new for 2023.02.0 (pydata#7506) Update apply_ufunc output_sizes error message (pydata#7509) Zarr: drop "source" and "original_shape" from encoding (pydata#7500) [pre-commit.ci] pre-commit autoupdate (pydata#7507) Whats-new for 2023.03.0 Add `inclusive` argument to `cftime_range` and `date_range` and deprecate `closed` argument (pydata#7373) Make text match code example (pydata#7499) Remove Dask single-threaded setting in tests (pydata#7489) ...

bzah · 2023-03-16T14:54:51Z

Hey, thanks @dcherian for taking over and merging this PR !
(and sorry for not being active on it myself for the past year...)

* main: (40 commits) Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638) add timeouts for tests (pydata#7657) Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667) [pre-commit.ci] pre-commit autoupdate (pydata#7651) Allow all integer dtypes in `polyval` (pydata#7619) [skip-ci] dev whats-new (pydata#7660) Redo whats-new for 2023.03.0 (pydata#7659) Set copy=False when calling pd.Series (pydata#7642) Pin pandas < 2 (pydata#7650) Whats-new for release 2023.03.0 (pydata#7643) Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648) Use more descriptive link texts (pydata#7625) Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598) Fix `pcolormesh` with str coords (pydata#7612) [skip-ci] Fix groupby binary ops benchmarks (pydata#7603) Remove incomplete sentence in IO docs (pydata#7631) Allow indexing unindexed dimensions using dask arrays (pydata#5873) Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618) [pre-commit.ci] pre-commit autoupdate (pydata#7620) add a test for scatter colorbar extend (pydata#7616) ...

* upstream/main: (716 commits) Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638) add timeouts for tests (pydata#7657) Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667) [pre-commit.ci] pre-commit autoupdate (pydata#7651) Allow all integer dtypes in `polyval` (pydata#7619) [skip-ci] dev whats-new (pydata#7660) Redo whats-new for 2023.03.0 (pydata#7659) Set copy=False when calling pd.Series (pydata#7642) Pin pandas < 2 (pydata#7650) Whats-new for release 2023.03.0 (pydata#7643) Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648) Use more descriptive link texts (pydata#7625) Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598) Fix `pcolormesh` with str coords (pydata#7612) [skip-ci] Fix groupby binary ops benchmarks (pydata#7603) Remove incomplete sentence in IO docs (pydata#7631) Allow indexing unindexed dimensions using dask arrays (pydata#5873) Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618) [pre-commit.ci] pre-commit autoupdate (pydata#7620) add a test for scatter colorbar extend (pydata#7616) ...

bzah force-pushed the fix/dask_indexing branch 2 times, most recently from b7e4d61 to 71d5ff6 Compare October 18, 2021 08:08

keewis marked this pull request as draft October 18, 2021 08:39

Illviljan reviewed Oct 18, 2021

View reviewed changes

xarray/core/indexing.py Outdated Show resolved Hide resolved

Illviljan reviewed Oct 18, 2021

View reviewed changes

xarray/core/indexing.py Outdated Show resolved Hide resolved

bzah force-pushed the fix/dask_indexing branch from 7b627e7 to 952fe85 Compare October 19, 2021 14:10

bzah force-pushed the fix/dask_indexing branch from 952fe85 to 270873a Compare November 3, 2021 09:02

max-sixty mentioned this pull request Nov 6, 2021

We don't seem to get detail on test results? EnricoMi/publish-unit-test-result-action#183

Closed

Illviljan reviewed Nov 14, 2021

View reviewed changes

xarray/core/indexing.py Outdated Show resolved Hide resolved

bzah force-pushed the fix/dask_indexing branch from 270873a to f7991fd Compare November 15, 2021 09:10

bzah force-pushed the fix/dask_indexing branch from f7991fd to e4cfe63 Compare December 1, 2021 18:20

Attempt to fix indexing for Dask

bc4271c

This is a naive attempt to make `isel` work with Dask Known limitation: it triggers the computation.

bzah force-pushed the fix/dask_indexing branch from e4cfe63 to bc4271c Compare December 13, 2021 16:37

dcherian reviewed Dec 13, 2021

View reviewed changes

dcherian added topic-dask topic-indexing labels Dec 13, 2021

dcherian and others added 5 commits December 14, 2021 11:35

Works now.

73696e9

avoid importorskip

fad4348

More tests and fixes

7dadbf2

Merge branch 'main' into fix/dask_indexing

b7c382b

dcherian changed the title ~~WIP: Attempt to fix indexing for Dask~~ Allow indexing unindexed dimensions using dask arrays Mar 18, 2022

Raise nicer error when indexing with boolean dask array

ec4d6ee

dcherian reviewed Mar 18, 2022

View reviewed changes

xarray/tests/test_dataarray.py Outdated Show resolved Hide resolved

dcherian and others added 4 commits March 18, 2022 22:49

Annotate tests

944dbac

Merge branch 'main' into fix/dask_indexing

fb5b01e

edit query tests

335b5da

[pre-commit.ci] auto fixes from pre-commit.com hooks

a11be00

for more information, see https://pre-commit.ci

backcompat: vendor np.broadcast_shapes

9f5e31b

dcherian force-pushed the fix/dask_indexing branch from 0e888fc to 9f5e31b Compare June 24, 2022 21:51

Small improvement

3306329

dcherian reviewed Jun 24, 2022

View reviewed changes

fix: Handle scalars properly.

32b73c3

dcherian reviewed Jun 24, 2022

View reviewed changes

dcherian marked this pull request as ready for review June 24, 2022 22:12

fix bad test

d6170ce

dcherian reviewed Jun 25, 2022

View reviewed changes

dcherian and others added 2 commits June 25, 2022 11:30

Check computes with setitem

aa1df48

Merge branch 'main' into fix/dask_indexing

c93b297

b-reyes mentioned this pull request Oct 4, 2022

Fix data loss issue in combine_echodata OSOceanAcoustics/echopype#824

Merged

dcherian requested a review from TomNicholas February 28, 2023 04:25

dcherian mentioned this pull request Feb 28, 2023

Dataset.where performances regression. #7516

Open

4 tasks

dcherian added the topic-arrays related to flexible array support label Feb 28, 2023

dcherian and others added 5 commits March 3, 2023 16:17

Merge branch 'main' into fix/dask_indexing

3f008c8

Better error

ff42585

Cleanup

d15c7fe

Raise nice error with VectorizedIndexer and dask.

220edc8

Add whats-new

8445120

dcherian added the plan to merge Final call for comments label Mar 6, 2023

Merge branch 'main' into fix/dask_indexing

75a6299

dcherian mentioned this pull request Mar 14, 2023

Slow performance of isel #2227

Open

dcherian merged commit 83e159e into pydata:main Mar 15, 2023

bzah deleted the fix/dask_indexing branch March 16, 2023 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow indexing unindexed dimensions using dask arrays #5873

Allow indexing unindexed dimensions using dask arrays #5873

bzah commented Oct 18, 2021 •

edited by dcherian

pep8speaks commented Oct 18, 2021 •

edited

github-actions bot commented Oct 18, 2021 •

edited

dcherian Dec 13, 2021

bzah Dec 14, 2021 •

edited

dcherian Dec 14, 2021

dcherian Dec 14, 2021 •

edited

bzah Dec 15, 2021

dcherian Jun 24, 2022 •

edited

dcherian Jun 24, 2022

dcherian Jun 25, 2022

dcherian Jun 25, 2022 •

edited

bzah commented Mar 16, 2023

-    idx = da.argmax("time")
-    actual = da.isel(time=idx)
-    assert np.all(actual == 42)
+    with raise_if_dask_computes():
+        actual = da.isel(time=dask.array.from_array([9], chunks=(1,))
+    expected = da.isel(time=[9])
+    assert_identical(expected, actual)

Allow indexing unindexed dimensions using dask arrays #5873

Allow indexing unindexed dimensions using dask arrays #5873

Conversation

bzah commented Oct 18, 2021 • edited by dcherian

pep8speaks commented Oct 18, 2021 • edited

Comment last updated at 2021-11-03 09:02:49 UTC

github-actions bot commented Oct 18, 2021 • edited

Unit Test Results

dcherian Dec 13, 2021

Choose a reason for hiding this comment

bzah Dec 14, 2021 • edited

Choose a reason for hiding this comment

dcherian Dec 14, 2021

Choose a reason for hiding this comment

dcherian Dec 14, 2021 • edited

Choose a reason for hiding this comment

bzah Dec 15, 2021

Choose a reason for hiding this comment

dcherian Jun 24, 2022 • edited

Choose a reason for hiding this comment

dcherian Jun 24, 2022

Choose a reason for hiding this comment

dcherian Jun 25, 2022

Choose a reason for hiding this comment

dcherian Jun 25, 2022 • edited

Choose a reason for hiding this comment

bzah commented Mar 16, 2023

bzah commented Oct 18, 2021 •

edited by dcherian

pep8speaks commented Oct 18, 2021 •

edited

github-actions bot commented Oct 18, 2021 •

edited

bzah Dec 14, 2021 •

edited

dcherian Dec 14, 2021 •

edited

dcherian Jun 24, 2022 •

edited

dcherian Jun 25, 2022 •

edited