Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow indexing unindexed dimensions using dask arrays #5873

Merged
merged 27 commits into from
Mar 15, 2023

Conversation

bzah
Copy link
Contributor

@bzah bzah commented Oct 18, 2021

This is a naive attempt to make isel work with Dask.
Known limitation: it triggers the computation.

WIP
The code presented here is mainly to support the discussion on #2511.
It has not been unit tested and should probably not be merged as is.

@pep8speaks
Copy link

pep8speaks commented Oct 18, 2021

Hello @bzah! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-11-03 09:02:49 UTC

@bzah bzah force-pushed the fix/dask_indexing branch 2 times, most recently from b7e4d61 to 71d5ff6 Compare October 18, 2021 08:08
@keewis keewis marked this pull request as draft October 18, 2021 08:39
@github-actions
Copy link
Contributor

github-actions bot commented Oct 18, 2021

Unit Test Results

         4 files           4 suites   31m 32s ⏱️
15 157 tests 13 818 ✔️ 1 339 💤 0
60 628 runs  55 272 ✔️ 5 356 💤 0

Results for commit 270873a.

♻️ This comment has been updated with latest results.

xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
This is a naive attempt to make `isel` work with Dask

Known limitation: it triggers the computation.
Comment on lines 743 to 746
idx = da.argmax("time")
actual = da.isel(time=idx)

assert np.all(actual == 42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

Suggested change
idx = da.argmax("time")
actual = da.isel(time=idx)
assert np.all(actual == 42)
with raise_if_dask_computes():
actual = da.isel(time=dask.array.from_array([9], chunks=(1,))
expected = da.isel(time=[9])
assert_identical(expected, actual)

with from . import raise_if_dask_computes at the top.

We could test 9, [9], and [9, 9]

Copy link
Contributor Author

@bzah bzah Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edited
When the indexer is an independent Dask array, the computation is triggered.
By modifying the unit test as follow, it raises the "RuntimeError: Too many computes. Total: 1 > max: 0"

def test_indexing_dask_array():
    da = DataArray(
        np.ones(10 * 3 * 3).reshape((10, 3, 3)),
        dims=("time", "x", "y"),
    ).chunk(dict(time=-1, x=1, y=1))
    with raise_if_dask_computes():
        actual = da.isel(time=dask.array.from_array([9], chunks=(1,)))

I thought it would only be triggered when the indexer is constructed from da.
That's why the unit test was using idx = da.argmax("time").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would only be triggered when the indexer is constructed from da.

Yeah that raise_if_dask_computes decorator is great for checking these ideas.

I've got it fixed but we should wait for @benbovy to finish with #5692 (though it doesn't look like that PR conflicts with these changes). It would be useful to add more test cases here.

Copy link
Contributor

@dcherian dcherian Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other tests could be the ones in #4276, #4663, and this one from #2511

import dask.array as da
import numpy as np
import xarray as xr
from xarray.tests import raise_if_dask_computes

darr = xr.DataArray(data=[0.2, 0.4, 0.6], coords={"z": range(3)}, dims=("z",))
indexer = xr.DataArray(
    data=np.random.randint(0, 3, 8).reshape(4, 2).astype(int),
    coords={"y": range(4), "x": range(2)},
    dims=("y", "x"),
)
with raise_if_dask_computes():
    actual = darr[indexer.chunk({"y": 2})]

xr.testing.assert_identical(actual, darr[indexer])

Which actually does work now, but end up computing the indexer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I was not expecting these changes to be merged, it was only to support the discussion on the issue but it's nice if can be integrated.
I'll try to add the unit tests.

dcherian and others added 5 commits December 14, 2021 11:35
* main: (34 commits)
  Explicit indexes (pydata#5692)
  Remove test_rasterio_vrt_network (pydata#6371)
  Allow write_empty_chunks to be set in Zarr encoding (pydata#6348)
  Add new tutorial video (pydata#6353)
  Revert "explicitly install `ipython_genutils` (pydata#6350)" (pydata#6361)
  [pre-commit.ci] pre-commit autoupdate (pydata#6357)
  Run pyupgrade on core/groupby (pydata#6351)
  Generate reductions for DataArray, Dataset, GroupBy and Resample (pydata#5950)
  explicitly install `ipython_genutils` (pydata#6350)
  Bump actions/setup-python from 2 to 3 (pydata#6338)
  Bump actions/checkout from 2 to 3 (pydata#6337)
  In documentation on adding a new backend, add missing import and tweak headings (pydata#6330)
  Lengthen underline, correct spelling, and reword (pydata#6326)
  quantile: use skipna=None (pydata#6303)
  New whatsnew section
  v2022.03.0 release notes (pydata#6319)
  fix typos (using codespell) (pydata#6316)
  Add General issue template (pydata#6314)
  Disable CI runs on forks (pydata#6315)
  Enable running sphinx-build on Windows (pydata#6237)
  ...
@dcherian dcherian changed the title WIP: Attempt to fix indexing for Dask Allow indexing unindexed dimensions using dask arrays Mar 18, 2022
if not np.issubdtype(k.dtype, np.integer):
raise TypeError(
f"invalid indexer array, does not have integer dtype: {k!r}"
)
if k.ndim != 1:
if k.ndim > 1:
Copy link
Contributor

@dcherian dcherian Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k might be a 0d dask array (see below and #4276)

@@ -109,7 +111,11 @@ def _advanced_indexer_subspaces(key):
return (), ()

non_slices = [k for k in key if not isinstance(k, slice)]
ndim = len(np.broadcast(*non_slices).shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is just an optimization. Happy to add it in a different PR.

@dcherian dcherian marked this pull request as ready for review June 24, 2022 22:12
key = tuple(
k.data.item() if isinstance(k, Variable) and k.ndim == 0 else k for k in key
k.data if isinstance(k, Variable) and k.ndim == 0 else k for k in key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary for #4276

(da.array([1, 99, 99, 99]), np.arange(4) > 0),
(da.array([99, 99, 99, 99]), Variable(("x"), da.array([1, 2, 3, 4])) > 0),
(da.array([1, 99, 99, 99]), np.array([False, True, True, True])),
(da.array([99, 99, 99, 99]), Variable(("x"), np.array([True] * 4))),
Copy link
Contributor

@dcherian dcherian Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last test was silently computing earlier.

* upstream/main: (291 commits)
  Update error message when saving multiindex (pydata#7475)
  DOC: cross ref the groupby tutorial (pydata#7555)
  [pre-commit.ci] pre-commit autoupdate (pydata#7543)
  supress namespace_package deprecation warning (doctests) (pydata#7548)
  [skip-ci] Add PDF of Xarray logo (pydata#7530)
  Support complex arrays in xr.corr (pydata#7392)
  clarification for thresh arg of dataset.dropna() (pydata#7481)
  [pre-commit.ci] pre-commit autoupdate (pydata#7524)
  Require to explicitly defining optional dimensions such as hue and markersize (pydata#7277)
  Use plt.rc_context for default styles (pydata#7318)
  Update HOW_TO_RELEASE.md (pydata#7512)
  Update whats-new for dev (pydata#7511)
  Fix whats-new for 2023.02.0 (pydata#7506)
  Update apply_ufunc output_sizes error message (pydata#7509)
  Zarr: drop "source" and  "original_shape" from encoding (pydata#7500)
  [pre-commit.ci] pre-commit autoupdate (pydata#7507)
  Whats-new for 2023.03.0
  Add `inclusive` argument to `cftime_range` and `date_range` and deprecate `closed` argument (pydata#7373)
  Make text match code example (pydata#7499)
  Remove Dask single-threaded setting in tests (pydata#7489)
  ...
@dcherian dcherian added the topic-arrays related to flexible array support label Feb 28, 2023
@dcherian dcherian added the plan to merge Final call for comments label Mar 6, 2023
@dcherian dcherian merged commit 83e159e into pydata:main Mar 15, 2023
@bzah bzah deleted the fix/dask_indexing branch March 16, 2023 14:51
@bzah
Copy link
Contributor Author

bzah commented Mar 16, 2023

Hey, thanks @dcherian for taking over and merging this PR !
(and sorry for not being active on it myself for the past year...)

dcherian added a commit to dcherian/xarray that referenced this pull request Mar 26, 2023
* main: (40 commits)
  Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638)
  add timeouts for tests (pydata#7657)
  Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667)
  [pre-commit.ci] pre-commit autoupdate (pydata#7651)
  Allow all integer dtypes in `polyval` (pydata#7619)
  [skip-ci] dev whats-new (pydata#7660)
  Redo whats-new for 2023.03.0 (pydata#7659)
  Set copy=False when calling pd.Series (pydata#7642)
  Pin pandas < 2 (pydata#7650)
  Whats-new for release 2023.03.0 (pydata#7643)
  Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648)
  Use more descriptive link texts (pydata#7625)
  Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598)
  Fix `pcolormesh` with str coords (pydata#7612)
  [skip-ci] Fix groupby binary ops benchmarks (pydata#7603)
  Remove incomplete sentence in IO docs (pydata#7631)
  Allow indexing unindexed dimensions using dask arrays (pydata#5873)
  Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618)
  [pre-commit.ci] pre-commit autoupdate (pydata#7620)
  add a test for scatter colorbar extend (pydata#7616)
  ...
dcherian added a commit to kmsquire/xarray that referenced this pull request Mar 29, 2023
* upstream/main: (716 commits)
  Faq pull request (According to pull request pydata#7604 & issue pydata#1285 (pydata#7638)
  add timeouts for tests (pydata#7657)
  Pull Request Labeler - Undo workaround sync-labels bug (pydata#7667)
  [pre-commit.ci] pre-commit autoupdate (pydata#7651)
  Allow all integer dtypes in `polyval` (pydata#7619)
  [skip-ci] dev whats-new (pydata#7660)
  Redo whats-new for 2023.03.0 (pydata#7659)
  Set copy=False when calling pd.Series (pydata#7642)
  Pin pandas < 2 (pydata#7650)
  Whats-new for release 2023.03.0 (pydata#7643)
  Bump pypa/gh-action-pypi-publish from 1.7.1 to 1.8.1 (pydata#7648)
  Use more descriptive link texts (pydata#7625)
  Fix missing 'dim' argument in _get_nan_block_lengths (pydata#7598)
  Fix `pcolormesh` with str coords (pydata#7612)
  [skip-ci] Fix groupby binary ops benchmarks (pydata#7603)
  Remove incomplete sentence in IO docs (pydata#7631)
  Allow indexing unindexed dimensions using dask arrays (pydata#5873)
  Bump pypa/gh-action-pypi-publish from 1.6.4 to 1.7.1 (pydata#7618)
  [pre-commit.ci] pre-commit autoupdate (pydata#7620)
  add a test for scatter colorbar extend (pydata#7616)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-arrays related to flexible array support topic-dask topic-indexing
Projects
None yet
4 participants