Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix): optimize subsetting dask array #1432

Merged
merged 5 commits into from Mar 22, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 2 additions & 7 deletions anndata/_core/index.py
Expand Up @@ -147,15 +147,10 @@ def _subset(a: np.ndarray | pd.DataFrame, subset_idx: Index):

@_subset.register(DaskArray)
def _subset_dask(a: DaskArray, subset_idx: Index):
if all(isinstance(x, cabc.Iterable) for x in subset_idx):
if len(subset_idx) > 1 and all(isinstance(x, cabc.Iterable) for x in subset_idx):
if isinstance(a._meta, csc_matrix):
return a[:, subset_idx[1]][subset_idx[0], :]
elif isinstance(a._meta, spmatrix):
return a[subset_idx[0], :][:, subset_idx[1]]
else:
# TODO: this may have been working for some cases?
subset_idx = np.ix_(*subset_idx)
return a.vindex[subset_idx]
return a[subset_idx[0], :][:, subset_idx[1]]
Copy link
Contributor Author

@ilan-gold ilan-gold Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask does not support a[subset_idx] when subset_idx has more than one entry

anndata/_core/anndata.py:1506: in copy
    X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy()
/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/functools.py:909: in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
anndata/_core/index.py:155: in _subset_dask
    return a[subset_idx]
venv/lib/python3.11/site-packages/dask/array/core.py:1994: in __getitem__
    dsk, chunks = slice_array(out, self.name, self.chunks, index2, self.itemsize)
venv/lib/python3.11/site-packages/dask/array/slicing.py:176: in slice_array
    dsk_out, bd_out = slice_with_newaxes(out_name, in_name, blockdims, index, itemsize)
venv/lib/python3.11/site-packages/dask/array/slicing.py:198: in slice_with_newaxes
    dsk, blockdims2 = slice_wrap_lists(out_name, in_name, blockdims, index2, itemsize)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

out_name = 'getitem-32365ec69f5d5f165e6565bb934d931b', in_name = 'array-780508e68d811416a0a1a22cb32db79f', blockdims = ((30,), (15,))
index = (array([ 0,  2,  4,  9, 11, 12, 13, 14, 16, 17, 20, 21, 22, 25, 27, 28, 29]), array([ 3,  6, 10])), itemsize = 4

    def slice_wrap_lists(out_name, in_name, blockdims, index, itemsize):
        """
        Fancy indexing along blocked array dasks

        Handles index of type list.  Calls slice_slices_and_integers for the rest

        See Also
        --------

        take : handle slicing with lists ("fancy" indexing)
        slice_slices_and_integers : handle slicing with slices and integers
        """
        assert all(isinstance(i, (slice, list, Integral)) or is_arraylike(i) for i in index)
        if not len(blockdims) == len(index):
            raise IndexError("Too many indices for array")

        # Do we have more than one list in the index?
        where_list = [
            i for i, ind in enumerate(index) if is_arraylike(ind) and ind.ndim > 0
        ]
        if len(where_list) > 1:
>           raise NotImplementedError("Don't yet support nd fancy indexing")
E           NotImplementedError: Don't yet support nd fancy indexing

venv/lib/python3.11/site-packages/dask/array/slicing.py:244: NotImplementedError

return a[subset_idx]


Expand Down
2 changes: 2 additions & 0 deletions docs/release-notes/0.10.7.md
Expand Up @@ -10,3 +10,5 @@

```{rubric} Performance
```

* Remove `vindex` for subsetting `dask.array.Array` because of its slowness and memory consumption {user} `ilan-gold` {pr}`1432`