(fix): optimize subsetting dask array #1432

ilan-gold · 2024-03-21T15:48:44Z

Performance issue discovered in filter_genes (feat): pre-processing functions for dask with sparse chunks scanpy#2856 (comment)
Tests added
Release note added (or unnecessary)

ilan-gold · 2024-03-21T15:50:29Z

MVCE as motivation for doing this

import dask.array as da
import numpy as np
import scipy as sp

# DENSE
arr = np.random.randn(chunksize, chunksize)
X = da.map_blocks(lambda block_id: arr, dtype=arr.dtype, meta=arr, chunks=((chunksize, ) * (size // chunksize),) * 2)

%timeit X.vindex[np.ix_(index_0, index_1)] # slow
%timeit X[index_0, :][:, index_1] # fast

np.array_equal(X[index_0, :][:, index_1].compute(), X.vindex[np.ix_(index_0, index_1)].compute())

# SPARSE
arr = sp.sparse.random(chunksize, chunksize, format="csr", density=.1)
X = da.map_blocks(lambda block_id: arr, dtype=arr.dtype, meta=arr, chunks=((chunksize, ) * (size // chunksize),) * 2)

%timeit X.vindex[np.ix_(index_0, index_1)]
%timeit X[index_0, :][:, index_1]

np.array_equal(X[index_0, :][:, index_1].compute().toarray(), X.vindex[np.ix_(index_0, index_1)].compute().toarray())

I don't understand why the sparse one doesn't compute in the last step but in any case, the non vindex op is much faster and also works so I'm not sure it really matters why vindex doesn't compute since where are removing this anyway....

codecov · 2024-03-21T16:06:08Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.08%. Comparing base (98d33da) to head (34e3476).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1432      +/-   ##
==========================================
- Coverage   86.26%   84.08%   -2.18%     
==========================================
  Files          36       36              
  Lines        5612     5599      -13     
==========================================
- Hits         4841     4708     -133     
- Misses        771      891     +120

Flag	Coverage Δ
gpu-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
anndata/_core/index.py	`93.19% <100.00%> (-0.14%)`	⬇️

... and 10 files with indirect coverage changes

ilan-gold · 2024-03-21T16:17:47Z

anndata/_core/index.py

-            # TODO: this may have been working for some cases?
-            subset_idx = np.ix_(*subset_idx)
-            return a.vindex[subset_idx]
+        return a[subset_idx[0], :][:, subset_idx[1]]


dask does not support a[subset_idx] when subset_idx has more than one entry

anndata/_core/anndata.py:1506: in copy X=_subset(self._adata_ref.X, (self._oidx, self._vidx)).copy() /opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/functools.py:909: in wrapper return dispatch(args[0].__class__)(*args, **kw) anndata/_core/index.py:155: in _subset_dask return a[subset_idx] venv/lib/python3.11/site-packages/dask/array/core.py:1994: in __getitem__ dsk, chunks = slice_array(out, self.name, self.chunks, index2, self.itemsize) venv/lib/python3.11/site-packages/dask/array/slicing.py:176: in slice_array dsk_out, bd_out = slice_with_newaxes(out_name, in_name, blockdims, index, itemsize) venv/lib/python3.11/site-packages/dask/array/slicing.py:198: in slice_with_newaxes dsk, blockdims2 = slice_wrap_lists(out_name, in_name, blockdims, index2, itemsize) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ out_name = 'getitem-32365ec69f5d5f165e6565bb934d931b', in_name = 'array-780508e68d811416a0a1a22cb32db79f', blockdims = ((30,), (15,)) index = (array([ 0, 2, 4, 9, 11, 12, 13, 14, 16, 17, 20, 21, 22, 25, 27, 28, 29]), array([ 3, 6, 10])), itemsize = 4 def slice_wrap_lists(out_name, in_name, blockdims, index, itemsize): """ Fancy indexing along blocked array dasks Handles index of type list. Calls slice_slices_and_integers for the rest See Also -------- take : handle slicing with lists ("fancy" indexing) slice_slices_and_integers : handle slicing with slices and integers """ assert all(isinstance(i, (slice, list, Integral)) or is_arraylike(i) for i in index) if not len(blockdims) == len(index): raise IndexError("Too many indices for array") # Do we have more than one list in the index? where_list = [ i for i, ind in enumerate(index) if is_arraylike(ind) and ind.ndim > 0 ] if len(where_list) > 1: > raise NotImplementedError("Don't yet support nd fancy indexing") E NotImplementedError: Don't yet support nd fancy indexing venv/lib/python3.11/site-packages/dask/array/slicing.py:244: NotImplementedError

ilan-gold · 2024-03-21T16:22:13Z

/Users/ilangold/Projects/Theis/anndata/anndata/_core/index.py:153: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
    ...     array[indexer]

To avoid creating the large chunks, set the option
    >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
    ...     array[indexer]
  return a[subset_idx[0], :][:, subset_idx[1]]

When using this with large datasets via filter_cells for example....

UPDATE:

Ok the warning arises from

Dask warns when indexing like this produces a chunk that’s 5x larger than the array.chunk-size config option. You have two options to deal with that warning:

    Set dask.config.set({"array.slicing.split_large_chunks": False}) to allow the large chunk and silence the warning.

    Set dask.config.set({"array.slicing.split_large_chunks": True}) to avoid creating the large chunk in the first place.

The right choice will depend on your downstream operations. See [Chunks](https://docs.dask.org/en/latest/array-chunks.html#array-chunks) for more on choosing chunk sizes.

In other words, this is a global warning and the default for the array.chunk-size is 128 MB. So this is not a bug

flying-sheep · 2024-03-22T14:49:13Z

With my change, the dependencies changed like this:

--- 2024-03-21.txt      2024-03-22 15:46:21.047726071 +0100
+++ 2024-03-22.txt      2024-03-22 15:46:49.230697833 +0100
@@ -1,12 +1,12 @@
-anndata           0.11.0.dev90+g8f4c755
+anndata           0.11.0.dev95+g9db28d5
 anyio             4.3.0
-array-api-compat  1.4.1
+array-api-compat  1.5.1
 asciitree         0.3.3
 memray            1.11.0
 msgpack           1.0.8
 natsort           8.4.0
 networkx          3.2.1
-numba             0.59.0
+numba             0.59.1
 numcodecs         0.12.1
 numpy             1.26.4
 numpy-groupies    0.10.2
@@ -26,7 +26,7 @@
 pytest            8.1.1
 pytest-cov        4.1.0
 pytest-memray     1.5.0
-pytest-mock       3.12.0
+pytest-mock       3.14.0
 pytest-xdist      3.5.0
 python-dateutil   2.9.0.post0
 pytz              2024.1
@@ -34,7 +34,7 @@
 rich              13.7.1
 scanpy            1.10.0rc2
 scikit-learn      1.4.1.post1
-scipy             1.12.0
+scipy             1.13.0rc1
 seaborn           0.13.2
 session-info      1.0.0
 setuptools        69.2.0
@@ -46,7 +46,7 @@
 stdlib-list       0.10.0
 tblib             3.0.0
 textual           0.53.1
-threadpoolctl     3.3.0
+threadpoolctl     3.4.0
 toolz             0.12.1
 tornado           6.4
 tqdm              4.66.2
@@ -55,7 +55,7 @@
 uc-micro-py       1.0.3
 umap-learn        0.5.5
 urllib3           2.2.1
-uv                0.1.22
+uv                0.1.23
 zarr              2.17.1
 zict              3.0.0

lumberbot-app · 2024-03-22T15:44:03Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 0.10.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 507444a5081c5c9c092f683c7501a721b513fe8c

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #1432: (fix): optimize subsetting dask array'

Push to a named branch:

git push YOURFORK 0.10.x:auto-backport-of-pr-1432-on-0.10.x

Create a PR against branch 0.10.x, I would have named this PR:

"Backport PR #1432 on branch 0.10.x ((fix): optimize subsetting dask array)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

ivirshup · 2024-03-22T17:03:17Z

@ilan-gold, did you find or open a bug on dask for this?

ilan-gold · 2024-03-22T17:09:02Z

@ivirshup is this a bug? I guess it's a performance issue, I can open an issue.

(fix): optimize subsetting dask array

ba69a74

(chore): release note

5ff8c12

ilan-gold added the skip-gpu-ci label Mar 21, 2024

ilan-gold added this to the 0.10.7 milestone Mar 21, 2024

ilan-gold mentioned this pull request Mar 21, 2024

(feat): pre-processing functions for dask with sparse chunks scverse/scanpy#2856

Merged

3 tasks

ilan-gold commented Mar 21, 2024

View reviewed changes

ilan-gold marked this pull request as ready for review March 22, 2024 11:55

ilan-gold enabled auto-merge (squash) March 22, 2024 11:57

ilan-gold mentioned this pull request Mar 22, 2024

FileNotFoundError: [Errno 2] No such file or directory: 'llvm-config' on CI #1433

Closed

3 tasks

ilan-gold and others added 2 commits March 22, 2024 14:30

(fix): try verbose.

f6a3af9

pin scanpy high for pre-releases

fd4e7c9

Merge branch 'main' into ig/subsetting_dask_array

34e3476

ilan-gold merged commit 507444a into main Mar 22, 2024
15 checks passed

ilan-gold deleted the ig/subsetting_dask_array branch March 22, 2024 15:43

lumberbot-app bot added the Still Needs Manual Backport label Mar 22, 2024

ilan-gold added a commit that referenced this pull request Mar 23, 2024

Backport PR #1432: (fix): optimize subsetting dask array

253ee7c

ilan-gold mentioned this pull request Mar 23, 2024

Backport PR #1432: (fix): optimize subsetting dask array #1437

Merged

ilan-gold added a commit that referenced this pull request Mar 25, 2024

Backport PR #1432: (fix): optimize subsetting dask array (#1437)

c2f0d1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix): optimize subsetting dask array #1432

(fix): optimize subsetting dask array #1432

ilan-gold commented Mar 21, 2024 •

edited

ilan-gold commented Mar 21, 2024 •

edited

codecov bot commented Mar 21, 2024 •

edited

ilan-gold Mar 21, 2024 •

edited

ilan-gold commented Mar 21, 2024 •

edited

flying-sheep commented Mar 22, 2024

lumberbot-app bot commented Mar 22, 2024

ivirshup commented Mar 22, 2024

ilan-gold commented Mar 22, 2024

(fix): optimize subsetting dask array #1432

(fix): optimize subsetting dask array #1432

Conversation

ilan-gold commented Mar 21, 2024 • edited

ilan-gold commented Mar 21, 2024 • edited

codecov bot commented Mar 21, 2024 • edited

Codecov Report

ilan-gold Mar 21, 2024 • edited

Choose a reason for hiding this comment

ilan-gold commented Mar 21, 2024 • edited

flying-sheep commented Mar 22, 2024

lumberbot-app bot commented Mar 22, 2024

ivirshup commented Mar 22, 2024

ilan-gold commented Mar 22, 2024

ilan-gold commented Mar 21, 2024 •

edited

ilan-gold commented Mar 21, 2024 •

edited

codecov bot commented Mar 21, 2024 •

edited

ilan-gold Mar 21, 2024 •

edited

ilan-gold commented Mar 21, 2024 •

edited