(feat): allow boolean indices to pass down to `X` #1365

ilan-gold · 2024-02-09T14:04:06Z

Towards Check indexing arg types #1293
Tests added
Release note added (or unnecessary)

codecov · 2024-02-09T14:14:47Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.08%. Comparing base (3f6a217) to head (591dae5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1365      +/-   ##
==========================================
- Coverage   86.17%   84.08%   -2.10%     
==========================================
  Files          35       35              
  Lines        5584     5592       +8     
==========================================
- Hits         4812     4702     -110     
- Misses        772      890     +118

Flag	Coverage Δ
gpu-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
anndata/_core/index.py	`93.33% <100.00%> (+0.86%)`	⬆️
anndata/_core/views.py	`84.05% <100.00%> (-5.60%)`	⬇️

... and 7 files with indirect coverage changes

ilan-gold · 2024-02-09T14:49:31Z

@ivirshup I have a fix here but I'm unsure of the reasoning for integer indices. The use for them, primarily, seemed to be that sparse/h5py needed integers. The numpy/pandas subsetting seems to work ok with booleans. For example, we don't seem to need them for forming a proper mesh, it seems, using ix_. I also am aware of the array-api boolean indexing issues. I did add a test to ensure that there are equal objects. Also, after #1366, we can actually inspect usage of the optimization if we keep the mocking/spy.

ivirshup · 2024-02-12T13:18:56Z

I have a fix here but I'm unsure of the reasoning for integer indices. The use for them, primarily, seemed to be that sparse/h5py needed integers.

I'm not sure what you mean by this. Are you wondering why we were doing np.where in the first place?

ilan-gold · 2024-02-12T14:36:37Z

I'm not sure what you mean by this. Are you wondering why we were doing np.where in the first place?

I'm guessing the reason was for sparse/hdf5 compatibility but am not sure and don't want to miss something I'm not thinking of.

ivirshup · 2024-02-13T12:26:19Z

You could intentionally break it, see what tests fail, and decide if there are any cases missing that should be tested
You could run the scanpy tests with this branch installed, that often catches other things

It's also possible we didn't support masking because it was easier to support fewer cases and the performance hit from doing np.where wasn't high.

ilan-gold · 2024-02-13T13:13:22Z

You could intentionally break it, see what tests fail, and decide if there are any cases missing that should be tested

This is how I went about this PR, by removing np.where so I am not sure what else could be "intentionally [broken]" but am open to ideas.

You could run the scanpy tests with this branch installed, that often catches other things

Great idea, will definitely do this.

ivirshup · 2024-02-13T13:33:22Z

so I am not sure what else could be "intentionally [broken]" but am open to ideas.

You could invert the mask?

ilan-gold · 2024-02-14T09:33:43Z

@ivirshup I got no errors in scanpy and nothing looks particularly out of place in AnnData either. As mentioned on slack, also, subset_func is randomized so using it is not possible here.

ivirshup · 2024-02-15T09:18:42Z

Why would subset func being randomized here matter?

adata.X[idx] should have some relationship to adata[idx].X regardless of the value of idx, no?

ilan-gold · 2024-02-15T09:22:27Z

Oh @ivirshup you just want to test for failing or not? I thought we wanted to do equality i.e. assert_equal(adata[subset_from_subset_func1, :], adata[subset_from_subset_func2, :], which I don't think is possible because subset_from_subset_func2 will not have the same indices as subset_from_subset_func1.

ilan-gold · 2024-02-15T11:06:44Z

@ivirshup Done. Sorry for the confusion.

ivirshup

So, I just tried seeing if we would continue using boolean masks on repeated indexing. Not surprisingly, we don't:

import scanpy as sc

adata = adata = sc.datasets.pbmc3k_processed().raw.to_adata()
v = adata[adata.obs["louvain"].isin(["B cells", "NK cells"])]
v2 = v[v.obs["louvain"] == "B Cells"]

v._oidx
# array([False,  True, False, ...,  True,  True, False])

v2._oidx
# array([   1,   10,   18,   19,   20,   23,   31,   55,   75,   85,   88,
#          89,   92,   93,  105,  107,  117,  122,  127,  134,  139,  143,
#         146,  160,  162,  165,  178,  179,  181,  190,  192,  218,  231,
# ...

This means it's quite easy to hit the unoptimized case for any sort of more complex querying.

Do you think we should fix this by continuing to pass through booleans, or do we now need to address the integer case? I think masks on top of masks could happen in this PR (though doesn't strictly have to), but integer would be a separate PR.

ilan-gold · 2024-02-23T12:38:55Z

@ivirshup I think it's fine for that not to hit the optimization since the length of the contiguous slices that would be used are basically on average 1. Do you see a significant performance change with this? The original issue was about long(ish) contiguous slices: #1224

ivirshup · 2024-02-23T13:04:50Z

We could resolve nested boolean masks like:

def update_mask_numpy(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
    return mask_new

There is a maybe better solution, if we could use numba or added some compiled code.

import numpy as np, numba

@numba.njit
def update_mask_numba(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    incoming_pos = 0

    for orig_pos, val in enumerate(mask_orig):
        if val:
            mask_new[orig_pos] = mask_incoming[incoming_pos]
            incoming_pos += 1
    if incoming_pos != len(mask_incoming):
        raise ValueError("Incoming mask is too long")

    return mask_new

def update_mask_numpy(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
    return mask_new

rng = np.random.default_rng()

def update_mask_bench(size1, size2, size3):
    int_idx1 = rng.choice(size1, size2, replace=False)
    int_idx2 = rng.choice(size2, size3, replace=False)

    bool_idx1 = np.zeros(size1, dtype=bool)
    bool_idx2 = np.zeros(size2, dtype=bool)
    bool_idx1[int_idx1] = True
    bool_idx2[int_idx2] = True

    %timeit update_mask_numba(bool_idx1, bool_idx2)
    %timeit update_mask_numpy(bool_idx1, bool_idx2)

In [57]: update_mask_bench(1_000_000, 500_000, 10_000)
4.55 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.58 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [58]: update_mask_bench(1_000_000, 500_000, 400_000)
4.51 ms ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.75 ms ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [59]: update_mask_bench(1_000_000, 500_000, 450_000)
4.51 ms ± 2.86 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.08 ms ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [60]: update_mask_bench(10_000_000, 9_000_000, 8_000_000)
18.9 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
63.1 ms ± 89.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [61]: update_mask_bench(10_000_000, 9_000_000, 500_000)
18.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
39 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [62]: update_mask_bench(10_000_000, 1_000_000, 500_000)
14.8 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.2 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [64]: update_mask_bench(10_000_000, 1_000_000, 100_000)
14.7 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.4 ms ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the differences are never too huge, and we're talking about 10s of milliseconds differences for 10 million rows in the worst cases.

Which we could dispatch to in _resolve_idx like this:

@_resolve_idx.register(np.ndarray)
def _resolve_idx_ndarray(old, new, l):
    if is_bool_dtype(old) and is_bool_dtype(new):
        return update_mask(old, new)
    if is_bool_dtype(old):
        old = np.where(old)[0]
    return old[new]

ivirshup · 2024-02-23T13:06:25Z

@ivirshup I think it's fine for that not to hit the optimization since the length of the contiguous slices that would be used are basically on average 1. Do you see a significant performance change with this? The original issue was about long(ish) contiguous slices: #1224

This is just a toy case, where the data is unoptimized for query. I just wanted to see if we would propagate the masks. In data optimized for query we should be sorting by the fields where querying on which would lead to longer runs.

ilan-gold · 2024-02-23T13:33:29Z

Do you think we should fix this by continuing to pass through booleans, or do we now need to address the integer case?

I think mask type should stay boolean, then, yes. I also think the performance is acceptable in numpy - your test cases were quite large. I can add this into the current PR then.

ivirshup

Great!

…to `X`

…pass down to `X`) (#1387) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

ilan-gold added 2 commits February 9, 2024 14:40

(feat): _normalize_index can now return a boolean index

2c14674

(chore): add sparse test

f680cf7

ilan-gold added enhancement backend: hdf5 topic: backed backend: zarr skip-gpu-ci labels Feb 9, 2024

ilan-gold added this to the 0.10.6 milestone Feb 9, 2024

ilan-gold added 2 commits February 9, 2024 15:27

(chore): release note

0a53691

(chore): add test

23e9e8a

(chore): add access benchmarking

33dc27d

ilan-gold marked this pull request as ready for review February 12, 2024 12:04

ilan-gold requested a review from ivirshup February 12, 2024 12:15

Merge branch 'main' into ig/anndata_slice_indexing

86d17ea

(fix): counting optimizations

d80a16f

(chore): move test + add X tests

8d685d2

ilan-gold force-pushed the ig/anndata_slice_indexing branch from 64ab900 to 8d685d2 Compare February 15, 2024 11:06

ilan-gold self-assigned this Feb 20, 2024

ivirshup reviewed Feb 23, 2024

View reviewed changes

ilan-gold added 2 commits February 23, 2024 14:33

Merge branch 'main' into ig/anndata_slice_indexing

2603777

(fix): make sure optimization propagates to views of views

591dae5

ivirshup self-requested a review February 23, 2024 14:57

ivirshup approved these changes Feb 23, 2024

View reviewed changes

ivirshup merged commit a478647 into main Feb 23, 2024
15 checks passed

ivirshup deleted the ig/anndata_slice_indexing branch February 23, 2024 14:58

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Feb 23, 2024

Backport PR scverse#1365: (feat): allow boolean indices to pass down …

2a344b2

…to `X`

meeseeksmachine mentioned this pull request Feb 23, 2024

Backport PR #1365 on branch 0.10.x ((feat): allow boolean indices to pass down to X) #1387

Merged

flying-sheep pushed a commit that referenced this pull request Feb 23, 2024

Backport PR #1365 on branch 0.10.x ((feat): allow boolean indices to …

aa4e94b

…pass down to `X`) (#1387) Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(feat): allow boolean indices to pass down to `X` #1365

(feat): allow boolean indices to pass down to `X` #1365

ilan-gold commented Feb 9, 2024 •

edited

codecov bot commented Feb 9, 2024 •

edited

ilan-gold commented Feb 9, 2024

ivirshup commented Feb 12, 2024

ilan-gold commented Feb 12, 2024

ivirshup commented Feb 13, 2024

ilan-gold commented Feb 13, 2024

ivirshup commented Feb 13, 2024

ilan-gold commented Feb 14, 2024

ivirshup commented Feb 15, 2024

ilan-gold commented Feb 15, 2024

ilan-gold commented Feb 15, 2024

ivirshup left a comment

ilan-gold commented Feb 23, 2024

ivirshup commented Feb 23, 2024

ivirshup commented Feb 23, 2024

ilan-gold commented Feb 23, 2024

ivirshup left a comment

(feat): allow boolean indices to pass down to X #1365

(feat): allow boolean indices to pass down to X #1365

Conversation

ilan-gold commented Feb 9, 2024 • edited

codecov bot commented Feb 9, 2024 • edited

Codecov Report

ilan-gold commented Feb 9, 2024

ivirshup commented Feb 12, 2024

ilan-gold commented Feb 12, 2024

ivirshup commented Feb 13, 2024

ilan-gold commented Feb 13, 2024

ivirshup commented Feb 13, 2024

ilan-gold commented Feb 14, 2024

ivirshup commented Feb 15, 2024

ilan-gold commented Feb 15, 2024

ilan-gold commented Feb 15, 2024

ivirshup left a comment

Choose a reason for hiding this comment

ilan-gold commented Feb 23, 2024

ivirshup commented Feb 23, 2024

ivirshup commented Feb 23, 2024

ilan-gold commented Feb 23, 2024

ivirshup left a comment

Choose a reason for hiding this comment

(feat): allow boolean indices to pass down to `X` #1365

(feat): allow boolean indices to pass down to `X` #1365

ilan-gold commented Feb 9, 2024 •

edited

codecov bot commented Feb 9, 2024 •

edited