Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(feat): allow boolean indices to pass down to X #1365

Merged
merged 10 commits into from Feb 23, 2024

Conversation

ilan-gold
Copy link
Contributor

@ilan-gold ilan-gold commented Feb 9, 2024

Copy link

codecov bot commented Feb 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.08%. Comparing base (3f6a217) to head (591dae5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1365      +/-   ##
==========================================
- Coverage   86.17%   84.08%   -2.10%     
==========================================
  Files          35       35              
  Lines        5584     5592       +8     
==========================================
- Hits         4812     4702     -110     
- Misses        772      890     +118     
Flag Coverage Δ
gpu-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
anndata/_core/index.py 93.33% <100.00%> (+0.86%) ⬆️
anndata/_core/views.py 84.05% <100.00%> (-5.60%) ⬇️

... and 7 files with indirect coverage changes

@ilan-gold
Copy link
Contributor Author

@ivirshup I have a fix here but I'm unsure of the reasoning for integer indices. The use for them, primarily, seemed to be that sparse/h5py needed integers. The numpy/pandas subsetting seems to work ok with booleans. For example, we don't seem to need them for forming a proper mesh, it seems, using ix_. I also am aware of the array-api boolean indexing issues. I did add a test to ensure that there are equal objects. Also, after #1366, we can actually inspect usage of the optimization if we keep the mocking/spy.

@ilan-gold ilan-gold marked this pull request as ready for review February 12, 2024 12:04
@ivirshup
Copy link
Member

I have a fix here but I'm unsure of the reasoning for integer indices. The use for them, primarily, seemed to be that sparse/h5py needed integers.

I'm not sure what you mean by this. Are you wondering why we were doing np.where in the first place?

@ilan-gold
Copy link
Contributor Author

I'm not sure what you mean by this. Are you wondering why we were doing np.where in the first place?

I'm guessing the reason was for sparse/hdf5 compatibility but am not sure and don't want to miss something I'm not thinking of.

@ivirshup
Copy link
Member

  • You could intentionally break it, see what tests fail, and decide if there are any cases missing that should be tested
  • You could run the scanpy tests with this branch installed, that often catches other things

It's also possible we didn't support masking because it was easier to support fewer cases and the performance hit from doing np.where wasn't high.

@ilan-gold
Copy link
Contributor Author

You could intentionally break it, see what tests fail, and decide if there are any cases missing that should be tested

This is how I went about this PR, by removing np.where so I am not sure what else could be "intentionally [broken]" but am open to ideas.

You could run the scanpy tests with this branch installed, that often catches other things

Great idea, will definitely do this.

@ivirshup
Copy link
Member

so I am not sure what else could be "intentionally [broken]" but am open to ideas.

You could invert the mask?

@ilan-gold
Copy link
Contributor Author

@ivirshup I got no errors in scanpy and nothing looks particularly out of place in AnnData either. As mentioned on slack, also, subset_func is randomized so using it is not possible here.

@ivirshup
Copy link
Member

Why would subset func being randomized here matter?

adata.X[idx] should have some relationship to adata[idx].X regardless of the value of idx, no?

@ilan-gold
Copy link
Contributor Author

Oh @ivirshup you just want to test for failing or not? I thought we wanted to do equality i.e. assert_equal(adata[subset_from_subset_func1, :], adata[subset_from_subset_func2, :], which I don't think is possible because subset_from_subset_func2 will not have the same indices as subset_from_subset_func1.

@ilan-gold
Copy link
Contributor Author

@ivirshup Done. Sorry for the confusion.

@ilan-gold ilan-gold self-assigned this Feb 20, 2024
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I just tried seeing if we would continue using boolean masks on repeated indexing. Not surprisingly, we don't:

import scanpy as sc

adata = adata = sc.datasets.pbmc3k_processed().raw.to_adata()
v = adata[adata.obs["louvain"].isin(["B cells", "NK cells"])]
v2 = v[v.obs["louvain"] == "B Cells"]

v._oidx
# array([False,  True, False, ...,  True,  True, False])

v2._oidx
# array([   1,   10,   18,   19,   20,   23,   31,   55,   75,   85,   88,
#          89,   92,   93,  105,  107,  117,  122,  127,  134,  139,  143,
#         146,  160,  162,  165,  178,  179,  181,  190,  192,  218,  231,
# ...

This means it's quite easy to hit the unoptimized case for any sort of more complex querying.

Do you think we should fix this by continuing to pass through booleans, or do we now need to address the integer case? I think masks on top of masks could happen in this PR (though doesn't strictly have to), but integer would be a separate PR.

@ilan-gold
Copy link
Contributor Author

@ivirshup I think it's fine for that not to hit the optimization since the length of the contiguous slices that would be used are basically on average 1. Do you see a significant performance change with this? The original issue was about long(ish) contiguous slices: #1224

@ivirshup
Copy link
Member

We could resolve nested boolean masks like:

def update_mask_numpy(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
    return mask_new
There is a maybe better solution, if we could use numba or added some compiled code.
import numpy as np, numba

@numba.njit
def update_mask_numba(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    incoming_pos = 0

    for orig_pos, val in enumerate(mask_orig):
        if val:
            mask_new[orig_pos] = mask_incoming[incoming_pos]
            incoming_pos += 1
    if incoming_pos != len(mask_incoming):
        raise ValueError("Incoming mask is too long")

    return mask_new

def update_mask_numpy(mask_orig, mask_incoming):
    mask_new = np.zeros_like(mask_orig)
    mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
    return mask_new

rng = np.random.default_rng()

def update_mask_bench(size1, size2, size3):
    int_idx1 = rng.choice(size1, size2, replace=False)
    int_idx2 = rng.choice(size2, size3, replace=False)

    bool_idx1 = np.zeros(size1, dtype=bool)
    bool_idx2 = np.zeros(size2, dtype=bool)
    bool_idx1[int_idx1] = True
    bool_idx2[int_idx2] = True

    %timeit update_mask_numba(bool_idx1, bool_idx2)
    %timeit update_mask_numpy(bool_idx1, bool_idx2)
In [57]: update_mask_bench(1_000_000, 500_000, 10_000)
4.55 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.58 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [58]: update_mask_bench(1_000_000, 500_000, 400_000)
4.51 ms ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.75 ms ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [59]: update_mask_bench(1_000_000, 500_000, 450_000)
4.51 ms ± 2.86 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.08 ms ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [60]: update_mask_bench(10_000_000, 9_000_000, 8_000_000)
18.9 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
63.1 ms ± 89.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [61]: update_mask_bench(10_000_000, 9_000_000, 500_000)
18.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
39 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [62]: update_mask_bench(10_000_000, 1_000_000, 500_000)
14.8 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.2 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [64]: update_mask_bench(10_000_000, 1_000_000, 100_000)
14.7 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.4 ms ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So the differences are never too huge, and we're talking about 10s of milliseconds differences for 10 million rows in the worst cases.

Which we could dispatch to in _resolve_idx like this:

@_resolve_idx.register(np.ndarray)
def _resolve_idx_ndarray(old, new, l):
    if is_bool_dtype(old) and is_bool_dtype(new):
        return update_mask(old, new)
    if is_bool_dtype(old):
        old = np.where(old)[0]
    return old[new]

@ivirshup
Copy link
Member

@ivirshup I think it's fine for that not to hit the optimization since the length of the contiguous slices that would be used are basically on average 1. Do you see a significant performance change with this? The original issue was about long(ish) contiguous slices: #1224

This is just a toy case, where the data is unoptimized for query. I just wanted to see if we would propagate the masks. In data optimized for query we should be sorting by the fields where querying on which would lead to longer runs.

@ilan-gold
Copy link
Contributor Author

Do you think we should fix this by continuing to pass through booleans, or do we now need to address the integer case?

I think mask type should stay boolean, then, yes. I also think the performance is acceptable in numpy - your test cases were quite large. I can add this into the current PR then.

@ivirshup ivirshup self-requested a review February 23, 2024 14:57
Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@ivirshup ivirshup merged commit a478647 into main Feb 23, 2024
15 checks passed
@ivirshup ivirshup deleted the ig/anndata_slice_indexing branch February 23, 2024 14:58
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Feb 23, 2024
flying-sheep pushed a commit that referenced this pull request Feb 23, 2024
…pass down to `X`) (#1387)

Co-authored-by: Ilan Gold <ilanbassgold@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants