Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse X Chunks #568

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Sparse X Chunks #568

wants to merge 12 commits into from

Conversation

ilan-gold
Copy link
Contributor

Fixes #524.

I only pass in the chunks arg to indices and data since those are really the "data" arrays. I don't think it makes sense to do it for indptr since you really need the whole thing in memory to be able to index.

@codecov
Copy link

codecov bot commented May 28, 2021

Codecov Report

Merging #568 (d32526c) into master (9620645) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
+ Coverage   86.74%   86.80%   +0.05%     
==========================================
  Files          25       25              
  Lines        3545     3561      +16     
==========================================
+ Hits         3075     3091      +16     
  Misses        470      470              
Impacted Files Coverage Δ
anndata/_io/read.py 72.11% <ø> (ø)
anndata/_core/anndata.py 82.75% <100.00%> (ø)
anndata/_io/zarr.py 94.27% <100.00%> (+0.43%) ⬆️

@ilan-gold
Copy link
Contributor Author

ilan-gold commented May 28, 2021

To provide some motivation, the chunking of the sparse matrix (specifically csc) can make fetching arbitrary genes (i.e columns) costly when the chunks are large due to default behavior. The idea of our approach is to make as small a request as possible for each selection so the UI remains responsive, or not even make a request at all due to caching. For example this demo is backed by a zarr AnnData store with a csc-matrix (just sitting on a file server, no special API or anything) but the chunks (i.e gene selections) that we fetch are hundreds of KB large, and this example is "only" ~13000 cells - we would like to be able to access chunks efficiently for datasets that have hundreds of thousands of cells:
Screen Shot 2021-05-28 at 10 32 31 AM
I have a dataset that has a million cells so having some more fine grained control would be nice. For comparison here is a dense matrix backed example with considerably smaller chunk requests:
Screen Shot 2021-05-28 at 10 32 53 AM

@ilan-gold
Copy link
Contributor Author

@ivirshup sorry to ping directly, but is there anything else I can do for this PR?

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I haven't gotten around to reviewing this yet since I want to have a chance to check it out and remind myself of how the chunk API in zarr works. I'll be able to give a more informed review later this week or maybe early next week.

Some questions:

when the chunks are large due to default behavior

How large are the chunks by default?

Comment on lines 193 to 194
if "chunks" in dataset_kwargs:
del dataset_kwargs["chunks"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this mutate the dataset_kwargs dict and change downstream behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how deep "API stability" goes, but it seemed the only place this could come from was here so where it is set explicitly.

z = zarr.open(str(zarr_pth)) # As of v2.3.2 zarr won’t take a Path
assert z["X"]["data"].chunks == (20,)
assert z["X"]["indices"].chunks == (20,)
# Chunks arg only affects the "data" arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you only wanted to access a subset of genes or cells, I would imagine you would run into similar problems if the indptr array was unchunked. Does it really make sense for chunking to not apply here?

Copy link
Contributor Author

@ilan-gold ilan-gold Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little conflicted on this, but I suppose that it couldn't hurt such as if you have a million variables on the row or column that indptr is for. This array is probably smaller than data or indices anyway.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jul 6, 2021

@ivirshup I am running this locally on python 3.8 without issue. I also tried out the CI on the initial commit to this PR and it now fails with the same issue that appears to be causing failures here (something about an excel reader dependency?). Should I open an issue?

Other than that, I allowed the argument to pass through to indptr as discussed. Looking forward to next steps.

Copy link
Member

@ivirshup ivirshup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the longer than expected wait! PhD stuff and a troublesome scanpy release.

An issue I'm not sure how to deal with here is that chunks=(100, 100) now has very different meanings for sparse and dense array. I believe that in future we are going to want to have the chunks divide sparse arrays like they do dense arrays. I think the solution you had proposed previously (#523 (comment)) is more appropriate here.

Could you make this behaviour available through a sparse_chunks argument?


I'll fix the failing excel test (#588)

@ilan-gold
Copy link
Contributor Author

Will do @ivirshup. Thanks.

@ilan-gold
Copy link
Contributor Author

@ivirshup All seems well. Let me know if there's anything else I can do.

@ilan-gold
Copy link
Contributor Author

@ivirshup Also opened ilan-gold#1 into this branch for passing the chunks and sparse_chunks to the layers as well. Seems like a nice, low-risk extension that should give us a lot of nice functionality for visualization of different matrices from the same AnnData store without introducing a wildly new API.

@ilan-gold ilan-gold requested a review from ivirshup July 21, 2021 13:35
@ivirshup
Copy link
Member

ivirshup commented Jul 26, 2021

Thanks for the changes! Sorry about the long wait for feedback.

Also opened ilan-gold#1 into this branch for passing the chunks and sparse_chunks to the layers as well

I'd like that addition to be made. It makes more sense to me for me if this was applied to all sparse matrices in the anndata object.

@ilan-gold
Copy link
Contributor Author

ilan-gold commented Jul 26, 2021

@ivirshup do you want to have a look at it first, or should I merge it into this branch? Or do you want to merge this branch, and then I can make a PR here for that feature as well?

@ivirshup
Copy link
Member

I'm happy with you to merge that branch into this PR.

`layers` use `chunks` and `sparse_chunks` API for `zarr`
@ilan-gold
Copy link
Contributor Author

Done! Feel free to re-review, merge, request changes etc. Thanks!

@ilan-gold
Copy link
Contributor Author

@ivirshup I think the pre-commit should pass now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chunks for sparse Matrix types
2 participants