-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse X
Chunks
#568
base: main
Are you sure you want to change the base?
Sparse X
Chunks
#568
Conversation
Codecov Report
@@ Coverage Diff @@
## master #568 +/- ##
==========================================
+ Coverage 86.74% 86.80% +0.05%
==========================================
Files 25 25
Lines 3545 3561 +16
==========================================
+ Hits 3075 3091 +16
Misses 470 470
|
To provide some motivation, the chunking of the sparse matrix (specifically csc) can make fetching arbitrary genes (i.e columns) costly when the chunks are large due to default behavior. The idea of our approach is to make as small a request as possible for each selection so the UI remains responsive, or not even make a request at all due to caching. For example this demo is backed by a zarr AnnData store with a csc-matrix (just sitting on a file server, no special API or anything) but the chunks (i.e gene selections) that we fetch are hundreds of KB large, and this example is "only" ~13000 cells - we would like to be able to access chunks efficiently for datasets that have hundreds of thousands of cells: |
@ivirshup sorry to ping directly, but is there anything else I can do for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I haven't gotten around to reviewing this yet since I want to have a chance to check it out and remind myself of how the chunk API in zarr works. I'll be able to give a more informed review later this week or maybe early next week.
Some questions:
when the chunks are large due to default behavior
How large are the chunks by default?
anndata/_io/zarr.py
Outdated
if "chunks" in dataset_kwargs: | ||
del dataset_kwargs["chunks"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this mutate the dataset_kwargs
dict
and change downstream behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how deep "API stability" goes, but it seemed the only place this could come from was here so where it is set explicitly.
anndata/tests/test_readwrite.py
Outdated
z = zarr.open(str(zarr_pth)) # As of v2.3.2 zarr won’t take a Path | ||
assert z["X"]["data"].chunks == (20,) | ||
assert z["X"]["indices"].chunks == (20,) | ||
# Chunks arg only affects the "data" arrays. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you only wanted to access a subset of genes or cells, I would imagine you would run into similar problems if the indptr
array was unchunked. Does it really make sense for chunking to not apply here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little conflicted on this, but I suppose that it couldn't hurt such as if you have a million variables on the row or column that indptr
is for. This array is probably smaller than data
or indices
anyway.
@ivirshup I am running this locally on python 3.8 without issue. I also tried out the CI on the initial commit to this PR and it now fails with the same issue that appears to be causing failures here (something about an excel reader dependency?). Should I open an issue? Other than that, I allowed the argument to pass through to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the longer than expected wait! PhD stuff and a troublesome scanpy release.
An issue I'm not sure how to deal with here is that chunks=(100, 100)
now has very different meanings for sparse and dense array. I believe that in future we are going to want to have the chunks
divide sparse arrays like they do dense arrays. I think the solution you had proposed previously (#523 (comment)) is more appropriate here.
Could you make this behaviour available through a sparse_chunks
argument?
I'll fix the failing excel test (#588)
Will do @ivirshup. Thanks. |
@ivirshup All seems well. Let me know if there's anything else I can do. |
@ivirshup Also opened ilan-gold#1 into this branch for passing the |
Thanks for the changes! Sorry about the long wait for feedback.
I'd like that addition to be made. It makes more sense to me for me if this was applied to all sparse matrices in the anndata object. |
@ivirshup do you want to have a look at it first, or should I merge it into this branch? Or do you want to merge this branch, and then I can make a PR here for that feature as well? |
I'm happy with you to merge that branch into this PR. |
`layers` use `chunks` and `sparse_chunks` API for `zarr`
Done! Feel free to re-review, merge, request changes etc. Thanks! |
@ivirshup I think the pre-commit should pass now. |
Fixes #524.
I only pass in the chunks arg to
indices
anddata
since those are really the "data" arrays. I don't think it makes sense to do it forindptr
since you really need the whole thing in memory to be able to index.