Sparse `X` Chunks #568

ilan-gold · 2021-05-28T14:26:54Z

Fixes #524.

I only pass in the chunks arg to indices and data since those are really the "data" arrays. I don't think it makes sense to do it for indptr since you really need the whole thing in memory to be able to index.

codecov · 2021-05-28T14:31:48Z

Codecov Report

Merging #568 (d32526c) into master (9620645) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #568      +/-   ##
==========================================
+ Coverage   86.74%   86.80%   +0.05%     
==========================================
  Files          25       25              
  Lines        3545     3561      +16     
==========================================
+ Hits         3075     3091      +16     
  Misses        470      470

Impacted Files	Coverage Δ
anndata/_io/read.py	`72.11% <ø> (ø)`
anndata/_core/anndata.py	`82.75% <100.00%> (ø)`
anndata/_io/zarr.py	`94.27% <100.00%> (+0.43%)`	⬆️

ilan-gold · 2021-05-28T14:38:42Z

To provide some motivation, the chunking of the sparse matrix (specifically csc) can make fetching arbitrary genes (i.e columns) costly when the chunks are large due to default behavior. The idea of our approach is to make as small a request as possible for each selection so the UI remains responsive, or not even make a request at all due to caching. For example this demo is backed by a zarr AnnData store with a csc-matrix (just sitting on a file server, no special API or anything) but the chunks (i.e gene selections) that we fetch are hundreds of KB large, and this example is "only" ~13000 cells - we would like to be able to access chunks efficiently for datasets that have hundreds of thousands of cells:

I have a dataset that has a million cells so having some more fine grained control would be nice. For comparison here is a dense matrix backed example with considerably smaller chunk requests:

ilan-gold · 2021-06-21T16:20:20Z

@ivirshup sorry to ping directly, but is there anything else I can do for this PR?

ivirshup

No worries. I haven't gotten around to reviewing this yet since I want to have a chance to check it out and remind myself of how the chunk API in zarr works. I'll be able to give a more informed review later this week or maybe early next week.

Some questions:

when the chunks are large due to default behavior

How large are the chunks by default?

ivirshup · 2021-06-22T02:36:00Z

anndata/_io/zarr.py

+    if "chunks" in dataset_kwargs:
+        del dataset_kwargs["chunks"]


Wouldn't this mutate the dataset_kwargs dict and change downstream behavior?

I am not sure how deep "API stability" goes, but it seemed the only place this could come from was here so where it is set explicitly.

ivirshup · 2021-06-22T02:36:20Z

anndata/tests/test_readwrite.py

+    z = zarr.open(str(zarr_pth))  # As of v2.3.2 zarr won’t take a Path
+    assert z["X"]["data"].chunks == (20,)
+    assert z["X"]["indices"].chunks == (20,)
+    # Chunks arg only affects the "data" arrays.


If you only wanted to access a subset of genes or cells, I would imagine you would run into similar problems if the indptr array was unchunked. Does it really make sense for chunking to not apply here?

I was a little conflicted on this, but I suppose that it couldn't hurt such as if you have a million variables on the row or column that indptr is for. This array is probably smaller than data or indices anyway.

ilan-gold · 2021-07-06T18:35:05Z

@ivirshup I am running this locally on python 3.8 without issue. I also tried out the CI on the initial commit to this PR and it now fails with the same issue that appears to be causing failures here (something about an excel reader dependency?). Should I open an issue?

Other than that, I allowed the argument to pass through to indptr as discussed. Looking forward to next steps.

ivirshup

Sorry for the longer than expected wait! PhD stuff and a troublesome scanpy release.

An issue I'm not sure how to deal with here is that chunks=(100, 100) now has very different meanings for sparse and dense array. I believe that in future we are going to want to have the chunks divide sparse arrays like they do dense arrays. I think the solution you had proposed previously (#523 (comment)) is more appropriate here.

Could you make this behaviour available through a sparse_chunks argument?

I'll fix the failing excel test (#588)

ilan-gold · 2021-07-19T10:21:31Z

Will do @ivirshup. Thanks.

ilan-gold · 2021-07-19T11:26:28Z

@ivirshup All seems well. Let me know if there's anything else I can do.

ilan-gold · 2021-07-21T12:05:38Z

@ivirshup Also opened ilan-gold#1 into this branch for passing the chunks and sparse_chunks to the layers as well. Seems like a nice, low-risk extension that should give us a lot of nice functionality for visualization of different matrices from the same AnnData store without introducing a wildly new API.

ivirshup · 2021-07-26T06:21:41Z

Thanks for the changes! Sorry about the long wait for feedback.

Also opened ilan-gold#1 into this branch for passing the chunks and sparse_chunks to the layers as well

I'd like that addition to be made. It makes more sense to me for me if this was applied to all sparse matrices in the anndata object.

ilan-gold · 2021-07-26T11:30:19Z

@ivirshup do you want to have a look at it first, or should I merge it into this branch? Or do you want to merge this branch, and then I can make a PR here for that feature as well?

ivirshup · 2021-07-26T11:44:44Z

I'm happy with you to merge that branch into this PR.

`layers` use `chunks` and `sparse_chunks` API for `zarr`

ilan-gold · 2021-07-26T12:00:17Z

Done! Feel free to re-review, merge, request changes etc. Thanks!

ilan-gold · 2021-08-12T00:26:56Z

@ivirshup I think the pre-commit should pass now.

Sparse chunks.

318e54a

ilan-gold mentioned this pull request May 28, 2021

chunks for sparse Matrix types #524

Open

ivirshup reviewed Jun 22, 2021

View reviewed changes

ilan-gold added 3 commits July 6, 2021 19:25

Allow arg to move to indptr

6b1ef81

Flake8

979c0aa

Merge remote-tracking branch 'origin/master' into sparse_chunks

3a77f4a

ivirshup requested changes Jul 12, 2021

View reviewed changes

ilan-gold added 2 commits July 19, 2021 13:13

Add sparse_chunks argument.

7a891be

Merge branch 'master' into sparse_chunks

e669f1b

ilan-gold added 2 commits July 20, 2021 21:01

Pass chunks and sparse_chunks to layers.

9d9c1f8

Fix chunks condition.

a7993df

Remove warning, clarify condition.

db24541

ilan-gold requested a review from ivirshup July 21, 2021 13:35

Add test.

d92602b

Merge pull request #1 from ilan-gold/chunks_layers

04e8c9e

`layers` use `chunks` and `sparse_chunks` API for `zarr`

Run black.

d32526c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse `X` Chunks #568

Sparse `X` Chunks #568

ilan-gold commented May 28, 2021

codecov bot commented May 28, 2021 •

edited

ilan-gold commented May 28, 2021 •

edited

ilan-gold commented Jun 21, 2021

ivirshup left a comment

ivirshup Jun 22, 2021

ilan-gold Jun 22, 2021

ivirshup Jun 22, 2021

ilan-gold Jun 22, 2021 •

edited

ilan-gold commented Jul 6, 2021 •

edited

ivirshup left a comment •

edited

ilan-gold commented Jul 19, 2021

ilan-gold commented Jul 19, 2021

ilan-gold commented Jul 21, 2021

ivirshup commented Jul 26, 2021 •

edited

ilan-gold commented Jul 26, 2021 •

edited

ivirshup commented Jul 26, 2021

ilan-gold commented Jul 26, 2021

ilan-gold commented Aug 12, 2021

Sparse X Chunks #568

Are you sure you want to change the base?

Sparse X Chunks #568

Conversation

ilan-gold commented May 28, 2021

codecov bot commented May 28, 2021 • edited

Codecov Report

ilan-gold commented May 28, 2021 • edited

ilan-gold commented Jun 21, 2021

ivirshup left a comment

Choose a reason for hiding this comment

ivirshup Jun 22, 2021

Choose a reason for hiding this comment

ilan-gold Jun 22, 2021

Choose a reason for hiding this comment

ivirshup Jun 22, 2021

Choose a reason for hiding this comment

ilan-gold Jun 22, 2021 • edited

Choose a reason for hiding this comment

ilan-gold commented Jul 6, 2021 • edited

ivirshup left a comment • edited

Choose a reason for hiding this comment

ilan-gold commented Jul 19, 2021

ilan-gold commented Jul 19, 2021

ilan-gold commented Jul 21, 2021

ivirshup commented Jul 26, 2021 • edited

ilan-gold commented Jul 26, 2021 • edited

ivirshup commented Jul 26, 2021

ilan-gold commented Jul 26, 2021

ilan-gold commented Aug 12, 2021

Sparse `X` Chunks #568

Sparse `X` Chunks #568

codecov bot commented May 28, 2021 •

edited

ilan-gold commented May 28, 2021 •

edited

ilan-gold Jun 22, 2021 •

edited

ilan-gold commented Jul 6, 2021 •

edited

ivirshup left a comment •

edited

ivirshup commented Jul 26, 2021 •

edited

ilan-gold commented Jul 26, 2021 •

edited