Patch AnnData.sizeof() for backed datasets #1230

Neah-Ko · 2023-11-07T13:27:48Z

Closes scipy.sparse.issparse check is always false in AnnData.__sizeof__() method + csr_matrix() realizes data #1222
Tests added
Release note added (or unnecessary)

codecov · 2023-11-07T13:36:19Z

Codecov Report

Merging #1230 (a362854) into main (1f5965b) will decrease coverage by 1.86%.
The diff coverage is 95.23%.

❗ Current head a362854 differs from pull request most recent head c77deed. Consider uploading reports for the commit c77deed to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1230      +/-   ##
==========================================
- Coverage   84.97%   83.12%   -1.86%     
==========================================
  Files          34       34              
  Lines        5399     5405       +6     
==========================================
- Hits         4588     4493      -95     
- Misses        811      912     +101

Flag	Coverage Δ
gpu-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
anndata/_core/anndata.py	`85.33% <95.23%> (+2.20%)`	⬆️

... and 7 files with indirect coverage changes

flying-sheep

Thanks! This helps a lot, but I think there are still a few assumptions that could break.

We can of course help out with testing or so, just tell us if you need support!

PS: Please also add a release note here:

anndata/docs/release-notes/0.10.4.md

Lines 3 to 5 in af7a5b7

    
           ```{rubric} Bugfix 
        
           ``` 
        
           * Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep`

anndata/_core/anndata.py

Neah-Ko · 2023-11-08T14:05:44Z

Thanks! This helps a lot, but I think there are still a few assumptions that could break.

We can of course help out with testing or so, just tell us if you need support!

PS: Please also add a release note here:

anndata/docs/release-notes/0.10.4.md

Lines 3 to 5 in af7a5b7

```{rubric} Bugfix

```

* Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep`

Hello @flying-sheep

Maybe I need some help for testing / refinement of the specs we are aiming for here.

I designed this naive test to append under anndata/tests/test_backed_sparse.py

def test_backed_sizeof(ondisk_equivalent_adata):
    csr_mem, csr_disk, csc_disk, dense_disk = ondisk_equivalent_adata

    assert_equal(dense_disk.__sizeof__(), csr_mem.__sizeof__())
    assert_equal(dense_disk.__sizeof__(), csr_disk.__sizeof__())
    assert_equal(dense_disk.__sizeof__(), csc_disk.__sizeof__())

it does two passes, testing both h5ad and zarr backends (you may add diskfmt in the argument list of the test function to check).

However it highlighted that the current cs_to_bytes() implementation can return quite different results than multiplying number of elements by the size of individual elements.

E.g if you place a debug breakpoint on the first assert and execute some commands:

nelem_x_size = lambda X: np.array(X.shape).prod() * X.dtype.itemsize
cstb = lambda X: X.data.nbytes + X.indptr.nbytes + X.indices.nbytes

# h5py pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X._to_backed())
3204
cstb(csr_disk.X._to_backed())
3204
nelem_x_size(dense_disk.X)
20000

# zarr pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X)
3204
cstb(csr_disk.X)
3204
nelem_x_size(dense_disk.X)
20000

Lead

Then I decided to try re-implementing the get_size function like this:

def get_size(X):
    if isinstance(X, (h5py.Dataset, 
                      sparse.csr_matrix,
                      sparse.csc_matrix,
                      BaseCompressedSparseDataset)):
        return np.array(X.shape).prod() * X.dtype.itemsize
    else:
        return X.__sizeof__()

Effect on the test:

# h5py pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20000

# zarr pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20128

The test fails because of the size of dense_disk.X an np.ndarray is slightly bigger than the sum of its parts. Now I feel a little blocked because X can contain many data structures and harmonizing this calculation to the bit seems near-impossible at worst and hacky at best.

Reflexions

I am starting to question implementing this directly in __sizeof__(), since it should in principle return the size of the object and not the size that it would be if data had been realized.

Maybe this deserves another function that has a more explicit name ? Or that function could simply compute the size of the data making it less precise but good enough in terms of order of magnitude.

I see that #981 and #947 are about adding lazy support for other coordinates than X, I think this is something that we need to think about while designing that feature as well.

Let me know what you think.

Best,

flying-sheep · 2023-11-09T07:53:10Z

X can contain many data structures, harmonizing this calculation to the bit seems near-impossible at worst and hacky at best.

What do you mean? I do you mean things like DataFrames, which have several parts, or do you mean that there can be complex dtypes that aren’t easy to calculate size for?

I’d say: Only test simple cases.

Unless I misunderstood and even simple arrays can have varying sizes. In that case maybe just assert lower_bound < size < upper_bound or so.

Neah-Ko · 2023-11-10T15:40:01Z

Hello @flying-sheep,

I meant that it would be hard to return a consistent size value for the various classes that can be returned by accessing AnnData.X

Since you don't have a problem with an imprecise test, I've updated my solution with the lower/upper bounds asserts.

flying-sheep · 2023-11-13T08:46:01Z

Hm, I think I wasn’t clear enough. What I meant is

For each data type you add support for, it’s better to have not entirely precise size measurements rather than none.

With a focus on entirely. I thought you were referring to a few bytes of housekeeping data that some class has.

Also I think you’re now making it so sparse matrix size isn’t reported correctly anymore. np.array(X.shape).prod() * X.dtype.itemsize is the size a dense array would have, which can be much more than a sparse one. The code without your PR (X_csr.data.nbytes + X_csr.indptr.nbytes + X_csr.indices.nbytes) is correct AFAIK. We can add a precise test for the case of sparse matrices.

…_lazy_sizeof

Neah-Ko · 2023-11-13T11:49:16Z

@flying-sheep
Hello,
I've returned to the cs_to_bytes method for sparse datasets and split tests between sparse and dense.

flying-sheep · 2023-11-13T12:59:50Z

OK, great! Now the only remaining point is

I am starting to question implementing this directly in sizeof(), since it should in principle return the size of the object and not the size that it would be if data had been realized.

Actually it should only return the size of the AnnData object, not even the things it refers to, see the docs: https://docs.python.org/3/library/sys.html#sys.getsizeof

I think it makes sense to customize it. I think for now, we could change __sizeof__ to something like

def __sizeof__(self, *, with_disk: bool = False) -> int:
    ...

Then sys.getsizeof(adata) will still return the (less wrong) value of the approximate total memory size, but you can manually call adata.__sizeof__(with_disk=True) to get memory + on disk.

If we want to follow the specs, we would have change it to

def __sizeof__(self, *, with_fields: bool = False, with_disk: bool = False) -> int:
    ...

which means we’d have to change behavior, so maybe let’s not do this right now.

What do you think?

Neah-Ko · 2023-11-13T14:14:12Z

@flying-sheep
Makes sense, I agree that we don't need to change the behavior too much. I've pushed an implementation with the with_disk argument and modified tests accordingly. Let me know if this is what you had in mind.

flying-sheep

Pretty much!

I think the behavior should be:

with_disk=True → everything
with_disk=False → all in-memory structures, sparse or dense.

anndata/_core/anndata.py

…_lazy_sizeof

for more information, see https://pre-commit.ci

anndata/tests/test_backed_sparse.py

flying-sheep · 2023-11-17T09:42:53Z

Thank you for the PR and for being patient with my many requests 😄

…cked datasets) (#1234) Co-authored-by: Etienne JODRY <Etienne.JODRY@hotmail.fr>

Neah-Ko · 2023-11-17T09:56:09Z

Thank you for the PR and for being patient with my many requests 😄

Sure, with pleasure. It was fun to dig into it :) Happy that it passed.

Best,

Patch AnnData.__sizeof__() for backed datasets

c72b2bb

flying-sheep added this to the 0.10.4 milestone Nov 7, 2023

flying-sheep requested changes Nov 7, 2023

View reviewed changes

anndata/_core/anndata.py Outdated Show resolved Hide resolved

anndata/_core/anndata.py Outdated Show resolved Hide resolved

anndata/_core/anndata.py Outdated Show resolved Hide resolved

Apply suggested changes: call public classes, precise naming, notes

c50b44e

flying-sheep added the skip-gpu-ci label Nov 9, 2023

flying-sheep mentioned this pull request Nov 9, 2023

AnnData.__sizeof__ error on subsetted AnnData object #1127

Closed

3 tasks

Rework get_size and unit test

b6694a8

Neah-Ko requested a review from flying-sheep November 10, 2023 15:40

Format changelog entry

312b1a0

Etienne Jodry added 2 commits November 13, 2023 12:43

Separate sparse and dense tests

c8a9cd9

Merge branch 'dev_lazy_sizeof' of github.com:Neah-Ko/anndata into dev…

523f7d9

…_lazy_sizeof

Rework dense test

ff75479

Add with_disk parameter

f60908a

flying-sheep requested changes Nov 13, 2023

View reviewed changes

anndata/_core/anndata.py Outdated Show resolved Hide resolved

flying-sheep and others added 8 commits November 13, 2023 16:34

Merge branch 'main' into dev_lazy_sizeof

0efc062

Adjust to specs

6e01c31

Merge branch 'dev_lazy_sizeof' of github.com:Neah-Ko/anndata into dev…

183416c

…_lazy_sizeof

Fix type

8786690

sparse dataset is also on disk

4b50bfb

[pre-commit.ci] auto fixes from pre-commit.com hooks

5c9377e

for more information, see https://pre-commit.ci

make tests make sense

c7cb928

Fix comparison

4459ae0

flying-sheep requested changes Nov 16, 2023

View reviewed changes

anndata/tests/test_backed_sparse.py Outdated Show resolved Hide resolved

flying-sheep added 5 commits November 17, 2023 09:31

Fix tests

23f6bf4

types

a362854

dedupe

785490f

Fix tests

d5ed1f0

Merge branch 'main' into dev_lazy_sizeof

88e2ca1

flying-sheep approved these changes Nov 17, 2023

View reviewed changes

tqdm is not a direct dependency so defer import

c77deed

flying-sheep enabled auto-merge (squash) November 17, 2023 09:29

flying-sheep merged commit d4cde5c into scverse:main Nov 17, 2023
12 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Nov 17, 2023

Backport PR scverse#1230: Patch AnnData.__sizeof__() for backed datasets

1f94fcd

meeseeksmachine mentioned this pull request Nov 17, 2023

Backport PR #1230 on branch 0.10.x (Patch AnnData.__sizeof__() for backed datasets) #1234

Merged

flying-sheep pushed a commit that referenced this pull request Nov 17, 2023

Backport PR #1230 on branch 0.10.x (Patch AnnData.__sizeof__() for ba…

0611f3b

…cked datasets) (#1234) Co-authored-by: Etienne JODRY <Etienne.JODRY@hotmail.fr>

Neah-Ko mentioned this pull request Jan 31, 2024

lazy dataframes in .obs and .var with backed="r" mode #981

Open

t-kalinowski mentioned this pull request Apr 18, 2024

Unexpected error when __sizeof__ is malformed when running Python rstudio/rstudio#13491

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch AnnData.sizeof() for backed datasets #1230

Patch AnnData.sizeof() for backed datasets #1230

Neah-Ko commented Nov 7, 2023 •

edited by flying-sheep

codecov bot commented Nov 7, 2023 •

edited

flying-sheep left a comment

Neah-Ko commented Nov 8, 2023 •

edited

flying-sheep commented Nov 9, 2023

Neah-Ko commented Nov 10, 2023

flying-sheep commented Nov 13, 2023 •

edited

Neah-Ko commented Nov 13, 2023

flying-sheep commented Nov 13, 2023 •

edited

Neah-Ko commented Nov 13, 2023

flying-sheep left a comment

flying-sheep commented Nov 17, 2023

Neah-Ko commented Nov 17, 2023

	```{rubric} Bugfix
	```
	* Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep`

Patch AnnData.__sizeof__() for backed datasets #1230

Patch AnnData.__sizeof__() for backed datasets #1230

Conversation

Neah-Ko commented Nov 7, 2023 • edited by flying-sheep

codecov bot commented Nov 7, 2023 • edited

Codecov Report

flying-sheep left a comment

Choose a reason for hiding this comment

Neah-Ko commented Nov 8, 2023 • edited

Lead

Reflexions

flying-sheep commented Nov 9, 2023

Neah-Ko commented Nov 10, 2023

flying-sheep commented Nov 13, 2023 • edited

Neah-Ko commented Nov 13, 2023

flying-sheep commented Nov 13, 2023 • edited

Neah-Ko commented Nov 13, 2023

flying-sheep left a comment

Choose a reason for hiding this comment

flying-sheep commented Nov 17, 2023

Neah-Ko commented Nov 17, 2023

Patch AnnData.sizeof() for backed datasets #1230

Patch AnnData.sizeof() for backed datasets #1230

Neah-Ko commented Nov 7, 2023 •

edited by flying-sheep

codecov bot commented Nov 7, 2023 •

edited

Neah-Ko commented Nov 8, 2023 •

edited

flying-sheep commented Nov 13, 2023 •

edited

flying-sheep commented Nov 13, 2023 •

edited