Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch AnnData.__sizeof__() for backed datasets #1230

Merged
merged 22 commits into from Nov 17, 2023

Conversation

Neah-Ko
Copy link
Contributor

@Neah-Ko Neah-Ko commented Nov 7, 2023

Copy link

codecov bot commented Nov 7, 2023

Codecov Report

Merging #1230 (a362854) into main (1f5965b) will decrease coverage by 1.86%.
The diff coverage is 95.23%.

❗ Current head a362854 differs from pull request most recent head c77deed. Consider uploading reports for the commit c77deed to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1230      +/-   ##
==========================================
- Coverage   84.97%   83.12%   -1.86%     
==========================================
  Files          34       34              
  Lines        5399     5405       +6     
==========================================
- Hits         4588     4493      -95     
- Misses        811      912     +101     
Flag Coverage Δ
gpu-tests ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
anndata/_core/anndata.py 85.33% <95.23%> (+2.20%) ⬆️

... and 7 files with indirect coverage changes

@flying-sheep flying-sheep added this to the 0.10.4 milestone Nov 7, 2023
Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This helps a lot, but I think there are still a few assumptions that could break.

We can of course help out with testing or so, just tell us if you need support!

PS: Please also add a release note here:

```{rubric} Bugfix
```
* Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep`

anndata/_core/anndata.py Outdated Show resolved Hide resolved
anndata/_core/anndata.py Outdated Show resolved Hide resolved
anndata/_core/anndata.py Outdated Show resolved Hide resolved
@Neah-Ko
Copy link
Contributor Author

Neah-Ko commented Nov 8, 2023

Thanks! This helps a lot, but I think there are still a few assumptions that could break.

We can of course help out with testing or so, just tell us if you need support!

PS: Please also add a release note here:

```{rubric} Bugfix
```
* Only try to use `Categorical.map(na_action=…)` in actually supported Pandas ≥2.1 {pr}`1226` {user}`flying-sheep`

Hello @flying-sheep

Maybe I need some help for testing / refinement of the specs we are aiming for here.

I designed this naive test to append under anndata/tests/test_backed_sparse.py

def test_backed_sizeof(ondisk_equivalent_adata):
    csr_mem, csr_disk, csc_disk, dense_disk = ondisk_equivalent_adata

    assert_equal(dense_disk.__sizeof__(), csr_mem.__sizeof__())
    assert_equal(dense_disk.__sizeof__(), csr_disk.__sizeof__())
    assert_equal(dense_disk.__sizeof__(), csc_disk.__sizeof__())

it does two passes, testing both h5ad and zarr backends (you may add diskfmt in the argument list of the test function to check).

However it highlighted that the current cs_to_bytes() implementation can return quite different results than multiplying number of elements by the size of individual elements.

E.g if you place a debug breakpoint on the first assert and execute some commands:

nelem_x_size = lambda X: np.array(X.shape).prod() * X.dtype.itemsize
cstb = lambda X: X.data.nbytes + X.indptr.nbytes + X.indices.nbytes

# h5py pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X._to_backed())
3204
cstb(csr_disk.X._to_backed())
3204
nelem_x_size(dense_disk.X)
20000

# zarr pass
cstb(csr_mem.X)
3204
cstb(csc_disk.X)
3204
cstb(csr_disk.X)
3204
nelem_x_size(dense_disk.X)
20000

Lead

Then I decided to try re-implementing the get_size function like this:

def get_size(X):
    if isinstance(X, (h5py.Dataset, 
                      sparse.csr_matrix,
                      sparse.csc_matrix,
                      BaseCompressedSparseDataset)):
        return np.array(X.shape).prod() * X.dtype.itemsize
    else:
        return X.__sizeof__()

Effect on the test:

# h5py pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20000

# zarr pass
get_size(csr_mem.X)
20000
get_size(csr_disk.X)
20000
get_size(csc_disk.X)
20000
get_size(dense_disk.X)
20128

The test fails because of the size of dense_disk.X an np.ndarray is slightly bigger than the sum of its parts. Now I feel a little blocked because X can contain many data structures and harmonizing this calculation to the bit seems near-impossible at worst and hacky at best.

Reflexions

I am starting to question implementing this directly in __sizeof__(), since it should in principle return the size of the object and not the size that it would be if data had been realized.

Maybe this deserves another function that has a more explicit name ? Or that function could simply compute the size of the data making it less precise but good enough in terms of order of magnitude.

I see that #981 and #947 are about adding lazy support for other coordinates than X, I think this is something that we need to think about while designing that feature as well.

Let me know what you think.

Best,

@flying-sheep
Copy link
Member

X can contain many data structures, harmonizing this calculation to the bit seems near-impossible at worst and hacky at best.

What do you mean? I do you mean things like DataFrames, which have several parts, or do you mean that there can be complex dtypes that aren’t easy to calculate size for?

I’d say: Only test simple cases.

Unless I misunderstood and even simple arrays can have varying sizes. In that case maybe just assert lower_bound < size < upper_bound or so.

@Neah-Ko
Copy link
Contributor Author

Neah-Ko commented Nov 10, 2023

Hello @flying-sheep,

I meant that it would be hard to return a consistent size value for the various classes that can be returned by accessing AnnData.X

Since you don't have a problem with an imprecise test, I've updated my solution with the lower/upper bounds asserts.

@flying-sheep
Copy link
Member

flying-sheep commented Nov 13, 2023

Hm, I think I wasn’t clear enough. What I meant is

For each data type you add support for, it’s better to have not entirely precise size measurements rather than none.

With a focus on entirely. I thought you were referring to a few bytes of housekeeping data that some class has.

Also I think you’re now making it so sparse matrix size isn’t reported correctly anymore. np.array(X.shape).prod() * X.dtype.itemsize is the size a dense array would have, which can be much more than a sparse one. The code without your PR (X_csr.data.nbytes + X_csr.indptr.nbytes + X_csr.indices.nbytes) is correct AFAIK. We can add a precise test for the case of sparse matrices.

@Neah-Ko
Copy link
Contributor Author

Neah-Ko commented Nov 13, 2023

@flying-sheep
Hello,
I've returned to the cs_to_bytes method for sparse datasets and split tests between sparse and dense.

@flying-sheep
Copy link
Member

flying-sheep commented Nov 13, 2023

OK, great! Now the only remaining point is

I am starting to question implementing this directly in sizeof(), since it should in principle return the size of the object and not the size that it would be if data had been realized.

Actually it should only return the size of the AnnData object, not even the things it refers to, see the docs: https://docs.python.org/3/library/sys.html#sys.getsizeof

I think it makes sense to customize it. I think for now, we could change __sizeof__ to something like

def __sizeof__(self, *, with_disk: bool = False) -> int:
    ...

Then sys.getsizeof(adata) will still return the (less wrong) value of the approximate total memory size, but you can manually call adata.__sizeof__(with_disk=True) to get memory + on disk.

If we want to follow the specs, we would have change it to

def __sizeof__(self, *, with_fields: bool = False, with_disk: bool = False) -> int:
    ...

which means we’d have to change behavior, so maybe let’s not do this right now.

What do you think?

@Neah-Ko
Copy link
Contributor Author

Neah-Ko commented Nov 13, 2023

@flying-sheep
Makes sense, I agree that we don't need to change the behavior too much. I've pushed an implementation with the with_disk argument and modified tests accordingly. Let me know if this is what you had in mind.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much!

I think the behavior should be:

  1. with_disk=True → everything
  2. with_disk=False → all in-memory structures, sparse or dense.

anndata/_core/anndata.py Outdated Show resolved Hide resolved
anndata/tests/test_backed_sparse.py Outdated Show resolved Hide resolved
@flying-sheep flying-sheep enabled auto-merge (squash) November 17, 2023 09:29
@flying-sheep flying-sheep merged commit d4cde5c into scverse:main Nov 17, 2023
12 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Nov 17, 2023
@flying-sheep
Copy link
Member

Thank you for the PR and for being patient with my many requests 😄

flying-sheep pushed a commit that referenced this pull request Nov 17, 2023
…cked datasets) (#1234)

Co-authored-by: Etienne JODRY <Etienne.JODRY@hotmail.fr>
@Neah-Ko
Copy link
Contributor Author

Neah-Ko commented Nov 17, 2023

Thank you for the PR and for being patient with my many requests 😄

Sure, with pleasure. It was fun to dig into it :) Happy that it passed.

Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scipy.sparse.issparse check is always false in AnnData.__sizeof__() method + csr_matrix() realizes data
2 participants