Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy dataframes in .obs and .var with backed="r" mode #981

Open
gszep opened this issue Apr 15, 2023 · 13 comments · May be fixed by #983 or #1247
Open

lazy dataframes in .obs and .var with backed="r" mode #981

gszep opened this issue Apr 15, 2023 · 13 comments · May be fixed by #983 or #1247

Comments

@gszep
Copy link

gszep commented Apr 15, 2023

As far as I understand backed mode works for .X (and possibly .layers?) as when I access that attribute an h5py.Dataset class is returned. But accessing .obs and .var still returns a pandas.DataFrame which suggests that all the data is loaded into memory. Ideally accessing .obs or .var would return a dask.DataFrame (probably using their own read functions) that loads column and row slices as and when needed.

@gszep gszep changed the title lazy .obs and .var with backed="r" mode lazy dataframes in .obs and .var with backed="r" mode using dask.DataFrame Apr 15, 2023
@gszep gszep changed the title lazy dataframes in .obs and .var with backed="r" mode using dask.DataFrame lazy dataframes in .obs and .var with backed="r" mode Apr 15, 2023
@gszep
Copy link
Author

gszep commented Apr 15, 2023

open to a pure h5py solution too that doesn't need additional deps

@gszep gszep linked a pull request Apr 16, 2023 that will close this issue
@ivirshup
Copy link
Member

Hi @gszep! I like this idea! But would note @ilan-gold is already working on something similar over in:

What do you think of that approach?

@gszep
Copy link
Author

gszep commented Apr 18, 2023

fetching partial data over the internet would make sense for datasets that are on the order of TB in size. Otherwise I don't see much of an advantage over downloading the datasets to local disk and reading from there. It should be noted that I am not very familiar with zarr arrays however so may be missing something.

The aim with this PR is to expose more of the lazy constant time random access cabailities of the HDF5 format. It would be a shame to restrict it to only .X. In our use case we have up to a billion rows and possibly hundreds of columns in .obs that we only need so slice subsets of at any given moment.

In terms of implementation I prefer a multiple-dispatch-like approach, so modifying the code in anndata/_io/specs/methods.py

@_REGISTRY.register_read(H5Group, IOSpec("dataframe", "0.2.0"))
@_REGISTRY.register_read(ZarrGroup, IOSpec("dataframe", "0.2.0"))
def read_dataframe(elem, _reader):
    return LazyDataFrame(elem, _reader)

adding a new LazyDataFrame that behaves exactly like a pd.DataFrame requires touching fewer locations on code. I tend to shy away from subclassing/creating abstract base classes as in my experience no one agrees on what a "useful" level of abstraction is.

@ilan-gold
Copy link
Contributor

ilan-gold commented Apr 18, 2023

@gszep My approach will almost certainly not be zarr-specific. It's just easier for me at the moment because it makes debugging easier when I can read what is happening off of an HTTP server log. Also the following:

adding a new LazyDataFrame that behaves exactly like a pd.DataFrame requires touching fewer locations on code

is a non-trivial problem, which is partially solved by my PR. In any case, the read_dispatched method should allow you to customize this if you have an implementation.

Also, out of curiosity, what sort of data do you have that is a billion rows and hundreds of columns if you can share?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

@Neah-Ko
Copy link
Contributor

Neah-Ko commented Jan 31, 2024

I am having issues with time taken by analyzing big dataset memory after contributing for #1230
On use it seems that AnnData still loads a good chunk of the file in memory when in backed mode.

I implemented a pure h5py function to see if we could at least in principle bypass it:

import h5py as h5

def cs_to_bytes(X) -> int:
    return int(X.data.nbytes + X.indptr.nbytes + X.indices.nbytes)

def h5_size(filepath):
    hf = h5.File(filepath)
    count = 0
    for sub_idx in ("X", "layers", "obs", "obsm", "obsp",
                    "raw", "uns","var", "varm", "varp"):
        subarr = hf.get(sub_idx, None)
        spd = sparse_dataset(subarr) if subarr else None
        count += cs_to_bytes(spd._to_backed()) if spd else 0
    return count 

Then profiled and compared the two. You may reproduce using this script taking an .h5ad path as CLI argument:

Note: datasets can be found on CellXGene data portal, filename contains the size in cells ("_xk")

~$ python3 h5py_script.py /datasets/thalamic_complex_13k.h5ad 
/datasets/thalamic_complex_13k.h5ad
Profiling:            <h5_size>  RSS:   1.83MB | VMS: 942.08kB | SHR 811.01kB | time:   2.18ms | result 618.07MB
Profiling:       <anndata_size>  RSS:  56.05MB | VMS:  55.58MB | SHR   1.45MB | time: 174.85ms | result 653.81MB
~$ python3 h5py_script.py /datasets/Submucosa_Epcam_16k.h5ad 
/datasets/Submucosa_Epcam_16k.h5ad
Profiling:            <h5_size>  RSS:   2.37MB | VMS:   1.43MB | SHR 811.01kB | time:   6.53ms | result 277.55MB
Profiling:       <anndata_size>  RSS:  58.94MB | VMS:  58.49MB | SHR   1.52MB | time: 205.04ms | result 314.66MB
~$ python3 h5py_script.py /datasets/mCT-seq_37k.h5ad 
/datasets/mCT-seq_37k.h5ad
Profiling:            <h5_size>  RSS:   2.09MB | VMS:   1.44MB | SHR 802.82kB | time:   2.73ms | result   2.06GB
Profiling:       <anndata_size>  RSS:  58.13MB | VMS:  56.64MB | SHR   1.34MB | time: 183.95ms | result   2.08GB
~$ python3 h5py_script.py /datasets/70k_1T.h5ad 
/datasets/70k_1T.h5ad
Profiling:            <h5_size>  RSS:   2.64MB | VMS:   1.99MB | SHR 811.01kB | time:   3.58ms | result   1.82GB
Profiling:       <anndata_size>  RSS:  54.75MB | VMS:  54.54MB | SHR   1.07MB | time: 183.13ms | result   1.85GB
~$ python3 h5py_script.py /datasets/70k_1T_2.h5ad 
/datasets/70k_1T_2.h5ad
Profiling:            <h5_size>  RSS:   2.57MB | VMS:   1.98MB | SHR 737.28kB | time:   3.64ms | result   1.47GB
Profiling:       <anndata_size>  RSS:  70.29MB | VMS:  69.69MB | SHR   1.35MB | time: 244.01ms | result   1.52GB
  • This "pure" h5py solution is off by a good margin but in the right order of magnitude
  • We may see that AnnData in comparison takes way more time & memory

Investigating a bit with the debugger reveals that on reading h5ad backed dataset the "attributes" sub-matrices are loaded in dataframes like @gszep was pointing out.

anndata/anndata/_io/h5ad.py

Lines 143 to 168 in c790113

def read_h5ad_backed(filename: str | Path, mode: Literal["r", "r+"]) -> AnnData:
d = dict(filename=filename, filemode=mode)
f = h5py.File(filename, mode)
attributes = ["obsm", "varm", "obsp", "varp", "uns", "layers"]
df_attributes = ["obs", "var"]
if "encoding-type" in f.attrs:
attributes.extend(df_attributes)
else:
for k in df_attributes:
if k in f: # Backwards compat
d[k] = read_dataframe(f[k])
d.update({k: read_elem(f[k]) for k in attributes if k in f})
d["raw"] = _read_raw(f, attrs={"var", "varm"})
adata = AnnData(**d)
# Backwards compat to <0.7
if isinstance(f["obs"], h5py.Dataset):
_clean_uns(adata)
return adata

So X stays backed but the attributes are loaded, thus read from the disk and allocated on the ram, and are taxing on the resources.

I see that there are some contributions to address this issue. However, it is now stale. Is this on the roadmap ?

@ilan-gold
Copy link
Contributor

@Neah-Ko I am actively working on a solution involving xarray. We have prototype branches (e.g., #1247). It would be great to have feedback. There are a few data type stumbling blocks with xarray which is why we are hesitant to release this. I am hopeful this can be in a release soon, though, as I am making progress.

@ilan-gold
Copy link
Contributor

Relatedly, you could write your own i/o like here via read_dispatched. I imagine the arrays is obsm are taking up the most space, so this should alleviate that.

@Neah-Ko
Copy link
Contributor

Neah-Ko commented Jan 31, 2024

Hi @ilan-gold, thanks for your answer.

I've ran my profiling test on your code. 1. checkout on the pr code then 2. use it to load using the new feature;

from anndata.experimental.backed import read_backed

@profile
def anndata_size(filepath):
    adata = read_backed(filepath)
    return adata.__sizeof__(with_disk=True)

On the same datasets than above, results are like this:

~$ python3 h5py_script.py /datasets/thalamic_complex_13k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/thalamic_complex_13k.h5ad
Profiling:            <h5_size>  RSS:   1.88MB | VMS: 831.49kB | SHR 868.35kB | time:   2.18ms | result 618.07MB
Profiling:       <anndata_size>  RSS:  43.46MB | VMS:  44.13MB | SHR   2.62MB | time: 232.45ms | result 618.07MB
~$ python3 h5py_script.py /datasets/Submucosa_Epcam_16k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/Submucosa_Epcam_16k.h5ad
Profiling:            <h5_size>  RSS:   2.08MB | VMS:   1.26MB | SHR 786.43kB | time:    2.4ms | result 277.55MB
Profiling:       <anndata_size>  RSS:  38.95MB | VMS:  38.46MB | SHR   2.57MB | time: 211.77ms | result 283.48MB
~$ python3 h5py_script.py /datasets/mCT-seq_37k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/mCT-seq_37k.h5ad
Profiling:            <h5_size>  RSS:   2.08MB | VMS:   1.32MB | SHR 790.53kB | time:   2.76ms | result   2.06GB
Profiling:       <anndata_size>  RSS:  48.96MB | VMS:  48.58MB | SHR   2.07MB | time: 312.39ms | result   2.06GB
~$ python3 h5py_script.py /datasets/70k_1T.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/70k_1T.h5ad
Profiling:            <h5_size>  RSS:    2.7MB | VMS:   1.73MB | SHR 868.35kB | time:   3.76ms | result   1.82GB
Profiling:       <anndata_size>  RSS:  45.43MB | VMS:  45.99MB | SHR    2.1MB | time: 376.34ms | result   1.82GB
~$ python3 h5py_script.py /datasets/70k_1T_2.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/70k_1T_2.h5ad
Profiling:            <h5_size>  RSS:   2.67MB | VMS:   1.86MB | SHR 843.78kB | time:   3.71ms | result   1.47GB
Profiling:       <anndata_size>  RSS:  84.01MB | VMS:  83.52MB | SHR   2.09MB | time: 356.92ms | result   1.47GB

As a result we see:

  • That it gives closer values for size comparison, which could indicate that some stuff is missing
  • slightly less memory use
  • it doesn't seem to improve performance that much so some data realization is probably happening.

@ilan-gold
Copy link
Contributor

@Neah-Ko I will need to check this but I would be shocked if that weren't a bug. We have tests to specifically check nothing is read in except what we have no choice but to read in (I think it's dataframes within obsm/varm and awkward arrays, but that's it).

@ilan-gold
Copy link
Contributor

@Neah-Ko Why are you using the with_disk argument? I thought you'd be interested in with_disk=False since you're interested in i/o. For example, based on the above PR I link to:

from anndata.tests.helpers import gen_adata
from anndata.experimental import read_backed
from scipy import sparse

adata = gen_adata((1000, 1000), sparse.csc_matrix)
backed_path = "backed.zarr"
adata.write_zarr(backed_path)
remote = read_backed(backed_path) 
# remote = read_backed('47bb980c-9482-4aee-9558-443c04170dec.h5ad')
size_backed = remote.__sizeof__()
size_backed_with_disk = remote.__sizeof__(with_disk=True)
size_in_memory = remote.to_memory().__sizeof__()
# the difference is orders of magnitude
assert size_backed < size_backed_with_disk and size_backed_with_disk < size_in_memory

I also tried this with a dataset from CellXGene and h5ad and got similar results: Dissection: Thalamus (THM) - medial nuclear complex of thalamus (MNC) - mediodorsal nucleus of thalamus + reuniens nucleus (medioventral nucleus) of thalamus - MD + Re (not sure if the download link here works or not but it's there to try).

Does this work for you? Am I misunderstanding the with_disk argument?

@Neah-Ko
Copy link
Contributor

Neah-Ko commented Feb 1, 2024

@ilan-gold

For the with_disk argument I redirect you to the conversation on the pr #1230 .
It was an addition suggested by @flying-sheep

My initial use case was to estimate the size that a dataset is going to take into memory simply by opening it in backed mode.

__sizeof__() function is supposed to return the current size of the object in memory -> with_disk also adds the size that sparse/backed/lazy datasets would take if loaded.

@Neah-Ko I will need to check this but I would be shocked if that weren't a bug. We have tests to specifically check nothing is read in except what we have no choice but to read in (I think it's dataframes within obsm/varm and awkward arrays, but that's it).

So it is a bit hard for me to fully understand your read_backed function as it seems to be bouncing around with read_dispatched. Let me know if you find what causes this loading.

While looking at the original read_h5ad_backed function that I included in a comment above. I was thinking that maybe we could load into a {dask, x}array at that step. Then trigger the compute() on property access.
(e.g. AnnData._X: lazy array, AnnData.X: getter that will return AnnData._X.compute())

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 1, 2024

@Neah-Ko I think we may be talking past each other here, which I apologize for. Let me try to understand. You are interested in benchmarking how much data is read only into memory and not on disk. So my question is: why you use the with_disk option if you're interested in the in-memory usage? And then separately, does this read_backed method address your use-case (which is presumably motivating the current question about __sizeof__)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants