lazy dataframes in `.obs` and `.var` with `backed="r"` mode #981

gszep · 2023-04-15T14:17:40Z

As far as I understand backed mode works for .X (and possibly .layers?) as when I access that attribute an h5py.Dataset class is returned. But accessing .obs and .var still returns a pandas.DataFrame which suggests that all the data is loaded into memory. Ideally accessing .obs or .var would return a dask.DataFrame (probably using their own read functions) that loads column and row slices as and when needed.

The text was updated successfully, but these errors were encountered:

gszep · 2023-04-15T15:03:05Z

open to a pure h5py solution too that doesn't need additional deps

ivirshup · 2023-04-17T14:13:39Z

Hi @gszep! I like this idea! But would note @ilan-gold is already working on something similar over in:

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

What do you think of that approach?

gszep · 2023-04-18T09:46:58Z

fetching partial data over the internet would make sense for datasets that are on the order of TB in size. Otherwise I don't see much of an advantage over downloading the datasets to local disk and reading from there. It should be noted that I am not very familiar with zarr arrays however so may be missing something.

The aim with this PR is to expose more of the lazy constant time random access cabailities of the HDF5 format. It would be a shame to restrict it to only .X. In our use case we have up to a billion rows and possibly hundreds of columns in .obs that we only need so slice subsets of at any given moment.

In terms of implementation I prefer a multiple-dispatch-like approach, so modifying the code in anndata/_io/specs/methods.py

@_REGISTRY.register_read(H5Group, IOSpec("dataframe", "0.2.0"))
@_REGISTRY.register_read(ZarrGroup, IOSpec("dataframe", "0.2.0"))
def read_dataframe(elem, _reader):
    return LazyDataFrame(elem, _reader)

adding a new LazyDataFrame that behaves exactly like a pd.DataFrame requires touching fewer locations on code. I tend to shy away from subclassing/creating abstract base classes as in my experience no one agrees on what a "useful" level of abstraction is.

ilan-gold · 2023-04-18T12:54:42Z

@gszep My approach will almost certainly not be zarr-specific. It's just easier for me at the moment because it makes debugging easier when I can read what is happening off of an HTTP server log. Also the following:

adding a new LazyDataFrame that behaves exactly like a pd.DataFrame requires touching fewer locations on code

is a non-trivial problem, which is partially solved by my PR. In any case, the read_dispatched method should allow you to customize this if you have an implementation.

Also, out of curiosity, what sort of data do you have that is a billion rows and hundreds of columns if you can share?

github-actions · 2023-06-18T02:39:32Z

This issue has been automatically marked as stale because it has not had recent activity.
Please add a comment if you want to keep the issue open. Thank you for your contributions!

Neah-Ko · 2024-01-31T11:38:06Z

I am having issues with time taken by analyzing big dataset memory after contributing for #1230
On use it seems that AnnData still loads a good chunk of the file in memory when in backed mode.

I implemented a pure h5py function to see if we could at least in principle bypass it:

import h5py as h5

def cs_to_bytes(X) -> int:
    return int(X.data.nbytes + X.indptr.nbytes + X.indices.nbytes)

def h5_size(filepath):
    hf = h5.File(filepath)
    count = 0
    for sub_idx in ("X", "layers", "obs", "obsm", "obsp",
                    "raw", "uns","var", "varm", "varp"):
        subarr = hf.get(sub_idx, None)
        spd = sparse_dataset(subarr) if subarr else None
        count += cs_to_bytes(spd._to_backed()) if spd else 0
    return count

Then profiled and compared the two. You may reproduce using this script taking an .h5ad path as CLI argument:

Note: datasets can be found on CellXGene data portal, filename contains the size in cells ("_xk")

~$ python3 h5py_script.py /datasets/thalamic_complex_13k.h5ad 
/datasets/thalamic_complex_13k.h5ad
Profiling:            <h5_size>  RSS:   1.83MB | VMS: 942.08kB | SHR 811.01kB | time:   2.18ms | result 618.07MB
Profiling:       <anndata_size>  RSS:  56.05MB | VMS:  55.58MB | SHR   1.45MB | time: 174.85ms | result 653.81MB
~$ python3 h5py_script.py /datasets/Submucosa_Epcam_16k.h5ad 
/datasets/Submucosa_Epcam_16k.h5ad
Profiling:            <h5_size>  RSS:   2.37MB | VMS:   1.43MB | SHR 811.01kB | time:   6.53ms | result 277.55MB
Profiling:       <anndata_size>  RSS:  58.94MB | VMS:  58.49MB | SHR   1.52MB | time: 205.04ms | result 314.66MB
~$ python3 h5py_script.py /datasets/mCT-seq_37k.h5ad 
/datasets/mCT-seq_37k.h5ad
Profiling:            <h5_size>  RSS:   2.09MB | VMS:   1.44MB | SHR 802.82kB | time:   2.73ms | result   2.06GB
Profiling:       <anndata_size>  RSS:  58.13MB | VMS:  56.64MB | SHR   1.34MB | time: 183.95ms | result   2.08GB
~$ python3 h5py_script.py /datasets/70k_1T.h5ad 
/datasets/70k_1T.h5ad
Profiling:            <h5_size>  RSS:   2.64MB | VMS:   1.99MB | SHR 811.01kB | time:   3.58ms | result   1.82GB
Profiling:       <anndata_size>  RSS:  54.75MB | VMS:  54.54MB | SHR   1.07MB | time: 183.13ms | result   1.85GB
~$ python3 h5py_script.py /datasets/70k_1T_2.h5ad 
/datasets/70k_1T_2.h5ad
Profiling:            <h5_size>  RSS:   2.57MB | VMS:   1.98MB | SHR 737.28kB | time:   3.64ms | result   1.47GB
Profiling:       <anndata_size>  RSS:  70.29MB | VMS:  69.69MB | SHR   1.35MB | time: 244.01ms | result   1.52GB

This "pure" h5py solution is off by a good margin but in the right order of magnitude
We may see that AnnData in comparison takes way more time & memory

Investigating a bit with the debugger reveals that on reading h5ad backed dataset the "attributes" sub-matrices are loaded in dataframes like @gszep was pointing out.

anndata/anndata/_io/h5ad.py

Lines 143 to 168 in c790113

    
           def read_h5ad_backed(filename: str | Path, mode: Literal["r", "r+"]) -> AnnData: 
        
               d = dict(filename=filename, filemode=mode) 
        
               f = h5py.File(filename, mode) 
        
               attributes = ["obsm", "varm", "obsp", "varp", "uns", "layers"] 
        
               df_attributes = ["obs", "var"] 
        
               if "encoding-type" in f.attrs: 
        
                   attributes.extend(df_attributes) 
        
               else: 
        
                   for k in df_attributes: 
        
                       if k in f:  # Backwards compat 
        
                           d[k] = read_dataframe(f[k]) 
        
               d.update({k: read_elem(f[k]) for k in attributes if k in f}) 
        
               d["raw"] = _read_raw(f, attrs={"var", "varm"}) 
        
               adata = AnnData(**d) 
        
               # Backwards compat to <0.7 
        
               if isinstance(f["obs"], h5py.Dataset): 
        
                   _clean_uns(adata) 
        
               return adata

So X stays backed but the attributes are loaded, thus read from the disk and allocated on the ram, and are taxing on the resources.

I see that there are some contributions to address this issue. However, it is now stale. Is this on the roadmap ?

ilan-gold · 2024-01-31T13:10:58Z

@Neah-Ko I am actively working on a solution involving xarray. We have prototype branches (e.g., #1247). It would be great to have feedback. There are a few data type stumbling blocks with xarray which is why we are hesitant to release this. I am hopeful this can be in a release soon, though, as I am making progress.

ilan-gold · 2024-01-31T13:13:03Z

Relatedly, you could write your own i/o like here via read_dispatched. I imagine the arrays is obsm are taking up the most space, so this should alleviate that.

Neah-Ko · 2024-01-31T15:20:16Z

Hi @ilan-gold, thanks for your answer.

I've ran my profiling test on your code. 1. checkout on the pr code then 2. use it to load using the new feature;

from anndata.experimental.backed import read_backed

@profile
def anndata_size(filepath):
    adata = read_backed(filepath)
    return adata.__sizeof__(with_disk=True)

On the same datasets than above, results are like this:

~$ python3 h5py_script.py /datasets/thalamic_complex_13k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/thalamic_complex_13k.h5ad
Profiling:            <h5_size>  RSS:   1.88MB | VMS: 831.49kB | SHR 868.35kB | time:   2.18ms | result 618.07MB
Profiling:       <anndata_size>  RSS:  43.46MB | VMS:  44.13MB | SHR   2.62MB | time: 232.45ms | result 618.07MB
~$ python3 h5py_script.py /datasets/Submucosa_Epcam_16k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/Submucosa_Epcam_16k.h5ad
Profiling:            <h5_size>  RSS:   2.08MB | VMS:   1.26MB | SHR 786.43kB | time:    2.4ms | result 277.55MB
Profiling:       <anndata_size>  RSS:  38.95MB | VMS:  38.46MB | SHR   2.57MB | time: 211.77ms | result 283.48MB
~$ python3 h5py_script.py /datasets/mCT-seq_37k.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/mCT-seq_37k.h5ad
Profiling:            <h5_size>  RSS:   2.08MB | VMS:   1.32MB | SHR 790.53kB | time:   2.76ms | result   2.06GB
Profiling:       <anndata_size>  RSS:  48.96MB | VMS:  48.58MB | SHR   2.07MB | time: 312.39ms | result   2.06GB
~$ python3 h5py_script.py /datasets/70k_1T.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/70k_1T.h5ad
Profiling:            <h5_size>  RSS:    2.7MB | VMS:   1.73MB | SHR 868.35kB | time:   3.76ms | result   1.82GB
Profiling:       <anndata_size>  RSS:  45.43MB | VMS:  45.99MB | SHR    2.1MB | time: 376.34ms | result   1.82GB
~$ python3 h5py_script.py /datasets/70k_1T_2.h5ad 
<frozen abc>:106: FutureWarning: xarray subclass Dataset2D should explicitly define __slots__
/datasets/70k_1T_2.h5ad
Profiling:            <h5_size>  RSS:   2.67MB | VMS:   1.86MB | SHR 843.78kB | time:   3.71ms | result   1.47GB
Profiling:       <anndata_size>  RSS:  84.01MB | VMS:  83.52MB | SHR   2.09MB | time: 356.92ms | result   1.47GB

As a result we see:

That it gives closer values for size comparison, which could indicate that some stuff is missing
slightly less memory use
it doesn't seem to improve performance that much so some data realization is probably happening.

ilan-gold · 2024-02-01T09:26:38Z

@Neah-Ko I will need to check this but I would be shocked if that weren't a bug. We have tests to specifically check nothing is read in except what we have no choice but to read in (I think it's dataframes within obsm/varm and awkward arrays, but that's it).

ilan-gold · 2024-02-01T10:28:37Z

@Neah-Ko Why are you using the with_disk argument? I thought you'd be interested in with_disk=False since you're interested in i/o. For example, based on the above PR I link to:

from anndata.tests.helpers import gen_adata
from anndata.experimental import read_backed
from scipy import sparse

adata = gen_adata((1000, 1000), sparse.csc_matrix)
backed_path = "backed.zarr"
adata.write_zarr(backed_path)
remote = read_backed(backed_path) 
# remote = read_backed('47bb980c-9482-4aee-9558-443c04170dec.h5ad')
size_backed = remote.__sizeof__()
size_backed_with_disk = remote.__sizeof__(with_disk=True)
size_in_memory = remote.to_memory().__sizeof__()
# the difference is orders of magnitude
assert size_backed < size_backed_with_disk and size_backed_with_disk < size_in_memory

I also tried this with a dataset from CellXGene and h5ad and got similar results: Dissection: Thalamus (THM) - medial nuclear complex of thalamus (MNC) - mediodorsal nucleus of thalamus + reuniens nucleus (medioventral nucleus) of thalamus - MD + Re (not sure if the download link here works or not but it's there to try).

Does this work for you? Am I misunderstanding the with_disk argument?

Neah-Ko · 2024-02-01T12:00:39Z

@ilan-gold

For the with_disk argument I redirect you to the conversation on the pr #1230 .
It was an addition suggested by @flying-sheep

My initial use case was to estimate the size that a dataset is going to take into memory simply by opening it in backed mode.

__sizeof__() function is supposed to return the current size of the object in memory -> with_disk also adds the size that sparse/backed/lazy datasets would take if loaded.

@Neah-Ko I will need to check this but I would be shocked if that weren't a bug. We have tests to specifically check nothing is read in except what we have no choice but to read in (I think it's dataframes within obsm/varm and awkward arrays, but that's it).

So it is a bit hard for me to fully understand your read_backed function as it seems to be bouncing around with read_dispatched. Let me know if you find what causes this loading.

While looking at the original read_h5ad_backed function that I included in a comment above. I was thinking that maybe we could load into a {dask, x}array at that step. Then trigger the compute() on property access.
(e.g. AnnData._X: lazy array, AnnData.X: getter that will return AnnData._X.compute())

ilan-gold · 2024-02-01T12:40:39Z

@Neah-Ko I think we may be talking past each other here, which I apologize for. Let me try to understand. You are interested in benchmarking how much data is read only into memory and not on disk. So my question is: why you use the with_disk option if you're interested in the in-memory usage? And then separately, does this read_backed method address your use-case (which is presumably motivating the current question about __sizeof__)?

gszep changed the title ~~lazy .obs and .var with backed="r" mode~~ lazy dataframes in .obs and .var with backed="r" mode using dask.DataFrame Apr 15, 2023

gszep mentioned this issue Apr 15, 2023

scverse datastructure for AIRR data scverse/scirpy#327

Closed

1 task

gszep changed the title ~~lazy dataframes in .obs and .var with backed="r" mode using dask.DataFrame~~ lazy dataframes in .obs and .var with backed="r" mode Apr 15, 2023

gszep linked a pull request Apr 16, 2023 that will close this issue

lazy dataframes in .obs and .var with backed="r" mode #983

Draft

github-actions bot added the stale label Jun 18, 2023

ivirshup added enhancement and removed stale labels Jun 19, 2023

ilan-gold mentioned this issue Aug 8, 2023

(feat): experimental read_backed method for zarr + hdf5 via read_dispatched #947

Open

Neah-Ko mentioned this issue Nov 8, 2023

Patch AnnData.__sizeof__() for backed datasets #1230

Merged

3 tasks

ilan-gold linked a pull request Nov 30, 2023 that will close this issue

(feat): xarray with experimental backed reading #1247

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lazy dataframes in `.obs` and `.var` with `backed="r"` mode #981

lazy dataframes in `.obs` and `.var` with `backed="r"` mode #981

gszep commented Apr 15, 2023

gszep commented Apr 15, 2023

ivirshup commented Apr 17, 2023

gszep commented Apr 18, 2023

ilan-gold commented Apr 18, 2023 •

edited

github-actions bot commented Jun 18, 2023

Neah-Ko commented Jan 31, 2024

ilan-gold commented Jan 31, 2024

ilan-gold commented Jan 31, 2024

Neah-Ko commented Jan 31, 2024 •

edited

ilan-gold commented Feb 1, 2024

ilan-gold commented Feb 1, 2024

Neah-Ko commented Feb 1, 2024

ilan-gold commented Feb 1, 2024 •

edited

lazy dataframes in .obs and .var with backed="r" mode #981

lazy dataframes in .obs and .var with backed="r" mode #981

Comments

gszep commented Apr 15, 2023

gszep commented Apr 15, 2023

ivirshup commented Apr 17, 2023

gszep commented Apr 18, 2023

ilan-gold commented Apr 18, 2023 • edited

github-actions bot commented Jun 18, 2023

Neah-Ko commented Jan 31, 2024

ilan-gold commented Jan 31, 2024

ilan-gold commented Jan 31, 2024

Neah-Ko commented Jan 31, 2024 • edited

ilan-gold commented Feb 1, 2024

ilan-gold commented Feb 1, 2024

Neah-Ko commented Feb 1, 2024

ilan-gold commented Feb 1, 2024 • edited

lazy dataframes in `.obs` and `.var` with `backed="r"` mode #981

lazy dataframes in `.obs` and `.var` with `backed="r"` mode #981

ilan-gold commented Apr 18, 2023 •

edited

Neah-Ko commented Jan 31, 2024 •

edited

ilan-gold commented Feb 1, 2024 •

edited