Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading parts/fields of adata (h5ad) only #436

Closed
Hrovatin opened this issue Oct 2, 2020 · 8 comments
Closed

Reading parts/fields of adata (h5ad) only #436

Hrovatin opened this issue Oct 2, 2020 · 8 comments
Assignees
Milestone

Comments

@Hrovatin
Copy link

Hrovatin commented Oct 2, 2020

It would be nice if one could read only individual fields (obs, var, etc.) from adata stored in h5ad format. This would enable faster reading when only metadata is required.

@Hrovatin Hrovatin changed the title Request: Reading parts/fields of adata (h5ad) only Reading parts/fields of adata (h5ad) only Oct 2, 2020
@Koncopd
Copy link
Member

Koncopd commented Oct 2, 2020

@Hrovatin hi, you can do
read('file.h5ad', backed='r')
This will have metadata in memory and .X as a backed dataset on the disk.

@Hrovatin
Copy link
Author

Hrovatin commented Oct 2, 2020

Thanks. This was not clear to me when I read the documentation.

@Hrovatin Hrovatin closed this as completed Oct 2, 2020
@ivirshup ivirshup reopened this Nov 23, 2020
@ivirshup
Copy link
Member

Re-opening, since I think we can do more with this. Additional cases include:

  • Reading a single array from obsm
  • Reading a single column from obs
  • Reading all entries, but only for a subset of observations

@ivirshup ivirshup added this to the 0.8 milestone Nov 23, 2020
@ivirshup
Copy link
Member

Currently this can be done with read_elem, write_elem from anndata._io.specs, if the user passes the underlying store. E.g.:

import h5py
from anndata._io.specs import read_elem

with h5py.File("adata.h5ad") as f:
    cell_types = read_elem(f["obs/celltype"])
    umap = read_elem(f["obsm/X_umap"])

I'm considering adding this to the experimental module for the next release.

@ivirshup ivirshup self-assigned this Jan 13, 2022
@ivirshup
Copy link
Member

In the next release we will export read_elem and write_elem from anndata.experimental

@lamasJose
Copy link

Re-opening, since I think we can do more with this. Additional cases include:

* Reading a single array from `obsm`

* Reading a single column from `obs`

* Reading all entries, but only for a subset of observations

Hi, is this implemented yet? I am trying to read only a few columns of the X layer with read_elem but I am not finding the way. Maybe I am doing it wrong but it could be very usefull for very large datasets

@ivirshup
Copy link
Member

If you have the file

f = h5py.File("adata.h5ad")

If it's CSC, you can do:

ad.experimental.sparse_dataset(f["X"])[:, col_idx]

If it's dense you can do:

f["X"][:, col_idx]

If it's CSR, you're basically going to have to read through the whole thing, but dask will handle that for you if you take the read_sparse_as_dask function from this tutorial (https://scanpy.readthedocs.io/en/stable/tutorials/experimental/dask.html) and then do:

read_sparse_as_dask("adata.h5ad", "X", 10_000)[:, col_idx].compute()

@lamasJose
Copy link

That worked, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants