Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load AnnData From Zarr #807

Merged
merged 40 commits into from
Dec 21, 2020
Merged

Load AnnData From Zarr #807

merged 40 commits into from
Dec 21, 2020

Conversation

ilan-gold
Copy link
Collaborator

@ilan-gold ilan-gold commented Dec 15, 2020

I am starting this out as a draft since I think there are a few open questions:

  1. Do we want to formalize the options in a schema? Do we expect this to work more generally than just anndata-zarr? I think we would the json schema to operate on an enum of file type in this case but I don't know if this is possible (i.e for anndata-zarr, use a certain options schema as opposed to another one) - for example this is a schema for visualizing the habib data
  • Answered: Schema has been added
  1. @keller-mark I am terrible at this sort o thing but we probably want to support a filtering operation for genes - what do you think we should call that? It would be good for the expression-matrix.zarr loader so that we can load obsm subsets of the cell x gene matrix. I think the idea here would be allowing people to use the highly variable genes without subsetting X as I do below.
  • I went with genesFilter, open to other obtions
  1. Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...
  2. Do we want some sort of defaults? Like X for the expression matrix? My sense is no.
  3. String parsing is pretty haphazard but seems to work. I am going to try to put some more effort into this.

As far as what the PR does do, you can specify different parts of the AnnData store to be mapped to parts of the Vitessce configuration i.e X for expression matrix, or obsm.leiden for cell set labels. This is all accomplished through the view configuration which has options for all of this. We introduce a new set of loaders to handle this, anndata-cell-sets.zarr etc.

I also committed some example view configs which I will remove but the process for downloading/viewing them is as follows:

  • slide-seq: I can send you the data if you do not have it. Then I ran
sc.pp.highly_variable_genes(adata, n_top_genes=200)
adata.obsm['X_top_200_genes'] = adata[:, adata.var['highly_variable']].X.copy()
adata.write_zarr('secondary_analysis_hvg_obsm.zarr')

before writing to zarr.

  • habib: I ran
sc.pp.highly_variable_genes(adata, n_top_genes=200, inplace=True, subset=True)
adata.write_zarr('habib_200_hvg.zarr')

before writing to zarr. Demo is here

  • pbmc: No alterations. I got this dataset from here
adata = sc.datasets.pbmc68k_reduced()
adata.write_zarr('pbmc.zarr')

Demo is here

  • pbmc_processed: No alterations. I got this dataset from here
adata = sc.datasets.pbmc3k()
adata.write_zarr('pbmc3k_processed.zarr')

Demo is here

@ilan-gold
Copy link
Collaborator Author

Was just thinking about something related and I think making separate schemas for the different kinds of options makes sense - no need for an enum, the options just have to be one of these schemas or none at all. This would allow us to have an OME-TIFF loader directly.

@keller-mark
Copy link
Member

Yes direct raster.ome-zarr and raster.ome-tiff loaders would be great to have!

@keller-mark
Copy link
Member

Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...

Is there any reason not to use an array of strings instead? Then you would not need to do setName.replace(".", "/") and instead could do setPath.join(".") or setPath.join("/") depending on what you need

@keller-mark
Copy link
Member

Can we change genesFilter to geneFilter to match the geneFilter coordination type https://github.com/hubmapconsortium/vitessce/blob/master/src/app/state/coordination.js#L27

@manzt
Copy link
Member

manzt commented Dec 18, 2020

Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...

I would use posix-like ('/') paths relative to the root. All valid keys in a zarr store are posix-like paths, and since these are paths to arrays within the store, it makes sense to provide a path. This is what is done in the multiscale specification.

That's the point of being able to do:

await openArray({ store, path }); // path is a posix-like path with "/"

The "store" takes care of translating a zarr path:

# e.g. on windows
import zarr
root = zarr.open('mydataset.zarr') # DirectoryStore('mydataset.zarr')
arr = root.get('my/nested/array/path') # opens an array from within hierarchy
arr2 = root.get('my\nested\array\path') # raises an exception

@ilan-gold
Copy link
Collaborator Author

@manzt Thanks for the comments on the zarr store. @keller-mark I went with slashes on Trevor's recommendations only because I think the nested stuff can get a little hairy, and it is cleaner to not have to join or replace and split in our code.

Co-authored-by: Mark Keller <7525285+keller-mark@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants