Load AnnData From Zarr #807

ilan-gold · 2020-12-15T21:14:10Z

I am starting this out as a draft since I think there are a few open questions:

Do we want to formalize the options in a schema? Do we expect this to work more generally than just anndata-zarr? I think we would the json schema to operate on an enum of file type in this case but I don't know if this is possible (i.e for anndata-zarr, use a certain options schema as opposed to another one) - for example this is a schema for visualizing the habib data

Answered: Schema has been added

@keller-mark I am terrible at this sort o thing but we probably want to support a filtering operation for genes - what do you think we should call that? It would be good for the expression-matrix.zarr loader so that we can load obsm subsets of the cell x gene matrix. I think the idea here would be allowing people to use the highly variable genes without subsetting X as I do below.

I went with genesFilter, open to other obtions

Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...
Do we want some sort of defaults? Like X for the expression matrix? My sense is no.
String parsing is pretty haphazard but seems to work. I am going to try to put some more effort into this.

As far as what the PR does do, you can specify different parts of the AnnData store to be mapped to parts of the Vitessce configuration i.e X for expression matrix, or obsm.leiden for cell set labels. This is all accomplished through the view configuration which has options for all of this. We introduce a new set of loaders to handle this, anndata-cell-sets.zarr etc.

I also committed some example view configs which I will remove but the process for downloading/viewing them is as follows:

slide-seq: I can send you the data if you do not have it. Then I ran

sc.pp.highly_variable_genes(adata, n_top_genes=200)
adata.obsm['X_top_200_genes'] = adata[:, adata.var['highly_variable']].X.copy()
adata.write_zarr('secondary_analysis_hvg_obsm.zarr')

before writing to zarr.

habib: I ran

sc.pp.highly_variable_genes(adata, n_top_genes=200, inplace=True, subset=True)
adata.write_zarr('habib_200_hvg.zarr')

before writing to zarr. Demo is here

pbmc: No alterations. I got this dataset from here

adata = sc.datasets.pbmc68k_reduced()
adata.write_zarr('pbmc.zarr')

Demo is here

pbmc_processed: No alterations. I got this dataset from here

adata = sc.datasets.pbmc3k()
adata.write_zarr('pbmc3k_processed.zarr')

Demo is here

ilan-gold · 2020-12-17T15:40:19Z

Was just thinking about something related and I think making separate schemas for the different kinds of options makes sense - no need for an enum, the options just have to be one of these schemas or none at all. This would allow us to have an OME-TIFF loader directly.

keller-mark · 2020-12-17T17:09:31Z

Yes direct raster.ome-zarr and raster.ome-tiff loaders would be great to have!

src/loaders/anndata-loaders/MatrixZarrLoader.js

keller-mark · 2020-12-18T14:56:00Z

Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...

Is there any reason not to use an array of strings instead? Then you would not need to do setName.replace(".", "/") and instead could do setPath.join(".") or setPath.join("/") depending on what you need

keller-mark · 2020-12-18T14:58:29Z

Can we change genesFilter to geneFilter to match the geneFilter coordination type https://github.com/hubmapconsortium/vitessce/blob/master/src/app/state/coordination.js#L27

src/loaders/anndata-loaders/MatrixZarrLoader.js

manzt · 2020-12-18T16:19:07Z

Right now, I am obeying the convention of dots in the schema instead of slashes for communicating paths since it is similar to how it is done in Python. Should it be slashes? They are paths after all...

I would use posix-like ('/') paths relative to the root. All valid keys in a zarr store are posix-like paths, and since these are paths to arrays within the store, it makes sense to provide a path. This is what is done in the multiscale specification.

That's the point of being able to do:

await openArray({ store, path }); // path is a posix-like path with "/"

The "store" takes care of translating a zarr path:

# e.g. on windows
import zarr
root = zarr.open('mydataset.zarr') # DirectoryStore('mydataset.zarr')
arr = root.get('my/nested/array/path') # opens an array from within hierarchy
arr2 = root.get('my\nested\array\path') # raises an exception

ilan-gold · 2020-12-18T21:08:10Z

@manzt Thanks for the comments on the zarr store. @keller-mark I went with slashes on Trevor's recommendations only because I think the nested stuff can get a little hairy, and it is cleaner to not have to join or replace and split in our code.

src/loaders/anndata-loaders/MatrixZarrLoader.js

…sortium/vitessce into ilan-gold/load_anndata

src/schemas/config.schema.json

Co-authored-by: Mark Keller <7525285+keller-mark@users.noreply.github.com>

ilan-gold added 26 commits November 16, 2020 16:53

Scaffold data fetching for scatterplot

6e471a5

Merge branch 'master' into ilan-gold/load_anndata

6c8095e

[WIP] More code

87fbcb6

Displaying UMAP Works.

5478b1a

config

2484d1c

Improve robustness of loading text of cell names.

0bc6b87

Refactor

93da637

Load options for spatial for anndata zarr

17669c9

Small fixes. Options allowed.

0d0b70a

Custom cells.

2eb9cdc

[WIP] Habib loaded.

667e098

Filter on length

c7260f6

More aggressive string replacement.

67ec991

Matrix loader. More robust cell sets.

a9241aa

Merge branch 'master' into ilan-gold/load_anndata

b3bfd49

Reformat Cells Loader

bbdec07

No async - only promises

dec6b66

Wrap in promise

7c69392

Load sparse matrix

43ac4b0

Add factors to cells.

dc52bf4

Allow for multiple annotations.

023d596

Rename

0d70c2c

Changelog.

11dde49

Merge branch 'master' into ilan-gold/load_anndata

a262752

Support sparse column format

b8dddc1

Support longer cell name lists.

d7028a0

ilan-gold added 2 commits December 17, 2020 13:04

Allow for genes filter on matrix.

49b0f4f

Add docs. Slight refactor. Add pbmc full example.

a8810de

manzt reviewed Dec 18, 2020

View reviewed changes

src/loaders/anndata-loaders/MatrixZarrLoader.js Outdated Show resolved Hide resolved

manzt reviewed Dec 18, 2020

View reviewed changes

src/loaders/anndata-loaders/MatrixZarrLoader.js Outdated Show resolved Hide resolved

manzt reviewed Dec 18, 2020

View reviewed changes

src/loaders/anndata-loaders/MatrixZarrLoader.js Outdated Show resolved Hide resolved

keller-mark reviewed Dec 18, 2020

View reviewed changes

src/loaders/anndata-loaders/MatrixZarrLoader.js Outdated Show resolved Hide resolved

Address comments.

50a462c

Merge branch 'master' into ilan-gold/load_anndata

323f53f

manzt reviewed Dec 18, 2020

View reviewed changes

src/loaders/anndata-loaders/MatrixZarrLoader.js Outdated Show resolved Hide resolved

ilan-gold added 2 commits December 18, 2020 16:48

Use Promise.all

d04c54d

Merge branch 'ilan-gold/load_anndata' of https://github.com/hubmapcon…

c6e89e6

…sortium/vitessce into ilan-gold/load_anndata

keller-mark reviewed Dec 18, 2020

View reviewed changes

src/schemas/config.schema.json Outdated Show resolved Hide resolved