Handling "Non-Standard" AnnData Zarr Data #15

ilan-gold · 2020-12-08T22:34:19Z

Related to vitessce/vitessce#713, but this line for example

Lines 312 to 313 in 816e47d

    
           cluster_ids = adata.obs['CellType'].unique().tolist() 
        
           cell_cluster_ids = adata.obs['CellType'].values.tolist()

is for a custom part of the AnnData so we should probably not have it as a standard loadable object. Am I wrong? I can't find any documentation for CellType on the website. This part of the documentation lays out how to use __categories/MY_CATEGORY (which is part of the zarr store) but also has a reference to cell_type.

I am coming across this stuff as I write the JSON API for declaring parts of AnnData store for usage in Vitessce. It's tricky because it's not clear what we should have - for example, for spatial you might have something like

        {
          "type": "cells",
          "fileType": "anndata-cells.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "obsm.spatial": "xy",
            "obsm.poly": "poly"
          }
        },

Where the correspondence between our JSON schema terminology and this config is one to one but then for something like

        {
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "obs.CellType": "sets"
          }

you don't have such a nice correspondence at which point you are dealing with "magic" strings.

I think the reason I am mentioning this hear is that we probably want to harmonize the terminology we use for custom parts of the store across languages. Thoughts?

The text was updated successfully, but these errors were encountered:

keller-mark · 2020-12-08T22:45:14Z

Yes 'CellType' is just specific to one .h5ad file I was working with at the time. I started to move away from it on this branch https://github.com/vitessce/vitessce-python/blob/keller-mark/widget-method/vitessce/wrappers.py#L315 where I added a parameter for the AnnDataWrapper contructor to identify which column of adata.obs to use for cell sets. This is still a bit limiting though since as you point out there can be multiple columns with categorical data and we may want to output all of them as multiple cell set hierarchies. Maybe we can instead support a list of columns, like cell_set_obs_cols = [] and then the user could write something like AnnDataWrapper(adata, cell_set_obs_cols=["CellType", "leiden"]) based on the columns they want to use for cell sets.

keller-mark · 2020-12-08T22:47:56Z

When reading straight from JS, I think it may be helpful if the options object is reversed:

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "sets": ["obs.cellType", "obs.leiden"]
          }

and

{
          "type": "cells",
          "fileType": "anndata-cells.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "xy": "obsm.spatial",
            "poly": "obsm.polygon",
            "mappings": {
                 "UMAP": {
                      "key": "obsm.X_umap",
                      "dims": [0, 1],
                   },
                 "PCA": {
                      "key": "obsm.X_pca",
                      "dims": [3, 5], // use principal components 4 and 6
                   },
            }
          }
        },

ilan-gold · 2020-12-08T22:50:08Z

I was just typing that...the sets example was making me think that the direction was wrong in how I laid things out... I am still not a fan of magic strings but I like where it's headed. I think we should try to cover our bases in Python around customization as much as possible and then port it to Javascript/JSON since once it's out in JS, we can't take it back whereas here we have a bit of a sandbox here to play in. I'll try loading Matt's datasets as well as going out and looking for others on the internet.

keller-mark · 2020-12-08T22:52:05Z

When these are the values rather than the keys, we don't need to use magic strings, we can use arrays like ["obsm", "X_umap"] which would mean obsm.X_umap

keller-mark · 2020-12-08T22:55:14Z

Maybe

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "sets": [["obs","CellType"], ["obs", "leiden"]]
          }

where each element of "sets" is a path to the column that should be used to define each cell set hierarchy

ilan-gold · 2020-12-08T22:55:25Z

When these are the values rather than the keys, we don't need to use magic strings, we can use arrays like ["obsm", "X_umap"] which would mean obsm.X_umap

I was referring more to sets as a key as opposed to xy and poly for example, but perhaps I am being too nit-picky. sets does not exist in our json schema but xy and poly do.

keller-mark · 2020-12-08T23:22:38Z

Ah I see. Actually then something more like the cell-sets-tabular schema may be better https://github.com/hubmapconsortium/vitessce/blob/master/src/schemas/cell-sets-tabular.schema.json

Maybe

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": [
              {
                   "group_name": "Cell Type",
                   "set_name": ["obs", "CellType"],
                    "prediction_score": ["obs", "CellTypeScore"]
               },
              {
                   "group_name": "Leiden Clustering",
                   "set_name": ["obs", "leiden"]
               }
          ]

where set_name and prediction_score are assumed to be paths to AnnData columns, and group_name is assumed to be the name of the hierarchy, and cell_id is not needed since it would be automatically assigned based on the Cell IDs in the AnnData index.

This has the benefit of easily supporting the prediction score/confidence values as well.

keller-mark · 2020-12-31T19:06:29Z

I will close this now that we have vitessce/vitessce#807 and open a new issue for integrating it here

ilan-gold added help wanted investigation labels Dec 8, 2020

keller-mark closed this as completed Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling "Non-Standard" AnnData Zarr Data #15

Handling "Non-Standard" AnnData Zarr Data #15

ilan-gold commented Dec 8, 2020 •

edited

Loading

keller-mark commented Dec 8, 2020

keller-mark commented Dec 8, 2020 •

edited

Loading

ilan-gold commented Dec 8, 2020

keller-mark commented Dec 8, 2020

keller-mark commented Dec 8, 2020

ilan-gold commented Dec 8, 2020

keller-mark commented Dec 8, 2020 •

edited

Loading

keller-mark commented Dec 31, 2020

Handling "Non-Standard" AnnData Zarr Data #15

Handling "Non-Standard" AnnData Zarr Data #15

Comments

ilan-gold commented Dec 8, 2020 • edited Loading

keller-mark commented Dec 8, 2020

keller-mark commented Dec 8, 2020 • edited Loading

ilan-gold commented Dec 8, 2020

keller-mark commented Dec 8, 2020

keller-mark commented Dec 8, 2020

ilan-gold commented Dec 8, 2020

keller-mark commented Dec 8, 2020 • edited Loading

keller-mark commented Dec 31, 2020

ilan-gold commented Dec 8, 2020 •

edited

Loading

keller-mark commented Dec 8, 2020 •

edited

Loading

keller-mark commented Dec 8, 2020 •

edited

Loading