Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling "Non-Standard" AnnData Zarr Data #15

Closed
ilan-gold opened this issue Dec 8, 2020 · 8 comments
Closed

Handling "Non-Standard" AnnData Zarr Data #15

ilan-gold opened this issue Dec 8, 2020 · 8 comments

Comments

@ilan-gold
Copy link
Collaborator

ilan-gold commented Dec 8, 2020

Related to vitessce/vitessce#713, but this line for example

cluster_ids = adata.obs['CellType'].unique().tolist()
cell_cluster_ids = adata.obs['CellType'].values.tolist()

is for a custom part of the AnnData so we should probably not have it as a standard loadable object. Am I wrong? I can't find any documentation for CellType on the website. This part of the documentation lays out how to use __categories/MY_CATEGORY (which is part of the zarr store) but also has a reference to cell_type.

I am coming across this stuff as I write the JSON API for declaring parts of AnnData store for usage in Vitessce. It's tricky because it's not clear what we should have - for example, for spatial you might have something like

        {
          "type": "cells",
          "fileType": "anndata-cells.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "obsm.spatial": "xy",
            "obsm.poly": "poly"
          }
        },

Where the correspondence between our JSON schema terminology and this config is one to one but then for something like

        {
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "obs.CellType": "sets"
          }

you don't have such a nice correspondence at which point you are dealing with "magic" strings.

I think the reason I am mentioning this hear is that we probably want to harmonize the terminology we use for custom parts of the store across languages. Thoughts?

@keller-mark
Copy link
Member

Yes 'CellType' is just specific to one .h5ad file I was working with at the time. I started to move away from it on this branch https://github.com/vitessce/vitessce-python/blob/keller-mark/widget-method/vitessce/wrappers.py#L315 where I added a parameter for the AnnDataWrapper contructor to identify which column of adata.obs to use for cell sets. This is still a bit limiting though since as you point out there can be multiple columns with categorical data and we may want to output all of them as multiple cell set hierarchies. Maybe we can instead support a list of columns, like cell_set_obs_cols = [] and then the user could write something like AnnDataWrapper(adata, cell_set_obs_cols=["CellType", "leiden"]) based on the columns they want to use for cell sets.

@keller-mark
Copy link
Member

keller-mark commented Dec 8, 2020

When reading straight from JS, I think it may be helpful if the options object is reversed:

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "sets": ["obs.cellType", "obs.leiden"]
          }

and

{
          "type": "cells",
          "fileType": "anndata-cells.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "xy": "obsm.spatial",
            "poly": "obsm.polygon",
            "mappings": {
                 "UMAP": {
                      "key": "obsm.X_umap",
                      "dims": [0, 1],
                   },
                 "PCA": {
                      "key": "obsm.X_pca",
                      "dims": [3, 5], // use principal components 4 and 6
                   },
            }
          }
        },

@ilan-gold
Copy link
Collaborator Author

I was just typing that...the sets example was making me think that the direction was wrong in how I laid things out... I am still not a fan of magic strings but I like where it's headed. I think we should try to cover our bases in Python around customization as much as possible and then port it to Javascript/JSON since once it's out in JS, we can't take it back whereas here we have a bit of a sandbox here to play in. I'll try loading Matt's datasets as well as going out and looking for others on the internet.

@keller-mark
Copy link
Member

When these are the values rather than the keys, we don't need to use magic strings, we can use arrays like ["obsm", "X_umap"] which would mean obsm.X_umap

@keller-mark
Copy link
Member

Maybe

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": {
            "sets": [["obs","CellType"], ["obs", "leiden"]]
          }

where each element of "sets" is a path to the column that should be used to define each cell set hierarchy

@ilan-gold
Copy link
Collaborator Author

When these are the values rather than the keys, we don't need to use magic strings, we can use arrays like ["obsm", "X_umap"] which would mean obsm.X_umap

I was referring more to sets as a key as opposed to xy and poly for example, but perhaps I am being too nit-picky. sets does not exist in our json schema but xy and poly do.

@keller-mark
Copy link
Member

keller-mark commented Dec 8, 2020

Ah I see. Actually then something more like the cell-sets-tabular schema may be better https://github.com/hubmapconsortium/vitessce/blob/master/src/schemas/cell-sets-tabular.schema.json

Maybe

{
          "type": "cell-sets",
          "fileType": "anndata-cell-sets.zarr",
          "url": "http://127.0.0.1:8081/habib.zarr",
          "options": [
              {
                   "group_name": "Cell Type",
                   "set_name": ["obs", "CellType"],
                    "prediction_score": ["obs", "CellTypeScore"]
               },
              {
                   "group_name": "Leiden Clustering",
                   "set_name": ["obs", "leiden"]
               }
          ]

where set_name and prediction_score are assumed to be paths to AnnData columns, and group_name is assumed to be the name of the hierarchy, and cell_id is not needed since it would be automatically assigned based on the Cell IDs in the AnnData index.

This has the benefit of easily supporting the prediction score/confidence values as well.

@keller-mark
Copy link
Member

I will close this now that we have vitessce/vitessce#807 and open a new issue for integrating it here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants