# Curate `AnnData` based on the CELLxGENE schema

This guide shows how to curate an AnnData object with the help of [`laminlabs/cellxgene`](https://lamin.ai/laminlabs/cellxgene) and the [CELLxGENE schema v5.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/5.1.0/schema.md).

Load your instance to register the curated AnnData:

In [None]:
# !pip install 'lamindb[bionty,jupyter]' cellxgene-lamin cellxgene-schema
!lamin init --storage ./test-cellxgene-curate --schema bionty

In [None]:
import lamindb as ln
import cellxgene_lamin as cxg

Let's start with an AnnData object that we'd like to inspect and curate:

In [None]:
adata = cxg.datasets.anndata_human_immune_cells(populate_registries=True)
adata.write_h5ad("anndata_human_immune_cells.h5ad")
adata

In [None]:
!cellxgene-schema validate anndata_human_immune_cells.h5ad

## Validate and curate metadata

Validate the AnnData object:

In [None]:
try:
    curate = cxg.Curate(adata)
except Exception as e:
    print(e)

Let's fix the "donor_id" column name:

In [None]:
adata.obs.rename(columns={"donor": "donor_id"}, inplace=True)

For the missing columns, we can pass default values suggested from CELLxGENE:

In [None]:
cxg.CellxGeneFields.OBS_FIELD_DEFAULTS

In [None]:
curate = cxg.Curate(adata, defaults=cxg.CellxGeneFields.OBS_FIELD_DEFAULTS, organism="human")

In [None]:
curate.categoricals

In [None]:
validated = curate.validate(organism="human")
validated

## Register new metadata labels

Following the suggestions above to register genes and labels that aren't present in the current instance:

(Note that our instance is rather empty. Once you filled up the registries, registering new labels won't be frequently needed)

In [None]:
curate.add_validated_from("all")

For donors, we register the new labels:

In [None]:
curate.add_new_from("donor_id")

An error is shown for the tissue label "lungg", which is a typo, should be "lung". Let's fix it:

In [None]:
tissues = curate.lookup().tissue
# using a lookup object to find the correct term
tissues.lung

In [None]:
adata.obs["tissue"] = adata.obs["tissue"].cat.rename_categories(
    {"lungg": tissues.lung.name}
)
curate.add_validated_from("tissue")

Let's validate the object again:

In [None]:
validated = curate.validate()
validated

In [None]:
adata.obs.head()

## Save artifact

In [None]:
artifact = curate.save_artifact(description="test h5ad file")

In [None]:
artifact.describe()

The below is optional -- it mimics the way cellxgene creates collections of `AnnData` objects to link them to studies.

In [None]:
# register a new collection
collection = curate.save_collection(
    [artifact],  # registered artifact above, can also pass a list of artifacts
    name=(  # title of the publication
        "Cross-tissue immune cell analysis reveals tissue-specific features in humans"
        " (for test demo only)"
    ),
    description="10.1126/science.abl5197",  # DOI of the publication
    reference="E-MTAB-11536",  # accession number (e.g. GSE#, E-MTAB#, etc.)
    reference_type="ArrayExpress",  # source type (e.g. GEO, ArrayExpress, SRA, etc.)
)

## Return an input h5ad file for cellxgene-schema

In [None]:
adata_cxg = curate.to_cellxgene(is_primary_data=True)
adata_cxg

In [None]:
adata_cxg.write_h5ad("anndata_human_immune_cells_cxg.h5ad")

In [None]:
!cellxgene-schema validate anndata_human_immune_cells_cxg.h5ad

```{note}

The Curate class is designed to validate all metadata for adherence to ontologies.
It does not reimplement all rules of the cellxgene schema and we therefore recommend running the [cellxgene-schema](https://github.com/chanzuckerberg/single-cell-curation) if full adherence beyond metadata is a necessity.
```