# Query cellxgene-census using TileDB-SOMA

The [first guide](cellxgene) queried metadata and h5ad artifacts directly through LaminDB.

This guide uses the TileDB-SOMA API to run similar queries.

## Setup

Load your LaminDB instance for storing queried data:

In [None]:
!lamin init --storage ./test-cellxgene --schema bionty

In [None]:
import lamindb as ln
import bionty as bt
import cellxgene_census

census_version = "2023-07-25"

## Create lookup objects

We use metadata records in the `laminlabs/cellxgene` instance to generate lookups:

In [None]:
source = "laminlabs/cellxgene"
human = "homo_sapiens"

features = ln.Feature.using(source).lookup(return_field="name")
assays = bt.ExperimentalFactor.using(source).lookup(return_field="name")
cell_types = bt.CellType.using(source).lookup(return_field="name")
tissues = bt.Tissue.using(source).lookup(return_field="name")
ulabels = ln.ULabel.using(source).lookup()
suspension_types = ulabels.is_suspension_type.children.all().lookup(return_field="name")

## Query data

In [None]:
value_filter = (
    f'{features.tissue} == "{tissues.brain}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell}", "{cell_types.neuron}"] and'
    f' {features.suspension_type} == "{suspension_types.cell}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)

In [None]:
value_filter

In [None]:
%%time

with cellxgene_census.open_soma(census_version=census_version) as census:
    # Reads SOMADataFrame as a slice
    cell_metadata = census["census_data"][human].obs.read(value_filter=value_filter)

    # Concatenates results to pyarrow.Table
    cell_metadata = cell_metadata.concat()

    # Converts to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

In [None]:
cell_metadata.shape

In [None]:
cell_metadata.head()

## Create AnnData

In [None]:
%%time

with cellxgene_census.open_soma(census_version=census_version) as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism=human,
        obs_value_filter=value_filter,
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )

In [None]:
adata.var = adata.var.set_index("feature_id")

In [None]:
adata

In [None]:
adata.var.head()

In [None]:
adata.obs.head()

## Register the queried AnnData

In [None]:
ln.transform.stem_uid = "6oq3VJy5yxIU"
ln.transform.version = "0"
ln.track()

Register genes and features:

In [None]:
bt.settings.organism = "human"

In [None]:
genes = bt.Gene.from_values(adata.var_names, field=bt.Gene.ensembl_gene_id)
ln.save(genes)

features = ln.Feature.from_df(adata.obs)
ln.save(features)

Register the `AnnData` object:

In [None]:
artifact = ln.Artifact.from_anndata(
    adata,
    description=(
        "microglial and neuron cell data from 10x 3' v3 in brain queried from Census"
    ),
)

In [None]:
artifact.save()

Link validated metadata:

In [None]:
artifact.features._add_set_from_anndata(var_field=bt.Gene.ensembl_gene_id)

In [None]:
features_remote = ln.Feature.using(source).lookup().dict()
features = ln.Feature.lookup().dict()

for col, orm in {
    "assay": bt.ExperimentalFactor,
    "cell_type": bt.CellType,
    "tissue": bt.Tissue,
    "disease": bt.Disease,
    "suspension_type": ln.ULabel,
}.items():
    labels = orm.from_values(adata.obs[col])
    if len(labels) > 0:
        ln.save(labels)
    else:
        labels = [orm(name=name) for name in adata.obs[col].unique()]
        ln.save(labels)
    artifact.labels.add(labels, features.get(col))

In [None]:
artifact.describe()

In [None]:
artifact.view_lineage()

In [None]:
# clean up test instance
!lamin delete --force test-cellxgene
!rm -r ./test-cellxgene