[![hub](https://img.shields.io/badge/Source%20%26%20report%20on%20LaminHub-mediumseagreen)](https://lamin.ai/laminlabs/cellxgene/transform/5FUyJ6RkVk0Dz8)

# CELLxGENE: scRNA-seq

[CZ CELLxGENE](https://cellxgene.cziscience.com/) hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, ...).

You can use the CELLxGENE data in two ways:

1. Query collections of `AnnData` objects (this page).
2. Query a big array store produced by concatenated `AnnData` objects via `tiledbsoma` ([see here](query-census)).

If you are interested in building similar data assets in-house:

1. See the [transfer guide](inv:docs#transfer) to zero-copy data to your own LaminDB instance.
2. See the [scRNA guide](inv:docs#scrna) for how to create a growing versioned queryable scRNA-seq dataset.
3. See the [Curate](./cellxgene-curate) for validating, curating and registering your own AnnData objects.

```{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/YHMYgXCfJTJvKPBmuh1S.png" width="700px">

```


Load the public LaminDB instance that mirrors cellxgene:

In [None]:
# !pip install 'lamindb[bionty,jupyter]'
!lamin load laminlabs/cellxgene

In [None]:
import lamindb as ln
import bionty as bt

## Query & understand metadata

### Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including [basic biological entities](https://lamin.ai/laminlabs/docs/bionty) and things like users or storage locations.

Let's use auto-complete to look up cell types:

:::{dropdown} Show me a screenshot

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/lgRNHNtMxjU0y8nIagt7.png" width="400px">

:::

In [None]:
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell

You can also arbitrarily chain filters and create lookups from them:

In [None]:
users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(name="is_suspension_type").one().children.lookup()  # suspension types

### Search & filter metadata

We can use search & filters for metadata:

In [None]:
bt.CellType.search("effector T cell").df().head()

And use a `uid` to filter exactly one metadata record:

In [None]:
effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell

### Understand ontologies

View the related ontology terms: 

In [None]:
effector_t_cell.view_parents(distance=2, with_children=True)

Or access them programmatically:

In [None]:
effector_t_cell.children.df()

## Query artifacts

Unlike in the [tiledbsoma guide](query-census), here, we'll query sets of `.h5ad` files, which correspond to `AnnData` objects.

To see what you can query for, simply look at the registry representation:

In [None]:
ln.Artifact

Here is an exemplary string query:

In [None]:
ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__name__in=["B cell", "T cell"],  # cell types measured in AnnData
    created_by__handle="sunnyosun"  # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()

```{dropdown} What happens under the hood?

As you saw from inspecting `ln.Artifact`, `ln.Artifact.cell_types` relates artifacts with `bt.CellType`.

The expression `cell_types__name__in` performs the join of the underlying registries and matches `bt.CellType.name` to `["B cell", "T cell"]`.

Similar for `created_by`, which relates artifacts with `ln.User`.

```

Queries by string are prone to typos. Let's query with auto-completed records instead.

In [None]:
ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[cell_types.b_cell, cell_types.t_cell],  # cell types measured in AnnData
    created_by=users.sunnyosun   # creator
).order_by(
    "created_at"
).df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()

## Query collections

Often, you work with collections of artifacts, which {class}`~lamindb.Collection` helps managing.

Let's look at the collection that corresponds to the `cellxgene-census` release of `.h5ad` artifacts:

In [None]:
collection = ln.Collection.filter(name="cellxgene-census", version="2024-07-01").one()
collection

You can count all contained artifacts or get them as a dataframe.

In [None]:
collection.artifacts.count()

In [None]:
collection.artifacts.df().head()  # not tracking run & transform because read-only instance

You can query across artifacts by arbitrary metadata combinations, for instance:

In [None]:
query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame

## Query arrays

```{note}

Here, we discuss slicing individual `AnnData` arrays. If you want to slice a large concatenated array store, see the [tiledbsoma guide](query-census).

```

In the query above, each artifact stores an array in form of an `.h5ad` file, which corresponds to an `AnnData` object.

Let's look at the first array in the query and show its metadata using `.describe()`.

In [None]:
artifact = query.first()
artifact.describe()

:::{dropdown} More ways of accessing metadata

Access just features:

```
artifact.features
```

Or get labels given a feature:

```
artifact.labels.get(features.tissue).df()
```

```
artifact.labels.get(features.collection).one()
```

:::



If you want to query a slice of the array data, you have two options:
1. Cache & load the entire array into memory via `artifact.load() -> AnnData` (caches the h5ad on disk, so that you only download once)
2. Stream the array using a (cloud-backed) accessor `artifact.open() -> AnnDataAccessor`

Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

Cache & load:

In [None]:
adata = artifact.load()
adata

Now we have an `AnnData` object, which stores observation annotations matching our artifact-level query in the `.obs` slot, and we can re-use almost the same query on the array-level.

:::{dropdown} See the array-level query

```
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
```

:::


:::{dropdown} See the artifact-level query

```
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
```

`AnnData` uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

:::

Stream:

In [None]:
adata_backed = artifact.open()
adata_backed

We now have an `AnnDataAccessor` object, which behaves much like an `AnnData`, and the query looks the same.

:::{dropdown} See the query

```
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()
```

:::

## Train ML models

You can directly train ML models on very large collections of AnnData objects.

See {doc}`docs:scrna5`.

## Exploring data by collection

Alternatively, 

- [you can search a file on the LaminHub UI](https://lamin.ai/laminlabs/cellxgene/artifacts) and fetch it through: 
`ln.Artifact.get(uid)`
- or query for a collection you found on [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/collections)

Let's search the collections from CELLxGENE within the 2023-12-15 release:

In [None]:
ln.Collection.filter(version="2024-07-01").search("immune human kidney", limit=10)

Let's get the record of the top hit collection:

In [None]:
collection = ln.Collection.get("kqiPjpzpK9H9rdtnV67f")
collection

We see it's a Science paper and we could find more information using the [DOI](https://doi.org/10.1126/science.aat5031) or CELLxGENE [collection id](https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec).

Check different versions of this collection:

In [None]:
collection.versions.df()

Each collection has at least one {class}`~lamindb.Artifact` file associated to it. Let's get the associated artifacts:

In [None]:
collection.artifacts.df()

```{toctree}
:maxdepth: 1
:hidden:

query-census
cellxgene-curate
```