# Setup

In [1]:
import anndata
import os

import sfaira



The following path will be used as a cache for downloads from the cellxgene data server:

In [2]:
cache_path = os.path.join(".", "data")

# #TLDR

All code necessary to download and load a collection in one go:

``` python
from sfaira.data.dataloaders.databases import DatasetSuperGroupDatabases
dsg = DatasetSuperGroupDatabases(data_path=cache_path)
dsg.subset(key="collection_id", values=target_collections)
dsg.download()
dsg.load()
adatas = dsg.adata_ls
```

# Sfaira basics

We will be working with the `Dataset` API of sfaira with which you can also load sfaira maintained data sets. Here, we use a specific part of the API that interfaces the cellxgene data base as a group of data set groups (a super group in sfaira nomenclature): Each cellxgene collection is a group of data set groups, which mean that each cellxgene maintained h5ad maps to one `Dataset` class in sfaira.

Consider this API summary for more details on the `Dataset` API, `DatasetGroup`'s and `DatasetSuperGroup`'s: https://sfaira.readthedocs.io/en/latest/api.html#module-sfaira.data

# Interaction with database

In [3]:
dsg = sfaira.data.dataloaders.databases.DatasetSuperGroupDatabases(data_path=cache_path)

Ontology <class 'sfaira.versions.metadata.base.OntologyUberonLifecyclestage'> is not a DAG, treat child-parent reasoning with care.
Ontology <class 'sfaira.versions.metadata.base.OntologyMondo'> is not a DAG, treat child-parent reasoning with care.
Ontology <class 'sfaira.versions.metadata.base.OntologyUberon'> is not a DAG, treat child-parent reasoning with care.


Let's check out which collections are available from cellxgene (you can also manually check on the website https://cellxgene.cziscience.com/):

In [4]:
dsg.show_summary()

37b21763-7f0f-41ae-9001-60bad6e2841d
	 ('cellxgene', 'Homo sapiens', 'islet of Langerhans', "10x 3' transcription profiling", 'healthy')
b07e5164-baf6-43d2-bdba-5a249d0da879
	 ('cellxgene', 'Homo sapiens', 'pancreas', 'CEL-seq2', 'healthy')
9dbab10c-118d-496b-966a-67f1763a6b7d
	 ('cellxgene', 'Homo sapiens', 'blood', "10x 3' v3", 'COVID-19')
030faa69-ff79-4d85-8630-7c874a114c19
	 ('cellxgene', 'Homo sapiens', None, "10x 3' v3", 'COVID-19')
28c696bb-9549-434b-9340-dc745a846f9a
	 ('cellxgene', 'Mus musculus', 'frontal cortex', 'Smart-seq', 'healthy')
4b9e0a15-c006-45d9-860f-b8a43ccf7d9d
	 ('cellxgene', 'Homo sapiens', None, 'Drop-seq', 'COVID-19')
ca421096-6240-4cee-8c12-d20899b3e005
	 ('cellxgene', 'Homo sapiens', None, 'Drop-seq', 'COVID-19')
dd018fc0-8da7-4033-a2ba-6b47de8ebb4f
	 ('cellxgene', 'Homo sapiens', 'peripheral zone of prostate', "10x 3' v2", 'benign prostatic hyperplasia (disease)')
2a262b59-7936-4ecd-b656-248247a0559f
	 ('cellxgene', 'Mus musculus', 'prostate gland', "10x 

We can also look at properties of each collection (data set group), such as its ID:

In [5]:
print([x.collection_id for x in dsg.dataset_groups])

['51544e44-293b-4c2b-8c26-560678423380', '6e8c5415-302c-492a-a5f9-f29c57ff18fb', '0a839c4b-10d0-4d64-9272-684c49a2c8ba', '2a79d190-a41e-4408-88c8-ac5c4d03c0fc', '45f0f67d-4b69-4a3c-a4e8-a63b962e843f', 'd0e9c47b-4ce7-4f84-b182-eddcfa0b2658', '4b54248f-2165-477c-a027-dd55082e8818', 'c9706a92-0e5f-46c1-96d8-20e42467f287', '180bff9c-c8a5-4539-b13b-ddbc00d643e6', '5e469121-c203-4775-962d-dcf2e5d6a472', '4f889ffc-d4bc-4748-905b-8eb9db47a2ed', '367d95c0-0eb0-4dae-8276-9407239421ee', '9b02383a-9358-4f0f-9795-a891ec523bcc', 'cdfb9ead-cb58-4a53-879d-5e4ed5329e73', '7d7cabfd-1d1f-40af-96b7-26a0825a306d', '625f6bf4-2f33-4942-962e-35243d284837', '44531dd9-1388-4416-a117-af0a99de2294', '48d354f5-a5ca-4f35-a3bb-fa3687502252', '00109df5-7810-4542-8db5-2288c46e0424', '5d445965-6f1a-4b68-ba3a-b8f765155d3a', 'af893e86-8e9f-41f1-a474-ef05359b1fb7', '24d42e5e-ce6d-45ff-a66b-a3b3b715deaf', '2f75d249-1bec-459b-bf2b-b86221097ced', 'a238e9fa-2bdf-41df-8522-69046f99baff', '38833785-fac5-48fd-944a-0f62a4c23ed1',

We will choose one collection for now to keep download time low, you can select any number of collections in parallel though or even entirely skip subsetting and directly download the entire cellxgene databse. Note that all h5ad files will be saved into `cache_path`so make sure to plan enough storage on that partition!

In [6]:
target_collections = ["9c8808ce-1138-4dbe-818c-171cff10e650"]

Note that this collection is also described on the website `https://cellxgene.cziscience.com/collections/{target_collections[i]}`,  ie: https://cellxgene.cziscience.com/collections/9c8808ce-1138-4dbe-818c-171cff10e650

In [7]:
dsg.subset(key="collection_id", values=target_collections)

Let's briefly check that this subsetting did indeed leave us with one collection:

In [8]:
dsg.datasets

{'26ae14da-9e5f-4d18-abae-18a5a328feef': <sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader.Dataset at 0x7fb7b1190a00>,
 'cfa3c355-ee77-4fc8-9a00-78e61d23024c': <sfaira.data.dataloaders.databases.cellxgene.cellxgene_loader.Dataset at 0x7fb7b1190250>}

Note that `9c8808ce-1138-4dbe-818c-171cff10e650` is a collection, it corresponds to on study or project and represents two datasets here!

Now, we can proceed to download all selected collections:

In [9]:
dsg.download()

You can check the downloaded files out you cache directory if you like!

In [10]:
os.listdir(cache_path)

['9c8808ce-1138-4dbe-818c-171cff10e650']

Because we are using a defined cache directory, the next time we try downloading this data set, the command will return directly because the dataset is already there:

In [11]:
dsg.download()

# Usage of downloads

Next, we can load the selected collections from your disk into memory to access the adata object, for example!

In [12]:
dsg.load()

loading 26ae14da-9e5f-4d18-abae-18a5a328feef
loading cfa3c355-ee77-4fc8-9a00-78e61d23024c


We can access the list of all adatas associated with this collection or specifically query individual ones frmo the sfaira `dsg` object:

In [13]:
adatas = dsg.adata_ls
adatas

[AnnData object with n_obs × n_vars = 5625 × 14872
     obs: 'sample_strain', 'tissue_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'phase', 'nUMI', 'nGene', 'assay_ontology_term_id', 'cell_type_original', 'ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'ethnicity', 'development_stage'
     var: 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference'
     uns: 'X_normalization', 'contributors', 'default_embedding', 'layer_descriptions', 'preprint_doi', 'publication_doi', 'schema_version', 'title'
     obsm: 'X_tSpace', 'X_tUMAP',
 AnnData object with n_obs × n_vars = 4355 × 14989
     obs: 'tissue_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'ethnicity_ontology_term_id', 'phase', 'nUMI', 'nGene', 'assay_ontology_term_id', 'cell_type_original', 'develo

In [14]:
adata = dsg.datasets["26ae14da-9e5f-4d18-abae-18a5a328feef"].adata
adata

AnnData object with n_obs × n_vars = 5625 × 14872
    obs: 'sample_strain', 'tissue_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'phase', 'nUMI', 'nGene', 'assay_ontology_term_id', 'cell_type_original', 'ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'ethnicity', 'development_stage'
    var: 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference'
    uns: 'X_normalization', 'contributors', 'default_embedding', 'layer_descriptions', 'preprint_doi', 'publication_doi', 'schema_version', 'title'
    obsm: 'X_tSpace', 'X_tUMAP'

You can also use anndata and a manually assembled path to load these objects:

In [15]:
adata2 = anndata.read_h5ad(os.path.join(
    cache_path, 
    "9c8808ce-1138-4dbe-818c-171cff10e650/26ae14da-9e5f-4d18-abae-18a5a328feef.h5ad"
))
adata2

AnnData object with n_obs × n_vars = 5625 × 14872
    obs: 'sample_strain', 'tissue_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'phase', 'nUMI', 'nGene', 'assay_ontology_term_id', 'cell_type_original', 'ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'is_primary_data', 'organism_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'ethnicity', 'development_stage'
    var: 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference'
    uns: 'X_normalization', 'contributors', 'default_embedding', 'layer_descriptions', 'preprint_doi', 'publication_doi', 'schema_version', 'title'
    obsm: 'X_tSpace', 'X_tUMAP'