Note: This tutorial has been tested with sfaira version 0.3.11

# Dataloader Tutorial

One of Sfaira's main features is easy access to publicly (or privately) contributed dataloaders. All loaded data comes in a common format allowing for homogeneous downstream analysis without fighting data sources.
This tutorial will focuses on the usage of Sfaira dataloaders to access data.

In [1]:
import sfaira
import os

# Set this path to your local sfaira data repository
basedir = '.'
datadir = os.path.join(basedir, 'raw')
metadir = os.path.join(basedir, 'meta')
cachedir = os.path.join(basedir, 'cache')

# Load all data sets from an organ

In [2]:
# Here we choose human eye:
# The DatasetGroupPancreas contains instances of Dataset which correspond to individual data sets.
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)  # This links all data sets available
ds.subset(key="organism", values=["Homo sapiens"])  # Subset to all human datasets
ds.subset(key="organ", values=["eye"])  # Subset further to eye organ
ds.download() # Download the selected datasets to your local sfaira data repository
ds.load(verbose=1)  # This loads the anndata objects into memory
ds.streamline_features(match_to_release="104", subset_genes_to_type="protein_coding")  # Choose a reference genome by ensembl release and subset to only protein-coding genes
ds.streamline_metadata(schema="sfaira")  # make sure the metadata annotation of all datasets are in line with the sfaira schema, so they can be cleanly concatenated in the next step
print(ds.adata) # Use the adata object for your analysis or modelling.

Ontology <class 'sfaira.versions.metadata.base.OntologyMondo'> is not a DAG, treat child-parent reasoning with care.
Ontology <class 'sfaira.versions.metadata.base.OntologyUberon'> is not a DAG, treat child-parent reasoning with care.
Ontology <class 'sfaira.versions.metadata.base.OntologyUberonLifecyclestage'> is not a DAG, treat child-parent reasoning with care.
Downloading: menon19.processed.h5ad
Downloading: voigt19.processed.h5ad
Downloading: lukowski19.processed.h5ad
loading homosapiens_retina_2019_10x3v3_menon_001_10.1038/s41467-019-12780-8
loading homosapiens_retina_2019_10x3v3_voigt_001_10.1073/pnas.1914143116
loading homosapiens_retina_2019_10x3v2_lukowski_001_10.15252/embj.2018100811


  self._set_arrayXarray_sparse(i, j, x)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 44120 × 19966
    obs: 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'development_stage', 'disease', 'ethnicity', 'id', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'tech_sample', 'assay_sc_ontology_term_id', 'cell_line_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'organ_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'dataset'
    var: 'ensembl', 'gene_symbol'
    uns: 'year', 'default_embedding', 'doi_journal', 'author', 'remove_gene_version', 'ethnicity', 'cell_line', 'download_url_data', 'state_exact', 'title', 'cell_type', 'assay_differentiation', 'primary_data', 'assay_type_differentiation', 'mapped_features', 'load_raw', 'id', 'organism', 'sample_source', 'normalization', 'download_url_meta', 'sex', 'disease', 'development_stage', '

# Load selected datasets for one organ

In [3]:
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)
ds.subset(key="organism", values=["Homo sapiens"])  # subsets all human data sets
ds.subset(key="organ", values=["lung"])  # subsets all lung data sets
ds.ids  # Display the datasets remaining following subsetting

['homosapiens_lung_2020_10x3v2_miller_001_10.1016/j.devcel.2020.01.033',
 'homosapiens_lungparenchyma_2019_10x3transcriptionprofiling_braga_001_10.1038/s41591-019-0468-5',
 'homosapiens_lung_2019_dropseq_braga_001_10.1038/s41591-019-0468-5',
 'homosapiens_lungparenchyma_2020_None_habermann_001_10.1126/sciadv.aba1972',
 'homosapiens_lung_2019_10x3transcriptionprofiling_szabo_001_10.1038/s41467-019-12464-3',
 'homosapiens_lung_2019_10x3transcriptionprofiling_szabo_002_10.1038/s41467-019-12464-3',
 'homosapiens_lung_2019_10x3transcriptionprofiling_szabo_007_10.1038/s41467-019-12464-3',
 'homosapiens_lung_2019_10x3transcriptionprofiling_szabo_008_10.1038/s41467-019-12464-3',
 'homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x',
 'homosapiens_lung_2020_10x3v2_lukassen_001_10.15252/embj.20105114',
 'homosapiens_lung_2020_10x3v2_lukassen_002_10.15252/embj.20105114',
 'homosapiens_lung_2020_10x3v2_travaglini_001_10.1038/s41586-020-2922-4',
 'homosapiens_lung_2020_s

In [4]:
# pick a specific dataset and check if it has celltype annotations (in case that matters to you)
ds.datasets['homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x'].annotated

True

In [5]:
# subset to the selected dataset
ds.subset(key="id", values=["homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x"])  # subsets all lung data sets

# download and load the specific dataset
ds.download()
ds.load(verbose=1)

# get the unmodified adata object of the dataset
adata = ds.datasets['homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x'].adata
print(adata)

Downloading: madissoon19_lung.processed.h5ad
loading homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x
AnnData object with n_obs × n_vars = 57020 × 25204
    obs: 'Donor', 'Time', 'donor_time', 'leiden', 'patient', 'sample', 'Celltypes'
    var: 'gene.ids.HCATisStab7509734', 'gene.ids.HCATisStab7509735', 'gene.ids.HCATisStab7509736', 'gene.ids.HCATisStab7587202', 'gene.ids.HCATisStab7587205', 'gene.ids.HCATisStab7587208', 'gene.ids.HCATisStab7587211', 'gene.ids.HCATisStab7646032', 'gene.ids.HCATisStab7646033', 'gene.ids.HCATisStab7646034', 'gene.ids.HCATisStab7646035', 'gene.ids.HCATisStab7659968', 'gene.ids.HCATisStab7659969', 'gene.ids.HCATisStab7659970', 'gene.ids.HCATisStab7659971', 'gene.ids.HCATisStab7747197', 'gene.ids.HCATisStab7747198', 'gene.ids.HCATisStab7747199', 'gene.ids.HCATisStab7747200', 'n.cells'
    obsm: 'X_pca', 'X_tsne', 'X_umap'


In [6]:
ds.datasets['homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x'].streamline_features(match_to_release="104", subset_genes_to_type="protein_coding")  # match the feature space to a reference annotation
ds.datasets['homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x'].streamline_metadata(schema="sfaira")  # convert the metadata annotation to the sfaira standard
adata = ds.datasets['homosapiens_lungparenchyma_2019_10x3v2_madissoon_001_10.1186/s13059-019-1906-x'].adata  # get the steramlined adata object of the selected dataset
print(adata)

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 57020 × 19966
    obs: 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'development_stage', 'disease', 'ethnicity', 'id', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'tech_sample', 'assay_sc_ontology_term_id', 'cell_line_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'organ_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id'
    var: 'ensembl', 'gene_symbol'
    uns: 'annotated', 'author', 'default_embedding', 'doi_journal', 'doi_preprint', 'download_url_data', 'download_url_meta', 'normalization', 'primary_data', 'title', 'year', 'load_raw', 'mapped_features', 'remove_gene_version', 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'cell_line', 'cell_type', 'development_stage', 'disease', 'ethnicity', 'id', 'organ', 'organism', '

# Creating streamlined cross-organ datasets

In [7]:
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)  # This links all data sets available
ds.subset(key="organism", values=["Mus musculus"])  # subset to mouse datasets
ds.subset(key="organ", values=["lung", "liver"])  # subset further to liver and lung data sets
ds.download() # Download the selected datasets to your local sfaira data repository
ds.load(verbose=1)  # This loads the anndata objects into memory
ds.streamline_features(match_to_release="104", subset_genes_to_type="protein_coding")  # Choose a reference genome and subset to only protein-coding genes
ds.streamline_metadata(schema="sfaira")  # make sure the metadata annotation of all datasets are in line with the sfaira schema, so they can be cleanly concatenated in the next step
print(ds.adata) # Use the adata object for your analysis or modelling.

Downloading: tabula-muris-senis-droplet-processed-official-annotations-Liver.h5ad
Downloading: tabula-muris-senis-facs-processed-official-annotations-Liver.h5ad
Downloading: tabula-muris-senis-droplet-processed-official-annotations-Lung.h5ad
Downloading: tabula-muris-senis-facs-processed-official-annotations-Lung.h5ad
Downloading: 5435866.zip
loading musmusculus_liver_2019_10x3v2_pisco_020_10.1038/s41586-020-2496-1



This is where adjacency matrices should go now.
  warn(

This is where adjacency matrices should go now.
  warn(
... storing 'development_stage' as categorical


loading musmusculus_liver_2019_smartseq2_pisco_021_10.1038/s41586-020-2496-1


... storing 'development_stage' as categorical


loading musmusculus_lung_2019_10x3v2_pisco_022_10.1038/s41586-020-2496-1


... storing 'development_stage' as categorical


loading musmusculus_lung_2019_smartseq2_pisco_023_10.1038/s41586-020-2496-1


... storing 'development_stage' as categorical


loading musmusculus_liver_2018_microwellseq_han_013_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_lung_2018_microwellseq_han_014_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_liver_2018_microwellseq_han_020_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_liver_2018_microwellseq_han_021_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_lung_2018_microwellseq_han_022_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_lung_2018_microwellseq_han_023_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_lung_2018_microwellseq_han_024_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_046_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_047_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_048_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_049_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_050_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical


loading musmusculus_blood_2018_microwellseq_han_051_10.1016/j.cell.2018.02.001


... storing 'ClusterID' as categorical
... storing 'Tissue' as categorical
... storing 'Batch' as categorical
... storing 'Annotation' as categorical
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are n

AnnData object with n_obs × n_vars = 67783 × 21885
    obs: 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'development_stage', 'disease', 'id', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'tech_sample', 'assay_sc_ontology_term_id', 'cell_line_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'ethnicity_ontology_term_id', 'ethnicity', 'organ_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'dataset'
    var: 'ensembl', 'gene_symbol'
    uns: 'year', 'default_embedding', 'doi_journal', 'author', 'remove_gene_version', 'ethnicity', 'cell_line', 'download_url_data', 'state_exact', 'title', 'cell_type', 'assay_differentiation', 'primary_data', 'assay_type_differentiation', 'mapped_features', 'load_raw', 'id', 'organism', 'sample_source', 'normalization', 'download_url_meta', 'sex', 'disease', 'development_stage', '