# Dataloader Tutorial

One of Sfaira's main features is easy access to publicly (or privately) contributed dataloaders. All loaded data comes in a common format allowing for homogeneous downstream analysis without fighting data sources.
This tutorial will focuses on the usage of Sfaira dataloaders to access data.

In [1]:
import sfaira
import os

# Set this path to your local sfaira data repository
basedir = '.'
datadir = os.path.join(basedir, 'raw')
metadir = os.path.join(basedir, 'meta')
cachedir = os.path.join(basedir, 'cache')



Ontology <class 'sfaira.versions.metadata.base.OntologyMondo'> is not a DAG, treat child-parent reasoning with care.
Ontology <class 'sfaira.versions.metadata.base.OntologyUberon'> is not a DAG, treat child-parent reasoning with care.


# Load all data sets from an organ

In [3]:
# Here we choose mouse pancreas:
# The DatasetGroupPancreas contains instances of Dataset which correspond to individual data sets.
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)  # This links all data sets available
ds.subset(key="organism", values=["mouse"])  # Subset to all mouse datasets
ds.subset(key="organ", values=["pancreas"])  # Subset further to pancreas organ
ds.load()  # This loads the anndata objects into memory
ds.streamline_features(match_to_reference={"human": "Homo_sapiens.GRCh38.102", "mouse": "Mus_musculus.GRCm38.102"}, subset_genes_to_type="protein_coding")  # Choose a reference genome and subset to only protein-coding genes
ds.streamline_metadata(schema="sfaira")  # make sure the metadata annotation of all datasets are in line with the sfaira schema, so they can be cleanly concatenated in the next step
print(ds.adata) # Use the adata object for your analysis or modelling.

loading mouse_pancreas_2019_10xsequencing_pisco_028_10.1101/661728
loading mouse_pancreas_2019_smartseq2_pisco_029_10.1101/661728
loading mouse_pancreas_2019_10xsequencing_thompson_001_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_002_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_003_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_005_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_006_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_007_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2019_10xsequencing_thompson_008_10.1016/j.cmet.2019.01.021
loading mouse_pancreas_2018_microwellseq_han_016_10.1016/j.cell.2018.02.001
loading mouse_pancreas_2018_microwellseq_han_045_10.1016/j.cell.2018.02.001


  self._set_arrayXarray_sparse(i, j, x)
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 50776 × 21936
    obs: 'development_stage', 'sex', 'cell_ontology_class', 'cell_ontology_id', 'cell_types_original', 'dataset'
    var: 'ensembl', 'names'
    uns: 'mapped_features'


# Load selected datasets for one organ

In [4]:
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)
ds.subset(key="organism", values=["mouse"])  # subsets all pancreas data sets
ds.subset(key="organ", values=["pancreas"])  # subsets all pancreas data sets
ds.ids  # Display the datasets remaining following subsetting

['mouse_pancreas_2019_10xsequencing_pisco_028_10.1101/661728',
 'mouse_pancreas_2019_smartseq2_pisco_029_10.1101/661728',
 'mouse_pancreas_2019_10xsequencing_thompson_001_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_002_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_003_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_005_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_006_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_007_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2019_10xsequencing_thompson_008_10.1016/j.cmet.2019.01.021',
 'mouse_pancreas_2018_microwellseq_han_016_10.1016/j.cell.2018.02.001',
 'mouse_pancreas_2018_microwellseq_han_045_10.1016/j.cell.2018.02.001']

In [5]:
# pick a specific dataset and check if it has celltype annotations (in case that matters to you)
ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].annotated

True

In [6]:
# load the specific dataset
ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].load()

# get the unmodified adata object of the dataset
adata = ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].adata
print(adata)

AnnData object with n_obs × n_vars = 3701 × 27998
    obs: 'celltypes'
    var: 'ensembl', 'names'


In [7]:
ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].streamline_features(match_to_reference={"human": "Homo_sapiens.GRCh38.102", "mouse": "Mus_musculus.GRCm38.102"}, subset_genes_to_type="protein_coding")  # match the feature space to a reference annotation
ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].streamline_metadata(schema="sfaira")  # convert the metadata annotation to the sfaira standard
adata = ds.datasets['mouse_pancreas_2019_10xsequencing_thompson_004_10.1016/j.cmet.2019.01.021'].adata  # get the steramlined adata object of the selected dataset
print(adata)

AnnData object with n_obs × n_vars = 3701 × 21936
    obs: 'cell_ontology_class', 'cell_ontology_id', 'cell_types_original'
    var: 'ensembl', 'names'
    uns: 'annotated', 'author', 'default_embedding', 'doi', 'download_url_data', 'download_url_meta', 'id', 'mapped_features', 'ncells', 'normalization', 'primary_data', 'title', 'year', 'load_raw', 'remove_gene_version', 'assay_sc', 'assay_differentiation', 'assay_type_differentiation', 'bio_sample', 'cell_line', 'development_stage', 'disease', 'ethnicity', 'individual', 'organ', 'organism', 'sex', 'state_exact', 'sample_source', 'tech_sample'


# Creating streamlined cross-organ datasets

In [8]:
ds = sfaira.data.Universe(data_path=datadir, meta_path=metadir, cache_path=cachedir)  # This links all data sets available
ds.subset(key="organism", values=["mouse"])  # subset to mouse datasets
ds.subset(key="organ", values=["lung", "liver"])  # subset further to liver and lung data sets
ds.load()  # This loads the anndata objects into memory
ds.streamline_features(match_to_reference={"human": "Homo_sapiens.GRCh38.102", "mouse": "Mus_musculus.GRCm38.102"}, subset_genes_to_type="protein_coding")  # Choose a reference genome and subset to only protein-coding genes
ds.streamline_metadata(schema="sfaira")  # make sure the metadata annotation of all datasets are in line with the sfaira schema, so they can be cleanly concatenated in the next step
print(ds.adata) # Use the adata object for your analysis or modelling.

loading mouse_liver_2019_10xsequencing_pisco_020_10.1101/661728
loading mouse_liver_2019_smartseq2_pisco_021_10.1101/661728
loading mouse_lung_2019_10xsequencing_pisco_022_10.1101/661728
loading mouse_lung_2019_smartseq2_pisco_023_10.1101/661728
loading mouse_liver_2018_microwellseq_han_013_10.1016/j.cell.2018.02.001
loading mouse_lung_2018_microwellseq_han_014_10.1016/j.cell.2018.02.001
loading mouse_liver_2018_microwellseq_han_020_10.1016/j.cell.2018.02.001
loading mouse_liver_2018_microwellseq_han_021_10.1016/j.cell.2018.02.001
loading mouse_lung_2018_microwellseq_han_022_10.1016/j.cell.2018.02.001
loading mouse_lung_2018_microwellseq_han_023_10.1016/j.cell.2018.02.001
loading mouse_lung_2018_microwellseq_han_024_10.1016/j.cell.2018.02.001


Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 60688 × 21936
    obs: 'development_stage', 'sex', 'cell_ontology_class', 'cell_ontology_id', 'cell_types_original', 'dataset'
    var: 'ensembl', 'names'
    uns: 'mapped_features'
