In [1]:
from anndata.experimental import read_backed
import fsspec, zarr
import scanpy as sc
import os
%load_ext autoreload
%autoreload 2

  cls = super().__new__(mcls, name, bases, namespace, **kwargs)


First we set up a custom store for tracking how many requests we are making.  This is just a light wrapper around LRUStoreCache that prints when a key has been accessed.

In [2]:
class AccessTrackingStore(zarr.LRUStoreCache):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def __getitem__(self, key):
        if key not in self._values_cache:
            print(key)
        return super().__getitem__(key)

In [3]:
mapper = fsspec.get_mapper('https://vitessce-demo-data.storage.googleapis.com/anndata-demos/BALF_VIB-UGent_processed_cleaned.zarr/')
store = AccessTrackingStore(mapper, max_size=2**28)

In [4]:
adata = read_backed(store)

.zmetadata
obs/_index/0
obs/_index/2
obs/_index/1
obs/_index/3
obs/_index/4
obs/_index/5
obs/_index/6
obs/_index/7
var/_index/0


In [5]:
adata

AnnDataBacked object with n_obs × n_vars = 275056 × 24740
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

Great! We can see that with only a few requests, we can now view all the columns available in this new `AnnDataBacked` object.  This is a great start towards understanding what our data is.

Immediately, we see that this is some sort of COVID-19 dataset (from https://www.covid19cellatlas.org/index.patient.html, "Bronchoalveolar lavage fluid").  Maybe we're only interested in those who were COVID-19 positive and died from it.  Let's try to get that subset and observe what happen.

Note the type of  `obs` - an xarray Dataset.  More info can be found at their homepage: https://docs.xarray.dev/, but this gives AnnData a familiar feeling dataframe API to Pandas while keeping things lazy-loaded.

In [6]:
adata.obs

obs/orig.ident/categories/0
obs/Age/categories/0
obs/Sex/categories/0
obs/Race/categories/0
obs/Ethnicity/categories/0
obs/BMI/categories/0
obs/Method/categories/0
obs/CITE-Seq panel/categories/0
obs/Reference/categories/0
obs/Institute/categories/0
obs/Annotation/categories/0
obs/Smoking/categories/0
obs/28-day outcome/categories/0
obs/Organ System/categories/0
obs/Source/categories/0
obs/Technology/categories/0


Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [7]:
adata.obs['SARS-CoV-2 PCR']

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [8]:
adata.obs['28-day death']

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.10 MiB 268.61 kiB Shape (275056,) (34382,) Dask graph 8 chunks in 2 graph layers Data type float64 numpy.ndarray",275056  1,

Unnamed: 0,Array,Chunk
Bytes,2.10 MiB,268.61 kiB
Shape,"(275056,)","(34382,)"
Dask graph,8 chunks in 2 graph layers,8 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


These are both `xarray` `DataArrays`, with Dask backing data.  Thus we'll need to bring them into memory to create an interesting subset, but this is no problem.  Usually we would hope these are stored as boolean, but floats will work just the same.  Note that the indexing data needs to be in memory first.  This will likely be improved in the future.

In [9]:
has_covid = adata.obs['SARS-CoV-2 PCR'].data.compute().astype('bool')
did_die = adata.obs['28-day death'].data.compute().astype('bool')
mortality = adata[has_covid & did_die, :]
survived = adata[has_covid & ~did_die, :]
mortality

obs/SARS-CoV-2 PCR/0obs/SARS-CoV-2 PCR/1
obs/SARS-CoV-2 PCR/2

obs/SARS-CoV-2 PCR/3
obs/SARS-CoV-2 PCR/4
obs/SARS-CoV-2 PCR/5
obs/SARS-CoV-2 PCR/6
obs/SARS-CoV-2 PCR/7
obs/28-day death/0
obs/28-day death/1
obs/28-day death/2
obs/28-day death/3
obs/28-day death/4
obs/28-day death/5
obs/28-day death/6
obs/28-day death/7


AnnDataBacked object with n_obs × n_vars = 57250 × 24740
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

In [10]:
survived

AnnDataBacked object with n_obs × n_vars = 82764 × 24740
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

That was pretty fast!  Now we're getting somewhere.  Let's look at the cell types present in our dataset.  Note that the backing `data` of the obs column (itself a `DataArrray`) is special - a custom class we use that should feel similar to Pandas `categorical` datatype nonetheless.

We immediately are able to see the available categories here.  No "real data" has been read in yet.  COVID 19 is known to act (https://pubmed.ncbi.nlm.nih.gov/34861051/) on CD4, Neutorphil and CD8+ cells inversely in suriviors vs. non-surivivors, so let's focus on those.

In [11]:
adata.obs['Annotation'].data

obs/Annotation/codes/0
obs/Annotation/codes/1


['Neutrophil', 'Doublet', 'CD8+ T-cell', 'Macrophage', 'CD8+ T-cell', ..., 'Neutrophil', 'Neutrophil', 'Neutrophil', 'Neutrophil', 'Neutrophil']
Length: 275056
Categories (14, object): ['B cell', 'Baso Mast', 'CD4+ T-cell', 'CD8+ T-cell', ..., 'Plasma cell', 'cDC', 'gd T-cell', 'pDC']

In [12]:
affected_cell_types = ['CD4+ T-cell', 'CD8+ T-cell', 'Neutrophil']

In [13]:
# Note we have to load the data into memory via .data[()] in order to index the AnnData object
mortality_affected_cell_types = mortality[mortality.obs['Annotation'].data[()].isin(affected_cell_types), :]
mortality_affected_cell_types

AnnDataBacked object with n_obs × n_vars = 44029 × 24740
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

In [14]:
survived_affected_cell_types = survived[survived.obs['Annotation'].data[()].isin(affected_cell_types), :]
survived_affected_cell_types

AnnDataBacked object with n_obs × n_vars = 49751 × 24740
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

We can now check the claim of the above-linked paper.  They claim "At admission, patients who later succumbed to COVID-19 had significantly lower frequencies of all memory CD8+ T cell subsets, resulting in increased CD4-to-CD8 T cell and neutrophil-to-CD8 T cell ratios."  Is this true?  We can check very easily! Indeed it is!

In [15]:
cd8_count_survived = survived_affected_cell_types[survived_affected_cell_types.obs['Annotation'].data[()] == 'CD8+ T-cell', :].shape[0]
cd4_count_survived = survived_affected_cell_types[survived_affected_cell_types.obs['Annotation'].data[()] == 'CD4+ T-cell', :].shape[0]
neutorphil_count_survived = survived_affected_cell_types[survived_affected_cell_types.obs['Annotation'].data[()] == 'Neutrophil', :].shape[0]

print('Patients who Survived:')
print('----------------------')
print(f'CD4/CD8 Ratio: {cd4_count_survived / cd8_count_survived}')
print(f'Neutrophil/CD8 Ratio: {neutorphil_count_survived / cd8_count_survived}')

Patients who Survived:
----------------------
CD4/CD8 Ratio: 1.4132880871584212
Neutrophil/CD8 Ratio: 6.4724057867476334


In [16]:
cd8_count_mortality = mortality_affected_cell_types[mortality_affected_cell_types.obs['Annotation'].data[()] == 'CD8+ T-cell', :].shape[0]
cd4_count_mortality = mortality_affected_cell_types[mortality_affected_cell_types.obs['Annotation'].data[()] == 'CD4+ T-cell', :].shape[0]
neutrophil_count_mortality = mortality_affected_cell_types[mortality_affected_cell_types.obs['Annotation'].data[()] == 'Neutrophil', :].shape[0]

print('Patients who Died:')
print('----------------------')
print(f'CD4/CD8 Ratio: {cd4_count_mortality / cd8_count_mortality}')
print(f'Neutrophil/CD8 Ratio: {neutrophil_count_mortality / cd8_count_mortality}')

Patients who Died:
----------------------
CD4/CD8 Ratio: 2.9249146757679183
Neutrophil/CD8 Ratio: 14.858788395904437


Remarkable, we immediately see that this dataset confirms that result reported.  And this short confirmatory analysis all took place without every loading the omics data (from `X`) into memory. Indeed, all of the above code should cumulatively have taken no longer than 15 seconds to run.  But what if we do want to look at the genomics data?  Let's try to do that, using a few cell-type markers reported from this dataset: https://www.medrxiv.org/content/10.1101/2020.11.20.20227355v1.full.pdf.  These should appear clearly across certain cell types when visualized.

In [17]:
genes = ['MUC5AC', 'FOXP3', 'CTLA4']
has_covid_adata = adata[:, genes]
has_covid_adata

AnnDataBacked object with n_obs × n_vars = 275056 × 3
    obs: 'orig.ident', 'Age', 'Sex', 'Race', 'Ethnicity', 'BMI', 'Pre-existing heart disease', 'Pre-existing lung disease', 'Pre-existing kidney disease', 'Pre-existing diabetes', 'Pre-existing hypertension', 'Pre-existing immunocompromised condition', 'Smoking', 'SARS-CoV-2 PCR', 'SARS-CoV-2 Ab', 'Symptomatic', 'Admitted to hospital', 'Highest level of respiratory support', 'Vasoactive agents required during hospitalization', '28-day death', '28-day outcome', 'Disease classification', 'Organ System', 'Source', 'Days since hospital admission', 'SOFA', 'Technology', 'Method', 'CITE-Seq panel', 'Reference', 'Institute', 'Creation date', 'Annotation'
    var: 'feature_type', 'gene_id'
    obsm: 'X_umap'
    layers: 'X_csc'

Now we want to visualize the data.  But we need to bring the data into memory for that.  Luckily, this is no problem as there is a convenient `to_memory` function provided with this new `AnnDataBacked` object.  Also, we note the presence of an `X_csc` layer - the `X` layer is sparse `CSR` format which will not be very good for reading remotely.  Thus we use the `X_csc` matrix for fast access to full cell information given a subset of genes of interest. 

In [None]:
obs_keys_to_exclude = ['obs/' + v for v in has_covid_adata.obs.keys() if v != 'Annotation']
var_keys_to_exclude = ['var/' + v for v in has_covid_adata.var.keys()]
has_covid_in_memory_adata = has_covid_adata.to_memory(exclude=['X'] + obs_keys_to_exclude + var_keys_to_exclude)

var/feature_type/categories/0
var/gene_id/categories/0
obsm/X_umap/0.0obsm/X_umap/0.1
obsm/X_umap/1.0
obsm/X_umap/1.1

obsm/X_umap/2.0
obsm/X_umap/2.1
obsm/X_umap/3.0
obsm/X_umap/3.1
layers/X_csc/indptr/0
layers/X_csc/data/961
layers/X_csc/data/404


Note the data accessed - basically only UMAP coordinates and a few chunks of the underlying sparse data.  The above should have only taken about 5 seconds.  Finally, we can use this in-memory object in `scanpy` to visualize the data.  Indeed, the genes only show up in subsets of the clusers because the paper reports a finer grained cell typing than is given in the `AnnData` object.  For example, from the paper, "preliminary phenotyping of CD4 T cell subsets revealed...regulatory (FOXP3, CTLA4)....cells."  And indeed, we see those genes appearing within, but overall in, the CD4+ T-Cell cluster.

In [None]:
sc.pl.umap(has_covid_in_memory_adata, color="Annotation")

In [None]:
sc.pl.umap(has_covid_in_memory_adata, color=genes, layer='X_csc', ncols=1)

Lastly, note that if you were to rerun the notebook without restarting the kernel, you would load no data in.  that is because we used an `LRUStoreCache` for the `zarr` data, so the data is cached.  In total the above notebook should not have taken more than 30 seconds to run.  This enables a new way of accessing data in `AnnData` that is either too far away or too big to fit into memory.