Let us now consider hierarchical cell-type relationships. For this, we will utilize the Allen-Brain datasets, as they come with two (non-overlapping) granularity-levels of cell-types.

In [1]:
import sys
sys.path.append('../src/')

import json
import config
import logging
import scanpy as sc
from refcm import RefCM

config.start_logging(logging.DEBUG)

In [2]:
mtg = sc.read_h5ad('../data/MTG.h5ad')
alm = sc.read_h5ad('../data/ALM.h5ad')
visp = sc.read_h5ad('../data/VISp.h5ad')

[h5py._conv      ] [DEBUG   ] : Creating converter from 3 to 5


Let us first retrieve the hierarchical relationships between cell types:

In [3]:
labels = mtg.obs[['labels3', 'labels34']].set_index('labels3')
coarse_levels = labels.index.unique().to_list()

hierarchy = {
    level: labels.loc[level].drop_duplicates().values.flatten().tolist()
    for level in coarse_levels
}

print(json.dumps(hierarchy, indent=4, sort_keys=True))

{
    "Excitatory": [
        "Exc L5/6 IT 3",
        "Exc L6 CT",
        "Exc L6b",
        "Exc L4/5 IT",
        "Exc L5/6 IT 2",
        "Exc L6 IT 1",
        "Exc L6 IT 2",
        "Exc L5/6 NP",
        "Exc L2/3 IT",
        "Exc L5/6 IT 1",
        "Exc L3/5 IT",
        "Exc L5 PT"
    ],
    "Inhibitory": [
        "Sst 1",
        "Lamp5 Rosehip",
        "Vip 5",
        "Pvalb 2",
        "Sst 3",
        "Pvalb 1",
        "Vip 3",
        "Pax6",
        "Vip 1",
        "Vip Sncg",
        "Vip 4",
        "Lamp5 2",
        "Lamp5 Lhx6",
        "Sst 4",
        "Chandelier",
        "Vip 2",
        "Sst 2",
        "Sst 5",
        "Sst Chodl",
        "Lamp5 1"
    ],
    "Non-neuronal": [
        "Astrocyte",
        "Oligo"
    ]
}


We can then map across these different levels and datasets -- here VISp to ALM -- and evaluate the performance as follows. Let us first consider mapping from granular to coarse resolutions.

In [4]:
# ensure we allow the query clusters to "split" without restriction
rcm = RefCM(num_iter_max=1e7, max_merges=-1, cache_load=False, cache_save=False)
m = rcm.annotate(visp, 'VISp', alm, 'ALM', 'labels3', 'labels34')

[refcm           ] [INFO    ] : NOTE: raw counts expected in anndata .X attributes.
[refcm           ] [DEBUG   ] : Loading cached mapping costs from cache.json.
[refcm           ] [DEBUG   ] : Selecting joint gene subset for query and reference datasets
[refcm           ] [DEBUG   ] : Using 1503 genes.
[refcm           ] [DEBUG   ] : Computing Wasserstein distances.
|████████████████| [100.00% ] : 00:24
[refcm           ] [DEBUG   ] : starting LP optimization
[refcm           ] [DEBUG   ] : optimization terminated w. status "Optimal"


In [5]:
m.display_matching_costs(ground_truth_obs_key='labels34')

[matchings       ] [DEBUG   ] : astrocyte | 0.70 > non-neuronal < 1.00 | non-neuronal
[matchings       ] [DEBUG   ] : chandelier | 0.70 > inhibitory < 1.00 | inhibitory
[matchings       ] [DEBUG   ] : exc l2/3 it | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l3/5 it | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l4/5 it | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l5 pt | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l5/6 it 1 | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l5/6 it 2 | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l5/6 it 3 | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l5/6 np | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l6 ct | 0.70 > excitatory < 1.00 | excitatory
[matchings       ] [DEBUG   ] : exc l6 it 1 | 0.70 > excitatory < 1.00 | 

Comparing with the previously-established hierarchy, but also verifying with the next code snippets, every cell was indeed correctly labeled to its coarser cell type!

In [25]:
visp.obs

Unnamed: 0,labels3,labels34,refcm_clusters,refcm_annot
F1S4_160108_001_A01,Inhibitory,Vip 2,29,Non-neuronal
F1S4_160108_001_B01,Inhibitory,Lamp5 Rosehip,17,Non-neuronal
F1S4_160108_001_C01,Inhibitory,Lamp5 Rosehip,17,Non-neuronal
F1S4_160108_001_D01,Inhibitory,Vip 5,32,Non-neuronal
F1S4_160108_001_E01,Inhibitory,Lamp5 1,14,Non-neuronal
...,...,...,...,...
FYS4_171004_103_G01,Inhibitory,Sst 4,25,Non-neuronal
FYS4_171004_104_A01,Inhibitory,Vip Sncg,33,Non-neuronal
FYS4_171004_104_B01,Inhibitory,Pvalb 1,20,Inhibitory
FYS4_171004_104_C01,Excitatory,Exc L5 PT,5,Non-neuronal


In [26]:
(visp.obs['labels3'] == visp.obs['refcm_annot']).all()

False

Conversely, let us now map from coarse to granular annotations:

In [18]:
# ensure we allow the query clusters to "split" as it sees fit
rcm = RefCM(numItermax=1e7, max_splits=-1)
m = rcm.annotate(visp, 'VISp', alm, 'ALM', 'labels34', 'labels3')

[refcm           ] [INFO    ] : NOTE: raw counts expected in anndata .X attributes.
[refcm           ] [DEBUG   ] : No existing matching db cost file db.json found.
[refcm           ] [DEBUG   ] : Selecting joint gene subset for query and reference datasets
[refcm           ] [DEBUG   ] : Using 1503 genes.
[refcm           ] [DEBUG   ] : Computing Wasserstein distances.
|████████████████| [100.00% ] : 00:32
[refcm           ] [DEBUG   ] : starting LP optimization
[refcm           ] [DEBUG   ] : optimization terminated w. status "Optimal"


In [19]:
m.display_matching_costs(ground_truth_obs_key='labels3')

Comparing this graph with the previous one and the established hierarchy, we conclude that this mapping direction also establishes the correct links in this direction!