In [1]:
import sys
sys.path.append('../src/')

import json
import config
import logging
import scanpy as sc
from refcm import RefCM

config.start_logging(logging.DEBUG)

Let us now consider hierarchical cell-type relationships. For this, we will utilize the Allen-Brain datasets, as they come with two (non-overlapping) granularity-levels of cell-types.

In [13]:
mtg = sc.read_h5ad(f'../data/MTG.h5ad')
alm = sc.read_h5ad(f'../data/ALM.h5ad')
visp = sc.read_h5ad(f'../data/VISp.h5ad')

Let us first retrieve the hierarchical relationships between cell types:

In [3]:
labels = mtg.obs[['labels3', 'labels34']].set_index('labels3')
coarse_levels = labels.index.unique().to_list()

hierarchy = {
    level: labels.loc[level].drop_duplicates().values.flatten().tolist() 
    for level in coarse_levels
}

print(json.dumps(hierarchy, indent=4))

{
    "Inhibitory": [
        "Sst 1",
        "Lamp5 Rosehip",
        "Vip 5",
        "Pvalb 2",
        "Sst 3",
        "Pvalb 1",
        "Vip 3",
        "Pax6",
        "Vip 1",
        "Vip Sncg",
        "Vip 4",
        "Lamp5 2",
        "Lamp5 Lhx6",
        "Sst 4",
        "Chandelier",
        "Vip 2",
        "Sst 2",
        "Sst 5",
        "Sst Chodl",
        "Lamp5 1"
    ],
    "Excitatory": [
        "Exc L5/6 IT 3",
        "Exc L6 CT",
        "Exc L6b",
        "Exc L4/5 IT",
        "Exc L5/6 IT 2",
        "Exc L6 IT 1",
        "Exc L6 IT 2",
        "Exc L5/6 NP",
        "Exc L2/3 IT",
        "Exc L5/6 IT 1",
        "Exc L3/5 IT",
        "Exc L5 PT"
    ],
    "Non-neuronal": [
        "Astrocyte",
        "Oligo"
    ]
}


To verify the performance of refcm, let us quickly define helper functions to determine if a given `(coarse_level, granular_level)` pair matches this hierarchy.

In [4]:
def is_valid_pair(a: str, b: str) -> bool:
    return b in hierarchy[a] if a in hierarchy else hierarchy.get(b) is not None and a in hierarchy[b]

In [5]:
is_valid_pair('Excitatory', 'Exc L5 PT') \
    and is_valid_pair('Exc L5 PT', 'Excitatory') \
    and not is_valid_pair('Excitatory', 'not a type') \
    and not is_valid_pair('not a type', 'Exc L5 PT')

True

We can then map across these different levels and evaluate the performance as follows. Let us first consider mapping from granular to coarse resolutions.

In [28]:
# ensure we allow the query clusters to "split" without restriction
rcm = RefCM(numItermax=1e7, max_merges=-1)
m = rcm.annotate(visp, 'VISp', alm, 'ALM', 'labels3', 'labels34')
m.display_matching_costs(ground_truth_obs_key='labels34')

[refcm           ] [INFO    ] : NOTE: raw counts expected in anndata .X attributes.
[refcm           ] [DEBUG   ] : No existing matching db cost file db.json found.
[refcm           ] [DEBUG   ] : Selecting joint gene subset for query and reference datasets
[refcm           ] [DEBUG   ] : Using 1503 genes.
[refcm           ] [DEBUG   ] : Computing Wasserstein distances.
|████████████████| [100.00% ] : 00:29
[refcm           ] [DEBUG   ] : starting LP optimization
[refcm           ] [DEBUG   ] : optimization terminated w. status "Optimal"


In [34]:
visp.obs

Unnamed: 0,labels3,labels34,refcm_clusters,refcm_annot
F1S4_160108_001_A01,Inhibitory,Vip 2,29,Inhibitory
F1S4_160108_001_B01,Inhibitory,Lamp5 Rosehip,17,Inhibitory
F1S4_160108_001_C01,Inhibitory,Lamp5 Rosehip,17,Inhibitory
F1S4_160108_001_D01,Inhibitory,Vip 5,32,Inhibitory
F1S4_160108_001_E01,Inhibitory,Lamp5 1,14,Inhibitory
...,...,...,...,...
FYS4_171004_103_G01,Inhibitory,Sst 4,25,Inhibitory
FYS4_171004_104_A01,Inhibitory,Vip Sncg,33,Inhibitory
FYS4_171004_104_B01,Inhibitory,Pvalb 1,20,Inhibitory
FYS4_171004_104_C01,Excitatory,Exc L5 PT,5,Excitatory


As we can see from the next snippet, every cell was correctly labeled to it's coarser cell type.

In [37]:
(visp.obs['labels3'] == visp.obs['refcm_annot']).all()

True

Conversely, let us now map from coars to granular annotations.

In [14]:
# ensure we allow the query clusters to "split" without restriction
rcm = RefCM(numItermax=1e7, max_splits=-1)
m = rcm.annotate(visp, 'VISp', alm, 'ALM', 'labels34', 'labels3')
m.display_matching_costs()

[refcm           ] [INFO    ] : NOTE: raw counts expected in anndata .X attributes.
[refcm           ] [DEBUG   ] : No existing matching db cost file db.json found.
[refcm           ] [DEBUG   ] : Selecting joint gene subset for query and reference datasets
[refcm           ] [DEBUG   ] : Using 1503 genes.
[refcm           ] [DEBUG   ] : Computing Wasserstein distances.
|████████████████| [100.00% ] : 00:39
[refcm           ] [DEBUG   ] : starting LP optimization
[refcm           ] [DEBUG   ] : optimization terminated w. status "Optimal"


We can then determine the correctness of the mapping as follows. Here we may not use a simple equality check, since we are mapping 3 clusters to 34 clusters without sub-clustering the coarse annotations before applying RefCM.

In [26]:
# TODO 

Unnamed: 0,labels3,labels34,refcm_clusters,refcm_annot
F1S4_160108_001_A01,Inhibitory,Vip 2,1,Chandelier
F1S4_160108_001_B01,Inhibitory,Lamp5 Rosehip,1,Chandelier
F1S4_160108_001_C01,Inhibitory,Lamp5 Rosehip,1,Chandelier
F1S4_160108_001_D01,Inhibitory,Vip 5,1,Chandelier
F1S4_160108_001_E01,Inhibitory,Lamp5 1,1,Chandelier
...,...,...,...,...
FYS4_171004_103_G01,Inhibitory,Sst 4,1,Chandelier
FYS4_171004_104_A01,Inhibitory,Vip Sncg,1,Chandelier
FYS4_171004_104_B01,Inhibitory,Pvalb 1,1,Chandelier
FYS4_171004_104_C01,Excitatory,Exc L5 PT,0,Exc L2/3 IT
