In this notebook, we will be plotting the performance of RefCM across all currently available datasets, as well as testing what parameters yield the best performance.

In [1]:
import sys
sys.path.append('../src/')

import os
import json
import config
import logging
import numpy as np
import scanpy as sc
from refcm import RefCM

config.start_logging(logging.DEBUG)

In [2]:
%load_ext autoreload
%autoreload 2

All currently tested datasets, and their associated .obs clustering key:

In [3]:
DSS = {
    # pancreas datasets
    "pancreas_celseq": "celltype",
    "pancreas_celseq2": "celltype",
    "pancreas_fluidigmc1": "celltype",
    "pancreas_indrop1": "celltype",
    "pancreas_indrop2": "celltype",
    "pancreas_indrop3": "celltype",
    "pancreas_indrop4": "celltype",
    "pancreas_smarter": "celltype",
    "pancreas_smartseq2": "celltype",
    
    # Allen-Brain datasets
    "ALM": "labels34",
    "MTG": "labels34",
    "VISp": "labels34",
    
    # LGN datasets
    "LGN_human_intron": "cluster_label",
    "LGN_human_exon": "cluster_label",
    "LGN_macaque_intron": "cluster_label",
    "LGN_macaque_exon": "cluster_label",
    "LGN_mouse_intron": "cluster_label",
    "LGN_mouse_exon": "cluster_label",
    
    # pbmc datasets
    "pbmc_10Xv2": "labels",
    "pbmc_10Xv3": "labels",
    "pbmc_CEL-Seq": "labels",
    "pbmc_Drop-Seq": "labels",
    "pbmc_inDrop": "labels",
    "pbmc_Seq-Well": "labels",
    "pbmc_Smart-Seq2": "labels",
    
    # celltypist datasets
    "Blood": "cell_type",
    "Bone_marrow": "cell_type",
    "Heart": "cell_type",
    "Hippocampus": "cell_type",
    "Intestine": "cell_type",
    "Kidney": "cell_type",
    "Liver": "cell_type",
    "Lung": "cell_type",
    "Lymph_node": "cell_type",
    "Pancreas": "cell_type",
    "Skeletal_muscle": "cell_type",
    "Spleen": "cell_type",
}

Use the below codeblock to test out certain parameters (or think how you could automate a grid-like parameter-search) to determine which choice might yield the best results (and why that may be the case?).

In [4]:
# query / reference dataset choice
q_id = 'LGN_human_intron'
ref_id = 'LGN_macaque_intron'

q_ds = sc.read_h5ad(f'../data/{q_id}.h5ad')
ref_ds = sc.read_h5ad(f'../data/{ref_id}.h5ad')

# the parameters to tune
rcm = RefCM(target_sum=1e6, discovery_threshold=0.5)

# evaluation / display
m = rcm.annotate(q_ds, q_id, ref_ds, ref_id, DSS[q_id], DSS[ref_id])
m.eval(DSS[q_id])
m.display_matching_costs(DSS[q_id])

[h5py._conv      ] [DEBUG   ] : Creating converter from 3 to 5
[refcm           ] [INFO    ] : NOTE: raw counts expected in anndata .X attributes.
[refcm           ] [DEBUG   ] : Loading cached mapping costs from ws_cache.json.
[refcm           ] [DEBUG   ] : Using costs for LGN_human_intron->LGN_macaque_intron found in cache.
[refcm           ] [DEBUG   ] : starting LP optimization
[refcm           ] [DEBUG   ] : optimization terminated w. status "Optimal"
[matchings       ] [DEBUG   ] : GABA1                mapped to GABA1               
[matchings       ] [DEBUG   ] : GABA2                mapped to GABA3               
[matchings       ] [DEBUG   ] : GABA3                mapped to GABA4               
[matchings       ] [DEBUG   ] : K1                   mapped to Pulv                
[matchings       ] [DEBUG   ] : K2                   mapped to K2                  
[matchings       ] [DEBUG   ] : MP                   mapped to M                   
[matchings       ] [DEBUG   ] : OP

If you have found better parameters for a certain query -> reference pair, enter these under the `params.json` file. In the example above, this would mean editing line 485 to: 

```json
    ...
    "LGN_human_intron": {
            ...
            "LGN_macaque_intron": {"target_sum": 1000000, "discovery_threshold": 0.5},
            ...
    }
    ...
```

With the parameters available in `params.json`, we can then view our performance across all dataset combinations as follows:

In [21]:
with open('params.json', 'r') as f:
    params = json.load(f)

In [None]:
# mapping accuracy on overlapping cell types
acc_overlap = np.zeros((len(DSS), len(DSS)))

# accuracy on non-overlapping cell types
acc_different = np.zeros((len(DSS), len(DSS)))

# mapping all different combinations
for i, (a_id, a_key) in enumerate(DSS.items()):
    a_ds = sc.read_h5ad(f'../data/{a_id}.h5ad')
    
    for j, (b_id, b_key) in enumerate(DSS.items()):
        
        if j < i:
            continue
        
        b_ds = sc.read_h5ad(f'../data/{b_id}.h5ad')
        
        # run first with a as query and b as reference, then b as query and a as reference
        # to avoid re-reading certain files as often.
        
        # TODO await matching evalution metrics rework & merge to complete
        # the below steps
        
        rcm = RefCM(**params[a_id][b_id])
        m = rcm.annotate(a_ds, a_id, b_ds, b_id, DSS[a_id], DSS[b_id])
        
        
        rcm = RefCM(**params[b_id][a_id])
        m = rcm.annotate(b_ds, b_id, a_ds, a_id, DSS[b_id], DSS[a_id])

In [None]:
# TODO code for heatplot visualization