# Relative expression across cell types

Let the matrix of mean expression values for each gene g in each cell type $ct$ for a given modality $m$ defined as:
$$<X>_{ct,g}^m = \frac{1}{\lvert C_{ct} \rvert} \sum\limits_{g\in C_{ct}} X_{g,ct}^{m}$$ 

where $C_{ct}$ is the set of all genes in given cell type clusters $ct$ and $X_{c,g}^{m}$ is the matrix of normalized expression values for each gene $g$ in each cell $c$. 

Define the difference in mean expression between two cell types $(c_1,c_2)$ in the given gene $g$ and the given modality
$$\delta_{c_1,c_2}^m= <X>_{g,c_1}^m - <X>_{g,c_2}^m $$

To normalize these pairwise differences in mean cell type expression so that they are comparable across modalities, we divide by the sum of the absolute values of all possible pairwise differences between cell types. To be sure that the values are comparable across datasets with different numbers of cell-types, we scale the result by a factor of $N_{c}^2$ representing the total number of shared cell types shared between the two modalities. We define the normalized pairwise difference in mean expression between two cell types $c_1,c_2$ in a given gene $g$ and a given modality $m$ as
$$\delta_{c_1,c_2}^{'m,g}= \frac{N_{c}^2 \delta_{c_1,c_2}^{m,g}}{\sum\limits_{c_1}\sum\limits_{c_2}|\delta_{c_1,c_2}^{m,g}|} $$

To compare the normalized pairwise differences in mean cell-type expression between modalities, we define $\triangle$ as
$$\triangle = \sum\limits_{ct}\sum\limits_{c_1}\sum\limits_{c_2} |\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$

Define the final metric $M$ bounded at a maximum of 1, representing the perfect similarity of relative gene expression between modalities, and minimum of 0, representing the perfect dissimilarity of relative cell type expression between two modalities such that each cell types expression value in each gene pair is swapped.

$$M=1-\frac{\triangle}{2*|\sum\limits_{g,c_1,c_1}\delta_{c_1,c_2}^{'sc,g}|}$$

Further compute the metric on per-gene and per-celltype basis:

1. per cell-type:

$$\triangle_{c1}= \sum\limits_{g}\sum\limits_{c_2}|\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$
$$M_{c_1}=1-\frac{\triangle}{2*|\sum\limits_{g,c_2}\delta_{c_1,c_2}^{'sc,g}|}$$

2. per gene:
$$\triangle_{g}= \sum\limits_{c_1}\sum\limits_{c_2}|\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$
$$M_{g}=1-\frac{\triangle}{2*|\sum\limits_{c_1,c_2}\delta_{c_1,c_2}^{'sc,g}|}$$



In [11]:
import scanpy as sc
import numpy as np
import pandas as pd
from anndata import AnnData
from scipy.sparse import issparse
import math

In [12]:
adata_sc = sc.read_h5ad("/mnt/storage/adata_sc.h5ad")
adata_sc.layers["raw"] = adata_sc.X.copy()
sc.pp.normalize_total(adata_sc)
adata_sc.layers["norm"] = adata_sc.X.copy()
sc.pp.log1p(adata_sc)
adata_sc.layers["lognorm"] = adata_sc.X.copy()
adata_sp = sc.read_h5ad("/mnt/storage/adata_sp.h5ad")
adata_sp.layers["raw"] = adata_sp.X.copy()
sc.pp.normalize_total(adata_sp)
adata_sp.layers["norm"] = adata_sp.X.copy()
sc.pp.log1p(adata_sp)
adata_sp.layers["lognorm"] = adata_sp.X.copy()



In [68]:
  def relative_celltype_expression(adata_sp: AnnData, adata_sc: AnnData, key:str='celltype', layer:str='lognorm'):
    """Calculate the efficiency deviation present between the genes in the panel. 
    ----------
    adata_sp : AnnData
        annotated ``AnnData`` object with counts from spatial data
    adata_sc : AnnData
        annotated ``AnnData`` object with counts from scRNAseq data
    key: str (default: 'celltype')
        .obs column of ``AnnData`` that contains celltype information
    layer: str (default: 'lognorm')
        layer of ```AnnData`` to use to compute the metric

    Returns
    -------
    overall_metric: float
        similarity of relative gene expression across all genes and celltypes, b/t the scRNAseq and spatial data
    per_gene_metric: float
        similarity of relative gene expression per gene across all celltypes, b/t the scRNAseq and spatial data
    per_celltype_metric: float
        similarity of relative gene expression per celltype across all genes, b/t the scRNAseq and spatial data
  
    """   
    ### SET UP
    # set the .X layer of each of the adatas to be log-normalized counts
    adata_sp.X = adata_sp.layers[layer]
    adata_sc.X = adata_sc.layers[layer]
    
    # take the intersection of genes in adata_sp and adata_sc, as a list
    intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))
    
    # subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
    adata_sc=adata_sc[:,intersect]
    adata_sp=adata_sp[:,intersect]
    
    # sparse matrix support
    for a in [adata_sc, adata_sp]:
        if issparse(a.X):
            a.X = a.X.toarray()
            
    # find the unique celltypes in adata_sc that are also in adata_sp
    unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()
    
    
    
    #### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
    # get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)
    
    # get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)
    
    # add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
    exp_sc[key]=list(adata_sc.obs[key])
    exp_sp[key]=list(adata_sp.obs[key])
    
    # delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
    exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
    exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]
    
    # find the mean expression for each gene for each celltype in sc and sp data
    mean_celltype_sp=exp_sp.groupby(key).mean()
    mean_celltype_sc=exp_sc.groupby(key).mean()
    
    # sort genes in alphabetical order 
    mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
    mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]
    
    
    #### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN CELL TYPES
    mean_celltype_sc_np = mean_celltype_sc.T.to_numpy()
    pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
    pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes, num_celltypes, num_genes) 
       
    mean_celltype_sp_np = mean_celltype_sp.T.to_numpy()
    pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
    pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes,num_celltypes, num_genes) 
    
    #### NORMALIZE THESE PAIRWISE DISTANCES BETWEEN CELL TYPES
    #calculate sum of absolute distances
    abs_diff_sc = np.absolute(pairwise_distances_sc)
    abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))
    
    abs_diff_sp = np.absolute(pairwise_distances_sp)
    abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))
    
    norm_factor_sc = (1/(mean_celltype_sc.T.shape[1]**2)) * abs_diff_sum_sc
    norm_factor_sp = (1/(mean_celltype_sp.T.shape[1]**2)) * abs_diff_sum_sp
    
    
    #perform normalization
    norm_pairwise_distances_sc = np.divide(pairwise_distances_sc, norm_factor_sc)
    norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)
    
    
    pairwise_distances_sc[:,:,norm_factor_sc!=0] = np.divide(pairwise_distances_sc[:,:,norm_factor_sc!=0], 
                                                             norm_factor_sc[norm_factor_sc!=0])
    # exclude the ones with norm_factor_sc, norm_factor_sp with zero
    pairwise_distances_sp[:,:,norm_factor_sp!=0] = np.divide(pairwise_distances_sp[:,:,norm_factor_sp!=0], 
                                                             norm_factor_sp[norm_factor_sp!=0])
    norm_pairwise_distances_sc = pairwise_distances_sc
    norm_pairwise_distances_sp = pairwise_distances_sp
    
    
    ##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
    overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
    overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))
    
    per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
    per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
    per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 


    #per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.T.columns, columns=['score']) #add back the gene labels 
    
    per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
    per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
    per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 
    
    return overall_metric, per_gene_metric, per_celltype_metric
    

In [69]:
overall_metric, per_gene_metric, per_celltype_metric = relative_celltype_expression(adata_sp, adata_sc,'celltype', 'lognorm')


  norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)


In [70]:
overall_metric


0.6439329547646605

In [76]:
per_gene_metric.loc['ELAVL4']

score    0.5
Name: ELAVL4, dtype: float32

In [78]:
per_gene_metric.sort_values(by = 'score')

Unnamed: 0,score
PLN,0.000917
PCSK1N,0.011523
STMN2,0.013303
ALDH1A1,0.053840
ISL1,0.072022
...,...
FABP3,0.902144
ITLN1,0.920288
TMEM100,0.939514
CLDN5,0.974751


In [79]:
per_celltype_metric

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Atrial cardiomyocytes,0.67969
Capillary endothelium,0.639844
Cardiac neural crest cells,0.423477
Endothelium / pericytes,0.578967
Epicardial cells,0.677966
Fibroblast-like,0.701047
Myoz2-enriched cardiomyocytes,0.700623
Smooth muscle cells,0.630102
Ventricular cardiomyocytes,0.723606


In [19]:
    key = 'celltype'
    ### SET UP
    # set the .X layer of each of the adatas to be log-normalized counts
    adata_sp.X = adata_sp.layers['lognorm']
    adata_sc.X = adata_sc.layers['lognorm']
    
    # take the intersection of genes in adata_sp and adata_sc, as a list
    intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))
    
    # subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
    adata_sc=adata_sc[:,intersect]
    adata_sp=adata_sp[:,intersect]
    
    # sparse matrix support
    for a in [adata_sc, adata_sp]:
        if issparse(a.X):
            a.X = a.X.toarray()
            
    # find the unique celltypes in adata_sc that are also in adata_sp
    unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()
    
    
    
    #### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
    # get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)
    
    # get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)
    
    # add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
    exp_sc[key]=list(adata_sc.obs[key])
    exp_sp[key]=list(adata_sp.obs[key])
    
    # delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
    exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
    exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]
    
    # find the mean expression for each gene for each celltype in sc and sp data
    mean_celltype_sp=exp_sp.groupby(key).mean()
    mean_celltype_sc=exp_sc.groupby(key).mean()
    
    # sort genes in alphabetical order 
    mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
    mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]
    
    
    #### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN CELL TYPES
    mean_celltype_sc_np = mean_celltype_sc.T.to_numpy()
    pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
    pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes, num_celltypes, num_genes) 
       
    mean_celltype_sp_np = mean_celltype_sp.T.to_numpy()
    pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
    pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes, num_celltypes, num_genes) 
    

In [20]:
mean_celltype_sc    

Unnamed: 0_level_0,ALDH1A1,CCDC102B,CDK1,CLDN5,CLU,COL1A1,COL1A2,COL3A1,COL9A2,COX4I2,...,TBX18,TBX5,TCF21,TM4SF18,TMEM100,TNNI1,TNNT1,TOP2A,TPM1,TRIL
celltype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Atrial cardiomyocytes,0.070622,0.014332,0.134023,0.0,0.173841,0.770815,0.631345,0.567033,0.011336,0.0,...,0.315678,0.930767,0.212845,0.014428,0.014428,3.255895,0.058358,0.214871,4.437976,0.083696
Capillary endothelium,0.018484,0.008688,0.127884,0.121026,0.679183,0.848604,1.665936,3.010427,0.029125,0.05338,...,0.004407,0.01232,0.024709,0.037562,1.816919,0.409603,0.034472,0.124519,1.557335,0.01331
Cardiac neural crest cells,0.0,0.0,0.2754,0.0,0.0,0.327139,0.105324,0.226673,0.106637,0.0,...,0.0,0.0,0.0,0.0,0.0,0.595246,0.221815,0.412716,1.205898,0.0
Endothelium / pericytes,0.021827,0.020329,0.277732,1.84266,0.091237,0.174514,0.617577,0.590308,0.0,0.0,...,0.022038,0.0,0.016606,1.061958,0.04682,0.504329,0.0,0.277497,1.308317,0.024577
Epicardial cells,0.020325,0.193309,0.138615,0.025699,0.29825,3.206337,2.527879,2.527611,0.087567,0.034449,...,0.497212,0.109456,1.388747,0.008125,0.047454,0.99715,0.963487,0.146544,2.759103,0.244209
Fibroblast-like,0.026235,0.34269,0.209518,0.0,1.048562,3.08209,2.69659,2.653021,0.597371,0.110919,...,0.078199,0.109375,1.362039,0.0,0.102516,0.483224,0.045288,0.272569,2.439869,0.342264
Myoz2-enriched cardiomyocytes,0.0,0.0,0.408046,0.0,0.10199,0.307963,0.241134,0.216542,0.0,0.0,...,0.0,0.182265,0.0,0.0,0.041576,4.009036,0.046568,0.180283,5.287023,0.0
Smooth muscle cells,0.296529,0.041946,0.265166,0.0,0.385315,3.310543,2.814162,2.820954,0.185865,0.076334,...,0.201595,0.276645,1.048125,0.0,0.086988,0.674345,0.076691,0.380033,3.139091,0.088886
Ventricular cardiomyocytes,0.0,0.025544,0.073364,0.027397,0.346445,0.326636,0.192586,0.311006,0.041929,0.0,...,0.0,0.202851,0.112871,0.010568,0.037833,3.854236,0.063967,0.08359,4.945629,0.020986


In [35]:
np.absolute(pairwise_distances_sc[:,:,0]).sum()

5.815171

In [38]:
np.sum(np.absolute(pairwise_distances_sc), axis=(0,1))

array([  5.815171 ,   8.113675 ,   8.790309 ,  31.256842 ,  26.789362 ,
       109.08644  ,  95.87685  , 102.411964 ,  12.784489 ,   3.255559 ,
        38.600258 ,  49.041695 ,  10.61685  ,  17.581856 ,   4.618783 ,
        43.98599  ,  80.97139  ,   7.15145  ,   1.2755082,  27.684597 ,
         7.08749  ,  32.64617  ,  83.51407  ,   1.9210945,  32.308155 ,
        17.339188 ,  51.665276 ,  16.361362 ,  31.646755 ,  52.276478 ,
         4.449986 ,  13.448379 ,  47.725346 ,  66.605255 ,  82.73471  ,
        92.859634 , 111.42126  ,  39.98896  ,  64.683876 ,   6.7342534,
        28.141737 ,   8.483444 ,  95.0742   ,   8.627323 ,  49.191463 ,
        43.088745 ,   0.978282 ,  85.99298  ,  46.194878 , 109.798874 ,
        16.442139 ,  12.551054 ,   3.198803 ,  56.48529  ,  60.479668 ,
         1.4930291,  13.651462 ,  20.027817 ,  47.569122 ,  17.59976  ,
        30.544506 , 121.66321  ,  17.98474  ,   9.748947 , 134.78775  ,
         9.2621765], dtype=float32)

In [None]:
    #### normalize these pairwise distances between cell types
    #calculate sum of absolute distances
    abs_diff_sc = np.absolute(pairwise_distances_sc)
    abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))
    
    abs_diff_sp = np.absolute(pairwise_distances_sp)
    abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))
    

In [None]:
    # calculate normalization factor
    norm_factor_sc = (1/(mean_celltype_sc.T.shape[1]**2)) * abs_diff_sum_sc
    norm_factor_sp = (1/(mean_celltype_sp.T.shape[1]**2)) * abs_diff_sum_sp
    
    #perform normalization
    norm_pairwise_distances_sc = np.divide(pairwise_distances_sc, norm_factor_sc)
    norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)

In [374]:
    ##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
    overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
    overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))
    
    per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
    per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
    per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 
    
    overall_metric

nan

In [376]:
    per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
    per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
    per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 
   

In [379]:
per_gene_metric

Unnamed: 0,score
ALDH1A1,0.053840
CCDC102B,0.740426
CDK1,0.564429
CLDN5,0.974751
CLU,0.389904
...,...
TNNI1,0.774648
TNNT1,0.815976
TOP2A,0.597923
TPM1,0.610392


In [380]:
per_gene_metric.sort_values(by = 'score')

Unnamed: 0,score
PLN,0.000917
PCSK1N,0.011523
STMN2,0.013303
ALDH1A1,0.053840
ISL1,0.072022
...,...
ITLN1,0.920288
TMEM100,0.939514
CLDN5,0.974751
TM4SF18,0.975703
