Prior to computing each similarity metric, we subset the genes and the celltypes to those shared between both modalities. 

### Relative expression across genes
To assess the similarity of the relative expression of gene pairs in each celltype cluster between the two modalities, we compute the following metric:

Let the matrix of mean expression values for each gene $g$ in each celltype $ct$ for a given modality $m$ be defined as $$<X>_{ct,g}^m = \frac{1}{\lvert C_{ct} \rvert} \sum\limits_{c\in C_{ct}} X_{c,g}^{m}$$ 

where $C_{ct}$ is the set of all cells in a given cell type cluster $ct$ and $X_{c,g}^{m}$ is the matrix of normalized expression values for each gene $g$ in each cell $c$. 

Then, define the difference in mean expression between two genes ($g_1$, $g_2$) in a given celltype $ct$ and a given modality $m$ as

$$\delta_{g1,g2}^{m,ct} = <X>_{ct,g1}^m - <X>_{ct,g2}^m$$
 
To normalize these pairwise differences in mean gene expression so that they are comparable across modalities, we divide by the sum of the absolute values of all possible pairwise differences between genes, $\sum\limits_{g1}\sum\limits_{g2}\lvert \delta_{g1,g2}^{m,ct} \rvert$. Furthermore, to ensure that the values are comparable across datasets with different numbers of genes, we scale the result by a factor of $N^2_{g}$. In sum, we define the normalized pairwise difference in mean expression between two genes ($g_1$, $g_2$) in a given celltype $ct$ and a given modality $m$ as 

$$\delta_{g1,g2}^{'m,ct} = \frac{N^2_{g} \delta_{g1,g2}^{m,ct}}{\sum\limits_{g1}\sum\limits_{g2}\lvert \delta_{g1,g2}^{m,ct} \rvert} $$

where $N^2_{g}$ is the total number of genes shared between the two modalities.

Finally, we compare the normalized pairwise differences in mean gene expression between modalities ($sc$ and $sp$), as follows: 
$$\triangle = \sum\limits_{ct} \sum\limits_{g1} \sum\limits_{g2} \lvert \delta_{g1,g2}^{'sp,ct} - \delta_{g1,g2}^{'sc,ct} \rvert $$

The final metric is bounded at a maximum of 1, representing perfect similarity of relative gene expression between modalities. Furthermore, the metric is constructed such that, when its value is $0$, this represents perfect dissimilarity of relative gene expression between modalities (such that each gene's expression value in each gene pair is swapped). 

$$ M = 1 - \frac{\triangle}{ 2 * \lvert \sum\limits_{ct,g1,g2} \delta_{g1,g2}^{'m,sc}\rvert } $$


We can further compute the metric on a per-gene and per-celltype basis:

1) per gene:
$$\triangle_{g1} = \sum\limits_{ct} \sum\limits_{g2} \lvert \delta_{g1,g2}^{'sp,ct} - \delta_{g1,g2}^{'sc,ct} \rvert $$
$$ M_{g1} = 1 - \frac{\triangle}{ 2 * \lvert \sum\limits_{ct,g2} \delta_{g1,g2}^{'m,sc}\rvert } $$

2) per cell-type:
$$\triangle_{ct} = \sum\limits_{g1} \sum\limits_{g2} \lvert \delta_{g1,g2}^{'sp,ct} - \delta_{g1,g2}^{'sc,ct} \rvert $$
$$ M_{ct} = 1 - \frac{\triangle}{ 2 * \lvert \sum\limits_{g1,g2}\delta_{g1,g2}^{'m,sc} \rvert } $$

### TEST the function code below:

In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
from anndata import AnnData
from scipy.sparse import issparse

In [6]:
def relative_gene_expression(adata_sp: AnnData, adata_sc: AnnData, key:str='celltype', layer:str='lognorm'):
    """Calculate the efficiency deviation present between the genes in the panel. 
    ----------
    adata_sp : AnnData
        annotated ``AnnData`` object with counts from spatial data
    adata_sc : AnnData
        annotated ``AnnData`` object with counts from scRNAseq data
    pipeline_output : float, optional
        Boolean for whether to return just the efficiency deviation (default: True), or to also to return efficiency mean and gene ratios
    Returns
    -------
    efficiency_std : float
        Standard deviation of the calculated efficiencies for every gene. The higher it is, the more different the capture efficiencies are in comparison with the scRNAseq for every gene
    efficiency_mean: float
        Mean efficiency found when comparing scRNAseq and spatial for the overall panel tested
    gr: pandas dataframe
        Gene ratios, or calculated efficiency for every gene in the panel when comparing scRNAseq to spatial
    """   
    ### SET UP
    # set the .X layer of each of the adatas to be log-normalized counts
    adata_sp.X = adata_sp.layers[layer]
    adata_sc.X = adata_sc.layers[layer]
    
    # take the intersection of genes in adata_sp and adata_sc, as a list
    intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))
    
    # subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
    adata_sc=adata_sc[:,intersect]
    adata_sp=adata_sp[:,intersect]
    
    # sparse matrix support
    for a in [adata_sc, adata_sp]:
        if issparse(a.X):
            a.X = a.X.toarray()
            
    # find the unique celltypes in adata_sc that are also in adata_sp
    unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()
    
    
    
    #### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
    # get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)
    
    # get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)
    
    # add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
    exp_sc[key]=list(adata_sc.obs[key])
    exp_sp[key]=list(adata_sp.obs[key])
    
    # delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
    exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
    exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]
    
    # find the mean expression for each gene for each celltype in sc and sp data
    mean_celltype_sp=exp_sp.groupby(key).mean()
    mean_celltype_sc=exp_sc.groupby(key).mean()
    
    # sort genes in alphabetical order 
    mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
    mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]
    
    
    #### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN GENES
    mean_celltype_sc_np = mean_celltype_sc.to_numpy()
    pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
    pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0))
       
    mean_celltype_sp_np = mean_celltype_sp.to_numpy()
    pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
    pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0))
    
    #### NORMALIZE THESE PAIRWISE DISTANCES BETWEEN GENES
    #calculate sum of absolute distances
    abs_diff_sc = np.absolute(pairwise_distances_sc)
    abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))
    
    abs_diff_sp = np.absolute(pairwise_distances_sp)
    abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))
    
    # calculate normalization factor
    norm_factor_sc = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sc
    norm_factor_sp = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sp
    
    #perform normalization
    norm_pairwise_distances_sc = np.divide(pairwise_distances_sc, norm_factor_sc)
    norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)
    
    
    ##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
    overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
    overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))
    
    per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
    per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
    per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 
    
    per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
    per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
    per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 
    
    return overall_metric, per_gene_metric, per_celltype_metric
    
     
    
    

#### import test data

In [3]:
## load in example datasets 
adata_sc = sc.read("/mnt/storage/adata_sc.h5ad")
adata_sp = sc.read("/mnt/storage/adata_sp.h5ad")

In [4]:
## first normalize the test adatas -- they don't have a lognorm layer yet
adata_sc.layers['raw'] = adata_sc.X
adata_sp.layers['raw'] = adata_sp.X

adata_sc.layers['norm'] = sc.pp.normalize_total(adata=adata_sc, target_sum=None, exclude_highly_expressed=False, max_fraction=0.05, key_added=None, layer=None, copy=False, inplace=False)['X']
adata_sp.layers['norm'] = sc.pp.normalize_total(adata=adata_sp, target_sum=None, exclude_highly_expressed=False, max_fraction=0.05, key_added=None, layer=None, copy=False, inplace=False)['X']

adata_sc.layers['lognorm'] = adata_sc.layers['norm'].copy()
adata_sp.layers['lognorm'] = adata_sp.layers['norm'].copy()

sc.pp.log1p(adata_sc, layer='lognorm')
sc.pp.log1p(adata_sp, layer='lognorm')



In [7]:
overall_metric, per_gene_metric, per_celltype_metric = relative_gene_expression(adata_sp, adata_sc)

In [8]:
overall_metric

0.5747764624851616

In [9]:
per_gene_metric

Unnamed: 0,score
ALDH1A1,0.343729
CCDC102B,0.619379
CDK1,0.624260
CLDN5,0.556060
CLU,0.481670
...,...
TNNI1,0.629613
TNNT1,0.548418
TOP2A,0.567718
TPM1,0.553430


In [10]:
per_gene_metric.sort_values(by = 'score')

Unnamed: 0,score
STMN2,0.009524
TBX18,0.150236
MYBPC3,0.194568
PAM,0.257041
MYRF,0.309373
...,...
TMEM100,0.691615
PTN,0.700911
COL1A2,0.762755
COL3A1,0.791830


In [11]:
per_celltype_metric

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Atrial cardiomyocytes,0.603222
Capillary endothelium,0.649237
Cardiac neural crest cells,0.182063
Endothelium / pericytes,0.442457
Epicardial cells,0.461399
Fibroblast-like,0.744147
Myoz2-enriched cardiomyocytes,0.651561
Smooth muscle cells,0.751999
Ventricular cardiomyocytes,0.686895


In [12]:
per_celltype_metric.sort_values(by = 'score')

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Cardiac neural crest cells,0.182063
Endothelium / pericytes,0.442457
Epicardial cells,0.461399
Atrial cardiomyocytes,0.603222
Capillary endothelium,0.649237
Myoz2-enriched cardiomyocytes,0.651561
Ventricular cardiomyocytes,0.686895
Fibroblast-like,0.744147
Smooth muscle cells,0.751999


### Finally, test the function in chunks...

In [13]:
layer='lognorm'
key='celltype'

### SET UP
# set the .X layer of each of the adatas to be log-normalized counts
adata_sp.X = adata_sp.layers[layer]
adata_sc.X = adata_sc.layers[layer]
    
# take the intersection of genes in adata_sp and adata_sc, as a list
intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))

# subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
adata_sc=adata_sc[:,intersect]
adata_sp=adata_sp[:,intersect]

# sparse matrix support
for a in [adata_sc, adata_sp]:
    if issparse(a.X):
        a.X = a.X.toarray()

# find the unique celltypes in adata_sc that are also in adata_sp
unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()


In [14]:
#### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
# get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)

# get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)

# add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
exp_sc[key]=list(adata_sc.obs[key])
exp_sp[key]=list(adata_sp.obs[key])

# delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]

# find the mean expression for each gene for each celltype in sc and sp data
mean_celltype_sp=exp_sp.groupby(key).mean()
mean_celltype_sc=exp_sc.groupby(key).mean()

# sort genes in alphabetical order 
mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]

In [17]:
#### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN GENES
mean_celltype_sc_np = mean_celltype_sc.to_numpy()
pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0))

mean_celltype_sp_np = mean_celltype_sp.to_numpy()
pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0))


In [24]:
pairwise_distances_sc[:,:,0]

array([[ 0.        ,  0.05628961, -0.0634018 , ..., -0.14424898,
        -4.3673544 , -0.01307469],
       [-0.05628961,  0.        , -0.11969141, ..., -0.20053859,
        -4.423644  , -0.06936429],
       [ 0.0634018 ,  0.11969141,  0.        , ..., -0.08084717,
        -4.3039527 ,  0.05032711],
       ...,
       [ 0.14424898,  0.20053859,  0.08084717, ...,  0.        ,
        -4.2231054 ,  0.1311743 ],
       [ 4.3673544 ,  4.423644  ,  4.3039527 , ...,  4.2231054 ,
         0.        ,  4.3542795 ],
       [ 0.01307469,  0.06936429, -0.05032711, ..., -0.1311743 ,
        -4.3542795 ,  0.        ]], dtype=float32)

In [25]:
mean_celltype_sc.head()

Unnamed: 0_level_0,ALDH1A1,CCDC102B,CDK1,CLDN5,CLU,COL1A1,COL1A2,COL3A1,COL9A2,COX4I2,...,TBX18,TBX5,TCF21,TM4SF18,TMEM100,TNNI1,TNNT1,TOP2A,TPM1,TRIL
celltype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Atrial cardiomyocytes,0.070622,0.014332,0.134023,0.0,0.173841,0.770815,0.631345,0.567033,0.011336,0.0,...,0.315678,0.930767,0.212845,0.014428,0.014428,3.255895,0.058358,0.214871,4.437976,0.083696
Capillary endothelium,0.018484,0.008688,0.127884,0.121026,0.679183,0.848604,1.665936,3.010427,0.029125,0.05338,...,0.004407,0.01232,0.024709,0.037562,1.816919,0.409603,0.034472,0.124519,1.557335,0.01331
Cardiac neural crest cells,0.0,0.0,0.2754,0.0,0.0,0.327139,0.105324,0.226673,0.106637,0.0,...,0.0,0.0,0.0,0.0,0.0,0.595246,0.221815,0.412716,1.205898,0.0
Endothelium / pericytes,0.021827,0.020329,0.277732,1.84266,0.091237,0.174514,0.617577,0.590308,0.0,0.0,...,0.022038,0.0,0.016606,1.061958,0.04682,0.504329,0.0,0.277497,1.308317,0.024577
Epicardial cells,0.020325,0.193309,0.138615,0.025699,0.29825,3.206337,2.527879,2.527611,0.087567,0.034449,...,0.497212,0.109456,1.388747,0.008125,0.047454,0.99715,0.963487,0.146544,2.759103,0.244209


In [27]:
## compare values to check that this pairwise difference operation worked
(mean_celltype_sc.iloc[0,0] - mean_celltype_sc.iloc[0,1])  == pairwise_distances_sc[0,1,0]

True

In [28]:
(mean_celltype_sc.iloc[8,1] - mean_celltype_sc.iloc[8,4])  == pairwise_distances_sc[1,4,8]

True

In [30]:
(mean_celltype_sc.iloc[5,63] - mean_celltype_sc.iloc[5,3])  == pairwise_distances_sc[63,3,5]

True

In [31]:
#### NORMALIZE THESE PAIRWISE DISTANCES BETWEEN GENES
#calculate sum of absolute distances
abs_diff_sc = np.absolute(pairwise_distances_sc)
abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))

abs_diff_sp = np.absolute(pairwise_distances_sp)
abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))

In [33]:
abs_diff_sc.min()

0.0

In [34]:
abs_diff_sum_sc #across all celltypes

array([3682.1406, 2009.1934,  831.8567, 1302.214 , 2967.219 , 2672.2893,
       4150.291 , 3265.994 , 4722.6807], dtype=float32)

In [35]:
# calculate normalization factor
norm_factor_sc = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sc
norm_factor_sp = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sp

In [36]:
norm_factor_sc

array([16039405. ,  8752046. ,  3623567.8,  5672444. , 12925206. ,
       11640492. , 18078668. , 14226669. , 20571996. ], dtype=float32)

In [37]:
#perform normalization
norm_pairwise_distances_sc = np.divide(pairwise_distances_sc, norm_factor_sc)
norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)

In [38]:
norm_pairwise_distances_sc.shape

(66, 66, 9)

In [39]:
norm_pairwise_distances_sc[0,1,0] == pairwise_distances_sc[0,1,0]/norm_factor_sc[0]

True

In [40]:
norm_pairwise_distances_sc[1,2,5] == pairwise_distances_sc[1,2,5]/norm_factor_sc[5]

True

In [41]:
##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))


In [42]:
overall_score

0.0017571226

In [43]:
overall_metric

0.5747764624851616

In [44]:
per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 



In [46]:
per_gene_metric

Unnamed: 0,score
ALDH1A1,0.343729
CCDC102B,0.619379
CDK1,0.624260
CLDN5,0.556060
CLU,0.481670
...,...
TNNI1,0.629613
TNNT1,0.548418
TOP2A,0.567718
TPM1,0.553430


In [47]:
per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 


In [48]:
per_celltype_metric

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Atrial cardiomyocytes,0.603222
Capillary endothelium,0.649237
Cardiac neural crest cells,0.182063
Endothelium / pericytes,0.442457
Epicardial cells,0.461399
Fibroblast-like,0.744147
Myoz2-enriched cardiomyocytes,0.651561
Smooth muscle cells,0.751999
Ventricular cardiomyocytes,0.686895


woohoo! all checks passed...looks good.

### ARCHIVE - OLD

In [None]:
overall_metric

In [None]:
per_gene_metric

In [None]:
per_gene_metric.sort_values(by = 'score')

In [None]:
per_celltype_metric

In [None]:
per_celltype_metric.sort_values(by = 'score')

Everything looks good so far...

### Finally, test the function in chunks...

In [None]:
layer='lognorm'
key='celltype'

### SET UP
# set the .X layer of each of the adatas to be log-normalized counts
adata_sp.X = adata_sp.layers[layer]
adata_sc.X = adata_sc.layers[layer]
    
# take the intersection of genes in adata_sp and adata_sc, as a list
intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))

# subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
adata_sc=adata_sc[:,intersect]
adata_sp=adata_sp[:,intersect]

# sparse matrix support
for a in [adata_sc, adata_sp]:
    if issparse(a.X):
        a.X = a.X.toarray()

# find the unique celltypes in adata_sc that are also in adata_sp
unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()


In [None]:
#### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
# get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)

# get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)

# add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
exp_sc[key]=list(adata_sc.obs[key])
exp_sp[key]=list(adata_sp.obs[key])

# delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]

# find the mean expression for each gene for each celltype in sc and sp data
mean_celltype_sp=exp_sp.groupby(key).mean()
mean_celltype_sc=exp_sc.groupby(key).mean()

# sort genes in alphabetical order 
mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]

In [None]:
mean_celltype_sc

In [None]:
mean_celltype_sp

In [None]:
#### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN GENES
transposed_data_sc = mean_celltype_sc.to_numpy().T
pairwise_distances_sc = transposed_data_sc[:,:,np.newaxis] - transposed_data_sc[:,np.newaxis,:]

transposed_data_sp = mean_celltype_sp.to_numpy().T
pairwise_distances_sp = transposed_data_sp[:,:,np.newaxis] - transposed_data_sp[:,np.newaxis,:]


In [None]:
new_pairwise_distances_sc.transpose((1,2,0)).shapenew_data_sc = mean_celltype_sc.to_numpy()
new_pairwise_distances_sc = new_data_sc[:,:,np.newaxis] - new_data_sc[:,np.newaxis,:]
new_pairwise_distances_sc.transpose((1,2,0)).shape

In [None]:
new_pairwise_distances_sc.transpose((1,2,0)).shape

In [None]:
mean_celltype_sc.loc['Atrial cardiomyocytes',:]

In [None]:
0.070622 - 0.014332

In [None]:
0.070622 - 0.134023

In [None]:
new_pairwise_distances_sc[0,:,:].shape

In [None]:
pairwise_distances_sc.shape

In [None]:
pairwise_distances_sc.transpose((0,2,1)).shape

In [None]:
pairwise_distances_sc.shape

In [None]:
mean_celltype_sc

In [None]:
0.070622 - 0.018484

In [None]:
pd.DataFrame(pairwise_distances_sc[0,:,:])

In [None]:
0.070622 - 0.014332

In [None]:
#### NORMALIZE THESE PAIRWISE DISTANCES BETWEEN GENES
#calculate sum of absolute distances
abs_diff_sc = np.absolute(pairwise_distances_sc)
abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))

abs_diff_sp = np.absolute(pairwise_distances_sp)
abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))

# calculate normalization factor
norm_factor_sc = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sc
norm_factor_sp = mean_celltype_sc.shape[1]**2 * abs_diff_sum_sp

#perform normalization
norm_pairwise_distances_sc = np.divide(pairwise_distances_sc, norm_factor_sc)
norm_pairwise_distances_sp = np.divide(pairwise_distances_sp, norm_factor_sp)


##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))

per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 

per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 


In [None]:
## outtakes --> originally in the code but not used for this new metric

  # find the mean expression of each gene in the sc dataset
    gene_means_sc=pd.DataFrame(np.mean(exp_sc,axis=0))
    
    # sort the genes so that they're in alphabetical order in the gene_means_sc df
    exp_sc=exp_sc.loc[exp_sc.index.sort_values(),:]
    
   
    # find the mean expression of each gene in the sp dataset 
    gene_means_sp=pd.DataFrame(np.mean(exp_sp,axis=0))
    
    # sort the genes so that they're in alphabetical order in the gene_means_sp df
    gene_means_sp=gene_means_sp.loc[gene_means_sp.index.sort_values(),:]
    
    
    
    
    

In [None]:
## outtakes -- helpful for Asli potentially
 transposed_data_sc = mean_celltype_sc.to_numpy().T
    pairwise_distances_sc = transposed_data_sc[:,:,np.newaxis] - transposed_data_sc[:,np.newaxis,:]
    
    transposed_data_sp = mean_celltype_sp.to_numpy().T
    pairwise_distances_sp = transposed_data_sp[:,:,np.newaxis] - transposed_data_sp[:,np.newaxis,:]
    