# Relative expression across cell types

Let the matrix of mean expression values for each gene g in each cell type $ct$ for a given modality $m$ defined as:
$$<X>_{ct,g}^m = \frac{1}{\lvert C_{ct} \rvert} \sum\limits_{g\in C_{ct}} X_{g,ct}^{m}$$ 

where $C_{ct}$ is the set of all genes in given cell type clusters $ct$ and $X_{c,g}^{m}$ is the matrix of normalized expression values for each gene $g$ in each cell $c$. 

Define the difference in mean expression between two cell types $(c_1,c_2)$ in the given gene $g$ and the given modality
$$\delta_{c_1,c_2}^m= <X>_{g,c_1}^m - <X>_{g,c_2}^m $$

By deriving the mean expressions to the sum of the absolute values of all possible pairwise differences between cell types, we normalize sothat they are comparable across modalities. To be sure that the values are comparable across datasets with different numbers of cell-types, we scale the result by a factor of $N_{c}^2$ representing the total number of shared cell types shared between the two modalities. We define the normalized pairwise difference in mean expression between two cell types $c_1,c_2$ in a given gene $g$ and a given modality $m$ as
$$\delta_{c_1,c_2}^{'m,g}= \frac{N_{c}^2 \delta_{c_1,c_2}^{m,g}}{\sum\limits_{c_1}\sum\limits_{c_2}|\delta_{c_1,c_2}^{m,g}|} $$

To compare the normalized pairwise differences in mean cell-type expression between modalities, we define $\triangle$ as
$$\triangle = \sum\limits_{ct}\sum\limits_{c_1}\sum\limits_{c_2} |\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$

Define the final metric $M$ bounded at a maximum of 1, representing the perfect similarity of relative gene expression between modalities, and minimum of 0, representing the perfect dissimilarity of relative cell type expression between two modalities such that each cell types expression value in each gene pair is swapped.

$$M=1-\frac{\triangle}{2*|\sum\limits_{g,c_1,c_1}\delta_{c_1,c_2}^{'sc,g}|}$$

Further compute the metric on per-gene and per-celltype basis:

1. per cell-type:

$$\triangle_{c1}= \sum\limits_{g}\sum\limits_{c_2}|\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$
$$M_{c_1}=1-\frac{\triangle}{2*|\sum\limits_{g,c_2}\delta_{c_1,c_2}^{'sc,g}|}$$

2. per gene:
$$\triangle_{g}= \sum\limits_{c_1}\sum\limits_{c_2}|\delta_{c_1,c_2}^{'sp,g}- \delta_{c_1,c_2}^{'sc,g}|$$
$$M_{g}=1-\frac{\triangle}{2*|\sum\limits_{c_1,c_2}\delta_{c_1,c_2}^{'sc,g}|}$$



In [1]:
import scanpy as sc
import numpy as np
import pandas as pd
from anndata import AnnData
from scipy.sparse import issparse
import math

In [158]:
adata_sc = sc.read_h5ad("/mnt/storage/adata_sc.h5ad")
adata_sc.layers["raw"] = adata_sc.X.copy()
sc.pp.normalize_total(adata_sc)
adata_sc.layers["norm"] = adata_sc.X.copy()
sc.pp.log1p(adata_sc)
adata_sc.layers["lognorm"] = adata_sc.X.copy()
adata_sp = sc.read_h5ad("/mnt/storage/adata_sp.h5ad")
adata_sp.layers["raw"] = adata_sp.X.copy()
sc.pp.normalize_total(adata_sp)
adata_sp.layers["norm"] = adata_sp.X.copy()
sc.pp.log1p(adata_sp)
adata_sp.layers["lognorm"] = adata_sp.X.copy()



In [308]:
  def relative_celltype_expression(adata_sp: AnnData, adata_sc: AnnData, key:str='celltype', layer:str='lognorm'):
    """Calculate the efficiency deviation present between the genes in the panel. 
    ----------
    adata_sp : AnnData
        annotated ``AnnData`` object with counts from spatial data
    adata_sc : AnnData
        annotated ``AnnData`` object with counts from scRNAseq data
    key: str (default: 'celltype')
        .obs column of ``AnnData`` that contains celltype information
    layer: str (default: 'lognorm')
        layer of ```AnnData`` to use to compute the metric

    Returns
    -------
    overall_metric: float
        similarity of relative gene expression across all genes and celltypes, b/t the scRNAseq and spatial data
    per_gene_metric: float
        similarity of relative gene expression per gene across all celltypes, b/t the scRNAseq and spatial data
    per_celltype_metric: float
        similarity of relative gene expression per celltype across all genes, b/t the scRNAseq and spatial data
  
    """   
    ### SET UP
    # set the .X layer of each of the adatas to be log-normalized counts
    adata_sp.X = adata_sp.layers[layer]
    adata_sc.X = adata_sc.layers[layer]
    
    # take the intersection of genes in adata_sp and adata_sc, as a list
    intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))
    
    # subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
    adata_sc=adata_sc[:,intersect]
    adata_sp=adata_sp[:,intersect]
    
    # sparse matrix support
    for a in [adata_sc, adata_sp]:
        if issparse(a.X):
            a.X = a.X.toarray()
            
    # find the unique celltypes in adata_sc that are also in adata_sp
    unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()
    
    
    
    #### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
    # get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)
    
    # get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)
    
    # add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
    exp_sc[key]=list(adata_sc.obs[key])
    exp_sp[key]=list(adata_sp.obs[key])
    
    # delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
    exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
    exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]
    
    # find the mean expression for each gene for each celltype in sc and sp data
    mean_celltype_sp=exp_sp.groupby(key).mean()
    mean_celltype_sc=exp_sc.groupby(key).mean()
    
    # sort genes in alphabetical order 
    mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
    mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]
    
    
    #### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN CELL TYPES
    mean_celltype_sc_np = mean_celltype_sc.T.to_numpy()
    pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
    pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0)) #results in np.array of dimensions (num_genes, num_genes, num_celltypes) 
       
    mean_celltype_sp_np = mean_celltype_sp.T.to_numpy()
    pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
    pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0)) #results in np.array of dimensions (num_genes, num_genes, num_celltypes) 
    
    #### NORMALIZE THESE PAIRWISE DISTANCES BETWEEN CELL TYPES
    #calculate sum of absolute distances
    abs_diff_sc = np.absolute(pairwise_distances_sc.T)
    abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))
    
    abs_diff_sp = np.absolute(pairwise_distances_sp.T)
    abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))
    
    norm_factor_sc = mean_celltype_sc.T.shape[1]**2 * abs_diff_sum_sc
    norm_factor_sp = mean_celltype_sp.T.shape[1]**2 * abs_diff_sum_sp
    
    #perform normalization
    norm_pairwise_distances_sc = np.divide(pairwise_distances_sc.T, norm_factor_sc)
    norm_pairwise_distances_sp = np.divide(pairwise_distances_sp.T, norm_factor_sp)
    
    ##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
    overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
    overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))
    
    per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
    per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
    per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 


    #per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.T.columns, columns=['score']) #add back the gene labels 
    
    per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
    per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
    per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 
    
    return overall_metric, per_gene_metric, per_celltype_metric
    

In [309]:
overall_metric, per_gene_metric, per_celltype_metric = relative_celltype_expression(adata_sp, adata_sc,'celltype', 'lognorm')


In [310]:
overall_metric


0.58707479695175

In [311]:
per_gene_metric

Unnamed: 0,score
ALDH1A1,-2.452862
CCDC102B,0.759614
CDK1,0.538605
CLDN5,0.553949
CLU,0.505288
...,...
TNNI1,0.636490
TNNT1,0.530408
TOP2A,0.182024
TPM1,0.574190


The history saving thread hit an unexpected error (OperationalError('database or disk is full')).History will not be written to the database.


In [312]:
per_gene_metric.sort_values(by = 'score')

Unnamed: 0,score
STMN2,-23.574125
PCSK1N,-18.955132
ISL1,-5.071395
ALDH1A1,-2.452862
MYRF,-2.364538
...,...
MYL3,0.798107
COL1A1,0.808010
NAV1,0.832547
LDB3,0.863535


In [313]:
per_celltype_metric

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Atrial cardiomyocytes,0.627987
Capillary endothelium,0.576846
Cardiac neural crest cells,0.451244
Endothelium / pericytes,0.487327
Epicardial cells,0.50208
Fibroblast-like,0.641672
Myoz2-enriched cardiomyocytes,0.64189
Smooth muscle cells,0.683424
Ventricular cardiomyocytes,0.671203


In [296]:
    ### SET UP
    # set the .X layer of each of the adatas to be log-normalized counts
    adata_sp.X = adata_sp.layers['lognorm']
    adata_sc.X = adata_sc.layers['lognorm']
    
    # take the intersection of genes in adata_sp and adata_sc, as a list
    intersect = list(set(adata_sp.var_names).intersection(set(adata_sc.var_names)))
    
    # subset adata_sc and adata_sp to only include genes in the intersection of adata_sp and adata_sc 
    adata_sc=adata_sc[:,intersect]
    adata_sp=adata_sp[:,intersect]
    
    # sparse matrix support
    for a in [adata_sc, adata_sp]:
        if issparse(a.X):
            a.X = a.X.toarray()
            
    # find the unique celltypes in adata_sc that are also in adata_sp
    unique_celltypes=adata_sc.obs.loc[adata_sc.obs[key].isin(adata_sp.obs[key]),key].unique()
    
    
    
    #### FIND MEAN GENE EXPRESSION PER CELL TYPE FOR EACH MODALITY
    # get the adata_sc cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sc=pd.DataFrame(adata_sc.X,columns=adata_sc.var.index)
    
    # get the adata_sp cell x gene matrix as a pandas dataframe (w gene names as column names)
    exp_sp=pd.DataFrame(adata_sp.X,columns=adata_sp.var.index)
    
    # add "celltype" label column to exp_sc & exp_sp cell x gene matrices 
    exp_sc[key]=list(adata_sc.obs[key])
    exp_sp[key]=list(adata_sp.obs[key])
    
    # delete all cells from the exp matrices if they aren't in the set of intersecting celltypes b/t sc & sp data
    exp_sc=exp_sc.loc[exp_sc[key].isin(unique_celltypes),:]
    exp_sp=exp_sp.loc[exp_sp[key].isin(unique_celltypes),:]
    
    # find the mean expression for each gene for each celltype in sc and sp data
    mean_celltype_sp=exp_sp.groupby(key).mean()
    mean_celltype_sc=exp_sc.groupby(key).mean()
    
    # sort genes in alphabetical order 
    mean_celltype_sc=mean_celltype_sc.loc[:,mean_celltype_sc.columns.sort_values()]
    mean_celltype_sp=mean_celltype_sp.loc[:,mean_celltype_sp.columns.sort_values()]
    
    
    #### CALCULATE PAIRWISE RELATIVE DISTANCES BETWEEN CELL TYPES
    mean_celltype_sc_np = mean_celltype_sc.T.to_numpy()
    pairwise_distances_sc = mean_celltype_sc_np[:,:,np.newaxis] - mean_celltype_sc_np[:,np.newaxis,:]
    pairwise_distances_sc = pairwise_distances_sc.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes, num_celltypes, num_genes) 
       
    mean_celltype_sp_np = mean_celltype_sp.T.to_numpy()
    pairwise_distances_sp = mean_celltype_sp_np[:,:,np.newaxis] - mean_celltype_sp_np[:,np.newaxis,:]
    pairwise_distances_sp = pairwise_distances_sp.transpose((1,2,0)) #results in np.array of dimensions (num_celltypes, num_celltypes, num_genes) 
    

In [304]:
    #### normalize these pairwise distances between cell types
    #calculate sum of absolute distances
    abs_diff_sc = np.absolute(pairwise_distances_sc.T)
    abs_diff_sum_sc = np.sum(abs_diff_sc, axis=(0,1))
    
    abs_diff_sp = np.absolute(pairwise_distances_sp.T)
    abs_diff_sum_sp = np.sum(abs_diff_sp, axis=(0,1))
    

In [305]:
    # calculate normalization factor
    norm_factor_sc = mean_celltype_sc.T.shape[1]**2 * abs_diff_sum_sc
    norm_factor_sp = mean_celltype_sp.T.shape[1]**2 * abs_diff_sum_sp
    
    #perform normalization
    norm_pairwise_distances_sc = np.divide(pairwise_distances_sc.T, norm_factor_sc)
    norm_pairwise_distances_sp = np.divide(pairwise_distances_sp.T, norm_factor_sp)

In [299]:
    ##### CALCULATE OVERALL SCORE,PER-GENE SCORES, PER-CELLTYPE SCORES
    overall_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=None)
    overall_metric = 1 - (overall_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=None)))
    
    per_gene_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(1,2))
    per_gene_metric = 1 - (per_gene_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(1,2))))
    per_gene_metric = pd.DataFrame(per_gene_metric, index=mean_celltype_sc.columns, columns=['score']) #add back the gene labels 
    
    overall_metric

0.58707479695175

In [300]:
    per_celltype_score = np.sum(np.absolute(norm_pairwise_distances_sp - norm_pairwise_distances_sc), axis=(0,1))
    per_celltype_metric = 1 - (per_celltype_score/(2 * np.sum(np.absolute(norm_pairwise_distances_sc), axis=(0,1))))
    per_celltype_metric = pd.DataFrame(per_celltype_metric, index=mean_celltype_sc.index, columns=['score']) #add back the celltype labels 
   

In [301]:
per_gene_metric

Unnamed: 0,score
ALDH1A1,-2.452862
CCDC102B,0.759614
CDK1,0.538605
CLDN5,0.553949
CLU,0.505288
...,...
TNNI1,0.636490
TNNT1,0.530408
TOP2A,0.182024
TPM1,0.574190


In [302]:
per_gene_metric.sort_values(by = 'score')

Unnamed: 0,score
STMN2,-23.574125
PCSK1N,-18.955132
ISL1,-5.071395
ALDH1A1,-2.452862
MYRF,-2.364538
...,...
MYL3,0.798107
COL1A1,0.808010
NAV1,0.832547
LDB3,0.863535


In [303]:
per_celltype_metric

Unnamed: 0_level_0,score
celltype,Unnamed: 1_level_1
Atrial cardiomyocytes,0.627987
Capillary endothelium,0.576846
Cardiac neural crest cells,0.451244
Endothelium / pericytes,0.487327
Epicardial cells,0.50208
Fibroblast-like,0.641672
Myoz2-enriched cardiomyocytes,0.64189
Smooth muscle cells,0.683424
Ventricular cardiomyocytes,0.671203
