# CellphoneDB scoring protocol
**B-cell signaling example**

In [None]:
%%capture
pip install --force-reinstall "git+https://github.com/ventolab/CellphoneDB.git@scoring"

### Load scanpy object

In [1]:
import scanpy as sc
adata = sc.read('/Users/rp23/Downloads/kevin_bcells_stroma/adata_subset_Bcells_stroma.h5ad')
adata.shape

(36445, 33712)

In [2]:
import os
# The default version of CellphoneDB data is the latest one, but you can change it to a previous version 
# at any point in this notebook (by re-setting the value of cpdb_version variable). 
# Please note that the format of the database from version v4.1.0 is incompatible with that of previous 
# versions, hence the lowest version number you may choose in this notebook is v4.1.0
cpdb_version = "v4.1.9"
# cpdb_dir will contain the *_input.csv and cellphonedb.zip files that you will download from https://github.com/ventolab/cellphonedb-data
# At the very least, please replace <your_user_id> with your user id
cpdb_dir = os.path.join("/Users/rp23/.cpdb/releases", cpdb_version)
cpdb_file_path = os.path.join(cpdb_dir, "cellphonedb.zip")

### Downsample clusters
Protocol is not memory optimized, thus downsampling might be of interest (or request more memory).

In [3]:
# Name of column containing the cluster name
cluster_id_col = 'cell.labels'
# Percentage of cells that you want to keep for each cluster
# Values between 0-1
downsamp_percentage = 1
# Downsample each cluster to the specifcied percentage
adata_obs = adata.obs.groupby(cluster_id_col).sample(frac = downsamp_percentage)
adata = adata[list(adata_obs.index)]
adata

View of AnnData object with n_obs × n_vars = 36445 × 33712
    obs: 'cell.labels', 'doublets', 'fetal.ids', 'gender', 'is_doublet', 'is_doublet_poptrim', 'is_doublet_wolock', 'lanes', 'nGene', 'nUMI', 'orig.ident', 'percent.mito', 'processing.type', 'scrublet_cluster_score', 'scrublet_score', 'sequencing.type', 'sort.ids', 'april_cell.labels', 'cell.labels_20200708', 'cell.labels_20200713', 'cell.labels_20200718', 'nk_meta', 'mito.threshold'
    var: 'gene_ids-1', 'feature_types-1'
    obsm: 'X_orig_pca', 'X_pca', 'X_umap'

### Convert sparse normalized matrix to dense matrix
TODO: This should be optimized to use the sparse matrix rather than dense.

In [4]:
import pandas as pd
# Tranpose matrix so genes are in columns and rows are samples
norm_matrix = pd.DataFrame(adata.X.todense(),
                           columns = list(adata.var.index),
                           index = list(adata.obs.index)).transpose()
metadata = adata.obs
# Remove scanpy object to save some memory
del adata

### Apply functions to rank interactions

##### **Step 1**: Filter genes expressed in less than min_pct_cell of cells in a given cluster.

In [5]:
from cellphonedb.utils import scoring_utils
import time
t0 = time.time()
cpdb_f = scoring_utils.filter_genes_cluster(matrix = norm_matrix,
                              metadata = metadata,
                              min_pct_cell = 0.1,
                              cell_column_name = cluster_id_col)
print(time.time() - t0, "seconds wall time")

16.240373134613037 seconds wall time


##### **Step 2**: Calculate the gene's mean expression per cluster.

In [6]:
cpdb_fm = scoring_utils.mean_expression_cluster(matrix = cpdb_f,
                                  metadata = metadata,
                                  cell_column_name = cluster_id_col)

##### **Step 3**: Calculate geometric expression mean per heteromer

In [7]:
cpdb_fmsh = scoring_utils.heteromer_geometric_expression(matrix = cpdb_fm,
                                                         cpdb_file_path = cpdb_file_path)

(33712, 28)
(1337, 28)


##### **Step 4**: Scale the gene's mean expression across clusters.

In [8]:
cpdb_fms = scoring_utils.scale_expression(cpdb_fmsh,
                            upper_range = 10)

##### **Step 5**: calculate the ligand-receptor score.

In [9]:
import time
t0 = time.time()
cpdb_scoring = scoring_utils.score_product(matrix = cpdb_fmsh, 
                                           cpdb_file_path = cpdb_file_path,
                                           threads = 4)
print(time.time() - t0, "seconds wall time")
# 251.42471504211426 seconds wall time - single-threaded

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 406/406 [01:36<00:00,  4.19it/s]


107.52942204475403  got lr_outer_longs
107.83731889724731 seconds wall time


### List all cell-pairs comparison
Results are stored as as dictionary of dataframes, each dataframe is named after the cells being analyzed for cell-cell communication. \
Beware you will find `cell_A|cell_B` but not `cell_B|cell_A`. Each dataframe contains the parteners swapped to compare interactions in both directions.

In [10]:
list(cpdb_scoring.keys())[0:10]

[]

### Example of how to query results
Ordering results by the score

In [None]:
example_table = cpdb_scoring['endosteal fibroblast|osteoclast'].sort_values('Score',
                                                                            ascending = False)

In [None]:
example_table.head(20)

____