# Example 1
## Step 0 - prepare your data

Prepare cellphoneDB inputs starting from an anndata object object

In [1]:
import numpy as np
import pandas as pd
import scanpy as sc
import anndata
import os
import sys
from scipy import sparse



sc.settings.verbosity = 1  # verbosity: errors (0), warnings (1), info (2), hints (3)
sys.executable

'/home/jovyan/my-conda-envs/sc_analysis/bin/python'


### 1. Load andata

The anndata object contains counts that have been normalized (per cell) and log-transformed.

In [2]:
adata = sc.read('endometrium_example_counts.h5ad')

### 2. Generate your meta

In this example, our input is an anndata containing the cluster/celltype information in anndata.obs['cell_type']

The object also has anndata.obs['lineage'] information wich will be used below for a hierarchical DEGs approach. 

In [3]:
adata.obs['cell_type'].values.describe()

Unnamed: 0_level_0,counts,freqs
categories,Unnamed: 1_level_1,Unnamed: 2_level_1
Endothelial ACKR1,100,0.051308
Endothelial SEMA3G,100,0.051308
Fibroblast C7,100,0.051308
Fibroblast dS,100,0.051308
Fibroblast eS,100,0.051308
Lymphoid,100,0.051308
Myeloid,100,0.051308
PV MYH11,100,0.051308
PV STEAP4,100,0.051308
epi_Ciliated,100,0.051308


In [4]:
df_meta = pd.DataFrame(data={'Cell':list(adata.obs.index),
                             'cell_type':[ i for i in adata.obs['cell_type']]
                            })
df_meta.set_index('Cell', inplace=True)
df_meta.to_csv('endometrium_example_meta.tsv', sep = '\t')

### 3. Compute DEGs (optional)

We will import out gene expression into Seurat using rpy2 so that we can estimate the differentially expressed genes using Seurat `FindAllMarkers` 


In [5]:
# Conver to dense matrix for Seurat
adata.X = adata.X.toarray()

In [6]:
import rpy2.rinterface_lib.callbacks
import logging
# Ignore R warning messages
#Note: this can be commented out to get more verbose R output
rpy2.rinterface_lib.callbacks.logger.setLevel(logging.ERROR)
import anndata2ri
anndata2ri.activate()
%load_ext rpy2.ipython


In [7]:
%%R -i adata
adata

class: SingleCellExperiment 
dim: 20975 1949 
metadata(0):
assays(1): X
rownames(20975): RP11-34P13.7 FO538757.2 ... AC004556.1 AC240274.1
rowData names(2): gene_ids n_cells
colnames(1949): 4861STDY7387181_AAACCTGAGGGCACTA
  4861STDY7387181_AAACCTGTCAATAAGG ... GSM4577315_TTGTTCAAGCCACCGT
  GSM4577315_TTTACGTTCGTAGGGA
colData names(20): sample_names log2p1_count ... cell_type n_counts
reducedDimNames(0):
altExpNames(0):


Use Seurat `FindAllMarkers` to compute differentially expressed genes and extract the corresponding data frame `DEGs`.
Here there are three options you may be interested on:
1. Identify DEGs for each cell type (compare cell type vs rest, most likely option) 
2. Identify DEGs for each cell type using a per-lineage hierarchycal approach (compare cell type vs rest in the lineage, such as in endometrium paper Garcia-Alonso et al 2021)

In the endometrium paper (Garcia-Alonso et al 2021) we're interested in the differences within the stromal and epithelial lineages, rather than the commonalities (example, what is specific of epithelials in the glands compared to epithelials in the lumen). The reason is that epithelial and stromal subtypes vary in space and type and thus we wanna extract the subtile differences within the lineage to better understand their differential location/ biological role.


In [8]:
%%R -o DEGs

library(Seurat)
so = as.Seurat(adata, counts = "X", data = "X")
Idents(so) = so$cell_type

## OPTION 1 - compute DEGs for all cell types
## Extract DEGs for each cell_type
# DEGs <- FindAllMarkers(so, 
#                        test.use = 'LR', 
#                        verbose = F, 
#                        only.pos = T, 
#                        random.seed = 1, 
#                        logfc.threshold = 0.2, 
#                        min.pct = 0.1, 
#                        return.thresh = 0.05)


# OPTION 2 - optional - Re-compute  hierarchical (per lineage) DEGs for Epithelial and Stromal lineages
DEGs = c()
for( lin in c('Epithelial', 'Stromal') ){
    message('Computing DEGs within linage ', lin)
    so_in_lineage = subset(so, cells = Cells(so)[ so$lineage == lin ] )
    celltye_in_lineage = unique(so$cell_type[ so$lineage == lin ])
    DEGs_lin = FindAllMarkers(so_in_lineage, 
                       test.use = 'LR', 
                       verbose = F, 
                       only.pos = T, 
                       random.seed = 1, 
                       logfc.threshold = 0.2, 
                       min.pct = 0.1, 
                       return.thresh = 0.05)
    DEGs = rbind(DEGs_lin, DEGs)
}

Filter significant genes. Here we select genes with adjusted p-value `< 0.05` and average log FoldChange `>0.1`

In [9]:
DEGs.head()

Unnamed: 0,p_val,avg_log2FC,pct.1,pct.2,p_val_adj,cluster,gene
GSN,2.3219610000000002e-39,2.807124,0.91,0.725,4.870313e-35,Fibroblast C7,GSN
IGFBP5,4.016545e-36,3.52453,0.94,0.52,8.424704000000001e-32,Fibroblast C7,IGFBP5
RPL21,2.663697e-35,1.313335,1.0,0.99,5.5871040000000005e-31,Fibroblast C7,RPL21
RPS27,7.950448e-34,0.985078,1.0,1.0,1.6676060000000002e-29,Fibroblast C7,RPS27
ASPN,2.981168e-30,2.664842,0.55,0.025,6.253000999999999e-26,Fibroblast C7,ASPN


In [10]:
cond1 = DEGs['p_val_adj'] < 0.05 
cond2 = DEGs['avg_log2FC'] > 0.1
mask = [all(tup) for tup in zip(cond1, cond2)]
fDEGs = DEGs[mask]

Save significant DEGs into a file.
Important, the DEGs output file must contain 
- 1st column = cluster
- 2nd column = gene 
- 3rd-Z columns = ignored

In [11]:
# 1st column = cluster; 2nd column = gene 
fDEGs = fDEGs[['cluster', 'gene', 'p_val_adj', 'p_val', 'avg_log2FC', 'pct.1', 'pct.2']] 
fDEGs.to_csv('endometrium_example_DEGs.tsv', index=False, sep='\t')