This tutorial will demonstrate how to pre-process single-cell raw UMI counts to generate expression matrices that can be used as input to cell-cell communication tools. We will assume appropriate quality-control (QC) has already been applied to the dataset (e.g., exclusion of low-quality cells and doublets). We recommend the tutorial by [Luecken & Theis](https://doi.org/10.15252/msb.20188746) as a starting point for a detailed overview of QC and single-cell RNAseq analysis pipelines in general. 

Here we will focus on:
1. Normalization
2. Inter-operability between R and python. 

We demonstrate a typical workflow using the popular single-cell analysis [Seurat](https://satijalab.org/seurat/index.html). We will use a [BALF COVID dataset](https://doi.org/10.1038/s41591-020-0901-9), which contains 12 samples associated with "Healthy Control", "Moderate", or "Severe" COVID contexts.

Details and caveats regarding [batch correction](https://www.nature.com/articles/s41592-018-0254-1), which removes technical variation while preserving biological variation between samples, can be viewed in the additional examples tutorial entitled "S1_Batch_Correction".

In [1]:
library(Seurat)

# paths
data.path<-'/data3/hratch/ccc_protocols/'

Attaching SeuratObject

Attaching sp

Registered S3 method overwritten by 'SeuratDisk':
  method            from  
  as.sparse.H5Group Seurat



The 12 samples can be downloaded as .h5 files from [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926). You can also download the cell metadata from [here](https://raw.githubusercontent.com/zhangzlab/covid_balf/master/all.cell.annotation.meta.txt)

We download these files directly in the proceeding cell:

In [4]:
covid.input.path<-paste0(data.path, 'raw/covid_balf/')

# download the metadata
metadata.link <- 'https://raw.githubusercontent.com/zhangzlab/covid_balf/master/all.cell.annotation.meta.txt'
cmd <- paste0('wget ', metadata.link, ' -O ', covid.input.path, 'metadata.txt')
system(cmd, ignore.stdout = T, ignore.stderr = T)

# download the expression data
sample.links <- c(
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339769/suppl/GSM4339769%5FC141%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5',
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339770/suppl/GSM4339770%5FC142%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5',
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339771/suppl/GSM4339771%5FC143%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339772/suppl/GSM4339772%5FC144%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339773/suppl/GSM4339773%5FC145%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5',
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4339nnn/GSM4339774/suppl/GSM4339774%5FC146%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475048/suppl/GSM4475048%5FC51%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475049/suppl/GSM4475049%5FC52%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475050/suppl/GSM4475050%5FC100%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475051/suppl/GSM4475051%5FC148%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5', 
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475052/suppl/GSM4475052%5FC149%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5',
    'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4475nnn/GSM4475053/suppl/GSM4475053%5FC152%5Ffiltered%5Ffeature%5Fbc%5Fmatrix%2Eh5'
    )

for (sl in sample.links){
    cmd <- paste0('wget ', sl, ' -P ', covid.input.path)
    system(cmd, ignore.stdout = T, ignore.stderr = T)
}

We can then format the downloaded files:

In [5]:
# format the metadata
md <- read.table(paste0(covid.input.path, 'metadata.txt'), header = T, row.names = 'ID')
colnames(md) = c('Sample.ID', 'sample_new', 'Context', 'disease', 'hasnCoV', 'cluster', 'cell.type')

context.map = c('Healthy.Control', 'Moderate.Covid', 'Severe.Covid')
names(context.map) <- c('HC', 'M', 'S')
md['Context'] <- unname(context.map[md$Context])
md$Context <- factor(md$Context, levels = c('Healthy.Control', 'Moderate.Covid', 'Severe.Covid'))

md<-md[md$Sample.ID != 'GSM3660650', ] # drop the non-scRNAseq dataset included in this file

md<-md[with(md, order(Context, Sample.ID)), ]
head(md)

Unnamed: 0_level_0,Sample.ID,sample_new,Context,disease,hasnCoV,cluster,cell.type
Unnamed: 0_level_1,<chr>,<chr>,<fct>,<chr>,<chr>,<int>,<chr>
AAACCCACAGCTACAT_3,C100,HC3,Healthy.Control,N,N,27,B
AAACCCATCCACGGGT_3,C100,HC3,Healthy.Control,N,N,23,Macrophages
AAACCCATCCCATTCG_3,C100,HC3,Healthy.Control,N,N,6,T
AAACGAACAAACAGGC_3,C100,HC3,Healthy.Control,N,N,10,Macrophages
AAACGAAGTCGCACAC_3,C100,HC3,Healthy.Control,N,N,10,Macrophages
AAACGAAGTCTATGAC_3,C100,HC3,Healthy.Control,N,N,9,T


In [7]:
balf.samples<-list()

suppressMessages({
    suppressWarnings({
        for (filename in list.files(covid.input.path)){
            if (endsWith(filename, '.h5')){
                sample<-unlist(strsplit(filename, '_'))[[2]]

                # subset and format metadata
                md.sample<-md[md$Sample.ID == sample,]
                rownames(md.sample) <- unname(sapply(rownames(md.sample), 
                                                   function(x) paste0(unlist(strsplit(x, '_'))[[1]], '-1')))
                # load the counts
                so <- Seurat::Read10X_h5(filename=paste0(covid.input.path, filename), unique.features=T)
                so <- so[, rownames(md.sample)] # only include cells present in the metadata

                # preprocess
                so <- CreateSeuratObject(counts=so, project=sample, meta.data=md.sample[c('Sample.ID', 'Context', 'cell.type')], 
                                          min.cells=3)        
                balf.samples[[sample]] <- so
            }
        }        
    })
})

balf.samples is a list with names as each sample and values as a Seurat object storing the raw UMI counts for that sample

In [8]:
names(balf.samples)

In [9]:
balf.samples$C100

An object of class Seurat 
16566 features across 2566 samples within 1 assay 
Active assay: RNA (16566 features, 0 variable features)

To normalize the raw UMI counts, we recommend log(1+CPM) normalization, as this maintains non-negative counts and is the input for many communication scoring functions

In [10]:
balf.samples <- lapply(balf.samples, 
                    function(so) NormalizeData(so, normalization.method = "LogNormalize", scale.factor = 1e6))

In [156]:
balf.samples$C100

An object of class Seurat 
16566 features across 2566 samples within 1 assay 
Active assay: RNA (16566 features, 0 variable features)

In [11]:
ordered.genes<-sort(rownames(balf.samples$C100))
head(as.data.frame(balf.samples[['C100']]@assays$RNA@data)[ordered.genes,])

Unnamed: 0_level_0,AAACCCACAGCTACAT-1,AAACCCATCCACGGGT-1,AAACCCATCCCATTCG-1,AAACGAACAAACAGGC-1,AAACGAAGTCGCACAC-1,AAACGAAGTCTATGAC-1,AAACGAAGTGTAGTGG-1,AAACGCTGTCACGTGC-1,AAACGCTGTTGGAGGT-1,AAAGAACTCTAGAACC-1,⋯,TTTGATCTCCCGAAAT-1,TTTGGAGCAATACAGA-1,TTTGGAGTCACCATAG-1,TTTGGAGTCTCACCCA-1,TTTGGTTAGATGGCGT-1,TTTGGTTGTACCCAGC-1,TTTGGTTGTTACTCAG-1,TTTGTTGAGCTAGAGC-1,TTTGTTGCAATGAAAC-1,TTTGTTGCAGAGGGTT-1
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
A1BG,0,0,0,3.492637,4.373578,0,0,0,4.111976,0.0,⋯,0,0,5.413881,0,0,0.0,0,0,0,0.0
A1BG-AS1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A2M,0,0,0,4.855851,0.0,0,0,0,0.0,6.001009,⋯,0,0,5.127682,0,0,5.041668,0,0,0,5.305098
A2M-AS1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A2ML1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A4GALT,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0


We can save this list of Suerat object for future use in other scripts:

In [15]:
saveRDS(balf.samples, 
       paste0(data.path, 'interim/covid_balf_norm.rds'))

# Interoperability

In [None]:
# interoperability
library(SeuratDisk, quietly = T)
library('reticulate')
anndata<-import('anndata')
library('Matrix')

## to Python

For use in Python, we can convert each Seurat object to an Anndata object using [SeuratDisk](https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html). The resultant h5ad file contains the same information as the AnnData object generated in the companion Python tutorial. See that tutorial for loading these saved files. 

In [12]:
suppressMessages({
    for (sample in names(balf.samples)){
        file.name<-paste0(data.path, 'interim/covid_R_to_python/', sample, '.h5Seurat')
#         file.name<-paste0(sample, '.h5Seurat') # permission issues
        SaveH5Seurat(balf.samples[[sample]], filename = file.name)
        Convert(file.name, dest = "h5ad", overwrite = TRUE)
    }
})

## from Python

Here, we can load the expression matrices that were generated in the companion python script using AnnData and saved as h5ad files into Seurat:

In [22]:
adata_to_seurat<-function(adata){
    raw<-adata$raw$to_adata()
    raw.counts<-t(as.matrix(raw$X))
    rownames(raw.counts)<-rownames(raw$var)
    colnames(raw.counts)<-rownames(raw$obs)

    so<-CreateSeuratObject(counts=raw.counts, assay = 'RNA', meta.data = adata$obs)

    norm.counts<-t(as.matrix(adata$X))
    rownames(norm.counts)<-rownames(so)
    colnames(norm.counts)<-colnames(so)

    so@assays$RNA@data<-as(norm.counts, "dgCMatrix")
    so@assays$RNA@meta.features<-adata$var
    
    return(so)
}

In [79]:
balf.samples.python<-list()
for (sample in names(balf.samples)){
    file.name<-paste0(data.path, 'interim/covid_python_to_R/', sample, '.h5ad')
    adata<-anndata$read_h5ad(file.name)
    balf.samples.python[[sample]]<-adata_to_seurat(adata)
}

While this Seurat object is not completely identical to the one generated in this script, it stores all the same information. Counts are stored in the 'RNA' assay, with raw UMI counts are stored in the counts slot, the log(1+CPM) matrix is stored in data slot, and relevant metadata is available. We can see that the expression matrix is the same:

In [154]:
balf.samples.python$C100

An object of class Seurat 
16566 features across 2566 samples within 1 assay 
Active assay: RNA (16566 features, 0 variable features)

In [155]:
ordered.genes<-sort(rownames(balf.samples.python$C100))
head(as.data.frame(balf.samples.python[['C100']]@assays$RNA@data)[ordered.genes,])

Unnamed: 0_level_0,AAACCCACAGCTACAT-1,AAACCCATCCACGGGT-1,AAACCCATCCCATTCG-1,AAACGAACAAACAGGC-1,AAACGAAGTCGCACAC-1,AAACGAAGTCTATGAC-1,AAACGAAGTGTAGTGG-1,AAACGCTGTCACGTGC-1,AAACGCTGTTGGAGGT-1,AAAGAACTCTAGAACC-1,⋯,TTTGATCTCCCGAAAT-1,TTTGGAGCAATACAGA-1,TTTGGAGTCACCATAG-1,TTTGGAGTCTCACCCA-1,TTTGGTTAGATGGCGT-1,TTTGGTTGTACCCAGC-1,TTTGGTTGTTACTCAG-1,TTTGTTGAGCTAGAGC-1,TTTGTTGCAATGAAAC-1,TTTGTTGCAGAGGGTT-1
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
A1BG,0,0,0,3.492637,4.373578,0,0,0,4.111976,0.0,⋯,0,0,5.413881,0,0,0.0,0,0,0,0.0
A1BG-AS1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A2M,0,0,0,4.855852,0.0,0,0,0,0.0,6.001009,⋯,0,0,5.127682,0,0,5.041668,0,0,0,5.305098
A2M-AS1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A2ML1,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
A4GALT,0,0,0,0.0,0.0,0,0,0,0.0,0.0,⋯,0,0,0.0,0,0,0.0,0,0,0,0.0
