This tutorial will demonstrate how to pre-process single-cell raw UMI counts to generate expression matrices that can be used as input to cell-cell communication tools. We will assume appropriate quality-control (QC) has already been applied to the dataset (e.g., exclusion of low-quality cells and doublets). We recommend the tutorial by [Luecken & Theis](https://doi.org/10.15252/msb.20188746) as a starting point for a detailed overview of QC and single-cell RNAseq analysis pipelines in general. 

Here we will focus on:
1. Normalization
2. Batch correction (for multiple samples/contexts)

We demonstrate a typical workflow using the popular single-cell analysis software [scanpy](https://scanpy.readthedocs.io/en/stable/) to generate an AnnData object which can be used downstream. Preferably, the workflow can maintain non-negative counts since Tensor-cell2cell uses a non-negative decomposition.

For use with Tensor-cell2cell, we want a dataset that represents >2 contexts. We also want a dataset that contains [replicates](https://www.nature.com/articles/nmeth.3091). Replicates will allow us to ensure that the output factors are not simply due to technical effects (i.e., a factor with high loadings for just one replicate in the context dimension). We will use a [BALF COVID dataset](https://doi.org/10.1038/s41591-020-0901-9), which contains 12 samples associated with "Healthy Control", "Moderate", or "Severe" COVID contexts. This dataset does not contain technical replicates since each sample was taken from a different patient, but each sample associated with a context is a biological replicate. [Batch correction](https://www.nature.com/articles/s41592-018-0254-1) removes technical variation while preserving biological variation bewteen samples. We can reasonably assume that the biological variation in samples between contexts will be greater than that of those within contexts after using appropriate batch correction to remove technical variation. Thus, we expect Tensor-cell2cell to capture overall communication trends differing between contexts and can then assess that output factors aren't simply due to technical effects by checking that the output factors have similar loadings for biological replicates and do not have  high loadings for just one sample in the context dimension. 

In [1]:
import os

import scanpy as sc
import pandas as pd
import numpy as np

import sys
sys.path.insert(1, '/home/hratch/Projects/CCC/ccc_protocols/scripts/')
from cell2cell_dev.datasets.load_data import CovidBalf

import warnings
warnings.filterwarnings('ignore')

seed = 888

The 12 samples can be downloaded as .h5 files from [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926). You can also download the cell metadata from [here](https://raw.githubusercontent.com/zhangzlab/covid_balf/master/all.cell.annotation.meta.txt)

cell2cell has a helper function to download and format the raw UMI counts and metadata:

In [2]:
covid_balf_data = CovidBalf(data_path = '/data3/hratch/ccc_protocols/raw/covid_balf/')
# covid_balf_data.download_data()
md, balf_samples = covid_balf_data.format_data()

md.head()

Unnamed: 0_level_0,Sample_ID,sample_new,Context,disease,hasnCoV,cluster,cell_type
cell_barcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AAACCCACAGCTACAT_3,C100,HC3,Healthy_Control,N,N,27,B
AAACCCATCCACGGGT_3,C100,HC3,Healthy_Control,N,N,23,Macrophages
AAACCCATCCCATTCG_3,C100,HC3,Healthy_Control,N,N,6,T
AAACGAACAAACAGGC_3,C100,HC3,Healthy_Control,N,N,10,Macrophages
AAACGAAGTCGCACAC_3,C100,HC3,Healthy_Control,N,N,10,Macrophages


balf_samples is a dictionary with keys as each sample and values as an AnnData object storing the raw UMI counts for that sample

In [3]:
balf_samples_norm = dict()
for sample, adata in balf_samples.items():
    adata_norm = adata.copy()
    sc.pp.normalize_total(adata_norm, target_sum=1e6) # CPM normalize
    sc.pp.log1p(adata_norm) # logarithmize
    
    balf_samples_norm[sample] = adata_norm

In [4]:
# import scanorama

# corrected, genes = scanorama.correct_scanpy([balf_samples['C100'], balf_samples['C152']])

In [8]:
adata_concat = balf_samples['C100'].copy()
adata_concat = adata_concat.concatenate(balf_samples['C152'], batch_categories=['C100', 'C152'])
sc.pp.pca(adata_concat)
sc.external.pp.bbknn(adata_concat, batch_key='Sample_ID')

TypeError: bbknn() got an unexpected keyword argument 'n_trees'

In [6]:
!pip install bbknn

Collecting bbknn
  Using cached bbknn-1.5.1-py3-none-any.whl (11 kB)
Installing collected packages: bbknn
Successfully installed bbknn-1.5.1


ImportError: Please install bbknn: `pip install bbknn`.

In [None]:
balf_samples['C100'], balf_samples['C152']

# Batch correction:

# you are here

Finally, we apply a batch correction. The goal here is to account for sample-to-sample technical variability. In this case, we show Combat since it is built in with scanpy. 

Note, the final input matrices to Tensor-cell2cell must be non-negative. We will demonstrate workarounds to negative counts in the tensor building tutorial. 

See 10.1186/s13619-020-00041-9 for a benchmarking of Scanpy's batch correction methods

In [32]:
batch_var = 'Sample_ID' # the batch variable in the metadata

Batch correction using combat:

In [33]:
# merge the balf_samples
balf_corrected = sc.concat(balf_samples.values())
balf_corrected.obs_names_make_unique()

# store log(1+CPM) values in "raw" attribute
balf_corrected.raw = balf_corrected 

# do the batch correction
sc.pp.combat(balf_corrected, key = batch_var) 

At some point in the pipeline, we must account for batch. Batch-correction is important since Tensor-cell2cell considers multiple balf_samples to extract context-dependent patterns, and we want to make sure we are capturing true biological signals rather than sample-specific differences due to technical variability. 

Ideally, we can use single-cell RNAseq batch correction methods. There are a few potential problems with this approach:

1) Batch correction methods often return a matrix in a reduced space and thus does not have the original gene features included, which is needed for LR scoring (see [Table 1](https://academic.oup.com/nargab/article/4/1/lqac022/6548822)).

2) Some cell-cell communication tools expect data in other formats, such as log(1+CPM)

3) Batch correction methods that do return gene counts often return negative counts which can result in negative LR scores. Negative values in the tensor can bias non-negative TCD, the main algorithm used in Tensor-cell2cell.  

In this tutorial, and its companion 01B for R users, we will show pre-processing from raw counts to batch corrected counts. Problem 1 can simply be dealt with by only using batch correction methods that return the original gene features. Problem 2-3 will be discussed further in Tutorials XXX. Essentially, Problems 2-3 can both be dealth with by instead directly introducing a technical covariate to account for batch directly to the decomposition. Problem 3 can also be dealt with either by masking negative values or using a TCD approach that does not have a non-negative constraint. 

The next two cells, unused, show examples of other methods for batch correction . See https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html for more tutorials on batch correction

Batch correction with scanorama:

In [8]:
# import scanorama

# # merge all the balf_samples into a single object
# balf_log = sc.concat(balf_samples.values())
# balf_log.obs_names_make_unique()

# # correct with scanorama
# balf_corrected = scanorama.correct_scanpy(adatas=list(balf_samples.values()), return_dimred=False)

# # aggregate into one object
# balf_corrected = sc.concat(balf_corrected) 
# balf_corrected.obs_names_make_unique()

# # store log(1+CPM) values in "raw" attribute
# balf_corrected.raw = balf_log

Batch correction using a simple linear regression:

In [9]:
# # merge the balf_samples
# balf_corrected = sc.concat(balf_samples.values())
# balf_corrected.obs_names_make_unique()

# # store log(1+CPM) values in "raw" attribute
# balf_corrected.raw = balf_corrected

# # do the batch correction
# sc.pp.regress_out(balf_corrected, keys = batch_var)

Calculate a PCA manifold on the batch-corrected counts

In [17]:
# get the top 2000 highly variable genes
sc.pp.highly_variable_genes(balf_corrected, n_top_genes = 2000)

# get PCA to 100 PCs
sc.tl.pca(balf_corrected, use_highly_variable = True, svd_solver='arpack', random_state = seed, 
         n_comps = 100)

In [None]:
# TODO: make this corrected object, the raw data, and metadata available to download somewhere
out_path = '/data3/hratch/c2c_general/'
balf_corrected.write_h5ad(out_path + 'batch_corrected_balf_covid.h5ad') # 6.7Gb

The final "balf_corrected" AnnData object has the following attributes:
1) X: batch-correct counts matrix (preferably non-negative) <br>
2) obs: cell metadata that includes the cell group (cluster or type), Sample ID, and Context <br>
3) raw: log(1+CPM) normalized AnnData object <br>
4) obsm['X_pca']: the cell manifold 

Regardless of the preprocessing pipeline used, these four pieces of information will be necessary for some parts of the Tensor-cell2cell analyses. 

In [20]:
# corrected counts matrix
balf_corrected.to_df().T.head()

Unnamed: 0,AAACCTGAGGAATCGC-1,AAACCTGTCCAGAAGG-1,AAACCTGTCCAGTAGT-1,AAACCTGTCTGGGCCA-1,AAACGGGCACGAGGTA-1,AAACGGGGTACATCCA-1,AAACGGGGTCTCCCTA-1,AAACGGGTCTAGAGTC-1,AAAGATGTCGTGGGAA-1,AAAGCAAAGGGATACC-1,...,TTTGGTTAGCACGCCT-1,TTTGGTTAGTGGTAAT-1,TTTGGTTAGTTGTAGA-1-1,TTTGGTTCATACTACG-1,TTTGTCAAGATTACCC-1,TTTGTCAAGTGGTAAT-1,TTTGTCACAGAAGCAC-1,TTTGTCATCAACCAAC-1,TTTGTCATCCAAACAC-1,TTTGTCATCGCGTTTC-1
LINC00115,-0.026211,3.884047,-0.026211,-0.026211,4.615011,-0.026211,-0.026211,-0.026211,-0.026211,4.423844,...,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741,-0.031741
NOC2L,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,-0.276392,...,-0.434179,-0.434179,2.241069,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179,-0.434179
KLHL17,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,-0.045777,...,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993,-0.019993
PLEKHN1,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,-0.061546,...,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228,0.002228
HES4,-0.444072,4.559903,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,-0.444072,...,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,0.562998,6.038211


In [25]:
# cell metadata
balf_corrected.obs.head()

Unnamed: 0,Sample_ID,Context,cell_type,n_genes,n_genes_by_counts,total_counts,total_counts_mt,pct_counts_mt
AAACCTGAGGAATCGC-1,C148,Severe_Covid,Macrophages,606,606,1342.0,60.0,4.470939
AAACCTGTCCAGAAGG-1,C148,Severe_Covid,Macrophages,2035,2034,7297.0,334.0,4.577224
AAACCTGTCCAGTAGT-1,C148,Severe_Covid,Macrophages,1660,1658,4959.0,324.0,6.533575
AAACCTGTCTGGGCCA-1,C148,Severe_Covid,Macrophages,4965,4964,31956.0,1374.0,4.299662
AAACGGGCACGAGGTA-1,C148,Severe_Covid,T,1290,1288,2892.0,119.0,4.114799


In [28]:
# log(1+CPM) counts matrix
balf_corrected.raw.to_adata().to_df().T.head()

Unnamed: 0,AAACCTGAGGAATCGC-1,AAACCTGTCCAGAAGG-1,AAACCTGTCCAGTAGT-1,AAACCTGTCTGGGCCA-1,AAACGGGCACGAGGTA-1,AAACGGGGTACATCCA-1,AAACGGGGTCTCCCTA-1,AAACGGGTCTAGAGTC-1,AAAGATGTCGTGGGAA-1,AAAGCAAAGGGATACC-1,...,TTTGGTTAGCACGCCT-1,TTTGGTTAGTGGTAAT-1,TTTGGTTAGTTGTAGA-1-1,TTTGGTTCATACTACG-1,TTTGTCAAGATTACCC-1,TTTGTCAAGTGGTAAT-1,TTTGTCACAGAAGCAC-1,TTTGTCATCAACCAAC-1,TTTGTCATCCAAACAC-1,TTTGTCATCGCGTTTC-1
LINC00115,0.0,4.927562,0.0,0.0,5.848695,0.0,0.0,0.0,0.0,5.607794,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
NOC2L,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.458022,0.0,0.0,0.0,0.0,0.0,0.0,0.0
KLHL17,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
PLEKHN1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
HES4,0.0,6.021334,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.693111


In [36]:
# cell manifold
pd.DataFrame(balf_corrected.obsm['X_pca'], 
            columns = ['PC' + str(i) for i in range(1, 101)], 
                      index = balf_corrected.obs.index.tolist()).head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC91,PC92,PC93,PC94,PC95,PC96,PC97,PC98,PC99,PC100
AAACCTGAGGAATCGC-1,-12.27951,-11.569539,2.333932,-0.792567,-2.818772,2.152842,1.414543,-2.81945,-0.843046,-0.359107,...,0.636127,-0.555673,-1.174195,0.382731,-1.154741,1.880821,-2.175864,0.964448,2.638466,2.88271
AAACCTGTCCAGAAGG-1,4.993004,-9.634439,-4.017168,6.867777,-3.659773,2.181219,2.381791,5.002475,4.327793,-2.199009,...,1.056841,-0.549429,-1.792231,-1.502628,3.608827,0.246818,2.4785,-0.412751,2.093467,-2.841379
AAACCTGTCCAGTAGT-1,-4.327682,-9.474691,-0.37374,-1.574611,1.026707,0.264881,-0.642445,2.645939,0.76808,1.401344,...,-1.219072,-1.50904,0.535264,-2.544001,-2.009928,0.51789,-0.087034,0.123314,0.31421,-3.137026
AAACCTGTCTGGGCCA-1,21.99651,2.566066,-3.94439,-13.916259,7.445151,3.117134,1.296663,2.814822,-1.560574,-0.914551,...,-1.616594,-0.872157,-2.061459,-0.744164,-1.183877,-4.804202,-1.866552,2.740477,1.292481,-1.16091
AAACGGGCACGAGGTA-1,-18.817038,4.994419,-9.384811,-6.844615,-3.056098,2.77496,-5.017649,0.400225,0.813033,0.663741,...,1.176409,0.047743,-0.064789,2.626716,0.151935,0.872773,-1.538529,-0.425076,1.31526,-0.842731


In [None]:
# from typing import Dict
# def split_adata(adata, sample_col = 'Sample_ID'):
#     """Split an AnnData object with corrected counts into its respective balf_samples.

#     Parameters
#     ----------
#     adata : AnnData
#         merged AnnData object across balf_samples (see sc.concat)
#     sample_col : str, optional
#         the metadata (adata.obs) column specifying the balf_samples, by default 'Sample_ID'

#     Returns
#     -------
#     balf_samples : Dict[str, AnnData]
#         the set of AnnData objects corresponding to each sample
#     """
    
#     balf_samples = {sample: adata[adata.obs[adata.obs[sample_col] == sample].index] for sample in adata.obs[sample_col].unique()}
#     return balf_samples


# balf_corrected_split = split_adata(adata=balf_corrected)