### Discussion

Here, we discuss the intricacies of batch correction with regards to CCC scoring and Tensor-cell2cell.

For use with Tensor-cell2cell, we want a dataset that represents >2 contexts. When considering 2 or more contexts, these are typically derived from different samples and/or measurments, and thus introduce batch effects.  

#### Replicates
Our dataset should contain [replicates](https://www.nature.com/articles/nmeth.3091). Replicates will allow us to ensure that the output factors are not simply due to technical effects (i.e., a factor with high loadings for just one replicate in the context dimension). We will use a [BALF COVID dataset](https://doi.org/10.1038/s41591-020-0901-9), which contains 12 samples associated with "Healthy Control", "Moderate", or "Severe" COVID contexts. This dataset does not contain technical replicates since each sample was taken from a different patient, but each sample associated with a context is a biological replicate. 


#### Batch correction
[Batch correction](https://www.nature.com/articles/s41592-018-0254-1) removes technical variation while preserving biological variation between samples. We can reasonably assume that the biological variation in samples between contexts will be greater than that of those within contexts after using appropriate batch correction to remove technical variation. Thus, we expect Tensor-cell2cell to capture overall communication trends differing between contexts and can then assess that output factors aren't simply due to technical effects by checking that the output factors have similar loadings for biological replicates and do not have  high loadings for just one sample in the context dimension. 

#### Benchmarking Batch Effects

Using simulated data and metrics of batch severity (kBET and NMI), we saw that Tensor-cell2cell is robust to batch effects approaching **XXX** severity (see benchmarking/batch_correction for details). Applying these metrics to your own dataset should help you determine whether batch correction is necessary prior to running Tensor-cell2cell.

***add png here**

#### Introduction of Negative Counts

Since CCC uses gene expression values to infer communication, a pre-requesite on selection of batch effect correction method is that it returns a corrected counts matrix rather than a latent/reduced space representation (see [Table 1](https://doi.org/10.1093/nargab/lqac022) for examples). 

Secondly, most batch correction methods that return a counts matrix introduce negative counts. Below, we show a simple example with a batch correction method that 1) returns a corrected counts matrix, and 2) returns non-negative counts. Here, we discuss further the problems with negative counts in CCC. 

* Problem 1: Negative expression values can distort scoring functions that include multiplicative functions of ligands and receptors. Take the scenario in which a ligand has a negative count and a receptor has a negative count, this will yield a positive communication score, assumed to be strong. 
* Problem 2: Negative expression values can yield negative communication scores, which the non-negative tensor decomposition algorithm used by Tensor-cell2cell disregards in its optimization. 

Regardless, we show that Tensor-cell2cell can robustly identify communication patterns even in the presence of negative counts introduced during batch correction (see benchmarking/batch_correction for details). This is likely due to the key fact: negative counts and communication scores represent lower strength interactions that do not have a strong influence on the overall communication. If using a preferred batch correction method that introduces negative counts, to address the above problems, follow these recommendations: 
* Recommendation 1: Try using methods that have additive rather than multiplicative functions for scoring of ligand-receptor pairs. 
* Recommendation 2a: If the scoring method cannot handle negative values, replace these with NaN. These are genes that are more lowly expressed anyways, so disregarding their communication score is ok. 
* Recommendation 2b: If the scoring method can handle negative values, the final tensor will contain negative values. Use a mask to have Tensor-cell2cell disregard these values when running the decomposition. Assuming an additive scoring function was used, these are communication scores that are lower strength anyways, so disregarding them is ok. 



### Application 

In [5]:
library(Seurat, quietly = T)

seed<-888
set.seed(seed)
data.path<-'/data3/hratch/ccc_protocols/'

First, let's load our normalized expression data from Tutorial 1:

In [6]:
balf.samples<-readRDS(paste0(data.path, 'interim/covid_balf_norm.rds'))

## Simple Example: Batch corrected counts with only non-negative values

[scVI](https://doi.org/10.1038/s41592-018-0229-2) implements a batch correction method that can return non-negative corrected counts, and it also [benchmarked](https://doi.org/10.1038/s41592-021-01336-8) to work well. 

We format our list by [merging](https://satijalab.org/seurat/articles/merge_vignette.html) it into a single Seurat Object:

In [7]:
library(RCurl)

library(sceasy)
library(reticulate)
scvi <- import("scvi", convert = FALSE)

In [8]:
balf.combined<- merge(balf.samples[[1]], y = balf.samples[2:length(balf.samples)], add.cell.ids = names(balf.samples))

Since CCC inference tools only consider a subset of the genes (those present in ligand-receptor databases), we do not filter for highly variable genes as this would exclude too many LRs and decrease the power of communication inference. 

However, if runtime with scVI is a concern, we can conduct the following optional step prior to batch correction: filtering for only genes present in the LR database that you will use for communication scoring. Here, we use the [CellChat](https://doi.org/10.1038/s41467-021-21246-9) database as an example. 

**ToDO**: can we change this to be from LIANA directly?

In [9]:
# optional step: 

# get the CellChatDB
hl <- RCurl::getURL('https://raw.githubusercontent.com/LewisLabUCSD/Ligand-Receptor-Pairs/master/Human/Human-2020-Jin-LR-pairs.csv')
lr.pairs <- read.csv(text = hl)

# separate complexes and join LRs
receptors = lr.pairs$receptor_symbol
receptors<-unname(unlist(sapply(receptors, function(x) strsplit(x, '&'))))                           
ligands = lr.pairs$ligand_symbol
ligands<-unname(unlist(sapply(ligands, function(x) strsplit(x, '&'))))
lrs<-sort(unique(c(receptors, ligands)))
 
# subset to present lrs                            
balf.combined<-subset(balf.combined, features = lrs[lrs %in% rownames(balf.combined)])

In [10]:
balf.combined

An object of class Seurat 
765 features across 63103 samples within 1 assay 
Active assay: RNA (765 features, 0 variable features)

In [11]:
head(balf.combined@assays$RNA@counts)

   [[ suppressing 34 column names ‘C100_AAACCCACAGCTACAT-1’, ‘C100_AAACCCATCCACGGGT-1’, ‘C100_AAACCCATCCCATTCG-1’ ... ]]



6 x 63103 sparse Matrix of class "dgCMatrix"
                                                                          
ACKR2  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ACKR3  . . . 5 . . . . . 6 . . . . . . . 1 5 1 . . . . 2 . . 1 . 1 . . . .
ACKR4  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ACVR1  1 . . . . . . 1 . 2 1 . . . . . . . . 1 1 . . . . . . 1 . . . . 1 .
ACVR1B . . . . . . . . . . . . . . . . . . . . . 1 . . . . 1 . . . . . . .
ACVR1C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
             
ACKR2  ......
ACKR3  ......
ACKR4  ......
ACVR1  ......
ACVR1B ......
ACVR1C ......

 .....suppressing 63069 columns in show(); maybe adjust 'options(max.print= *, width = *)'
 ..............................

Next, we can run scVI according to the [tutorial](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/scvi_in_R.html):

In [12]:
balf.combined<-sceasy::convertFormat(balf.combined, from ="seurat", to="anndata", main_layer="counts", 
                                     drop_single_values=FALSE)

In [8]:
set.seed(0)
scvi$model$SCVI$setup_anndata(balf.combined, batch_key = 'Sample.ID')
model = scvi$model$SCVI(balf.combined, n_layers = 2L, n_latent = 30L, gene_likelihood= "nb") # non-default args - 
model$train()

None

None

scVI's batch corrected matrix has the added benefit of being formatted like a depth-normalized matrix. Transforming this with log1p will put it in a similar format as log(1+CPM).

In [9]:
# library size and log1p make it similar to log(1+CPM) normalization, but with batch correction
# batch corrected counts: https://discourse.scverse.org/t/how-to-extract-batch-corrected-expression-matrix-from-trained-scvi-vae-model/151

contexts<-sort(unlist(unname(unique(reticulate::py_to_r(balf.combined$obs$orig.ident)))))
corrected.data = model$get_normalized_expression(transform_batch = contexts,
                                                library_size = 1e6) # depth normalization
corrected.data<-t(log1p(reticulate::py_to_r(corrected.data))) # log1p transformation
write.csv(corrected.data, paste0(data_path, 'interim/R_scvi_corrected_counts.csv'))
head(corrected.data)

Unnamed: 0,C100_AAACCCACAGCTACAT-1,C100_AAACCCATCCACGGGT-1,C100_AAACCCATCCCATTCG-1,C100_AAACGAACAAACAGGC-1,C100_AAACGAAGTCGCACAC-1,C100_AAACGAAGTCTATGAC-1,C100_AAACGAAGTGTAGTGG-1,C100_AAACGCTGTCACGTGC-1,C100_AAACGCTGTTGGAGGT-1,C100_AAAGAACTCTAGAACC-1,⋯,C52_TTTGTCAGTGTCAATC-1,C52_TTTGTCAGTGTGAAAT-1,C52_TTTGTCATCAGTTAGC-1,C52_TTTGTCATCCAGTATG-1,C52_TTTGTCATCCCTAATT-1,C52_TTTGTCATCGATAGAA-1,C52_TTTGTCATCGGAAATA-1,C52_TTTGTCATCGGTCCGA-1,C52_TTTGTCATCTCACATT-1,C52_TTTGTCATCTCCAACC-1
ACKR2,1.283241,2.014318,1.1279008,0.2590926,0.3994529,1.923254,0.4924334,1.420789,0.7050416,0.2992088,⋯,0.9756578,0.7842626,1.252644,1.275348,0.7998077,1.336386,1.938972,0.7883438,0.8397672,0.4684761
ACKR3,5.225464,5.081991,3.2819411,6.4156503,5.4199048,2.955636,4.4852005,3.531504,5.8149713,6.9652176,⋯,3.9997808,4.1867009,4.507971,3.874483,3.556751,4.037932,3.59541,3.63284,5.1635154,6.2748763
ACKR4,1.270339,3.800042,0.9486992,0.5444351,1.774679,2.982699,0.9904066,2.485552,1.8266945,1.7041898,⋯,1.541676,1.149508,2.011045,1.164759,2.4805874,3.03931,1.793038,1.6229188,1.0483807,1.143764
ACVR1,6.221755,6.519434,5.9303213,5.2539786,5.4150169,5.154817,5.8265839,5.560836,5.4222423,5.7798103,⋯,5.2057511,4.9446762,5.995573,5.401917,5.4384499,5.121652,5.443285,5.1498417,5.7672026,5.3364292
ACVR1B,4.020944,6.719118,4.7649436,5.3688274,6.2614555,4.467631,6.0154724,4.592467,6.2905916,5.7646415,⋯,5.3702008,5.018803,5.825969,6.594711,5.6006534,5.515707,5.754275,5.7002365,5.242444,5.829326
ACVR1C,3.447239,4.362259,4.7644632,1.508062,1.1731801,2.356233,1.0052726,2.494201,2.053203,1.6715363,⋯,1.5956861,1.1101224,2.032625,1.644986,2.168508,1.38359,1.155425,0.9965123,2.8551377,1.4848984


This corrected data matrix can replace the log(1+CPM) matrix used in tutorials 02 onwards for downstream analyses, if desired. Note, outputs won't be identical to companion Python tutorial in this case due to stochastic steps in scVI. 

**To do: should I show how to replace this in the actual Seurat object?**

## Complex Example: Batch corrected counts containing negative values