For use with Tensor-cell2cell, we want a dataset that represents >2 contexts. We also want a dataset that contains [replicates](https://www.nature.com/articles/nmeth.3091). Replicates will allow us to ensure that the output factors are not simply due to technical effects (i.e., a factor with high loadings for just one replicate in the context dimension). We will use a [BALF COVID dataset](https://doi.org/10.1038/s41591-020-0901-9), which contains 12 samples associated with "Healthy Control", "Moderate", or "Severe" COVID contexts. This dataset does not contain technical replicates since each sample was taken from a different patient, but each sample associated with a context is a biological replicate. 

[Batch correction](https://www.nature.com/articles/s41592-018-0254-1) removes technical variation while preserving biological variation between samples. We can reasonably assume that the biological variation in samples between contexts will be greater than that of those within contexts after using appropriate batch correction to remove technical variation. Thus, we expect Tensor-cell2cell to capture overall communication trends differing between contexts and can then assess that output factors aren't simply due to technical effects by checking that the output factors have similar loadings for biological replicates and do not have  high loadings for just one sample in the context dimension. 

Finally, we apply a batch correction. The goal here is to account for sample-to-sample technical variability. 

At this point, we diverge from the Python preprocessing tutorial in order to leverage Seurat's built-in batch correction functions. To decrease run time, we will use reciprocal PCA instead of CCA. See https://satijalab.org/seurat/articles/integration_introduction.html and Seurat's other integration vignettes for additional details. To apply Combat as in scanpy, see commented code further below.

Note, the final input matrices to Tensor-cell2cell must be non-negative. We will demonstrate workarounds to negative counts in the tensor building tutorial.

In [None]:
batch.var <- 'Sample.ID' # the batch variable in the metadata

In [None]:
# get the HVGs for each sample separately
balf.samples <- lapply(balf.samples, 
                      function(so) FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000))
                       
# find the common HVGs across samples
integration.features <- SelectIntegrationFeatures(object.list = balf.samples)

                       
# # to use CCA instead of reciprocal PCA, follow lines 10-12, instead of lines 14-22
# # find the integration anchors
# integration.anchors <- FindIntegrationAnchors(object.list = balf.samples, 
#                                               anchor.features = integration.features)    
                       
# calculate PCA on each sample separately
balf.samples <- lapply(X = balf.samples, FUN = function(x) {
    x <- ScaleData(x, features = integration.features, verbose = F)
    x <- RunPCA(x, features = integration.features, verbose = F)
})

# find the integration anchors
integration.anchors <- FindIntegrationAnchors(object.list = balf.samples, 
                                              anchor.features = integration.features, reduction = "rpca")

# do the batch correction
balf.corrected <- IntegrateData(anchorset = integration.anchors)

In [None]:
# 2000 top variable features were already calculated

# get PCA to 100 PCs
balf.corrected <- ScaleData(balf.corrected, verbose = F)
balf.corrected <- RunPCA(balf.corrected, npcs = 100, verbose = F)