# Normalization

## Motivation

Up to this point, we removed doublets and low-quality cells from the dataset and the data is available as a count matrix. These counts represent the capture, reverse transcription and sequencing of a molecule in the scRNA-seq experiment. Each of these steps adds a degree of variability to the measured count depth for identical cells, so the difference in gene expression between cells in the count data might simply be due to sampling effects. The preprocessing step of "normalization" addresses these problems. Several normalization techniques are used in practice varying in complexity. We would like to highlight three popular normalization techniques: proportional fitting with log plus one transformation (log1pPF), scran normalization with log plus one transformation and sctransform normalization. A complete and independent benchmark comparing all common normalization techniques is still an open research task. Attempts of comparing normalization techniques and their impact on downstream analysis tasks are either incomplete (not comparing all existing methods) or led to different conclusions {cite}`Booeshaghi2022, germain_pipecomp_2020, Crowell2020, Brown2021`. We therefore introduce the reader to log1pPF, scran and sctransform normalization and recommend to assess the results and impact of the chosen normalization technique during downstream clustering.

We first import all required Python packages and load the dataset for which we removed ambient RNA, doublets and filtered low quality cells. 

In [1]:
import scanpy as sc
import anndata2ri
import logging
from scipy.sparse import issparse

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
sc.settings.set_figure_params(
    dpi=80,
    facecolor="white",
    # color_map="YlGnBu",
    frameon=False,
)

rcb.logger.setLevel(logging.ERROR)
ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [2]:
# switch to figshare afterwards
adata = sc.read("s4d8_subset_gex_qc.h5ad")

## log1p proportional fitting

One common approach is equalizing the count depth for all cells with subsequent variance stabilization through log plus one (log1p) transformation. Count depth scaling normalizes the data to a “size factor” such as ten thousand (CP10k) or one million (CPM, counts per million). CPM normalization assumes that initially all cells in the dataset contained an equal number of molecules ad that the difference in count depth is only due to sampling. However, as datasets usually consist of heterogeneous cell populations which different cell sizes and molecule counts, more complex normalization methods are needed. 

A method similar to log1pCP10k or log1pCPM, is log1p proportional fitting (log1PF). Log1pPF describes cell depth normalization to the mean cell depth followed by log plus one transformation. This normalization technique was adapted from bulk RNA-seq. Booeshaghi et al. adujusted this normalization technique by adding another step of proportional fitting, so PFlog1pPF. They showed that PFlog1pPF
* Effectively stabilizes variance
* Leads to a low fraction of false-positive differentially expressed genes and
* Recovers within cell-type Spearman correlation

log1pPF and PFlog1pPF can be easily computed with scanpy. 

In [3]:
# proportional fitting to mean of cell depth
proportional_fitting = sc.pp.normalize_total(adata, target_sum=None, inplace=False)
# log1p transform
adata.layers["log1pPF_normalization"] = sc.pp.log1p(proportional_fitting["X"])
# proportional fitting
adata.layers["PFlog1pPF_normalization"] = sc.pp.normalize_total(
    adata, target_sum=None, layer="log1pPF_normalization", inplace=False
)["X"]

## scran normalization

The scran method leverages a deconvolution approach. Cells are partitioned into pools and normalized across cells in each pool. The resulting system of linear equations is then used to define individual cell factors. 


We first load the additionally required Python and R packages.

In [4]:
from scipy.sparse import csr_matrix, issparse

In [5]:
%%R
library(scran)
library(BiocParallel)

scran requires a coarse clustering input to improve size factor esimation performance. In this tutorial, we use a simple preprocessing approach and cluster the data at a low resolution to get an input for the size factor estimation. The basic preprocessing includes assuming all size factors are equal (library size normalization to counts per million - CPM) and log-transforming the count data.

In [6]:
# Preliminary clustering for differentiated normalisation
adata_pp = adata.copy()
sc.pp.normalize_per_cell(adata_pp, counts_per_cell_after=1e6)
sc.pp.log1p(adata_pp)
sc.pp.pca(adata_pp, n_comps=15)
sc.pp.neighbors(adata_pp)
sc.tl.leiden(adata_pp, key_added="groups")

We now add `data_mat` and our computed groups into our R environment. 

In [7]:
import scipy

data_mat = adata_pp.X.T
# convert to CSC if possible. See https://github.com/MarioniLab/scran/issues/70
if scipy.sparse.issparse(data_mat):
    if data_mat.nnz > 2**31 - 1:
        data_mat = data_mat.tocoo()
    else:
        data_mat = data_mat.tocsc()
ro.globalenv["data_mat"] = data_mat
ro.globalenv["input_groups"] = adata_pp.obs["groups"]

We can now also delete the copy of our anndata object, as we obtained all objects needed in order to run scran. 

In [8]:
del adata_pp

We now compute the size factors based on the groups of cells we calculated before. 

In [9]:
%%R -o size_factors

size_factors = sizeFactors(
    computeSumFactors(
        SingleCellExperiment(
            list(counts=data_mat)), 
            clusters = input_groups,
            min.mean = 0.1,
            BPPARAM = MulticoreParam()
    )
)

We save `size_factors` in `.obs` and are now able to normalize the data and subsequently apply a log1p transformation.

In [10]:
adata.obs["size_factors"] = size_factors
scran = adata.X / adata.obs["size_factors"].values[:, None]
adata.layers["scran_normalization"] = csr_matrix(sc.pp.log1p(scran))

## scTransform normalization

We now introduce the reader to the normalization with *sctransform*. Sctransform was motivated by the observation that cell-to-cell variation in scRNA-seq data might be confounded by biological heterogeneity with technical effects. The method utilizes Pearson residuals from 'regularized negative binomial regression' to calculate a model of technical noise in the data. Sctransform adds the count depth as a covariate in a generalized linear model. {cite}`germain_pipecomp_2020` showed in an independent comparison of different normalization techniques that this method removed the impact of sampling effects while preserving cell heterogeneity in the dataset. Sctransform does not require downstream heuristic steps like pseudo count addition or log-transformation.

The output of sctransform are normalized values that can be positive or negative. Negative residuals for a cell and gene indicate that less counts are observed than expected compared to the gene's average expression and cellular sequencing depth. Positive residuals indicate the more counts respectively.  

We first load the additionally required python and R packages.

In [11]:
%%R
library(Seurat)
library(sctransform)

The dataset used in this chapter has a sparse representation of the count matrix, so we sort the indices and add the AnnData object to the R environment.

In [12]:
if issparse(adata.X):
    if not adata.X.has_sorted_indices:
        adata.X.sort_indices()
ro.globalenv["adata"] = adata

The object is now transformed into a Seurat object with original expression annotated as "RNA" and we can call sctransform with the "glmGamPoi" method. Sctranform allows the user to only keep variable genes, we set this option to False as we are only interested in the normalization of the data. We additionally set the number of subsampling cells `ncells` to 3000 to reduce the runtime of sctransform. This number is used to build the NB regression and the default is 5000. You can adjust this number based on the compute power you have at hand.

In [13]:
%%R
seurat_obj = as.Seurat(adata, counts="X", data = NULL)
seurat_obj = RenameAssays(seurat_obj, originalexp = "RNA")
res = SCTransform(object=seurat_obj, method = "glmGamPoi", return.only.var.genes = FALSE)



Sctransform stores the result in the "SCT" assay in the Seurat object. The "SCT" assay contains the following matrices:

* `res[["SCT"]]@scale.data` stores the normalized values, the residuals, which can be used as input to PCA. This matrix is non-sparse so it is rather memory-costly for all genes. By setting the argument `return.only.var.genes` to `TURE` we can save memory and sctransform will only store variable genes. However, in this case the sctransform feature selection method is used and {cite}`germain_pipecomp_2020` recommend to use deviance as stated in the following chapter.

* sctransform additionally stored the 'corrected' UMI counts which can be interpreted as the number of counts one would observe if all cells were sequenced to the same depth. They are stored in `res[["SCT"]]@counts`.

* `res[["SCT"]]@data` contains a log-normalized version of the corrected counts. They may be helpful for visualization, differential expression analysis and integration. Generally, sctransform recommends to use the residuals directly for downstream tasks.

We now extract the residuals and save it to the AnnData object. As `sctransform` returns a gene by cell matrix, we transpose it and save it as a new layer. The residuals can then be used for further downstream analysis steps. 

In [14]:
norm_x = ro.r("res@assays$SCT@scale.data").T
adata.layers["scTransform_normalization"] = norm_x

We applied different normalization techniques to our dataset and saved them as separat layers to our anndata object. Depending on the downstream analysis task it can be favourable to use a differently normalized layer and assess the result.

In [15]:
adata.write("s4d8_subset_gex_qc_norm.h5ad")

## Key Takeaways

1. Try a simply normalization technique like log1pPF and assess normlaization result by visualizing it on a UMAP with respect to total counts and highly expressed genes in your dataset.
2. Try scran normalization and assess if rare cell populations can be still recovered. 
3. If rare cell populations are removed with scran normalization, try scTransform.

## References

```{bibliography}
:filter: docname in docnames
```