# Preprocessing & visualization introduction

This chapter will guide you through the steps of data preprocessing and visualization. Pre-processing aims to address the following analysis steps: quality control, normalization, feature selection, and dimensionality reduction. 

This chapter will additionally introduce the advanced topics of ambient RNA, single nuclei and interactive visualization. 

The starting point of this notebook are raw sequencing data generated by sequencing machines which are processed and aligned to obtain matrices of molecular counts, so called count matrices, or read counts (read matrices). The difference between cound and read matrices depends on whether unique molecular identifiers (UMIs) were included in the single-cell library construction protocol. 

Read and count matrices have the dimension number of barcodes x number of transcripts. It is important to note that the term "barcode" is used here instead of "cell" as a barcode might wrongly have tagged multiple cells (doublet) or might have not tagged any cell (empty droplet/well). We will elaborate more on this in the next section "Doublet detraction & quality control". 

### Imports
Before we introduce the first pre-processing step of doublet detection and quality control, we import the respective packages from python and R. 

In [1]:
import warnings
warnings.filterwarnings('ignore')

import scanpy as sc
import numpy as np
import pandas as pd

import anndata2ri
import logging

import rpy2.rinterface_lib.callbacks as rcb
import rpy2.robjects as ro

sc.settings.verbosity = 0
rcb.logger.setLevel(logging.ERROR)

ro.pandas2ri.activate()
anndata2ri.activate()

%load_ext rpy2.ipython

In [2]:
%%R
library(Seurat)
library(scDblFinder)
library(scater)
library(BiocParallel)

### Dataset

In [3]:
adata = sc.read('cellranger_neurips21_bmmc.h5ad')
adata.var_names_make_unique()
adata

Variable names are not unique. To make them unique, call `.var_names_make_unique`.


AnnData object with n_obs × n_vars = 131515 × 36601
    obs: 'site', 'donor'
    var: 'gene_ids', 'feature_types', 'genome'

The dataset has the shape `n_obs` 131,515 x `n_vars` 36,601, so barcodes x number of transcripts. We additionally inspect some further information in `.obs` and `.var` which we will neglect for now. 

# 2.1 Doublet Detection

Doublets are defined as two cells that are sequenced under the same cellular barcode, so for example if they were captured in the same droplet. We therefore used so far the term "barcode" instead if "cell". A doublet is called homotypic if it is formed by the same cell type (but from different individuals) and heterotypic otherwise. Homotypic doublets are not necessarily identifiable from count matrices and are often considered innocuous as they can be identified with cell hashing or SNPs. Hence, their identification is not the main goal of the doublet detection methods. 

Doublets formed from different cell types or states are called heterotypic. Their identification is crucial as they are most likedly misclassified and can lead to distorted downstream analysis steps. Hence, doublet detection and removal is typically an initial preprocessing step. Doublets can be either identified through their high number of reads and detected features, or with methods that create atrifical doublets and comparing these with each cell. Doublet detection methods are computationally efficient and several software packages exist for this analysis them. To mention a few: *scDblFinder*,  *scds*, *DoubletFinder* or *scran’s doubletCells*. These methods are not yet benchmarked  in an independent comparison, but compared in a pipeline setting by {cite}`germain_pipecomp_2020` that collects several preprocessung steps. scDblFinder showed to enhance downstream clustering results. 

#### scDblFinder

It proved to be benefical and computationally efficient with software packages such as *scDblFinder* or *scds*. We will demonstrate the usage of scDblFinder which is a Bioconductor package working on raw count matrices. 

The scDblFinder R package collects several methods for the detection of doublets in scRNA-seq data and identifies heterotypic doublets which are in many cases also the most critical ones as stated above. 

scDblFinder can be used for datasets with multiple samples. It will use this information to generate artificial doublets and the nearest neighbors are identifies for each sample separatly. The overall scoring is then performed globally and sample-specific doublet rates will be considered for the threshold.

scDblFinde expects an `SingleCellExperiment` object of shape number of transcripts x barcodes which does not contain empty drops, but other than that has not been filtered beore. We will now apply the method to the transposed raw count matrix `.X`. To use scDblFinder on multiple samples we additionally provide a vector of the sample ids and allow multithreading using the BPPARAM parameter. Both parameters can be ignored if your dataset only consists of one sample.

In [4]:
data_mat = adata.X.T
sample_ids = pd.factorize(adata.obs['donor'])[0]

We can now launch the doublet detection by using `data_mat` as input to scDblFinder within a SingleCellExperiment.

scBblFinder adds several columns to the colData of sce, three of them are in particular intersting for the user:

* `sce$scDblFinder.score` : the final doublet score (the higher the more likely that the cell is a doublet)

* `sce$scDblFinder.ratio` : the ratio of artificial doublets in the cell's neighborhood

* `sce$scDblFinder.class` : the classification (doublet or singlet)

We will only output the `class` argument. The other arguments can be added to the anndata object in a similar way. 

In [5]:
%%R -i data_mat -i sample_ids -o droplet_class

sce = scDblFinder(
    SingleCellExperiment(
        list(counts=data_mat),
    ), 
    samples=sample_ids, 
    BPPARAM=MulticoreParam(3), 
    verbose=TRUE
)
droplet_class = sce$scDblFinder.class

We now store the result in `.obs` and show the value counts for both classes singlet and doublet in the dataset.

In [6]:
adata.obs['scDblFinder_class'] = np.expand_dims(droplet_class, axis=1)
adata.obs['scDblFinder_class'].value_counts()

singlet    108183
doublet     23332
Name: scDblFinder_class, dtype: int64

The resulting classification into singlets and doublets is added to our anndata object. In our dataset, scDblFinder identified 58,264 droplets as doublets. Those are removed from the data.

In [7]:
print('Total number of cells: {:d}'.format(adata.n_obs))
adata = adata[adata.obs['scDblFinder_class']=='singlet'].copy()

print('Number of cells after filtering of doublets: {:d}'.format(adata.n_obs))

Total number of cells: 131515
Number of cells after filtering of doublets: 108183


## 2.2 Quality Control

Apart from doublets, the dataset might contain low-quality cells which are filtered by cell quality control (QC). Cell QC is typically performed on the following three QC covariates: 

* number of counts per barcode (count depth),

* number of genes per barcode, and

* fraction of counts from mitochondrial genes per barcode. 

In cell QC these covariates are filtered via thresholding as outlier barcodes might correspond to dying cells, so cells with broken membranes whose cytoplasmic mRNA has leaked out and therefore only the mRNA in the mitochondria is still present. These cells might then show a low count depth, few detected genes and a high fraction of mitochondrial reads.

However, it is important to consider the three QC covariates jointly as otherwise this might lead to misinterpretation of cellular signal. One example are cells with a relativly high fraction of mitochondrial counts which might be involved in respiratory processes and should not be filtered out. Another example are cells with low or high counts which might correspond to quiescent cell populations or cells larger in size. Therefore, it is prefered to consider multiple covariates when univariate thesholding decisions are made. In general, it is advised to exclude fewer cells and be as permissive as possible to avoid filtering out viable cell populations or small sub-populations. 

QC on datasets is often performed in a manual fashion by looking at the distribution of different QC covariates. However, as these datasets grow in size it might be worth to consider automatic thresholding via MAD (median absolute deviations). {cite}`germain_pipecomp_2020` proposed in their study on different preprocessing steps to mark cells as outliers if they differ by 5 MADs which relects a relatively permissive filtering strategy. 

In the analysis workflow, one first needs to calculate the QC covariates (QC metrics). We use the scanpy function `sc.pp.calculate_qc_metrics()` for this manner and additionally compute the log1p transformed values. 

In [8]:
sc.pp.calculate_qc_metrics(adata, inplace=True, percent_top=[20], log1p=True)

In [9]:
adata

AnnData object with n_obs × n_vars = 108183 × 36601
    obs: 'site', 'donor', 'scDblFinder_class', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes'
    var: 'gene_ids', 'feature_types', 'genome', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts'

Next, we define a function that allows automatic filtering based on MAD. This function takes a `metric`, i.e. a column in `.obs` and the number of MADs (nmad) that is still permissive within the filtering strategy. 

In [23]:
def isOutlier(metric, nmads):
    M = adata.obs[metric]
    outlier = (M < np.median(M) - nmads * M.mad()) | (np.median(M) + nmads * M.mad() < M)
    return outlier

We now apply this function to the `log1p_total_counts`, `log1p_n_genes_by_counts` and `pct_counts_in_top_20_genes` QC covariates each with a threshold of 5 MADs.

In [24]:
adata.obs['outlier'] = isOutlier('log1p_total_counts', 5) | isOutlier('log1p_n_genes_by_counts', 5) | isOutlier('pct_counts_in_top_20_genes', 5) 
adata.obs['outlier'].value_counts()

False    104883
True       3067
Name: outlier, dtype: int64

So we classify 3,067 cells as outlier with respect to these three metrics. 

As a next step, we calculate the percentage of mitochondrial counts as they are often associated with cell degradation. It is important to note that mitochondrial counts are annotated either with the prefix "mt-" or "MT-" depending on the species considered in the dataset. The dataset considered in this notebook is .... ?

In [None]:
mt_gene_mask = [gene.startswith('MT-') for gene in adata.var_names]
adata.obs['pct_counts_Mt'] = adata.X.toarray()[:, mt_gene_mask].sum(1)/adata.obs['total_counts']

`pct_counts_Mt` is filtered with 3 MADs. Additionally, cells with a percentage of mitochondrial counts exceding 8 % are filtered out.

In [25]:
adata.obs['mt_outlier'] = isOutlier('pct_counts_Mt', 3) | (adata.obs['pct_counts_Mt'] > 0.08)
adata.obs['mt_outlier'].value_counts()

False    101043
True       6907
Name: mt_outlier, dtype: int64

So we classify 6,907 cells as outlier with respect to mitochonrial reads.

We now filter our anndata object based on these two additional columns. 

In [26]:
adata = adata[(~adata.obs['outlier']) & (~adata.obs['mt_outlier'])].copy()

In [27]:
#Filter genes:
print('Total number of genes: {:d}'.format(adata.n_vars))

# Min 5 cells - filters out 0 count genes
sc.pp.filter_genes(adata, min_cells=5)
print('Number of genes after cell filter: {:d}'.format(adata.n_vars))

Total number of genes: 36601
Number of genes after cell filter: 29437


## 2.3 Normalization

### 2.3.1 *sctransform* normalization

### 2.3.2 *scran* normalization

## References

```{bibliography}
:filter: docname in docnames
```

In [30]:
import session_info
session_info.show()