## Gene set enrichment and pathway analysis 

1. Introduction
2. Types of gene set tests
3. Pathway/gene set collections
4. Gene set tests and pathway analysis
5. Technical considerations
    - Filtering out the gene sets with low number of genes 
    - Data normalisation
6. Take aways
7. Case study
8. References

### 1. Introduction

Differential gene expression analysis is almost always followed by *gene set enrichment analysis*, where the interest is to identify the molecular mechanisms, such as biological processes, gene ontologies or regulatory pathways that are over-represented in an experimental condition compared to control or other conditions, on the basis of differentially expressed (DE) genes. 

Single-cell RNA-seq provides unprecedented insights into variations in cell types between conditions, tissue types, species and individuals. Often, it is of interest to identify pathways or biological processes enriched in a particular cell type, for example, in disease condition compared to control. To determine the pathways enriched in a cell type-specific manner between two conditions, the relevant collection(s) of gene set signatures is first selected, where each gene set defines a biological processes (e.g. epithelial to mesechymal transition, metabolism etc) or pathway (e.g. MAPK signalling). For each gene set in the collection, DE genes present in the gene set are used to assess the enrichment. Depending on the type of the enrichment test chosen, gene expression measurements may or may not be used for the computation of test statistic. 

In this chapter, we first provide an overview of different types of gene set enrichment tests, introduce some commonly used gene signature collections and discuss best practices for pathway enrichment analysis, or functional enrichment analysis in general.  

### 2. Types of gene set tests

Gene set tests can be *competitive* or *self-contained* as defined by Goeman and Buhlmann (2007) {cite}`goeman2007analyzing`. Competitive gene set testing tests whether the genes in the set are highly ranked in terms of differential expression relative to the genes not in the set. The sampling unit here is genes, so the test can be done with a single sample. The test requires genes that are not in the set. In self-contained gene set testing, the sampling unit is the subject, so multiple samples per group are required, but it is not required to have genes that are not present in the set. A self-contained gene set test tests whether genes in the test set are differentially expressed without regard to any other gene measured in the dataset. Note that in biological data there exist inter-gene correlations. There are only a few tests that accomodate inter-gene correlations. We will discuss these methods later.

### 3. Pathway/gene set collections

Gene sets are a curated list of gene names (or gene ids) that are known to be involved in a biological process through previous studies and/or experiments. The Molecular Signatures Database (MSigDB) {cite}`subramanian2005gene,liberzon2011molecular` is the most comprehensive database consisting of 9 collections of gene sets. Some commonly used collections are C5 the GO or gene ontology collection, C2 collection of curated gene signatures from published studies that are typically context (e.g. tissue, condition) specific, but also include KEGG and REACTOME gene signatures. For cancer studies, the Hallmark collection is commonly used, and for immunologic studies the C7 collection is a common choice. Note that these signatures are mainly derived from Bulk-seq measurements and measure continuous phenotypes. Recently and with the wide-spread availability of scRNA-seq datasets, databases have evolved that provide curated marker lists derived from published single cell studies, that define cell types in various tissues and species. These include CellMarker {cite}`zhang2019cellmarker` and PanglaoDB {cite}`franzen2019panglaodb`. 

### 4.  Gene set tests and pathway analysis

Similar to differential gene expression, researchers have the option to run pathway analysis either at single-cell level or pseudo-bulk samples of cells. Note that we use the terms pathway analysis, pathway enrichment analysis, gene set enrichment analysis and functional analysis interchangeably in this chapter. 

In scRNA-seq data analysis, gene set enrichment is generally carried out on clusters of cells or cell types, one-at-a-time. Genes differentially expressed in a cluster or cell type are used to identify over-represented gene sets from the selected collection, using simple hypergeomtric tests, Fisher's exact test (as in *Enrichr* {cite} `chen2013enrichr`, for example).

More commonly *fgsea* {cite}`korotkevich2021fast` is used, which computes an enrichment score using some signed statitics of the genes in the gene set, for example, the t-statistics from differential expression test. An empricial null distribution is computed for the enrichment score, using some random gene sets of the same size, and a p-value is computed to determine the significance of the enrichment score. The p-values are then adjusted for multiple hypothesis testing. *fgsea* is a computationally faster implementation of the well established *Gene Set Enrichment Analysis (GSEA)* algorithm {cite}`subramanian2005gene`. 

While these approaches test for gene set enrichment per cluster or cell type, that is p-values are per cluster or group of cells, decoupler {cite}`badia2022decoupler`, which is a suite of tools for biological activity inference, additionally allows for per cell assessment of gene set enrichment, that is per cell p-values, simply using Fisher's exact test. The "per-cell grouping" differential expression test followed by gene set enrichment test works well in scRNA-seq datasets where complex variations such as batch effects or biological variations due to multiple experimental perturbations, subjects and patients are not present in the data. False discovery rate can be very high if these variations are not properly accounted for in the differential tests (for example in t-tests), which  consequently impacts gene set enrichment test results.  

An alternative approach to "per-cell grouping" approach described above, is to create pseudo-bulk samples from single cells, and use gene set enrichment methods developed for Bulk RNA-seq. Several competitive and self-contained gene set enrichment tests, namely *fry* and *camera* are implememted in *limma* {cite} `ritchie2015limma`, which uses linear models and Empirical Bayes moderated test statistics to test for differential expression. Linear models can accomodate complex experimental designs (e.g. subjects, perturbations, batches, nested contrasts, interactions etc) through the design matrix. In addition, the gene set tests implemented in limma account for inter-gene correlations. Gene set tests in *limma* can also be applied to (properly transformed) single cell measurements without pseudo-bulk generation. However, there are currently no benchmarks that had assessed the accuracy of gene set test results when these methods are applied directly to cells.  

#### Gene set test vs. pathway activity inference 

Gene set tests test whether a pathway is enriched, in other words over-represented, in one condition compared to others, say, Monocyte cell population in healthy donors compared to severe COVID-19 patients. An alternative approach is to simply score the activity of a pathway or gene signature, in absolute sense, in individual cells, rather than testing for a differential activity between conditions. Some of the widely used tools for inference of gene set activity in general (including pathway activity) in individual cells include *VISION* {cite}`detomaso2019functional`, *AUCell* {cite}`aibar2017scenic`, pathway overdispersion analysis using *Pagoda2* {cite}`fan2016characterizing, lake2018integrative` and simple combined z-score {cite}`lee2008inferring`. 

*DoRothEA* {cite}`garcia2019benchmark` and *PROGENy* {cite}`schubert2018perturbation` are among functional analysis tools developed to infer transcription factor (TF) - target activities originally in Bulk RNA data. Holland et al. {cite}`holland2020robustness` found that Bulk RNA-seq methods *DoRothEA* and *PROGENy* have optimal performance in simulated scRNA-seq data, and even partially outperform tools specifically designed for scRNA-seq analysis despite the drop-out events and low library sizes in single cell data. Holland et al. also concluded that pathway and TF activity inference is more sensitive to the choice of gene sets rather than the statistical methods. In contrast to Holand et al., Zhang et al. {cite}`zhang2020benchmarking` found that single-cell-based tools outperform bulk-base methods from three different aspects of accuracy, stability and scalability. It should be noted that pathway and gene set activity inference tools inherently do not account for batch effects or biological variations other than the biological variation of interest. 

Furthermore, while the tools mentioned here score every gene set in individual cells, they are not able to select for the most biologically relevant gene sets among all scored gene sets. scDECAF {cite} is a gene set activity inference tool that allows data-driven selection of the most informative gene sets, aiding to dissect meaningful cellular heterogeneity.

%# TODO: scDECAF citation

### 5. Technical considerations

#### Filtering out the gene sets with low number of genes 

A common practice is to exclude any gene sets with small number of genes overlapping the data in the pre-processing step, or if the overlap with Highly Variable Genes (HVG) is small, should the data have been subsetted on these genes. Zhang et al. {cite}`zhang2020benchmarking` found that the performance of both single-cell-based and bulk-based methods drops as gene coverage, that is the number of genes in pathways/gene sets, decreases. Holland et al {cite}`holland2020robustness` also found that low gene coverage adversely impacts the performance of Bulk-seq *DoRothEA* and *PROGENy* on single cell data. These report collectively support that filtering gene sets with low gene coverage, say less than 12 or 10 genes in the set, is beneficial in pathway analysis.
Zhang et al. additionally found that pathway analysis was susceptible to normalisation procedures applied to gene expression measurements.

#### Data normalisation

Read counts in single cell experiments are typically normalised early on in the pre-processing pipeline to ensure that measuerments are comparable across cells of various library sizes. Zhang et al. {cite}`zhang2020benchmarking` found that normalisation by *SCTransform* {cite}`hafemeister2019normalization` and *scran* {cite}`lun2016step` generally improves the performance of both single-cell- and bulk-based pathway scoring tools. They found that the performance of *AUCell* and *z-score* is particularly affected by normalization with distinct methods.

### 6. Take aways

- Normalise your data using standard scRNA-seq normalisation methods namely SCTransform and scran, and filter gene sets with low gene coverage in your data prior to pathway analysis.
- Be aware of different types of gene set testing tests (i.e. competitive vs self-contained) and use one that suits your application.
- Be aware of differences between gene set enrichment and gene set activity inference. *fgsea* is the widely used gene set test in single-cell studies; Pagoda 2 is found to outperform other pathway acitivity scoring tools. If your datasets has complex experimental design, consider pseudo-bulk analysis with gene set tests implemented in *limma*, as they account for inter-gene correlations. 

### 7. Case study: Pathway enrichment analysis in human COVID-19 PBMCs

In [1]:
import scanpy as sc
import anndata as ad
import numpy as np
import pandas as pd
import anndata2ri
import rpy2
from rpy2.robjects import r

anndata2ri.activate()

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
suppressPackageStartupMessages({
    library(SingleCellExperiment)
    library(Seurat)
    library(SeuratDisk)
})

In [None]:
%%R -i adata_
cite = as.Seurat(adata_, counts='X', data=NULL)
cite

In [None]:
%%R
sessionInfo()