# Pathway Analysis

The primary goal of Pathway Analysis (PA) is to assess data collected from high-throughput technologies, discovering meaningful group of genes that are altered in case samples when compared to a control. In this approach, PA methods attempt to overcome the challenge of comprehending a huge lists of important, but isolated genes detached from biological context, which are the principal output of differential expression analysis. Pathway enrichment analysis is a common strategy to tackling this problem that summarizes the large gene list as a smaller list of more clearly interpretable pathways. Pathways are statistically tested for over-representation in the experimental gene list relative to what would be expected by chance, using several common statistical tests that take into account the number of genes detected in the experiment, their relative ranking, and the number of genes annotated to a pathway of interest.

A good technique to analyze gene sets is to compare them to well-annotated gene sets (biological pathways). For example, Over-representation Analysis (ORA) counts the number of similar genes shared by an input gene set and each annotated gene set and applies a statistical test to evaluate the overlap's statistical significance. A p-value cutoff, e.g. 0.05, is used to identify annotated gene sets with significant overlaps to the input gene set. Gene Set Enrichment Analysis (GSEA) seeks to avoid the necessity for an ad hoc cutoff (e.g. expression fold change) in identifying the input gene set. GSEA ranks all genes in the genome by differential expression level and checks if any annotated gene set is scored unexpectedly high or low.

## Learning Objectives:
1. Introduction to enrichment analysis
3. Enrichment analysis using different methods.
4. Saving results.

## Enrichment analysis
 Gene Set Enrichment Analysis methods help gain insight into obtained gene lists by identifying pathways that are enriched in a gene list more than would be expected by chance. The three major steps involved the process include; definition of a gene list from omics data, determination of statistically enriched pathways, and visualization and interpretation of the results.

## Enrichment analysis using ORA
Over-representation (or enrichment) analysis is a statistical method that determines whether genes from pre-defined sets (ex: those belonging to a specific GO term or KEGG pathway) are present more than would be expected (over-represented) in a subset of your data. The p-value can be calculated by hyper-geometric distribution.

![](./images/Module4/ora_p_value.png)

where N is the total number of genes in the background distribution, M is the number of genes within that distribution that are annotated (either directly or indirectly) to the gene set of interest, n is the size of the list of genes of interest and k is the number of genes within that list which are annotated to the gene set. The background distribution by default is all the genes that have annotation.

### Over Representation Analysis Using GO

In [35]:
data <- read.csv("./data/DE_genes.csv")
rownames(data) <- data$ID
head(data)

Unnamed: 0_level_0,X,ID,adj.P.Val,P.Value,t,B,logFC,Gene.symbol,Gene.title
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>
222178_s_at,222178_s_at,222178_s_at,2.962397e-50,5.418192e-55,-71.93947,81.51371,-1.750469,,
224687_at,224687_at,224687_at,1.5750449999999998e-38,5.761483e-43,-42.51195,71.37488,-5.0270459,ANKIB1,ankyrin repeat and IBR domain containing 1
207488_at,207488_at,207488_at,1.328826e-35,7.291224e-40,-37.04099,67.63002,-1.0532279,,
239226_at,239226_at,239226_at,5.317826e-31,3.890499e-35,29.94059,61.01151,0.7890331,,
216490_x_at,216490_x_at,216490_x_at,2.320159e-28,2.121773e-32,26.41134,56.70533,1.0811623,,
234109_x_at,234109_x_at,234109_x_at,3.581024e-28,3.929793e-32,-26.08643,56.2667,-0.5401491,ONECUT3,one cut homeobox 3


In [67]:
mask <- data$adj.P.Val < 0.05 &
        abs(data$logFC) > log2(2)
deGenes <- rownames(data[mask,])

In [68]:
geneUniverse <- rownames(data)

In [69]:
library("hgu133plus2.db")
# Map gene IDs to gene symbols
deGenes <- select(hgu133plus2.db, deGenes, "entrezgene")
# Remove duplicated gene IDs
deGenes <- deGenes[!duplicated(deGenes[,1]),]
deGenes <-deGenes$SYMBOL

geneUniverse <- select(hgu133plus2.db, geneUniverse, "entrezgene")
# Remove duplicated gene IDs
geneUniverse <- geneUniverse[!duplicated(geneUniverse[,1]),]
geneUniverse <-geneUniverse$SYMBOL

ERROR: Error in UseMethod("select"): no applicable method for 'select' applied to an object of class "c('ChipDb', 'AnnotationDb', 'envRefClass', '.environment', 'refClass', 'environment', 'refObject', 'AssayData')"


In [54]:
length(deGenes)
length(geneUniverse)

In [56]:
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("clusterProfiler")

'getOption("repos")' replaces Bioconductor standard repositories, see
'?repositories' for details

replacement repositories:
    CRAN: https://cran.r-project.org


Bioconductor version 3.15 (BiocManager 1.30.18), R 4.2.1 (2022-06-23 ucrt)

Installing package(s) 'clusterProfiler'

also installing the dependencies 'stringi', 'gridGraphics', 'tweenr', 'polyclip', 'systemfonts', 'lazyeval', 'fastmatch', 'stringr', 'ggfun', 'ggplotify', 'patchwork', 'ggforce', 'ggrepel', 'viridis', 'tidygraph', 'graphlayouts', 'ape', 'tidytree', 'treeio', 'DO.db', 'fgsea', 'reshape2', 'aplot', 'ggraph', 'igraph', 'scatterpie', 'shadowtext', 'ggtree', 'downloader', 'DOSE', 'enrichplot', 'GOSemSim', 'plyr', 'qvalue', 'yulab.utils'





  There are binary versions available but the source versions are later:
             binary source needs_compilation
ggforce       0.3.4  0.4.0              TRUE
graphlayouts  0.8.1  0.8.2              TRUE

package 'stringi' successfully unpacked and MD5 sums checked
package 'gridGraphics' successfully unpacked and MD5 sums checked
package 'tweenr' successfully unpacked and MD5 sums checked
package 'polyclip' successfully unpacked and MD5 sums checked
package 'systemfonts' successfully unpacked and MD5 sums checked
package 'lazyeval' successfully unpacked and MD5 sums checked
package 'fastmatch' successfully unpacked and MD5 sums checked
package 'stringr' successfully unpacked and MD5 sums checked
package 'ggfun' successfully unpacked and MD5 sums checked
package 'ggplotify' successfully unpacked and MD5 sums checked
package 'patchwork' successfully unpacked and MD5 sums checked
package 'ggrepel' successfully unpacked and MD5 sums checked
package 'viridis' successfully unpacked and 

installing the source packages 'ggforce', 'graphlayouts', 'DO.db'


Old packages: 'babelgene', 'cli', 'crayon', 'dplyr', 'ISOcodes', 'limma',
  'raster', 'rlang', 'RSQLite', 'vctrs', 'AnnotationDbi', 'Biobase',
  'BiocGenerics', 'BiocVersion', 'Biostrings', 'cluster', 'evaluate',
  'foreign', 'GenomeInfoDb', 'GenomeInfoDbData', 'GO.db', 'graph', 'htmltools',
  'IRanges', 'KEGGREST', 'lifecycle', 'MASS', 'Matrix', 'nlme', 'nnet',
  'openssl', 'org.Hs.eg.db', 'pillar', 'S4Vectors', 'survival', 'topGO',
  'XVector', 'zlibbioc'



In [70]:
library(clusterProfiler)

In [57]:

ans.go <- enrichGO(gene = deGenes, ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = geneUniverse,
                   readable=TRUE,
                   pvalueCutoff = 0.05)



clusterProfiler v4.4.4  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/

If you use clusterProfiler in published research, please cite:
T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141


Attaching package: 'clusterProfiler'


The following object is masked from 'package:AnnotationDbi':

    select


The following object is masked from 'package:IRanges':

    slice


The following object is masked from 'package:S4Vectors':

    rename


The following object is masked from 'package:stats':

    filter


--> No gene can be mapped....

--> Expected input gene ID: 23424,7349,219793,9466,898,10468

--> return NULL...



ERROR: Error in eval(e, x, parent.frame()): object 'Count' not found


In [60]:
tab.go <- as.data.frame(ans.go)

In [61]:
ans.go

NULL

In [None]:
tab.go<- subset(tab.go, Count>5)
tab.go[1:5, 1:6]

In order to assess functional enrichment, both DE gene list and gene universe must be annotated in Entrez IDs:


In [39]:
library(org.Hs.eg.db)
deGenes <- unlist(mget(deGenes, envir=org.Hs.egENSEMBL2EG,
                       ifnotfound = NA))

geneUniverse <- unlist(mget(geneUniverse, envir=org.Hs.egENSEMBL2EG,
                            ifnotfound = NA))

In [40]:
deGenes


The GO enrichment analysis using clusterProfiler is performed by


In [None]:
library(clusterProfiler)


In [None]:
ans.go <- enrichGO(gene = deGenes, ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = geneUniverse,
                   readable=TRUE,
                   pvalueCutoff = 0.05)

In [None]:
tab.go <- as.data.frame(ans.go)


In [None]:
tab.go<- subset(tab.go, Count>5)


In [None]:
tab.go[1:5, 1:6]


## Visualization


All analyses performed with clusterProfiler can be visualized by different plots:


In [None]:
library(enrichplot)


In [None]:
p1 <- barplot(ans.dis, showCategory=10)


In [None]:
p1

In [None]:
p2 <- dotplot(ans.kegg, showCategory=20) + ggtitle("KEGG")
p3 <- dotplot(ans.dis, showCategory=20) + ggtitle("Disease")


In [None]:
plot_grid(p2, p3, nrow=2)


In [None]:
p4 <- upsetplot(ans.dis)


In [None]:
p4

In [None]:
p5 <- emapplot(ans.kegg)


In [None]:
p5

In [None]:
 install.packages("cowplot")

In [None]:
cowplot::plot_grid(p1, p3, p5, ncol=2, labels=LETTERS[1:3])


## Enrichment Analysis using FGSEA


This submodule describes FGSEA, one of the tools for evaluating pathway enrichment in transcriptional data and it stands for Fast preranked Gene Set Enrichment Analysis (GSEA). It can quickly and accurately calculate arbitrarily low GSEA P-values for a collection of gene sets based on an adaptive multi-level split Monte-Carlo scheme.

 Typical experimental design consists in comparing two conditions with several replicates using a differential gene expression test followed by preranked GSEA performed against a collection of hundreds and thousands of pathways. However, the reference implementation of this method cannot accurately estimate small P-values, which significantly limits its sensitivity due to multiple hypotheses correction procedure. FGSEA, on the other hand, is able to estimate arbitrarily low GSEA P-values with a high accuracy in a matter of minutes or even seconds.


## GSEA Enrichment Analysis
# Inputs:
gene_list = Ranked gene list ( numeric vector, names of vector should be gene names)
GO_file= Path to the “gmt” GO file on your system.
pval = P-value threshold for returning results


# Steps
1. Run GSEA (package: fgsea)
2. Collapse redundant GO terms using a permutation test
3. Return GSEA plot and data.frame of results

# FGSEA Using GO

In [None]:
library(tidyverse)


In [None]:
data <- read_csv("./data/DE_genes.csv")


In [None]:
data

In [None]:
mask <- data$adj.P.Val < 0.05 &
  abs(data$logFC) > log2(2)

In [None]:
deGenes <- rownames(data[mask,])


In [None]:
deGenes

In [None]:
length(deGenes)


In [None]:
geneUniverse <- rownames(data)


In [None]:
length(geneUniverse)


## The FGSEA Analysis Procedure (from the author)
Loading Libraries


In [None]:
library(data.table)


In [None]:
library(fgsea)


In [None]:
library(ggplot2)


Loading example pathways and gene-level statistics:


In [None]:
data(examplePathways)


In [None]:
data(exampleRanks)


#fgsea has a default lower bound eps=1e-10 for estimating P-values. If you need to estimate P-value more accurately, you can set the eps argument to zero in the fgsea function.


In [None]:
fgseaRes <- fgsea(pathways = examplePathways,
                  stats    = exampleRanks,
                  eps      = 0.0,
                  minSize  = 15,
                  maxSize  = 500)

In [None]:
head(fgseaRes[order(pval), ])


# One can make an enrichment plot for a pathway:


In [None]:
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
               exampleRanks) + labs(title="Programmed Cell Death")


# Or make a table plot for a bunch of selected pathways:


In [None]:
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]


In [None]:
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]


In [None]:
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))


In [None]:
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
              gseaParam=0.5)

## FGSEA Enrichment Analysis With KEGG Data

In [None]:
# To load the GMT file for enrichment analysis, we can use GSA.read.gmt function available in the GSA R package. Here is the code to install the package and load the GMT file

suppressMessages({
  suppressWarnings(install.packages("GSA"))
  suppressPackageStartupMessages({
    suppressWarnings(library(GSA))
  })
})

In [None]:
# Load the GMT file from disk, we use "invisible" function to supress the excessive output message from the "GSA.read.gmt" function

invisible(capture.output(pathways <- GSA::GSA.read.gmt("./data/KEGG_pathways.gmt")))

In [None]:
# View first five pathways and related gene sets
pathways$genesets[1:5]

In [None]:
# View the name of the pathways
pathways$geneset.names[1:5]

In [None]:
# View the description of each pathway
pathways$geneset.descriptions[1:5]

# FGSEA Using KEGG

# To load the GMT file for enrichment analysis, we can use GSA.read.gmt function available in the GSA R package. Here is the code to install the package and load the GMT file

suppressMessages({
  suppressWarnings(install.packages("GSA"))
  suppressPackageStartupMessages({
    suppressWarnings(library(GSA))
  })
})


# Load the GMT file from disk, we use "invisible" function to supress the excessive output message from the "GSA.read.gmt" function

invisible(capture.output(pathways <- GSA::GSA.read.gmt("./data/KEGG_pathways.gmt")))
pathways
# View first five pathways and related gene sets
pathways$genesets[1:5]

# View the name of the pathways
pathways$geneset.names[1:5]

# View the description of each pathway
pathways$geneset.descriptions[1:5]

#//GeneList
DE <- readRDS("./data/DE_genes.rds")

# Get p-value from DE results
gene_list <- DE$adj.P.Val
#// Gene List
gene_list
GSEA = function(gene_list, pathways, pval) {
  set.seed(54321)

  library(dplyr)
  library(fgsea)
  if ( any( duplicated(names(gene_list)) )  ) {
    warning("Duplicates in gene names")
    gene_list = gene_list[!duplicated(names(gene_list))]
  }
  if  ( !all( order(gene_list, decreasing = TRUE) == 1:length(gene_list)) ){
    warning("Gene list not sorted")
    gene_list = sort(gene_list, decreasing = TRUE)
  }

  myGO = fgsea::gmtPathways(pathways)

  fgRes <- fgsea::fgsea(pathways = myGO,
                        stats = gene_list,
                        minSize=15, ## minimum gene set size
                        maxSize=400, ## maximum gene set size
                        nperm=10000) %>%
    as.data.frame() %>%
    dplyr::filter(padj < !!pval) %>%
    arrange(desc(NES))
  message(paste("Number of signficant gene sets =", nrow(fgRes)))


  message("Collapsing Pathways -----")
  concise_pathways = collapsePathways(data.table::as.data.table(fgRes),
                                      pathways = myGO,
                                      stats = gene_list)
  fgRes = fgRes[fgRes$pathway %in% concise_pathways$mainPathways, ]
  message(paste("Number of gene sets after collapsing =", nrow(fgRes)))

  fgRes$Enrichment = ifelse(fgRes$NES > 0, "Up-regulated", "Down-regulated")
  filtRes = rbind(head(fgRes, n = 10),
                  tail(fgRes, n = 10 ))


  total_up = sum(fgRes$Enrichment == "Up-regulated")
  total_down = sum(fgRes$Enrichment == "Down-regulated")
  header = paste0("Top 10 (Total pathways: Up=", total_up,", Down=",    total_down, ")")


  colos = setNames(c("firebrick2", "dodgerblue2"),
                   c("Up-regulated", "Down-regulated"))


  g1= ggplot(filtRes, aes(reorder(pathway, NES), NES)) +
    geom_point( aes(fill = Enrichment, size = size), shape=21) +
    scale_fill_manual(values = colos ) +
    scale_size_continuous(range = c(2,10)) +
    geom_hline(yintercept = 0) +
    coord_flip() +
    labs(x="Pathway", y="Normalized Enrichment Score",
         title=header) +
    th

  output = list("Results" = fgRes, "Plot" = g1)
  return(output)
}



In [None]:
## Enrichment analysis using gsa


In [None]:
# // GSA using GO

In [None]:
# // GSA with KEGG

# FGSEA Using GO

In [None]:
library(tidyverse)


In [None]:
data <- read_csv("./data/DE_genes.csv")


In [None]:
data

In [None]:
mask <- data$adj.P.Val < 0.05 &
  abs(data$logFC) > log2(2)

In [None]:
deGenes <- rownames(data[mask,])


In [None]:
deGenes

In [None]:
length(deGenes)


In [None]:
geneUniverse <- rownames(data)


In [None]:
length(geneUniverse)


## The FGSEA Analysis Procedure (from the author)
Loading Libraries


In [None]:
library(data.table)


In [None]:
library(fgsea)


In [None]:
library(ggplot2)


Loading example pathways and gene-level statistics:


In [None]:
data(examplePathways)


In [None]:
data(exampleRanks)


#fgsea has a default lower bound eps=1e-10 for estimating P-values. If you need to estimate P-value more accurately, you can set the eps argument to zero in the fgsea function.


In [None]:
fgseaRes <- fgsea(pathways = examplePathways,
                  stats    = exampleRanks,
                  eps      = 0.0,
                  minSize  = 15,
                  maxSize  = 500)

In [None]:
head(fgseaRes[order(pval), ])


# One can make an enrichment plot for a pathway:


In [None]:
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
               exampleRanks) + labs(title="Programmed Cell Death")


# Or make a table plot for a bunch of selected pathways:


In [None]:
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]


In [None]:
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]


In [None]:
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))


In [None]:
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
              gseaParam=0.5)

## FGSEA Enrichment Analysis With KEGG Data

In [None]:
# To load the GMT file for enrichment analysis, we can use GSA.read.gmt function available in the GSA R package. Here is the code to install the package and load the GMT file

suppressMessages({
  suppressWarnings(install.packages("GSA"))
  suppressPackageStartupMessages({
    suppressWarnings(library(GSA))
  })
})

In [None]:
# Load the GMT file from disk, we use "invisible" function to supress the excessive output message from the "GSA.read.gmt" function

invisible(capture.output(pathways <- GSA::GSA.read.gmt("./data/KEGG_pathways.gmt")))

In [None]:
# View first five pathways and related gene sets
pathways$genesets[1:5]

In [None]:
# View the name of the pathways
pathways$geneset.names[1:5]

In [None]:
# View the description of each pathway
pathways$geneset.descriptions[1:5]

# FGSEA Using KEGG

# To load the GMT file for enrichment analysis, we can use GSA.read.gmt function available in the GSA R package. Here is the code to install the package and load the GMT file

suppressMessages({
  suppressWarnings(install.packages("GSA"))
  suppressPackageStartupMessages({
    suppressWarnings(library(GSA))
  })
})


# Load the GMT file from disk, we use "invisible" function to supress the excessive output message from the "GSA.read.gmt" function

invisible(capture.output(pathways <- GSA::GSA.read.gmt("./data/KEGG_pathways.gmt")))
pathways
# View first five pathways and related gene sets
pathways$genesets[1:5]

# View the name of the pathways
pathways$geneset.names[1:5]

# View the description of each pathway
pathways$geneset.descriptions[1:5]

#//GeneList
DE <- readRDS("./data/DE_genes.rds")

# Get p-value from DE results
gene_list <- DE$adj.P.Val
#// Gene List
gene_list
GSEA = function(gene_list, pathways, pval) {
  set.seed(54321)

  library(dplyr)
  library(fgsea)
  if ( any( duplicated(names(gene_list)) )  ) {
    warning("Duplicates in gene names")
    gene_list = gene_list[!duplicated(names(gene_list))]
  }
  if  ( !all( order(gene_list, decreasing = TRUE) == 1:length(gene_list)) ){
    warning("Gene list not sorted")
    gene_list = sort(gene_list, decreasing = TRUE)
  }

  myGO = fgsea::gmtPathways(pathways)

  fgRes <- fgsea::fgsea(pathways = myGO,
                        stats = gene_list,
                        minSize=15, ## minimum gene set size
                        maxSize=400, ## maximum gene set size
                        nperm=10000) %>%
    as.data.frame() %>%
    dplyr::filter(padj < !!pval) %>%
    arrange(desc(NES))
  message(paste("Number of signficant gene sets =", nrow(fgRes)))


  message("Collapsing Pathways -----")
  concise_pathways = collapsePathways(data.table::as.data.table(fgRes),
                                      pathways = myGO,
                                      stats = gene_list)
  fgRes = fgRes[fgRes$pathway %in% concise_pathways$mainPathways, ]
  message(paste("Number of gene sets after collapsing =", nrow(fgRes)))

  fgRes$Enrichment = ifelse(fgRes$NES > 0, "Up-regulated", "Down-regulated")
  filtRes = rbind(head(fgRes, n = 10),
                  tail(fgRes, n = 10 ))


  total_up = sum(fgRes$Enrichment == "Up-regulated")
  total_down = sum(fgRes$Enrichment == "Down-regulated")
  header = paste0("Top 10 (Total pathways: Up=", total_up,", Down=",    total_down, ")")


  colos = setNames(c("firebrick2", "dodgerblue2"),
                   c("Up-regulated", "Down-regulated"))


  g1= ggplot(filtRes, aes(reorder(pathway, NES), NES)) +
    geom_point( aes(fill = Enrichment, size = size), shape=21) +
    scale_fill_manual(values = colos ) +
    scale_size_continuous(range = c(2,10)) +
    geom_hline(yintercept = 0) +
    coord_flip() +
    labs(x="Pathway", y="Normalized Enrichment Score",
         title=header) +
    th

  output = list("Results" = fgRes, "Plot" = g1)
  return(output)
}



In [None]:
## Enrichment analysis using gsa


In [None]:
# // GSA using GO

In [None]:
# // GSA with KEGG