# Gene set and pathway

![](./images/Module3/Workflow3.jpg)

Differential expression (DE) analysis typically yields a list of genes or proteins. Our intention is to use such lists to gain novel insights about genes and proteins that may have roles in a given
phenomenon, phenotype or disease progression. However, in many cases, gene lists generated from DE analysis are
difficult to interpret due to their large size and lack of useful annotations. Hence, pathway analysis (also known
as gene set analysis or over-representation analysis), aims to reduce the complexity of interpreting gene lists
via mapping the listed genes to known (i.e. annotated) biological pathways, processes and functions.
This learning submodule introduces common curated biological databases including Gene Ontology (GO) and the
Kyoto Encyclopedia of Genes and Genomes (KEGG).

## Learning Objectives:
1. Introduction to Ontology and Gene Ontology.
2. Introduction to KEGG Pathway Database.
3. Download terms, pathway gene set from GO and KEGG.
4. Save results to GMT file format.

In [3]:
suppressMessages({
suppressWarnings(library(IRdisplay))
suppressWarnings(IRdisplay::display_html(data = NULL, file ="Quizzes/Quiz_Submodule3-1.html"))
})

## Ontology and Gene Ontology
### Overview
In this section we will learn about the concept of gene ontology in Bioinformatics. Ontology is set of concepts and categories defined by a shared vocabulary to denote  properties of the concepts, as well as the relationships between the concepts.
Ontology plays an important role in the field of bioinformatics. Ontology enables unambiguous communication e.g.,
a way to understand different groups’ annotations of various genomes. Also, it allows the knowledge to be structured to perform automated analyses by computer programs.

The Gene Ontology (GO) database defines a structured, common, and controlled vocabulary to describe attributes of genes and gene products
across organisms. Collaboration is key to build a consensus vocabulary. But the term gene ontology, or GO, is commonly used
to refer to both the terms as well as the associations between genes, which is sometimes a source of potential confusion. In order to avoid this, here we will use the term “GO”
to describe the set of terms and their hierarchical structure and “GO annotations” to describe the set of associations between
genes and GO terms. The GO is divided into three categories to describe the genes and gene products from three different
angles: Molecular Function, Biological Process, and Cellular Component.

The structure of GO can be described in terms of into directed acyclic graphs (DAGs), where each GO term is a node,
and the relationships between the terms are edges between the nodes. GO is loosely hierarchical, with ‘child’ terms
being more specialized than their ‘parent’ terms, but unlike a strict hierarchy, a term may have more than one parent
term (note that the parent/child model does not hold true for all types of relations). The structure of the controlled
vocabularies are intended to reflect true, biological relationships. In contrast to strict hierarchies, DAGS allow
multiple relationships between a more granular (child) term and a more general parent term. The relationship between
terms affects how queries are made. For example,a query for all genes with binding activity would include transcription
factors as well as genes with other types of binding activity (such as protein binding, ligand binding). The illustration
of category and structure of GO is shown in the figure below:

![](./images/Module3/GO_Structure.jpg)
*(Source: https://www.ebi.ac.uk/, http://geneontology.org/)*

### Gene ontology relationship
In DAGs graph, *terms* are represented as *nodes* and *relations* (also known as *object properties*) between the *terms*
are *edges*. There are commonly used relationships in GO such as *is a* (is a *subtype of*), *part of, has part, regulates,
negatively regulates and positively regulates*. All terms (except from the root terms representing each aspect) have a sub-class relationship to another terms.

Examples:

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Example:</b>  **GO:1904659:glucose transport** *is* a **GO:0015749:monosaccharide transport**.

The *is a* relation forms the basic structure of GO. If we say A *is a* B, we mean that node A is a subtype of node B

</div>

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Example:</b> **GO:0031966:mitochondrial membrane** *is part of* **GO:0005740:mitochondrial envelope**

The *part of* relation is used to represent part-whole relationships. A *part of* relation would only be added between
A and B if B is **necessarily** *part of* A: wherever B exists, it is as *part of* A, and the presence of the B implies
the presence of A. However, given the occurrence of A, we cannot say for certain that B exists.
<div>

<div class="alert alert-block alert-warning">
    <i class="fa fa-pencil" aria-hidden="true"></i>
    <b>Example:</b> **GO:0098689:latency-replication decision** *regulates* **GO:0019046:release from viral latency**

A relation that describes case in which one process directly affects the manifestation of another process or quality,
i.e. the former *regulates* the latter.
    <div>


A more specific case with more nodes and edges can be seen at the figure below:
<br>
![](./images/Module3/GO_Relation.jpg)
*(Source: https://advaitabio.com/)* <br>
For more technical information about relations and their properties used in GO and other ontologies see the
<a href="https://obofoundry.org/ontology/ro.html">OBO Relations Ontology (RO)</a>


In [4]:
suppressMessages({
suppressWarnings(library(IRdisplay))
suppressWarnings(IRdisplay::display_html(data = NULL, file ="Quizzes/Quiz_Submodule3.html"))
})

### GO storage file formats
GO terms are updated monthly in the following formats:
* OBO 1.4 files are human-readable (in addition to machine-readable) and can be opened in any text editor.
* OWL files can be read by <a href="https://protege.stanford.edu/">Protégé</a> text editor.

 In this learning submodule, we will only use ".OBO" to obtain GO terms.The OBO file format is for representing ontologies and controlled vocabularies. The format itself attempts to achieve the following goals:
 * Human readability
 * Ease of parsing
 * Extensibility
 * Minimal redundancy

The file structure can is shown in the following figure.


![](./images/Module3/OBO_Format.jpg)

The OBO file has a header, which is an unlabeled section at the beginning of the document. The header ends when the first term is encountered. Next, term is represented in labeled section with the tag *[Term]*. Under each term, we can find other information such as term ID, official name, category (namespace), term definition, synonym and relation to other GO terms.

At this step, we still don't know what genes are related to which GO terms. In order to retrieve custom sets of gene ontology annotations for any list of genes from organisms, NCBI has published a Gene2GO database that obtain GO terms and the entrez gene ids related to those go terms. The database can be retrieve from <a href="https://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz">here</a> text editor. The Gene2GO database can be viewed using text editor, the file structure is presented in the figure below:

![](./images/Module3/Gene2GO.jpg)

The OBO and Gene2GO databases will be used in combination to obtain GO term and related genes for enrichment analysis.

### Retrieving GO terms from DE gene list
This section focuses on downloading related GO terms based on the DE genelist obtained from the DE analysis in the previous section.
Here, we will use `topGO` and `hgu133plus2.db` R packages to obtain GO terms. The `topGO` package has built-in functions that use Gene2GO databases to retrieve GO terms from the gene ID give by DE analysis. Since the dataset we used in submodule 02 was generated for human, we will use `hgu133plus2.db` database to map probeIDs to gene symbols.
The installation process of two package can be done by the script below:

In [1]:
# Installation of topGO and hgu133plus2.db package
suppressMessages({if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  suppressWarnings(BiocManager::install("topGO", update = F))
  suppressWarnings(BiocManager::install("hgu133plus2.db", update = F))
  suppressWarnings(BiocManager::install("AnnotationDbi", update = F))
})


The downloaded binary packages are in
	/var/folders/xf/v88ngt8s37b5qqxtsz3hwyhm0000gn/T//RtmpqrOqgA/downloaded_packages


In [2]:
# Loading the library
suppressPackageStartupMessages({
  suppressWarnings(library("topGO"))
  suppressWarnings(library("hgu133plus2.db"))
  suppressWarnings(library("AnnotationDbi"))
})


groupGOTerms: 	GOBPTerm, GOMFTerm, GOCCTerm environments built.



Load the DE genelist generated from the DE analysis using `limma`.

In [3]:
# Loading the DE result
data = readRDS("./data/DE_genes.rds")

By default, the DE analysis performed by `limma` contains multiple features. However, adjusted *p-value* and gene ID are the most important features for enrichment analysis. We can use the following code to list of gene IDs and their *p-value*.

In [4]:
# Get p-value from DE results
genelist <- data$adj.P.Val
# Assign gene IDs to associated p-values
names(genelist) <- data$PROBEID

After successfully obtaining the genelist, we need to map the gene IDs to the gene symbols using `hgu133plus2.db`.

In [5]:
# Map gene IDs to gene symbols
gene <- suppressMessages(AnnotationDbi::select(hgu133plus2.db, names(genelist), "SYMBOL"))

In [6]:
# Remove duplicated gene IDs
gene <- gene[!duplicated(gene[,1]),]

# Assign result to a new genlist with gene symbols
geneList2 <- genelist
names(geneList2) <- gene[,2]

Now, we can search for related GO terms based on the new gene list using `topGO` package. First, we need to create a `topGOdata`
object.

In [7]:
# Retrieve all the GO terms related to the genelist obtained from the expression matrix
GOdata <- new("topGOdata", description = "",ontology = "BP",
                    allGenes = geneList2, geneSel = function(x)x, nodeSize = 10,
                    annot = annFUN.org, ID = "alias", mapping = "org.Hs.eg")


Building most specific GOs .....

	( 12457 GO terms found. )


Build GO DAG topology ..........

	( 15840 GO terms and 35901 relations. )


Annotating nodes ...............

	( 16332 genes annotated to the GO terms. )



We can search for related GO terms using `geneInTerm` function and view the term with associated genes.

In [9]:
# Obtain a list of genes for each GO term
allGO = genesInTerm(GOdata)
allGO[1:5]

Now, we already had GO terms with genes. However, we still do not know the meaning of GO terms related to biological process. We can use `GO.db` database to get a set of annotation maps describing the entire Gene Ontology assembled using data from GO. We can use the following code to install the `GO.db` R package.

In [10]:
suppressMessages({if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  suppressWarnings(BiocManager::install("GO.db", update = F))
})
library(GO.db)

Then, we can use the following command to obtain the GO terms description.

In [11]:
# Getting the name of each GO term
terms <- names(allGO)
# Getting the description of each GO term
descriptions <-lapply(Term(terms), `[[`, 1)

In order to perform enrichment analysis in later submodules, we need to save the GO terms and genesets to the standard output. One commonly used format is Gene Matrix Transposed file format *(\*.gmt)*. The GMT file format is a tab delimited file format that describes gene sets. In the GMT format, each row represents a gene set; in the GMX format, each column represents a gene set. Here, we can save GO terms and genesets to the *\*.gmt* using the following function:

In [12]:
# A function to save the GO terms with geneset to the local repository
writeGMT <- function(genesets, descriptions, outfile) {

  if (file.exists(outfile)) {
    file.remove(outfile)
  }
  for (gs in names(genesets)) {
    write(c(gs, gsub("\t", " ", descriptions[[gs]]), genesets[[gs]]), file=outfile, sep="\t", append=TRUE, ncolumns=length(genesets[[gs]]) + 2)
  }
}
outfile <- "./data/GO_terms.gmt"
writeGMT(allGO, descriptions, outfile)

```{admonition} Saving data to the Google Cloud Bucket
gsutil cp ./data/GO_terms.gmt gs://cpa-output
```

## Kyoto Encyclopedia of Genes and Genomes (KEGG)
### Overview
KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development. The KEGG database project was initiated in 1995 by Minoru Kanehisa, professor at the Institute for Chemical Research, Kyoto University, under the then ongoing Japanese Human Genome Program. Foreseeing the need for a computerized resource that can be used for biological interpretation of genome sequence data, he started developing the KEGG PATHWAY database. It is a collection of manually drawn KEGG pathway maps representing experimental knowledge on metabolism and various other functions of the cell and the organism. Each pathway map contains a network of molecular interactions and reactions and is designed to link genes in the genome to gene products (mostly proteins) in the pathway. This has enabled the analysis called KEGG pathway mapping, whereby the gene content in the genome is compared with the KEGG PATHWAY database to examine which pathways and associated functions are likely to be encoded in the genome. KEGG is a "computer representation" of the biological system. It integrates building blocks and wiring diagrams of the system—more specifically, genetic building blocks of genes and proteins, chemical building blocks of small molecules and reactions, and wiring diagrams of molecular interaction and reaction networks. The illustrative structure of KEGG is presented as figure below.
![](./images/Module3/KEGG.jpg)


In [5]:
suppressMessages({
suppressWarnings(library(IRdisplay))
suppressWarnings(IRdisplay::display_html(data = NULL, file ="Quizzes/Quiz_Submodule3-2.html"))
})

### Retrieving pathways from KEGG databases
In this section, we will retrieve pathways and related genesets from the KEGG database using R command line. Here we will use `KEGGREST` R package that provides a client interface to the KEGG REST server. `KEGGREST` can be installed from the Bioconductor using following command.

In [13]:
suppressMessages({if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
  suppressWarnings(BiocManager::install("KEGGREST", update = F))
})
suppressPackageStartupMessages({
  library(KEGGREST)
})

KEGG contains a number of databases. To get an idea of what is available, run `listDatabases()`:

In [14]:
KEGGREST::listDatabases()

We can use these databases in further queries. Note that in many cases you can also use a three-letter KEGG organism code or a “T number” (genome identifier) in the same place you would use one of these database names.

We can obtain the list of organisms available in KEGG with the `keggList()` function:

In [15]:
organism <- keggList("organism")

In [16]:
print(paste0("KEGG supports ",dim(organism)[1]," organisms"))

[1] "KEGG supports 8668 organisms"


To view the supported organisms we can use the following command:

In [17]:
# View several supported organism
head(organism)

T.number,organism,species,phylogeny
T01001,hsa,Homo sapiens (human),Eukaryotes;Animals;Vertebrates;Mammals
T01005,ptr,Pan troglodytes (chimpanzee),Eukaryotes;Animals;Vertebrates;Mammals
T02283,pps,Pan paniscus (bonobo),Eukaryotes;Animals;Vertebrates;Mammals
T02442,ggo,Gorilla gorilla gorilla (western lowland gorilla),Eukaryotes;Animals;Vertebrates;Mammals
T01416,pon,Pongo abelii (Sumatran orangutan),Eukaryotes;Animals;Vertebrates;Mammals
T03265,nle,Nomascus leucogenys (northern white-cheeked gibbon),Eukaryotes;Animals;Vertebrates;Mammals


In submodule 02, we performed DE analysis on a human dataset. Therefore, we need to download pathways for humans. The abbreviation of human pathway in KEGG is `hsa` and we can use `keggList` function to get the pathway list.

In [6]:
suppressMessages({
suppressWarnings(library(IRdisplay))
suppressWarnings(IRdisplay::display_html(data = NULL, file ="Quizzes/Quiz_Submodule3-3.html"))
})

In [19]:
# Obtain the pathways belong to human
pathways.list <- keggList("pathway", "hsa")

The pathway list contains pathway description and pathway code in a single line of text. To see the first five pathways, we can use the following command:

In [20]:
# View the first five pathways
pathways.list[1:5]

We can see that, in each line, the text in the quotation mark contains pathway information while the later part contains pathway code leading by a prefix `path:`. To get  pathway codes from the pathway list, we can use the following command:

In [9]:
suppressMessages({
suppressWarnings(library(IRdisplay))
suppressWarnings(IRdisplay::display_html(data = NULL, file ="Quizzes/Quiz_Submodule3-4.html"))
})

In [22]:
# Retrieve all the pathway IDs belong to human
pathway.codes <- sub("path:", "", names(pathways.list))
pathway.codes

We can use the following command to check how many pathways are available for human

In [23]:
print(paste0("Number of available pathways for human are: ", length(pathway.codes)))

[1] "Number of available pathways for human are: 352"


The following code will help to get list of genes and pathway's description for all pathways available in human.

In [8]:
# Function to get all the gene names for each pathway
genes.by.pathway <- sapply(pathway.codes,
                           function(pwid){
                             pw <- keggGet(pwid)
                             if (is.null(pw[[1]]$GENE)) return(NA)
                             pw2 <- pw[[1]]$GENE[c(FALSE,TRUE)]
                             pw2 <- unlist(lapply(strsplit(pw2, split = ";", fixed = T), function(x)x[1]))
                             return(pw2)
                           }
)
# Function to get description for each pathway
description.by.pathway <- sapply(pathway.codes,
                                 function(pwid){
                                   pw <- keggGet(pwid)
                                   if (is.null(pw[[1]]$NAME)) return(NA)
                                   pw2 <- pw[[1]]$NAME
                                   return(pw2)
                                 }
)
# Convert the pathway description to a list
description.by.pathway <- as.list(description.by.pathway)

ERROR: Error in lapply(X = X, FUN = FUN, ...): object 'pathway.codes' not found


We can view the first five pathways with their genesets using the following command

In [22]:
# View the first five pathway with the genesets
genes.by.pathway[1:5]

Use the following command to see the description of the first five pathways

In [7]:
# View the description of the first five pathways
description.by.pathway[1:5]

ERROR: Error in eval(expr, envir, enclos): object 'description.by.pathway' not found


Then we can save the output to *\*.gmt* file using the following commands

In [None]:
# Saving the pathway information to the local repository
outfile <- "./data/KEGG_pathways.gmt"
writeGMT(genes.by.pathway, description.by.pathway, outfile)


```{admonition} Saving data to the Google Cloud Bucket
gsutil cp ./data/KEGG_pathways.gmt gs://cpa-output
```


In the next submodule, we will do Pathway Analysis.
