# Functional Enrichment Analysis

Author: Ashley Schwartz

Date: October 2023

## Purpose and Background

This tutorial goes over how to retrieve the genes in a particular pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG), retrieve the genes in a particular function from Gene Ontology (GO), and perform gene enrichment analysis on these gene sets using a variety of different methods.

**Databases Supported**

| Database   | Abbreviation | Description   | Link |
|------|--------------|------------------|-------------------|
| Kyoto Encyclopedia of Genes and Genomes Pathway Database | KEGG Pathway | Molecular pathways and interactions              | [link](https://www.genome.jp/kegg/pathway.html)        |
| Kyoto Encyclopedia of Genes and Genomes Disease Database | KEGG Disease | Genes and pathways associated with diseases      | [link](https://www.genome.jp/kegg/disease/)           |
| Gene Ontology Biological Process Database                | GO BP        | Molecular events in biological processes         | [link](http://geneontology.org/docs/ontology-documentation/#biological-process-ontology-bp) |
| Gene Ontology Cellular Component Database                | GO CC        | Cellular structures and locations                | [link](http://geneontology.org/docs/ontology-documentation/#cellular-component-ontology-cc) |
| Gene Ontology Molecular Function Database                | GO MF        | Specific activities or functions of gene products | [link](http://geneontology.org/docs/ontology-documentation/#molecular-function-ontology-mf) |

**Database Naming Conventions**

Each database follows naming conventions for their pathways and functions. For a quick overview of how each database names a particular pathway or function, we provide some examples here:

| Database      | ID Format      | Example ID   | Pathway/Function Name | Notes |
|---------------|----------------|--------------|-----------------------|-------|
| KEGG Pathway  | Numeric with Prefix | dre00120 (Zebrafish), hsa00120 (Human)| Primary bile acid biosynthesis | KEGG pathway IDs are species-specific.|
| KEGG Disease  | Alphanumeric  | H00001 (Human) | B-cell acute lymphoblastic leukemia | KEGG disease IDs are identified for human and are not defined for zebrafish.|
| GO BP         | Alphanumeric  | GO:0007165 | Signal transduction | GO BP IDs are unique to each species.|
| GO CC         | Alphanumeric  | GO:0005634 | Nucleus | GO CC IDs are unique to each species.|
| GO MF         | Alphanumeric  | GO:0003824 | Catalytic activity | GO MF IDs are unique to each species.|

**Enrichment Methods Supported**

| Enrichment Method     | Description                                                      |
|-----------------------|------------------------------------------------------------------|
| Fisher's Exact Test   | A statistical method for assessing the significance of gene enrichment within predefined functional categories or pathways using categorical gene set data. |
| Logistic Regression    | A modeling technique for predicting gene enrichment in specific functional categories, offering flexibility to handle diverse data types within gene sets. |

**Organisms Supported**

`danRerLib` is built for zebrafish and supports three organism types:

The table you've provided is clear and informative, but there appears to be a minor typographical error in the descriptions for "zebrafish" and "mapped zebrafish." Here's a corrected version:

**Organisms Supported**

`danRerLib` is built for zebrafish and supports three organism types:

| Organism       | Abbreviation | Description                               |
| ---------------|--------------| ------------------------------------------|
| Zebrafish      | 'dre'        | The zebrafish taxonomy                     |
| Human          | 'hsa'        | The human taxonomy                         |
| Mapped Zebrafish | 'dreM'      | An organism defined through orthology     |


## Requirements

In this tutorial we will be utilizing:
- the required python package
    - see install notes if not currently installed.


In [1]:
# IMPORT PYTHON PACKAGE
# ---------------------
from danrerlib import KEGG, utils, GO
import pandas as pd

## Download Gene Sets

If you are interested in identifying the genes within a gene set, you can download genes for a given KEGG pathway id, KEGG disease id, or GO id (BP, CC, or MF). A key benefit and advantage to `danRerLib` is we can also generate the gene list for a pathway that might not exist for zebrafish, but does exist for human. We can then map the genes to zebrafish genes if desired.

### KEGG Pathway

_Purpose: Given a KEGG pathway id, retrieve a list of all genes in said pathway._

As a reminder, `danRerLib` supports three organism types (see background for more information): hsa, dre, and dreM. Of course, you are likely most interested in the zebrafish genes, but human genes are also provided for comparisons. Lets look at the KEGG pathway `00400` which is defined as phenylalanine, tyrosine and tryptophan biosynthesis. This pathway exists for human `hsa00400` and zebrafish `dre00400`. Here are a few examples of gathering the genes in this pathway for different organisms:

In [2]:
kegg_id = '00400'
human_genes = KEGG.get_genes_in_pathway(kegg_id, 'hsa')
human_genes

Unnamed: 0,Human NCBI Gene ID
0,137362
1,259307
2,2805
3,2806
4,5053
5,6898


In [3]:
dre_genes = KEGG.get_genes_in_pathway(kegg_id, 'dre')
dre_genes

Unnamed: 0,NCBI Gene ID
0,335974
1,337166
2,378962
3,406330
4,406688
5,561410
6,791730


In [4]:
mapped_genes = KEGG.get_genes_in_pathway(kegg_id, 'dreM')
mapped_genes

Unnamed: 0,NCBI Gene ID
0,561410
1,335974
2,337166
3,378962
4,406330
5,406688
6,791730


It is also possible to include the organism identifier in the KEGG id sent to the function and omit the organism identifier. As an example of this, see below where we only provide the full KEGG id desired.

In [5]:
KEGG.get_genes_in_pathway('hsa00400')

Unnamed: 0,Human NCBI Gene ID
0,137362
1,259307
2,2805
3,2806
4,5053
5,6898


### KEGG Disease

_Purpose: Given a KEGG disease id, retrieve a list of all genes in said disease._

We can also get the list of genes for a particular disease listed in the KEGG database. This function takes the disease id and the organism. Not that the zebrafish organism in this case comes from mapped values as the disease gene lists are not annotated for zebrafish within the KEGG database. 

In [6]:
# human organism
KEGG.get_genes_in_disease('H00001', 'hsa')

Unnamed: 0,Human NCBI Gene ID
0,25
1,4297
2,4299
3,6929
4,5087
5,861
6,4609
7,64109
8,5079


In [7]:
# zebrafish organism
KEGG.get_genes_in_disease('H00001', 'dre')

Unnamed: 0,NCBI Gene ID
0,100000720
1,557048
2,100537394
3,664768
4,30310
5,58138
6,570960
7,58126
8,393141
9,30686


It is expected, as described in the mapping tutorial, that the number of genes in the human gene set may not always be equal to the number of genes in the zebrafish set. This is because there is not always a 1:1 mapping in orthology between human ans zebrafish taxonomy. 

### Gene Ontology

_Purpose: Given a Go concept id, retrieve a list of all genes in said GO concept._

`danRerLib` supports Gene Ontology Biological Processes (BP), Cellular Components (CC), and Molecular Functions (MF). There is one primary function for the GO module that will retrieve the genes in any given concept, regardless if it comes from BP, CC, or MF. This can be done for either organism (hsa, dre, or dreM). Retrieval of the GO id `GO:0033554`, cellular response to stress, is shown below for the zebrafish. 

In [9]:
GO.get_genes_in_GO_concept('GO:0033554', 'dre')

Unnamed: 0,ZFIN ID
0,ZDB-GENE-030131-8638
1,ZDB-GENE-091204-246
2,ZDB-GENE-040718-255
3,ZDB-GENE-050320-35
4,A0A2R8S050
5,ZDB-GENE-030131-6701
6,ZDB-GENE-040825-4
7,ZDB-GENE-030131-1096
8,ZDB-GENE-030131-9531
9,ZDB-GENE-030826-18
