# Perform Enrichment

Author: Ashley Schwartz

Date: July 17, 2023

## Purpose and Background

This tutorial explains the process of conducting gene set enrichment testing, a method employed to assess predefined biologically relevant gene sets such as pathways or biological processes. The goal is to determine if these sets contain a higher number of significant genes from an experimental dataset compared to what would be expected by random chance.

Given a dataset with differential gene expression for genes with associated p-values, we can determine the gene sets (concepts) that have significantly higher significance values than expected at random. This can be done by testing against all concept ids within a concept, or a few you might be interested in. 

A large limitation to a variety of designed enrichment methodologies is the minimal information for zebrafish. To overcome this, we have developed a new organism type, 'dreM' that is a mirror of the human 'hsa' organism with zebrafish genes. For example, the KEGG database has 355 annotated pathways for humans (hsa) at the time of writing this tutorial. Zebrafish (dre) have only 177. We have mapped all genes in the 355 hsa pathways to create 355 dreM pathways.

If you would like to reference true zebrafish pathways, choose org = dre

If you would like to reference mapped zebrafish pathways, choose org = dreM

Some key definitions:

| Term | Definition/Description |
| - | - |
| concept | a database resource such as KEGG or GO |
| KEGG | defined as a concept https://www.kegg.jp/kegg/pathway.html | 
| KEGG pathway | pathway is a database for the KEGG concept |
| KEGG disease | disease is a database for the KEGG concept | 

In this tutorial we will be using the following key elements:

| Item | Desctiption  |
|-|-|
`data/test_data/TPP.txt` | A differential expression dataset containing Gene IDs, Log2FC values, and associated p-values |

In general, while you do not need a large foundation in Python to execute the code listed in this tutorial, a general understanding of absolute and relative paths is useful.


## Set up Python environment

In [1]:
# IMPORT PYTHON PACKAGES
# ----------------------

# makes the notebook cell print all outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
# path packages
import sys
from pathlib import Path
# data processing packages
import pandas as pd

In [2]:
# SET UP MY LOCAL PACKAGE
# -----------------------
# this step is only needed because the local package has not been released through pip

cwd = Path().absolute()

package_folder = cwd / Path('../src/danRerlib')
sys.path.append(str(package_folder))
import mapping, KEGG, enrichment, utils

# SET UP DATA DIRECTORY
# ---------------------
test_data_dir = cwd / Path('data/test_data/')
out_data_dir = cwd / Path('data/out_data/')

# note: I am using the Path package to take care of any operating
#       system differences for users of this tutorial

## Find KEGG Enrichment 

### KEGG Pathway Enrichment

_Purpose: Given a dataset with genes and associated p-values for significant differential expression, determine the enriched KEGG pathways._

Step 1: Read your data into the workspace. The supported format is a dataset with columns 'NCBI Gene ID' and 'Pvalue'. We will be using the test dataset with relative path `data/test_data/TPP.txt` which is an example differential expression dataset.

In [3]:
file_path = test_data_dir / Path('TPP.txt')
tpp_df = pd.read_csv(file_path, sep='\t')

We can also quickly print some stats to see what we are working with.

In [4]:
rows, cols = tpp_df.shape
print(f"- The column names for this dataset are: {tpp_df.columns.values}")
print(f'- The data has {rows} entries (genes).')

- The column names for this dataset are: ['NCBI Gene ID' 'PValue' 'logFC']
- The data has 21854 entries (genes).


__Step 2:__ Specify the Gene ID type currently used in your dataset. 

In [5]:
gene_id_type = 'NCBI Gene ID'

As you'l notice, this matches my first column name in my dataset. A quick reminder that the Gene ID type must be one of the supported types and is case/spelling sensitive. Options are: NCBI Gene ID, ZFIN ID, Symbol, Ensembl ID. Many of the databases use the NCBI Gene ID so that is usually a preferred format, but any ID type will work here.

__Step 3:__ Launch the `enrich_KEGG` function. This function performs enrichment for the KEGG database with a few key distinctions.

In [6]:
out = enrichment.enrich_KEGG(tpp_df)

In [7]:
out

Unnamed: 0,Concept Type,Concept ID,# Genes in Concept in Universe,# Sig Genes Belong to Concept,Proportion of Genes,Coeff,P-value,FDR,Odds Ratio,Enriched
173,KEGG pathway,dreM04550,155,47,0.303226,0.255060,4.961454e-07,4.961454e-07,1.290539,enriched
272,KEGG pathway,dreM05225,173,56,0.323699,0.236088,2.403849e-06,2.403849e-06,1.266286,enriched
235,KEGG pathway,dreM04961,64,18,0.281250,0.314233,2.836166e-06,2.836166e-06,1.369209,enriched
10,KEGG pathway,dreM00030,28,14,0.500000,0.379057,5.718199e-06,5.718199e-06,1.460907,enriched
189,KEGG pathway,dreM04659,93,29,0.311828,0.275132,8.003806e-06,8.003806e-06,1.316705,enriched
...,...,...,...,...,...,...,...,...,...,...
230,KEGG pathway,dreM04979,61,12,0.196721,0.183676,4.658255e-02,4.658255e-02,1.201626,enriched
137,KEGG pathway,dreM04668,110,27,0.245455,0.147690,4.661458e-02,4.661458e-02,1.159154,enriched
335,KEGG pathway,dreM05034,120,29,0.241667,0.142774,4.671743e-02,4.671743e-02,1.153469,enriched
170,KEGG pathway,dreM04520,118,31,0.262712,0.142940,4.822297e-02,4.822297e-02,1.153660,enriched
