
# **GSEA-InContext**

The details of the theory behind **GSEA-InContext** and its approach can be found in the following publication:

- **GSEA-InContext: identifying novel and common patterns in expression experiments**  
  *Rani K. Powers, Andrew Goodspeed, Harrison Pielke-Lombardo, Aik-Choon Tan, James C. Costello*  
  Bioinformatics, Volume 34, Issue 13, July 2018, Pages i555–i564  
  [DOI: 10.1093/bioinformatics/bty271](https://doi.org/10.1093/bioinformatics/bty271)  
  Published: 27 June 2018


## **Input File 1: `.rnk` Files**

### **What Are `.rnk` Files?**  

A **`.rnk` (rank) file** is a tab-delimited text file used as input for **Gene Set Enrichment Analysis (GSEA)**, particularly in **GSEA Preranked analysis**. It contains **ranked genes** based on a specific metric (e.g., fold change, correlation, t-statistic).


---

## **Structure of an `.rnk` File**  

Each `.rnk` file typically has **two columns**:  

| Gene Symbol / ID | Ranking Score |
|------------------|--------------|
| GeneA           | 2.34         |
| GeneB           | -1.56        |
| GeneC           | 0.87         |
| GeneD           | -3.21        |

- **First column** → Gene name or Entrez ID  
- **Second column** → A ranking metric (e.g., log2 fold-change, correlation with phenotype)  
- **No header row**  

---

## **Where Are `.rnk` Files Used?**  

### 1️⃣ **GSEA Preranked Analysis**  
   - Used when you have a predefined ranking of genes instead of raw expression data.  
   - Helps determine if certain gene sets are **enriched** at the top or bottom of the ranked list.  

### 2️⃣ **Differential Gene Expression Analysis (DGEA)**  
   - If you compare **two conditions** (e.g., Control vs. Disease), genes are ranked by **log fold-change** or **statistical significance**.  

### 3️⃣ **Functional Pathway Analysis**  
   - Identifies pathways significantly affected in a dataset.  

---





## **Example of a `.rnk` File**  




In [8]:
import pandas as pd

# Define file path
rnk_file = "data/rnk_lists/GSE4773_DEG_Expt1_Control_vs_Group1_gene.rnk"

# Read the tab-delimited file
rnk_data = pd.read_csv(rnk_file, sep="\t", header=None)

# Display the first few rows
print(rnk_data.head())

      0         1
0   NTS  1.837808
1  GBP1  1.825789
2   ID2  1.497050
3   ADM  1.305856
4  LHX8  1.264238


## **Input File 2: GMT Files**
## **What is a GMT File?**

A **GMT (Gene Matrix Transposed)** file is a text-based format primarily used for storing **gene sets** in **Gene Set Enrichment Analysis (GSEA)**. Each line in the GMT file represents a **gene set** and contains the gene set name, a brief description, and a list of genes associated with that gene set.

---

## **Structure of a GMT File**

Each line in a **GMT file** consists of:
1. **Gene set name**: A unique identifier for the gene set (e.g., the name of a biological pathway).
2. **Description**: A brief description of the gene set (optional).
3. **List of genes**: A list of gene symbols (or gene IDs) that belong to the gene set.

The elements in a line are **tab-separated**, and there is **no header row**.

---

## **Example of GMT File**




In [10]:
import pandas as pd

# Read GMT file into a structured format
gmt_data = []
gmt_file = "data/gene_sets/c2.cp.kegg.v6.0.symbols.gmt"

with open(gmt_file, "r") as f:
    for line in f:
        data = line.strip().split("\t")
        gene_set = data[0]  # Gene set name
        description = data[1]  # Description
        genes = data[2:]  # List of genes
        gmt_data.append([gene_set, description, genes])

# Convert to DataFrame
gmt_df = pd.DataFrame(gmt_data, columns=["Gene Set", "Description", "Genes"])

# Display first few rows
display(gmt_df.head())


Unnamed: 0,Gene Set,Description,Genes
0,KEGG_GLYCOLYSIS_GLUCONEOGENESIS,http://www.broadinstitute.org/gsea/msigdb/card...,"[ACSS2, GCK, PGK2, PGK1, PDHB, PDHA1, PDHA2, P..."
1,KEGG_CITRATE_CYCLE_TCA_CYCLE,http://www.broadinstitute.org/gsea/msigdb/card...,"[IDH3B, DLST, PCK2, CS, PDHB, PCK1, PDHA1, LOC..."
2,KEGG_PENTOSE_PHOSPHATE_PATHWAY,http://www.broadinstitute.org/gsea/msigdb/card...,"[RPE, RPIA, PGM2, PGLS, PRPS2, FBP2, PFKM, PFK..."
3,KEGG_PENTOSE_AND_GLUCURONATE_INTERCONVERSIONS,http://www.broadinstitute.org/gsea/msigdb/card...,"[UGT1A10, UGT1A8, RPE, UGT1A7, UGT1A6, UGT2B28..."
4,KEGG_FRUCTOSE_AND_MANNOSE_METABOLISM,http://www.broadinstitute.org/gsea/msigdb/card...,"[MPI, PMM2, PMM1, FBP2, PFKM, GMDS, PFKFB4, PF..."


## **Input File 3: Background Ranked List**

### **Background Ranked List: `all_442_lists_permuted_x100.csv` in GSEA-InContext**

### **Purpose of `all_442_lists_permuted_x100.csv`**

The **`all_442_lists_permuted_x100.csv`** file is used in **GSEA-InContext** as a **background ranked list**. This file contains multiple precomputed permutations of ranked gene lists, which serve as a background for the enrichment analysis. The purpose is to compare the observed gene set enrichment scores with the distribution of scores derived from these permuted lists, allowing **GSEA-InContext** to assess whether the observed enrichment is statistically significant or not.


The **`all_442_lists_permuted_x100.csv`** file is used in **GSEA-InContext** as a **background ranked list**. This file contains multiple precomputed permutations of ranked gene lists that serve as a background for the enrichment analysis, allowing **GSEA-InContext** to compare the observed gene set enrichment scores against the distribution of scores derived from these permuted lists.

### **Key Details**

1. **Background Rank Lists**:
   - The file contains **442 permuted gene lists**, each representing a different set of genes or conditions.
   - The `x100` in the filename suggests that each gene list has been permuted 100 times to generate a **null distribution** of random enrichment scores.

2. **Use in GSEA-InContext**:
   - In **GSEA-InContext**, the observed enrichment score for a gene set is compared to the distribution of scores derived from these **permuted lists**.
   - This comparison helps to assess whether the observed enrichment is statistically significant or just a result of random variation.


3. **Application in GSEA-InContext**:
   - When running **GSEA-InContext**, this background file is used in the `background_rnks` argument to help perform the analysis and statistical comparison.

### **Example Usage in Python**



In [15]:
import pandas as pd

# Define file path
file_path = "data/bg_rnk_lists/all_442_lists_permuted_x100.csv"

# Read CSV file
df = pd.read_csv(file_path, header=None)

# Display the first few rows
display(df.head())

print(df.shape)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16883,16884,16885,16886,16887,16888,16889,16890,16891,16892
0,RGS2,SMG1P7,PAGE5,ARSG,DTX4,TDO2,ILK,S100A7,RHOA,NLRC4,...,EHF,SNORA71B,CCR2,NPIPB4,DNAI1,HMX2,CTR9,RTP4,OSR2,SLC5A3
1,DHFR,HIST1H2BG,ZNF207,IL6,MALAT1,NAV3,KDM4B,UCA1,LINC00238,CLDN16,...,IGFBP5,HOXB3,TNFRSF19,PLCB4,MYRFL,C8orf34-AS1,GPRIN3,CLDN4,MS4A3,SMCO3
2,ITPKB,GBP1,INSIG1,NPIPB4,UBE2G2,ZNF280B,MIR3945HG,MYLK,CXCL3,CST6,...,TNFRSF17,ZNF503,NEIL3,CHAC2,CRIP2,RGS18,FAM138F,KDELC2,RPL10,PTX3
3,SNORA71B,SSBP1,COL6A3,CD69,SERPINB2,LINC00467,LINC01215,SH2D1A,ITGB8,KRTAP19-1,...,DLX6,SPEM1,HEMGN,ADAMTS17,ADM,ZNF92,PAQR7,IL4I1,VDAC3,SDHAF3
4,CAPNS1,TOP1MT,NPL,CENPF,SNORA48,RNU6-646P,CDC42EP1,ZBTB40,TFAP2C,OGN,...,IGSF5,EHBP1L1,EPHB4,TM4SF5,GTSF1,SUB1,MGP,ZNF681,HPF1,SUSD5


(100, 16893)


# **Running GSEA Preranked and GSEA-InContext using `gseapy`**

This script demonstrates how to run **GSEA Preranked** and **GSEA-InContext** analyses using the `gseapy` library.

## **Python Code**


In [None]:
import os
import sys

gsea_incontext_path=os.path.abspath(os.path.join(os.path.dirname('__file__')))
print(gsea_incontext_path)
gsea_local_path=os.path.join(gsea_incontext_path,'gseapy')
sys.path.insert(0,gsea_local_path)
import gseapy
import sys, logging



rnk = "GSE4773_DEG_Expt1_Control_vs_Group1_gene.rnk" 


# Run GSEA preranked - Kegg
prerank_results = gseapy.prerank(
	rnk='data/rnk_lists/' + rnk,
	gene_sets='data/gene_sets/c2.cp.kegg.v6.0.symbols.gmt', 
	outdir='out/Prerank_KEGG/' + rnk[:-4], 
	permutation_num=100, 
	no_plot=False,
	processes=4
)

# Run GSEA-InContext - Kegg
gseapen_results = gseapy.incontext(
	rnk='data/rnk_lists/' + rnk,
	gene_sets='data/gene_sets/c2.cp.kegg.v6.0.symbols.gmt', 
	background_rnks='data/bg_rnk_lists/all_442_lists_permuted_x100.csv', 
	outdir='out/InContext_KEGG/' + rnk[:-4], 
	permutation_num=100,
	no_plot=False,
	processes=4
)

# Run GSEA preranked - Hallmarks
prerank_results = gseapy.prerank(
	rnk='data/rnk_lists/' + rnk,
	gene_sets='data/gene_sets/hallmarks.gmt', 
	outdir='out/Prerank_Hallmarks/' + rnk[:-4], 
	permutation_num=100, 
	no_plot=False,
	processes=4
)

# Run GSEA-InContext - Hallmarks
gseapen_results = gseapy.incontext(
	rnk='data/rnk_lists/' + rnk,
	gene_sets='data/gene_sets/hallmarks.gmt', 
	background_rnks='data/bg_rnk_lists/all_442_lists_permuted_x100.csv', 
	outdir='out/InContext_Hallmarks/' + rnk[:-4], 
	permutation_num=100,
	no_plot=False,
	processes=4
)

print('Done!')

2025-03-03 23:06:27,637 Input gene rankings contains NA values(gene name and ranking value), drop them all!


/media/sutanu/wd_costello_lab_dell_laptop_sn/projects/GSEA-InContext


2025-03-03 23:06:34,335 Input gene rankings contains NA values(gene name and ranking value), drop them all!


100


2025-03-03 23:07:20,063 Input gene rankings contains NA values(gene name and ranking value), drop them all!
