# Enrichment Analysis


## 1. EOO (excess of overlap) -- Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife
### Purpose

The goal of **EOO (Enrichment Of Overlap)** enrichment analysis is to evaluate the enrichment of significant variants within specific genomic annotations.

Using a **Leave-One-Chromosome-Out (LOCO) block jackknife** approach, the method estimates the **Odds Ratio (OR)** and **Enrichment** statistics for each annotation column, offering insights into the overlap between significant variants and annotated genomic regions.

---

### Input

The analysis requires **two main input files**:

### 1. Significant Variants File (`-significant_variants_path`)

- **Format**: RDS, TSV, or TXT
- **Required columns**:
    - `chr`: Chromosome number
    - `pos`: Genomic position

### 2. Baseline Annotation File (`-baseline_anno_path`)

- **Format**: Tabular text file
- **Required columns**:
    - `CHR`: Chromosome number
    - `BP`: Genomic position
    - **Annotation columns**: Binary values (0/1) starting from the 7th column onward

---

### Methods and Steps

### Core Statistics

- **Odds Ratio (OR)**
    
    Quantifies the odds of a variant being significant given the presence of an annotation versus its absence.
    
- **Enrichment**
    
    Measures the proportion of significant variants within an annotation relative to its background proportion.
    

### LOCO Jackknife Method

Steps:

1. For each chromosome iii: remove all data from chromosome iii
2. Compute OR and Enrichment on the remaining data
3. Aggregate across all chromosomes to estimate:
    - Mean statistic
    - Standard error (SE)

### Workflow Execution

- Main computations are implemented in R using a function that calculates statistics across all annotations and chromosomes.

---

### Output

The output is an `.RDS` file containing the following components:

- `summary`: A summary `data.frame` of:
    - `OR`
    - `OR_SE`
    - `Enrichment`
    - `Enrichment_SE`
- `OR_blockJacknife`: A 22×N22 \times N22×N matrix of log2(OR) values, one row per chromosome
- `Enrichment_blockJacknife`: A 22×N22 \times N22×N matrix of enrichment values
- `OR`: Mean OR per annotation column
- `Enrichment`: Mean enrichment per annotation
- `OR_sd`: Standard error of OR per annotation
- `Enrichment_sd`: Standard error of enrichment per annotation

In [None]:
sos run pipeline/eoo_enrichment.ipynb enrichment \
    --significant_variants_path data/eoo_enrichment/colocboost_binary_vcp0.1_hg38_annotation.tsv.gz \
    --baseline_anno_path data/eoo_enrichment/colocboost_binary_vcp0.1_hg38_annotation_data.tsv \
    --name enrichment_results \
    --cwd output/eoo_enrichment

  import pkg_resources
INFO: Running [32menrichment[0m: 
INFO: [32menrichment[0m is [32mcompleted[0m.
INFO: [32menrichment[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/eoo_enrichment/enrichment/enrichment_results.enrichment_results.rds[0m
INFO: Workflow enrichment (ID=wb945681d54e9f1a9) is executed successfully with 1 completed step.


In [None]:
# R CODE
setwd('/home/ubuntu/xqtl_protocol_exercise')
library(data.table)
eoo_results = fread('output/eoo_enrichment/enrichment/enrichment_results.enrichment_results_summary.tsv.gz')
head(eoo_results)

“package ‘data.table’ was built under R version 4.4.3”


Annotation,OR,OR_SE,OR_log2,OR_SE_log2,Enrichment,Enrichment_SE,Enrichment_log2,Enrichment_SE_log2,Enrichment_Z_score,Enrichment_P_value,Enrichment_log2_z_scores,Enrichment_log2_p_values
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>
LHX2_TSS_H3K27ac,0.9681361,0.02522293,-0.04671829,0.03765948,0.9918868,0.006557302,-0.011754081,0.009542351,151.2645,0,-1.2317805,0.2180311
LHX2_enhancer,0.9677172,0.02470836,-0.04734252,0.03680956,0.9916835,0.006458475,-0.012049768,0.009394118,153.5476,0,-1.2826928,0.1995997
LHX2_enhancer_atac,1.0230761,0.03173695,0.0329135,0.04480019,1.005684,0.007727166,0.008175039,0.011087937,130.1491,0,0.7372913,0.4609452


## 2. Pathway

### Purpose

Pathway enrichment analysis identifies biological pathways that are statistically overrepresented in a given gene set. This allows researchers to infer potential biological functions, disease relevance, or regulatory mechanisms associated with the gene set.

---

### Input

The required input is a gene list with group labels:

```
tsv
CopyEdit
group       gene_id
Neuron      ENSG00000139618
Neuron      ENSG00000091831
Microglia   ENSG00000196839

```

Parameters:

- `organism = "hsa"`: species set to human.
- `pvalue_cutoff = 1`: significance threshold for pathway filtering (can be adjusted based on analysis goals).

---

### Method

The analysis is conducted using the **clusterProfiler** R package with the KEGG database. The procedure includes:

1. Subset genes by group.
2. Perform KEGG pathway enrichment analysis for each group using the hypergeometric test.
3. Calculate pathway-level enrichment statistics:
    - Raw p-values
    - Adjusted p-values (FDR)
    - Fold enrichment
    - RichFactor
    - z-score

---

### Output

The output is a table where each row represents one pathway enriched in a specific gene group. An example:

| ID | Description | GeneRatio | BgRatio | RichFactor | FoldEnrichment | zScore | pvalue | p.adjust | Count | group |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| hsa05010 | Alzheimer disease | 2/5 | 391/9393 | 0.0051 | 9.609 | 4.01 | 0.0159 | 0.1208 | 2 | Neuron |

---

### Output Column Definitions

| Column Name | Description |
| --- | --- |
| `ID` | KEGG pathway identifier |
| `Description` | Name of the biological pathway |
| `GeneRatio` | Proportion of input genes in the pathway (e.g., 2/5) |
| `BgRatio` | Proportion of background genes in the pathway (e.g., 391/9393) |
| `RichFactor` | Real-valued version of GeneRatio |
| `FoldEnrichment` | GeneRatio divided by BgRatio |
| `zScore` | Standardized enrichment statistic |
| `pvalue` | Unadjusted p-value from enrichment test |
| `p.adjust` | Adjusted p-value after multiple testing correction (FDR) |
| `Count` | Number of input genes matched to the pathway |
| `group` | The group label of the gene set |


In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/gsea.ipynb pathway_analysis \
    --genes_file data/pathway/test_pathway_genes_input.tsv \
    --cwd output/pathway_results --name test_genes

  import pkg_resources
INFO: Running [32mpathway_analysis[0m: 
INFO: [32mpathway_analysis[0m is [32mcompleted[0m.
INFO: [32mpathway_analysis[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/pathway_results/pathway_analysis/test_genes.pathway_results.rds[0m
INFO: Workflow pathway_analysis (ID=w371947595f21bbcc) is executed successfully with 1 completed step.


In [None]:
# r code
pathway_result = readRDS('output/pathway_results/pathway_analysis/test_genes.pathway_results.rds')
head(pathway_result)

Unnamed: 0_level_0,category,subcategory,ID,Description,GeneRatio,BgRatio,RichFactor,FoldEnrichment,zScore,pvalue,p.adjust,qvalue,geneID,Count,group
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<int>,<chr>
hsa05224,Human Diseases,Cancer: specific types,hsa05224,Breast cancer,2/5,148/9393,0.013513514,25.386486,6.900875,0.002390337,0.09083282,0.0679359,675/2099,2,Neuron
hsa04010,Environmental Information Processing,Signal transduction,hsa04010,MAPK signaling pathway,2/5,300/9393,0.006666667,12.524,4.68153,0.009537746,0.12082701,0.09036923,51135/4137,2,Neuron
hsa05010,Human Diseases,Neurodegenerative disease,hsa05010,Alzheimer disease,2/5,391/9393,0.00511509,9.609207,4.012911,0.01589726,0.12082701,0.09036923,351/4137,2,Neuron
hsa03440,Genetic Information Processing,Replication and repair,hsa03440,Homologous recombination,1/5,41/9393,0.024390244,45.819512,6.637191,0.021639633,0.12082701,0.09036923,675,1,Neuron
hsa05022,Human Diseases,Neurodegenerative disease,hsa05022,Pathways of neurodegeneration - multiple diseases,2/5,483/9393,0.004140787,7.778882,3.529964,0.023788463,0.12082701,0.09036923,351/4137,2,Neuron
hsa04961,Organismal Systems,Excretory system,hsa04961,Endocrine and other factor-regulated calcium reabsorption,1/5,53/9393,0.018867925,35.445283,5.803256,0.027901787,0.12082701,0.09036923,2099,1,Neuron


## 3. sLDSC Enrichment
**Stratified LD Score Regression (S-LDSC)** is designed to quantify the contribution of different genomic functional annotations to the heritability of complex traits and assess their statistical significance. By integrating GWAS summary statistics with genome annotations, S-LDSC distinguishes true polygenic signals from confounding effects.

---

### Input

- **Annotation File**: Contains information about genomic regions (e.g., coding regions, enhancers). Can be single or combined annotations.
- **Baseline Annotation Files**: Per-chromosome `.annot.gz` files representing reference annotations used in the regression model.
- **Genomic Reference Files**: Per-chromosome PLINK-format genotype reference files used to compute LD scores.
- **SNP List**: A list of SNP IDs used in LDSC analysis.
- **Traits List**: A list of GWAS summary statistics filenames corresponding to different traits or trait groups.

---

### Workflow Steps

1. **Annotation Preprocessing**
    
    Generate LD Score files and genome annotation matrices from raw annotations.
    
2. **GWAS Summary Preparation**
    
    Preprocess GWAS summary statistics to a format compatible with LDSC.
    
3. **Heritability Estimation**
    
    Perform LD Score regression to estimate the heritability of each trait.
    
4. **Initial Tau Computation**
    
    Calculate tau statistics (functional effect sizes) and prepare intermediate outputs for meta-analysis.
    
5. **Meta-Analysis**
    
    Integrate tau and enrichment estimates across multiple traits or trait groups.
    

---
### Output

The main output is an RDS file containing meta-analyzed **tau** and **enrichment** results, also with an intitial processed rds file for each gwas results before meta.

- Meta Analysis Output Column Meanings

| Column | Description |
| --- | --- |
| **Mean** | The meta-analyzed estimate. For **tau**, it reflects the independent contribution of the annotation to trait heritability. For **enrichment**, it indicates how many times more heritability is explained by the annotation than expected by its proportion of SNPs. |
| **SD** | Standard deviation of the meta-analyzed estimate, representing its precision. |
| **P** | The meta-analyzed p-value, assessing the statistical significance of the effect (tau or enrichment). Smaller p-values indicate more significant results. |


### Commands

In [None]:
# step 1, make annotation of provided snp list, and calculate ldscore
 sos run pipeline/sldsc_enrichment.ipynb make_annotation_files_ldscore \
    --annotation_file  data/polyfun/input/colocboost_test_annotation_path.txt \
    --reference_anno_file data/polyfun/input/reference_annotation0.txt \
    --genome_ref_file data/polyfun/input/genome_reference_bfile.txt \
    --annotation_name test_colocboost \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --cwd output/polyfun/ -j 22
 

  import pkg_resources
INFO: Running [32mmake_annotation_files_ldscore[0m: 
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THR

### calculate heritability for each gwas sumstats

In [None]:
# step 2 ldsc regression: calculate heritability 
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/sldsc_enrichment.ipynb get_heritability \
    --target_anno_dir output/polyfun/test_colocboost \
    --sumstat_dir data/polyfun/example_data \
    --baseline_ld_dir data/polyfun/example_data \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --weights_dir data/polyfun/example_data \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --annotation_name test_colocboost \
    --cwd output/polyfun/test_colocboost/sumstats/ \
    --all_traits_file data/polyfun/input/sumstats_test_all.txt \
    -s build -j 2

  import pkg_resources
INFO: Running [32mget_heritability[0m: 
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
/bin/bash: /home/al4225/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
No frq files found at ./reference.*.frq, using --not-M-5-50 option
  'i.e., \ell_j := \sum_k p_k(1-p_k)r^2_{jk}, where p_k denotes the MAF '
  'i.e., \ell_j := \sum_k (p_k(1-p_k))^a r^2_{jk}, where p_k denotes the MAF '
INFO: Note: NumExpr detected 32 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 16.
INFO: NumExpr defaulting to 16 threads.
/bin/bash: /home/al4225/miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash)
No frq files found at ./reference.*.frq, using --not-M-5-50 option
  'i.e., \ell_j := \sum_k p_k(1-p_k)r^2_{jk}, where p_k denotes the MAF '
  'i.e., \ell_j := \sum_k (p_k(1-p_k))^a r^2_{jk}, where p_k den

In [None]:
# step 3: summarize results for each gwas trait, and implement meta analysis across diff gwas group
sos run pipeline/sldsc_enrichment.ipynb processed_stats \
    --target_anno_dir output/polyfun/test_colocboost \
    --sumstat_dir data/polyfun/example_data \
    --baseline_ld_dir data/polyfun/example_data \
    --python_exec python \
    --polyfun_path data/github/polyfun \
    --weights_dir data/polyfun/example_data \
    --plink_name reference. \
    --baseline_name annotations. \
    --weight_name weights. \
    --annotation_name test_colocboost \
    --cwd output/polyfun/test_colocboost/sumstats/ \
    --trait_group_paths "data/polyfun/input/sumstats_test_all.txt data/polyfun/input/sumstats_test_category1.txt" \
    --trait_group_names "All category1" \
    --all_traits_file data/polyfun/input/sumstats_test_all.txt \
    -s force   


  import pkg_resources
INFO: Running [32mprocessed_stats_1[0m: 
INFO: [32mprocessed_stats_1[0m is [32mcompleted[0m.
INFO: [32mprocessed_stats_1[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/polyfun/test_colocboost/sumstats/test_colocboost.single_tau.initial_processed_stats.rds[0m
INFO: Running [32mprocessed_stats_2[0m: 
INFO: [32mprocessed_stats_2[0m is [32mcompleted[0m.
INFO: [32mprocessed_stats_2[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/polyfun/test_colocboost/sumstats/processed_stats_2/single_tau.test_colocboost.meta_processed_stats.rds[0m
INFO: Workflow processed_stats (ID=w1b294dd9e067cdfc) is executed successfully with 2 completed steps.


In [None]:
# output
setwd('/home/ubuntu/xqtl_protocol_exercise')
sldsc_each_gwas_results = readRDS('output/polyfun/test_colocboost/sumstats/test_colocboost.single_tau.initial_processed_stats.rds')
str(sldsc_each_gwas_results)

List of 3
 $ sumstats.parquet :List of 2
  ..$ single_tau:List of 3
  .. ..$ h2g      : num 0.0032
  .. ..$ sd_annot : num [1, 1] 0.127
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr "ANNOT"
  .. .. .. ..$ : chr "ANNOT"
  .. ..$ sc_matrix: num [1:200, 1] 0.0543 0.0569 0.0567 0.0519 0.0527 ...
  ..$ enrichment:List of 3
  .. ..$ enrichment_summary:Classes ‘data.table’ and 'data.frame':	1 obs. of  5 variables:
  .. .. ..$ Enrichment.Enrichment                    : num 25.4
  .. .. ..$ Enrichment_std_error.Enrichment_std_error: num 17.8
  .. .. ..$ Prop._h2.Prop._h2                        : num 0.417
  .. .. ..$ Prop._SNPs.Prop._SNPs                    : num 0.0164
  .. .. ..$ Enrichment_p                             : num 0.413
  .. .. ..- attr(*, ".internal.selfref")=<externalptr> 
  .. ..$ meta_enrstat      :List of 3
  .. .. ..$ enrich_stat: num 2.28e-05
  .. .. ..$ enrich_z   : num -0.819
  .. .. ..$ enrich_sd  : num -2.78e-05
  .. ..$ meta_enr          :List of 2
  .

In [None]:
# meta analysis results across traits group output
setwd('/home/ubuntu/xqtl_protocol_exercise')
sldsc_meta_results = readRDS('output/polyfun/test_colocboost/sumstats/processed_stats_2/single_tau.test_colocboost.meta_processed_stats.rds')
sldsc_meta_results

Unnamed: 0,Mean,SD,P
All,0.04598254,0.04027347,0.2535548
category1,0.04208514,0.0482427,0.3830104

Unnamed: 0,Mean,SD,P
All,20.46742,9.91854,0.3909615
category1,18.26617,11.93938,0.9727415
