╔══════════════════════════════════════════════════════════╗
║ ║
║ ██████╗ █████╗ ██████╗███████╗ ║
║ ██╔══██╗██╔══██╗██╔════╝██╔════╝ ║
║ ██████╔╝███████║██║ █████╗ ║
║ ██╔═══╝ ██╔══██║██║ ██╔══╝ ║
║ ██║ ██║ ██║╚██████╗███████╗ ║
║ ╚═╝ ╚═╝ ╚═╝ ╚═════╝╚══════╝ ║
║ ║
║ Predicting Activity-Contact for Enhancer-promoter ║
║ ║
╚══════════════════════════════════════════════════════════╝
Author: Linyong Shen @ Northwest A&F University (西北农林科技大学)
PACE is a flexible multi-omics computational framework for predicting enhancer-promoter regulatory interactions in livestock species, based on the Activity-by-Contact (ABC) Model.
PACE extends the ABC model with flexible multi-omics integration:
A(E) × C(E,G) × W_expr(G)
PACE Score(E,G) = ─────────────────────────────
Σ[A(e) × C(e,G) × W_expr(G)]
Where Activity can integrate multiple data types:
Activity = f(Accessibility, Histones, TFs, Methylation, ...)
- 🐄 Universal for Livestock: Pig, cattle, sheep, chicken, horse, and other domestic animals
- 📊 Flexible Multi-omics Integration: From minimal (ATAC-only) to comprehensive (all data types)
- 🧬 RNA-seq Integration: Filter predictions to expressed genes
- 🔬 Methylation Support: DNA methylation as inhibitory signal
- 🎯 TF Binding: Incorporate transcription factor ChIP-seq
- ✅ eQTL Validation: Validate predictions with genetic evidence
- 🚀 Automated Pipeline: Snakemake workflow for one-command execution
| Document | Description |
|---|---|
| 📖 Quick Start Guide | 5-minute guide to get started |
| 📊 Methods & Algorithm | Detailed methods and formulas |
| 🤖 ML Integration | Machine learning module guide |
# Clone the repository
git clone https://github.com/shenlinyong/PACE.git
cd PACE
# Run setup script
bash setup.sh
# Activate environment
conda activate pace-env| Component | Version | Required |
|---|---|---|
| Python | 3.8 - 3.12 | ✓ |
| Conda/Mamba | Latest | ✓ |
| Memory | 8GB+ RAM | ✓ |
| Storage | 5GB+ | ✓ |
| Package | Version | Purpose |
|---|---|---|
| numpy | ≥1.20 | Numerical computing |
| pandas | ≥1.3 | Data manipulation |
| scipy | ≥1.7 | Scientific computing |
| matplotlib | ≥3.4 | Visualization |
| seaborn | ≥0.11 | Statistical plots |
| Package | Version | Purpose |
|---|---|---|
| bedtools | ≥2.30 | BED file operations |
| samtools | ≥1.15 | BAM processing |
| macs2 | ≥2.2 | Peak calling |
| pybigwig | ≥0.3 | BigWig I/O |
| pybedtools | ≥0.9 | BED operations |
| Package | Version | Purpose |
|---|---|---|
| scikit-learn | ≥1.0 | Machine learning |
| hic-straw | ≥1.3 | Hi-C data |
| snakemake | 7.x | Workflow |
# Install system tools first (via conda or apt)
conda install -c bioconda bedtools samtools macs2
# Then install Python packages
pip install -r requirements.txtTo confirm that the testing environment is correctly configured, run the full test suite:
# Run comprehensive test suite (37 tests)
python test_pace_complete.pyThis command evaluates the following components:
- Module imports (7 tests)
- Core functions (9 tests)
- CLI scripts (6 tests)
- Complete pipeline (5 tests)
- Machine learning module (10 tests)
Successful completion of all tests indicates that the environment has been properly set up and that all major components of the PACE framework are functioning as expected.
We provide a complete example dataset to help you get started quickly:
# Run the example pipeline (takes ~2-5 minutes)
bash example/run_example.shThis will:
- Generate synthetic test data (5 Mb genome, 50 genes)
- Prepare reference files
- Run the complete PACE pipeline
- Show prediction results
See example/README.md for a detailed step-by-step tutorial.
# 1. Prepare reference files
python scripts/prepare_reference.py \
--gtf annotation.gtf.gz \
--fasta genome.fa.gz \
--output_dir reference/my_species
# 2. Configure your analysis:
# - Edit config/config.yaml for pipeline parameters
# - Edit config/config_biosamples.tsv for sample information
# 3. Run the pipeline
snakemake --cores 8| Data Type | Format | Description |
|---|---|---|
| ATAC-seq | BAM/tagAlign | Chromatin accessibility |
| DNase-seq | BAM | Chromatin accessibility |
| Data Type | Format | Description | Effect |
|---|---|---|---|
| H3K27ac | BAM/bigWig | Active enhancers | Activating |
| RNA-seq | TSV (TPM) | Gene expression | Filter/Weight |
| Data Type | Format | Description | Effect |
|---|---|---|---|
| H3K4me1 | BAM/bigWig | Enhancer mark | Activating |
| H3K4me3 | BAM/bigWig | Promoter mark | Activating |
| H3K36me3 | BAM/bigWig | Transcribed regions | Activating |
| H3K27me3 | BAM/bigWig | Polycomb repression | Inhibitory |
| H3K9me3 | BAM/bigWig | Heterochromatin | Inhibitory |
| DNA Methylation | bedGraph/bigWig | CpG methylation | Inhibitory |
| TF ChIP-seq | BAM/bigWig | TF binding | Activating |
| Hi-C | .hic/bedpe | 3D chromatin contact | Contact |
| eQTL | TSV | Genetic evidence | Validation |
PACE supports multiple methods for combining signals:
| Method | Formula | Use Case |
|---|---|---|
geometric_mean |
∏(Sᵢ)^(1/n) | Simple, like original ABC |
weighted_geometric |
∏(Sᵢ^wᵢ) | Prioritize certain signals |
weighted_sum |
Σ(wᵢ × Sᵢ) | Linear combination |
| Signal | Type | Default Weight |
|---|---|---|
| ATAC/DNase | Accessibility | 1.5 |
| H3K27ac | Active enhancer | 1.0 |
| H3K4me1 | Enhancer mark | 0.8 |
| H3K4me3 | Promoter mark | 0.5 |
| H3K27me3 | Repressive | 0.5 (inhibitory) |
| Methylation | CpG | 0.5 (inhibitory) |
| TF binding | Transcription factor | 0.3 |
The following examples show different configuration levels for config/config.yaml:
Edit config/config.yaml:
activity_method: "geometric_mean"
histone_marks:
H3K27ac:
enabled: falseEdit config/config.yaml:
activity_method: "geometric_mean"
histone_marks:
H3K27ac:
enabled: true
weight: 1.0Edit config/config.yaml:
activity_method: "weighted_geometric"
accessibility:
weight: 1.5
histone_marks:
H3K27ac: {enabled: true, weight: 1.0}
H3K4me1: {enabled: true, weight: 0.8}
H3K4me3: {enabled: true, weight: 0.5}
methylation:
enabled: true
weight: 0.5
expression:
enabled: true
min_expression: 1.0
weight_method: "log"Edit config/config_biosamples.tsv to add your samples. The file uses tab-separated format with the following columns:
| Column | Description |
|---|---|
biosample |
Sample name |
default_accessibility_feature |
ATAC or DHS |
| Column | Description |
|---|---|
ATAC |
ATAC-seq tagAlign/BAM file |
DHS |
DNase-seq BAM file |
| Column | Description |
|---|---|
H3K27ac, H3K4me1, H3K4me3, etc. |
Histone ChIP-seq files |
RNA_seq |
Gene expression file (TPM) |
methylation |
DNA methylation file |
HiC_file, HiC_type, HiC_resolution |
Hi-C data |
TF_binding, TF_names |
TF ChIP-seq files |
biosample ATAC H3K27ac RNA_seq default_accessibility_feature
Pig_Liver data/atac.tagAlign.gz data/h3k27ac.bam data/rnaseq.tsv ATACSee config/config_biosamples.tsv for complete column list and more examples.
results/{biosample}/
├── Peaks/
│ ├── macs2_peaks.narrowPeak # MACS2 peak calls
│ └── macs2_peaks.narrowPeak.sorted.candidateRegions.bed
├── Neighborhoods/
│ ├── EnhancerList.txt # Enhancers with activity scores
│ └── GeneList.txt # Genes with expression
├── Predictions/
│ ├── EnhancerPredictionsAllPutative.tsv.gz # All E-G predictions
│ ├── EnhancerPredictionsFull_*.tsv # Filtered (all columns)
│ └── EnhancerPredictions_*.tsv # Filtered (key columns)
├── Metrics/
│ ├── QCSummary_*.tsv # QC metrics summary
│ └── QCPlots_*.pdf # QC visualization
└── logs/
└── *.log # Pipeline logs
| Column | Description |
|---|---|
| chr, start, end | Enhancer coordinates |
| TargetGene | Target gene symbol |
| ABC.Score | Prediction score (0-1) |
| distance | Enhancer-TSS distance |
| class | Enhancer class (promoter/proximal/distal) |
wget https://ftp.ensembl.org/pub/release-111/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.111.gtf.gzwget https://ftp.ensembl.org/pub/release-111/fasta/bos_taurus/dna/Bos_taurus.ARS-UCD1.3.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/bos_taurus/Bos_taurus.ARS-UCD1.3.111.gtf.gzwget https://ftp.ensembl.org/pub/release-111/fasta/ovis_aries_rambouillet/dna/Ovis_aries_rambouillet.Oar_rambouillet_v1.0.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/ovis_aries_rambouillet/Ovis_aries_rambouillet.Oar_rambouillet_v1.0.111.gtf.gzwget https://ftp.ensembl.org/pub/release-111/fasta/gallus_gallus/dna/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/gallus_gallus/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.111.gtf.gzwget https://ftp.ensembl.org/pub/release-111/fasta/equus_caballus/dna/Equus_caballus.EquCab3.0.dna.toplevel.fa.gz
wget https://ftp.ensembl.org/pub/release-111/gtf/equus_caballus/Equus_caballus.EquCab3.0.111.gtf.gz| Species | Genome Size |
|---|---|
| Pig | 2.5e9 |
| Cattle | 2.7e9 |
| Sheep | 2.6e9 |
| Chicken | 1.0e9 |
| Horse | 2.5e9 |
Q: Can I run PACE with only ATAC-seq data?
A: Yes, PACE works with minimal data (ATAC-seq only) and scales up with more data types.
Q: Can I run without Hi-C data?
A: Yes, PACE uses a power-law function to estimate 3D contact when Hi-C is unavailable.
Q: How do I choose the score threshold?
A: Default 0.02 works well. Increase to 0.03-0.05 for higher precision, decrease to 0.015 for higher recall.
Q: How does methylation affect predictions?
A: DNA methylation is treated as an inhibitory signal - high methylation reduces enhancer activity scores.
PACE includes an optional ML module that can learn optimal feature combinations from validated E-G pairs (e.g., eQTL data).
Default: Gradient Boosting Classifier
- Provides feature importance for interpretability
- Handles non-linear feature interactions
- Robust to class imbalance (few validated pairs vs many predictions)
- Widely used in genomics applications
Alternative: Random Forest (--model_type random_forest)
See docs/ML_INTEGRATION.md for detailed model comparison.
- You have validated E-G pairs (eQTL, CRISPRi, etc.)
- You want to optimize predictions for your specific tissue/species
- You want to understand which features are most predictive
Create a TSV file with validated E-G pairs:
enhancer gene validated
chr1:1000-1500 Gene1 1
chr1:5000-5500 Gene2 1
chr2:3000-3500 Gene3 0python scripts/pace_ml.py train \
--predictions results/Sample/Predictions/EnhancerPredictionsAllPutative.tsv.gz \
--validation data/eqtl_validated_pairs.tsv \
--output models/my_model.pkl \
--balance_classespython scripts/pace_ml.py predict \
--predictions results/NewSample/Predictions/EnhancerPredictionsAllPutative.tsv.gz \
--model models/my_model.pkl \
--output results/NewSample/Predictions/EnhancerPredictions_ML.tsv \
--abc_weight 0.5The ML module adds:
ML.Score: ML-based prediction probabilityCombined.Score: Weighted combination of ABC.Score and ML.Score
In config/config.yaml:
ml_integration:
enabled: false # Set to true to enable
model_type: "gradient_boosting" # or "random_forest"
use_pretrained: false # Use pre-trained modelNote: ML is experimental and requires scikit-learn (pip install scikit-learn).
If you use PACE in your research, please cite:
@software{PACE,
author = {Shen, Linyong},
title = {PACE: Predicting Activity-Contact for Enhancer-promoter},
year = {2026},
publisher = {GitHub},
url = {https://github.com/shenlinyong/PACE}
}And the original ABC model papers:
@article{fulco2019activity,
title={Activity-by-contact model of enhancer--promoter regulation from thousands of CRISPR perturbations},
author={Fulco, Charles P and Nasser, Joseph and others},
journal={Nature Genetics},
volume={51},
pages={1664--1669},
year={2019}
}
@article{nasser2021genome,
title={Genome-wide enhancer maps link risk variants to disease genes},
author={Nasser, Joseph and Bergman, Drew T and others},
journal={Nature},
volume={593},
pages={238--243},
year={2021}
}- Author: Linyong Shen (申林用)
- Institution: Northwest A&F University (西北农林科技大学)
- Email: [shenlinyong@nwafu.edu.cn]
- Issues: GitHub Issues
MIT License - see LICENSE file for details.
PACE is built upon the ABC-Enhancer-Gene-Prediction framework developed by the Broad Institute and Engreitz Lab.