Skip to content
Genome-wide contact analysis using sklearn
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
example screenshot for readme Aug 23, 2019
peakachu make pre-trained model work for KR-normalized matrix Aug 23, 2019
scripts update command-line help information Aug 22, 2019
.DS_Store make pre-trained model work for KR-normalized matrix Aug 23, 2019
LICENSE Create LICENSE Aug 5, 2019
README.md commented on cross-platform Aug 23, 2019
setup.py Update setup.py Aug 21, 2019

README.md

Introduction

What is Peakachu

Peakachu is an acronym that standands for Unveil Hi-C Anchors and Peaks. It takes genome-wide contact data as input and returns coordinates of likely interactions such as chromatin loops. A machine learning framework based on sklearn is employed to generate random forest models trained on example interactions predicted by an arbitrary experiment. For example, loops can be predicted in Hi-C data using models trained with the Hi-C profiles of interactions detected via ChIA-PET. Although Hi-C is the namesake of Peakachu, it is designed to accept any genome-wide contact map including those from Micro-C and DNA SPRITE.

Citation

Tarik J. Salameh, Xiaotao Wang, Fan Song, Bo Zhang, Sage M. Wright, Chachrit Khunsriraksakul, Feng Yue. A supervised learning framework for chromatin loop detection in genome-wide contact maps. biorXiv. doi: https://doi.org/10.1101/739698

Peakachu requires Python3 and several scientific packages to run. It is best to set up a conda environment then install from one of PyPI, github, or conda. Copy and paste one of the command snippets below:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n 3dgenome python=3.6 scikit-learn=0.20.2 numpy scipy pandas h5py cooler
source activate 3dgenome
pip install peakachu
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda create -n 3dgenome python=3.6 scikit-learn=0.20.2 numpy scipy pandas h5py cooler
source activate 3dgenome
git clone https://github.com/tariks/peakachu
cd peakachu
python setup.py install

Peakachu should now be installed as a command-line tool within the new environment. Options for all peakachu commands and sub-commands can be accessed with the -h option.

peakachu -h
usage: peakachu [-h] {train,score_chromosome,score_genome,depth,pool} ...

Train Random Forest with Hi-C data and training peaklist.

positional arguments:
  {train,score_chromosome,score_genome,depth,pool}
    train               Train RandomForest model per chromosome
    score_chromosome    Calculate interaction probability per pixel for a
                        chromosome
    score_genome        Calculate interaction probability per pixel for the
                        whole genome
    depth               Print total intra-chromosomal read count
    pool                Print centroid loci from score_x output

optional arguments:
  -h, --help            show this help message and exit

Example: predicting loops in GM12878 Hi-C

GM12878 is a commonly studied cell-line based on lymphoblasts from an adult individual. The following example will download a cooler file from a public source, train a series of models using ChIA-PET or HiChIP data, then predict loops using the trained models.

Data preparation

Peakachu requires the contact map to be a cooler file and any training input to be a text file in bedpe format. Example training data is included in the github repository, consisting of bedpe files prepared from supplementary tables in Tang et al and Mumbach et al. Cooler files may be found at the 4DN data portal.

wget ftp://cooler.csail.mit.edu/coolers/hg19/Rao2014-GM12878-MboI-allreps-filtered.10kb.cool

Train a model and predict loops

It is always a good idea to call the help function immediately before entering a command:

peakachu train -h
usage: peakachu train [-h] [-r RESOLUTION] [-p PATH] [--balance] [-O OUTPUT]
                      [-w WIDTH] [-b BEDPE]

optional arguments:
  -h, --help            show this help message and exit
  -r RESOLUTION, --resolution RESOLUTION
                        Resolution in bp, default 10000
  -p PATH, --path PATH  Path to a .cool URI string or a .hic file.
  --balance             Whether or not using the ICE/KR-balanced matrix.
  -O OUTPUT, --output OUTPUT
                        Folder path to store results.
  -w WIDTH, --width WIDTH
                        Number of bins added to center of window. default
                        width=5 corresponds to 11x11 windows
  -b BEDPE, --bedpe BEDPE
                        Path to the bedpe file containing positive training
                        set.
peakachu train -p Rao2014-GM12878-MboI-allreps-filtered.10kb.cool --balance -O models -b hg19.mumbach.h3k27ac.hichip.bedpe

This will train one 23 random forest models, each labeled by a chromosome. Every model was trained on all of the interactions from the bedpe files EXCEPT for the chromosome which it is labeled as. The purpose of this is to allow Peakachu to predict loops from the same map it used for training, without overfitting. To use these models, you may either use the score_chromosome function to predict loops in only one chromosome, or the score_genome function when using a trained model to predict loops in a new contact map.

peakachu score_chromosome -h
usage: peakachu score_chromosome [-h] [-r RESOLUTION] [-p PATH] [--balance]
                                 [-O OUTPUT] [-w WIDTH] [-m MODEL] [-l LOWER]
                                 [-u UPPER]

optional arguments:
  -h, --help            show this help message and exit
  -r RESOLUTION, --resolution RESOLUTION
                        Resolution in bp, default 10000
  -p PATH, --path PATH  Path to a .cool URI string or a .hic file.
  --balance             Whether or not using the ICE/KR-balanced matrix.
  -O OUTPUT, --output OUTPUT
                        Folder path to store results.
  -w WIDTH, --width WIDTH
                        Number of bins added to center of window. default
                        width=5 corresponds to 11x11 windows
  -m MODEL, --model MODEL
                        Path to pickled model file.
  -l LOWER, --lower LOWER
                        Lower bound of distance between loci in bins (default
                        2).
  -u UPPER, --upper UPPER
                        Upper bound of distance between loci in bins (default
                        300).
for i in models/*pkl; do peakachu score_chromosome -p Rao2014-GM12878-MboI-allreps-filtered.10kb.cool --balance -O scores -m $i; done
for i in scores/*; do peakachu pool -i $i -t .9 > ${i}.loops.txt; done

The pool function serves to select the most significant non-redundant results from per-pixel probabilities calculated by the score functions. It is recommended to try different probability thresholds to achieve the best sensitivity-specificity tradeoff. The output is a standard bedpe file with the 7th and final column containing the predicted probability from the sklearn model, to support further filtering. The results can be visualized in juicer by loading as 2D annotations. Here is an example screenshot of predicted GM12878 loops in juicer: Predicted loops from model trained on H3K27ac HiChIP interactions

Using Peakachu as a standard loop caller

Models for predicting loops in Hi-C have already been prepared using CTCF ChIA-PET and H3K27ac HiChIP as training sets, at a variety of read depths. Simply download the appropriate model file and run the score_genome function.

Total intra reads CTCF Model Link H3K27ac Model Link
> 2 billion CTCF total H3K27ac total
1.9 - 2 billion CTCF 90% H3K27ac 90%
1.7 - 1.9 billion CTCF 80% H3K27ac 80%
1.5 - 1.7 billion CTCF 70% H3K27ac 70%
1.3 - 1.5 billion CTCF 60% H3K27ac 60%
1.1 - 1.3 billion CTCF 50% H3K27ac 50%
0.9 - 1.1 billion CTCF 40% H3K27ac 40%
0.7 - 0.9 billion CTCF 30% H3K27ac 30%
0.4 - 0.7 billion CTCF 20% H3K27ac 20%
< 400 million CTCF 10% H3K27ac 10%

To make it clear, let's download another Hi-C dataset from 4DN: https://data.4dnucleome.org/files-processed/4DNFI5IHU27G/@@download/4DNFI5IHU27G.mcool. Peakachu provides a handy function peakachu depth to extract the total number of intra-chromosomal pairs from cool URI:

peakachu depth -p 4DNFI5IHU27G.mcool:resolutions/10000

592991890

According to the table, for ~600 million intra-reads, we recommend using the 20% models in your prediction. Please refer to our paper to learn the differences between the CTCF and H3K27ac models.

peakachu score_genome -r 10000 --balance -p 4DNFI5IHU27G.mcool:resolutions/10000 -O scores -m down20.ctcf.pkl
for i in scores/*; do peakachu pool -i $i -t .9 > ${i}.loops.txt; done

Not just Hi-C

Peakachu has been tested on Hi-C, Micrco-C, and DNA SPRITE contact maps with good results. For training sets, ChIA-PET, HiChIP, and PLAC-Seq have been tested. The purpose of this software is ultimately to facilitate the interpretation of results from multiple types of experiments, and the user is encouraged to apply Peakachu's training framework to newer approaches as they become available.

You can’t perform that action at this time.