Skip to content
forked from dejunlin/hicrep

Python implementation of HiCRep stratum-adjusted correlation coefficient of Hi-C data with Cooler sparse contact matrix support

License

Notifications You must be signed in to change notification settings

xieting0603/hicrep

 
 

Repository files navigation

hicrep

Build Status License: GPL v3 PyPI version Python version

Python implementation of the HiCRep: a stratum-adjusted correlation coefficient (SCC) for Hi-C data with support for Cooler sparse contact matrices

The algorithm is published in:

HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Tao Yang Feipeng Zhang Galip Gürkan Yardımcı Fan Song Ross C. Hardison William Stafford Noble Feng Yue and Qunhua Li, Genome Res. 2017 Nov;27(11):1939-1949. doi: 10.1101/gr.220640.117.

This implementation takes a pair of Hi-C data sets in Cooler format (.cool for single binsize or .mcool multiple binsizes) and computes the HiCRep SCC scores for each pair of chromosomes between the two data sets. A guide for how to convert a file of read-pairs into the appropriate .cool or .mcool format is available in the Cooler documentation here.

The HiCRep SCC computed from this implementaion is consistent with the original R implementaion (https://github.com/MonkeyLB/hicrep/) and it's more than 10x faster than the R version:

comparison

Usage

To use as a python module, install the package

pip install hicrep

and then use the util function readMcool

from hicrep.utils import readMcool

to read a pair of mcool files and specify the bin size to compute SCC with:

fmcool1 = "mydata1.mcool"
fmcool2 = "mydata2.mcool"
binSize = 100000
cool1, binSize1 = readMcool(fmcool1, binSize)
cool2, binSize2 = readMcool(fmcool2, binSize)

or a pair of .cool files with built-in bin size:

fcool1 = "mydata1.cool"
fcool2 = "mydata2.cool"
cool1, binSize1 = readMcool(fmcool1, -1)
cool2, binSize2 = readMcool(fmcool2, -1)
# binSize1 and binSize2 will be set to the bin size built in the cool file
binSize = binSize1

then define the parameters for computing HiCRep SCC:

from hicrep import hicrepSCC

# smoothing window half-size
h = 1

# maximal genomic distance to include in the calculation
dBPMax = 500000

# whether to perform down-sampling or not 
# if set True, it will bootstrap the data set # with larger contact counts to
# the same number of contacts as in the other data set; otherwise, the contact 
# matrices will be normalized by the respective total number of contacts
bDownSample = False

# compute the SCC score
# this will result in a SCC score for each chromosome available in the data set
# listed in the same order as the chromosomes are listed in the input Cooler files
scc = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample)

# Optionally you can get SCC score from a subset of chromosomes
sccSub = hicrepSCC(cool1, cool2, h, dBPMax, bDownSample, np.array(['myChr1', 'myOtherChr'], dtype=str))

To use as a command line tool, install this package by

pip install hicrep

then run

hicrep mydata1.mcool mydata2.mcool outputSCC.txt --binSize 100000 --h 1 --dBPMax 500000 

when passing in an .mcool file with multiple binsizes or

hicrep mydata1.cool mydata2.cool outputSCC.txt --h 1 --dBPMax 500000 

when passing in a .cool file with a single bultin binsize. The output outputSCC.txt has a list of SCC scores for each chromosome in the input. The output SCC scores are listed in the same order as the chromosomes are listed in the input Cooler files. To see the list of command line options:

hicrep -h

You can optionally compute SCC scores for a subset of chromosomes using

hicrep mydata1.cool mydata2.cool outputSCC_Subset.txt --h 1 --dBPMax 500000 --chrNames 'myChr1' 'myOtherChr'

About

Python implementation of HiCRep stratum-adjusted correlation coefficient of Hi-C data with Cooler sparse contact matrix support

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%