ENT3C is a method for qunatifying the similarity of micro-C/Hi-C derived chromosomal contact matrices. It is based on the von Neumann entropy1 and recent work on entropy quantification of Pearson correlation matrices2. For a contact matrix, ENT3C records the change in local pattern complexity of smaller Pearson-transformed submatrices along a matrix diagonal to generate a characteristic signal. Similarity is defined as the Pearson correlation between the respective entropy signals of two contact matrices.
https://doi.org/10.1093/nargab/lqae076
-
Loads cooler files and looks for shared empty bins.
-
ENT3C will first take the logarithm of an input matrix
$\mathbf{M}$ -
Next, smaller submatrices
$\mathbf{a}$ of dimension$n\times n$ are extracted along the diagonal of an input contact matrix$\mathbf{M}$ -
$nan$ values in$\mathbf{a}$ are set to the minimum value in$\mathbf{a}$ . -
$\mathbf{a}$ is transformed into a Pearson correlation matrix$\mathbf{P}$ . -
$\mathbf{P}$ is transformed into$\boldsymbol{\rho}=\mathbf{P}/n$ to fulfill the conditions for computing the von Neumann entropy. -
The von Neumann entropy of
$\boldsymbol{\rho}$ is computed as$S(\boldsymbol{\rho})=\sum_j \lambda_j \log \lambda_j$ where
$\lambda_j$ is the$j$ th eigenvalue of$\boldsymbol{\rho}$ -
This is repeated for subsequent submatrices along the diagonal of the input matrix and stored in the entropy signal
$\mathbf{S}_{M}$ . -
Similarity
$Q$ is defined as the Pearson correlation$r$ between the entropy signals of two matrices:$Q(\mathbf{M}_1,\mathbf{M}_2) = r(\mathbf{S}_{\mathbf{M}_1},\mathbf{S}_{\mathbf{M}_2})$ .
Exemplary epiction of ENT3C derivation of the entropy signal
Julia or MATLAB.
- dependencies, packages and version information for julia implementation are defined in
Project.toml
andManifest.toml
- set
--install-deps=yes
if you wish to automatically install the packages and resolve environment
- set
- For the Julia implementation, ubuntu's hdf5-tools is also required.
Both Julia and MATLAB implementations (ENT3C.jl
and ENT3C.m
) were tested on Hi-C and micro-C contact matrices binned at 40 kb in cool
format.
micro-C
Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (pairs ) |
---|---|---|---|
H1-hESC | 1 | 4DNES21D8SP8 | 4DNFING6ZFD, 4DNFIBMG8YA3, 4DNFIMT4PHZ1, 4DNFI8GM4EL9 |
H1-hESC | 2 | 4DNES21D8SP8 | 4DNFIIYUGYBU, 4DNFI89L17XY, 4DNFIXP9MVBU, 4DNFI2YHYAJO, 4DNFIULY29IQ |
HFFc6 | 1 | 4DNESphiT3UBH | 4DNFIN7IIIY6, 4DNFIJZDEIZ3, 4DNFIYBTHGNA, 4DNFIK8UIB5B |
HFFc6 | 2 | 4DNESphiT3UBH | 4DNFIF5F4HRG, 4DNFIK82YRNM, 4DNFIATCW955, 4DNFIZU6ADT1, 4DNFIKWV6BY2 |
HFFc6 | 3 | 4DNESphiT3UBH | 4DNFIFJL4JIH, 4DNFIONHB78N, 4DNFIG1ZOVIM, 4DNFIPKVL9YI, 4DNFIJM966UR, 4DNFIV8JNJB8 |
Hi-C
Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (BAM ) |
---|---|---|---|
G401 | 1 | ENCSR079VIJ | ENCFF649MAY |
G401 | 2 | ENCSR079VIJ | ENCFF758WUD |
LNCaP | 1 | ENCSR346DCU | ENCFF977XHB |
LNCaP | 2 | ENCSR346DCU | ENCFF204XII |
A549 | 1 | ENCSR444WCZ | ENCFF867DCM |
A549 | 2 | ENCSR444WCZ | ENCFF532XBC |
-
for the Hi-C data,
bam
files were downloaded from the ENCODE data portal and converted intopairs
files using thepairtools parse
function3pairtools parse --chroms-path hg38.fa.sizes -o <OUT.pairs.gz> --assembly hg38 --no-flip --add-columns mapq --drop-sam --drop-seq --nproc-in 15 --nproc-out 15 <IN.bam>
-
for the micro-C data,
pairs
of technical replicates (TRs) were merged withpairtools merge
. E.g. for H1-hESC, BR1 (4DNES21D8SP8):pairtools merge -o <hESC.BR1.pairs.gz> --nproc 10 4DNFING6ZFDF.pairs.gz 4DNFIBMG8YA3.pairs.gz 4DNFIMT4PHZ1.pairs.gz 4DNFI8GM4EL9.pairs.gz
-
40 kb coolers were generated from the Hi-C/micro-C pairs files with
cload pairs
function4cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 --assembly hg38 <CHRSIZE_FILE:40000> <IN.pairs.gz> <OUT.cool>
-
The main ENT3C parameter affecting the final entropy signal
$S$ is the dimension of the submatricesSUB_M_SIZE_FIX
.-
SUB_M_SIZE_FIX
can be either be fixed by or alternatively, one can specifyCHRSPLIT
; in this caseSUB_M_SIZE_FIX
will be computed internally to fit the number of desired times the contact matrix is to be paritioned into.PHI=1+floor((N-SUB_M_SIZE)./phi)
where
N
is the size of the input contact matrix,phi
is the window shift,PHI
is the number of evaluated submatrices (consequently the number of data points in$S$ ).
-
-
Both Julia and MATLAB implementations (
ENT3C.jl
andENT3C.m
) use a configuration file in JSON format.- for Julia,
--config-file=config.json
- for MATLAB, please set configuration filename directly in
ENT3C.m
script
- for Julia,
ENT3C parameters are defined in config/config.json
"DATA_PATH": "DATA"
"FILES": [
"ENCSR079VIJ.BioRep1.40kb.cool",
"G401_BR1",
"ENCSR079VIJ.BioRep2.40kb.cool",
"G401_BR2"]
[<COOL_FILENAME>, <SHORT_NAME>]
💡 ENT3C also takes mcool
files as input. Please refer to biological replicates as "_BR%d" in the <SHORT_NAME>.
"`OUT_DIR": "OUTPUT/"
OUT_DIR
will be concatenated with OUTPUT/JULIA/
or OUTPUT/MATLAB/
.
"OUT_PREFIX": "40kb"
"Resolution": "40e3,100e3"
"ChrNr": "15,16,17,18,19,20,21,22,X"
"NormM": 0
NormM: 1
, balancing weights in cooler are applied. If set to 1, ENT3C expects weights to be in dataset /resolutions/<resolution>/bins/<WEIGHTS_NAME>
.
"WEIGHTS_NAME": "weight"
"SUB_M_SIZE_FIX": null
"CHRSPLIT": 10
"phi": 1
"PHI_MAX": 1000
julia ENT3C.jl --config-file=config/config.test.json --install-deps=no
matlab -nodesktop -nosplash -nodisplay -r "ENT3C('config/config.test.json'); exit"
Associated functions are contained in directories JULIA_functions/
and MATLAB_functions/
.
Output files:
40kb_ENT3C_similarity.csv
FILES
and the third column Q
the corresponding similarity score.
Resolution ChrNr Sample1 Sample2 Q
40000 15 A549_BR1 A549_BR2 0.995462832813044
40000 15 A549_BR1 G401_BR1 0.565465091507697
40000 15 A549_BR1 G401_BR2 0.587395560010108
40000 15 A549_BR1 H1-hESC_BR1 0.511892949109715
40000 15 A549_BR1 H1-hESC_BR2 0.46675009291503
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
40kb_ENT3C_OUT.csv
Name ChrNr Resolution n PHI phi binNrStart binNrEND START END S
G401_BR1 15 40000 292 877 2 1 369 0 14760000 3.70691992953067
G401_BR1 15 40000 292 877 2 3 371 80000 14840000 3.68605952020314
G401_BR1 15 40000 292 877 2 12 373 440000 14920000 3.67630110653009
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Each row corresponds to an evaluated submatrix with fields Name
(the short name specified in FILES
), ChrNr
, Resolution
, the sub-matrix dimension sub_m_dim
, PHI=1+floor((N-SUB_M_SIZE)./phi)
, binNrStart
and binNrEnd
correspond to the start and end bin of the submatrix, START
and END
are the corresponding genomic coordinates and S
is the computed von Neumann entropy.
40kb_ENT3C_signals.png
Entropy signals ENT3C.jl
for contact matrices of chromosome 15-22 binned at 40 kb in various cell lines:
Entropy signals ENT3C.m
for contact matrices of chromosomes 15-22 binned at 40 kb in various cell lines.
- Neumann, J. von., Thermodynamik quantenmechanischer Gesamtheiten. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen. Mathematisch-Physikalische Klasse 1927. 1927. 273-291.
- Felippe, H., et. al., Threshold-free estimation of entropy from a pearson matrix. EPL. 141(3):31003. 2023.
- Open2C et. al., Pairtools: from sequencing data to chromosome contacts. bioRxiv. 2023.
- Abdennur,N., and Mirny, L.A., Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020.