ENT3C is a method for qunatifying the similarity of micro-C/Hi-C derived chromosomal contact matrices. It is based on the von Neumann entropy1 and recent work on entropy quantification of Pearson correlation matrices2. For a contact matrix, ENT3C records the change in local pattern complexity of smaller Pearson-transformed submatrices along a matrix diagonal to generate a characteristic signal. Similarity is defined as the Pearson correlation between the respective entropy signals of two contact matrices.
https://doi.org/10.1101/2024.01.30.577923
-
Loads cooler files and looks for shared empty bins.
-
ENT3C will extract smaller submatrices
$\mathbf{a}$ of dimension$n\times n$ along the diagonal of an input contact matrix$\mathbf{M}$ -
The logarithm of
$\mathbf{a}$ is taken ($nan$ values are set to zero). -
$\mathbf{a}$ is transformed into a Pearson correlation matrix$\mathbf{P}$ ($nan$ values are set to zero). -
$\mathbf{P}$ is transformed into$\boldsymbol{\rho}=\mathbf{P}/n$ to fulfill the conditions for computing the von Neumann entropy. -
The von Neumann entropy of
$\boldsymbol{\rho}$ is computed as$S(\boldsymbol{\rho})=\sum_j \lambda_j \log \lambda_j$ where
$\lambda_j$ is the$j$ th eigenvalue of$\boldsymbol{\rho}$ -
This is repeated for subsequent submatrices along the diagonal of the input matrix and stored in the entropy signal
$\mathbf{S}_{M}$ . -
Similarity
$Q$ is defined as the Pearson correlation$r$ between the entropy signals of two matrices:$Q(\mathbf{M}_1,\mathbf{M}_2) = r(\mathbf{S}_{\mathbf{M}_1},\mathbf{S}_{\mathbf{M}_2})$ .
Exemplary epiction of ENT3C derivation of the entropy signal
Julia or MATLAB
Both Julia and MATLAB implementations (ENT3C.jl
and ENT3C.m
) were tested on Hi-C and micro-C contact matrices binned at 40 kb in cool
format.
micro-C
Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (pairs ) |
---|---|---|---|
H1-hESC | 1 | 4DNES21D8SP8 | 4DNFING6ZFD, 4DNFIBMG8YA3, 4DNFIMT4PHZ1, 4DNFI8GM4EL9 |
H1-hESC | 2 | 4DNES21D8SP8 | 4DNFIIYUGYBU, 4DNFI89L17XY, 4DNFIXP9MVBU, 4DNFI2YHYAJO, 4DNFIULY29IQ |
HFFc6 | 1 | 4DNESWST3UBH | 4DNFIN7IIIY6, 4DNFIJZDEIZ3, 4DNFIYBTHGNA, 4DNFIK8UIB5B |
HFFc6 | 2 | 4DNESWST3UBH | 4DNFIF5F4HRG, 4DNFIK82YRNM, 4DNFIATCW955, 4DNFIZU6ADT1, 4DNFIKWV6BY2 |
HFFc6 | 3 | 4DNESWST3UBH | 4DNFIFJL4JIH, 4DNFIONHB78N, 4DNFIG1ZOVIM, 4DNFIPKVL9YI, 4DNFIJM966UR, 4DNFIV8JNJB8 |
Hi-C
Cell line | Biological Replicate (BR) | Accession (Experiemnt set) | Accession (BAM ) |
---|---|---|---|
G401 | 1 | ENCSR079VIJ | ENCFF649MAY |
G401 | 2 | ENCSR079VIJ | ENCFF758WUD |
LNCaP | 1 | ENCSR346DCU | ENCFF977XHB |
LNCaP | 2 | ENCSR346DCU | ENCFF204XII |
A549 | 1 | ENCSR444WCZ | ENCFF867DCM |
A549 | 2 | ENCSR444WCZ | ENCFF532XBC |
-
for the Hi-C data,
bam
files were downloaded from the ENCODE data portal and converted intopairs
files using thepairtools parse
function3pairtools parse --chroms-path hg38.fa.sizes -o <OUT.pairs.gz> --assembly hg38 --no-flip --add-columns mapq --drop-sam --drop-seq --nproc-in 15 --nproc-out 15 <IN.bam>
-
for the micro-C data,
pairs
of technical replicates (TRs) were merged withpairtools merge
. E.g. for H1-hESC, BR1 (4DNES21D8SP8):pairtools merge -o <hESC.BR1.pairs.gz> --nproc 10 4DNFING6ZFDF.pairs.gz 4DNFIBMG8YA3.pairs.gz 4DNFIMT4PHZ1.pairs.gz 4DNFI8GM4EL9.pairs.gz
-
40 kb coolers were generated from the Hi-C/micro-C pairs files with
cload pairs
function4cooler cload pairs -c1 2 -p1 3 -c2 4 -p2 5 --assembly hg38 <CHRSIZE_FILE:40000> <IN.pairs.gz> <OUT.cool>
Both Julia and MATLAB implementations (ENT3C.jl
and ENT3C.m
) call a configuration file in JSON format.
💡 The main ENT3C parameter affecting the final entropy signal SUB_M_SIZE_FIX
.
SUB_M_SIZE_FIX
can be either be fixed by or alternatively, one can specify CHRSPLIT
; in this case SUB_M_SIZE_FIX
will be computed internally to fit the number of desired times the contact matrix is to be paritioned into.
WN=1+floor((N-SUB_M_SIZE)./WS)
where N
is the size of the input contact matrix, WS
is the window shift, WN
is the number of evaluated submatrices (consequently the number of data points in
ENT3C parameters set in config/config.json
DATA_PATH: "DATA"
FILES: ["ENCSR079VIJ.BioRep1.40kb.cool","G401_BR1" ...]
[<COOL_FILENAME>, <SHORT_NAME>]
💡 ENT3C also takes mcool
files as input. Please refer to biological replicates as "_BR%d" in the <SHORT_NAME>.
OUT_DIR: "OUTPUT/"
OUT_DIR
will be concatenated with OUTPUT/JULIA/
or OUTPUT/MATLAB/
.
OUT_PREFIX: "40kb"
Resolution: 40000
ChrNr: 14
NormM: 0
NormM:1
, balancing weights in cooler are applied.
SUB_M_SIZE_FIX: null
CHRSPLIT: 10
WS: 1
WN_MAX: 1000
Upon modifying config/config.json
as desired, ENT3C.jl
and ENT3C.m
will run with using specified parameters.
Associated functions are contained in directories JULIA_functions/
and MATLAB_functions/
.
Output files:
40kb_ENT3C_similarity.csv
FILES
and the third column Q
the corresponding similarity score.
ChrNr Sample1 Sample2 Q
15 A549_BR1 A549_BR2 0.991055440056682
15 A549_BR1 G401_BR1 0.558213759737198
15 A549_BR1 G401_BR2 0.593612526651996
. . .
. . .
. . .
40kb_ENT3C_OUT.csv
Name ChrNr Resolution sub_m_dim WN WS binNrStart binNrEND START END S
G401_BR1 15 40000 205 1841 1 1 282 0 11280000 3.27338758846104
G401_BR1 15 40000 205 1841 1 2 283 40000 11320000 3.27061968317512
G401_BR1 15 40000 205 1841 1 3 284 80000 11360000 3.25483843736616
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
Each row corresponds to an evaluated submatrix with fields Name
(the short name specified in FILES
), ChrNr
, Resolution
, the sub-matrix dimension sub_m_dim
, WN=1+floor((N-SUB_M_SIZE)./WS)
, binNrStart
and binNrEnd
correspond to the start and end bin of the submatrix, START
and END
are the corresponding genomic coordinates and S
is the computed von Neumann entropy.
40kb_ENT3C_signals.png
Entropy signals ENT3C.jl
for contact matrices of chromosome 15-22 binned at 40 kb in various cell lines:
Entropy signals ENT3C.m
for contact matrices of chromosomes 15-22 binned at 40 kb in various cell lines.
- Neumann, J. von., Thermodynamik quantenmechanischer Gesamtheiten. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen. Mathematisch-Physikalische Klasse 1927. 1927. 273-291.
- Felippe, H., et. al., Threshold-free estimation of entropy from a pearson matrix. EPL. 141(3):31003. 2023.
- Open2C et. al., Pairtools: from sequencing data to chromosome contacts. bioRxiv. 2023.
- Abdennur,N., and Mirny, L.A., Cooler: scalable storage for Hi-C data and other genomically labeled arrays. Bioinformatics. 2020.