# Infer enhancer-promoter loop

The ``infer_ep`` command uses the pre-trained ChromBERT model to infer enhancer-promoter loop on user-specified enhancer regions (celltype accessibility data).

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](./singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

In [1]:
import pandas as pd
import numpy as np
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert_tools/2.test/pull/ChromBERT-tools/examples/cli" #your workdir
os.chdir(workdir)


In [2]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [3]:
!chrombert-tools infer_ep -h

Usage: chrombert-tools infer_ep [OPTIONS]

  Infer enhancer-promoter loop

Options:
  --region FILE                   Region file.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution. Mouse only supports 1kb
                                  resolution.  [default: 1kb]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache dir.   [default:
                                  ~/.cache/chrombert/data]
  --chrombert-region-file FILE    ChromBERT region BED file. If not provided,
                                  use the default hg38_6k_1kb_region.bed in
                                  the cache dir.
  --chrombert-region-emb-file FILE
                                  ChromBERT region embedding file. If not
                                  provided, use the default
                          

In [4]:
data = pd.read_csv("../data/hESC_GSM2386582_ATAC.bed",sep='\t',header=None)
data

Unnamed: 0,0,1,2,3,4
0,chr1,10073,10454,peak1,22.18917
1,chr1,180752,181717,peak2,12.85029
2,chr1,629306,630507,peak3,582.86035
3,chr1,631166,631506,peak4,16.29220
4,chr1,632314,632941,peak5,14.54147
...,...,...,...,...,...
54403,chrY,19566985,19567894,peak54404,46.00989
54404,chrY,19744495,19744989,peak54405,5.62020
54405,chrY,20575493,20576092,peak54406,28.80352
54406,chrY,56728043,56728337,peak54407,34.13103


In [5]:
!mkdir -p 'output_infer_ep'

In [6]:
data[data[0]=='chr1'].to_csv('output_infer_ep/hESC_chr1.bed',sep='\t',header=False,index=False)

In [11]:
!chrombert-tools infer_ep \
    --region 'output_infer_ep/hESC_chr1.bed' \
    --odir 'output_infer_ep' \
    --genome 'hg38' \
    --resolution "1kb"

Region summary - total: 5262, overlapping with ChromBERT: 5490 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 33
Finished!
Cosine similarity between tss and region pairs saved to: output_infer_ep/tss_region_pairs_cos.tsv


In [12]:
# infer enhancer-promoter loop
# cos_sim: cosine similarity between the enhancer region embedding and the gene promoter (TSS) region embedding; higher values indicate a more likely enhancer–promoter loop.
tss_region_pairs_cos = pd.read_csv("output_infer_ep/tss_region_pairs_cos.tsv",sep='\t')
tss_region_pairs_cos


Unnamed: 0,chrom,gene_id,gene_name,tss,tss_build_region_index,distal_region_start,distal_region_end,distal_region_build_region_index,dist,dist_bin,cos_sim
0,chr1,ENSG00000278267,MIR6859-1,17436,2,10073,10454,0,-6982,-2,0.609375
1,chr1,ENSG00000278267,MIR6859-1,17436,2,180752,181717,39,163316,37,0.738281
2,chr1,ENSG00000227232,WASH7P,29570,3,10073,10454,0,-19116,-3,0.599121
3,chr1,ENSG00000227232,WASH7P,29570,3,180752,181717,39,151182,36,0.706543
4,chr1,ENSG00000243485,MIR1302-2HG,29554,3,10073,10454,0,-19100,-3,0.599121
...,...,...,...,...,...,...,...,...,...,...,...
83939,chr1,ENSG00000200495,RNU6-1205P,248912795,183959,248872833,248874568,183923,-38227,-36,0.021606
83940,chr1,ENSG00000200495,RNU6-1205P,248912795,183959,248872833,248874568,183924,-38227,-35,0.332031
83941,chr1,ENSG00000200495,RNU6-1205P,248912795,183959,248897546,248898122,183945,-14673,-14,0.868652
83942,chr1,ENSG00000200495,RNU6-1205P,248912795,183959,248905862,248907049,183953,-5746,-6,0.082703


In [13]:
tss_region_pairs_cos.query("gene_name == 'RNVU1-15'").sort_values(by='cos_sim',ascending=False)

Unnamed: 0,chrom,gene_id,gene_name,tss,tss_build_region_index,distal_region_start,distal_region_end,distal_region_build_region_index,dist,dist_bin,cos_sim
47075,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144546304,144546997,99004,133728,79,0.966797
47076,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144551352,144552375,99008,138776,83,0.910645
47077,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144560735,144561133,99015,148159,90,0.794922
47072,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144461299,144461938,98949,48723,24,0.699219
47071,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144418804,144419421,98930,6228,5,0.688477
47074,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144523881,144524433,98985,111305,60,0.651367
47078,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144567583,144567937,99021,155007,96,0.431885
47073,chr1,ENSG00000207205,RNVU1-15,144412576,98925,144490250,144490622,98966,77674,41,0.307373


In [14]:
tss_region_pairs_cos.query("gene_name == 'MOB3C'").sort_values(by='cos_sim',ascending=False)

Unnamed: 0,chrom,gene_id,gene_name,tss,tss_build_region_index,distal_region_start,distal_region_end,distal_region_build_region_index,dist,dist_bin,cos_sim
32993,chr1,ENSG00000142961,MOB3C,46616811,40462,46667925,46668704,40512,51114,50,0.97998
32991,chr1,ENSG00000142961,MOB3C,46616811,40462,46603897,46604538,40452,-12273,-10,0.934082
32995,chr1,ENSG00000142961,MOB3C,46616811,40462,46718661,46719445,40562,101850,100,0.915039
32974,chr1,ENSG00000142961,MOB3C,46616811,40462,46394157,46395083,40257,-221728,-205,0.8125
32978,chr1,ENSG00000142961,MOB3C,46616811,40462,46442369,46443232,40300,-173579,-162,0.80957
32983,chr1,ENSG00000142961,MOB3C,46616811,40462,46489219,46490138,40345,-126673,-117,0.723145
32976,chr1,ENSG00000142961,MOB3C,46616811,40462,46406183,46406626,40268,-210185,-194,0.716309
32980,chr1,ENSG00000142961,MOB3C,46616811,40462,46466315,46466979,40323,-149832,-139,0.709473
32987,chr1,ENSG00000142961,MOB3C,46616811,40462,46532951,46533643,40387,-83168,-75,0.672852
32992,chr1,ENSG00000142961,MOB3C,46616811,40462,46613503,46614118,40459,-2693,-3,0.651367
