# Extract general embeddings from ChromBERT

All commands below extract general embeddings from the pre-trained ChromBERT model.

- ``embed_cell_cistrome``: extracts cistrome embeddings
- ``embed_cell_gene``: extracts gene embeddings
- ``embed_cell_region``: extracts region embeddings
- ``embed_cell_regulator``: extracts regulator embeddings


**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](./singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.


In [1]:
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli"
os.chdir(workdir)
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # gpu device

In [2]:
import pickle
import h5py
import pandas as pd
import numpy as np

In [3]:
region_file = '../data/CTCF_ENCFF664UGR_sample100.bed'

In [4]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

## Embed cistromes
Extract general cistrome embeddings on specified regions

In [5]:
!chrombert-tools embed_cistrome -h

Usage: chrombert-tools embed_cistrome [OPTIONS]

  Extract general cistrome embeddings on specified regions

Options:
  --region FILE                   Region file.  [required]
  --cistrome TEXT                 GSM/ENCODE id or factor:cell, e.g. ENCSR...
                                  or GSM... or ATAC-seq:HEK293T or
                                  BCL11A:GM12878. Use ';' to separate
                                  multiple.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --oname TEXT                    Output name of the cistrome embeddings.
                                  [default: cistrome_emb]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution.  [default: 1kb]
  --batch-size INTEGER            Batch size.  [default: 64]
  --num-workers INTEGER           Dataloader workers.  [default: 8]
  --chrombert-cache-dir DIRECTORY
                            

In [6]:
# region_file is a bed file
!chrombert-tools embed_cistrome \
    --region {region_file} \
    --cistrome "ENCSR440VKE_2;GSM1208591;ATAC-seq:HEK293T;BCL11A:GM12878" \
    --odir "./output_emb_cistrome" \
    --genome "hg38" \
    --resolution "1kb"


Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 0
Note: All cistrome names were converted to lowercase for matching.
Cistromes count summary - requested: 4, matched in ChromBERT meta: 4, not found: 0, not found cistromes: []
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
use organisim hg38; max sequence length is 6391
100%|█████████████████████████████████████████████| 2/2 [00:02<00:00,  1.30s/it]
Finished!
Saved mean ci

In [7]:
region_file2 = "../data/CTCF_ENCFF664UGR_sample100.csv"

In [8]:
# region_file is a csv file with columns: chrom, start, end
!chrombert-tools embed_cistrome \
    --region {region_file2} \
    --cistrome "ENCSR440VKE_2;GSM1208591;ATAC-seq:HEK293T;BCL11A:GM12878" \
    --odir "./output_emb_cistrome2" \
    --genome "hg38" \
    --resolution "1kb"

Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 0
Note: All cistrome names were converted to lowercase for matching.
Cistromes count summary - requested: 4, matched in ChromBERT meta: 4, not found: 0, not found cistromes: []
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
use organisim hg38; max sequence length is 6391
100%|█████████████████████████████████████████████| 2/2 [00:02<00:00,  1.32s/it]
Finished!
Saved mean ci

In [9]:
# mean_cistrome_emb.pkl: one 768-dim vector per cistrome (averaged across all regions)
with open("./output_emb_cistrome2/mean_cistrome_emb.pkl", "rb") as f:
    mean_cistrome_emb_dict2 = pickle.load(f)


In [10]:

for key, value in mean_cistrome_emb_dict2.items():
    print(key, value.shape)


encsr440vke_2 (768,)
gsm1208591 (768,)
atac-seq:hek293t (768,)
bcl11a:gm12878 (768,)


In [11]:
with open("./output_emb_cistrome/mean_cistrome_emb.pkl", "rb") as f:
    mean_cistrome_emb_dict1 = pickle.load(f)


for key, value in mean_cistrome_emb_dict1.items():
    print(key, value.shape)

encsr440vke_2 (768,)
gsm1208591 (768,)
atac-seq:hek293t (768,)
bcl11a:gm12878 (768,)


In [12]:
for key, value in mean_cistrome_emb_dict1.items():
    assert (value == mean_cistrome_emb_dict2[key]).all()






In [13]:
# cistrome_emb_on_region.hdf5: per-region embeddings; each cistrome has a matrix of shape (N_regions, 768)
# region: [chrom, start, end, build_region_index]
# build_region_index maps to ChromBERT's reference regions
with h5py.File("./output_emb_cistrome2/cistrome_emb_on_region.hdf5", "r") as f:
    print(f.keys())
    print(f['emb'].keys())
    emb_key = list(f['emb'].keys())
    region = f['region'][:]
    for key in emb_key:
        emb = f['emb'][key][:]
        print(key, emb.shape)

region[0:10]


<KeysViewHDF5 ['emb', 'region']>
<KeysViewHDF5 ['atac-seq:hek293t', 'bcl11a:gm12878', 'encsr440vke_2', 'gsm1208591']>
atac-seq:hek293t (100, 768)
bcl11a:gm12878 (100, 768)
encsr440vke_2 (100, 768)
gsm1208591 (100, 768)


array([[       1, 37989946, 37990368,    32658],
       [      11,  2400199,  2400617,   289179],
       [      12,  6778809,  6779319,   391108],
       [      12, 52980788, 52981316,   424926],
       [      12, 53676021, 53676448,   425578],
       [      14, 21092401, 21092968,   560876],
       [      14, 23057979, 23058458,   562483],
       [      14, 23120727, 23121190,   562542],
       [      14, 23379895, 23380314,   562781],
       [      14, 23588973, 23589439,   562958]])

In [14]:
# overlap_region.bed: input regions overlapped with ChromBERT's reference regions; contains columns: chrom, start, end, build_region_index
overlap_region = pd.read_csv("./output_emb_cistrome2/overlap_region.bed",sep='\t',header=None, names=['chrom','start','end','build_region_index'])
overlap_region.head()

Unnamed: 0,chrom,start,end,build_region_index
0,chr1,37989946,37990368,32658
1,chr11,2400199,2400617,289179
2,chr12,6778809,6779319,391108
3,chr12,52980788,52981316,424926
4,chr12,53676021,53676448,425578


In [15]:

assert (overlap_region['build_region_index'].values == region[:,-1]).all()

## Embed genes
Extract general gene embeddings

In [16]:
!chrombert-tools embed_gene -h

Usage: chrombert-tools embed_gene [OPTIONS]

  Extract general gene embeddings

Options:
  --gene TEXT                     Gene symbols or IDs. e.g.
                                  ENSG00000170921;TANC2;DPYD. Use ';' to
                                  separate multiple genes.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --oname TEXT                    Output name of the gene embeddings.
                                  [default: gene_emb]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution. Mouse only supports 1kb
                                  resolution.  [default: 1kb]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache dir.   [default:
                                  ~/.cache/chrombert/data]
  --chrombert-region-file FILE    ChromBERT region BED file. If not provided,
                                  use the defa

In [17]:
! chrombert-tools embed_gene \
    --gene "ENSG00000170921;TANC2;ENSG00000200997;DPYD;SNORA70;tp53;brd4" \
    --odir "./output_emb_genes" \
    --genome "hg38" \
    --resolution "1kb"

Finished!
Note: All gene names were converted to lowercase for matching.
Gene count summary - requested: 7, matched: 7, not found: 0
Gene meta file: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_gene_meta.tsv
Region embedding source: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_region_emb.npy
Gene embeddings saved to: ./output_emb_genes/gene_emb.pkl
Matched gene meta saved to: ./output_emb_genes/overlap_genes_meta.tsv


In [18]:
# embs_dict.pkl: gene embed dict
with open("./output_emb_genes/embs_dict.pkl", "rb") as f:
    gene_emb_dict = pickle.load(f)
for key, value in gene_emb_dict.items():
    print(key, value.shape)


ensg00000170921 (768,)
tanc2 (768,)
ensg00000200997 (768,)
dpyd (768,)
snora70 (768,)
tp53 (768,)
brd4 (768,)


In [19]:
# overlap_genes_meta.tsv: Input genes whose TSS overlapped with ChromBERT's reference regions; 
# contains columns: chrom, loc1 (gene start), loc2 (gene end), strand, tss (transcription start site), gene_id, gene_name, gene_biotype, start (ChromBERT region), end (ChromBERT region), build_region_index (ChromBERT region index)
overlap_genes_meta = pd.read_csv("./output_emb_genes/overlap_genes_meta.tsv", sep="\t")
overlap_genes_meta.head()

Unnamed: 0,chrom,loc1,loc2,strand,tss,gene_id,gene_name,gene_biotype,start,end,build_region_index
0,chr1,97077743,97995000,-,97995000,ensg00000188641,dpyd,protein_coding,97995000,97996000,81283
1,chr1,12221148,12221271,-,12221271,ensg00000252969,snora70,snoRNA,12221000,12222000,10149
2,chr1,202527310,202527427,-,202527427,ensg00000253042,snora70,snoRNA,202527000,202528000,144650
3,chr3,108574565,108574698,+,108574565,ensg00000202379,snora70,snoRNA,108574000,108575000,1283102
4,chr5,88382772,88382907,+,88382772,ensg00000206958,snora70,snoRNA,88382000,88383000,1545129


## Embed regions
Extract general region embeddings

In [20]:
!chrombert-tools embed_region -h

Usage: chrombert-tools embed_region [OPTIONS]

  Extract general region embeddings

Options:
  --region FILE                   Region file.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --oname TEXT                    Output name of the region embeddings.
                                  [default: region_emb]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution. Mouse only supports 1kb
                                  resolution.  [default: 1kb]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache dir.   [default:
                                  ~/.cache/chrombert/data]
  --chrombert-region-file FILE    ChromBERT region BED file. If not provided,
                                  use the default hg38_6k_1kb_region.bed in
                                  the cache dir.
  --chrombert-region-emb-file FILE
                       

In [21]:
!chrombert-tools embed_region \
    --region {region_file} \
    --odir './output_emb_region' \
    --genome "hg38" \
    --resolution "1kb"
    

Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 0
Finished!
Focus region summary - total: 100, overlapping with ChromBERT: 100, It is possible for a single region to overlap multiple ChromBERT regions,non-overlapping: 0
Overlapping focus regions BED file: ./output_emb_region/overlap_region.bed
Non-overlapping focus regions BED file: ./output_emb_region/no_overlap_region.bed
Overlapping focus region embeddings saved to: ./output_emb_region/region_emb.npy


In [22]:
# overlap_region_emb: one 768-dim vector per region
overlap_region_emb = np.load("./output_emb_region/overlap_region_emb.npy")
print(overlap_region_emb.shape)

# overlap_region.bed: input regions overlapped with ChromBERT's reference regions; contains columns: chrom, start, end, build_region_index

overlap_region = pd.read_csv("./output_emb_region/overlap_region.bed",sep='\t',header=None, names=['chrom','start','end','build_region_index'])
len(overlap_region),overlap_region.head()

(100, 768)


(100,
    chrom     start       end  build_region_index
 0   chr1  37989946  37990368               32658
 1  chr11   2400199   2400617              289179
 2  chr12   6778809   6779319              391108
 3  chr12  52980788  52981316              424926
 4  chr12  53676021  53676448              425578)

## Embed regulators
Extract general regulator embeddings on specified regions

In [23]:
!chrombert-tools embed_regulator -h

Usage: chrombert-tools embed_regulator [OPTIONS]

  Extract general regulator embeddings on specified regions

Options:
  --region FILE                   Region file.  [required]
  --regulator TEXT                Regulators of interest, e.g. EZH2 or
                                  EZH2;BRD4. Use ';' to separate multiple
                                  regulators.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --oname TEXT                    Output name of the regulator embeddings.
                                  [default: regulator_emb]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution.  [default: 1kb]
  --batch-size INTEGER            Batch size.  [default: 64]
  --num-workers INTEGER           Dataloader workers.  [default: 8]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache dir (contains config/
                    

In [24]:
!chrombert-tools embed_regulator \
    --region {region_file} \
    --regulator "EZH2;BRD4;CTCF;FOXA3;myod1;myF5" \
    --odir "./output_emb_regulator_1kb" \
    --genome "hg38" \
    --resolution "1kb"

Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found regulator: ['foxa3']
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
use organisim hg38; max sequence length is 6391
100%|█████████████████████████████████████████████| 2/2 [00:02<00:00,  1.24s/it]
Finished!
Save

In [25]:
# mean_regulator_emb.pkl: one 768-dim vector per regulator (averaged across all regions)
with open("./output_emb_regulator_1kb/mean_regulator_emb.pkl", "rb") as f:
    mean_regulator_emb_dict = pickle.load(f)

for key, value in mean_regulator_emb_dict.items():
    print(key, value.shape)


ctcf (768,)
brd4 (768,)
myod1 (768,)
myf5 (768,)
ezh2 (768,)


In [26]:
# regulator_emb_on_region.hdf5: Per-region embeddings: matrix of shape (N_regions, 768) for each regulator
with h5py.File("./output_emb_regulator_1kb/regulator_emb_on_region.hdf5", "r") as f:
    print(f.keys())
    print(f['emb'].keys())
    emb_key = list(f['emb'].keys())
    region = f['region'][:]
    for key in emb_key:
        emb = f['emb'][key][:]
        print(key, emb.shape)

<KeysViewHDF5 ['emb', 'region']>
<KeysViewHDF5 ['brd4', 'ctcf', 'ezh2', 'myf5', 'myod1']>
brd4 (100, 768)
ctcf (100, 768)
ezh2 (100, 768)
myf5 (100, 768)
myod1 (100, 768)


In [27]:
region[0:10]

array([[       1, 37989946, 37990368,    32658],
       [      11,  2400199,  2400617,   289179],
       [      12,  6778809,  6779319,   391108],
       [      12, 52980788, 52981316,   424926],
       [      12, 53676021, 53676448,   425578],
       [      14, 21092401, 21092968,   560876],
       [      14, 23057979, 23058458,   562483],
       [      14, 23120727, 23121190,   562542],
       [      14, 23379895, 23380314,   562781],
       [      14, 23588973, 23589439,   562958]])

In [28]:
# overlap_region.bed: input regions overlapped with ChromBERT's reference regions; contains columns: chrom, start, end, build_region_index
overlap_region = pd.read_csv("./output_emb_regulator_1kb/overlap_region.bed",sep='\t',header=None, names=['chrom','start','end','build_region_index'])
overlap_region.head()

assert (overlap_region['build_region_index'].values == region[:,-1]).all()

In [29]:
!chrombert-tools embed_regulator \
    --region {region_file} \
    --regulator "EZH2;BRD4;CTCF;FOXA3;myod1;myF5" \
    --odir "./output_emb_regulator_200bp" \
    --genome "hg38" \
    --resolution "200bp"

Region summary - total: 100, overlapping with ChromBERT: 227 (one region may overlap multiple ChromBERT regions, We keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region),non-overlapping: 99
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found regulator: ['foxa3']
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
use organisim hg38; max sequence length is 6391
100%|█████████████████████████████████████████████| 4/4 [00:04<00:00,  1.19s/it]
Finished!
Sav