# Extract cell-type-specific embeddings from fine-tuned ChromBERT.

All commands below fine-tune ChromBERT on cell-type-specific accessibility data (BigWig + peaks) 
and then extract embeddings from the fine-tuned model. If a fine-tuned checkpoint is provided, 
fine-tuning is skipped and embeddings are generated directly from the checkpoint.

- ``embed_cell_cistrome``: extracts cistrome embeddings
- ``embed_cell_gene``: extracts gene embeddings reflecting cell-type-specific regulatory context
- ``embed_cell_region``: extracts region embeddings reflecting cell-type-specific regulatory patterns
- ``embed_cell_regulator``: extracts regulator embeddings

# Example: 
Extract cell-type-specific embeddings for myoblast, including cistrome, gene, region, and regulator embeddings.

``Recommended``: First run infer_cell_trn to generate a fine-tuned checkpoint and infer key regulators/TRNs. Then use that checkpoint to extract cell-type-specific embeddings (cistrome, gene, region, and regulator).

In [1]:
import os
os.chdir("/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli")
os.environ["CUDA_VISIBLE_DEVICES"]='0'

In [2]:
import pickle
import h5py
import pandas as pd
import numpy as np
import glob

In [3]:
region_file = '../data/CTCF_ENCFF664UGR_sample100.bed' # only for example
ft_ckpt_dir = "/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/**/*.ckpt" # if you have already run infer_cell_trn, you can use the fine-tuned checkpoint to extract cell-type-specific embeddings

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt

'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt'

In [4]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome         Extract cell-specific cistrome embeddings...
  embed_cell_gene             Extract cell-specific gene embeddings
  embed_cell_region           Extract cell-specific region embeddings
  embed_cell_regulator        Extract cell-specific regulator embeddings...
  embed_cistrome              Extract general cistrome embeddings on...
  embed_gene                  Extract general gene embeddings
  embed_region                Extract general region embeddings
  embed_regulator             Extract general regulator embeddings on...
  find_driver_in_dual_region  Find driver factors in dual functional...
  find_driver_in_transition   Find driver factors in cell state transit

## Embed cistromes
Extract cell-specific cistrome embeddings on specified regions

In [5]:
!chrombert-tools embed_cell_cistrome -h

Usage: chrombert-tools embed_cell_cistrome [OPTIONS]

  Extract cell-specific cistrome embeddings on specified regions

Options:
  --region FILE                   Region file where cistrome embeddings will
                                  be computed.  [required]
  --cistrome TEXT                 GSM/ENCODE id or factor:cell, e.g. ENCSR...
                                  or GSM... or ATAC-seq:HEK293T or
                                  BCL11A:GM12878. Use ';' to separate
                                  multiple.  [required]
  --cell-type-bw FILE             Cell type accessibility BigWig file.
                                  Required if --ft-ckpt is not provided.
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
                                  Required if --ft-ckpt is not provided.
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                              

In [7]:
# # if you not provided checkpoint
# !chrombert-tools embed_cell_cistrome \
#     --cell-type-bw "../data/myoblast_ENCFF149ERN_signal.bigwig" \
#     --cell-type-peak "../data/myoblast_ENCFF647RNC_peak.bed" \
#     --region {region_file} \
#     --cistrome "GSM837613;CTCF:HSMM;h3k27ac:LHCNM2" \
#     --odir "./output_cell_specific_emb_cistrome" \
#     --genome "hg38" \
#     --resolution "1kb" 2> "./tmp/infer_cell_trn.stderr.log" # redirect stderr to log file

In [8]:
# region_file is a bed file
!chrombert-tools embed_cell_cistrome \
    --region {region_file} \
    --ft-ckpt {ft_ckpt} \
    --cistrome "GSM837613;CTCF:HSMM;h3k27ac:LHCNM2" \
    --odir "./output_cell_specific_emb_cistrome" \
    --genome "hg38" \
    --resolution "1kb"

Stage 1: Preparing regions
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions), non-overlapping: 0
Note: All cistrome names were converted to lowercase for matching.
Cistromes count summary - requested: 3, matched in ChromBERT meta: 3, not found: 0, not found cistromes: []
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Found 3 cistromes: ['gsm837613', 'ctcf:hsmm', 'h3k27ac:lhcnm2']
Finished stage 1
Stage 2: Using provided fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55

In [9]:
# cell_specific_mean_cistrome_emb.pkl: one 768-dim vector per cistrome (averaged across all regions)
with open("./output_cell_specific_emb_cistrome/cell_specific_mean_cistrome_emb.pkl", "rb") as f:
    mean_cistrome_emb_dict2 = pickle.load(f)


In [10]:

for key, value in mean_cistrome_emb_dict2.items():
    print(key, value.shape)


gsm837613 (768,)
ctcf:hsmm (768,)
h3k27ac:lhcnm2 (768,)


In [11]:
# cell_specific_cistrome_emb_on_region.hdf5: per-region embeddings; each cistrome has a matrix of shape (N_regions, 768)
# region: [chrom, start, end, build_region_index]
# build_region_index maps to ChromBERT's reference regions
with h5py.File("./output_cell_specific_emb_cistrome/cell_specific_cistrome_emb_on_region.hdf5", "r") as f:
    print(f.keys())
    print(f['emb'].keys())
    emb_key = list(f['emb'].keys())
    region = f['region'][:]
    for key in emb_key:
        emb = f['emb'][key][:]
        print(key, emb.shape)

region[0:10]

<KeysViewHDF5 ['emb', 'region']>
<KeysViewHDF5 ['ctcf:hsmm', 'gsm837613', 'h3k27ac:lhcnm2']>
ctcf:hsmm (100, 768)
gsm837613 (100, 768)
h3k27ac:lhcnm2 (100, 768)


array([[       1, 37989946, 37990368,    32658],
       [      11,  2400199,  2400617,   289179],
       [      12,  6778809,  6779319,   391108],
       [      12, 52980788, 52981316,   424926],
       [      12, 53676021, 53676448,   425578],
       [      14, 21092401, 21092968,   560876],
       [      14, 23057979, 23058458,   562483],
       [      14, 23120727, 23121190,   562542],
       [      14, 23379895, 23380314,   562781],
       [      14, 23588973, 23589439,   562958]])

In [12]:
# overlap_region.bed: input regions overlapped with ChromBERT's reference regions; contains columns: chrom, start, end, build_region_index
overlap_region = pd.read_csv("./output_cell_specific_emb_cistrome/overlap_region.bed",sep='\t',header=None, names=['chrom','start','end','build_region_index'])
overlap_region.head()

Unnamed: 0,chrom,start,end,build_region_index
0,chr1,37989946,37990368,32658
1,chr11,2400199,2400617,289179
2,chr12,6778809,6779319,391108
3,chr12,52980788,52981316,424926
4,chr12,53676021,53676448,425578


## Embed genes
Extract cell-specific gene embeddings

In [19]:
!chrombert-tools embed_cell_gene -h

Usage: chrombert-tools embed_cell_gene [OPTIONS]

  Extract cell-specific gene embeddings

Options:
  --gene TEXT                     Gene symbols or IDs. e.g.
                                  ENSG00000170921;TANC2;DPYD. Use ';' to
                                  separate multiple genes.  [required]
  --cell-type-bw FILE             Cell type accessibility BigWig file.
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                                  ckpt. If you not provide, you should provide
                                  --cell-type-bw and --cell-type-peak to train
                                  a cell-specific model.
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                          

In [21]:
! chrombert-tools embed_cell_gene \
    --gene "myod1;myf5;tp53;brd4;ENSG00000009709" \
    --odir "./output_emb_cell_genes" \
    --ft-ckpt {ft_ckpt} \
    --genome "hg38" \
    --resolution "1kb"

Finished stage 1
Stage 2: Using provided fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
Loading from pl module, remove prefix 'model.'
Loaded 110/110 parameters
Finished stage 2
Stage 3: Computing cell-specific gene embeddings
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Computing region embeddings: 100%|████████████████| 2/2 [00:01<00:00,  1.46it/s]
Finished stage 3

Finished all s

In [22]:
# embs_dict.pkl: gene embed dict
with open("./output_emb_genes/embs_dict.pkl", "rb") as f:
    gene_emb_dict = pickle.load(f)
for key, value in gene_emb_dict.items():
    print(key, value.shape)


ensg00000170921 (768,)
tanc2 (768,)
ensg00000200997 (768,)
dpyd (768,)
snora70 (768,)
tp53 (768,)
brd4 (768,)


## Embed region
Extract cell-specific region embeddings

In [23]:
!chrombert-tools embed_cell_region -h

Usage: chrombert-tools embed_cell_region [OPTIONS]

  Extract cell-specific region embeddings

Options:
  --region FILE                   Region file to compute embeddings for.
                                  [required]
  --cell-type-bw FILE             Cell type accessibility BigWig file.
                                  Required if --ft-ckpt is not provided.
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
                                  Required if --ft-ckpt is not provided.
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                                  ckpt. If not provided, you must provide
                                  --cell-type-bw and --cell-type-peak to train
                                  a cell-specific model.
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  

In [24]:
! chrombert-tools embed_cell_region \
    --region {region_file} \
    --ft-ckpt {ft_ckpt} \
    --odir "./output_cell_specific_emb_region" \
    --genome "hg38" \
    --resolution "1kb"

Stage 1: Overlapping focus regions with ChromBERT regions
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions), non-overlapping: 0
Found 100 overlapping regions
Finished stage 1
Stage 2: Using provided fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
Loading from pl module, remove prefix 'model.'
Loaded 110/110 parameters
Finished stage 2
Stage 3: Computing cell-specific region embeddings
Your supervised_file does not contain the 'label' column. Please v

In [25]:
# cell_specific_overlap_region_emb.npy: one 768-dim vector per region
overlap_region_emb = np.load("./output_cell_specific_emb_region/cell_specific_overlap_region_emb.npy")
print(overlap_region_emb.shape)

# overlap_region.bed: input regions overlapped with ChromBERT's reference regions; contains columns: chrom, start, end, build_region_index

overlap_region = pd.read_csv("./output_emb_region/overlap_region.bed",sep='\t',header=None, names=['chrom','start','end','build_region_index'])
len(overlap_region),overlap_region.head()

(100, 768)


(100,
    chrom     start       end  build_region_index
 0   chr1  37989946  37990368               32658
 1  chr11   2400199   2400617              289179
 2  chr12   6778809   6779319              391108
 3  chr12  52980788  52981316              424926
 4  chr12  53676021  53676448              425578)

## Embed regulator
Extract cell-specific regulator embeddings on specified regions

In [26]:
!chrombert-tools embed_cell_regulator -h

Usage: chrombert-tools embed_cell_regulator [OPTIONS]

  Extract cell-specific regulator embeddings on specified regions

Options:
  --region FILE                   Region file where regulator embeddings will
                                  be computed.  [required]
  --regulator TEXT                Regulators of interest, e.g. EZH2 or
                                  EZH2;BRD4. Use ';' to separate multiple
                                  regulators.  [required]
  --cell-type-bw FILE             Cell type accessibility BigWig file.
                                  Required if --ft-ckpt is not provided.
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
                                  Required if --ft-ckpt is not provided.
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                                  ckpt. If not provided, you must provide
                     

In [27]:
!chrombert-tools embed_cell_regulator \
    --region {region_file} \
    --ft-ckpt {ft_ckpt} \
    --regulator "EZH2;BRD4;CTCF;FOXA3;myod1;myF5" \
    --odir "./output_cell_specific_emb_regulator" \
    --genome "hg38" \
    --resolution "1kb"
    

Stage 1: Preparing regions and regulators
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions), non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found regulator: ['foxa3']
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Found 5 regulators: ['myod1', 'ctcf', 'myf5', 'brd4', 'ezh2']
Finished stage 1
Stage 2: Using provided fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_tr

In [28]:
# cell_specific_mean_regulator_emb.pkl: one 768-dim vector per regulator (averaged across all regions)
with open("./output_cell_specific_emb_regulator/cell_specific_mean_regulator_emb.pkl", "rb") as f:
    mean_regulator_emb_dict = pickle.load(f)

for key, value in mean_regulator_emb_dict.items():
    print(key, value.shape)

myod1 (768,)
ctcf (768,)
myf5 (768,)
brd4 (768,)
ezh2 (768,)


In [29]:
# cell_specific_regulator_emb_on_region.hdf5: Per-region embeddings: matrix of shape (N_regions, 768) for each regulator
with h5py.File("./output_cell_specific_emb_regulator/cell_specific_regulator_emb_on_region.hdf5", "r") as f:
    print(f.keys())
    print(f['emb'].keys())
    emb_key = list(f['emb'].keys())
    region = f['region'][:]
    for key in emb_key:
        emb = f['emb'][key][:]
        print(key, emb.shape)

<KeysViewHDF5 ['emb', 'region']>
<KeysViewHDF5 ['brd4', 'ctcf', 'ezh2', 'myf5', 'myod1']>
brd4 (100, 768)
ctcf (100, 768)
ezh2 (100, 768)
myf5 (100, 768)
myod1 (100, 768)
