# Infer cell-type-specific key regulators

The ``infer_cell_key_regulator`` command fine-tunes ChromBERT on cell-type-specific accessibility data (BigWig + peaks) and then infers a cell-type-specific key regulators. If a fine-tuned checkpoint is provided, fine-tuning is skipped and the key regulators is inferred directly from the checkpoint.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

## Example: 
infer cell-type-specific key regulators for myoblast

In [17]:
import pandas as pd
import numpy as np
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert_tools/2.test/pull/ChromBERT-tools/examples/cli" #your workdir
os.chdir(workdir)

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"]='0'

In [3]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [4]:
!chrombert-tools infer_cell_key_regulator -h

Usage: chrombert-tools infer_cell_key_regulator [OPTIONS]

  Infer cell-specific key regulators

Options:
  --cell-type-bw FILE             Cell type accessibility BigWig file.
                                  [required]
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
                                  [required]
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                                  ckpt.
  --genome [hg38|mm10]            Reference genome (hg38 or mm10).  [default:
                                  hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  ChromBERT resolution.  [default: 1kb]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --mode [fast|full]              Fast: downsample regions to 20k for
                                  training; Full: use all regions.  [default:
                              

#### download myoblast bigwig and peak file from encode

In [12]:
# download myoblast 
import subprocess
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
    subprocess.run(cmd, shell=True)

--2026-02-11 00:27:39--  https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz
Resolving www.encodeproject.org (www.encodeproject.org)... 34.211.244.144
Connecting to www.encodeproject.org (www.encodeproject.org)|34.211.244.144|:443... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://encode-public.s3.amazonaws.com/2020/11/18/eec1946c-813d-4b8d-8c71-78fd6655bcb6/ENCFF647RNC.bed.gz?response-content-disposition=attachment%3B%20filename%3DENCFF647RNC.bed.gz&AWSAccessKeyId=ASIATGZNGCNX2CRFPU3I&Signature=CfPaWF%2BdObt8fWkgXvsDNyzVMMY%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEOD%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLXdlc3QtMiJHMEUCIBaY3pDkjQeXFBVCC4cZAQ5JZ1FybAwSZOGxmbBLeofLAiEA25CdOKGBbZaqQcmKpdwyYagSl8sSXIRYvaJxqmH01OoqvAUIqf%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FARAAGgwyMjA3NDg3MTQ4NjMiDKVewssq%2F6jJQBSVTSqQBYWoIZ8zcDhW0c4ZTCOgEiJCtkSaM3za%2B18H1ORDJbFIYf0RpfKMQt4XoK1DIN0l6UexMkZ9J7rLjCGa5HZMayYDTjRRDTbBV0wXfVcR4j5BXSBktV%2FGrKJsuUUO3

In [6]:
# import subprocess
if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
    subprocess.run(cmd, shell=True)    

## Run

In [7]:
!mkdir -p ./tmp

In [13]:
# takes approximately 20-60 minutes to run
!chrombert-tools infer_cell_key_regulator \
    --cell-type-bw "../data/myoblast_ENCFF149ERN_signal.bigwig" \
    --cell-type-peak "../data/myoblast_ENCFF647RNC_peak.bed" \
    --odir "./output_infer_cell_key_regulator" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/infer_cell_key_regulator.stderr.log" # redirect stderr to log file
    

Stage 1: Praparing the dataset
Total regions: 324464
Fast mode: downsampling to 20k regions
Finished stage 1
Stage 2: Fine-tuning the model

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Epoch 0:  20%|████▍                 | 800/4000 [02:06<08:26,  6.32it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation:   0%|                                       | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|████████████████| 250/250 [00:22<00:00, 11.28it/s][A
Epoch 0:  40%|▍| 1600/4000 [04:44<07:06,  5.62it/s, v_num=0, default_validation/[A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation:   0%|                                       | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████

In [18]:
# factor_importance_rank.csv: ranked key regulators for myoblast with three columns:
#   - factors: regulator names
#   - similarity: cosine similarity of regulator embeddings between highly accessible regions and background region sets
#   - ranks: importance ranking

factor_importance_rank = pd.read_csv("./output_infer_cell_key_regulator/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)


Unnamed: 0,factors,similarity,rank
0,yap1,0.219407,1
1,myf5,0.243851,2
2,cbx6,0.251597,3
3,tead1,0.254286,4
4,tcf21,0.27184,5
5,myod1,0.299275,6
6,ring1,0.308147,7
7,myog,0.326159,8
8,rb1,0.360637,9
9,kdm6b,0.365583,10


### Load the fine-tuned checkpoint to infer key regulators for myoblast (skip fine-tuning)

In [9]:
import glob

In [19]:
# if you have already run infer_cell_key_regulator, you can use the fine-tuned checkpoint to infer cell-type-specific key regulators
ft_ckpt_dir = "./output_infer_cell_key_regulator/train/**/*.ckpt"

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]

In [13]:
ft_ckpt

'./output_infer_cell_key_regulator/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt'

In [14]:
# takes approximately 3-5 minutes to run
!chrombert-tools infer_cell_key_regulator \
    --cell-type-bw "../data/myoblast_ENCFF149ERN_signal.bigwig" \
    --cell-type-peak "../data/myoblast_ENCFF647RNC_peak.bed" \
    --ft-ckpt {ft_ckpt} \
    --odir "./output_infer_cell_key_regulator_load_cpkt" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/infer_cell_trn.stderr2.log" # redirect stderr to log file

Stage 1: Praparing the dataset
Total regions: 324464
Fast mode: downsampling to 20k regions
Finished stage 1
Use fine-tuned ChromBERT checkpoint file: ./output_infer_cell_key_regulator/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt to infer cell-specific trn
use organisim hg38; max sequence length is 6391
Loading checkpoint from ./output_infer_cell_key_regulator/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
Loading from pl module, remove prefix 'model.'
Loaded 110/110 parameters
Finished stage 2
Stage 3: generate regulator embedding on different activity regions
Finished stage 3
Stage 4: find key regulator
Finished stage 4: identify cell-specific key regulators (top 25)
        factors  similarity  rank
0          yap1    0.219407     1
1          myf5    0.243851     2
2          cbx6    0.251597     3
3         tead1    0.254286     4
4         tcf21    0.271840     5
5         myod1    0

In [16]:
factor_importance_rank = pd.read_csv("./output_infer_cell_key_regulator_load_cpkt/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)

Unnamed: 0,factors,similarity,rank
0,yap1,0.219407,1
1,myf5,0.243851,2
2,cbx6,0.251597,3
3,tead1,0.254286,4
4,tcf21,0.27184,5
5,myod1,0.299275,6
6,ring1,0.308147,7
7,myog,0.326159,8
8,rb1,0.360637,9
9,kdm6b,0.365583,10
