# Infer cell-type-specific transcriptional regulatory networks (TRNs)

The ``infer_cell_trn`` command fine-tunes ChromBERT on cell-type-specific accessibility data (BigWig + peaks) and then infers a cell-type-specific transcriptional regulatory network (TRN) and key regulators. If a fine-tuned checkpoint is provided, fine-tuning is skipped and the TRN is inferred directly from the checkpoint.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

# Example: 
infer key regulators and cell-type-specific transcriptional regulatory networks (TRNs) for myoblast

In [4]:
import pandas as pd
import numpy as np
import os
os.chdir("/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli")

In [5]:
os.environ["CUDA_VISIBLE_DEVICES"]='0'

In [6]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome         Extract cell-specific cistrome embeddings...
  embed_cell_gene             Extract cell-specific gene embeddings
  embed_cell_region           Extract cell-specific region embeddings
  embed_cell_regulator        Extract cell-specific regulator embeddings...
  embed_cistrome              Extract general cistrome embeddings on...
  embed_gene                  Extract general gene embeddings
  embed_region                Extract general region embeddings
  embed_regulator             Extract general regulator embeddings on...
  find_driver_in_dual_region  Find driver factors in dual functional...
  find_driver_in_transition   Find driver factors in cell state transit

In [7]:
!chrombert-tools infer_cell_trn -h

Usage: chrombert-tools infer_cell_trn [OPTIONS]

  Infer cell-specific TRN (Transition Regulatory Network)

Options:
  --cell-type-bw FILE             Cell type accessibility BigWig file.
                                  [required]
  --cell-type-peak FILE           Cell type accessibility Peak BED file.
                                  [required]
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, skip fine-tuning and use this
                                  ckpt.
  --genome [hg38|mm10]            Reference genome (hg38 or mm10).  [default:
                                  hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  ChromBERT resolution.  [default: 1kb]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --mode [fast|full]              Fast: downsample regions to 20k for
                                  training; Full: use all regions.  [default:
                   

#### download myoblast bigwig and peak file from encode

In [5]:
# download myoblast 
# import subprocess
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed'
#     subprocess.run(cmd, shell=True)

In [6]:
# import subprocess
# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)    

## Run

In [11]:
!mkdir -p ./tmp

In [12]:
# takes approximately 20-60 minutes to run
!chrombert-tools infer_cell_trn \
    --cell-type-bw "../data/myoblast_ENCFF149ERN_signal.bigwig" \
    --cell-type-peak "../data/myoblast_ENCFF647RNC_peak.bed" \
    --odir "./output_infer_cell_trn" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/infer_cell_trn.stderr.log" # redirect stderr to log file
    

Stage 1: Praparing the dataset
Finished stage 1
Stage 2: Fine-tuning the model

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Epoch 0:  20%|████▍                 | 800/4000 [02:18<09:12,  5.79it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.20it/s][A
Epoch 0:  40%|▍| 1600/4000 [05:01<07:32,  5.30it/s, v_num=0, default_validation/[A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.16it/s][A
Epoch 0:  60%

In [14]:
# factor_importance_rank.csv: ranked key regulators for myoblast with three columns:
#   - factors: regulator names
#   - similarity: cosine similarity of regulator embeddings between up-regulated and unchanged regions
#   - ranks: importance ranking

factor_importance_rank = pd.read_csv("./output_infer_cell_trn/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)


Unnamed: 0,factors,similarity,rank
0,yap1,0.128829,1
1,myf5,0.155469,2
2,tead1,0.177529,3
3,cbx6,0.188549,4
4,myod1,0.216889,5
5,ring1,0.228451,6
6,tcf21,0.253095,7
7,chd4,0.265094,8
8,myog,0.270762,9
9,rb1,0.298078,10


In [13]:
import glob

In [17]:
total_graph_edge_file = glob.glob("./output_infer_cell_trn/results/total_graph_edge_threshold*_quantile*.tsv")[0]
total_graph_edge_file

'./output_infer_cell_trn/results/total_graph_edge_threshold0.74_quantile0.99.tsv'

In [18]:
# total_graph_edge_*.tsv: edges (regulator pairs) in the cell-specific regulatory network 
# for up-regulated regions where cosine similarity >= threshold
total_graph_edge = pd.read_csv(total_graph_edge_file, sep="\t")
total_graph_edge

Unnamed: 0,node1,node2,cosine_similarity
0,5hmc,rloop,0.758960
1,adnp,atf5,0.809457
2,adnp,creb3,0.762606
3,adnp,creb3l4,0.794392
4,adnp,ets2,0.753899
...,...,...,...
5747,znf860,znf93,0.824218
5748,zscan20,zscan23,0.769609
5749,zscan20,zscan5a,0.828537
5750,zscan22,zscan31,0.838648


In [23]:
total_graph_edge.query("node1 =='myf5' or node2 =='myf5'")

Unnamed: 0,node1,node2,cosine_similarity
1604,myf5,myod1,0.88547
1605,myf5,myog,0.815771
1606,myf5,neurog2,0.80864
1607,myf5,tcf21,0.785743
1608,myf5,tead1,0.811371
1609,myf5,yap1,0.793563


In [24]:
total_graph_edge.query("node1 =='pax7' or node2 =='pax7'")

Unnamed: 0,node1,node2,cosine_similarity
705,dux4,pax7,0.755378


### Load the fine-tuned checkpoint to infer key regulators and TRN for myoblast (skip fine-tuning)

In [8]:
# takes approximately 3-5 minutes to run
!chrombert-tools infer_cell_trn \
    --cell-type-bw "../data/myoblast_ENCFF149ERN_signal.bigwig" \
    --cell-type-peak "../data/myoblast_ENCFF647RNC_peak.bed" \
    --ft-ckpt "/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt" \
    --odir "./output_infer_cell_trn_load_ckpt" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/infer_cell_trn.stderr2.log" # redirect stderr to log file

Stage 1: Praparing the dataset
Finished stage 1
Use fine-tuned ChromBERT checkpoint file: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt to infer cell-specific trn
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_infer_cell_trn/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=3-step=226.ckpt
Loading from pl module, remove prefix 'model.'
Loaded 110/110 parameters
Finished stage 2
Stage 3: generate regulator embedding on different activity regions
Finished stage 3
Stage 4: find key regulator
        factors  similarity  rank
0          yap1    0.128829     1
1          myf5    0.155469     2
2         tead1    0.177529     3
3          cbx6    0.188549     4
4         my

In [9]:
factor_importance_rank = pd.read_csv("./output_infer_cell_trn_load_ckpt/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)

Unnamed: 0,factors,similarity,rank
0,yap1,0.128829,1
1,myf5,0.155469,2
2,tead1,0.177529,3
3,cbx6,0.188549,4
4,myod1,0.216889,5
5,ring1,0.228451,6
6,tcf21,0.253095,7
7,chd4,0.265094,8
8,myog,0.270762,9
9,rb1,0.298078,10


In [10]:
import glob

In [11]:
total_graph_edge_file = glob.glob("./output_infer_cell_trn_load_ckpt/results/total_graph_edge_threshold*_quantile*.tsv")[0]
total_graph_edge_file

'./output_infer_cell_trn_load_ckpt/results/total_graph_edge_threshold0.74_quantile0.99.tsv'

In [12]:
# total_graph_edge_*.tsv: edges (regulator pairs) in the cell-specific regulatory network 
# for up-regulated regions where cosine similarity >= threshold
total_graph_edge = pd.read_csv(total_graph_edge_file, sep="\t")
total_graph_edge

Unnamed: 0,node1,node2,cosine_similarity
0,5hmc,rloop,0.758960
1,adnp,atf5,0.809457
2,adnp,creb3,0.762606
3,adnp,creb3l4,0.794392
4,adnp,ets2,0.753899
...,...,...,...
5747,znf860,znf93,0.824218
5748,zscan20,zscan23,0.769609
5749,zscan20,zscan5a,0.828537
5750,zscan22,zscan31,0.838648


In [13]:
total_graph_edge.query("node1 =='myf5' or node2 =='myf5'")

Unnamed: 0,node1,node2,cosine_similarity
1604,myf5,myod1,0.88547
1605,myf5,myog,0.815771
1606,myf5,neurog2,0.80864
1607,myf5,tcf21,0.785743
1608,myf5,tead1,0.811371
1609,myf5,yap1,0.793563
