# Identify driver factors in cell state transitions

The ``find_driver_in_transition`` command identifies key transcription factors that drive changes in gene expression and/or chromatin accessibility during cell state transitions (e.g., differentiation or reprogramming).

You can run this command with:
- expression only,
- accessibility only, or
- both expression and accessibility.

Provide the corresponding input files for the analyses you want to perform.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

## Example:

Find driver regulators in fibroblast-to-myoblast transition using both expression and accessibility


In [None]:
import pandas as pd
import numpy as np
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli"
os.chdir(workdir)

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"]='2'

In [1]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [4]:
!chrombert-tools find_driver_in_transition -h

Usage: chrombert-tools find_driver_in_transition [OPTIONS]

  Find driver factors in cell state transitions.

  This tool identifies key transcription factors that drive cell state
  transitions by analyzing changes in gene expression and/or chromatin
  accessibility between two cell states.

  You must provide at least one of the following: - Expression data (--exp-
  tpm1 and --exp-tpm2) - Accessibility data (--acc-peak1, --acc-peak2, --acc-
  signal1, --acc-signal2)

  Providing both expression and accessibility data yields more confident
  results.

Options:
  --exp-tpm1 FILE                 Expression (TPM) file for cell state 1. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --exp-tpm2 FILE                 Expression (TPM) file for cell state 2. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --acc-peak1 FILE                Chromatin accessibility peak BED file for
                                  cell state 1.
 

## Run

In [5]:
# Runtime estimates:
#   - fast mode: ~3-5 hours
#     (uses all ~19,620 genes for expression analysis, but downsamples 
#      chromatin accessibility regions to 20k for faster training)
#
# Note: Both modes (fast and full) use the complete gene expression dataset. The 'fast' mode 
# only downsamples chromatin accessibility regions, not gene data.

# So this downsampled 5000 genes for expression analysis for test (40-100 minutes)

!chrombert-tools find_driver_in_transition \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --genome 'hg38' \
  --resolution '1kb' \
  --odir output_find_driver_in_transition \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"

Stage 1: prepare dataset
Expression dataset already exists in output_find_driver_in_transition/exp/dataset
Processing stage 1: prepare chromatin accessibility dataset
Finished Stage 1
Whether to train ChromBERT to predict expression changes in cell state transition: True
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Processing stage 2 (exp): train ChromBERT to predict expression changes in cell state transition
Stage 2 (exp): train ChromBERT to predict expression changes in cell state transition

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Epoch 0:  20%|████▍                 | 336/1688 [03:54<15:43,  1.43it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/105 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████

### Load the fine-tuned checkpoint to infer key regulators and TRN for myoblast (skip fine-tuning)

In [31]:
!chrombert-tools find_driver_in_transition \
  --ft-ckpt-exp "/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=0-step=21.ckpt" \
  --ft-ckpt-acc "/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt" \
  --genome 'hg38' \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --resolution '1kb' \
  --odir output_find_driver_in_transition \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"

Stage 1: prepare dataset
Expression dataset already exists in output_find_driver_in_transition/exp/dataset
Chromatin accessibility dataset already exists in output_find_driver_in_transition/acc/dataset
Finished Stage 1
Whether to train ChromBERT to predict expression changes in cell state transition: True
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Processing stage 2 (exp): train ChromBERT to predict expression changes in cell state transition
Use fine-tuned ChromBERT checkpoint file: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=0-step=21.ckpt to find driver factors in different expression activity genes
use organisim hg38; max sequence length is 6391
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/

In [32]:
odir = "./output_find_driver_in_transition"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df

Unnamed: 0,factors,similarity,rank
0,cbx6,0.961502,1
1,cbx7,0.963570,2
2,cbx8,0.976251,3
3,ring1,0.979848,4
4,brd7,0.980695,5
...,...,...,...
986,stat4,0.996563,987
987,gtf2i,0.996564,988
988,irf5,0.996580,989
989,tfcp2,0.996936,990


In [33]:
acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df

Unnamed: 0,factors,similarity,rank
0,myog,0.163047,1
1,pax3-foxo1a,0.202741,2
2,myf5,0.212604,3
3,myod1,0.242216,4
4,pax7,0.346181,5
...,...,...,...
986,znf28,0.981777,987
987,znf26,0.981883,988
988,znf714,0.982148,989
989,znf266,0.982677,990


In [34]:
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df

Unnamed: 0,factors,similarity_exp,rank_exp,similarity_acc,rank_acc
0,cbx6,0.961502,1,0.925670,119
1,cbx7,0.963570,2,0.927905,124
2,cbx8,0.976251,3,0.930284,133
3,ring1,0.979848,4,0.923826,114
4,brd7,0.980695,5,0.932991,142
...,...,...,...,...,...
986,stat4,0.996563,987,0.961282,519
987,gtf2i,0.996564,988,0.976028,911
988,irf5,0.996580,989,0.970488,787
989,tfcp2,0.996936,990,0.980216,974


In [35]:
merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df

Unnamed: 0,factors,similarity_exp,rank_exp,similarity_acc,rank_acc,total_rank
0,yap1,0.985587,17,0.487734,10,1
1,chd4,0.983453,11,0.700076,24,2
2,neurog2,0.987556,29,0.444524,8,3
3,tead1,0.988670,39,0.455782,9,4
4,tcf21,0.987744,32,0.526704,16,4
...,...,...,...,...,...,...
986,znf266,0.996125,963,0.982677,990,986
987,maff,0.996367,977,0.980712,979,988
988,znf26,0.996320,974,0.981883,988,989
989,pitx1,0.996537,986,0.980594,977,990


In [36]:
merge_df.query("factors == 'myod1'")

Unnamed: 0,factors,similarity_exp,rank_exp,similarity_acc,rank_acc,total_rank
8,myod1,0.990031,55,0.242216,4,9


In [1]:
import pandas as pd
merge_df = pd.read_csv("/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/merge/factor_importance_rank.csv")

In [3]:
merge_df.head(n=5)

Unnamed: 0,factors,similarity_exp,rank_exp,similarity_acc,rank_acc,total_rank
0,yap1,0.985587,17,0.487734,10,1
1,chd4,0.983453,11,0.700076,24,2
2,neurog2,0.987556,29,0.444524,8,3
3,tead1,0.98867,39,0.455782,9,4
4,tcf21,0.987744,32,0.526704,16,4
