# Identify driver factors distinguishing two sets of genomic regions

The ``find_driver_in_dual_region`` command trains a classifier to distinguish two sets of genomic regions and identifies regulatory factors that contribute most to the classification.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

# Example:
EZH2, the catalytic subunit of Polycomb Repressive Complex 2 (PRC2), operates in two distinct modes: 
a classical H3K27me3-dependent repressive role and a non-classical H3K27me3-independent role.

We fine-tuned ChromBERT to classify EZH2 ChIP-seq peaks in human embryonic stem cells 
into classical and non-classical categories.

Using this fine-tuned model, we can identify key regulators that contribute to the classification
and plot EZH2 dual-function regulator subnetworks.


In [1]:
import pandas as pd
import numpy as np
import os
os.chdir("/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli")

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"]='2'

In [3]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome         Extract cell-specific cistrome embeddings...
  embed_cell_gene             Extract cell-specific gene embeddings
  embed_cell_region           Extract cell-specific region embeddings
  embed_cell_regulator        Extract cell-specific regulator embeddings...
  embed_cistrome              Extract general cistrome embeddings on...
  embed_gene                  Extract general gene embeddings
  embed_region                Extract general region embeddings
  embed_regulator             Extract general regulator embeddings on...
  find_driver_in_dual_region  Find driver factors in dual functional...
  find_driver_in_transition   Find driver factors in cell state transit

In [4]:
!chrombert-tools find_driver_in_dual_region -h

Usage: chrombert-tools find_driver_in_dual_region [OPTIONS]

  Find driver factors in dual functional regions.

Options:
  --function1-bed TEXT            Different genomic regions for function1. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function1-mode [and|or]       Logic mode for function1 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --function2-bed TEXT            Different genomic regions for function2. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function2-mode [and|or]       Logic mode for function2 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --dual-regulator TEXT           Dual-functional regulator(s). Use

## Run

In [5]:
!mkdir -p ./tmp

In [6]:
# takes approximately 20-40 minutes to run
!chrombert-tools find_driver_in_dual_region \
    --function1-bed "../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed" \
    --function2-bed "../data/hESC_GSM1003524_EZH2.bed" \
    --dual-regulator "EZH2" \
    --ignore-regulator "H3K27me3;H3K27me3/H3K4me3" \
    --odir "./output_find_driver_in_dual_region" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/find_driver_in_dual_region.log" # redirect stderr to log file

Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
Stage 1: Praparing the dataset
  Function1 regions (positive): 5736
  Function2 regions (negative): 5272
  Total dataset size: 11008
  Fast mode: downsampling to 20k regions (10k per class)
Finished stage 1
Stage 2: Fine-tuning the model

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Ignoring 206 cistromes and 2 regulators
Epoch 0:  20%|████▍                 | 440/2202 [01:14<04:58,  5.90it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation: |                                             | 0/? [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|                          | 0/

In [7]:
# factor_importance_rank.csv: ranked key regulators that contribute to classify EZH2 ChIP-seq peaks in hesc into classical and non-classical categories:
#   - factors: regulator names
#   - similarity: cosine similarity of regulator embeddings between up-regulated and unchanged regions
#   - ranks: importance ranking

factor_importance_rank = pd.read_csv("./output_find_driver_in_dual_region/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)

Unnamed: 0,factors,similarity,rank
0,suz12,0.75839,1
1,ezh2,0.827892,2
2,ezh1,0.872999,3
3,dnase,0.877582,4
4,h3k27ac,0.88252,5
5,fgfr1,0.887095,6
6,h3k9ac,0.888212,7
7,cbx3,0.892973,8
8,supt5h,0.894362,9
9,cbx8,0.896639,10


In [8]:
# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function1 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions


# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function2 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions

reg_cos_sim_func1 = pd.read_csv("./output_find_driver_in_dual_region/results/regulator_cosine_similarity_on_function1_regions.csv",index_col = 0)
reg_cos_sim_func2 = pd.read_csv("./output_find_driver_in_dual_region/results/regulator_cosine_similarity_on_function2_regions.csv",index_col = 0)



In [19]:
# Infer dual-functional regulator subnetwork (If --dual-regulator was provided, saved in {odir}/results/dual_regulator_subnetwork.pdf)

thre_func1 = np.percentile(reg_cos_sim_func1.values.flatten(), 95)
thre_func2 = np.percentile(reg_cos_sim_func2.values.flatten(), 95)

assert (reg_cos_sim_func1.index == reg_cos_sim_func2.index).all()
df_cos_reg = pd.DataFrame(
                index=reg_cos_sim_func1.index,
                data={
                    "function1": reg_cos_sim_func1.loc['ezh2', :], # ezh2 (dual-functional regulator)
                    "function2": reg_cos_sim_func2.loc['ezh2', :], # ezh2 (dual-functional regulator)
                },
            )
df_cos_reg["diff"] = df_cos_reg["function1"] - df_cos_reg["function2"]
df_candidate = df_cos_reg[df_cos_reg["diff"].abs() > 0.1]
topN_pos = df_candidate.query("function1 > @thre_func1").index.values
topN_neg = df_candidate.query("function2 > @thre_func2").index.values
top_pairs = np.union1d(topN_pos, topN_neg)



In [20]:
df_candidate

Unnamed: 0,function1,function2,diff
adnp,0.271903,0.414827,-0.142924
aebp2,0.192908,0.335669,-0.142762
aff4,0.397071,0.522204,-0.125134
ago2,0.239193,0.342009,-0.102816
ahrr,0.227926,0.375435,-0.147509
...,...,...,...
znf85,0.344781,0.455589,-0.110808
znf860,0.285424,0.415684,-0.130260
zscan2,0.247262,0.457844,-0.210582
zscan23,0.161990,0.272005,-0.110016


In [27]:
thre_func1, thre_func2

(0.6006548976289199, 0.5585828818148895)

In [21]:
# function1_sunbetwork
df_candidate.loc[topN_pos]

Unnamed: 0,function1,function2,diff
cbx2,0.64504,0.543022,0.102017
h2ak119ub,0.690942,0.56201,0.128932
jarid2,0.649965,0.536699,0.113266
kdm2b,0.65293,0.538311,0.114619
pcgf1,0.653823,0.525092,0.128731


In [23]:
# function2_sunbetwork
df_candidate.loc[topN_neg]

Unnamed: 0,function1,function2,diff
brca1,0.400947,0.596643,-0.195696
brd4,0.428495,0.58657,-0.158075
cdk8,0.434739,0.560862,-0.126123
cdk9,0.413138,0.563439,-0.1503
crebbp,0.432765,0.579736,-0.146971
e2f1,0.411585,0.582629,-0.171044
e2f4,0.401642,0.579386,-0.177745
ep300,0.473747,0.641044,-0.167297
foxm1,0.43354,0.647759,-0.214219
h2ak119ub,0.690942,0.56201,0.128932
