# Identify context-specific cofactors in different regions.

The ``find_context_specific_cofactor`` command trains a classifier to distinguish two sets of genomic regions, identifies regulatory factors that contribute most to the classification, and prioritizes context-specific cofactors for dual-functional target factors across the two contexts.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

## Example:
EZH2, the catalytic subunit of Polycomb Repressive Complex 2 (PRC2), operates in two distinct modes: 
a classical H3K27me3-dependent repressive role and a non-classical H3K27me3-independent role.

We fine-tuned ChromBERT to classify EZH2 ChIP-seq peaks in human embryonic stem cells 
into classical and non-classical categories.

Using this fine-tuned model, we identify regulators that most strongly distinguish the two region sets and visualize context-specific EZH2 cofactor subnetworks.


In [1]:
import pandas as pd
import numpy as np
import os
os.chdir("/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli")

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"]='1'

In [3]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [4]:
!chrombert-tools find_context_specific_cofactor -h

Usage: chrombert-tools find_context_specific_cofactor [OPTIONS]

  Find context-specific cofactors in different regions.

Options:
  --function1-bed TEXT            Different genomic regions for function1. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function1-mode [and|or]       Logic mode for function1 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --function2-bed TEXT            Different genomic regions for function2. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function2-mode [and|or]       Logic mode for function2 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --dual-regulator TEXT           Dual-functional regulat

## Run

In [5]:
!mkdir -p ./tmp

In [6]:
# takes approximately 20-40 minutes to run
!chrombert-tools find_context_specific_cofactor \
    --function1-bed "../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed" \
    --function2-bed "../data/hESC_GSM1003524_EZH2.bed" \
    --dual-regulator "EZH2" \
    --ignore-regulator "H3K27me3;H3K27me3/H3K4me3" \
    --odir "./output_find_context_specific_cofactor" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/find_context_specific_cofactor.log" # redirect stderr to log file

Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Stage 1: Praparing the dataset
  Function1 regions (positive): 5736
  Function2 regions (negative): 5272
  Total dataset size: 11008
  Fast mode: downsampling to 20k regions (10k per class)
Finished stage 1
Stage 2: Fine-tuning the model

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Ignoring 206 cistromes and 2 regulators
Epoch 0:  20%|████▍                 | 440/2202 [01:14<04:57,  5.91it/s, v_num=0]
Validati

In [7]:
# factor_importance_rank.csv: ranked key regulators that contribute to classify EZH2 ChIP-seq peaks in hesc into classical and non-classical categories:
#   - factors: regulator names
#   - similarity: cosine similarity of regulator embeddings between up-regulated and unchanged regions
#   - ranks: importance ranking

factor_importance_rank = pd.read_csv("./output_find_context_specific_cofactor/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)

Unnamed: 0,factors,similarity,rank
0,suz12,0.745108,1
1,ezh2,0.820978,2
2,ezh1,0.875315,3
3,dnase,0.880462,4
4,h3k27ac,0.88169,5
5,fgfr1,0.882949,6
6,cbx3,0.889959,7
7,h3k9ac,0.890341,8
8,h3k27me1,0.890884,9
9,ssu72,0.891237,10


In [8]:
# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function1 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions


# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function2 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions

reg_cos_sim_func1 = pd.read_csv("./output_find_context_specific_cofactor/results/regulator_cosine_similarity_on_function1_regions.csv",index_col = 0)
reg_cos_sim_func2 = pd.read_csv("./output_find_context_specific_cofactor/results/regulator_cosine_similarity_on_function2_regions.csv",index_col = 0)



In [9]:
# Infer dual-functional regulator subnetwork (If --dual-regulator was provided, saved in {odir}/results/dual_regulator_subnetwork.pdf)

thre_func1 = np.percentile(reg_cos_sim_func1.values.flatten(), 95)
thre_func2 = np.percentile(reg_cos_sim_func2.values.flatten(), 95)

assert (reg_cos_sim_func1.index == reg_cos_sim_func2.index).all()
df_cos_reg = pd.DataFrame(
                index=reg_cos_sim_func1.index,
                data={
                    "function1": reg_cos_sim_func1.loc['ezh2', :], # ezh2 (dual-functional regulator)
                    "function2": reg_cos_sim_func2.loc['ezh2', :], # ezh2 (dual-functional regulator)
                },
            )
df_cos_reg["diff"] = df_cos_reg["function1"] - df_cos_reg["function2"]
df_candidate = df_cos_reg[df_cos_reg["diff"].abs() > 0.1]
topN_pos = df_candidate.query("function1 > @thre_func1").index.values
topN_neg = df_candidate.query("function2 > @thre_func2").index.values
top_pairs = np.union1d(topN_pos, topN_neg)



In [10]:
df_candidate

Unnamed: 0,function1,function2,diff
adnp,0.252271,0.409365,-0.157094
aebp2,0.173025,0.340098,-0.167073
aff4,0.389687,0.520966,-0.131279
ahrr,0.203003,0.364030,-0.161028
alkbh3,0.214632,0.368570,-0.153939
...,...,...,...
zscan20,0.275069,0.401078,-0.126009
zscan23,0.130454,0.268913,-0.138459
zscan31,0.303008,0.427073,-0.124065
zscan5a,0.175975,0.313683,-0.137708


In [11]:
thre_func1, thre_func2

(0.5884455459537972, 0.5512158747683246)

In [12]:
# function1_sunbetwork
df_candidate.loc[topN_pos]

Unnamed: 0,function1,function2,diff
bcor,0.67947,0.56663,0.11284
cbx2,0.64198,0.536068,0.105911
ezh1,0.686177,0.582428,0.10375
h2ak119ub,0.692308,0.567569,0.124739
jarid2,0.66136,0.535041,0.126319
kdm2b,0.650839,0.519836,0.131003
pcgf1,0.651266,0.514883,0.136383
rybp,0.643115,0.540906,0.102209


In [13]:
# function2_sunbetwork
df_candidate.loc[topN_neg]

Unnamed: 0,function1,function2,diff
bcor,0.67947,0.56663,0.11284
brca1,0.402579,0.596531,-0.193951
brd4,0.436702,0.590984,-0.154282
cdk8,0.435299,0.557559,-0.12226
cdk9,0.417467,0.556912,-0.139445
crebbp,0.435355,0.581948,-0.146593
e2f1,0.419981,0.586126,-0.166145
e2f4,0.397484,0.580344,-0.182861
ep300,0.474239,0.636712,-0.162474
ezh1,0.686177,0.582428,0.10375
