# Identify context-specific cofactors in different regions.

The ``find_context_specific_cofactor`` command trains a classifier to distinguish two sets of genomic regions, identifies regulatory factors that contribute most to the classification, and prioritizes context-specific cofactors for dual-functional target factors across the two contexts.

**Note**: The remaining examples will only show the direct command usage. 

If you need to use Singularity container, please refer to the [`singularity_use.ipynb`](singularity_use.ipynb) tutorial for detailed instructions on using `singularity exec` with `chrombert-tools`.

## Example:
EZH2, the catalytic subunit of Polycomb Repressive Complex 2 (PRC2), operates in two distinct modes: 
a classical H3K27me3-dependent repressive role and a non-classical H3K27me3-independent role.

We fine-tuned ChromBERT to classify EZH2 ChIP-seq peaks in human embryonic stem cells 
into classical and non-classical categories.

Using this fine-tuned model, we identify regulators that most strongly distinguish the two region sets and visualize context-specific EZH2 cofactor subnetworks.


In [17]:
import pandas as pd
import numpy as np
import os
workdir="/mnt/Storage2/home/chenqianqian/projects/chrombert_tools/2.test/pull/ChromBERT-tools/examples/cli" # your workdir
os.chdir(workdir)

In [5]:
os.environ["CUDA_VISIBLE_DEVICES"]='0'

In [6]:
!chrombert-tools -h

Usage: chrombert-tools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  embed_cell_cistrome             Extract cell-specific cistrome...
  embed_cell_gene                 Extract cell-specific gene embeddings
  embed_cell_region               Extract cell-specific region embeddings
  embed_cell_regulator            Extract cell-specific regulator...
  embed_cistrome                  Extract general cistrome embeddings on...
  embed_gene                      Extract general gene embeddings
  embed_region                    Extract general region embeddings
  embed_regulator                 Extract general regulator embeddings on...
  find_context_specific_cofactor  Find context-specific cofactors in...
  find_driver_in_transition       Find driver factors in cell

In [7]:
!chrombert-tools find_context_specific_cofactor -h

Usage: chrombert-tools find_context_specific_cofactor [OPTIONS]

  Find context-specific cofactors in different regions.

Options:
  --function1-bed TEXT            Different genomic regions for function1. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function1-mode [and|or]       Logic mode for function1 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --function2-bed TEXT            Different genomic regions for function2. Use
                                  ';' to separate multiple BED files.
                                  [required]
  --function2-mode [and|or]       Logic mode for function2 regions: 'and'
                                  requires all conditions; 'or' requires any
                                  condition.  [default: and]
  --dual-regulator TEXT           Dual-functional regulat

## Run

In [8]:
!mkdir -p ./tmp

In [9]:
# takes approximately 20-40 minutes to run
!chrombert-tools find_context_specific_cofactor \
    --function1-bed "../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed" \
    --function2-bed "../data/hESC_GSM1003524_EZH2.bed" \
    --dual-regulator "EZH2" \
    --ignore-regulator "H3K27me3;H3K27me3/H3K4me3" \
    --odir "./output_find_context_specific_cofactor" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/find_context_specific_cofactor.log" # redirect stderr to log file

Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Stage 1: Praparing the dataset
  Function1 regions (positive): 5736
  Function2 regions (negative): 5272
  Total dataset size: 11008
  Fast mode: downsampling to 20k regions (10k per class)
Finished stage 1
Stage 2: Fine-tuning the model

[Attempt 0/2] seed=55
use organisim hg38; max sequence length is 6391
Ignoring 206 cistromes and 2 regulators
Epoch 0:  20%|████▍                 | 440/2202 [01:08<04:33,  6.44it/s, v_num=0]
Validati

In [18]:
# factor_importance_rank.csv: ranked key regulators that contribute to classify EZH2 ChIP-seq peaks in hesc into classical and non-classical categories:
#   - factors: regulator names
#   - similarity: cosine similarity of regulator embeddings between classical and non-classical regions
#   - ranks: importance ranking

factor_importance_rank = pd.read_csv("./output_find_context_specific_cofactor/results/factor_importance_rank.csv")
factor_importance_rank.head(n=25)

Unnamed: 0,factors,similarity,rank
0,suz12,0.756165,1
1,ezh2,0.828705,2
2,ezh1,0.873341,3
3,h3k27ac,0.880482,4
4,dnase,0.883265,5
5,h3k9ac,0.884529,6
6,fgfr1,0.885017,7
7,supt5h,0.888846,8
8,sin3a,0.889702,9
9,cbx3,0.889848,10


In [11]:
# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function1 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions


# regulator_cosine_similarity_on_function2_regions.csv: cosine similarity of regulator-regulator pairs on function2 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function2 regions

reg_cos_sim_func1 = pd.read_csv("./output_find_context_specific_cofactor/results/regulator_cosine_similarity_on_function1_regions.csv",index_col = 0)
reg_cos_sim_func2 = pd.read_csv("./output_find_context_specific_cofactor/results/regulator_cosine_similarity_on_function2_regions.csv",index_col = 0)



In [12]:
# Infer dual-functional regulator subnetwork (If --dual-regulator was provided, saved in {odir}/results/dual_regulator_subnetwork.pdf)

thre_func1 = np.percentile(reg_cos_sim_func1.values.flatten(), 95)
thre_func2 = np.percentile(reg_cos_sim_func2.values.flatten(), 95)

assert (reg_cos_sim_func1.index == reg_cos_sim_func2.index).all()
df_cos_reg = pd.DataFrame(
                index=reg_cos_sim_func1.index,
                data={
                    "function1": reg_cos_sim_func1.loc['ezh2', :], # ezh2 (dual-functional regulator)
                    "function2": reg_cos_sim_func2.loc['ezh2', :], # ezh2 (dual-functional regulator)
                },
            )
df_cos_reg["diff"] = df_cos_reg["function1"] - df_cos_reg["function2"]
df_candidate = df_cos_reg[df_cos_reg["diff"].abs() > 0.1]
topN_pos = df_candidate.query("function1 > @thre_func1").index.values
topN_neg = df_candidate.query("function2 > @thre_func2").index.values
top_pairs = np.union1d(topN_pos, topN_neg)



In [13]:
df_candidate

Unnamed: 0,function1,function2,diff
adnp,0.264486,0.408844,-0.144358
aebp2,0.179473,0.333748,-0.154275
aff4,0.394589,0.525682,-0.131093
ago2,0.241129,0.347329,-0.106200
ahrr,0.202957,0.367545,-0.164588
...,...,...,...
zscan20,0.272939,0.386961,-0.114021
zscan23,0.137165,0.259789,-0.122624
zscan31,0.327292,0.438583,-0.111292
zscan5a,0.179316,0.294775,-0.115459


In [14]:
thre_func1, thre_func2

(0.5941971146274339, 0.5555836736785533)

In [15]:
# function1_sunbetwork
df_candidate.loc[topN_pos]

Unnamed: 0,function1,function2,diff
h2ak119ub,0.692093,0.560984,0.131109
jarid2,0.657833,0.534949,0.122885
kdm2b,0.655377,0.537247,0.11813
pcgf1,0.653553,0.519462,0.134091
rybp,0.642865,0.541592,0.101273


In [16]:
# function2_sunbetwork
df_candidate.loc[topN_neg]

Unnamed: 0,function1,function2,diff
brca1,0.409142,0.605985,-0.196844
brd4,0.434379,0.59467,-0.160291
cdk8,0.436817,0.566043,-0.129226
cdk9,0.410774,0.566285,-0.155511
chd1,0.509027,0.610589,-0.101561
crebbp,0.436169,0.583665,-0.147496
e2f1,0.422902,0.590371,-0.167469
e2f4,0.396903,0.58145,-0.184547
ep300,0.469552,0.64115,-0.171598
fosl1,0.384548,0.566791,-0.182243
