# Example for uncovers cellular dynamics using ChromBERT: Transcriptome
To fully uncover cellular dynamics, it is essential to combine analyses of chromatin accessibility and the transcriptome. In this tutorial, we will demonstrate how to use ChromBERT to uncover cellular dynamics during a specific transdifferentiation process(Fibroblast to myoblast) in transcriptome. To follow this tutorial, you need to have the checkpoint (ckpt) files downloaded (see the README for details).


In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
import chrombert
import pandas as pd
import numpy as np
from torchinfo import summary
import subprocess
import torch
import lightning.pytorch as pl 
base_dir =  os.path.expanduser("~/.cache/chrombert/data") ### to_path_chrombert/data

## Preprocessing transcriptome dataset
You need to prepare transcriptome data (including TSS and TPM for gene).This tutorial will show you how to transform the data into the format required by ChromBERT.
To fine-tune on transcriptome data for predicting log1p-transformed fold change during transdifferentiation, you need to prepare the log1p-transformed fold change data. Additionally, you need to obtain the ChromBERT 1kb region bins corresponding to the TSS of the relevant genes.

In [2]:
gep_dir = f'{base_dir}/demo/transdifferentiation/transcriptome'
fibroblast_exp = pd.read_csv(f'{gep_dir}/fibroblast_expression.csv')
myoblast_exp = pd.read_csv(f'{gep_dir}/myoblast_expression.csv')
myoblast_exp.head(), fibroblast_exp.head()

(   chrom       tss          gene_id        tpm
 0  chr19  58353492  ENSG00000121410  22.236894
 1  chr19  58347718  ENSG00000268895   9.317134
 2  chr10  50885675  ENSG00000148584   0.000000
 3  chr12   9116229  ENSG00000175899   0.993828
 4  chr12   9065163  ENSG00000245105   0.124228,
    chrom       tss          gene_id        tpm
 0  chr19  58353492  ENSG00000121410  12.774133
 1  chr19  58347718  ENSG00000268895   2.939181
 2  chr10  50885675  ENSG00000148584   0.000000
 3  chr12   9116229  ENSG00000175899   0.226091
 4  chr12   9065163  ENSG00000245105   0.226091)

#### Prepare the log1p-transformed fold change data

In [3]:
from chrombert.scripts.chrombert_make_dataset import get_regions
merge_exp = pd.merge(fibroblast_exp,myoblast_exp,left_on=['chrom','tss','gene_id'],right_on=['chrom','tss','gene_id'],suffixes=['_fibroblast','_myoblast'])
merge_exp['fold_change']= np.log1p(merge_exp['tpm_myoblast']) - np.log1p(merge_exp['tpm_fibroblast'])
merge_exp['start'] = merge_exp['tss']//1000 * 1000
merge_exp['end'] = (merge_exp['tss']//1000 + 1) * 1000
foldchange_exp = merge_exp [['chrom','start','end','tss','gene_id','fold_change']]

chrom_regions = get_regions(base_dir,genome='hg38',high_resolution=False) # 1kb
chrom_regions
chrom_regions_df = pd.read_csv(chrom_regions,sep='\t',names=['chrom','start','end','build_region_index'])
chrom_regions_df
merge_region = pd.merge(foldchange_exp,chrom_regions_df,left_on=['chrom','start','end'],right_on=['chrom','start','end'],how='inner')[['chrom','start','end','build_region_index','fold_change','tss','gene_id']]
gep_df = merge_region.rename(columns={'fold_change':'label'})
gep_df.to_csv(f'{gep_dir}/fibroblast_to_myoblast_expression_changes.csv',index=False)
gep_df.head()

Unnamed: 0,chrom,start,end,build_region_index,label,tss,gene_id
0,chr19,58353000,58354000,917950,0.522949,58353492,ENSG00000121410
1,chr19,58347000,58348000,917944,0.962833,58347718,ENSG00000268895
2,chr10,50885000,50886000,221904,0.0,50885675,ENSG00000148584
3,chr12,9116000,9117000,393001,0.486225,9116229,ENSG00000175899
4,chr12,9065000,9066000,392961,-0.086734,9065163,ENSG00000245105


#### Preprocessing the fine-tuning data of transcriptome 
splitting it into training, testing, and validation sets with an 8:1:1 ratio and Downsample the data to test the fine-tuning process

In [4]:
train_data = gep_df.sample(frac=0.8,random_state=55)
test_data = gep_df.drop(train_data.index).sample(frac=0.5,random_state=55)
valid_data = gep_df.drop(train_data.index).drop(test_data.index)
train_data_sample = train_data.sample(n=80,random_state=55)
test_data_sample = test_data.sample(n=20,random_state=55)
valid_data_sample = valid_data.sample(n=20,random_state=55)
train_data_sample.to_csv(f'{gep_dir}/train_sample.csv',index=False)
test_data_sample.to_csv(f'{gep_dir}/test_sample.csv',index=False)
valid_data_sample.to_csv(f'{gep_dir}/valid_sample.csv',index=False)
train_data_sample.head()

Unnamed: 0,chrom,start,end,build_region_index,label,tss,gene_id
4496,chr6,138795000,138796000,1719247,0.0,138795911,ENSG00000203734
13237,chr8,96261000,96262000,1933979,-0.351699,96261902,ENSG00000156471
17079,chr22,29555000,29556000,1186796,-0.565257,29555216,ENSG00000100296
4118,chr8,1622000,1623000,1866390,0.0,1622417,ENSG00000253267
13482,chr7,66682000,66683000,1792879,0.402664,66682164,ENSG00000154710


#### Prepare the upregulated genes and unchanged genes for downstream analysis.

In [5]:
up_data = gep_df[gep_df['label']>1]
nochange_data = gep_df[(gep_df['label']>-0.5) & (gep_df['label']<0.5)]

up_data_sample = up_data.sample(n=100,random_state=55)
nochange_data_sample = nochange_data.sample(n=100,random_state=55)


up_data_sample.to_csv(f'{gep_dir}/up_data_sample.csv',index=False)
nochange_data_sample.to_csv(f'{gep_dir}/nochange_data_sample.csv',index=False)

## Fine-tune

In this section, we provide a tutorial for fine-tuning ChromBERTs to predict genome-wide changes in transcriptome.
However, the objectives for transcriptome changes differed due to the potential uncertainty in the influence of regions adjacent to transcription start sites (TSSs) on gene expression,  with a few key modifications:
- Dataset config Preparation: We use the `multi_flank_window` preset dataset config and set four `flank_window` parameter
- Model Instantiation: We use the `gep` preset model config and set four `gep_flank_window` parameter
Let's get started!

#### Set dataset config and data_module

We use the `multi_flank_window` preset dataset config and set four `flank_window` parameter

In [6]:
dataset_config = chrombert.get_preset_dataset_config("multi_flank_window",supervised_file = None, batch_size = 4, num_workers = 4,flank_window=4)
dataset_config

update path: hdf5_file = hg38_6k_1kb.hdf5
update path: meta_file = config/hg38_6k_meta.json


DatasetConfig({'hdf5_file': '/home/chenqianqian/.cache/chrombert/data/hg38_6k_1kb.hdf5', 'supervised_file': None, 'kind': 'MultiFlankwindowDataset', 'meta_file': '/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.json', 'ignore': False, 'ignore_object': None, 'batch_size': 4, 'num_workers': 4, 'shuffle': False, 'perturbation': False, 'perturbation_object': None, 'perturbation_value': 0, 'prompt_kind': None, 'prompt_regulator': None, 'prompt_regulator_cache_file': None, 'prompt_celltype': None, 'prompt_celltype_cache_file': None, 'fasta_file': None, 'flank_window': 4})

In [7]:
gep_dir = f'{base_dir}/demo/transdifferentiation/transcriptome'

In [8]:
data_module = chrombert.LitChromBERTFTDataModule(
    config = dataset_config, 
    train_params = {'supervised_file': f'{gep_dir}/train_sample.csv'}, 
    val_params = {'supervised_file':f'{gep_dir}/valid_sample.csv'}, 
    test_params = {'supervised_file':f'{gep_dir}/test_sample.csv'}
)
data_module.setup()

#### Set model config and model loading
We use the `gep` preset model config and set four `gep_flank_window` parameter

In [9]:
model_config = chrombert.get_preset_model_config("gep",gep_flank_window=4)
model_config

update path: mtx_mask = config/hg38_6k_mask_matrix.tsv
update path: pretrain_ckpt = checkpoint/hg38_6k_1kb_pretrain.ckpt


ChromBERTFTConfig(genome='hg38', task='gep', dim_output=1, mtx_mask='/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_mask_matrix.tsv', dropout=0.1, pretrain_ckpt='/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt', finetune_ckpt=None, ignore=False, ignore_index=(None, None), gep_flank_window=4, gep_parallel_embedding=False, gep_gradient_checkpoint=False, gep_zero_inflation=True, prompt_kind='cistrome', prompt_dim_external=512, dnabert2_ckpt=None)

In [10]:
model = model_config.init_model()
model.freeze_pretrain(2) ### freeze chrombert 6 transformer blocks
summary(model)

use organisim hg38; max sequence length including cls is 6392


Layer (type:depth-idx)                                       Param #
ChromBERTGEP                                                 --
├─PoolFlankWindow: 1-1                                       --
│    └─ChromBERT: 2-1                                        --
│    │    └─BERTEmbedding: 3-1                               (4,916,736)
│    │    └─ModuleList: 3-2                                  51,978,240
├─GepHeader: 1-2                                             --
│    └─CistromeEmbeddingManager: 2-2                         --
│    └─Conv2d: 2-3                                           769
│    └─ReLU: 2-4                                             --
│    └─ResidualBlock: 2-5                                    --
│    │    └─Linear: 3-3                                      1,090,560
│    │    └─Linear: 3-4                                      1,049,600
│    │    └─LayerNorm: 3-5                                   2,048
│    │    └─Linear: 3-6                                      1,0

#### Set train config and finetune

We fine-tune the model using PyTorch Lightning. A simple configuration is created to process parameters, and tuning is performed on a limited dataset to save time.
Note: The tuning process is random, so results may vary. To achieve the best results, consider increasing the number of epochs and the size of the dataset used


In [11]:
train_config = chrombert.finetune.TrainConfig(kind='zero_inflation',        
loss='zero_inflation',
max_epochs=2,
accumulate_grad_batches=2,
val_check_interval=2,
limit_val_batches=10,
tag='gep')
pl_module = train_config.init_pl_module(model) # wrap model with PyTorch Lightning module
type(pl_module)



chrombert.finetune.train.pl_module.ZeroInflationPLModule

Then we start tuning!   
The trainer will save logs in a format compatible with TensorBoard, and the checkpoint with the lowest validation loss will be saved during the process.



In [12]:
callback_ckpt = pl.callbacks.ModelCheckpoint(monitor = f"{train_config.tag}_validation/{train_config.loss}", mode = "min")
trainer = pl.Trainer(
    max_epochs=train_config.max_epochs,
    log_every_n_steps=1, 
    limit_val_batches = train_config.limit_val_batches,
    val_check_interval = train_config.val_check_interval,
    accelerator="gpu", 
    accumulate_grad_batches= train_config.accumulate_grad_batches, 
    fast_dev_run=False, 
    precision="bf16-mixed",
    strategy="auto",
    callbacks=[
        pl.callbacks.LearningRateMonitor(),
        callback_ckpt,   
    ],
    logger=pl.loggers.TensorBoardLogger("lightning_logs", name='gep'))
trainer.fit(pl_module,data_module)

Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Loading `train_dataloader` to estimate number of stepping batches.

  | Name  | Type         | Params | Mode 
-----------------------------------------------
0 | model | ChromBERTGEP | 62.8 M | train
-----------------------------------------------
18.9 M    Trainable params
43.9 M    Non-trainable params
62.8 M    Total params
251.022   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/chenqianqian/.conda/envs/demo/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:475: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.


Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=2` reached.


## Load fine-tuned checkpoint and evaluation
ChromBERT is now fine-tuned! You can access the tuned model directly using `pl_module.model`. However, please note that due to specific settings in flash-attention, you cannot change the dropout probability, which may introduce some randomness in the output.

For consistent results, we recommend saving the checkpoint and loading it with the original model to ensure you have the tuned model.

#### Load limited fine-tuned checkpoint of predicting genome-wide changes in transcriptome 
Due to time constraints in tutorial, we only use downsampled test data for evaluation.

In [13]:
import glob
gep_ft_ckpt = os.path.abspath(glob.glob('./lightning_logs/gep/version_0/checkpoints/*.ckpt')[0])
print(gep_ft_ckpt)

model_config = chrombert.get_preset_model_config("gep",gep_flank_window=4, dropout=0)
ft_model = model_config.init_model(finetune_ckpt = gep_ft_ckpt)
summary(ft_model)

/shared/chenqianqian/data_copy1/chenqianqian/finetune/test_model/ChromBERT_public/examples/tutorials/lightning_logs/gep/version_0/checkpoints/epoch=1-step=14.ckpt
update path: mtx_mask = config/hg38_6k_mask_matrix.tsv
update path: pretrain_ckpt = checkpoint/hg38_6k_1kb_pretrain.ckpt
use organisim hg38; max sequence length including cls is 6392
Loading checkpoint from /shared/chenqianqian/data_copy1/chenqianqian/finetune/test_model/ChromBERT_public/examples/tutorials/lightning_logs/gep/version_0/checkpoints/epoch=1-step=14.ckpt
Loading from pl module, remove prefix 'model.'
Loaded 112/112 parameters


Layer (type:depth-idx)                                       Param #
ChromBERTGEP                                                 --
├─PoolFlankWindow: 1-1                                       --
│    └─ChromBERT: 2-1                                        --
│    │    └─BERTEmbedding: 3-1                               4,916,736
│    │    └─ModuleList: 3-2                                  51,978,240
├─GepHeader: 1-2                                             --
│    └─CistromeEmbeddingManager: 2-2                         --
│    └─Conv2d: 2-3                                           769
│    └─ReLU: 2-4                                             --
│    └─ResidualBlock: 2-5                                    --
│    │    └─Linear: 3-3                                      1,090,560
│    │    └─Linear: 3-4                                      1,049,600
│    │    └─LayerNorm: 3-5                                   2,048
│    │    └─Linear: 3-6                                      1,090

Evaluation

In [14]:
from tqdm import tqdm
import torchmetrics as tm
dl = data_module.test_dataloader()
ft_model.cuda()
with torch.no_grad():
    y_preds = []
    y_labels = []
    for idx, batch in enumerate(tqdm(dl, total=len(dl))):
        for k in batch:
            if isinstance(batch[k], torch.Tensor):
                batch[k] = batch[k].cuda()
        y_pred = ft_model(batch)[1].cpu()
        y_label = batch['label'].cpu()
        y_preds.append(y_pred)
        y_labels.append(y_label)
    y_preds = torch.cat(y_preds)
    y_labels = torch.cat(y_labels)
predicts = y_preds.view(-1)
labels = y_labels.view(-1)
metrics_pearsonr = tm.PearsonCorrCoef()
metrics_spearmanr = tm.SpearmanCorrCoef()
metrics_mse = tm.MeanSquaredError()
metrics_mae = tm.MeanAbsoluteError()
metrics_r2 = tm.R2Score()
score_pearsonr = metrics_pearsonr(predicts, labels)
score_spearmanr = metrics_spearmanr(predicts, labels)
score_mse = metrics_mse(predicts, labels)
score_mae = metrics_mae(predicts, labels)
score_r2 = metrics_r2(predicts, labels)
scores = {
    "pearsonr": score_pearsonr,
    "spearmanr": score_spearmanr,
    "mse": score_mse,
    "mae": score_mae,
    "r2": score_r2,
    }
print(scores)

100%|██████████| 5/5 [00:09<00:00,  1.80s/it]

{'pearsonr': tensor(-0.0313), 'spearmanr': tensor(0.1015), 'mse': tensor(0.4036), 'mae': tensor(0.5084), 'r2': tensor(-1.1500)}





#### Load the checkpoint using the entire dataset to fine-tune for predicting genome-wide changes in the transcriptome
Using a limited fine-tuned checkpoint results in poor evaluation performance. We have fine-tuned a checkpoint using the entire dataset to enhance its performance for predicting genome-wide changes in the transcriptome. Due to time constraints in tutorial, we only use downsampled test data for evaluation.

In [15]:
gep_ft_ckpt = f'{gep_dir}/gep_fibroblast_to_myoblast.ckpt'

model_config = chrombert.get_preset_model_config("gep",gep_flank_window=4, dropout=0)
ft_model = model_config.init_model(finetune_ckpt = gep_ft_ckpt)

dl = data_module.test_dataloader()
ft_model.cuda()
with torch.no_grad():
    y_preds = []
    y_labels = []
    for batch in tqdm(dl,total=len(dl)):
        for k in batch:
            if isinstance(batch[k], torch.Tensor):
                batch[k] = batch[k].cuda()
        y_pred = ft_model(batch)[1].cpu()
        y_label = batch['label'].cpu()
        y_preds.append(y_pred)
        y_labels.append(y_label)
    y_preds = torch.cat(y_preds)
    y_labels = torch.cat(y_labels)
predicts = y_preds.view(-1)
labels = y_labels.view(-1)
metrics_pearsonr = tm.PearsonCorrCoef()
metrics_spearmanr = tm.SpearmanCorrCoef()
metrics_mse = tm.MeanSquaredError()
metrics_mae = tm.MeanAbsoluteError()
metrics_r2 = tm.R2Score()
score_pearsonr = metrics_pearsonr(predicts, labels)
score_spearmanr = metrics_spearmanr(predicts, labels)
score_mse = metrics_mse(predicts, labels)
score_mae = metrics_mae(predicts, labels)
score_r2 = metrics_r2(predicts, labels)
scores = {
    "pearsonr": score_pearsonr,
    "spearmanr": score_spearmanr,
    "mse": score_mse,
    "mae": score_mae,
    "r2": score_r2,
    }
print(scores)

update path: mtx_mask = config/hg38_6k_mask_matrix.tsv
update path: pretrain_ckpt = checkpoint/hg38_6k_1kb_pretrain.ckpt
use organisim hg38; max sequence length including cls is 6392
Loading checkpoint from /home/chenqianqian/.cache/chrombert/data/demo/transdifferentiation/transcriptome/gep_fibroblast_to_myoblast.ckpt
Loaded 112/112 parameters


100%|██████████| 5/5 [00:08<00:00,  1.80s/it]

{'pearsonr': tensor(0.4961), 'spearmanr': tensor(0.4652), 'mse': tensor(0.1490), 'mae': tensor(0.3108), 'r2': tensor(0.2062)}





## Use tuned to get regulator embedding and identified key regulator in cell state transition

Using a limited fine-tuned checkpoint results in poor evaluation performance; We have fine-tuned a checkpoint using the entire dataset to enhance its performance for predicting genome-wide changes in the transcriptome. Now, to obtain regulator embeddings, we load this fine-tuned checkpoint. However, due to time constraints in the tutorial, we will only use downsampled data for evaluation


In [16]:
model_tuned = chrombert.get_preset_model_config(
    "gep", 
    gep_flank_window = 4,
    dropout = 0,
    finetune_ckpt = f'{gep_dir}/gep_fibroblast_to_myoblast.ckpt').init_model() # use absolute path here, to avoid mixing of preset

update path: mtx_mask = config/hg38_6k_mask_matrix.tsv
update path: pretrain_ckpt = checkpoint/hg38_6k_1kb_pretrain.ckpt
update path: finetune_ckpt = /home/chenqianqian/.cache/chrombert/data/demo/transdifferentiation/transcriptome/gep_fibroblast_to_myoblast.ckpt
use organisim hg38; max sequence length including cls is 6392
Loading checkpoint from /home/chenqianqian/.cache/chrombert/data/demo/transdifferentiation/transcriptome/gep_fibroblast_to_myoblast.ckpt
Loaded 112/112 parameters


In [17]:
model_emb = model_tuned.get_embedding_manager().cuda()
summary(model_emb)

Layer (type:depth-idx)                                       Param #
ChromBERTEmbedding                                           --
├─PoolFlankWindow: 1-1                                       --
│    └─ChromBERT: 2-1                                        --
│    │    └─BERTEmbedding: 3-1                               4,916,736
│    │    └─ModuleList: 3-2                                  51,978,240
├─CistromeEmbeddingManager: 1-2                              --
Total params: 56,894,976
Trainable params: 56,894,976
Non-trainable params: 0

In [18]:
dataset_config = chrombert.get_preset_dataset_config("multi_flank_window",supervised_file = f'{gep_dir}/up_data_sample.csv', batch_size = 32, num_workers = 4)
dl = dataset_config.init_dataloader()
up_gep_embs = []
for batch in tqdm(dl):
    with torch.no_grad():
        for k, v in batch.items():
            if isinstance(v, torch.Tensor):
                batch[k] = v.cuda()
        emb = model_emb(batch).cpu()
        up_gep_embs.append(emb)
up_gep_embs = torch.cat(up_gep_embs)
up_gep_embs.shape

update path: hdf5_file = hg38_6k_1kb.hdf5
update path: meta_file = config/hg38_6k_meta.json


100%|██████████| 4/4 [00:43<00:00, 10.98s/it]


torch.Size([100, 1064, 768])

In [19]:
dataset_config = chrombert.get_preset_dataset_config("multi_flank_window",supervised_file = f'{gep_dir}/nochange_data_sample.csv', batch_size = 32, num_workers = 4)
dl = dataset_config.init_dataloader()
nochange_gep_embs = []
for batch in tqdm(dl):
    with torch.no_grad():
        for k, v in batch.items():
            if isinstance(v, torch.Tensor):
                batch[k] = v.cuda()
        emb = model_emb(batch).cpu()
        nochange_gep_embs.append(emb)
nochange_gep_embs = torch.cat(nochange_gep_embs)
nochange_gep_embs.shape

update path: hdf5_file = hg38_6k_1kb.hdf5
update path: meta_file = config/hg38_6k_meta.json


100%|██████████| 4/4 [00:44<00:00, 11.12s/it]


torch.Size([100, 1064, 768])

We consider only factors below, remove histone modifications and chromatin accessibility. Then we can get similarity between factors, which represent the potential interactions between factors. 

In [20]:
with open(os.path.join(base_dir, "config","hg38_6k_factors_list.txt"),"r") as f:
    factors = f.read().strip().split("\n")
factors = [f.strip().lower() for f in factors]


indices = np.in1d(model_emb.list_regulator,factors)
names = np.array(model_emb.list_regulator)[indices]
up_gep_embs = up_gep_embs.mean(axis=0)[indices]
nochange_gep_embs = nochange_gep_embs.mean(axis=0)[indices]

In [21]:
up_gep_embs.shape, nochange_gep_embs.shape

(torch.Size([996, 768]), torch.Size([996, 768]))

In [22]:
from sklearn.metrics.pairwise import cosine_similarity
gep_similarity = [cosine_similarity(up_gep_embs[i].reshape(1, -1), nochange_gep_embs[i].reshape(1, -1))[0, 0] for i in range(up_gep_embs.shape[0])]
gep_similarity_df = pd.DataFrame({'factors':names,'similarity':gep_similarity}).sort_values(by='similarity').reset_index(drop=True)
gep_similarity_df['rank']=gep_similarity_df.index + 1
gep_similarity_df.to_csv(f'{gep_dir}/gep_similarity_df.csv',index=False)
gep_similarity_df

Unnamed: 0,factors,similarity,rank
0,rpe,0.902802,1
1,klf15,0.925641,2
2,nr3c2,0.926707,3
3,myf5,0.938897,4
4,pgr,0.950597,5
...,...,...,...
991,dbp,0.998239,992
992,tfcp2,0.998259,993
993,mafk,0.998279,994
994,tfam,0.998362,995


In [23]:
gep_similarity_df[gep_similarity_df['factors']=='myod1']

Unnamed: 0,factors,similarity,rank
17,myod1,0.971539,18


## The average ranking is derived from the two modalities: chromatin accessibility and the transcriptome

In [24]:
chrom_accessibility_path = f'{gep_dir}/../chrom_accessibility/chromatin_accessibility_similarity_df.csv'
if not os.path.exists(chrom_accessibility_path):
    raise ValueError("Please follow the chromatin accessibility transdifferentiation tutorial to rank the regulatory factors within this context")
else:
    chrom_acc_similarity_df = pd.read_csv(chrom_accessibility_path)
    average_rank_df = pd.merge(gep_similarity_df, chrom_acc_similarity_df, on='factors', how='inner', suffixes=('_gep', '_chrom_acc'))
    average_rank_df['averge_rank'] = ((average_rank_df['rank_gep']+average_rank_df['rank_chrom_acc'])/2).rank().astype(int)
    average_rank_df=average_rank_df.sort_values(by='averge_rank')
    average_rank_df
    
average_rank_df[average_rank_df['factors']=='myod1']

Unnamed: 0,factors,similarity_gep,rank_gep,similarity_chrom_acc,rank_chrom_acc,averge_rank
17,myod1,0.971539,18,0.038196,1,5
