# Describing the problem & goals of this test

### Task & data

I chose sequence to expression task because the random promoter data recently generated by Carl de Boer lab allow us to quickly prototype and determine model architectures that are capable of predicting gene expression readout. This data describes transcriptional activity of just one yeast cell state but across 10 million random sequences which explore regulatory sequence space more deeply than you could by training models on the natural yeast genome. It was formatted in a way that makes it convenient to start with, whereas majority of problems relevant to perturbing aged and disfunctional cell states would require analysis and processing likely to take months to optimise. This makes this data a convenient starting point.

An important caveat is that the models optimised to perform well on this data would quite possibly not perform optimally for predicting total RNA abundance in human, mouse and other ageing-relevant model organisms because these organisms have more cell types and more complex regulation than yeast. The required specificity of transcription in vertebrates is going to require both 1) more complex interactions between transcription factors that bind to the promoter sequence near the transcription start site and 2) more complex interactions between transcription factors binding to the promoter and binding to the distal regulatory sites.

In addition, yeast don't have complex 3D genome organisation such as TAD domains and have more limited utilisation of activity associated compartments (A-B compartments seen in HiC data, more diverse liquid-liquid phase separation comparments). It is also likely that major gene regulation rules responsible for specification of cell states are different in vertebrates compared to yeast (active area of study, for examples by Arnau Sebe-Pedros lab). However, a model that will succeed in mammals should likely be sufficiently complex to deal with this relatively simple data. Therefore, this data provides a quick test of model architecture robustness.

The additional problem with this task is that to explain how the same genome is used by and leads to different cell states, we need to consider not just the genome sequence but which transcription factors and other regulatory proteins are present in those cell states. In contrast, this data presents the genome as if it was a static entity that always leads to the same activity. This task is not a realistic task necessary for identifying how to perturb aged or dysfunctional cells.

### Bayesian modelling, CNN & what is tested here

Bayesian modelling is not very frequently used in industry for large scale models. However, it provides a principled and intuitive way to include prior information about biology as inductive biases on model parameters - in addition to providing inductive biases in model architecture. In addition, there are statistical arguments why maximum likelihood inference leads to suboptimal solution for high-dimensional problems and where estimating variational posterior could lead to estimates that are closer to ground truth.

Even when Bayesian models are not used with parameters informed by measurements external to the training task, the priors can be useful for specifying reasonable ranges of values and default behavior of the model. For example, if we want to learn assay sensitivity, we may want to regularize the parameter that represents assay sensitivity to be close to 1 to avoid over-normalizing biological differences. Similarly, we can regularise convolutional neural network weights that represent transcription factor DNA recognition preference to be either 1) small, thus implicitly requesting the model to learn simpler motifs, or 2) similar to experimentally determined motifs for transcription factors, which are extremely well characterized in yeast, so should provide a very good reference for this model. Both settings are tested in this experiment.

In this task, we are going to train a simple Bayesian conditional neural network, which predicts the same outputs as the winning solution of the random promoter challenge. Winning solution of defined a dual likelihood for MPRA data that didn't just predict estimated average sequence activity (output of pre-processing workflow), but also predicted in which of the FACS bins a signal will be observed. The winning approach uses the inductive bias about how the experiment was designed, predicting which sequences are detected in which FACS sorting bins. FACS sorting was done according to reporter protein fluorescence - an approach where the measurement for every sequence is not readout on a continous scale but instead quantified in bins. In general, MPRA technologies with both sequencing and FACS readout could benefit from better formalized likelihoods that don't require arbitrary preprocessing of experimentally measured values (we could collaborate with Lars Velten, Jay Shendure labs to optimise this).

Since this data is quite simple, representing both one cell state and yeast, an organism with relatively simple regulation, we likely don't need very complex models such as Enformer or deep CNN to represent the underlying mechanisms, and we can use a simple two-layer convolutional neural network with Bayesian parameters.

#### Install environment

```bash
export PYTHONNOUSERSITE="aaaaa"
conda env create -n dream -f environment.yml
conda activate dream
python -m ipykernel install --user --name=dream --display-name='Environment (dream)'
```

```bash
export PYTHONNOUSERSITE="aaaaa"
conda env create -n dream2 -f environment.yml
conda activate dream2
python -m ipykernel install --user --name=dream2 --display-name='Environment (dream2)'
```

#### Load JASPAR motifs

```bash
wget https://jaspar.elixir.no/download/data/2024/CORE/JASPAR2024_CORE_fungi_non-redundant_pfms_meme.txt
```

# Import libraries and functions

In [1]:
import pandas as pd
import torch
import os
from prixfixe.autosome import AutosomeDataProcessor, AutosomeFinalLayersBlock, AutosomeTrainer, AutosomePredictor
from prixfixe.bhi import BHIFirstLayersBlock, BHICoreBlock
from prixfixe.unlockdna import UnlockDNACoreBlock
from prixfixe.prixfixe import PrixFixeNet
from prixfixe.bayesian import BayesianFinalLayersBlock, BayesianTrainer, setup_pyro_model, get_jaspar_motifs, motif_dict_to_array

  from .autonotebook import tqdm as notebook_tqdm


# Initialize paths and variables

In [2]:
TRAIN_DATA_PATH = "data/demo_train.txt" #change filename to actual training data
VALID_DATA_PATH = "data/demo_val.txt" #change filename to actual validaiton data
TRAIN_BATCH_SIZE = 512 # replace with 1024, if 1024 doesn't fit in gpu memory, decrease by order of 2 (512,256)
BATCH_PER_EPOCH = 10 #replace with total amount of possible batches in the training data
N_PROCS = 8
VALID_BATCH_SIZE = 4096
BATCH_PER_VALIDATION = 10 #replace with total amount of possible batches in the validaiton data
PLASMID_PATH = "data/plasmid.json"
SEQ_SIZE = 150
NUM_EPOCHS = 5 #replace with 80
CUDA_DEVICE_ID = 0
lr = 0.005 # 0.001 for attention layers in coreBlock

In [3]:
jaspar_dict = get_jaspar_motifs(fixed_motifs_path="./JASPAR2024_CORE_fungi_non-redundant_pfms_meme.txt", genome='yeast')
jaspar_array = motif_dict_to_array(jaspar_dict, 15)

jaspar_array.shape

(177, 15, 4)

# DataProcessor

In [4]:
generator = torch.Generator()
generator.manual_seed(2147483647)

dataprocessor = AutosomeDataProcessor(
    path_to_training_data=TRAIN_DATA_PATH,
    path_to_validation_data=VALID_DATA_PATH,
    train_batch_size=TRAIN_BATCH_SIZE, 
    batch_per_epoch=BATCH_PER_EPOCH,
    train_workers=N_PROCS,
    valid_batch_size=VALID_BATCH_SIZE,
    valid_workers=N_PROCS,
    shuffle_train=True,
    shuffle_val=False,
    plasmid_path=PLASMID_PATH,
    seqsize=SEQ_SIZE,
    generator=generator,
    dataset_kwargs={
        "use_single_channel": False,
        "use_reverse_channel": False,
    } 
)

In [5]:
batch = next(dataprocessor.prepare_train_dataloader())

batch.keys()

dict_keys(['x', 'y_probs', 'y'])

In [6]:
batch['x'].shape

torch.Size([512, 4, 150])

In [7]:
batch['y'].shape

torch.Size([512])

In [8]:
batch['y_probs'].shape

torch.Size([512, 18])

# Prix-Fixe Model

### DREAM-CNN Model

In [9]:
final = BayesianFinalLayersBlock(
    in_channels=5,
    seqsize=5,
    n_out=18,
    fixed_motifs=jaspar_array,
)
model = PrixFixeNet(
    first=None,
    core=None,
    final=final,
    generator=generator
)
setup_pyro_model(
    dataloader=dataprocessor.prepare_train_dataloader(), 
    pl_module=model.final,
)

TypeError: BayesianPyroModel.__init__() got an unexpected keyword argument 'dna_sequence'

In [None]:
# MODEL_LOG_DIR = f"prix_fixe_model_weights/0_1_0_0"
# model.load_state_dict(torch.load(os.path.join(MODEL_LOG_DIR, 'model_best.pth')))

# Trainer

In [None]:
trainer = AutosomeTrainer(
    model,    
    device=torch.device(f"cuda:{CUDA_DEVICE_ID}"), 
    model_dir="data/bayesian_model_weights",
    dataprocessor=dataprocessor,
    num_epochs=NUM_EPOCHS,
    lr = lr)

In [16]:
trainer.fit()

  0%|                                                     | 0/5 [00:00<?, ?it/s]
Train epoch:   0%|                                       | 0/10 [00:00<?, ?it/s][A
Train epoch:  10%|███                            | 1/10 [00:00<00:06,  1.36it/s][A
Train epoch:  20%|██████▏                        | 2/10 [00:01<00:03,  2.16it/s][A
Train epoch:  30%|█████████▎                     | 3/10 [00:01<00:02,  2.63it/s][A
Train epoch:  40%|████████████▍                  | 4/10 [00:01<00:02,  2.96it/s][A
Train epoch:  50%|███████████████▌               | 5/10 [00:01<00:01,  3.14it/s][A
Train epoch:  60%|██████████████████▌            | 6/10 [00:02<00:01,  3.31it/s][A
Train epoch:  70%|█████████████████████▋         | 7/10 [00:02<00:00,  3.38it/s][A
Train epoch:  80%|████████████████████████▊      | 8/10 [00:02<00:00,  3.44it/s][A
Train epoch:  90%|███████████████████████████▉   | 9/10 [00:02<00:00,  3.44it/s][A
Train epoch: 100%|██████████████████████████████| 10/10 [00:03<00:00,  3.51it/s

# Predict

In [17]:
import random
predictor = AutosomePredictor(model=model, model_pth='data/model_weights/model_best.pth', device=torch.device(f"cuda:0"))
dna = "TGCATTTTTTTCACATC"+ ''.join(random.choice('ACGT') for _ in range(80)) + "GGTTACGGCTGTT"
predictor.predict(dna)

10.054896354675293

# Prediction on the test dataset

In [18]:
test_df = pd.read_csv('data/filtered_test_data_with_MAUDE_expression.txt', header=None, sep='\t')

from tqdm import tqdm
pred_expr = []
for seq in tqdm(test_df.iloc[:, 0]):
    pred_expr.append(predictor.predict(seq))

In [20]:
from scipy.stats import pearsonr, spearmanr
print(pearsonr(pred_expr, list(test_df.iloc[:, 1])), spearmanr(pred_expr, list(test_df.iloc[:, 1])))

# Score your submission on DREAM Challenge test dataset

In [22]:
pred_expr = pd.read_csv('data/sample_submission.txt', sep = '\t', header = None).iloc[:,1]
from prixfixe.evaluation import evaluate_predictions
evaluate_predictions(pred_expr)

******************************************************
Pearson Score: 0.7657255844881551

Spearman Score: 0.8228750904214907

******************************************************
all r: 0.957144539749361

all r²: 0.916125669972016

all ρ: 0.961451653086994

******************************************************
high r: 0.6200899915391505

low r: 0.6211738513565918

yeast r: 0.8382821111688279

random r: 0.9677444394489736

challenging r: 0.9354983554787447

SNVs r: 0.8227819183935022

motif perturbation r: 0.9671482009080143

motif tiling r: 0.9449999831802987

******************************************************
high ρ: 0.5754373259429003

low ρ: 0.596033541641311

yeast ρ: 0.839060331461191

random ρ: 0.970287191964816

challenging ρ: 0.9289802083256298

SNVs ρ: 0.6775184531537061

motif perturbation ρ: 0.9611406141596464

motif tiling ρ: 0.9273541130425778

******************************************************
