# Modeling + Design

Notes:

- Rand scripts back-to-back to ensure compatability. 
- This notebook automatically tries to grab the most recent artifact from `/tmp/model_artifacts_*`
    - You may need to update this or hand copy the artifact between `train.py` and `generate.py`
- Settings (mostly for training) are adjusted for fast testing and will not reproduce similar results to the paper. 
- You will need to modify hyperparameters yourself. An obvious change is to set `--min_epochs 60` and `--max_epochs 200`.

## Deploy training
Deposit model in `/tmp/`. The progress bar doesn't play nice with the notebook, so you'll have to scroll a bit.

In [1]:
!python /home/ubuntu/boda2/src/train.py \
  --data_module=MPRA_DataModule \
    --datafile_path=gs://tewhey-public-data/CODA_resources/MPRA_ALL_HD_v2.txt \
    --sep space --sequence_column nt_sequence \
    --activity_columns K562_mean HepG2_mean SKNSH_mean \
    --stderr_columns lfcSE_k562 lfcSE_hepg2 lfcSE_sknsh \
    --synth_val_pct=0.0 --synth_test_pct=99.98 \
    --batch_size=1076 --duplication_cutoff=0.5 --std_multiple_cut=6.0 \
    --val_chrs 7 13 --test_chrs 9 21 X \
    --padded_seq_len=600 --use_reverse_complements=True --num_workers=8 \
  --model_module=BassetBranched \
    --input_len 600 \
    --conv1_channels=300 --conv1_kernel_size=19 \
    --conv2_channels=200 --conv2_kernel_size=11 \
    --conv3_channels=200 --conv3_kernel_size=7 \
    --linear_activation=ReLU --linear_channels=1000 \
    --linear_dropout_p=0.11625456877954289 \
    --branched_activation=ReLU --branched_channels=140 \
    --branched_dropout_p=0.5757068086404574 \
    --n_outputs=3 --n_linear_layers=1 \
    --n_branched_layers=3 --n_branched_layers=3 \
    --use_batch_norm=True --use_weight_norm=False \
    --loss_criterion=L1KLmixed --beta=5.0 \
    --reduction=mean \
  --graph_module=CNNTransferLearning \
    --parent_weights=gs://tewhey-public-data/CODA_resources/my-model.epoch_5-step_19885.pkl \
    --frozen_epochs=0 \
    --optimizer=Adam --amsgrad=True \
    --lr=0.0032658700881052086 --eps=1e-08 --weight_decay=0.0003438210249762151 \
    --beta1=0.8661062881299633 --beta2=0.879223105336538 \
    --scheduler=CosineAnnealingWarmRestarts --scheduler_interval=step \
    --T_0=4096 --T_mult=1 --eta_min=0.0 --last_epoch=-1 \
    --checkpoint_monitor=entropy_spearman --stopping_mode=max \
    --stopping_patience=30 --accelerator=gpu --devices=1 --min_epochs=1 --max_epochs=3 \
    --precision=16 --default_root_dir=/tmp/output/artifacts \
    --artifact_path=/tmp/

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
--------------------------------------------------

K562 | top cut value: 10.54, bottom cut value: -6.43
HepG2 | top cut value: 9.78, bottom cut value: -5.74
SKNSH | top cut value: 10.36, bottom cut value: -6.4

Number of examples discarded from top: 0
Number of examples discarded from bottom: 0

Number of examples available: 666435

--------------------------------------------------

Padding sequences... 

Creating train/val/test datasets with tokenized sequences... 

--------------------------------------------------

Number of examples in train: 1691500 (253.81%)
Number of examples in val:   55841 (8.38%)
Number of examples in test:  45330 (6.8%)

Excluded from train: -1126236 (-168.99)%
--------------------------------------------------
Copying gs://tewhey-public-data/CODA_resources/m

In [9]:
files = !ls /tmp/

_ = [ print(p) for p in files if 'model_artifacts' in p ]
model_artifact = [ p for p in files if 'model_artifacts' in p ][-1]

model_artifacts__20240216_154316__693863.tar.gz
model_artifacts__20240216_160210__448365.tar.gz
model_artifacts__20240216_162013__449584.tar.gz


## Design sequences

### Fast SeqProp

In [15]:
!python /home/ubuntu/boda2/src/generate.py \
    --params_module StraightThroughParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 --n_samples 10 \
        --use_norm True --use_affine False \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 1.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact /tmp/{model_artifact} \
    --generator_module FastSeqProp \
         --n_steps 200 --learning_rate 0.5 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path /tmp/test__k562__fsp


archive unpacked in /tmp/tmpkf2h83rp
Loaded model from 20240216_162013 in eval mode
Starting round: 0, generate 1000 proposals
Steps:   0%|                                            | 0/200 [00:00<?, ?it/s]Penalty not implemented
Steps: 100%|█████████████| 200/200 [00:54<00:00,  3.65it/s, Loss=-5.19, LR=1e-6]
Steps: 100%|█████████████| 200/200 [00:53<00:00,  3.75it/s, Loss=-5.22, LR=1e-6]
Steps: 100%|█████████████| 200/200 [00:53<00:00,  3.76it/s, Loss=-5.16, LR=1e-6]
Steps: 100%|█████████████| 200/200 [00:54<00:00,  3.69it/s, Loss=-5.17, LR=1e-6]
finished round
Proposals deposited at:
	/tmp/test__k562__fsp__20240216_162917__734583.pt


In [22]:
import torch
fsp_props = !ls /tmp/
fsp_props = [ p for p in fsp_props if 'test__k562__fsp' in p ][-1]
torch.load(f'/tmp/{fsp_props}')

{'proposals': [{'states': tensor([[[-3.7377, -2.7889, -3.6888,  ..., -2.0091, -0.8512, -1.1654],
            [-2.8875, 15.6401, -4.9318,  ...,  2.0829,  2.9065,  4.5684],
            [ 8.8135, -3.0222, -6.0940,  ..., -1.1214, -0.8691, -0.1751],
            [-3.1758, -1.7191, 11.4275,  ...,  1.0498, -1.3573, -0.9823]],
   
           [[-3.5919, -3.8258, -3.3337,  ..., -1.7833, -0.8523, -0.0423],
            [ 1.5290,  4.8438, -2.4072,  ..., -0.7916,  1.8525,  1.1274],
            [ 5.3218, -2.5268, -2.2526,  ...,  2.0384,  0.2067,  0.7588],
            [-1.2801,  1.5561,  5.4194,  ..., -0.0931,  0.0430, -2.0858]],
   
           [[-3.6997, -3.4999, -3.2232,  ..., -3.7783, -2.9031, -4.1824],
            [ 0.4595,  8.3717, -3.2004,  ..., -3.0341, -4.0613, -5.0852],
            [ 9.8184, -2.8284, -4.2960,  ..., -4.2540, 11.8247, -5.7616],
            [-2.0383, -2.2265,  5.9676,  ...,  7.8807, -4.6413,  8.0427]],
   
           ...,
   
           [[-4.1132, -3.3941, -2.6604,  ..., -4.4621,

### Simulated Annealing

In [17]:
!python /home/ubuntu/boda2/src/generate.py \
    --params_module BasicParameters \
        --batch_size 256 --n_channels 4 \
        --length 200 \
    --energy_module MinGapEnergy \
        --target_feature 0 --bending_factor 0.0 --a_min -2.0 --a_max 6.0 \
        --model_artifact /tmp/{model_artifact} \
    --generator_module SimulatedAnnealing \
         --n_steps 2000 --n_positions 5 \
         --a 1.0 --b 1.0 --gamma 0.501 \
    --energy_threshold -0.5 --max_attempts 40 \
    --n_proposals 1000 \
    --proposal_path /tmp/test__k562__sa


archive unpacked in /tmp/tmpv5iflges
Loaded model from 20240216_162013 in eval mode
Starting round: 0, generate 1000 proposals
collect samples
  0%|                                                  | 0/2000 [00:00<?, ?it/s]Penalty not implemented
100%|███████████████████████████████████████| 2000/2000 [00:34<00:00, 58.01it/s]
attempt 1 acceptance rate: 256/256
collect samples
100%|███████████████████████████████████████| 2000/2000 [00:31<00:00, 63.12it/s]
attempt 2 acceptance rate: 256/256
collect samples
100%|███████████████████████████████████████| 2000/2000 [00:31<00:00, 62.88it/s]
attempt 3 acceptance rate: 256/256
collect samples
100%|███████████████████████████████████████| 2000/2000 [00:31<00:00, 63.17it/s]
attempt 4 acceptance rate: 256/256
finished round
Proposals deposited at:
	/tmp/test__k562__sa__20240216_163141__730829.pt


In [23]:
import torch
sa_props = !ls /tmp/
sa_props = [ p for p in sa_props if 'test__k562__sa' in p ][-1]
torch.load(f'/tmp/{sa_props}')

{'proposals': [{'proposals': tensor([[[0., 1., 0.,  ..., 0., 0., 1.],
            [0., 0., 0.,  ..., 1., 1., 0.],
            [1., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 1.,  ..., 0., 0., 0.]],
   
           [[0., 0., 0.,  ..., 0., 1., 0.],
            [0., 0., 0.,  ..., 1., 0., 1.],
            [1., 0., 0.,  ..., 0., 0., 0.],
            [0., 1., 1.,  ..., 0., 0., 0.]],
   
           [[0., 0., 0.,  ..., 0., 1., 0.],
            [1., 1., 0.,  ..., 0., 0., 1.],
            [0., 0., 0.,  ..., 0., 0., 0.],
            [0., 0., 1.,  ..., 1., 0., 0.]],
   
           ...,
   
           [[0., 0., 1.,  ..., 0., 0., 1.],
            [0., 1., 0.,  ..., 1., 0., 0.],
            [1., 0., 0.,  ..., 0., 1., 0.],
            [0., 0., 0.,  ..., 0., 0., 0.]],
   
           [[0., 0., 0.,  ..., 0., 1., 1.],
            [0., 0., 1.,  ..., 0., 0., 0.],
            [1., 0., 0.,  ..., 1., 0., 0.],
            [0., 1., 0.,  ..., 0., 0., 0.]],
   
           [[0., 0., 0.,  ..., 0., 0., 0.],
     