# Designing an mRNA Vaccine with AIDO.ModelGenerator

This demo walks through the core functionality of AIDO.ModelGenerator, and shows how it provides access to multiple foundation models for complex biological applications.

# AIDO.ModelGenerator Core Design-Build-Test Cycles

AIDO.ModelGenerator provides a simple but highly customizeable framework for adapting, applying, and evaluating foundation models. See the [documentation](https://genbio-ai.github.io/ModelGenerator/) to deep-dive on any individual component. The core design-build-test cycle starts with fitting a model.

`mgen fit` allows you to quickly start finetune a model by providing 
- a foundation model backbone (e.g. aido_dna_dummy for debuggin here),
- a dataset (e.g. TranslationEfficiency),
- and a task to connect the two (e.g. SequenceRegression).

It handles all the heavy lifting and hardware coordination under the hood.

In [None]:
# Fit
!CUDA_VISIBLE_DEVICES=0 mgen fit --model SequenceRegression \
    --model.backbone aido_dna_dummy \
    --data TranslationEfficiency

After fitting a model, you can find the checkpoint under `logs/`. 
Checkpoints for best validation loss will be automatically saved and stored, as well as best training loss and the last checkpoint.

Using `mgen test` with the previous arguments and the checkpoint path, you can evaluate the test performance.
Using `mgen validate` does the same on the validation set.

Also make sure to check out the full `config.yaml` saved with the model checkpoint. 
This includes all the automatically configured behavior, which can be updated, hacked-on, and re-run with
```
mgen fit --config <path_to_config>.yaml
```

In [None]:
# Test
!CUDA_VISIBLE_DEVICES=0 mgen test --model SequenceRegression \
    --model.backbone aido_dna_dummy \
    --data TranslationEfficiency \
    --ckpt_path <your-checkpoint>.ckpt

In [None]:
# Don't be afraid to ask for help!
# !mgen --help
# !mgen fit --help
# !mgen fit --model.help SequenceRegression
# !mgen fit --model.help SequenceRegression.backbone

In [None]:
# See the full documentation for more help
# https://genbio-ai.github.io/ModelGenerator/

Now save the predictions to file and inspect what the model learned (or debug any issues).
ModelGenerator allows pass-through for most values from the dataset that aren't used for training, so you can easily inspect even when things get out of order during high-performance distributed training.

In [None]:
# Predict
!CUDA_VISIBLE_DEVICES=0 mgen predict --model SequenceRegression \
    --model.backbone aido_dna_dummy \
    --data TranslationEfficiency \
    --ckpt_path <your-checkpoint>.ckpt \
    --config configs/examples/save_predictions.yaml

In [None]:
# Change the backbone and adapter to larger models
# Apply mixed-precision, FSDP sharding, and parameter-efficient fine-tuning (PEFT) for large model training
!CUDA_VISIBLE_DEVICES=0,1 mgen fit --model SequenceRegression \
    --model.backbone aido_rna_1b600m \
    --model.adapter MLPPoolAdapter \
    --model.adapter.pooling cls_pooling \
    --model.adapter.dropout 0.1 \
    --data TranslationEfficiency \
    --trainer.strategy.class_path lightning.pytorch.strategies.FSDPStrategy \
    --trainer.strategy.init_args.sharding_strategy FULL_SHARD \
    --trainer.strategy.init_args.auto_wrap_policy modelgenerator.distributed.fsdp.wrap.AutoWrapPolicy \
    --model.backbone.use_peft true \
    --trainer.precision 16-mixed
# More backbones and datasets
# https://huggingface.co/genbio-ai

In [None]:
# Even import an external backbone from Huggingface
!CUDA_VISIBLE_DEVICES=0 mgen fit --model SequenceRegression \
    --model.backbone Huggingface \
    --model.backbone.model_path multimolecule/rnafm \
    --model.backbone.modules_for_model_registration+=multimolecule \
    --model.use_legacy_adapter true \
    --data TranslationEfficiency
# https://huggingface.co/multimolecule/rnafm

In [None]:
# Use your own local data
!CUDA_VISIBLE_DEVICES=0 mgen fit --model SequenceRegression \
    --model.backbone aido_dna_dummy \
    --data SequenceRegressionDataModule \
    --data.path predictions \
    --data.train_split_files predict_predictions.tsv \
    --data.x_col sequences \
    --data.y_col labels

In [None]:
# Automated hyperparameter sweeps
!wandb sweep genbio_scripts/wandb_sweep/slurm_sweep.yaml

# Demo: Developing a SARS-CoV-2 mRNA vaccine

This is a simple example showing how ModelGenerator handles multiple large-scale foundation models in an in-silico experiment design workflow.

We will optimize a few attributes of a SARS-CoV-2 mRNA vaccine.
1. **5' UTR Design**: Design a 5' UTR for efficient translation in human muscle cells.

2. **Coding Sequence Design**: Optimize the coding sequence for protein abundance in humans.

3. **mRNA Folding**: Predict the structure of the optimized mRNA.

4. **Protein Stabilization**: Improve the stability of the SARS-CoV-2 spike protein.

5. **Protein Folding**: Predict the 3D structure of the optimized spike protein and compare to the wild-type structure.

In [None]:
# Grab the antigen sequence: Spike protein coding sequence
# https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta&from=21563&to=25384
sequence = """
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAAT
TACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGT
TTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTC
TCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTT
CCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTCGAAGACCCAGTCCCT
ACTTATTGTTAATAACGCTACTAATGTTGTTATTAAAGTCTGTGAATTTCAATTTTGTAATGATCCATTT
TTGGGTGTTTATTACCACAAAAACAACAAAAGTTGGATGGAAAGTGAGTTCAGAGTTTATTCTAGTGCGA
ATAATTGCACTTTTGAATATGTCTCTCAGCCTTTTCTTATGGACCTTGAAGGAAAACAGGGTAATTTCAA
AAATCTTAGGGAATTTGTGTTTAAGAATATTGATGGTTATTTTAAAATATATTCTAAGCACACGCCTATT
AATTTAGTGCGTGATCTCCCTCAGGGTTTTTCGGCTTTAGAACCATTGGTAGATTTGCCAATAGGTATTA
ACATCACTAGGTTTCAAACTTTACTTGCTTTACATAGAAGTTATTTGACTCCTGGTGATTCTTCTTCAGG
TTGGACAGCTGGTGCTGCAGCTTATTATGTGGGTTATCTTCAACCTAGGACTTTTCTATTAAAATATAAT
GAAAATGGAACCATTACAGATGCTGTAGACTGTGCACTTGACCCTCTCTCAGAAACAAAGTGTACGTTGA
AATCCTTCACTGTAGAAAAAGGAATCTATCAAACTTCTAACTTTAGAGTCCAACCAACAGAATCTATTGT
TAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTT
TATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTGATTATTCTGTCCTATATAATTCCGCATCAT
TTTCCACTTTTAAGTGTTATGGAGTGTCTCCTACTAAATTAAATGATCTCTGCTTTACTAATGTCTATGC
AGATTCATTTGTAATTAGAGGTGATGAAGTCAGACAAATCGCTCCAGGGCAAACTGGAAAGATTGCTGAT
TATAATTATAAATTACCAGATGATTTTACAGGCTGCGTTATAGCTTGGAATTCTAACAATCTTGATTCTA
AGGTTGGTGGTAATTATAATTACCTGTATAGATTGTTTAGGAAGTCTAATCTCAAACCTTTTGAGAGAGA
TATTTCAACTGAAATCTATCAGGCCGGTAGCACACCTTGTAATGGTGTTGAAGGTTTTAATTGTTACTTT
CCTTTACAATCATATGGTTTCCAACCCACTAATGGTGTTGGTTACCAACCATACAGAGTAGTAGTACTTT
CTTTTGAACTTCTACATGCACCAGCAACTGTTTGTGGACCTAAAAAGTCTACTAATTTGGTTAAAAACAA
ATGTGTCAATTTCAACTTCAATGGTTTAACAGGCACAGGTGTTCTTACTGAGTCTAACAAAAAGTTTCTG
CCTTTCCAACAATTTGGCAGAGACATTGCTGACACTACTGATGCTGTCCGTGATCCACAGACACTTGAGA
TTCTTGACATTACACCATGTTCTTTTGGTGGTGTCAGTGTTATAACACCAGGAACAAATACTTCTAACCA
GGTTGCTGTTCTTTATCAGGATGTTAACTGCACAGAAGTCCCTGTTGCTATTCATGCAGATCAACTTACT
CCTACTTGGCGTGTTTATTCTACAGGTTCTAATGTTTTTCAAACACGTGCAGGCTGTTTAATAGGGGCTG
AACATGTCAACAACTCATATGAGTGTGACATACCCATTGGTGCAGGTATATGCGCTAGTTATCAGACTCA
GACTAATTCTCCTCGGCGGGCACGTAGTGTAGCTAGTCAATCCATCATTGCCTACACTATGTCACTTGGT
GCAGAAAATTCAGTTGCTTACTCTAATAACTCTATTGCCATACCCACAAATTTTACTATTAGTGTTACCA
CAGAAATTCTACCAGTGTCTATGACCAAGACATCAGTAGATTGTACAATGTACATTTGTGGTGATTCAAC
TGAATGCAGCAATCTTTTGTTGCAATATGGCAGTTTTTGTACACAATTAAACCGTGCTTTAACTGGAATA
GCTGTTGAACAAGACAAAAACACCCAAGAAGTTTTTGCACAAGTCAAACAAATTTACAAAACACCACCAA
TTAAAGATTTTGGTGGTTTTAATTTTTCACAAATATTACCAGATCCATCAAAACCAAGCAAGAGGTCATT
TATTGAAGATCTACTTTTCAACAAAGTGACACTTGCAGATGCTGGCTTCATCAAACAATATGGTGATTGC
CTTGGTGATATTGCTGCTAGAGACCTCATTTGTGCACAAAAGTTTAACGGCCTTACTGTTTTGCCACCTT
TGCTCACAGATGAAATGATTGCTCAATACACTTCTGCACTGTTAGCGGGTACAATCACTTCTGGTTGGAC
CTTTGGTGCAGGTGCTGCATTACAAATACCATTTGCTATGCAAATGGCTTATAGGTTTAATGGTATTGGA
GTTACACAGAATGTTCTCTATGAGAACCAAAAATTGATTGCCAACCAATTTAATAGTGCTATTGGCAAAA
TTCAAGACTCACTTTCTTCCACAGCAAGTGCACTTGGAAAACTTCAAGATGTGGTCAACCAAAATGCACA
AGCTTTAAACACGCTTGTTAAACAACTTAGCTCCAATTTTGGTGCAATTTCAAGTGTTTTAAATGATATC
CTTTCACGTCTTGACAAAGTTGAGGCTGAAGTGCAAATTGATAGGTTGATCACAGGCAGACTTCAAAGTT
TGCAGACATATGTGACTCAACAATTAATTAGAGCTGCAGAAATCAGAGCTTCTGCTAATCTTGCTGCTAC
TAAAATGTCAGAGTGTGTACTTGGACAATCAAAAAGAGTTGATTTTTGTGGAAAGGGCTATCATCTTATG
TCCTTCCCTCAGTCAGCACCTCATGGTGTAGTCTTCTTGCATGTGACTTATGTCCCTGCACAAGAAAAGA
ACTTCACAACTGCTCCTGCCATTTGTCATGATGGAAAAGCACACTTTCCTCGTGAAGGTGTCTTTGTTTC
AAATGGCACACACTGGTTTGTAACACAAAGGAATTTTTATGAACCACAAATCATTACTACAGACAACACA
TTTGTGTCTGGTAACTGTGATGTTGTAATAGGAATTGTCAACAACACAGTTTATGATCCTTTGCAACCTG
AATTAGACTCATTCAAGGAGGAGTTAGATAAATATTTTAAGAATCATACATCACCAGATGTTGATTTAGG
TGACATCTCTGGCATTAATGCTTCAGTTGTAAACATTCAAAAAGAAATTGACCGCCTCAATGAGGTTGCC
AAGAATTTAAATGAATCTCTCATCGATCTCCAAGAACTTGGAAAGTATGAGCAGTATATAAAATGGCCAT
GGTACATTTGGCTAGGTTTTATAGCTGGCTTGATTGCCATAGTAATGGTGACAATTATGCTTTGCTGTAT
GACCAGTTGCTGTAGTTGTCTCAAGGGCTGTTGTTCTTGTGGATCCTGCTGCAAATTTGATGAAGACGAC
TCTGAGCCAGTGCTCAAAGGAGTCAAATTACATTACACATAA
"""
sequence = sequence.replace('\n', '')
print(sequence)

sample = {
    'id': 'SPIKE_SARS2',
    'sequence': sequence,
    'label': 0,  # dummy label to fit with our othe data later
}
# save as tsv
import os
import pandas as pd
os.makedirs("tmp", exist_ok=True)
pd.DataFrame([sample]).to_csv("tmp/spike_genes.tsv", index=False, sep="\t")

## Designing a 5' UTR for efficient translation in human muscle cells

We're going to splice a new 5' UTR onto the SARS-CoV-2 spike protein coding sequence for efficient translation in human muscle cells.

In [None]:
# Human Insulin mRNA
# Split on the first start codon to get the 5' UTR
utr_sequence = """
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGATCACTGTCCTTCTGCCATGGCCCTGTG
GATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAAC
CAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACA
CACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGC
AGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACC
AGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGCAACTAGACGCAGCCCGCAGGCAGCCCCACACCCG
CCGCCTCCTGCACCGAGAGAGATGGAATAAAGCCCTTGAACCAGC
"""

utr_sequence = utr_sequence.replace('\n', '')
utr_sequence = utr_sequence.split('ATG')[0]  # Get only the upstream regulatory region
print(utr_sequence)

sample = {
    'id': 'INS',
    'sequence': utr_sequence,
    'label': 0,  # dummy label to fit with our othe data later
}
# save as tsv
import pandas as pd
pd.DataFrame([sample]).to_csv("tmp/genes.tsv", index=False, sep="\t")

### Download the AIDO.RNA Translation Efficiency Predictor

Many finetuned models are already available on HuggingFace.
This one was trained to predict translation efficiency: $\frac{\text{RNAseq}}{\text{RiboSeq}}$

We will optimize this for our 5' UTR design.

In [None]:
!huggingface-cli download genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle \
    --local-dir tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle

### Validate the pretrained predictor and save a reference dataset

In [None]:
# Configs make it easy to reproduce results.
# Test the model on the benchmark, and save the outputs for reference
!CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 mgen test --config hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/config.yaml \
    --ckpt_path hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/fold0/model.ckpt \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/efficiency_test_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
import plotting_utils
plotting_utils.prediction_histogram('tmp/efficiency_test_predictions/test_predictions.tsv')

### Predict on the new gene's 5' UTR

In [None]:
# Predict the translation efficiency of the UTR candidate
!CUDA_VISIBLE_DEVICES=1 mgen predict --config hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/config.yaml \
    --ckpt_path hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/fold0/model.ckpt \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.config_name null \
    --data.test_split_files genes.tsv \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 1 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/efficiency_gene_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
plotting_utils.prediction_histogram(
    background_pred = 'tmp/efficiency_test_predictions/test_predictions.tsv',
    gene_pred = 'tmp/efficiency_gene_predictions/predict_predictions.tsv'
)

### Mutate the UTR and predict mutation effects

In [None]:
# Mutate the UTR
# Read in the genes and vocabulary
genes = pd.read_csv('tmp/genes.tsv', sep='\t')
vocab = ['A', 'T', 'C', 'G']

rows = []
# For each gene in the genes dataframe
for i, (id, gene, label) in genes.iterrows():
    gene_list = list(gene)
    for j in range(len(gene_list)):
        for v in vocab:
            new_gene = gene_list.copy()
            if new_gene[j] == v:
                continue
            new_gene[j] = v
            rows.append({'id': f'{id}_{gene_list[j]}{j+1}{v}', 'sequence': ''.join(new_gene), 'label': label})
    df = pd.DataFrame(rows)
    df.to_csv(f'tmp/{id}_mutated.tsv', sep='\t', index=False)

In [None]:
# Predict translation efficiency of mutant UTRs
!CUDA_VISIBLE_DEVICES=1,2,3,4 mgen predict --config hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/config.yaml \
    --ckpt_path hf_models/genbio-ai/AIDO.RNA-1.6B-translation-efficiency-muscle/fold0/model.ckpt \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.test_split_files INS_mutated.tsv \
    --data.config_name null \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 4 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir efficiency_mutant_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
plotting_utils.prediction_histogram(
    background_pred = 'tmp/efficiency_test_predictions/test_predictions.tsv',
    gene_pred = 'tmp/efficiency_gene_predictions/predict_predictions.tsv',
    mutated_pred = 'tmp/efficiency_mutant_predictions/predict_predictions.tsv',
    mutated_ids = 'tmp/INS_mutated.tsv'
)

### Alternative design workflow 1: Conditional generation

Foundation models can be used to generate new sequences, and ModelGenerator helps to adapt them to direct the generation process.

Here, we invert the efficiency prediction problem to generate new 5' UTR sequences given a desired translation efficiency.

In [None]:
# Fit the model
# !CUDA_VISIBLE_DEVICES=1 mgen fit \
#     --config train_conditional_diffusion.yaml \

# Test the model
# !CUDA_VISIBLE_DEVICES=0,1,2,3 mgen test \
#     --config train_conditional_diffusion.yaml \
#     --ckpt_path <path-to-your-ckpt>.ckpt

# Generate 5' UTR sequences for specific efficiencies
import numpy as np
expr_range = np.linspace(-2, 2, 10)
rows = []
for i, expr in enumerate(expr_range):
    rows.append({'id': f'gene_{i}', 'label': expr, 'sequence': ('[MASK]' * 50) + 'ATG'})
pd.DataFrame(rows).to_csv('tmp/genes_masked.tsv', sep='\t', index=False)

!CUDA_VISIBLE_DEVICES=0 mgen predict \
    --ckpt_path <your-ckpt>.ckpt \
    --model ConditionalDiffusion \
    --model.backbone aido_rna_650m_cds \
    --model.num_denoise_steps 50 \
    --model.sample_seq true \
    --model.sampling_temperature 0.5 \
    --model.verbose true \
    --data ConditionalDiffusionDataModule \
    --data.path experiments/Demo \
    --data.test_split_files genes_masked.tsv \
    --data.batch_size 1 \
    --data.normalize false

### Alternative workflow 2: Embedding Painting

In previous examples we used finetuning is used to refine foundation models for specific tasks.

Sometimes foundation models learn useful properties from self-supervision, and no finetuning is necessary.

Here, we explore the pre-trained model's embeddings to see if it might have learned about translation efficiency without any further supervision.

This helps enable design even when data is so limited it can't be used for finetuning.

In [None]:
# Embed the wild-type sequence
!CUDA_VISIBLE_DEVICES=1 mgen predict --model Embed \
    --model.backbone aido_rna_1b600m \
    --data SequencesDataModule \
    --data.path tmp \
    --data.test_split_files genes.tsv \
    --data.batch_size 1 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/efficiency_gene_embeddings \
    --trainer.callbacks.filetype pt

In [None]:
# Embed the mutant sequences
!CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 mgen predict --model Embed \
    --model.backbone aido_rna_1b600m_cds \
    --data SequencesDataModule \
    --data.path tmp \
    --data.test_split_files INS_mutated.tsv \
    --data.batch_size 16 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/efficiency_mutant_embeddings \
    --trainer.callbacks.filetype pt

In [None]:
# Compile the embeddings
!python get_mean_embeddings.py --directory tmp/efficiency_gene_embeddings
!python get_mean_embeddings.py --directory tmp/efficiency_mutant_embeddings

In [None]:
# Visualize the embeddings
plotting_utils.prediction_embeddings(
    gene_pred = 'tmp/efficiency_gene_predictions/predict_predictions.tsv',
    mutated_pred = 'tmp/efficiency_mutant_predictions/predict_predictions.tsv',
    mutated_ids = 'tmp/INS_mutated.tsv',
    mutated_embeddings = 'tmp/efficiency_mutant_embeddings/mean_embeddings.pt',
    gene_embedding = 'tmp/efficiency_gene_embeddings/mean_embeddings.pt',
)

## Optimizing the spike protein coding sequence for protein abundance in human cells

### Download the AIDO.RNA Abundance Predictor and run through the simple mutation workflow

In [None]:
!huggingface-cli download genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens \
    --local-dir tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens

In [None]:
# Validate the model using the config it comes with, save the predictions to compare
!mgen test --config tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/config.yaml \
    --ckpt_path tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/fold0/model.ckpt \
    --model.strict_loading false \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/protein_abundance_test_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
import plotting_utils
plotting_utils.prediction_histogram('tmp/protein_abundance_test_predictions/test_predictions.tsv')

In [None]:
!mgen predict --config tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/config.yaml \
    --ckpt_path tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/fold0/model.ckpt \
    --model.strict_loading false \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.config_name null \
    --data.test_split_files spike_genes.tsv \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 1 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/protein_abundance_gene_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
plotting_utils.prediction_histogram(
    background_pred = 'tmp/protein_abundance_test_predictions/test_predictions.tsv',
    gene_pred = 'tmp/protein_abundance_gene_predictions/predict_predictions.tsv'
)

### Mutate the sequence with __silent__ mutations and predict the effects on protein abundance

Some mutations improve abundance without changing the resulting antigen!

In [None]:
# Mutate the gene
# Read in the genes and vocabulary
genes = pd.read_csv('tmp/spike_genes.tsv', sep='\t')
vocab = ['A', 'T', 'C', 'G']

# Get codon mapping
codon_lookup = {
    'TCA': 'S',
    'TCC': 'S',
    'TCG': 'S',
    'TCT': 'S',
    'TTC': 'F',
    'TTT': 'F',
    'TTA': 'L',
    'TTG': 'L',
    'TAC': 'Y',
    'TAT': 'Y',
    'TAA': '*',
    'TAG': '*',
    'TGC': 'C',
    'TGT': 'C',
    'TGA': '*',
    'TGG': 'W',
    'CTA': 'L',
    'CTC': 'L',
    'CTG': 'L',
    'CTT': 'L',
    'CCA': 'P',
    'CCC': 'P',
    'CCG': 'P',
    'CCT': 'P',
    'CAC': 'H',
    'CAT': 'H',
    'CAA': 'Q',
    'CAG': 'Q',
    'CGA': 'R',
    'CGC': 'R',
    'CGG': 'R',
    'CGT': 'R',
    'ATA': 'I',
    'ATC': 'I',
    'ATT': 'I',
    'ATG': 'M',
    'ACA': 'T',
    'ACC': 'T',
    'ACG': 'T',
    'ACT': 'T',
    'AAC': 'N',
    'AAT': 'N',
    'AAA': 'K',
    'AAG': 'K',
    'AGC': 'S',
    'AGT': 'S',
    'AGA': 'R',
    'AGG': 'R',
    'GTA': 'V',
    'GTC': 'V',
    'GTG': 'V',
    'GTT': 'V',
    'GCA': 'A',
    'GCC': 'A',
    'GCG': 'A',
    'GCT': 'A',
    'GAC': 'D',
    'GAT': 'D',
    'GAA': 'E',
    'GAG': 'E',
    'GGA': 'G',
    'GGC': 'G',
    'GGG': 'G',
    'GGT': 'G' 
}
# Make reverse lookup
codon_rev_lookup = {}
for k, v in codon_lookup.items():
    codon_rev_lookup[v] = codon_rev_lookup.get(v, [])
    codon_rev_lookup[v].append(k)

rows = []
# For each gene in the genes dataframe
for i, (id, gene, label) in genes.iterrows():
    gene_list = list(gene)
    for j in range(0, len(gene_list), 3):
        if j > 1024:
            break
        codon = ''.join(gene_list[j:j+3])
        if len(codon) < 3:
            continue
        # Only make synonymous mutations
        aa = codon_lookup[codon]
        synonymous_codons = codon_rev_lookup[aa]
        for v in synonymous_codons:
            if v == codon:
                continue
            new_gene = gene_list.copy()
            new_gene[j:j+3] = list(v)
            rows.append({'id': f'{id}_{j}{v}', 'sequence': ''.join(new_gene), 'label': label})
    df = pd.DataFrame(rows)
    df.to_csv(f'tmp/{id}_silent_mutated.tsv', sep='\t', index=False)

In [None]:
!CUDA_VISIBLE_DEVICES=1,2 taskset -c 32-63 mgen predict --config tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/config.yaml \
    --ckpt_path tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-CDS-protein-abundance-hsapiens/fold0/model.ckpt \
    --model.strict_loading false \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.config_name null \
    --data.test_split_files SPIKE_SARS2_silent_mutated.tsv \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 4 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/spike_mutant_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
plotting_utils.prediction_histogram(
    background_pred = 'tmp/protein_abundance_test_predictions/test_predictions.tsv',
    gene_pred = 'tmp/protein_abundance_gene_predictions/predict_predictions.tsv',
    mutated_pred = 'tmp/spike_mutant_predictions/predict_predictions.tsv',
    mutated_ids = 'tmp/SPIKE_SARS2_silent_mutated.tsv',
)

### Wild-type CDS
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAG __TGT__ GTTAATCTTACAACCAGAACTCAAT
TACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGT
...
### Optimized CDS
ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAG __TGC__ GTTAATCTTACAACCAGAACTCAAT
TACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGT
...

In [None]:
# Compile UTR and CDS for the candidate molecule
utr = "AATTTGCAGCCTCAGCCCCCAGCCATCTGCCGACCCCCCCACCCCAGGCCCTA"
candidate = utr + sequence
print(candidate)

pd.DataFrame([{'id': 'SPIKE_SARS2_Candidate', 'sequence': candidate[:1024], 'label': 0}]).to_csv("tmp/spike_candidate.tsv", index=False, sep="\t")

# Predict Candidate mRNA Folding

Many of the predicted optimizations will relate to structural changes in the mRNA. Understanding the folding of the mRNA can help inform future designs.

In [None]:
## make a directory to store the sample for inference
!mkdir -p tmp/RNA_SS/CUSTOM_SS_DATA/test/ 

In [None]:
## 1. Chunk up the sequences into 500 length fragments (model was trained with max 1000 len sequences, and there is max vRAM limit). We later merge them. A commonly used approximation.
## 2. generate .ct files with "dummpy ss" just to keep the same format (this will NOT be used for for evaluation in inference-only run).

plotting_utils.generate_ct_files(candidate, "tmp/RNA_SS/CUSTOM_SS_DATA/test/spike_gene")

In [None]:
!huggingface-cli download genbio-ai/AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction \
    --local-dir tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction

In [None]:
## run inference for the candidate
!mgen predict --config experiments/AIDO.RNA/rna_secondary_structure_prediction/rna_ss_prediction.yaml \
			--data.path tmp/RNA_SS/ \
			--data.dataset CUSTOM_SS_DATA \
			--trainer.default_root_dir tmp/RNA_SS/outputs/ \
			--trainer.callbacks.ft_schedule_path ../rna_secondary_structure_prediction/ft_schedules/layers_0_32.yaml \
			--ckpt_path tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction/model.ckpt \
			--trainer.devices 0,

In [None]:
## merge the ss of individual chunks and convert to dot-bracket and matrix formats
import numpy as np
from matplotlib import pyplot as plt
from genbio_scripts import plotting_utils

nucleotides, rna_ss, ss_probab_matrix = plotting_utils.ct_to_dot_bracket("tmp/RNA_SS/outputs/", "tmp/RNA_SS/spike_pred_ss.dot", len(candidate))
print(">infered\n" + nucleotides + "\n" + rna_ss + "\n")

ss_binary_matrix = plotting_utils.dot_bracket_to_matrix(rna_ss)
plt.figure(figsize=(10, 10))
plt.imshow(ss_binary_matrix, cmap='gray_r', interpolation='nearest')
plt.title('SS binary contact map')
plt.show()

# Alternative workflow: Dependency Mapping

[Nucleotide Dependency Mapping](https://www.biorxiv.org/content/10.1101/2024.07.27.605418v1) produces a 2D grid of dependency values between each nucleotide, indicating their co-conservation and likely functional dependency.
The grid of predicted dependencies is composed of dependency pixels $e_{i, j}$ for each nucleotide pair $i$ and $j$, where

$$e_{i, j} = \max_{k,q\in\{A,T,C,G\}} \left| \log_2 \left( \frac{\hat{\text{odds}}(n_j=k \mid n_1, ..., n_i=q, ..., n_L)}{\hat{\text{odds}}(n_j=k \mid n_1, ..., n_L)} \right) \right|,$$

such that $k$ and $q$ are the key and query nucleotide types, $n$ is a length $L$ DNA sequence, and $\hat{\text{odds}}$ are the odds inferred using the pretrained AIDO.DNA.

In [None]:
# Dependency Mapping
!mgen predict --config ../dependency_mapping/config.yaml \
    --data.path tmp \
    --data.test_split_files spike_candidate.tsv

!python experiments/AIDO.RNA/dependency_mapping/plot_dependency_maps.py \
    -i depmap_predictions \
    -o depmap_plots \
    -v ../dependency_mapping/DNA.txt \
    -t ../../../modelgenerator/huggingface_models/rnabert/vocab.txt

In [None]:
from PIL import Image
# Image.open('depmap_plots/>NR_002092.1|RNAseP|Drosophila_heatmap.png')
Image.open('depmap_plots/SPIKE_SARS2_Candidate_heatmap.png')

# Protein Stabilization

Here we make single mutations to the protein to improve stability and minimize degradation, while keeping the antigen structure similar.

In [None]:
# download protein sequence
from modelgenerator.structure_tokenizer.datasets.protein import Protein
from modelgenerator.structure_tokenizer.utils.constants import residue_constants as RC
import pandas as pd

# 6vxx A chain
pdb_id, chain_id = '6vxx', 'A'
!wget -qnc https://files.rcsb.org/download/{pdb_id}.pdb

aatype_tensor = Protein.from_pdb_file_path(f'{pdb_id}.pdb', chain_id).aatype
seq = "".join(list(RC.restype_1to3)[i] for i in aatype_tensor)
seq

In [None]:
# save as tsv
sample = {
    'id': '6vxx_A',
    'sequence': seq,
    'label': 0,
}
pd.DataFrame([sample]).to_csv("tmp/protein.tsv", index=False, sep="\t")

In [None]:
# Todo: upload stability predictor to HF
# !huggingface-cli download genbio-ai/AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction \
#     --local-dir tmp/hf_models/genbio-ai/AIDO.RNA-1.6B-bpRNA_secondary_structure_prediction

In [None]:
# predict protein abundance
!CUDA_VISIBLE_DEVICES=0 mgen predict --config ../../AIDO.Protein/xTrimo/configs/stability.yaml \
    --ckpt_path best_val:epoch=5-val_spearman=0.821.ckpt \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.test_split_files protein.tsv \
    --data.config_name null \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 1 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/stability_protein_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
# get all possible mutation
vocab = ["L","A", "G", "V", "S", "E", "R", "T", "I", "D", "P", "K", "Q", "N", "F", "Y", "M", "H", "W", "C", "X"]
rows = []
gene_list = list(seq)
# add original seq
rows.append({'mutation': f'', 'sequence': seq, 'label':-1})
for j in range(len(gene_list)):
    for v in vocab:
        new_gene = gene_list.copy()
        if new_gene[j] == v:
            continue
        new_gene[j] = v
        # fake label required by sequnce regression module
        rows.append({'mutation': f'{gene_list[j]}{j+1}{v}', 'sequence': ''.join(new_gene), 'label':-1})
df = pd.DataFrame(rows)
df.to_csv(f'tmp/protein_mutated.tsv', sep='\t', index=False)

In [None]:
!CUDA_VISIBLE_DEVICES=0,1,2,3 mgen predict --config ../../AIDO.Protein/xTrimo/configs/stability.yaml \
    --ckpt_path best_val:epoch=5-val_spearman=0.821.ckpt \
    --data SequenceRegressionDataModule \
    --data.path tmp \
    --data.test_split_files protein_mutated.tsv \
    --data.config_name null \
    --data.cv_num_folds 0 \
    --data.cv_fold_id_col null \
    --data.batch_size 16 \
    --trainer.callbacks+=modelgenerator.callbacks.PredictionWriter \
    --trainer.callbacks.output_dir tmp/stability_mutant_predictions \
    --trainer.callbacks.filetype tsv \
    --trainer.callbacks.write_cols+=sequences \
    --trainer.callbacks.write_cols+=predictions \
    --trainer.callbacks.write_cols+=labels

In [None]:
plotting_utils.prediction_histogram(
    background_pred = 'tmp/stability_mutant_predictions/predict_predictions.tsv',
    gene_pred = 'tmp/stability_protein_predictions/predict_predictions.tsv',
    background_pred_var = 'predictions',
)

In [None]:
# get the gene sequence and the mutated sequence with the highest stable score
import pandas as pd
df = pd.read_csv('tmp/stability_mutant_predictions/predict_predictions.tsv', sep='\t')
top_10 = df.nlargest(10, 'predictions')[['predictions', 'sequences']]
origin_seq_score = df[df['sequences']==''.join(gene_list)]['predictions'].values[0]
print(f'COVID protein stability score {origin_seq_score:.4f}')
print(top_10)
best_mutated_seq = top_10.iloc[0]['sequences']

# Protein Folding

Check if the stabilized protein has a similar 3D structure to the wild-type antigen.

In [None]:
# Structure prediction
!rm -rf tmp_str # cleaning up is necessary because the decoder will skip already existing files
!mkdir -p tmp_str

In [None]:
# visualize top 10 structures
df = pd.DataFrame(data={"idx": [i for i in range(10)], "aa_seq": [top_10['sequences'].values[i] for i in range(10)], "seq_len": [len(best_mutated_seq)]*10})
df.to_csv("tmp_str/tmp.csv", index=False)

In [None]:
# language model: amino acid sequence -> structure tokens
!WANDB_MODE=dryrun CUDA_VISIBLE_DEVICES=0 mgen predict --config ../AIDO.StructureTokenizer/protein2structoken_16b.yaml \
            --data.init_args.path "csv" \
            --data.init_args.test_split_files ["tmp_str/tmp.csv"]

In [None]:
# post process
!python ../../AIDO.StructureTokenizer/struct_token_format_conversion.py logs/protein2structoken_16b/predict_predictions.tsv logs/protein2structoken_16b/predict_predictions.pt
!python ../../AIDO.StructureTokenizer/extract_structure_tokenizer_codebook.py --output_path logs/protein2structoken_16b/codebook.pt

In [None]:
# Decode: structure tokens -> 3D coordinates
!WANDB_MODE=dryrun CUDA_VISIBLE_DEVICES=0,1,2,3 mgen predict --config ../AIDO.StructureTokenizer/decode.yaml \
 --data.init_args.config.struct_tokens_datasets_configs.name=protein2structoken_16b \
 --data.init_args.config.struct_tokens_datasets_configs.struct_tokens_path=./logs/protein2structoken_16b/predict_predictions.pt \
 --data.init_args.config.struct_tokens_datasets_configs.codebook_path=./logs/protein2structoken_16b/codebook.pt \
 --data.init_args.config.struct_tokens_datasets_configs.batch_size=1

In [None]:
from modelgenerator.structure_tokenizer.datasets.protein import Protein
from modelgenerator.structure_tokenizer.utils.constants import residue_constants as RC
prediction = "logs/protstruct_decode/protein2structoken_16b_pdb_files/4__output.pdb"
ground_truth = f"{pdb_id}_{chain_id}.pdb"
# drop the additional chain in the ground truth before visualization
Protein.from_pdb_file_path(f'{pdb_id}.pdb', chain_id).to_pdb(f"{pdb_id}_{chain_id}.pdb")

# Ground Truth Structure

In [None]:
from genbio_scripts import plotting_utils
pdb_id, chain_id = '6vxx', 'A'
plotting_utils.show_protein(f"{pdb_id}_{chain_id}.pdb")

# Stabilized Structure

In [None]:
plotting_utils.show_protein(prediction)