# Background

## DNA Melting Point

__Wikipedia:__

Nucleic acid thermodynamics is the study of how temperature affects the nucleic acid structure of double-stranded DNA (dsDNA). The melting temperature (Tm) is defined as the temperature at which half of the DNA strands are in the random coil or single-stranded (ssDNA) state. Tm depends on the length of the DNA molecule and its specific nucleotide sequence. DNA, when in a state where its two strands are dissociated (i.e., the dsDNA molecule exists as two independent strands), is referred to as having been denatured by the high temperature.


## This notebook

In this notebook we will use the GenSLM 25M parameter langauge model to generate embeddings for sequences and use a downstream model to take the embeddings and predict the melting point of the associated sequence. This workflow is common for many bioinformatics tasks, and can easily be adapted to other regression and classification problems.

In [None]:
# Installing GenSLM
# NOTE: You may need to run this twice due env reload
!pip install git+https://github.com/ramanathanlab/genslm

Collecting git+https://github.com/ramanathanlab/genslm
  Cloning https://github.com/ramanathanlab/genslm to /tmp/pip-req-build-ix5b5ho0
  Running command git clone --filter=blob:none --quiet https://github.com/ramanathanlab/genslm /tmp/pip-req-build-ix5b5ho0
  Resolved https://github.com/ramanathanlab/genslm to commit e4fbf3b8e641150d708c18e12d551de8ed0cae1c
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers@ git+https://github.com/maxzvyagin/transformers (from genslm==0.0.4a1)
  Cloning https://github.com/maxzvyagin/transformers to /tmp/pip-install-llnybjky/transformers_6c0a0afb79344dea9bf91c9be61e3141
  Running command git clone --filter=blob:none --quiet https://github.com/maxzvyagin/transformers /tmp/pip-install-llnybjky/transformers_6c0a0afb79344dea9bf91c9be61e3141
  Resolved https://github.com/maxzvyagin/transformers to commit ffd5aba0ad41a1ebd1897a77f6a3782fc2d75e1f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build whe

In [None]:
import torch
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn import svm
from google.colab import drive
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

from genslm import GenSLM, SequenceDataset

# Aquiring Model and Data

Visit: https://drive.google.com/drive/folders/1oYgda4Px-tugapgE2uumiUIf2p3PqIQI?usp=drive_link

- Right click `UmichSciFM-2024` Folder and click Organize -> Add Shortcut -> All Locations -> My Drive

Executing the cell below mounts your Google Drive to this notebook giving you access to the model checkpoint and data for this notebook

In [None]:
# Mount and see file structure
drive.mount("/content/drive")

!ls drive/MyDrive/UMichSciFM-2024/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data  model


In [None]:
# Load and view the dataset, split into train/test
data = pd.read_csv("drive/MyDrive/UMichSciFM-2024/data/meltingpoint.csv", index_col=0)
data

Unnamed: 0,Sequence,MeltingPoint
0,atgattatttccgcagccagcgattatcgcgccgcagcacaacgca...,82.112491
1,atggctaagctgaccaagcgcatgcgcgtgatccgtgacaaagttg...,80.338892
2,atgtttaaaaataaaatgatgatttgtctttatatgtttctattat...,76.102904
3,atgggtcgactggaaggaaaggtagcgatcgtcacgggcggtgcgc...,86.743695
4,atgcgtctaaaccccggccaacaacaagctgtcgaattcgttaccg...,81.709235
...,...,...
9411,gtggatatgagtaatacaagtgcagcaccacgtgacacgtgggggt...,78.878742
9412,ttggttgagcgccacgacatcgcaaccggtgccaccgggcgtaacc...,82.666703
9413,atgttccgttcgcttcttcgcctgtctgcagcgttgctggccttga...,85.151774
9414,gtgaaattactagatttattgtcaaaaggaattgtaataggtgatg...,75.071559


In [None]:
# Split dataset for use later

# Returns two independent dataframes that we will use for
# melting point modelling
train, test = train_test_split(data, train_size=1000, test_size=200)

# Begin Modelling

Below is an example of generating embeddings with GenSLM-25M, we will follow this generat workflow to generate embeddings for our dataset, and use a downstream model to predict the melting point of an input sequence

In [None]:


# Load model
model = GenSLM("genslm_25M_patric", model_cache_dir="drive/MyDrive/UMichSciFM-2024/model")
model.eval()

# Select GPU device if it is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Input data is a list of gene sequences
sequences = [
    "ATGAAAGTAACCGTTGTTGGAGCAGGTGCAGTTGGTGCAAGTTGCGCAGAATATATTGCA",
    "ATTAAAGATTTCGCATCTGAAGTTGTTTTGTTAGACATTAAAGAAGGTTATGCCGAAGGT",
]

example_dataset = SequenceDataset(sequences, model.seq_length, model.tokenizer)
example_dataloader = DataLoader(example_dataset, batch_size =2)

# Compute averaged-embeddings for each input sequence
embeddings = []
with torch.no_grad():
    for batch in example_dataloader:
        outputs = model(
            batch["input_ids"].to(device),
            batch["attention_mask"].to(device),
            output_hidden_states=True,
        )
        # outputs.hidden_states shape: (layers, batch_size, sequence_length, hidden_size)
        # Use the embeddings of the last layer
        emb = outputs.hidden_states[-1].detach().cpu().numpy()
        # Compute average over sequence length
        emb = np.mean(emb, axis=1)
        embeddings.append(emb)

# Concatenate embeddings into an array of shape (num_sequences, hidden_size)
embeddings = np.concatenate(embeddings)
embeddings.shape

Tokenizing...: 100%|██████████| 2/2 [00:00<00:00, 204.53it/s]


(2, 512)

In [None]:
# Get embeddings for training dataset
train_dataset = SequenceDataset(train.Sequence.values, model.seq_length, model.tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=8)

# Compute averaged-embeddings for each input sequence
train_embeddings = []
with torch.no_grad():
    for batch in tqdm(train_dataloader, desc="Embedding"):
        outputs = model(
            batch["input_ids"].to(device),
            batch["attention_mask"].to(device),
            output_hidden_states=True,
        )
        # outputs.hidden_states shape: (layers, batch_size, sequence_length, hidden_size)
        # Use the embeddings of the last layer
        emb = outputs.hidden_states[-1].detach().cpu().numpy()
        # Compute average over sequence length
        emb = np.mean(emb, axis=1)
        train_embeddings.append(emb)

# Concatenate embeddings into an array of shape (num_sequences, hidden_size)
train_embeddings = np.concatenate(train_embeddings)
train_embeddings.shape

Tokenizing...: 100%|██████████| 1000/1000 [00:08<00:00, 123.48it/s]
Embedding: 100%|██████████| 125/125 [01:38<00:00,  1.27it/s]


(1000, 512)

In [None]:
# Train SVM on embeddings for melting point
mp_regr = svm.SVR()
mp_regr.fit(train_embeddings, train.MeltingPoint.values)


# Evaluation

In [None]:
# Get embeddings for evaluation dataset
test_dataset = SequenceDataset(test.Sequence.values, model.seq_length, model.tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=8)

# Compute averaged-embeddings for each input sequence
test_embeddings = []
with torch.no_grad():
    for batch in tqdm(test_dataloader, desc="Embedding"):
        outputs = model(
            batch["input_ids"].to(device),
            batch["attention_mask"].to(device),
            output_hidden_states=True,
        )
        # outputs.hidden_states shape: (layers, batch_size, sequence_length, hidden_size)
        # Use the embeddings of the last layer
        emb = outputs.hidden_states[-1].detach().cpu().numpy()
        # Compute average over sequence length
        emb = np.mean(emb, axis=1)
        test_embeddings.append(emb)

# Concatenate embeddings into an array of shape (num_sequences, hidden_size)
test_embeddings = np.concatenate(test_embeddings)
test_embeddings.shape

Tokenizing...: 100%|██████████| 200/200 [00:00<00:00, 508.65it/s]
Embedding: 100%|██████████| 25/25 [00:19<00:00,  1.27it/s]


(200, 512)

In [None]:
# Evaluate the performance of the regressor on a held out test set

r2 = mp_regr.score(test_embeddings, test.MeltingPoint.values)

print(f"Regressor R^2 {r2} for test set")

# Test a few examples and see predictions
example_predictions = mp_regr.predict(test_embeddings[:10])

for (idx, row), pred_val in zip(test.iterrows(), example_predictions):
  print(f"Empirical melting point: {row['MeltingPoint']:.3f}\t\tPredicted melting point: {pred_val:.3f}")

Regressor R^2 0.9679511430339508 for test set
Empirical melting point: 76.496		Predicted melting point: 74.922
Empirical melting point: 84.790		Predicted melting point: 84.878
Empirical melting point: 81.261		Predicted melting point: 81.668
Empirical melting point: 74.420		Predicted melting point: 75.971
Empirical melting point: 81.411		Predicted melting point: 81.480
Empirical melting point: 80.978		Predicted melting point: 81.500
Empirical melting point: 88.151		Predicted melting point: 88.375
Empirical melting point: 80.331		Predicted melting point: 79.350
Empirical melting point: 85.033		Predicted melting point: 84.702
Empirical melting point: 82.632		Predicted melting point: 82.331
