# Problem Setup

**Problem Statement:**  
We aim to optimize artificial protein synthesis by predicting masked parts of a DNA sequence using a transformer model. Initially, we provide a partially masked DNA sequence and predict the missing tokens, minimizing the cross-entropy loss between the predicted and true tokens. Later, this approach may be extended to generate sequences based on functional descriptions.

**Objective:**  
Minimize the loss between the predicted and actual tokens at masked positions while ensuring that generated sequences remain biologically valid.

## Mathematical Formulation

For each DNA sequence \(X\) with masked positions \(\mathcal{M}\), the model predicts a probability distribution \(p(x_i\mid X_{\setminus \mathcal{M}})\) for every \(i \in \mathcal{M}\). The objective function is defined as:

\[
\mathcal{L} = - \sum_{i \in \mathcal{M}} \log p(x_i^{\text{true}}\mid X_{\setminus \mathcal{M}}) 
\]

subject to constraints ensuring valid base pairs and acceptable sequence lengths.

## Data Requirements and Success Metrics

**Data Requirements:**  
- Synthetic DNA sequences from a small vocabulary: `A`, `C`, `G`, and `T`.
- Randomly mask tokens in each sequence to simulate missing data.

**Success Metrics:**  
- Achieve >90% prediction accuracy on masked tokens.
- Low validation loss and biologically plausible sequences (with future integration of tools like AlphaFold).

## Implementation Overview

This notebook includes:

- Data generation and preprocessing for synthetic DNA sequences.
- A simple transformer model using PyTorch for masked token prediction.
- An objective function (cross-entropy loss) and an optimization loop.
- Basic logging, validation, and resource monitoring.
- Documentation of design decisions, known limitations, and next steps.

Note: The implementation is kept small-scale so it can run efficiently on a MacBook.

In [36]:
# ===============================
# Setup & Imports
# ===============================

# Required Imports
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np 
from torch.utils.data import Dataset, DataLoader, random_split
import time
import os
import random
from tqdm import tqdm  # for progress bars

# For resource monitoring 
import psutil

# Set random seeds for reproducibility
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)

# Check device (GPU if available, else CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cpu


In [20]:
# Data Generation: Create synthetic DNA sequences and mask tokens

# Define vocabulary and special token for masking
vocab = ['A', 'C', 'G', 'T']
mask_token = '[MASK]'
vocab.append(mask_token)
vocab_to_idx = {token: idx for idx, token in enumerate(vocab)}
idx_to_vocab = {idx: token for token, idx in vocab_to_idx.items()}

# Parameters
sequence_length = 10  # Keeping sequences small
num_sequences = 100   # Number of sequences for training
mask_prob = 0.3       # Probability of masking a token

def generate_sequence(length):
    """Generates a random DNA sequence of given length."""
    return [random.choice(vocab[:-1]) for _ in range(length)]

def mask_sequence(seq, mask_prob):
    """
    Randomly masks tokens in the sequence with probability mask_prob.
    Returns the masked sequence and labels where non-masked positions are set to -100.
    """
    masked_seq = []
    labels = []
    for token in seq:
        if random.random() < mask_prob:
            masked_seq.append(mask_token)
            labels.append(vocab_to_idx[token])  # Store the true token
        else:
            masked_seq.append(token)
            labels.append(-100)  # -100 will be ignored in loss computation
    return masked_seq, labels

# Generate the dataset
data = []
for _ in range(num_sequences):
    seq = generate_sequence(sequence_length)
    masked_seq, labels = mask_sequence(seq, mask_prob)
    data.append((seq, masked_seq, labels))

print("Sample original sequence:", data[0][0])
print("Sample masked sequence:", data[0][1])
print("Sample labels:", data[0][2])

Sample original sequence: ['A', 'G', 'T', 'A', 'A', 'C', 'C', 'T', 'T', 'C']
Sample masked sequence: ['A', 'G', 'T', '[MASK]', 'A', '[MASK]', '[MASK]', 'T', 'T', 'C']
Sample labels: [-100, -100, -100, 0, -100, 1, 1, -100, -100, -100]


In [21]:
# Define a PyTorch Dataset for the synthetic DNA data
class DNADataset(Dataset):
    def __init__(self, data, vocab_to_idx):
        self.data = data
        self.vocab_to_idx = vocab_to_idx
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        original, masked_seq, labels = self.data[idx]
        # Convert tokens to indices
        input_ids = [self.vocab_to_idx[token] for token in masked_seq]
        return torch.tensor(input_ids, dtype=torch.long), torch.tensor(labels, dtype=torch.long)

# Split data into training and validation sets (80/20 split)
train_size = int(0.8 * len(data))
val_size = len(data) - train_size
train_data = data[:train_size]
val_data = data[train_size:]

train_dataset = DNADataset(train_data, vocab_to_idx)
val_dataset = DNADataset(val_data, vocab_to_idx)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

In [24]:
# Define a simple Transformer-based model for masked token prediction
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=16, num_heads=2, num_layers=2, dropout=0.1):
        super(TransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x):
        """
        x: [batch_size, seq_length]
        returns: logits [batch_size, seq_length, vocab_size]
        """
        embedded = self.embedding(x)  # [batch_size, seq_length, embed_dim]
        # Transformer expects input shape: [seq_length, batch_size, embed_dim]
        embedded = embedded.transpose(0, 1)
        transformer_out = self.transformer_encoder(embedded)
        transformer_out = transformer_out.transpose(0, 1)
        logits = self.fc_out(transformer_out)
        return logits

# Initialize the model
vocab_size = len(vocab_to_idx)
model = TransformerModel(vocab_size).to(device)
print(model)

TransformerModel(
  (embedding): Embedding(5, 16)
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=16, out_features=16, bias=True)
        )
        (linear1): Linear(in_features=16, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=16, bias=True)
        (norm1): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((16,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (fc_out): Linear(in_features=16, out_features=5, bias=True)
)




In [25]:
# Ensure the model is defined before using it in the optimizer
vocab_size = len(vocab_to_idx)
model = TransformerModel(vocab_size).to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=-100)  # Ignore positions that are not masked
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Key training parameters
num_epochs = 5  # Small number for demonstration purposes

In [26]:
# Training Loop with Logging

start_time = time.time()
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)  # [batch_size, seq_length, vocab_size]
        
        # Reshape outputs and labels for loss computation
        outputs = outputs.view(-1, vocab_size)
        labels = labels.view(-1)
        
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
    
    avg_loss = running_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

end_time = time.time()
print("Training Time: {:.2f} seconds".format(end_time - start_time))

Epoch 1/5, Loss: 1.6040
Epoch 2/5, Loss: 1.4515
Epoch 3/5, Loss: 1.4331
Epoch 4/5, Loss: 1.4269
Epoch 5/5, Loss: 1.4206
Training Time: 0.23 seconds


In [27]:
# Validation and Test Cases

model.eval()
total_correct = 0
total_masked = 0

with torch.no_grad():
    for inputs, labels in val_loader:
        inputs = inputs.to(device)
        labels = labels.to(device)
        outputs = model(inputs)  # [batch_size, seq_length, vocab_size]
        predictions = torch.argmax(outputs, dim=-1)  # [batch_size, seq_length]
        
        # Evaluate only on masked positions (labels != -100)
        mask = labels != -100
        total_masked += mask.sum().item()
        total_correct += ((predictions == labels) * mask).sum().item()

if total_masked > 0:
    accuracy = total_correct / total_masked * 100
    print(f"Validation Accuracy on masked tokens: {accuracy:.2f}%")
else:
    print("No masked tokens in validation set.")

Validation Accuracy on masked tokens: 27.42%


In [28]:
# Basic Resource Monitoring and Performance Measurements

if 'psutil' in globals():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    print("Memory Usage: {:.2f} MB".format(mem_info.rss / (1024 * 1024)))
else:
    print("psutil not available. Skipping resource monitoring.")

# Additional performance measurements can be added as needed.

Memory Usage: 309.80 MB


In [32]:
from Bio import Entrez, SeqIO
import time


Entrez.email = "your_email@example.com"


def fetch_ids(search_term, max_records=10000):
    """
    Fetch up to max_records sequence IDs matching the search term.
    """
    print(f"Fetching up to {max_records} IDs for search term: '{search_term}'")
    search_handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=max_records)
    search_results = Entrez.read(search_handle)
    search_handle.close()
    ids = search_results["IdList"]
    print(f"Fetched {len(ids)} IDs.")
    return ids

def fetch_sequences_in_batches(id_list, batch_size=500, delay=0.5):
    """
    Fetch sequences given a list of IDs in smaller batches.
    """
    sequences = []
    total_ids = len(id_list)
    for start in range(0, total_ids, batch_size):
        end = min(total_ids, start + batch_size)
        id_batch = id_list[start:end]
        print(f"Fetching sequences for IDs {start} to {end}...")
        try:
            fetch_handle = Entrez.efetch(db="nucleotide", id=",".join(id_batch),
                                         rettype="fasta", retmode="text")
            records = list(SeqIO.parse(fetch_handle, "fasta"))
            fetch_handle.close()
            sequences.extend(records)
        except Exception as e:
            print(f"Error fetching batch {start}-{end}: {e}")
        time.sleep(delay)
    return sequences

# Define your search term 
search_term = "Homo sapiens[Organism] AND gene"

# Fetch only the first 10,000 IDs matching the search term
ids = fetch_ids(search_term, max_records=10000)

# Fetch the sequences corresponding to the retrieved IDs in batches
sequences = fetch_sequences_in_batches(ids, batch_size=500, delay=0.5)
print(f"Number of sequences fetched: {len(sequences)}")


for record in sequences[:5]:
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}\n")

Fetching up to 10000 IDs for search term: 'Homo sapiens[Organism] AND gene'
Fetched 10000 IDs.
Fetching sequences for IDs 0 to 500...
Fetching sequences for IDs 500 to 1000...
Fetching sequences for IDs 1000 to 1500...
Fetching sequences for IDs 1500 to 2000...
Fetching sequences for IDs 2000 to 2500...
Fetching sequences for IDs 2500 to 3000...
Fetching sequences for IDs 3000 to 3500...
Fetching sequences for IDs 3500 to 4000...
Fetching sequences for IDs 4000 to 4500...
Fetching sequences for IDs 4500 to 5000...
Fetching sequences for IDs 5000 to 5500...
Fetching sequences for IDs 5500 to 6000...
Fetching sequences for IDs 6000 to 6500...
Fetching sequences for IDs 6500 to 7000...
Fetching sequences for IDs 7000 to 7500...
Fetching sequences for IDs 7500 to 8000...
Fetching sequences for IDs 8000 to 8500...
Fetching sequences for IDs 8500 to 9000...
Fetching sequences for IDs 9000 to 9500...
Fetching sequences for IDs 9500 to 10000...
Number of sequences fetched: 10000
ID: pdb|9IJ4|C

In [42]:


# ===============================
# Define Vocabulary and Parameters
# ===============================
# Our DNA vocabulary and a special [MASK] token
vocab = ['A', 'C', 'G', 'T']
mask_token = '[MASK]'
vocab.append(mask_token)
vocab_to_idx = {token: idx for idx, token in enumerate(vocab)}
idx_to_vocab = {idx: token for token, idx in vocab_to_idx.items()}

# Fixed sequence segment length 
max_seq_len = 100
mask_prob = 0.3 # probability to mask a token

# ===============================
# Data Processing: Blacking Out Tokens in Real Data
# ===============================
def process_sequence(seq_record, max_seq_len, mask_prob):
    """
    Process a Biopython SeqRecord:
      - Convert the sequence to uppercase string.
      - If the sequence is at least max_seq_len long, randomly extract a contiguous segment.
      - For each character in the segment, with probability mask_prob, replace it with the [MASK] token.
      - Create a labels list: the true token index if masked, or -100 if not (to ignore in loss).
    Returns:
      (original_tokens, masked_tokens, labels)
    If the sequence is too short, returns None.
    """
    seq_str = str(seq_record.seq).upper()
    if len(seq_str) < max_seq_len:
        return None
    # Randomly choose a contiguous segment of length max_seq_len
    start_index = random.randint(0, len(seq_str) - max_seq_len)
    segment = seq_str[start_index : start_index + max_seq_len]
    tokens = list(segment)
    masked_tokens = []
    labels = []
    for token in tokens:
        # If token not in our allowed vocabulary, leave it unchanged and do not predict it.
        if token not in vocab_to_idx:
            masked_tokens.append(token)
            labels.append(-100)
        else:
            if random.random() < mask_prob:
                masked_tokens.append(mask_token)
                labels.append(vocab_to_idx[token])
            else:
                masked_tokens.append(token)
                labels.append(-100)
    return tokens, masked_tokens, labels

# Process all sequences (fetched from NCBI) and filter out any that are too short
processed_data = []
for record in sequences:
    result = process_sequence(record, max_seq_len, mask_prob)
    if result is not None:
        processed_data.append(result)

print(f"Processed {len(processed_data)} sequences out of {len(sequences)} total fetched sequences.")

# ===============================
# Create a PyTorch Dataset
# ===============================
class RealDNADataset(Dataset):
    def __init__(self, processed_data, vocab_to_idx):
        self.data = processed_data
        self.vocab_to_idx = vocab_to_idx
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        original, masked_seq, labels = self.data[idx]
        # Convert the masked tokens to indices; all tokens should be in our vocab.
        input_ids = [self.vocab_to_idx[token] if token in self.vocab_to_idx else 0 for token in masked_seq]
        return torch.tensor(input_ids, dtype=torch.long), torch.tensor(labels, dtype=torch.long)

dataset = RealDNADataset(processed_data, vocab_to_idx)

# Split dataset: 80% training, 20% validation
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
print(f"Training samples: {len(train_dataset)}, Validation samples: {len(val_dataset)}")

batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

# ===============================
# Define (another) Transformer Model
# ===============================
class FullTransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_dim=32, num_heads=4, num_layers=2, dropout=0.02):
        super(FullTransformerModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.fc_out = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x):
        # x: [batch_size, seq_length]
        embedded = self.embedding(x)  # [batch_size, seq_length, embed_dim]
        # Transformer expects shape: [seq_length, batch_size, embed_dim]
        embedded = embedded.transpose(0, 1)
        transformer_out = self.transformer_encoder(embedded)
        transformer_out = transformer_out.transpose(0, 1)  # [batch_size, seq_length, embed_dim]
        logits = self.fc_out(transformer_out)  # [batch_size, seq_length, vocab_size]
        return logits

vocab_size = len(vocab_to_idx)
model = FullTransformerModel(vocab_size).to(device)
print(model)

# ===============================
# Setup Loss Function and Optimizer
# ===============================
criterion = nn.CrossEntropyLoss(ignore_index=-100)  # Ignore positions not masked
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 30

# ===============================
# Training Loop with Progress Bar
# ===============================
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    train_batches = 0
    train_pbar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs} Training", leave=False)
    for inputs, labels in train_pbar:
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        optimizer.zero_grad()
        outputs = model(inputs)  # [batch_size, seq_length, vocab_size]
        # Flatten outputs and labels for loss computation
        outputs = outputs.view(-1, vocab_size)
        labels = labels.view(-1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        train_batches += 1
        train_pbar.set_postfix(loss=running_loss/train_batches)
    avg_loss = running_loss / train_batches
    print(f"Epoch {epoch+1}/{num_epochs} - Training Loss: {avg_loss:.4f}")
    
    # ===============================
    # Validation Loop with Progress Bar
    # ===============================
    model.eval()
    total_correct = 0
    total_masked = 0
    with torch.no_grad():
        val_pbar = tqdm(val_loader, desc="Validation", leave=False)
        for inputs, labels in val_pbar:
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            predictions = torch.argmax(outputs, dim=-1)
            # Evaluate only on positions where labels != -100
            mask = labels != -100
            total_correct += ((predictions == labels) * mask).sum().item()
            total_masked += mask.sum().item()
    if total_masked > 0:
        accuracy = total_correct / total_masked * 100
    else:
        accuracy = 0
    print(f"Epoch {epoch+1} - Validation Accuracy on Masked Tokens: {accuracy:.2f}%")

Processed 851 sequences out of 10000 total fetched sequences.
Training samples: 680, Validation samples: 171
FullTransformerModel(
  (embedding): Embedding(5, 32)
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=32, out_features=32, bias=True)
        )
        (linear1): Linear(in_features=32, out_features=2048, bias=True)
        (dropout): Dropout(p=0.02, inplace=False)
        (linear2): Linear(in_features=2048, out_features=32, bias=True)
        (norm1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.02, inplace=False)
        (dropout2): Dropout(p=0.02, inplace=False)
      )
    )
  )
  (fc_out): Linear(in_features=32, out_features=5, bias=True)
)


                                                                                                                             

Epoch 1/30 - Training Loss: 1.4133


                                                                                                                             

Epoch 1 - Validation Accuracy on Masked Tokens: 23.91%


                                                                                                                             

Epoch 2/30 - Training Loss: 1.3888


                                                                                                                             

Epoch 2 - Validation Accuracy on Masked Tokens: 28.36%


                                                                                                                             

Epoch 3/30 - Training Loss: 1.3761


                                                                                                                             

Epoch 3 - Validation Accuracy on Masked Tokens: 30.51%


                                                                                                                             

Epoch 4/30 - Training Loss: 1.3702


                                                                                                                             

Epoch 4 - Validation Accuracy on Masked Tokens: 29.83%


                                                                                                                             

Epoch 5/30 - Training Loss: 1.3678


                                                                                                                             

Epoch 5 - Validation Accuracy on Masked Tokens: 31.16%


                                                                                                                             

Epoch 6/30 - Training Loss: 1.3668


                                                                                                                             

Epoch 6 - Validation Accuracy on Masked Tokens: 30.06%


                                                                                                                             

Epoch 7/30 - Training Loss: 1.3697


                                                                                                                             

Epoch 7 - Validation Accuracy on Masked Tokens: 28.63%


                                                                                                                             

Epoch 8/30 - Training Loss: 1.3674


                                                                                                                             

Epoch 8 - Validation Accuracy on Masked Tokens: 30.88%


                                                                                                                             

Epoch 9/30 - Training Loss: 1.3655


                                                                                                                             

Epoch 9 - Validation Accuracy on Masked Tokens: 30.45%


                                                                                                                             

Epoch 10/30 - Training Loss: 1.3654


                                                                                                                             

Epoch 10 - Validation Accuracy on Masked Tokens: 30.30%


                                                                                                                             

Epoch 11/30 - Training Loss: 1.3656


                                                                                                                             

Epoch 11 - Validation Accuracy on Masked Tokens: 29.85%


                                                                                                                             

Epoch 12/30 - Training Loss: 1.3651


                                                                                                                             

Epoch 12 - Validation Accuracy on Masked Tokens: 30.30%


                                                                                                                             

Epoch 13/30 - Training Loss: 1.3626


                                                                                                                             

Epoch 13 - Validation Accuracy on Masked Tokens: 29.83%


                                                                                                                             

Epoch 14/30 - Training Loss: 1.3640


                                                                                                                             

Epoch 14 - Validation Accuracy on Masked Tokens: 30.84%


                                                                                                                             

Epoch 15/30 - Training Loss: 1.3649


                                                                                                                             

Epoch 15 - Validation Accuracy on Masked Tokens: 29.79%


                                                                                                                             

Epoch 16/30 - Training Loss: 1.3635


                                                                                                                             

Epoch 16 - Validation Accuracy on Masked Tokens: 30.65%


                                                                                                                             

Epoch 17/30 - Training Loss: 1.3623


                                                                                                                             

Epoch 17 - Validation Accuracy on Masked Tokens: 29.14%


                                                                                                                             

Epoch 18/30 - Training Loss: 1.3641


                                                                                                                             

Epoch 18 - Validation Accuracy on Masked Tokens: 30.75%


                                                                                                                             

Epoch 19/30 - Training Loss: 1.3607


                                                                                                                             

Epoch 19 - Validation Accuracy on Masked Tokens: 29.90%


                                                                                                                             

Epoch 20/30 - Training Loss: 1.3607


                                                                                                                             

Epoch 20 - Validation Accuracy on Masked Tokens: 30.69%


                                                                                                                             

Epoch 21/30 - Training Loss: 1.3608


                                                                                                                             

Epoch 21 - Validation Accuracy on Masked Tokens: 30.51%


                                                                                                                             

Epoch 22/30 - Training Loss: 1.3629


                                                                                                                             

Epoch 22 - Validation Accuracy on Masked Tokens: 30.35%


                                                                                                                             

Epoch 23/30 - Training Loss: 1.3608


                                                                                                                             

Epoch 23 - Validation Accuracy on Masked Tokens: 30.71%


                                                                                                                             

Epoch 24/30 - Training Loss: 1.3620


                                                                                                                             

Epoch 24 - Validation Accuracy on Masked Tokens: 30.34%


                                                                                                                             

Epoch 25/30 - Training Loss: 1.3620


                                                                                                                             

Epoch 25 - Validation Accuracy on Masked Tokens: 30.22%


                                                                                                                             

Epoch 26/30 - Training Loss: 1.3614


                                                                                                                             

Epoch 26 - Validation Accuracy on Masked Tokens: 30.28%


                                                                                                                             

Epoch 27/30 - Training Loss: 1.3608


                                                                                                                             

Epoch 27 - Validation Accuracy on Masked Tokens: 29.88%


                                                                                                                             

Epoch 28/30 - Training Loss: 1.3617


                                                                                                                             

Epoch 28 - Validation Accuracy on Masked Tokens: 30.10%


                                                                                                                             

Epoch 29/30 - Training Loss: 1.3623


                                                                                                                             

Epoch 29 - Validation Accuracy on Masked Tokens: 30.79%


                                                                                                                             

Epoch 30/30 - Training Loss: 1.3600


                                                                                                                             

Epoch 30 - Validation Accuracy on Masked Tokens: 29.36%




In [43]:
# Basic Resource Monitoring and Performance Measurements

if 'psutil' in globals():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    print("Memory Usage: {:.2f} MB".format(mem_info.rss / (1024 * 1024)))
else:
    print("psutil not available. Skipping resource monitoring.")

# Additional performance measurements can be added as needed.

Memory Usage: 71.88 MB


## Known Limitations, Debug Strategies, and Next Steps

**Known Limitations:**  
- This dataset is small and may not capture the complexity of real DNA sequences, especially larger ones.
- The model is simple; larger and more complex architectures may be needed for real-world data.

**Debug/Test Strategies:**  
- Validate model predictions on edge cases (e.g., sequences with no masked tokens).
- Log intermediate outputs and losses for analysis.
- Experiment with varying sequence lengths and mask probabilities.

**Next Steps:**  
- Our accuracy is just barely above random guessing. We shoud try different hyperparamaters or model achitectures entirely.
- Scale up using a lot real genomic data.
- Might need to encode structured format as a feature somehow
- Integrate external validation tools (e.g., AlphaFold for folding predictions) if these steps succeed.
- Refine the model architecture and hyperparameters based on further experiments.

## Development Process and Alternative Approaches

**Exploration:**  
- We experimented with various transformer depths and embedding sizes, as well as batch sizes, learning rates, black-out probability, and sequence size
- Alternative models (such as RNNs) were considered but discarded for their limited capacity to capture long-range dependencies. However we might try it now given our accuracy.

**Failed Attempts:**  
- Early versions with all synthetic data led to similar results. Implies that there is small difference betwene real and synthetic data
  in our model which we need to address. 
- Data preprocessing required several iterations to correctly handle token masking and label assignments.

**Design Decisions:**  
- A lightweight transformer was chosen to ensure the code runs efficiently on a MacBook.
- Cross-entropy loss with the ignore index (-100) is used to focus training on masked tokens.

**Safety Considerations:**  
- Outputs are validated against expected token patterns to ensure biologically plausible predictions.

**Alternative Approaches:**  
- Future work may explore larger models, alternative masking strategies, and integration with protein structure predictors.