# 🧬 Genomic Data Augmentation with OmniGenBench

Welcome to this comprehensive tutorial where we'll explore how to perform **intelligent genomic data augmentation** using **OmniGenBench**. This guide will walk you through the process of generating high-quality synthetic genomic sequences that preserve biological patterns and improve model training performance.

### 1. The Machine Learning Challenge: Why Genomic Data Augmentation?

**Genomic data augmentation** is a critical technique in computational biology that addresses several fundamental challenges in genomic machine learning:

- **Limited Training Data**: High-quality labeled genomic datasets are often small and expensive to generate
- **Class Imbalance**: Rare genomic variants and functions are underrepresented in datasets
- **Overfitting Prevention**: Augmentation increases dataset diversity and improves generalization
- **Domain Adaptation**: Bridging gaps between different experimental conditions or species

The power of genomic augmentation lies in its ability to:
- **Generate Realistic Sequences**: Create biologically plausible variants while preserving functional patterns
- **Expand Dataset Size**: Multiply available training data without additional experimental costs
- **Improve Model Robustness**: Enhance model performance on unseen genomic variations
- **Balance Datasets**: Address class imbalance issues in genomic classification tasks

Applications across computational biology:
- **Rare Variant Analysis**: Augment underrepresented mutation patterns for disease prediction
- **Cross-Species Learning**: Generate bridge sequences for evolutionary studies
- **Functional Annotation**: Create training data for poorly characterized genomic regions
- **Model Validation**: Generate test sequences for robustness evaluation

### 2. The Challenge: Biologically-Informed Sequence Generation

Unlike random mutations, intelligent genomic augmentation must preserve:

- **Functional Motifs**: Critical regulatory and coding sequences
- **Structural Constraints**: Secondary structures and folding patterns
- **Evolutionary Patterns**: Codon usage bias and phylogenetic relationships  
- **Statistical Properties**: Nucleotide composition and k-mer frequencies

**Augmentation Process:**

| Original Sequence | Random Mutation | Intelligent Augmentation |
|------------------|-----------------|-------------------------|
| `ATGCGATCG` | `ATGCTATCG` (random) | `ATGCGATCC` (codon-aware) |
| Functional | May break function | Preserves function |

### 3. The Tool: Masked Language Models for Genomic Augmentation

#### Foundation Model Understanding
**OmniGenome** uses masked language modeling (MLM) for intelligent sequence augmentation. This approach:

1. **Masks Strategic Positions**: Selectively mask nucleotides while preserving critical patterns
2. **Predicts Biologically Plausible Alternatives**: Use pre-trained understanding to suggest realistic substitutions
3. **Maintains Sequence Integrity**: Ensure augmented sequences remain biologically valid
4. **Preserves Functional Patterns**: Keep important motifs and regulatory elements intact

### 4. The Workflow: A 4-Step Guide to Genomic Augmentation

```mermaid
flowchart TD
    subgraph "4-Step Workflow for Genomic Data Augmentation"
        A["📥 Step 1: Setup and Configuration<br/>Configure augmentation parameters and models"] --> B["🔧 Step 2: Model Initialization<br/>Load pre-trained genomic foundation models"]
        B --> C["🎓 Step 3: Sequence Augmentation<br/>Generate diverse, biologically-valid variants"]
        C --> D["🔮 Step 4: Quality Assessment<br/>Validate and analyze augmented sequences"]
    end

    style A fill:#e1f5fe,stroke:#333,stroke-width:2px
    style B fill:#f3e5f5,stroke:#333,stroke-width:2px
    style C fill:#e8f5e8,stroke:#333,stroke-width:2px
    style D fill:#fff3e0,stroke:#333,stroke-width:2px
```

Let's start generating high-quality genomic training data!

## 🚀 Step 1: Setup and Configuration

This first step focuses on setting up our genomic data augmentation environment and understanding the key parameters that control sequence generation quality.

### 1.1: Environment Setup

First, let's install the required packages for intelligent genomic data augmentation.

In [None]:
!pip install omnigenbench torch transformers tqdm scikit-learn matplotlib seaborn -U

### 1.2: Import Required Libraries

Next, we import the essential libraries for genomic data augmentation, including specialized tools for sequence analysis and quality assessment.

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from sklearn.metrics import jaccard_score
import json
from tqdm import tqdm

from omnigenbench import (
    OmniModelForAugmentation,
    ModelHub,
)

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

### 1.3: Understanding Augmentation Parameters

Before we start augmentation, let's understand the key parameters that control the quality and diversity of generated sequences:

#### Critical Parameters
- **noise_ratio**: Proportion of tokens to mask (0.15-0.25 typically optimal)
- **instance_num**: Number of variants per original sequence
- **max_length**: Maximum sequence length for processing
- **model_selection**: Choice of pre-trained genomic foundation model

These parameters balance between sequence diversity and biological validity.

In [None]:
# Global configuration for genomic data augmentation
augmentation_config = {
    "model_name": "yangheng/OmniGenome-52M",
    "noise_ratio": 0.2,      # 20% of tokens will be masked
    "max_length": 512,       # Maximum sequence length
    "instance_num": 3,       # Generate 3 variants per sequence
    "batch_size": 8,         # Batch processing size
}

print("🎯 Genomic Data Augmentation Configuration:")
print(f"  Model: {augmentation_config['model_name']}")
print(f"  Noise ratio: {augmentation_config['noise_ratio']:.1%}")
print(f"  Max length: {augmentation_config['max_length']}")
print(f"  Instances per sequence: {augmentation_config['instance_num']}")
print(f"  Batch size: {augmentation_config['batch_size']}")

## 🚀 Step 2: Model Initialization

Now let's initialize the genomic augmentation model. The `OmniModelForAugmentation` leverages pre-trained genomic foundation models to generate biologically-informed sequence variants.

### Augmentation Model Features
- **Intelligent Masking**: Strategic selection of positions for variation
- **Contextual Prediction**: Uses surrounding sequence context for realistic substitutions  
- **Batch Processing**: Efficient handling of multiple sequences
- **Quality Control**: Built-in validation of augmented sequences

In [None]:
# Initialize the genomic augmentation model
print("🔧 Initializing Genomic Data Augmentation Model...")

augmentation_model = OmniModelForAugmentation(
    config_or_model=augmentation_config["model_name"],
    noise_ratio=augmentation_config["noise_ratio"],
    max_length=augmentation_config["max_length"],
    instance_num=augmentation_config["instance_num"]
)

print("✅ Augmentation model initialized successfully!")
print("🎯 Model capabilities:")
print("  - Intelligent sequence masking based on genomic patterns")
print("  - Context-aware nucleotide prediction")
print("  - Batch processing for efficiency")
print("  - Preservation of biological sequence properties")
print(f"  - Configured for {augmentation_config['instance_num']} variants per sequence")

## 🚀 Step 3: Sequence Augmentation

Now comes the exciting part! We'll demonstrate different approaches to genomic data augmentation, from single sequences to batch processing of entire datasets.

### Our Augmentation Strategy

We'll explore multiple augmentation scenarios:

1. **Single Sequence Augmentation**: Generate variants for individual sequences
2. **Batch Augmentation**: Process multiple sequences efficiently
3. **File-based Augmentation**: Handle large datasets from files
4. **Quality-controlled Augmentation**: Ensure biological validity of outputs

Let's start with single sequence augmentation to understand the process:

In [None]:
# Demonstrate single sequence augmentation
test_sequences = {
    "Coding sequence": "ATGAAAGCCATTGAGAAGGCAAAACCCCGATGGTCCTTCGCGAA",
    "UTR region": "AUUGAGAUGUUUGCCAUUUUGACCAUCUGACCUUUGCCAUC",
    "Regulatory motif": "TATAAGCCGCGGTGACCTGCAG",
    "Random sequence": "ATCGATCGATCGATCGATCG"
}

print("🎓 Demonstrating single sequence augmentation...")
print("⚡ Using intelligent masking and contextual prediction")

for seq_name, sequence in test_sequences.items():
    print(f"\n📊 Augmenting: {seq_name}")
    print(f"  Original: {sequence}")
    print(f"  Length: {len(sequence)} nucleotides")
    
    try:
        # Generate augmented variants
        augmented_sequence = augmentation_model.augment_sequence(sequence)
        
        print(f"  Augmented: {augmented_sequence}")
        
        # Analyze differences
        differences = sum(1 for a, b in zip(sequence, augmented_sequence) if a != b)
        similarity = (len(sequence) - differences) / len(sequence)
        
        print(f"  📈 Analysis:")
        print(f"    Changed positions: {differences}/{len(sequence)}")
        print(f"    Sequence similarity: {similarity:.1%}")
        print(f"    Effective noise ratio: {differences/len(sequence):.1%}")
        
        # GC content analysis
        def gc_content(seq):
            return (seq.upper().count('G') + seq.upper().count('C')) / len(seq)
        
        orig_gc = gc_content(sequence)
        aug_gc = gc_content(augmented_sequence)
        print(f"    GC content: {orig_gc:.1%} → {aug_gc:.1%} (Δ{aug_gc-orig_gc:+.1%})")
        
    except Exception as e:
        print(f"  ❌ Augmentation failed: {str(e)}")
    
    print("─" * 60)

### Batch Augmentation for Dataset Expansion

Now let's demonstrate batch augmentation for processing multiple sequences efficiently. This is particularly useful for expanding training datasets.

In [None]:
# Demonstrate file-based augmentation using existing toy dataset
input_file = "toy_datasets/train.json"
output_file = "toy_datasets/augmented_sequences.json"

print("🏗️ Demonstrating file-based batch augmentation...")
print(f"📂 Input file: {input_file}")
print(f"📂 Output file: {output_file}")

try:
    # Load original dataset to understand structure
    with open(input_file, 'r') as f:
        original_data = [json.loads(line.strip()) for line in f if line.strip()]
    
    print(f"📊 Original dataset: {len(original_data)} sequences")
    
    # Show sample original sequence
    if original_data:
        sample = original_data[0]
        print(f"📝 Sample original sequence:")
        for key, value in sample.items():
            if isinstance(value, str) and len(value) > 50:
                print(f"  {key}: {value[:50]}...")
            else:
                print(f"  {key}: {value}")
    
    # Perform augmentation using the model's file processing method
    print(f"\n🎓 Starting batch augmentation...")
    print(f"  - Processing {len(original_data)} sequences")
    print(f"  - Generating {augmentation_config['instance_num']} variants each")
    print(f"  - Expected output: {len(original_data) * augmentation_config['instance_num']} sequences")
    
    # Call the augmentation method
    augmentation_model.augment_from_file(
        input_file=input_file,
        output_file=output_file
    )
    
    # Verify output
    with open(output_file, 'r') as f:
        augmented_data = [json.loads(line.strip()) for line in f if line.strip()]
    
    print(f"✅ Augmentation completed!")
    print(f"📊 Results:")
    print(f"  - Original sequences: {len(original_data)}")
    print(f"  - Augmented sequences: {len(augmented_data)}")
    print(f"  - Expansion ratio: {len(augmented_data)/len(original_data):.1f}x")
    
    # Show sample augmented sequence
    if augmented_data:
        sample_aug = augmented_data[0]
        print(f"\n📝 Sample augmented sequence:")
        for key, value in sample_aug.items():
            if isinstance(value, str) and len(value) > 50:
                print(f"  {key}: {value[:50]}...")
            else:
                print(f"  {key}: {value}")
                
except Exception as e:
    print(f"❌ Batch augmentation failed: {str(e)}")
    print("This might be due to file format or model compatibility issues.")

## 🔮 Step 4: Quality Assessment and Analysis

The final step involves comprehensive analysis of our augmented sequences to ensure they maintain biological validity while providing useful diversity for training.

### Quality Assessment Pipeline

Our assessment includes:
1. **Sequence Diversity Analysis**: Measure how different augmented sequences are from originals
2. **Biological Property Conservation**: Check if important sequence characteristics are preserved
3. **Statistical Validation**: Ensure augmented sequences follow expected genomic patterns
4. **Functional Motif Preservation**: Verify that critical sequence elements remain intact

In [None]:
# Comprehensive quality assessment of augmented sequences
def analyze_sequence_properties(sequences, labels=None):
    """Analyze statistical properties of genomic sequences"""
    
    analysis = {
        'num_sequences': len(sequences),
        'avg_length': np.mean([len(seq) for seq in sequences]),
        'length_std': np.std([len(seq) for seq in sequences]),
        'gc_content': [],
        'nucleotide_composition': {'A': [], 'T': [], 'G': [], 'C': [], 'U': []},
    }
    
    for seq in sequences:
        seq_upper = seq.upper()
        length = len(seq_upper)
        
        # GC content
        gc = (seq_upper.count('G') + seq_upper.count('C')) / length if length > 0 else 0
        analysis['gc_content'].append(gc)
        
        # Nucleotide composition
        for nuc in ['A', 'T', 'G', 'C', 'U']:
            freq = seq_upper.count(nuc) / length if length > 0 else 0
            analysis['nucleotide_composition'][nuc].append(freq)
    
    # Convert to means and stds
    analysis['avg_gc_content'] = np.mean(analysis['gc_content'])
    analysis['gc_content_std'] = np.std(analysis['gc_content'])
    
    for nuc in analysis['nucleotide_composition']:
        freqs = analysis['nucleotide_composition'][nuc]
        analysis['nucleotide_composition'][nuc] = {
            'mean': np.mean(freqs),
            'std': np.std(freqs)
        }
    
    return analysis

print("🔬 Performing comprehensive quality assessment...")

# Load both original and augmented data for comparison
try:
    # Original sequences
    with open("toy_datasets/train.json", 'r') as f:
        original_data = [json.loads(line.strip()) for line in f if line.strip()]
    original_sequences = [item.get('sequence', item.get('seq', '')) for item in original_data]
    
    # Augmented sequences (if they exist)
    try:
        with open("toy_datasets/augmented_sequences.json", 'r') as f:
            augmented_data = [json.loads(line.strip()) for line in f if line.strip()]
        augmented_sequences = [item.get('sequence', item.get('seq', '')) for item in augmented_data]
    except:
        # Generate some examples for analysis if file doesn't exist
        print("Generating sample augmented sequences for analysis...")
        augmented_sequences = []
        for seq in original_sequences[:5]:  # Augment first 5 sequences
            try:
                aug_seq = augmentation_model.augment_sequence(seq)
                augmented_sequences.append(aug_seq)
            except:
                augmented_sequences.append(seq)  # Fallback to original
    
    # Analyze both datasets
    print("📊 Analyzing original sequences...")
    original_analysis = analyze_sequence_properties(original_sequences)
    
    print("📊 Analyzing augmented sequences...")
    augmented_analysis = analyze_sequence_properties(augmented_sequences)
    
    # Comparative analysis
    print("\n🎯 Quality Assessment Results:")
    print("=" * 60)
    
    print("📈 Dataset Size Comparison:")
    print(f"  Original sequences: {original_analysis['num_sequences']}")
    print(f"  Augmented sequences: {augmented_analysis['num_sequences']}")
    if original_analysis['num_sequences'] > 0:
        expansion_ratio = augmented_analysis['num_sequences'] / original_analysis['num_sequences']
        print(f"  Dataset expansion: {expansion_ratio:.1f}x")
    
    print("\n📏 Sequence Length Statistics:")
    print(f"  Original: {original_analysis['avg_length']:.1f} ± {original_analysis['length_std']:.1f}")
    print(f"  Augmented: {augmented_analysis['avg_length']:.1f} ± {augmented_analysis['length_std']:.1f}")
    
    print("\n🧬 GC Content Analysis:")
    print(f"  Original: {original_analysis['avg_gc_content']:.1%} ± {original_analysis['gc_content_std']:.1%}")
    print(f"  Augmented: {augmented_analysis['avg_gc_content']:.1%} ± {augmented_analysis['gc_content_std']:.1%}")
    gc_diff = abs(original_analysis['avg_gc_content'] - augmented_analysis['avg_gc_content'])
    print(f"  Difference: {gc_diff:.1%} ({'✅ Good' if gc_diff < 0.05 else '⚠️ Check' if gc_diff < 0.1 else '❌ Large'})")
    
    print("\n🔤 Nucleotide Composition Comparison:")
    for nuc in ['A', 'T', 'U', 'G', 'C']:
        orig_freq = original_analysis['nucleotide_composition'][nuc]['mean']
        aug_freq = augmented_analysis['nucleotide_composition'][nuc]['mean']
        diff = abs(orig_freq - aug_freq)
        if orig_freq > 0.01 or aug_freq > 0.01:  # Only show significant nucleotides
            status = '✅' if diff < 0.02 else '⚠️' if diff < 0.05 else '❌'
            print(f"  {nuc}: {orig_freq:.1%} → {aug_freq:.1%} ({diff:+.1%}) {status}")
    
    # Sequence diversity analysis
    if len(original_sequences) > 0 and len(augmented_sequences) > 0:
        print("\n🎲 Sequence Diversity Assessment:")
        
        # Sample sequences for comparison
        sample_size = min(10, len(original_sequences), len(augmented_sequences))
        orig_sample = original_sequences[:sample_size]
        aug_sample = augmented_sequences[:sample_size]
        
        # Calculate pairwise similarities within each set
        def pairwise_similarity(sequences):
            similarities = []
            for i in range(len(sequences)):
                for j in range(i+1, len(sequences)):
                    seq1, seq2 = sequences[i], sequences[j]
                    min_len = min(len(seq1), len(seq2))
                    if min_len > 0:
                        matches = sum(1 for a, b in zip(seq1[:min_len], seq2[:min_len]) if a == b)
                        similarity = matches / min_len
                        similarities.append(similarity)
            return np.mean(similarities) if similarities else 0
        
        orig_diversity = 1 - pairwise_similarity(orig_sample)
        aug_diversity = 1 - pairwise_similarity(aug_sample)
        
        print(f"  Original set diversity: {orig_diversity:.1%}")
        print(f"  Augmented set diversity: {aug_diversity:.1%}")
        
        if aug_diversity > orig_diversity:
            print("  ✅ Augmentation increased sequence diversity")
        elif aug_diversity > orig_diversity * 0.8:
            print("  ⚠️ Augmentation maintained reasonable diversity") 
        else:
            print("  ❌ Augmentation may have reduced diversity")

except Exception as e:
    print(f"❌ Quality assessment failed: {str(e)}")

print(f"\n🎉 Quality assessment completed!")
print("🚀 Your augmented dataset is ready for:")
print("  - Training data expansion and class balancing")
print("  - Model robustness improvement")
print("  - Cross-validation and generalization testing")
print("  - Domain adaptation and transfer learning")
print("  - Rare variant analysis and representation")

## 🎉 Tutorial Summary and Next Steps

Congratulations! You have successfully completed this comprehensive tutorial on genomic data augmentation with OmniGenBench.

### What You've Learned

You've walked through a complete, end-to-end workflow for intelligent genomic data augmentation. Specifically, you have:

1. **Understood the "Why"**: Gained appreciation for the importance of data augmentation in genomic machine learning and how intelligent augmentation preserves biological patterns while increasing diversity.

2. **Mastered the 4-Step Workflow**:
   - **Step 1: Setup and Configuration**: You learned how to configure augmentation parameters and understand their impact on sequence generation quality.
   - **Step 2: Model Initialization**: You saw how to leverage pre-trained genomic foundation models for context-aware sequence augmentation.
   - **Step 3: Sequence Augmentation**: You implemented both single sequence and batch augmentation strategies for different use cases.
   - **Step 4: Quality Assessment**: You performed comprehensive analysis to validate the biological validity and diversity of augmented sequences.

3. **Advanced Capabilities**: You explored:
   - Intelligent masking strategies that preserve important sequence patterns
   - Context-aware nucleotide prediction for realistic variations
   - Batch processing for efficient dataset expansion
   - Quality control metrics for validating augmented sequences
   - Statistical analysis of sequence properties and diversity

### Next Steps and Applications

Your augmented datasets can now be applied to:
- **Training Data Enhancement**: Expand small or imbalanced genomic datasets
- **Model Robustness**: Improve generalization through increased sequence diversity
- **Rare Variant Analysis**: Generate synthetic examples of underrepresented patterns
- **Cross-Domain Learning**: Bridge gaps between different genomic contexts
- **Validation Studies**: Create test sets for evaluating model robustness

### Best Practices for Genomic Augmentation

1. **Parameter Tuning**: Start with noise_ratio=0.15-0.25 and adjust based on your specific application
2. **Quality Validation**: Always assess biological property conservation in augmented sequences
3. **Diversity Balance**: Ensure augmentation increases diversity without breaking biological constraints
4. **Domain Specificity**: Consider the specific genomic context (coding, regulatory, etc.) when setting parameters
5. **Iterative Refinement**: Use validation metrics to fine-tune augmentation strategies

### Further Learning

Explore our other tutorials to expand your genomic AI toolkit:
- **[mRNA Degradation Prediction](../mRNA_degrad_rate_regression/)**: Apply augmented data to stability prediction
- **[RNA Secondary Structure Prediction](../rna_secondary_structure_prediction/)**: Use augmentation for structure modeling
- **[Translation Efficiency Prediction](../translation_efficiency_prediction/)**: Enhance training data for efficiency prediction

Thank you for following along. We hope this tutorial has provided you with the knowledge and tools to effectively augment genomic datasets for your machine learning research. Intelligent data augmentation is a powerful technique for advancing genomic AI!

**Happy augmenting and discovering! 🧬✨**

In [None]:
# Define file paths for input and output
input_file = "toy_datasets/test.json"
output_file = "toy_datasets/augmented_sequences.json"

# Augment sequences from the input file and save to the output file
model.augment_from_file(input_file, output_file)


The input file should be in JSON format, where each line contains a sequence, like this:

```json
{"seq": "ATCTTGCATTGAAG"}
{"seq": "GGTTTACAGTCCAA"}
```

The output will be saved in the same format, with each augmented sequence written in a new line.

### Step 3: **Configurable Parameters**
The augmentation process allows you to configure various parameters, such as:
- **`noise_ratio`**: Specifies the percentage of tokens that will be masked in the input sequence. The default value is `0.15` (i.e., 15% of tokens will be masked).
- **`max_length`**: The maximum token length for the input sequences. The default is `1026`.
- **`instance_num`**: The number of augmented instances to generate for each input sequence. The default is `1`, but you can increase this value to create multiple augmented versions of each sequence.

### Step 4: **Save Augmented Sequences**
The `save_augmented_sequences` method saves the generated augmented sequences to a JSON file. Each line will contain one augmented sequence in the format `{"aug_seq": "<augmented_sequence>"}`.

### Conclusion
The `OmniModelForAugmentation` class provides a simple and flexible interface for augmenting sequences using a masked language model. By adjusting the noise ratio, instance count, and other hyperparameters, you can create diverse augmented datasets to improve the performance of machine learning models.