# Data Preparation Tutorial: From Biological Sequences to Machine Learning Data 📊

Welcome to the first part of our tutorial series! This notebook focuses on the most critical first step of any machine learning project: **Data Preparation**. Proper data handling is the foundation of any successful model.

> 💡 **Learning Objectives**: Understand genomic data types, machine learning task classification, and master OmniGenBench data processing workflows
> 
> 📚 **Prerequisites**: It is recommended to study the **[Fundamental Concepts Tutorial](https://github.com/yangheng95/OmniGenBench/blob/master/examples/00_fundamental_concepts.ipynb)** to understand the general machine learning task classification and framework concepts.

---

## The World of Genomic Data & Machine Learning Tasks 🧬

In genomics, we primarily work with biological sequences, including **DNA** (composed of A, C, G, T) and **RNA** (A, C, G, U). To a computer, these are just strings of characters. Our goal is to translate these strings into a numerical format that a machine learning model can understand and learn from.

Depending on our scientific question, we can frame genomic problems into various machine learning tasks:

| Task Type | Description | Biological Example |
| :--- | :--- | :--- |
| **Sequence Classification** | Assign a single label to an entire sequence | Is this DNA sequence a promoter? (Yes/No) |
| **Sequence Regression** | Predict a continuous numerical value for a sequence | How stable is this protein? (e.g., a score from 0.0 to 1.0) |
| **Token Classification** | Assign a label to *each nucleotide (token)* in a sequence | Which nucleotides in this gene are part of a binding site? |
| **Sequence-to-Sequence** | Transform an input sequence into a different output sequence | What is the secondary structure of this RNA molecule? |


In this tutorial, we focus on **Sequence Classification**: we will classify a given mRNA sequence as having either 'High' or 'Low' translation efficiency.

> 🎯 **Why choose translation efficiency prediction?** This is a classic problem with important biological significance that's perfect for demonstrating machine learning workflows.

## Matching Data, Tasks, and Models in OmniGenBench 🔗

`OmniGenBench` provides a suite of pre-built models and datasets designed to work together. The key is to match your task to the right model and relevant dataset. Here are some educational combinations you can explore within the framework:

| Task Type | `OmniGenBench` Model | Example Dataset | Biological Question |
| :--- | :--- | :--- | :--- |
| **Sequence Classification** | `OmniModelForSequenceClassification` | `yangheng/promoter_prediction` | Is a DNA sequence a promoter region? |
| **Sequence Regression** | `OmniModelForSequenceRegression` | `yangheng/gfp_fluorescence` | How fluorescent will a GFP variant be? |
| **Token Classification** | `OmniModelForTokenClassification` | `yangheng/tf_binding_prediction` | Where will a transcription factor bind on this DNA? |

For our translation efficiency prediction task (High vs. Low), we are performing **Sequence Classification**. Therefore, the appropriate model is `OmniModelForSequenceClassification`.

> 💭 **Think About It**: If we wanted to predict the specific numerical value of translation efficiency (e.g., a continuous value between 0.1 and 0.9), which model should we use?

Let's start preparing the data for this specific task.

In [None]:
# Install the required packages
!pip install multimolecule omnigenbench -U

In [None]:
import warnings
from omnigenbench import (
    OmniTokenizer,
    OmniDatasetForSequenceClassification,
)

warnings.filterwarnings('ignore')
print("✅ Libraries imported successfully!")
print("🚀 Enhanced OmniDataset with automatic HuggingFace integration is ready!")

## 📋 Understanding OmniGenBench Data Templates

Before diving into data loading, it's crucial to understand how OmniGenBench expects data to be organized. This section explains the standardized data templates and directory structures that make the framework so powerful and consistent.

### 🗂️ Standard Directory Structure

OmniGenBench follows a conventional directory structure that enables automatic data discovery and loading:

```
dataset_directory/
├── train.jsonl              # Training data (required)
├── valid.jsonl              # Validation data (recommended)
├── test.jsonl               # Test data (optional)
├── config.py                # Metadata file (optional)
└── README.md                # Documentation (recommended)
```

**Key Benefits of This Structure:**
- 🔄 **Automatic Discovery**: Framework can find and load data without manual path specification
- 📊 **Consistent Splits**: Standard train/valid/test division across all datasets  
- 🧬 **Metadata Integration**: Additional biological annotations stored separately
- 📚 **Documentation**: Self-documenting datasets for reproducible research

### 📄 Data File Formats

#### **JSONL Format (Recommended)**
Each line contains a JSON object with the sequence and its annotations:

```json
{"sequence": "ATCGATCG...", "label": 1, "id": "seq_001"}
{"sequence": "GCTAGCTA...", "label": 0, "id": "seq_002"}
```

#### **CSV Format (Alternative)**
Tabular format with columns for sequence, labels, and metadata:

```csv
id,sequence,label,organism,5utr_length
seq_001,ATCGATCG...,1,rice,120
seq_002,GCTAGCTA...,0,rice,85
```

### 🏷️ Label Format for Translation Efficiency Classification

For TE prediction, labels represent translation efficiency categories:

**Binary Classification:**
- `1`: High translation efficiency
- `0`: Low translation efficiency

**JSONL Format:**
```json
{"sequence": "ATGAAACCC...", "label": 1}
```

**CSV Format:**
```csv
id,sequence,label
te_001,ATGAAACCC...,1
te_002,GCTATCGAT...,0
```

### ⚙️ Configuration Files

#### **config.py (Optional but Recommended)**
```json
{
  "task_type": "sequence_classification",
  "sequence_type": "RNA",
  "num_labels": 2,
  "sequence_length": 500,
  "total_samples": 15000,
  "description": "Translation efficiency prediction dataset"
}
```
### Note
The dataset file type can also be such as CSV/TSV, parquet, or other formats, in which case the files would be named `train.csv`, `valid.csv`, and `test.csv`.
The `config.py` file is optional but highly recommended. It allows you to define important metadata about your dataset, such as the number of labels, label names, and sequence length. This information helps the framework understand how to process your data correctly.

#### **Key Metadata Fields:**
- **`task_type`**: Defines the ML task (sequence_classification for TE)
- **`sequence_type`**: RNA for mRNA sequences
- **`num_labels`**: 2 for binary classification (High/Low)
- **`label_names`**: Human-readable labels
- **`sequence_length`**: Typical mRNA sequence length

### 🎯 Loading Strategy Options

OmniGenBench provides multiple loading strategies based on your data format:

```python
# Method 1: From Hugging Face Hub (automatic download)
dataset = OmniDatasetForSequenceClassification.from_hub(
    dataset_name_or_path="translation_efficiency_prediction",
    tokenizer=tokenizer
)

# Method 2: From local directory (JSONL files)
dataset = OmniDatasetForSequenceClassification(
    "./my_te_dataset/",
    tokenizer=tokenizer
)

# Method 3: From CSV file
dataset = OmniDatasetForSequenceClassification(
    "te_data.csv",
    tokenizer=tokenizer,
    sequence_column="sequence",
    label_column="label"
)
```

### 💡 Best Practices

1. **🏷️ Consistent Naming**: Use descriptive, consistent file and column names
2. **📊 Balanced Splits**: Ensure train/valid/test splits are representative
3. **🔍 Quality Control**: Validate data integrity before training
4. **📚 Documentation**: Include README with dataset description and usage
5. **🧬 Biological Context**: Preserve important biological metadata

With this foundation understanding, let's now proceed to configure our specific dataset parameters.

### 🛠️ Environment Setup

Before we start data processing, we need to install and import the necessary libraries to ensure our environment is ready for the tasks ahead.

> ⚠️ **Note**: If you have already installed these packages, you can skip the installation step.

In [None]:
# Enhanced Data Preparation Demo
print("🎉 Data Preparation Simplified with Enhanced OmniDataset!")
print("=" * 60)

config_or_model = "yangheng/OmniGenome-52M"


### 📥 Enhanced Data Preparation with OmniDataset

The enhanced OmniGenBench framework has revolutionized data preparation! With the new `from_hub()` class method, you no longer need to:

- ❌ Write manual download functions
- ❌ Create custom dataset classes  
- ❌ Handle file extraction manually
- ❌ Manage cache directories

Instead, everything is handled automatically:

- ✅ **Automatic Download**: Downloads datasets from HuggingFace Hub automatically
- ✅ **Smart Processing**: Handles different file formats (CSV, JSON, JSONL) intelligently
- ✅ **Built-in DataLoaders**: Get optimized PyTorch DataLoaders with one method call
- ✅ **Universal Interface**: Works with all genomic tasks (classification, regression, token-level)

> 🌾 **Dataset Background**: We're using the rice translation efficiency dataset where each mRNA sequence is labeled as having high (1) or low (0) translation efficiency based on experimental validation.

In [None]:
print(f"🔍 Loading model and tokenizer from: {config_or_model}")

# Initialize tokenizer
tokenizer = OmniTokenizer.from_pretrained(config_or_model)
print("✅ Tokenizer initialized")

# Create datasets with automatic download and processing
print("🏗️ Creating datasets with automatic download...")
datasets = OmniDatasetForSequenceClassification.from_hub(
    dataset_name_or_path="translation_efficiency_prediction",
    tokenizer=tokenizer,
    max_length=512,
)

print(f"📊 Dataset sizes:")
print(f"  📈 Training samples: {len(datasets['train'])}")
print(f"  🔍 Validation samples: {len(datasets['valid'])}")
print(f"  🧪 Test samples: {len(datasets['test'])}")

print("=" * 60)
print("🎉 Data preparation completed with just a few lines of code!")

### 🚀 The Power of OmniDataset

The `OmniDataset` class eliminates the complexity of traditional data preparation. Here's what happened behind the scenes when we called `from_hub()`:

1. **Automatic Download**: 
   - Checks if data exists locally
   - Downloads from HuggingFace Hub if needed  
   - Shows progress with a beautiful progress bar

2. **Intelligent Processing**:
   - Automatically detects file format (CSV, JSON, JSONL)
   - Handles data parsing and validation
   - Creates proper label mappings

3. **Seamless Integration**:
   - Works with any OmniGenome tokenizer
   - Provides optimized DataLoaders with smart defaults
   - Handles GPU memory management automatically

**Why is this revolutionary?**
- **60% Less Code**: Compare our 3-line solution to traditional 50+ line implementations
- **Universal Interface**: Same pattern works for all genomic tasks
- **Error-Proof**: Built-in error handling and validation
- **Research-Ready**: Focus on science, not infrastructure

> 💡 **Pro Tip**: This same pattern works for any genomic dataset! Just change the `dataset_name` parameter and you're ready to go.

## 🎯 Summary and Next Steps

In this comprehensive data preparation tutorial, you have mastered both **basic** and **advanced** data handling skills:

### 🎓 **Skills Learned:**
✅ **Understood genomic data types** (DNA vs RNA)  
✅ **Mastered ML task classification** (classification, regression, token classification)  
✅ **Learned data-task-model matching principles**  
✅ **Understood OmniGenBench data templates** and directory structures
✅ **Successfully used pre-built datasets** from HuggingFace Hub
✅ **Created optimized DataLoaders** with OmniGenBench

### 🏗️ **OmniGenBench Design Philosophy**

The framework follows key design principles that make genomic machine learning accessible and powerful:

#### **🔄 Modularity**
- **Universal concepts** separated from task-specific implementations
- **Reusable components** across different genomic prediction tasks
- **Standardized interfaces** for consistent user experience

#### **📚 Progressive Learning**
- **Foundation concepts** apply to all tasks (like the data templates you learned)
- **Task-specific tutorials** build upon universal knowledge
- **Advanced techniques** available when you're ready

#### **🎯 Practicality**
- **Real research scenarios** guide tutorial design
- **Publication-ready workflows** from day one
- **Immediate applicability** to your own projects

#### **🔧 Consistency**
- **Same 4-step workflow** across all genomic tasks:
  1. **Data Preparation** (this tutorial)
  2. **Model Initialization** (next tutorial)
  3. **Model Training** (tutorial 3)
  4. **Model Inference** (tutorial 4)
- **Uniform data formats** enable easy task switching
- **Predictable patterns** reduce learning curve

#### **🚀 Extensibility**
- **Easy adaptation** of tutorials for custom applications
- **Template-based approach** for new datasets
- **Modular architecture** supports novel research directions

> 💡 **Why This Matters**: This philosophy ensures that once you learn one genomic task, you can quickly master others. The translation efficiency concepts you've learned here directly apply to transcription factor binding prediction, protein stability prediction, and many other genomic challenges!

---

### 🚀 What's Next...

You've built a solid data foundation! In the next tutorial, we will:
- 🤖 **Explore genomic foundation models** and their architecture
- 🔧 **Learn model initialization** for different biological tasks
- 📊 **Understand model parameters** and configuration options
- 🎯 **Apply models to your custom data** you just created!

**Ready to meet your first genomic foundation model?** 

👉 **Next Step**: Open [02_model_initialization.ipynb](https://github.com/yangheng95/OmniGenBench/blob/master/examples/translation_efficiency_prediction/02_model_initialization.ipynb) to continue your learning journey!

## 🚀 Ready for Advanced Dataset Creation?

You've mastered the fundamentals of data preparation using pre-built datasets! But what happens when you need to work with your own biological data? 

### 🔬 When You Need Custom Datasets

In real research, you often encounter:
- 🧬 **Novel experimental data** from your laboratory
- 📊 **Unique biological questions** not covered by existing datasets  
- 🎯 **Proprietary data** that gives you a research advantage
- 📈 **Publication requirements** for original datasets

### 🎓 Advanced Skills Await You!

To learn how to create **publication-ready datasets from scratch**, including:
- **📋 Professional CSV templates** for any genomic task
- **⚙️ Research-grade configuration** management
- **🔬 Data quality assurance** and validation
- **📊 Integration with complex workflows**

**👉 Continue to Tutorial 5: [Advanced Dataset Creation](https://github.com/yangheng95/OmniGenBench/blob/master/examples/translation_efficiency_prediction/05_advanced_dataset_creation.ipynb)**

> 💡 **Learning Path**: Master the basics first (Tutorials 1-4), then advance to custom dataset creation (Tutorial 5) when you're ready for professional research workflows!