# 🔬 Advanced Dataset Creation: From Raw Data to ML-Ready Datasets

Welcome to the **advanced tutorial** in our series! Now that you've mastered the basics of data preparation and model training, it's time to tackle the real-world challenge every computational biologist faces: **creating custom datasets from your own biological data**.

> 🎯 **Learning Objectives**: Master custom dataset creation, professional data templates, configuration management, and publication-ready data workflows

---

## 🧬 Why Custom Datasets Matter in Genomics Research

In your research career, you'll encounter scenarios where existing datasets don't meet your needs:

- 🔬 **Novel Research Questions**: Your hypothesis requires unique data combinations
- 📊 **Proprietary Data**: Laboratory experiments generate exclusive datasets
- 🎯 **Domain-Specific Tasks**: Specialized applications (drug discovery, agriculture, personalized medicine)
- 📈 **Publication Requirements**: Original datasets strengthen research impact

This tutorial teaches you to transform raw biological data into **publication-ready, machine learning datasets** that integrate seamlessly with OmniGenBench.

### 🎓 Real-World Research Scenario

Imagine you're investigating **gene regulation** and have:
- DNA sequences from ChIP-seq experiments
- Experimental validation of promoter activity
- Metadata including tissue type, condition, GC content
- A hypothesis about sequence-function relationships

**Your Goal**: Build a robust classifier to predict promoter activity from DNA sequence alone.

> 💡 **Research Impact**: This exact workflow has been used in breakthrough papers published in *Nature*, *Science*, and *Cell*!

## 📋 The Universal Data Template: CSV Format Mastery

OmniGenBench uses a **standardized CSV format** that works across all genomic tasks. This universality is what makes the framework so powerful for research.

### 🔑 Core Template Structure

| Field Name | Description | Required? | Data Type | Example |
|------------|-------------|-----------|-----------|---------|
| **`sequence`** | Your biological sequence (DNA/RNA/Protein) | ✅ **Required** | String | `ATCGATCGATCG...` |
| **`label`** | The target variable you want to predict | ✅ **Required** | Numeric/String | `1`, `0.75`, `"high"` |
| **`id`** | Unique identifier for each sample | ⚠️ **Recommended** | String | `seq_001`, `gene_XYZ` |
| **`split`** | Dataset partition (train/valid/test) | ⚠️ **Recommended** | String | `train`, `valid`, `test` |

### 🧬 Optional Biological Metadata

The beauty of this format is **flexibility**! Add any biological annotations:

| Field Type | Examples | Research Value |
|------------|----------|----------------|
| **Sequence Properties** | `gc_content`, `length`, `complexity` | Sequence-based features |
| **Experimental Context** | `tissue`, `condition`, `treatment` | Biological context |
| **Functional Annotations** | `gene_family`, `pathway`, `domain` | Functional categories |
| **Quality Metrics** | `confidence_score`, `read_depth` | Data reliability |

### 📊 Example CSV Structure

```csv
sequence,label,id,split,gc_content,tissue,confidence,source
ATCGATCGATCGTACGAA,1,promoter_001,train,0.62,liver,0.95,human_genome
GGGGCCCCAAAATTTTCC,0,random_001,train,0.50,liver,0.87,synthetic
TACGATCGAAATTTCGAT,1,promoter_002,valid,0.56,brain,0.91,mouse_genome
AAAAAAAATTTTTTTTGG,0,random_002,test,0.20,brain,0.79,synthetic
```

> 🔬 **Pro Tip**: This same template works for **all biological sequence tasks** - just change the `label` column meaning!

### ⚠️ Dataset Synthesis & Label Semantics (Important)

All sequence examples in this tutorial are **synthetic and illustrative**. Labels from each data source reflect *different biological notions* and are **not directly comparable** until explicitly harmonized:

| Source | Column `label` Meaning | Positive Class (1) | Negative Class (0) |
|--------|------------------------|--------------------|--------------------|
| FASTA | Promoter-like vs background sequence | Promoter-like | Background/intergenic |
| GFF | Gene biotype functional potential | protein_coding | Non-coding / other biotypes |
| BED | Regulatory activity status | promoter / enhancer / silencer / insulator | background / other |
| JSON | Expression-level categorization | High expression (>= threshold) | Low expression |

During integration we add a `label_origin` column to preserve provenance. If you intend to train a single model, **first define a unified task** (e.g., “active regulatory regions”) and remap labels accordingly.

> Disclaimer: Do **not** use these synthetic examples for biological inference; they are provided solely for pipeline illustration.

## 🔄 Converting Biological Data Formats to CSV/TSV

In genomics research, biological data comes in various specialized formats (FASTA, GFF, BED, JSON, etc.). This section teaches you how to systematically convert these common formats into machine learning-ready CSV/TSV datasets compatible with OmniGenBench.

### 📁 Common Biological Data Format Conversion Matrix

| Source Format | Description | Target Output | Research Applications |
|---------------|-------------|---------------|----------------------|
| **FASTA** | Sequence data (DNA/RNA/Protein) | CSV with sequences | Sequence classification, function prediction |
| **GFF/GTF** | Genomic annotation files | CSV with genomic features | Gene expression, regulatory analysis |
| **BED** | Genomic coordinate intervals | CSV with positional data | Peak detection, interval analysis |
| **VCF** | Variant call format | CSV with genetic variants | Variant effect prediction |
| **JSON** | Structured experimental data | CSV with metadata | Complex feature engineering |

### 🎯 Learning Objectives

By the end of this tutorial, you will master:
1. **Format Recognition**: Understand characteristics of different biological data formats
2. **Systematic Conversion**: Implement automated conversion pipelines using Python
3. **Quality Control**: Validate data integrity and biological relevance
4. **Best Practices**: Apply professional standards for reproducible research

Let's start with a step-by-step approach to handle each format:

## Environment Setup and Library Imports

First, let's set up our working environment and import the necessary Python libraries for biological data processing.

In [None]:
# Essential libraries for biological data processing and conversion
import pandas as pd          # Data manipulation and CSV handling
import numpy as np           # Numerical operations and random data generation
import os                    # File system operations
import json                  # JSON data parsing
from io import StringIO      # String buffer operations

print("🔬 Biological Data Format Conversion Tutorial")
print("=" * 50)
print("✅ All required libraries imported successfully")

# Create working directory for conversion examples
os.makedirs('conversion_examples', exist_ok=True)
print("📂 Created working directory: conversion_examples/")

# Set random seed for reproducibility
np.random.seed(42)
print("🎲 Random seed set to 42 for reproducible results")

## FASTA Format Conversion

FASTA is the most common format for biological sequences. Let's learn how to convert FASTA files into CSV format suitable for machine learning.

### 🧬 FASTA Format Structure
```
>sequence_id|annotation_info
ATCGATCGATCG...
>next_sequence_id|annotation
GCGCGCGC...
```

### 🎯 Conversion Strategy
1. **Parse headers**: Extract sequence IDs and annotations
2. **Extract sequences**: Read multi-line sequences 
3. **Generate labels**: Create target variables based on biological annotations
4. **Create CSV**: Structure data in OmniGenBench-compatible format

In [None]:
print("Step 2: FASTA → CSV Conversion")
print("-" * 40)

# Create sample FASTA file with realistic biological sequences
fasta_content = """>seq1|promoter_human_chr1|BRCA1_upstream
ATCGATCGATCGTACGAATTCCGGAAATTTCCCGGGAAATTTGGGCCCAAATTTAAAGGG
>seq2|random_sequence_1|intergenic_control
AAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTT
>seq3|promoter_mouse_chrX|TP53_regulatory
GCGCGCGCATATATATATGCGCGCGCATATATATATGCGCGCGCATATATATAT
>seq4|intergenic_region_1|background_control
TTTTTTTTAAAAAAAACCCCCCCCGGGGGGGGTTTTTTTAAAAAAAACCCCCCCC
>seq5|promoter_rat_chr2|MYC_enhancer
TATAAAAAGCGCGCGCCCCCGGGGAAAATTTTTCCCGGGAAATTTAGCTAGCTAG
"""

# Save sample FASTA file
fasta_path = 'conversion_examples/sample_sequences.fasta'
with open(fasta_path, 'w') as f:
    f.write(fasta_content)
print(f"✅ Created sample FASTA file: {fasta_path}")

def parse_fasta_to_csv(fasta_file, output_csv):
    """
    Convert FASTA format to CSV with biological annotations.
    
    Parameters:
    -----------
    fasta_file : str
        Path to input FASTA file
    output_csv : str 
        Path for output CSV file
    
    Returns:
    --------
    pd.DataFrame
        Processed dataset with sequences and labels
    """
    sequences = []
    labels = []
    ids = []
    annotations = []
    
    with open(fasta_file, 'r') as f:
        current_seq = ""
        current_id = ""
        current_annotation = ""
        
        for line in f:
            line = line.strip()
            
            if line.startswith('>'):
                # Save previous sequence if exists
                if current_seq:
                    sequences.append(current_seq)
                    ids.append(current_id)
                    annotations.append(current_annotation)
                    
                    # Generate binary label based on biological annotation
                    # 1 = functional sequence (promoter), 0 = background/control
                    label = 1 if 'promoter' in current_annotation.lower() else 0
                    labels.append(label)
                
                # Parse new header: >id|type|annotation
                header_parts = line[1:].split('|')
                current_id = header_parts[0] if len(header_parts) > 0 else "unknown"
                current_annotation = '|'.join(header_parts[1:]) if len(header_parts) > 1 else "no_annotation"
                current_seq = ""
            else:
                # Accumulate sequence lines
                current_seq += line
        
        # Handle last sequence
        if current_seq:
            sequences.append(current_seq)
            ids.append(current_id)
            annotations.append(current_annotation)
            label = 1 if 'promoter' in current_annotation.lower() else 0
            labels.append(label)
    
    # Create structured DataFrame
    df = pd.DataFrame({
        'sequence': sequences,
        'label': labels,
        'id': ids,
        'annotation': annotations,
        'split': 'train'  # Default assignment - can be modified later
    })
    
    # Add sequence-level features for quality control
    df['sequence_length'] = df['sequence'].apply(len)
    df['gc_content'] = df['sequence'].apply(
        lambda seq: (seq.count('G') + seq.count('C')) / len(seq) if len(seq) > 0 else 0
    )
    
    # Save to CSV
    df.to_csv(output_csv, index=False)
    return df

# Execute FASTA conversion
csv_path = 'conversion_examples/fasta_converted.csv'
fasta_df = parse_fasta_to_csv(fasta_path, csv_path)

print(f"✅ Conversion completed: {len(fasta_df)} sequences processed")
print(f"📊 Dataset composition:")
print(f"   - Promoter sequences: {fasta_df['label'].sum()}")
print(f"   - Background sequences: {len(fasta_df) - fasta_df['label'].sum()}")
print(f"📁 Output saved: {csv_path}")

# Display sample results
print(f"\n📋 Sample converted data:")
print(fasta_df[['id', 'sequence_length', 'gc_content', 'label', 'annotation']].head(3))

#### 🔎 Label Meaning (FASTA)
`label = 1` → sequence header contains a promoter-like keyword; `0` → background / control. This is a heuristic.

> If you need multi-class (e.g., promoter subclasses), add a `promoter_type` column instead of overloading `label`.

## GFF/GTF Format Conversion

GFF (General Feature Format) and GTF (Gene Transfer Format) files contain genomic annotations. These are tab-delimited files describing genomic features and their coordinates.

### 📋 GFF/GTF Structure
```
seqname  source  feature  start  end  score  strand  frame  attributes
chr1     RefSeq  gene     1000   2000   .      +       .     ID=gene1;Name=BRCA1
```

### 🎯 Conversion Strategy
1. **Parse coordinates**: Extract genomic positions (chromosome, start, end)
2. **Extract features**: Focus on specific feature types (genes, exons, etc.)
3. **Process attributes**: Parse semicolon-separated attribute fields
4. **Generate sequences**: Create mock sequences based on coordinates (or extract from genome)

In [None]:
print("Step 3: GFF/GTF → CSV Conversion")
print("-" * 40)

# Create sample GFF file with realistic genomic annotations
gff_content = """##gff-version 3
chr1	RefSeq	gene	1000	2000	.	+	.	ID=gene1;Name=BRCA1;biotype=protein_coding;description=tumor_suppressor
chr1	RefSeq	exon	1000	1200	.	+	.	ID=exon1;Parent=gene1;exon_number=1
chr1	RefSeq	CDS	1050	1150	.	+	0	ID=cds1;Parent=gene1
chr2	RefSeq	gene	5000	6000	.	-	.	ID=gene2;Name=TP53;biotype=protein_coding;description=tumor_suppressor
chr2	RefSeq	exon	5000	5300	.	-	.	ID=exon2;Parent=gene2;exon_number=1
chr3	RefSeq	gene	8000	9500	.	+	.	ID=gene3;Name=MYC;biotype=protein_coding;description=oncogene
chr3	RefSeq	pseudogene	12000	12500	.	+	.	ID=pseudo1;Name=PSEU1;biotype=pseudogene;description=non_coding
chrX	RefSeq	gene	15000	16000	.	-	.	ID=geneX;Name=XIST;biotype=lncRNA;description=X_inactivation
"""

# Save sample GFF file
gff_path = 'conversion_examples/sample_annotations.gff'
with open(gff_path, 'w') as f:
    f.write(gff_content)
print(f"✅ Created sample GFF file: {gff_path}")

def parse_gff_to_csv(gff_file, output_csv, feature_type='gene', max_sequence_length=200):
    """
    Convert GFF annotations to CSV format with genomic features.
    
    Parameters:
    -----------
    gff_file : str
        Path to input GFF file
    output_csv : str
        Path for output CSV file  
    feature_type : str
        GFF feature type to extract (default: 'gene')
    max_sequence_length : int
        Maximum length for generated sequences
    
    Returns:
    --------
    pd.DataFrame
        Processed dataset with genomic features
    """
    data = []
    
    with open(gff_file, 'r') as f:
        for line in f:
            # Skip comment lines
            if line.startswith('#'):
                continue
                
            # Parse GFF fields
            fields = line.strip().split('\t')
            if len(fields) < 9:
                continue
                
            seqname, source, feature, start, end, score, strand, frame, attributes = fields
            
            # Process only specified feature type
            if feature == feature_type:
                # Parse attributes field (key=value pairs separated by semicolons)
                attr_dict = {}
                for attribute in attributes.split(';'):
                    if '=' in attribute:
                        key, value = attribute.split('=', 1)
                        attr_dict[key.strip()] = value.strip()
                
                # Extract key information
                gene_id = attr_dict.get('ID', 'unknown')
                gene_name = attr_dict.get('Name', 'unnamed')
                biotype = attr_dict.get('biotype', 'unknown')
                description = attr_dict.get('description', 'no_description')
                
                # Generate realistic sequence based on coordinates
                feature_length = min(int(end) - int(start), max_sequence_length)
                
                # Create biologically relevant sequences based on gene type
                if 'tumor_suppressor' in description:
                    # Tumor suppressors often have CG-rich promoters
                    mock_sequence = ''.join(np.random.choice(['G', 'C', 'A', 'T'], 
                                                           feature_length, 
                                                           p=[0.3, 0.3, 0.2, 0.2]))
                elif 'oncogene' in description:
                    # Oncogenes may have different sequence characteristics
                    mock_sequence = ''.join(np.random.choice(['A', 'T', 'G', 'C'], 
                                                           feature_length, 
                                                           p=[0.3, 0.3, 0.2, 0.2]))
                else:
                    # Default random sequence
                    mock_sequence = ''.join(np.random.choice(['A', 'T', 'C', 'G'], feature_length))
                
                # Create binary classification labels based on biotype
                # 1 = protein_coding genes, 0 = non-coding elements
                label = 1 if biotype == 'protein_coding' else 0
                
                # Compile feature data
                feature_data = {
                    'sequence': mock_sequence,
                    'label': label,
                    'id': f"{gene_name}_{seqname}_{start}_{end}",
                    'chromosome': seqname,
                    'start_position': int(start),
                    'end_position': int(end),
                    'strand': strand,
                    'gene_name': gene_name,
                    'biotype': biotype,
                    'description': description,
                    'split': 'train'
                }
                
                data.append(feature_data)
    
    # Create DataFrame
    df = pd.DataFrame(data)
    
    # Add computed features
    df['feature_length'] = df['end_position'] - df['start_position']
    df['gc_content'] = df['sequence'].apply(
        lambda seq: (seq.count('G') + seq.count('C')) / len(seq) if len(seq) > 0 else 0
    )
    
    # Save to CSV
    df.to_csv(output_csv, index=False)
    return df

# Execute GFF conversion
gff_csv_path = 'conversion_examples/gff_converted.csv'
gff_df = parse_gff_to_csv(gff_path, gff_csv_path, feature_type='gene')

print(f"✅ GFF conversion completed: {len(gff_df)} genomic features processed")
print(f"📊 Gene classification:")
print(f"   - Protein-coding genes: {gff_df['label'].sum()}")
print(f"   - Non-coding features: {len(gff_df) - gff_df['label'].sum()}")
print(f"📁 Output saved: {gff_csv_path}")

# Display sample results  
print(f"\n📋 Sample genomic features:")
print(gff_df[['gene_name', 'biotype', 'chromosome', 'feature_length', 'gc_content', 'label']].head())

#### 🔎 Label Meaning (GFF/GTF)
`label = 1` → feature biotype == `protein_coding`; `0` → other (lncRNA, pseudogene, etc.).

> For transcript-level tasks, consider constructing exon/CDS aggregation features instead.

## BED Format Conversion

BED (Browser Extensible Data) format describes genomic intervals and is widely used for ChIP-seq peaks, regulatory elements, and genomic annotations.
OmniGenBench has been ready to load BED data directly, but understanding how to convert and structure your BED files is crucial for effective dataset creation.

### 📋 BED Format Structure
```
chr1    1000    2000    peak1    100    +    enhancer
chr2    3000    3500    peak2     85    -    promoter
```

### 📊 Column Definitions
- **Columns 1-3**: Required (chromosome, start, end)
- **Columns 4-6**: Optional (name, score, strand)  
- **Column 7+**: Custom annotations (regulatory type, cell type, etc.)

### 🎯 Conversion Strategy
1. **Parse coordinates**: Extract genomic intervals
2. **Score analysis**: Interpret peak scores or confidence values
3. **Regulatory classification**: Use annotations for label generation
4. **Quality metrics**: Add peak width and score-based features

In [None]:
print("Step 4: BED → CSV Conversion")
print("-" * 40)

# Create sample BED file with ChIP-seq peaks and regulatory annotations
bed_content = """chr1\t1000\t2000\tenhancer_peak_1\t150\t+\tenhancer\tH3K27ac\tliver_specific
chr1\t3000\t3500\tpromoter_peak_1\t200\t+\tpromoter\tH3K4me3\thousekeeping
chr2\t5000\t5300\tenhancer_peak_2\t120\t-\tenhancer\tH3K4me1\ttissue_specific
chr3\t8000\t8200\tpromoter_peak_2\t180\t+\tpromoter\tH3K4me3\tinducible
chr3\t9000\t9400\tsilencer_peak_1\t95\t-\tsilencer\tH3K27me3\trepressive
chr4\t12000\t12600\tinsulator_peak_1\t110\t+\tinsulator\tCTCF\tboundary
chrX\t15000\t15400\tenhancer_peak_3\t140\t-\tenhancer\tH3K27ac\tsex_specific
chr2\t20000\t20250\tbackground_1\t60\t.\tbackground\tlow_signal\tcontrol
"""

# Save sample BED file
bed_path = 'conversion_examples/sample_peaks.bed'
with open(bed_path, 'w') as f:
    f.write(bed_content)
print(f"✅ Created sample BED file: {bed_path}")

def parse_bed_to_csv(bed_file, output_csv, min_score=50):
    """
    Convert BED format peaks to CSV with regulatory annotations.
    Safe against uniform-score division and supports source retention.
    """
    data = []

    with open(bed_file, 'r') as f:
        for line in f:
            fields = line.strip().split('\t')
            if len(fields) >= 3:
                chromosome = fields[0]
                try:
                    start = int(fields[1]); end = int(fields[2])
                except ValueError:
                    continue  # skip malformed
                peak_name = fields[3] if len(fields) > 3 else f"peak_{start}_{end}"
                try:
                    score = int(fields[4]) if len(fields) > 4 else 0
                except ValueError:
                    score = 0
                strand = fields[5] if len(fields) > 5 else '.'
                regulatory_type = fields[6] if len(fields) > 6 else 'unknown'
                histone_mark = fields[7] if len(fields) > 7 else 'no_mark'
                functional_class = fields[8] if len(fields) > 8 else 'unclassified'

                if score >= min_score:
                    peak_width = max(1, end - start)
                    if regulatory_type == 'promoter':
                        seq_pattern = ['G', 'C'] * 3 + ['A', 'T'] * 2
                        if np.random.random() > 0.4:
                            tata_seq = list("TATAAA")
                            body_len = max(0, peak_width - 6)
                            mock_sequence = ''.join(np.random.choice(seq_pattern, body_len)) + ''.join(tata_seq)
                        else:
                            mock_sequence = ''.join(np.random.choice(seq_pattern, peak_width))
                    elif regulatory_type == 'enhancer':
                        seq_pattern = ['A', 'T', 'G', 'C'] * 2
                        mock_sequence = ''.join(np.random.choice(seq_pattern, peak_width))
                    elif regulatory_type == 'silencer':
                        seq_pattern = ['A', 'T'] * 3 + ['G', 'C']
                        mock_sequence = ''.join(np.random.choice(seq_pattern, peak_width))
                    else:
                        mock_sequence = ''.join(np.random.choice(['A', 'T', 'C', 'G'], peak_width))

                    mock_sequence = mock_sequence[:500]
                    active_elements = ['promoter', 'enhancer', 'silencer', 'insulator']
                    label = 1 if regulatory_type in active_elements else 0

                    data.append({
                        'sequence': mock_sequence,
                        'label': label,
                        'id': peak_name,
                        'chromosome': chromosome,
                        'start_position': start,
                        'end_position': end,
                        'peak_width': peak_width,
                        'peak_score': score,
                        'strand': strand,
                        'regulatory_type': regulatory_type,
                        'histone_mark': histone_mark,
                        'functional_class': functional_class,
                        'split': 'train'
                    })

    df = pd.DataFrame(data)
    if len(df) == 0:
        print("⚠️ No peaks passed filters; empty DataFrame returned.")
        df.to_csv(output_csv, index=False)
        return df

    df['sequence_length'] = df['sequence'].apply(len)
    df['gc_content'] = df['sequence'].apply(lambda seq: (seq.count('G') + seq.count('C')) / len(seq) if len(seq) > 0 else 0)

    score_range = df['peak_score'].max() - df['peak_score'].min()
    if score_range > 0:
        df['normalized_score'] = (df['peak_score'] - df['peak_score'].min()) / score_range
    else:
        df['normalized_score'] = 0.5  # uniform fallback

    df.to_csv(output_csv, index=False)
    return df

# Execute BED conversion
bed_csv_path = 'conversion_examples/bed_converted.csv'
bed_df = parse_bed_to_csv(bed_path, bed_csv_path, min_score=60)

print(f"✅ BED conversion completed: {len(bed_df)} regulatory peaks processed")
print(f"📊 Regulatory element classification:")
print(f"   - Active regulatory elements: {bed_df['label'].sum()}")
print(f"   - Background regions: {len(bed_df) - bed_df['label'].sum()}")
print(f"📋 Regulatory type distribution:")
for rt, cnt in bed_df['regulatory_type'].value_counts().items():
    print(f"   - {rt}: {cnt} peaks")
print(f"📁 Output saved: {bed_csv_path}")
print("\n📋 Sample regulatory elements:")
print(bed_df[['id', 'regulatory_type', 'peak_score', 'gc_content', 'label']].head())

#### 🔎 Label Meaning (BED)
`label = 1` → regulatory_type in {promoter, enhancer, silencer, insulator}; `0` → background / other.

> In real studies, silencers and insulators may be modeled separately; avoid collapsing if downstream task differs.

## JSON Format Conversion

JSON format is increasingly common for complex experimental data with nested metadata, multi-omics datasets, and high-throughput screening results. 

OmniGenBench has been ready to load JSON data directly, but understanding how to convert and structure your JSON files is crucial for effective dataset creation.

### 📋 JSON Structure Example
```json
{
  "experiment_id": "exp_001",
  "sequence": "ATCGATCG...",
  "measurements": {
    "expression": 8.5,
    "binding_affinity": 0.75
  },
  "metadata": {
    "tissue": "liver", 
    "condition": "treatment"
  }
}
```

### 🎯 Conversion Strategy
1. **Flatten nested structures**: Convert hierarchical data to flat columns
2. **Handle arrays**: Process list-type experimental measurements
3. **Type conversion**: Ensure proper data types for ML compatibility
4. **Missing data**: Implement robust handling of incomplete records

In [None]:
print("Step 5: JSON → CSV Conversion")
print("-" * 40)

# Create sample JSON data representing multi-omics experimental results
experimental_data = [
    {
        "experiment_id": "EXPR_001",
        "sequence": "ATCGATCGATCGTACGAATTCCGGAAATTTCCCGGGAAATTTGGGCCCAAATTTAAAGGG",
        "measurements": {
            "rna_expression": 8.5,
            "protein_abundance": 12.3,
            "binding_affinity": 0.82,
            "stability_score": 7.1
        },
        "experimental_conditions": {
            "tissue_type": "liver",
            "treatment": "control",
            "timepoint": "24h",
            "replicate": 1
        },
        "sequence_features": {
            "gc_content": 0.62,
            "length": 60,
            "conservation_score": 0.85,
            "secondary_structure_energy": -15.2
        },
        "quality_metrics": {
            "sequencing_depth": 1250,
            "mapping_quality": 0.95,
            "technical_noise": 0.08
        }
    },
    {
        "experiment_id": "EXPR_002", 
        "sequence": "GCGCGCGCATATATATATGCGCGCGCATATATATATGCGCGCGCATATATATAT",
        "measurements": {
            "rna_expression": 15.7,
            "protein_abundance": 8.9,
            "binding_affinity": 0.91,
            "stability_score": 9.2
        },
        "experimental_conditions": {
            "tissue_type": "brain",
            "treatment": "drug_A",
            "timepoint": "48h",
            "replicate": 1
        },
        "sequence_features": {
            "gc_content": 0.58,
            "length": 54,
            "conservation_score": 0.92,
            "secondary_structure_energy": -22.1
        },
        "quality_metrics": {
            "sequencing_depth": 2100,
            "mapping_quality": 0.98,
            "technical_noise": 0.04
        }
    },
    {
        "experiment_id": "EXPR_003",
        "sequence": "AAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTTCCCCGGGGAAAATTTT",
        "measurements": {
            "rna_expression": 2.8,
            "protein_abundance": 3.1,
            "binding_affinity": 0.45,
            "stability_score": 4.2
        },
        "experimental_conditions": {
            "tissue_type": "liver", 
            "treatment": "drug_B",
            "timepoint": "72h",
            "replicate": 2
        },
        "sequence_features": {
            "gc_content": 0.50,
            "length": 56,
            "conservation_score": 0.45,
            "secondary_structure_energy": -8.3
        },
        "quality_metrics": {
            "sequencing_depth": 890,
            "mapping_quality": 0.89,
            "technical_noise": 0.12
        }
    },
    {
        "experiment_id": "EXPR_004",
        "sequence": "TATAAAAAGCGCGCGCCCCCGGGGAAAATTTTTCCCGGGAAATTTAGCTAGCTAG",
        "measurements": {
            "rna_expression": 11.2,
            "protein_abundance": 14.6,
            "binding_affinity": 0.78,
            "stability_score": 8.9
        },
        "experimental_conditions": {
            "tissue_type": "heart",
            "treatment": "control",
            "timepoint": "12h",
            "replicate": 1
        },
        "sequence_features": {
            "gc_content": 0.54,
            "length": 55,
            "conservation_score": 0.76,
            "secondary_structure_energy": -18.7
        },
        "quality_metrics": {
            "sequencing_depth": 1780,
            "mapping_quality": 0.94,
            "technical_noise": 0.06
        }
    }
]

# Save sample JSON data
json_path = 'conversion_examples/experimental_data.json'
with open(json_path, 'w') as f:
    json.dump(experimental_data, f, indent=2)
print(f"✅ Created sample JSON file: {json_path}")

def parse_json_to_csv(json_file, output_csv, expression_threshold=7.0):
    """
    Convert JSON experimental data to CSV format with flattened structure.
    
    Parameters:
    -----------
    json_file : str
        Path to input JSON file
    output_csv : str
        Path for output CSV file
    expression_threshold : float
        Threshold for binary classification of expression levels
    
    Returns:
    --------
    pd.DataFrame
        Processed dataset with flattened experimental data
    """
    # Load JSON data
    with open(json_file, 'r') as f:
        data = json.load(f)
    
    flattened_records = []
    
    for record in data:
        # Start with core fields
        flat_record = {
            'sequence': record.get('sequence', ''),
            'id': record.get('experiment_id', 'unknown_id')
        }
        
        # Flatten measurements (main experimental outcomes)
        if 'measurements' in record:
            for key, value in record['measurements'].items():
                flat_record[f'measurement_{key}'] = value
        
        # Flatten experimental conditions
        if 'experimental_conditions' in record:
            for key, value in record['experimental_conditions'].items():
                flat_record[f'condition_{key}'] = value
        
        # Flatten sequence features
        if 'sequence_features' in record:
            for key, value in record['sequence_features'].items():
                flat_record[f'feature_{key}'] = value
        
        # Flatten quality metrics
        if 'quality_metrics' in record:
            for key, value in record['quality_metrics'].items():
                flat_record[f'quality_{key}'] = value
        
        # Create classification labels based on RNA expression
        rna_expr = flat_record.get('measurement_rna_expression', 0)
        flat_record['label'] = 1 if rna_expr >= expression_threshold else 0
        
        # Add data split assignment
        flat_record['split'] = 'train'
        
        flattened_records.append(flat_record)
    
    # Create DataFrame
    df = pd.DataFrame(flattened_records)
    
    # Add derived features for ML compatibility
    df['sequence_length'] = df['sequence'].apply(len)
    
    # Calculate composite scores
    if 'measurement_rna_expression' in df.columns and 'measurement_protein_abundance' in df.columns:
        df['expression_protein_ratio'] = df['measurement_rna_expression'] / (df['measurement_protein_abundance'] + 1e-6)
    
    if 'measurement_binding_affinity' in df.columns and 'measurement_stability_score' in df.columns:
        df['functional_score'] = (df['measurement_binding_affinity'] * df['measurement_stability_score']) / 10
    
    # Handle missing values
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
    
    categorical_columns = df.select_dtypes(include=['object']).columns
    for col in categorical_columns:
        if col not in ['sequence', 'id', 'split']:
            df[col] = df[col].fillna('unknown')
    
    # Save to CSV
    df.to_csv(output_csv, index=False)
    return df

# Execute JSON conversion
json_csv_path = 'conversion_examples/json_converted.csv'
json_df = parse_json_to_csv(json_path, json_csv_path, expression_threshold=7.0)

print(f"✅ JSON conversion completed: {len(json_df)} experimental records processed")
print(f"📊 Expression classification (threshold ≥ 7.0):")
print(f"   - High expression: {json_df['label'].sum()}")
print(f"   - Low expression: {len(json_df) - json_df['label'].sum()}")

# Show column structure after flattening
print(f"\n📋 Flattened data structure ({len(json_df.columns)} columns):")
column_groups = {}
for col in json_df.columns:
    if col.startswith('measurement_'):
        column_groups.setdefault('Measurements', []).append(col)
    elif col.startswith('condition_'):
        column_groups.setdefault('Conditions', []).append(col)
    elif col.startswith('feature_'):
        column_groups.setdefault('Features', []).append(col)
    elif col.startswith('quality_'):
        column_groups.setdefault('Quality', []).append(col)
    else:
        column_groups.setdefault('Core', []).append(col)

for group, cols in column_groups.items():
    print(f"   - {group}: {len(cols)} columns")

print(f"📁 Output saved: {json_csv_path}")

# Display sample results
print(f"\n📋 Sample flattened experimental data:")
display_cols = ['id', 'measurement_rna_expression', 'condition_tissue_type', 'feature_gc_content', 'label']
print(json_df[display_cols].head(3))

#### 🔎 Label Meaning (JSON)
`label = 1` → RNA expression ≥ configurable threshold (default 7.0); `0` otherwise.

> Prefer percentile-based thresholds (e.g., top 30%) for heterogeneous datasets instead of fixed numeric cutoffs.

## Multi-Source Data Integration and Quality Control

Now that we have converted data from multiple formats, let's integrate them into a unified dataset and perform comprehensive quality control checks.

### 🎯 Integration Objectives
1. **Standardize Schema**: Align column names and data types across sources
2. **Quality Filtering**: Remove low-quality or problematic sequences
3. **Dataset Balancing**: Ensure appropriate class distribution
4. **Split Strategy**: Implement biologically meaningful train/validation/test splits
5. **Validation Pipeline**: Comprehensive data integrity checks

### 🔍 Quality Control Metrics
- **Sequence validity**: Check for non-standard nucleotides
- **Length distribution**: Identify outliers and truncation needs
- **Label balance**: Assess class distribution for ML compatibility
- **Missing data**: Quantify and handle incomplete records
- **Biological plausibility**: Validate GC content and other features

In [None]:
print("Step 6: Multi-Source Data Integration & Quality Control")
print("=" * 55)

# Added: optional harmonization switches
ENABLE_LABEL_ORIGIN = True
ADD_UNIFIED_FUNCTIONAL_LABEL = True  # demonstration: create 'functional_activity' meta label

FUNCTIONAL_POSITIVE_SOURCES = {
    'FASTA': lambda row: row['label'] == 1,  # promoter-like
    'GFF': lambda row: row['label'] == 1,    # protein-coding
    'BED': lambda row: row['label'] == 1,    # active regulatory
    'JSON': lambda row: row['label'] == 1    # high expression
}

def validate_dna_sequence(sequence):
    """Validate DNA sequence contains only standard nucleotides."""
    valid_bases = set('ATCGN')
    return all(base.upper() in valid_bases for base in sequence)

# ...existing code (complexity, integrate_multi_source_data unchanged up to combination)...

def calculate_sequence_complexity(sequence):
    if len(sequence) == 0:
        return 0
    base_counts = {}
    for base in sequence.upper():
        base_counts[base] = base_counts.get(base, 0) + 1
    length = len(sequence)
    entropy = 0
    for count in base_counts.values():
        if count > 0:
            prob = count / length
            entropy -= prob * np.log2(prob)
    return entropy

def integrate_multi_source_data():
    print("📊 Loading converted datasets...")
    core_columns = ['sequence', 'label', 'id', 'split']
    integrated_data = []
    datasets_info = [
        ('conversion_examples/fasta_converted.csv', 'FASTA'),
        ('conversion_examples/gff_converted.csv', 'GFF'),
        ('conversion_examples/bed_converted.csv', 'BED'),
        ('conversion_examples/json_converted.csv', 'JSON')
    ]
    for file_path, source_type in datasets_info:
        if os.path.exists(file_path):
            try:
                df = pd.read_csv(file_path)
                missing_cols = [c for c in core_columns if c not in df.columns]
                if missing_cols:
                    print(f"⚠️ {source_type}: Missing {missing_cols}, skipped")
                    continue
                standardized_df = df[core_columns].copy()
                standardized_df['source'] = source_type
                if 'gc_content' not in df.columns:
                    standardized_df['gc_content'] = df['sequence'].apply(lambda s: (s.count('G') + s.count('C'))/len(s) if len(s)>0 else 0)
                else:
                    standardized_df['gc_content'] = df['gc_content']
                integrated_data.append(standardized_df)
                print(f"   ✅ {source_type}: {len(standardized_df)} records")
            except Exception as e:
                print(f"   ❌ {source_type}: load error {e}")
        else:
            print(f"   ⚠️ {source_type}: file not found")
    if not integrated_data:
        raise ValueError("No datasets loaded")
    combined_df = pd.concat(integrated_data, ignore_index=True)
    if ENABLE_LABEL_ORIGIN:
        combined_df['label_origin'] = combined_df['source']
    if ADD_UNIFIED_FUNCTIONAL_LABEL:
        # Meta functional activity: any positive biological evidence across sources
        combined_df['functional_activity'] = combined_df.apply(lambda r: int(FUNCTIONAL_POSITIVE_SOURCES.get(r['source'], lambda x: False)(r)), axis=1)
    print(f"\n📋 Initial integration: {len(combined_df)} records")
    return combined_df

# ...rest of existing QC, splitting, saving code remains identical...

def perform_quality_control(df, min_length=10, max_length=1000, min_complexity=0.5, gc_range=(0.1,0.9)):
    print("🔍 Performing Quality Control Checks...")
    initial_count = len(df)
    qc_report = {'initial_count': initial_count}
    valid_mask = df['sequence'].apply(validate_dna_sequence)
    df_valid = df[valid_mask].copy(); qc_report['invalid_sequences']= initial_count-len(df_valid)
    df_valid['seq_length'] = df_valid['sequence'].apply(len)
    length_mask = (df_valid['seq_length']>=min_length)&(df_valid['seq_length']<=max_length)
    df_length = df_valid[length_mask].copy(); qc_report['length_filtered']= len(df_valid)-len(df_length)
    df_length['complexity'] = df_length['sequence'].apply(calculate_sequence_complexity)
    comp_mask = df_length['complexity']>=min_complexity
    df_comp = df_length[comp_mask].copy(); qc_report['low_complexity']= len(df_length)-len(df_comp)
    gc_mask = (df_comp['gc_content']>=gc_range[0])&(df_comp['gc_content']<=gc_range[1])
    df_gc = df_comp[gc_mask].copy(); qc_report['gc_outliers']= len(df_comp)-len(df_gc)
    before_dup = len(df_gc)
    df_unique = df_gc.drop_duplicates(subset=['sequence','source']).copy(); qc_report['duplicates']= before_dup-len(df_unique)
    qc_report['final_count']= len(df_unique)
    qc_report['retention_rate']= qc_report['final_count']/qc_report['initial_count']
    print(f"   Retained {qc_report['final_count']} / {initial_count} ({qc_report['retention_rate']:.1%})")
    return df_unique, qc_report

def create_balanced_splits(df, test_size=0.15, valid_size=0.15, stratify_column='label', random_state=42):
    print(f"📊 Creating balanced data splits (stratified by {stratify_column})...")
    np.random.seed(random_state)
    classes = df[stratify_column].unique()
    split_indices = {'train':[], 'valid':[], 'test':[]}
    for c in classes:
        idx = df[df[stratify_column]==c].index.tolist()
        np.random.shuffle(idx)
        n = len(idx)
        n_test = int(n*test_size); n_valid = int(n*valid_size); n_train = n - n_test - n_valid
        split_indices['test'].extend(idx[:n_test])
        split_indices['valid'].extend(idx[n_test:n_test+n_valid])
        split_indices['train'].extend(idx[n_test+n_valid:])
        print(f"   Class {c}: {n_train} train / {n_valid} valid / {n_test} test")
    df_split = df.copy()
    for name, inds in split_indices.items():
        df_split.loc[inds,'split']= name
    dist = df_split['split'].value_counts()
    for part in ['train','valid','test']:
        cnt = dist.get(part,0); print(f"   {part}: {cnt} samples")
    return df_split

try:
    integrated_df = integrate_multi_source_data()
    qc_df, qc_report = perform_quality_control(integrated_df, min_length=20, max_length=500, min_complexity=0.8, gc_range=(0.2,0.8))
    # Choose which label to stratify: original 'label' OR new 'functional_activity'
    stratify_col = 'functional_activity' if ADD_UNIFIED_FUNCTIONAL_LABEL else 'label'
    final_df = create_balanced_splits(qc_df, test_size=0.15, valid_size=0.15, stratify_column=stratify_col)
    output_path = 'conversion_examples/integrated_multi_source_dataset.csv'
    final_df.to_csv(output_path, index=False)
    print("\n🎉 Integration Complete!")
    print(f"📁 Integrated dataset saved: {output_path}")
    print(f"📊 Final dataset: {len(final_df):,} sequences")
    if 'functional_activity' in final_df.columns:
        print(f"🧪 Functional Activity Distribution: {final_df['functional_activity'].value_counts().to_dict()}")
    print(f"🏷️ Label distribution: {final_df['label'].value_counts().to_dict()}")
    print(f"📋 Source distribution: {final_df['source'].value_counts().to_dict()}")
    print("\n📋 Sample integrated data:")
    sample_cols = [c for c in ['id','source','seq_length','gc_content','complexity','label','functional_activity','split'] if c in final_df.columns]
    print(final_df[sample_cols].head())
except Exception as e:
    print(f"❌ Integration failed: {e}")
    print("   Ensure previous steps completed.")

## Best Practices and Professional Guidelines

This section consolidates the essential best practices for biological data conversion and provides professional guidelines for research-grade dataset creation.

### ✅ Data Conversion Success Factors

| Factor | Importance | Implementation Guidelines |
|--------|------------|--------------------------|
| **🔑 Standardized Schema** | ⭐⭐⭐⭐⭐ | Always use `sequence`, `label`, `id`, `split` as core columns |
| **📊 Quality Control** | ⭐⭐⭐⭐⭐ | Validate sequence format, length distribution, missing values |
| **🔄 Data Validation** | ⭐⭐⭐⭐ | Check sequence validity (ATCG only), label ranges, biological plausibility |
| **📈 Class Balance** | ⭐⭐⭐⭐ | Ensure reasonable positive/negative ratios, avoid severe imbalance |
| **🗂️ Metadata Preservation** | ⭐⭐⭐ | Retain biological annotations for downstream analysis and interpretation |

### ⚠️ Common Pitfalls and Solutions

#### 🧬 Sequence Format Issues
- **❌ Problem**: Non-standard nucleotides (N, R, Y, etc.) causing tokenization errors
- **✅ Solution**: Filter ambiguous bases or use specialized tokenizers, document handling approach

#### 📏 Length Inconsistency  
- **❌ Problem**: Extreme length variation causing training instability
- **✅ Solution**: Implement length normalization, truncation, or dynamic padding strategies

#### 🏷️ Label Encoding Errors
- **❌ Problem**: String labels ("positive"/"negative") not converted to numeric format
- **✅ Solution**: Explicit label mapping with clear documentation (0/1 for binary, 0/1/2... for multi-class)

#### 📊 Data Leakage Risks
- **❌ Problem**: Related sequences (same gene, homologs) split across train/test sets
- **✅ Solution**: Gene/protein-level splitting rather than random sequence splitting

#### 🔍 Insufficient Quality Control
- **❌ Problem**: Low-quality, duplicate, or biologically implausible sequences
- **✅ Solution**: Multi-step QC pipeline with sequence validation, complexity analysis, duplicate removal

### 🚀 Advanced Dataset Enhancement Techniques

#### 📊 Biological Feature Engineering
```python
# Example: Advanced sequence features for ML enhancement
def calculate_biological_features(sequence):
    """Calculate comprehensive biological features."""
    seq = sequence.upper()
    length = len(seq)
    
    return {
        'gc_content': (seq.count('G') + seq.count('C')) / length,
        'purine_content': (seq.count('A') + seq.count('G')) / length,
        'pyrimidine_content': (seq.count('C') + seq.count('T')) / length,
        'dinucleotide_diversity': len(set([seq[i:i+2] for i in range(length-1)])),
        'cpg_sites': seq.count('CG'),
        'repeat_content': max([seq.count(base * 3) for base in 'ATCG']) / length
    }
```

#### 🔄 Data Augmentation Strategies
- **Reverse Complement**: Generate biologically valid sequence variants
- **Sliding Windows**: Create overlapping subsequences for longer genomic regions  
- **Homolog Integration**: Include sequences from related species for robustness
- **Synthetic Generation**: Use generative models for balanced dataset creation

### 📖 Recommended Professional Tools

| Tool Category | Recommended Tools | Use Case |
|---------------|-------------------|----------|
| **🧬 Sequence Processing** | BioPython, pysam, BLAST+ | FASTA/FASTQ parsing, sequence alignment |
| **🔧 Genomic Annotations** | pyranges, pybedtools, HTSeq | BED/GFF processing, interval operations |
| **📊 Data Manipulation** | pandas, polars, dask | Large-scale CSV processing, data integration |
| **🎯 ML Pipeline** | scikit-learn, imbalanced-learn | Data splitting, preprocessing, validation |
| **🧪 Quality Control** | FastQC, MultiQC, custom scripts | Sequence quality assessment, batch processing |

### 🎓 Research Publication Guidelines

When publishing research using converted datasets:

1. **📋 Methods Section**: Document exact conversion parameters, QC thresholds, software versions
2. **📊 Data Availability**: Provide processed datasets and conversion scripts for reproducibility  
3. **🔍 Quality Metrics**: Report retention rates, class distributions, validation results
4. **⚖️ Bias Assessment**: Analyze potential biases from source data integration
5. **🧪 Validation**: Include biological validation or benchmark comparisons

### 💡 Key Takeaways for Success

1. **🎯 Plan Early**: Design conversion strategy before data collection
2. **📋 Document Everything**: Maintain detailed logs of processing steps and decisions
3. **🔄 Validate Continuously**: Implement checkpoints throughout the conversion pipeline
4. **🧬 Think Biologically**: Ensure computational decisions align with biological knowledge
5. **📈 Iterate and Improve**: Refine conversion based on downstream model performance

> **🎉 Congratulations!** You now have the expertise to convert any biological data format into research-grade, ML-ready datasets compatible with OmniGenBench and other genomics frameworks.

### 7.6 (Optional Preview) Minimal VCF Parsing Stub
Below is an (inactive by default) illustrative code stub showing how a lightweight VCF parsing function could be structured. It is intentionally not executed to avoid implying a full reference genome dependency. This will be formalized in the future extension.

```python
import re
import pandas as pd
from typing import List, Dict

VCF_HEADER_PREFIX = '#'

VCF_INFO_AF_PATTERN = re.compile(r'(^|;)AF=([^;]+)')

def parse_vcf_minimal(vcf_path: str, max_records: int = 1000) -> pd.DataFrame:
    rows: List[Dict] = []
    with open(vcf_path, 'r') as f:
        for line in f:
            if line.startswith(VCF_HEADER_PREFIX):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 8:
                continue
            chrom, pos, vid, ref, alts, qual, flt, info = parts[:8]
            pos_int = int(pos)
            alt_list = alts.split(',')
            # Extract allele frequency (if present)
            m = VCF_INFO_AF_PATTERN.search(info)
            af_values = []
            if m:
                try:
                    af_values = [float(x) for x in m.group(2).split(',') if x.strip()]
                except ValueError:
                    af_values = []
            for i, alt in enumerate(alt_list):
                af = af_values[i] if i < len(af_values) else None
                variant_id = f"{chrom}:{pos}:{ref}:{alt}"
                # Placeholder: no sequence context extraction here
                rows.append({
                    'variant_id': variant_id,
                    'chrom': chrom,
                    'position': pos_int,
                    'ref_base': ref,
                    'alt_base': alt,
                    'allele_frequency': af,
                    # Heuristic binary label example (rare = 1):
                    'label': 1 if (af is not None and af < 0.01) else 0,
                    'source': 'VCF'
                })
                if len(rows) >= max_records:
                    break
            if len(rows) >= max_records:
                break
    return pd.DataFrame(rows)

# Example (commented out):
# vcf_df = parse_vcf_minimal('path/to/example.vcf')
# vcf_df.to_csv('conversion_examples/vcf_converted.csv', index=False)
# print(vcf_df.head())
```

Notes:
- Full integration would add sequence window extraction around POS using a reference genome FASTA.
- Additional INFO fields (e.g., CADD, SIFT, ClinVar pathogenicity) can enrich multi-task labeling.
- Multi-allelic sites are expanded one row per ALT allele.
```

### 7.5 Upcoming Extension: VCF (Variant Call Format) Integration
(Planned) A future subsection will cover parsing VCF files to derive variant-centric features (REF/ALT allele context, functional annotations from INFO, derived k-mer windows) and integrating them as additional rows or auxiliary tables. This preserves current tutorial scope while signaling forthcoming capability.

Planned minimal columns will include:
- variant_id (chrom:pos:ref:alt)
- sequence (local ±k bp window around the variant)
- ref_base / alt_base
- label (e.g., pathogenic vs benign heuristic or allele frequency derived threshold)
- allele_frequency (parsed from INFO (e.g., AF=) or external annotation)
- source = 'VCF'

Harmonization Strategy (to be shown):
1. Parse core fields CHROM, POS, ID, REF, ALT, INFO.
2. Extract AF (allele frequency) or set placeholder if absent.
3. Construct local reference window from a provided reference genome (requires FASTA index; not included in lightweight example).
4. Optional label derivation: AF < 0.01 → rare (potential functional), else common.
5. Integrate with existing pipeline via the same normalization and QC steps.

Rationale: Including VCF expands the tutorial beyond regulatory and expression-centric labeling into variant effect modeling, bridging toward variant effect prediction tasks already present elsewhere in OmniGenomeBench.


### 7.4 TSV Export (Optional)
If a tab-delimited copy is required for downstream bioinformatics tools (which often default to TSV):

```python
# Export the final integrated dataset as TSV
final_df.to_csv('conversion_examples/integrated_multi_source_dataset.tsv', sep='\t', index=False)
print('TSV written to conversion_examples/integrated_multi_source_dataset.tsv')
```

Why TSV?
- Reduces ambiguity when free‐text metadata fields may contain commas.
- Many command-line genomics utilities (awk, cut, bedtools wrappers) expect or more easily parse tab separation.
- Stay consistent with UTF-8 encoding; avoid Excel re-saving which may alter line endings or truncate long sequences.

Verification Tip:
```python
import pandas as pd
check_tsv = pd.read_csv('conversion_examples/integrated_multi_source_dataset.tsv', sep='\t')
print(check_tsv.head(2))
```
