# biometal: ARM-Native Bioinformatics

This notebook demonstrates the key features of biometal:
- **Streaming architecture**: Constant ~5 MB memory
- **ARM NEON acceleration**: 16-25× speedup on ARM processors
- **Network streaming**: Analyze without downloading
- **ML preprocessing**: K-mer extraction for BERT models

In [None]:
import biometal
print(f"biometal version: {biometal.__version__}")

## 1. Streaming FASTQ Files

Process FASTQ files one record at a time with constant memory.
Memory usage remains ~5 MB regardless of file size (even TB-scale files).

In [None]:
# Stream FASTQ file (works with .fq, .fastq, .fq.gz, .fastq.gz)
stream = biometal.FastqStream.from_path("../tests/test_data.fq.gz")

# Process records one at a time
for record in stream:
    print(f"Read ID: {record.id}")
    print(f"Sequence: {record.sequence_str}")
    print(f"Quality: {record.quality_str}")
    print()

## 2. ARM NEON-Accelerated Operations

All operations use ARM NEON SIMD for 16-25× speedup on ARM platforms.
Automatically falls back to scalar code on x86_64.

### GC Content (20.3× speedup on ARM)

In [None]:
sequence = b"ATGCATGCGCGCGCGC"
gc = biometal.gc_content(sequence)
print(f"GC content: {gc:.2%}")

### Base Counting (16.7× speedup on ARM)

In [None]:
counts = biometal.count_bases(sequence)
print(f"Base counts: {counts}")

### Quality Filtering (25.1× speedup on ARM)

In [None]:
quality = b"IIIIIIII"  # Phred+33 encoding
mean_q = biometal.mean_quality(quality)
print(f"Mean quality score: {mean_q:.1f}")

if mean_q > 30.0:
    print("✓ High quality read")
else:
    print("✗ Low quality read")

## 3. Integrated Workflow

Combine streaming with NEON operations for efficient analysis.

In [None]:
stream = biometal.FastqStream.from_path("../tests/test_data.fq.gz")

# Statistics
total_reads = 0
total_bases = 0
total_gc = 0.0
high_quality_reads = 0

# Process stream
for record in stream:
    seq_bytes = bytes(record.sequence)
    qual_bytes = bytes(record.quality)
    
    # Calculate metrics
    gc = biometal.gc_content(seq_bytes)
    mean_q = biometal.mean_quality(qual_bytes)
    counts = biometal.count_bases(seq_bytes)
    
    # Update statistics
    total_reads += 1
    base_count = sum(counts.values())
    total_bases += base_count
    total_gc += gc * base_count
    
    if mean_q > 30.0:
        high_quality_reads += 1
    
    print(f"{record.id}: {base_count} bp, GC={gc:.2%}, Q={mean_q:.1f}")

# Summary
print(f"\nSummary:")
print(f"  Total reads: {total_reads}")
print(f"  Total bases: {total_bases:,}")
print(f"  Average GC: {(total_gc/total_bases):.2%}")
print(f"  High quality: {high_quality_reads}/{total_reads} ({high_quality_reads/total_reads:.1%})")

## 4. K-mer Extraction for ML

Extract k-mers for downstream machine learning tasks (e.g., BERT models).

In [None]:
sequence = b"ATGCATGCATGC"

# Overlapping k-mers
kmers_3 = biometal.extract_kmers(sequence, 3)
print(f"3-mers (overlapping): {kmers_3}")

# Non-overlapping k-mers
kmers_4 = biometal.extract_kmers_non_overlapping(sequence, 4)
print(f"4-mers (non-overlapping): {kmers_4}")

### Extract k-mers from FASTQ for ML preprocessing

In [None]:
stream = biometal.FastqStream.from_path("../tests/test_data.fq.gz")

# Collect k-mers from all high-quality reads
all_kmers = []

for record in stream:
    seq_bytes = bytes(record.sequence)
    qual_bytes = bytes(record.quality)
    
    # Only use high-quality reads
    mean_q = biometal.mean_quality(qual_bytes)
    if mean_q > 30.0:
        kmers = biometal.extract_kmers(seq_bytes, 3)
        all_kmers.extend(kmers)

print(f"Extracted {len(all_kmers)} k-mers from high-quality reads")
print(f"First 10 k-mers: {all_kmers[:10]}")

# Count k-mer frequencies
from collections import Counter
kmer_counts = Counter(all_kmers)
print(f"\nMost common k-mers:")
for kmer, count in kmer_counts.most_common(5):
    print(f"  {kmer}: {count}")

## 5. FASTA Streaming

Stream FASTA files (reference genomes, assemblies, etc.)

In [None]:
# Stream FASTA file
stream = biometal.FastaStream.from_path("../tests/test_data.fa.gz")

for record in stream:
    seq_bytes = bytes(record.sequence)
    gc = biometal.gc_content(seq_bytes)
    counts = biometal.count_bases(seq_bytes)
    
    print(f"Sequence: {record.id}")
    print(f"  Length: {len(record.sequence):,} bp")
    print(f"  GC content: {gc:.2%}")
    print(f"  Base composition: {counts}")
    print()

## 6. Visualization with Pandas

Integrate with pandas for data analysis and visualization.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Collect data from stream
stream = biometal.FastqStream.from_path("../tests/test_data.fq.gz")

data = []
for record in stream:
    seq_bytes = bytes(record.sequence)
    qual_bytes = bytes(record.quality)
    
    data.append({
        'read_id': record.id,
        'length': len(record.sequence),
        'gc_content': biometal.gc_content(seq_bytes),
        'mean_quality': biometal.mean_quality(qual_bytes),
    })

# Create DataFrame
df = pd.DataFrame(data)
print(df)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['gc_content'].plot(kind='bar', ax=axes[0], title='GC Content by Read')
axes[0].set_ylabel('GC Content')
axes[0].set_xlabel('Read')

df['mean_quality'].plot(kind='bar', ax=axes[1], title='Mean Quality by Read', color='orange')
axes[1].set_ylabel('Mean Quality Score')
axes[1].set_xlabel('Read')
axes[1].axhline(y=30, color='r', linestyle='--', label='Q30 threshold')
axes[1].legend()

plt.tight_layout()
plt.show()

## Performance Notes

### ARM NEON Speedups (vs scalar baseline)
- GC content: 20.3× faster
- Base counting: 16.7× faster
- Quality filtering: 25.1× faster

### Memory Usage
- Streaming: Constant ~5 MB (regardless of file size)
- Parallel decompression: ~1 MB bounded
- Total footprint: ~6 MB for TB-scale files

### Evidence Base
All optimizations are validated through 1,357 experiments (40,710 measurements, N=30).
See `OPTIMIZATION_RULES.md` in the repository for details.