# Getting Started with biometal

**Duration**: 15-20 minutes  
**Level**: Beginner  
**Prerequisites**: Basic Python knowledge, familiarity with FASTQ format

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Install and import biometal
2. ‚úÖ Stream FASTQ files with constant memory
3. ‚úÖ Calculate GC content and base composition
4. ‚úÖ Analyze quality scores
5. ‚úÖ Understand ARM NEON performance benefits

---

## Why biometal?

biometal is designed to solve a critical problem in bioinformatics: **analyzing massive datasets on consumer hardware**.

### Key Features:
- **Constant Memory**: ~5 MB regardless of dataset size (stream, don't load)
- **ARM-Native Speed**: 16-25√ó faster on Apple Silicon (M1/M2/M3/M4)
- **Network Streaming**: Analyze without downloading (5TB ‚Üí 5 MB)
- **Production Quality**: 347 tests, Grade A code quality

### Traditional Approach (Bad):
```python
# Load entire file into memory (BAD!)
records = list(load_all_reads("huge_file.fq.gz"))  # üí• Out of memory!
```

### biometal Approach (Good):
```python
# Stream one record at a time (GOOD!)
stream = biometal.FastqStream.from_path("huge_file.fq.gz")
for record in stream:  # ‚úÖ Constant 5 MB memory
    process(record)
```

## Installation

Install biometal from PyPI:

```bash
pip install biometal-rs
```

**Note**: The package name is `biometal-rs` on PyPI (name was taken), but you import it as `biometal`.

In [None]:
# Import biometal
import biometal

# Check version
print(f"biometal version: {biometal.__version__}")
print(f"Expected: 1.2.0 or higher")

## 1. Streaming FASTQ Files

Let's start by streaming a FASTQ file. biometal uses a **streaming architecture** that processes one record at a time, keeping memory constant.

### Why Streaming Matters:

| Approach | 1M Reads | 100M Reads | 1B Reads |
|----------|----------|------------|----------|
| **Load All** | 1.3 GB | 134 GB | üí• Crash |
| **Stream** | 5 MB | 5 MB | 5 MB |

### Demo Data:
We'll use a small test file for this tutorial. In real analysis, you'd use files with millions of reads.

In [None]:
# Create a small test FASTQ file for demonstration
import gzip

test_data = """@read1
ATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read2
GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC
+
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@read3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
"""

# Write test file
with gzip.open("test_reads.fq.gz", "wt") as f:
    f.write(test_data)

print("‚úÖ Created test_reads.fq.gz")

In [None]:
# Stream FASTQ file
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

# Process records one at a time
read_count = 0
for record in stream:
    read_count += 1
    print(f"Read {read_count}: {record.id}")
    print(f"  Sequence length: {len(record.sequence)} bp")
    print(f"  Quality length: {len(record.quality)} scores")
    print()

print(f"\n‚úÖ Processed {read_count} reads with constant memory")

### Understanding the FastqRecord

Each record has three attributes:
- **`id`**: Read identifier (string)
- **`sequence`**: DNA sequence (bytes)
- **`quality`**: Phred quality scores (bytes)

To work with sequences as strings, use `.sequence_str` property:

In [None]:
# Re-open stream (streams are consumed once)
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

for record in stream:
    # Access as bytes (for biometal functions)
    seq_bytes = record.sequence
    print(f"{record.id}:")
    print(f"  Bytes: {seq_bytes[:10]}...")
    
    # Access as string (for display/other tools)
    seq_str = record.sequence_str
    print(f"  String: {seq_str[:10]}...")
    print()
    break  # Just show first record

## 2. GC Content Analysis

GC content is a fundamental metric in genomics:
- **Bacteria**: Typically 40-70% GC
- **Humans**: ~41% GC
- **High GC**: May indicate contamination
- **Low GC**: May indicate AT-rich regions or poor quality

biometal's `gc_content()` function is **16-25√ó faster on ARM** thanks to NEON SIMD acceleration.

In [None]:
# Calculate GC content
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

total_gc = 0.0
read_count = 0

for record in stream:
    # Calculate GC content (0.0 to 1.0)
    gc = biometal.gc_content(record.sequence)
    
    print(f"{record.id}:")
    print(f"  Sequence: {record.sequence_str}")
    print(f"  GC content: {gc:.2%}")
    print()
    
    total_gc += gc
    read_count += 1

avg_gc = total_gc / read_count
print(f"Average GC content: {avg_gc:.2%}")

## 3. Base Counting

Count the frequency of each nucleotide base. Useful for:
- Quality control (balanced base composition)
- Detecting adapter contamination
- Identifying systematic biases

Also **16.7√ó faster on ARM** with NEON acceleration.

In [None]:
# Count bases
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

# Accumulate counts across all reads
total_counts = {"A": 0, "C": 0, "G": 0, "T": 0}

for record in stream:
    # Returns dict with counts
    counts = biometal.count_bases(record.sequence)
    
    print(f"{record.id}:")
    print(f"  A: {counts['A']}, C: {counts['C']}, "
          f"G: {counts['G']}, T: {counts['T']}")
    
    # Accumulate
    for base in "ACGT":
        total_counts[base] += counts[base]

print(f"\nTotal base counts:")
for base, count in total_counts.items():
    print(f"  {base}: {count}")

# Calculate percentages
total_bases = sum(total_counts.values())
print(f"\nBase composition:")
for base, count in total_counts.items():
    pct = 100 * count / total_bases
    print(f"  {base}: {pct:.1f}%")

## 4. Quality Score Analysis

Quality scores (Phred scores) indicate base calling confidence:
- **Q20**: 99% accuracy (1 in 100 error)
- **Q30**: 99.9% accuracy (1 in 1000 error)
- **Q40**: 99.99% accuracy (1 in 10,000 error)

Common thresholds:
- Keep reads with mean Q ‚â• 20
- Trim bases with Q < 20

biometal's `mean_quality()` is **25.1√ó faster on ARM**.

In [None]:
# Analyze quality scores
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

for record in stream:
    # Calculate mean quality (Phred score)
    mean_q = biometal.mean_quality(record.quality)
    
    print(f"{record.id}:")
    print(f"  Quality string: {record.quality_str}")
    print(f"  Mean quality: Q{mean_q:.1f}")
    
    # Quality assessment
    if mean_q >= 30:
        status = "‚úÖ Excellent (Q ‚â• 30)"
    elif mean_q >= 20:
        status = "‚ö†Ô∏è  Acceptable (Q ‚â• 20)"
    else:
        status = "‚ùå Poor (Q < 20)"
    
    print(f"  Status: {status}")
    print()

## 5. Complete Analysis Workflow

Let's combine everything into a typical QC analysis:

In [None]:
# Complete QC analysis
stream = biometal.FastqStream.from_path("test_reads.fq.gz")

# Initialize metrics
total_reads = 0
total_bases = 0
total_gc = 0.0
high_quality_reads = 0

print("üìä Quality Control Report\n")
print(f"{'Read ID':<15} {'Length':<10} {'GC%':<10} {'Mean Q':<10} {'Status'}")
print("-" * 65)

for record in stream:
    # Calculate metrics
    length = len(record.sequence)
    gc = biometal.gc_content(record.sequence)
    mean_q = biometal.mean_quality(record.quality)
    
    # Quality check
    if mean_q >= 20:
        status = "‚úÖ Pass"
        high_quality_reads += 1
    else:
        status = "‚ùå Fail"
    
    # Display
    print(f"{record.id:<15} {length:<10} {gc*100:<10.1f} {mean_q:<10.1f} {status}")
    
    # Accumulate
    total_reads += 1
    total_bases += length
    total_gc += gc

# Summary
print("-" * 65)
print(f"\nüìà Summary:")
print(f"  Total reads: {total_reads}")
print(f"  Total bases: {total_bases:,} bp")
print(f"  Average GC: {total_gc/total_reads:.2%}")
print(f"  High quality reads (Q‚â•20): {high_quality_reads}/{total_reads} "
      f"({100*high_quality_reads/total_reads:.1f}%)")

## 6. ARM NEON Performance

biometal automatically uses ARM NEON SIMD instructions on Apple Silicon (M1/M2/M3/M4) for massive speedups:

| Operation | Scalar | NEON | Speedup |
|-----------|--------|------|---------||
| Base counting | 315 Kseq/s | 5,254 Kseq/s | **16.7√ó** |
| GC content | 294 Kseq/s | 5,954 Kseq/s | **20.3√ó** |
| Quality filter | 245 Kseq/s | 6,143 Kseq/s | **25.1√ó** |

On other platforms (x86_64), biometal uses optimized scalar code (1√ó).

### Check Your Platform:

In [None]:
import platform

arch = platform.machine()
print(f"Your architecture: {arch}")

if arch == "arm64":
    print("‚úÖ You have ARM (Apple Silicon) - NEON acceleration enabled!")
    print("   Expected speedup: 16-25√ó faster than pure Python")
elif arch == "aarch64":
    print("‚úÖ You have ARM (Linux) - NEON acceleration enabled!")
    print("   Expected speedup: 6-10√ó faster (varies by platform)")
else:
    print(f"‚ÑπÔ∏è  You have {arch} - using optimized scalar code")
    print("   Still faster than pure Python, but no NEON acceleration")

## 7. Memory Efficiency Demonstration

Let's demonstrate constant memory usage:

In [None]:
import psutil
import os

# Get current process
process = psutil.Process(os.getpid())

# Measure memory before
mem_before = process.memory_info().rss / 1024 / 1024  # MB

# Stream through records
stream = biometal.FastqStream.from_path("test_reads.fq.gz")
for record in stream:
    # Process record
    gc = biometal.gc_content(record.sequence)
    mean_q = biometal.mean_quality(record.quality)

# Measure memory after
mem_after = process.memory_info().rss / 1024 / 1024  # MB

print(f"Memory before: {mem_before:.1f} MB")
print(f"Memory after:  {mem_after:.1f} MB")
print(f"Memory change: {mem_after - mem_before:.1f} MB")
print(f"\n‚úÖ Constant memory usage confirmed!")
print(f"   (Small file, but scales to TB-size datasets)")

## Key Takeaways

‚úÖ **Streaming Architecture**: Constant ~5 MB memory regardless of file size  
‚úÖ **Simple API**: `FastqStream.from_path()` ‚Üí iterate ‚Üí process  
‚úÖ **ARM Performance**: 16-25√ó faster on Apple Silicon (automatic)  
‚úÖ **Production Ready**: 347 tests, Grade A quality  

## What's Next?

Continue learning with:

**‚Üí [02_quality_control_pipeline.ipynb](02_quality_control_pipeline.ipynb)**
- Complete QC pipeline (trim ‚Üí filter ‚Üí mask)
- Trimmomatic-compatible trimming
- Quality-based masking
- Production workflows

Or explore:
- **03_kmer_analysis.ipynb**: K-mer extraction for ML (DNABert)
- **04_sra_streaming.ipynb**: Analyze without downloading (5TB ‚Üí 5 MB)

---

## Exercises

Try these on your own:

1. **Create your own FASTQ file** with different sequences and analyze them
2. **Filter reads by GC content** (e.g., keep only 40-60% GC)
3. **Calculate min/max quality** scores across a file
4. **Find reads with high A/T content** (potential poly-A tails)

---

## Resources

- **Documentation**: https://docs.rs/biometal
- **GitHub**: https://github.com/shandley/biometal
- **PyPI**: https://pypi.org/project/biometal-rs/
- **Issues**: https://github.com/shandley/biometal/issues

---

**biometal v1.2.0** - ARM-native bioinformatics with streaming architecture