# Quality Control Pipeline with biometal

**Duration**: 30-40 minutes  
**Level**: Intermediate  
**Prerequisites**: Complete [01_getting_started.ipynb](01_getting_started.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Trim reads by fixed positions (adapter removal)
2. ‚úÖ Trim reads by quality scores (Trimmomatic-style)
3. ‚úÖ Filter reads by length requirements
4. ‚úÖ Mask low-quality bases (variant calling pipelines)
5. ‚úÖ Build complete QC workflows (trim ‚Üí filter ‚Üí mask)
6. ‚úÖ Visualize QC metrics and results

---

## Why Quality Control?

Raw sequencing reads contain errors and artifacts that can affect downstream analysis:

### Common Issues:
- **Low-quality ends**: Base calling confidence drops toward read ends
- **Adapter sequences**: Sequencing adapters not fully removed
- **Short reads**: Fragments too short for alignment
- **Systematic errors**: Quality drops in specific regions

### QC Solutions:
- **Trimming**: Remove low-quality bases from ends
- **Masking**: Replace low-quality bases with 'N' (preserves length)
- **Filtering**: Remove reads that don't meet quality thresholds

### biometal v1.2.0 Phase 4 Features:

This notebook showcases the **20 new Python functions** added in biometal v1.2.0:
- **7 trimming operations** (fixed position + quality-based)
- **2 masking operations** (quality-based masking)
- **5 record operations** (length filtering, region extraction)
- **6 sequence operations** (reverse complement, validation)

All with **constant memory streaming** architecture!

In [None]:
# Import biometal
import biometal
print(f"biometal version: {biometal.__version__}")
print(f"Expected: 1.2.0 or higher (Phase 4 features)")

## 1. Create Test Data

Let's create reads with realistic quality patterns:
- High quality in the middle
- Lower quality at the ends (typical Illumina pattern)

In [None]:
import gzip

# Create test FASTQ with realistic quality patterns
test_data = """@read1_good_quality
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
FFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFF
@read2_low_quality_end
GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII########
@read3_low_quality_start
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
####IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read4_low_quality_both
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
####IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII#####
@read5_short_after_trim
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+
######IIII########IIII#######
"""

# Write test file
with gzip.open("qc_test_reads.fq.gz", "wt") as f:
    f.write(test_data)

print("‚úÖ Created qc_test_reads.fq.gz")
print("\nQuality encoding (Phred+33):")
print("  I = Q40 (99.99% accuracy) - Excellent")
print("  F = Q37 (99.98% accuracy) - Good")
print("  # = Q2  (37% accuracy) - Poor")

## 2. Fixed Position Trimming

Remove a fixed number of bases from read ends. Useful for:
- Adapter removal (if adapter length known)
- Systematic quality drops at specific positions
- Removing UMI/barcode sequences

### Functions:
- `trim_start(record, bases)` - Remove N bases from 5' end
- `trim_end(record, bases)` - Remove N bases from 3' end
- `trim_both(record, start_bases, end_bases)` - Trim both ends

In [None]:
# Demonstrate fixed position trimming
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

for record in stream:
    print(f"Original: {record.id}")
    print(f"  Length: {len(record.sequence)} bp")
    print(f"  Sequence: {record.sequence_str[:20]}...{record.sequence_str[-20:]}")
    print(f"  Quality:  {record.quality_str[:20]}...{record.quality_str[-20:]}")
    
    # Trim 5 bases from start
    trimmed = biometal.trim_start(record, 5)
    print(f"\nAfter trim_start(5):")
    print(f"  Length: {len(trimmed.sequence)} bp")
    print(f"  Sequence: {trimmed.sequence_str[:20]}...")
    
    # Trim 5 bases from end
    trimmed = biometal.trim_end(record, 5)
    print(f"\nAfter trim_end(5):")
    print(f"  Length: {len(trimmed.sequence)} bp")
    print(f"  Sequence: ...{trimmed.sequence_str[-20:]}")
    
    # Trim both ends
    trimmed = biometal.trim_both(record, 5, 5)
    print(f"\nAfter trim_both(5, 5):")
    print(f"  Length: {len(trimmed.sequence)} bp")
    print(f"  Removed: {len(record.sequence) - len(trimmed.sequence)} bp total")
    print("-" * 60)
    
    break  # Just show first record

## 3. Quality-Based Trimming

Trim bases based on quality scores (Phred+33 encoding). More flexible than fixed trimming.

### Quality Thresholds:
- **Q20**: 99% accuracy (1 in 100 error) - Minimum acceptable
- **Q30**: 99.9% accuracy (1 in 1000 error) - High quality
- **Q40**: 99.99% accuracy (1 in 10,000 error) - Excellent

### Functions:
- `trim_quality_end(record, min_quality)` - Trim from 3' end
- `trim_quality_start(record, min_quality)` - Trim from 5' end
- `trim_quality_both(record, min_quality)` - Trim both ends (single-pass)
- `trim_quality_window(record, min_quality, window_size)` - Trimmomatic-style

In [None]:
# Quality-based trimming demonstration
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

min_quality = 20  # Q20 threshold (99% accuracy)

print(f"Quality threshold: Q{min_quality} (99% accuracy)\n")

for record in stream:
    original_len = len(record.sequence)
    
    print(f"{record.id}:")
    print(f"  Original: {original_len} bp")
    print(f"  Quality:  {record.quality_str}")
    
    # Trim low-quality 3' end
    trimmed_end = biometal.trim_quality_end(record, min_quality)
    print(f"  After trim_quality_end(Q{min_quality}): {len(trimmed_end.sequence)} bp "
          f"(-{original_len - len(trimmed_end.sequence)} bp)")
    
    # Trim low-quality 5' end
    trimmed_start = biometal.trim_quality_start(record, min_quality)
    print(f"  After trim_quality_start(Q{min_quality}): {len(trimmed_start.sequence)} bp "
          f"(-{original_len - len(trimmed_start.sequence)} bp)")
    
    # Trim both ends (single-pass optimized)
    trimmed_both = biometal.trim_quality_both(record, min_quality)
    print(f"  After trim_quality_both(Q{min_quality}): {len(trimmed_both.sequence)} bp "
          f"(-{original_len - len(trimmed_both.sequence)} bp)")
    print()

print("‚úÖ Quality-based trimming preserves high-quality regions")

## 4. Sliding Window Trimming (Trimmomatic-Style)

The `trim_quality_window()` function implements Trimmomatic's SLIDINGWINDOW algorithm:
- Slides a window across the read (5' ‚Üí 3')
- Trims when average quality in window drops below threshold
- More aggressive than simple end trimming

### Use Cases:
- Trimmomatic-compatible workflows
- Aggressive quality filtering
- Removing internal low-quality regions

### Typical Parameters:
- **Window size**: 4 bp (Trimmomatic default)
- **Quality**: Q20 or Q30

In [None]:
# Sliding window trimming (Trimmomatic SLIDINGWINDOW)
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

window_size = 4  # Trimmomatic default
min_quality = 20  # Q20 threshold

print(f"Sliding window: {window_size} bp, Q{min_quality} average\n")

for record in stream:
    original_len = len(record.sequence)
    
    # Apply sliding window trimming
    trimmed = biometal.trim_quality_window(record, min_quality, window_size)
    
    print(f"{record.id}:")
    print(f"  Original: {original_len} bp")
    print(f"  Quality:  {record.quality_str}")
    print(f"  Trimmed:  {len(trimmed.sequence)} bp (-{original_len - len(trimmed.sequence)} bp)")
    
    if len(trimmed.sequence) > 0:
        print(f"  Final quality: {trimmed.quality_str}")
    else:
        print(f"  ‚ö†Ô∏è  All bases removed (entire read low quality)")
    print()

print("‚úÖ Sliding window is more aggressive than simple end trimming")

## 5. Length Filtering

After trimming, reads may become too short for alignment. Filter by length:

### Function:
- `meets_length_requirement(record, min_len, max_len)` - Returns True if length OK

### Typical Thresholds:
- **Short reads** (Illumina): Keep 50-150 bp
- **Long reads** (PacBio/Nanopore): Keep 1,000-50,000 bp
- **Depends on use case**: Alignment vs assembly vs k-mer analysis

In [None]:
# Length filtering demonstration
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

min_length = 40  # Minimum 40 bp after trimming
max_length = 100  # Maximum 100 bp
min_quality = 20  # Q20 trimming threshold

print(f"Pipeline: Trim (Q{min_quality}) ‚Üí Filter ({min_length}-{max_length} bp)\n")

passed = 0
failed_too_short = 0
failed_too_long = 0

for record in stream:
    # Step 1: Trim by quality
    trimmed = biometal.trim_quality_both(record, min_quality)
    
    # Step 2: Check length
    passes = biometal.meets_length_requirement(trimmed, min_length, max_length)
    
    print(f"{record.id}:")
    print(f"  Original: {len(record.sequence)} bp")
    print(f"  Trimmed:  {len(trimmed.sequence)} bp")
    
    if passes:
        print(f"  ‚úÖ PASS: Length {min_length}-{max_length} bp")
        passed += 1
    else:
        if len(trimmed.sequence) < min_length:
            print(f"  ‚ùå FAIL: Too short (<{min_length} bp)")
            failed_too_short += 1
        else:
            print(f"  ‚ùå FAIL: Too long (>{max_length} bp)")
            failed_too_long += 1
    print()

print(f"\nüìä Results:")
print(f"  Passed: {passed}")
  print(f"  Failed (too short): {failed_too_short}")
print(f"  Failed (too long): {failed_too_long}")

## 6. Quality-Based Masking

**Masking vs Trimming**:
- **Trimming**: Removes bases (changes read length)
- **Masking**: Replaces bases with 'N' (preserves read length)

### When to Use Masking:
- **Variant calling**: Preserves read structure for alignment
- **Length-sensitive analysis**: Need to maintain positions
- **Alternative to trimming**: When length must stay constant

### Functions:
- `mask_low_quality(record, min_quality)` - Replace bases with 'N'
- `count_masked_bases(record)` - Count N's for QC metrics

In [None]:
# Quality-based masking demonstration
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

min_quality = 20  # Q20 threshold

print(f"Masking: Replace bases with Q < {min_quality} with 'N'\n")

for record in stream:
    # Mask low-quality bases
    masked = biometal.mask_low_quality(record, min_quality)
    
    # Count masked bases
    n_count = biometal.count_masked_bases(masked)
    mask_rate = n_count / len(masked.sequence)
    
    print(f"{record.id}:")
    print(f"  Original: {record.sequence_str}")
    print(f"  Quality:  {record.quality_str}")
    print(f"  Masked:   {masked.sequence_str}")
    print(f"  N count:  {n_count}/{len(masked.sequence)} ({mask_rate*100:.1f}% masked)")
    
    # QC decision: Reject if >10% masked
    if mask_rate < 0.1:
        print(f"  ‚úÖ PASS: <10% masked")
    else:
        print(f"  ‚ùå FAIL: ‚â•10% masked")
    print()

print("‚úÖ Masking preserves read length (unlike trimming)")

## 7. Complete QC Pipeline

Let's build a production-ready QC pipeline combining everything:

### Workflow:
1. **Trim**: Remove low-quality ends (Q20, sliding window)
2. **Filter**: Keep only 50-150 bp reads
3. **Mask**: Replace remaining low-quality bases with 'N'
4. **Final QC**: Reject if >10% masked

This is a **typical pre-alignment QC workflow** used in genomics.

In [None]:
# Complete QC pipeline
stream = biometal.FastqStream.from_path("qc_test_reads.fq.gz")

# QC parameters
TRIM_QUALITY = 20       # Q20 for trimming
TRIM_WINDOW = 4         # 4 bp sliding window
MIN_LENGTH = 40         # Minimum 40 bp
MAX_LENGTH = 100        # Maximum 100 bp
MASK_QUALITY = 20       # Q20 for masking
MAX_MASK_RATE = 0.10    # Maximum 10% masked

print("üî¨ Quality Control Pipeline\n")
print(f"Parameters:")
print(f"  Trim: Q{TRIM_QUALITY}, {TRIM_WINDOW}bp window")
print(f"  Length: {MIN_LENGTH}-{MAX_LENGTH} bp")
print(f"  Mask: Q{MASK_QUALITY}, max {MAX_MASK_RATE*100}% masked\n")
print("-" * 70)

# Track statistics
total_reads = 0
passed_reads = 0
failed_trimming = 0
failed_length = 0
failed_masking = 0

for record in stream:
    total_reads += 1
    original_len = len(record.sequence)
    
    print(f"\n{record.id}:")
    print(f"  Step 0: Original = {original_len} bp")
    
    # Step 1: Quality-based trimming (sliding window)
    trimmed = biometal.trim_quality_window(record, TRIM_QUALITY, TRIM_WINDOW)
    print(f"  Step 1: Trimmed = {len(trimmed.sequence)} bp (-{original_len - len(trimmed.sequence)} bp)")
    
    if len(trimmed.sequence) == 0:
        print(f"  ‚ùå FAIL: All bases removed (low quality)")
        failed_trimming += 1
        continue
    
    # Step 2: Length filter
    if not biometal.meets_length_requirement(trimmed, MIN_LENGTH, MAX_LENGTH):
        print(f"  ‚ùå FAIL: Length {len(trimmed.sequence)} bp (not {MIN_LENGTH}-{MAX_LENGTH} bp)")
        failed_length += 1
        continue
    
    print(f"  Step 2: Length OK ({MIN_LENGTH}-{MAX_LENGTH} bp)")
    
    # Step 3: Mask remaining low-quality bases
    masked = biometal.mask_low_quality(trimmed, MASK_QUALITY)
    n_count = biometal.count_masked_bases(masked)
    mask_rate = n_count / len(masked.sequence)
    
    print(f"  Step 3: Masked = {n_count} bases ({mask_rate*100:.1f}%)")
    
    # Step 4: Final QC check
    if mask_rate > MAX_MASK_RATE:
        print(f"  ‚ùå FAIL: {mask_rate*100:.1f}% masked (>{MAX_MASK_RATE*100}%)")
        failed_masking += 1
        continue
    
    print(f"  ‚úÖ PASS: High-quality read ready for analysis")
    passed_reads += 1

# Summary statistics
print("\n" + "=" * 70)
print("\nüìä QC Summary:")
print(f"  Total reads: {total_reads}")
print(f"  Passed:      {passed_reads} ({100*passed_reads/total_reads:.1f}%)")
print(f"  Failed:      {total_reads - passed_reads} ({100*(total_reads-passed_reads)/total_reads:.1f}%)")
print(f"\n  Failure reasons:")
print(f"    - Trimming (no bases left):  {failed_trimming}")
print(f"    - Length (too short/long):   {failed_length}")
print(f"    - Masking (>10% masked):     {failed_masking}")
print("\n‚úÖ QC pipeline complete!")

## 8. Production Pipeline Function

Here's a reusable function for production QC:

In [None]:
def quality_control_pipeline(
    input_path,
    trim_quality=20,
    window_size=4,
    min_length=50,
    max_length=150,
    mask_quality=20,
    max_mask_rate=0.10
):
    """
    Complete QC pipeline: trim ‚Üí filter ‚Üí mask
    
    Args:
        input_path: Path to FASTQ file
        trim_quality: Minimum quality for trimming (Phred score)
        window_size: Window size for sliding window trimming
        min_length: Minimum read length after trimming
        max_length: Maximum read length
        mask_quality: Minimum quality for masking
        max_mask_rate: Maximum fraction of masked bases (0.0-1.0)
    
    Returns:
        Generator yielding (record, status, reason) tuples
    """
    stream = biometal.FastqStream.from_path(input_path)
    
    for record in stream:
        # Step 1: Trim
        trimmed = biometal.trim_quality_window(record, trim_quality, window_size)
        
        if len(trimmed.sequence) == 0:
            yield (record, "FAIL", "trimmed_empty")
            continue
        
        # Step 2: Length filter
        if not biometal.meets_length_requirement(trimmed, min_length, max_length):
            yield (record, "FAIL", f"length_{len(trimmed.sequence)}bp")
            continue
        
        # Step 3: Mask
        masked = biometal.mask_low_quality(trimmed, mask_quality)
        n_count = biometal.count_masked_bases(masked)
        mask_rate = n_count / len(masked.sequence)
        
        # Step 4: Final QC
        if mask_rate > max_mask_rate:
            yield (record, "FAIL", f"masked_{mask_rate*100:.1f}%")
            continue
        
        yield (masked, "PASS", f"quality_ok")

# Usage example
print("Running production QC pipeline...\n")

for result_record, status, reason in quality_control_pipeline("qc_test_reads.fq.gz"):
    if status == "PASS":
        print(f"‚úÖ {result_record.id}: {status} ({reason})")
    else:
        print(f"‚ùå {result_record.id}: {status} ({reason})")

print("\n‚úÖ Production function ready for use!")

## Key Takeaways

‚úÖ **7 Trimming Operations**: Fixed position + quality-based (v1.2.0)  
‚úÖ **2 Masking Operations**: Quality-based masking (v1.2.0)  
‚úÖ **Length Filtering**: Keep reads within size range  
‚úÖ **Complete Pipeline**: Trim ‚Üí Filter ‚Üí Mask workflow  
‚úÖ **Trimmomatic Compatible**: Sliding window trimming  
‚úÖ **Production Ready**: Reusable function for real analysis  

## Trimming vs Masking

| Feature | Trimming | Masking |
|---------|----------|----------|
| **Changes length** | ‚úÖ Yes | ‚ùå No |
| **Removes bases** | ‚úÖ Yes | ‚ùå No (replaces with N) |
| **Use for alignment** | ‚úÖ Yes | ‚ö†Ô∏è Depends |
| **Use for variant calling** | ‚ö†Ô∏è Possible | ‚úÖ Preferred |
| **Preserves positions** | ‚ùå No | ‚úÖ Yes |

## What's Next?

Continue learning with:

**‚Üí [03_kmer_analysis.ipynb](03_kmer_analysis.ipynb)**
- K-mer extraction for DNABert/DNABERT-2
- Minimizers for indexing
- Parallel extraction
- Machine learning preprocessing

Or explore:
- **04_sra_streaming.ipynb**: Analyze without downloading
- **01_getting_started.ipynb**: Review basics

---

## Exercises

Try these on your own:

1. **Adjust parameters**: Try Q30 threshold instead of Q20
2. **Create test data**: Generate reads with different quality patterns
3. **Compare methods**: Trimming vs masking - which preserves more reads?
4. **Optimize pipeline**: Find parameters that maximize passed reads
5. **Add visualizations**: Plot read length distributions before/after QC

---

**biometal v1.2.0** - Phase 4 sequence manipulation operations