# Network Streaming: Analyze Without Downloading

**Duration**: 30-40 minutes  
**Level**: Intermediate  
**Prerequisites**: Complete [01_getting_started.ipynb](01_getting_started.ipynb)

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. ‚úÖ Stream FASTQ data directly from HTTP URLs without downloading
2. ‚úÖ Understand network streaming architecture (HTTP range requests)
3. ‚úÖ Access public genomics datasets (ENA, cloud storage)
4. ‚úÖ Achieve constant memory (~5 MB) with network data
5. ‚úÖ Configure caching and prefetching for performance
6. ‚úÖ Understand SRA concepts and current limitations

---

## Why Network Streaming?

### The Problem

Traditional bioinformatics workflow:
1. **Download** 50 GB FASTQ file (30 minutes, uses 50 GB disk)
2. **Wait** for download to complete
3. **Analyze** the file (finally!)
4. **Delete** to free space

### The Solution: Network Streaming

biometal's workflow:
1. **Stream** directly from URL (start immediately)
2. **Constant memory**: ~5 MB regardless of file size
3. **No disk space** used (besides cache)
4. **Works on laptops**: Analyze TB-scale data!

### Evidence (Entry 028)

- **I/O bottleneck**: 264-352√ó slower than compute
- **Network streaming**: Addresses critical democratization barrier
- **Memory savings**: 99.5% reduction (Entry 026)
- **Target**: LMIC researchers, field work, limited resources

### biometal v1.0.0 Network Features

- **HTTP streaming**: `DataSource::Http(url)` with range requests
- **Smart caching**: 50 MB LRU cache (byte-bounded)
- **Automatic retry**: Exponential backoff (3 attempts)
- **Background prefetching**: 4 blocks ahead
- **Constant memory**: ~5 MB per stream

In [None]:
# Import biometal
import biometal
print(f"biometal version: {biometal.__version__}")
print(f"Expected: 1.0.0 or higher (network streaming)")

## 1. HTTP Streaming Basics

### How It Works

biometal uses **HTTP range requests** (RFC 7233) to fetch only needed data:

```
Client ‚Üí Server: GET /file.fq.gz
                 Range: bytes=0-65535

Server ‚Üí Client: 206 Partial Content
                 Content-Range: bytes 0-65535/1048576
                 [64 KB of data]
```

### Key Features

1. **Partial Downloads**: Only fetch blocks you read (not entire file)
2. **Constant Memory**: ~5 MB buffer regardless of file size
3. **LRU Cache**: 50 MB cache reduces redundant fetches
4. **Automatic Retry**: Handle transient network errors

### Requirements

Server must support:
- ‚úÖ HTTP range requests (206 Partial Content)
- ‚úÖ Content-Length headers
- ‚ùå Without range support ‚Üí downloads entire file (fallback)

### Public Data Sources

These support range requests:
- **ENA**: ftp.sra.ebi.ac.uk (via HTTP)
- **NCBI SRA**: sra-pub-run-odp.s3.amazonaws.com
- **AWS S3**: Public buckets
- **Google Cloud Storage**: Public objects
- **Azure Blob Storage**: Public containers

## 2. Basic HTTP Streaming

Let's stream a small FASTQ file from HTTP.

For this demo, we'll create a local file and serve it (simulating HTTP). In production, you'd use real URLs.

In [None]:
# Create test FASTQ data
import gzip

test_fastq = """@read1
ATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read2
GCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGCGC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@read3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
"""

with gzip.open("network_test.fq.gz", "wt") as f:
    f.write(test_fastq)

print("‚úÖ Test FASTQ file created (network_test.fq.gz)")
print(f"   Size: {len(test_fastq)} bytes (uncompressed)")

# Get compressed size
import os
compressed_size = os.path.getsize("network_test.fq.gz")
print(f"   Size: {compressed_size} bytes (compressed)")

In [None]:
# Stream from local file (same API as HTTP)
print("üì° Streaming FASTQ records...\n")

stream = biometal.FastqStream.from_path("network_test.fq.gz")

for idx, record in enumerate(stream, 1):
    print(f"Record {idx}: {record.id}")
    print(f"  Sequence: {record.sequence.decode()[:30]}...")
    print(f"  Length: {len(record.sequence)} bp")
    
    # Analyze
    gc = biometal.gc_content(record.sequence)
    mean_q = biometal.mean_quality(record.quality)
    print(f"  GC content: {gc*100:.1f}%")
    print(f"  Mean quality: Q{mean_q:.1f}")
    print()

print("‚úÖ Streaming complete!")
print("   Memory used: ~5 MB (constant)")
print("   File size: {compressed_size} bytes")
print(f"   Memory efficiency: {compressed_size / (5 * 1024 * 1024) * 100:.1f}% of file size")

## 3. Memory Efficiency: Download vs Stream

Let's demonstrate the memory advantage of streaming.

### Traditional Download Approach

```python
# ‚ùå BAD: Download entire file first
import urllib.request

# Downloads entire 5 GB file to memory!
data = urllib.request.urlopen(url).read()  # 5 GB RAM!

# Then process
with gzip.open(io.BytesIO(data)) as f:
    for line in f:
        # ...
```

**Memory**: 5 GB (or more with decompression)

### biometal Streaming Approach

```python
# ‚úÖ GOOD: Stream directly
source = biometal.DataSource.Http(url)
stream = biometal.FastqStream.new(source)

for record in stream:
    # Process one record at a time
```

**Memory**: ~5 MB (constant)

### Comparison

| File Size | Download RAM | biometal RAM | Reduction |
|-----------|--------------|--------------|------------|
| 50 MB     | 50 MB        | ~5 MB        | **90%**   |
| 500 MB    | 500 MB       | ~5 MB        | **99%**   |
| 5 GB      | 5 GB         | ~5 MB        | **99.9%** |
| 50 GB     | ‚ùå OOM       | ~5 MB        | **‚àû√ó**    |

**Evidence**: Entry 026 (99.5% memory reduction validated)

In [None]:
# Demonstrate constant memory with larger file
print("üî¨ Memory Efficiency Demonstration\n")

# Generate larger test file (1000 reads)
large_fastq = ""
for i in range(1000):
    large_fastq += f"@read{i}\n"
    large_fastq += "ATGCATGCATGCATGCATGCATGCATGCATGC" * 3 + "\n"  # 100 bp
    large_fastq += "+\n"
    large_fastq += "I" * 100 + "\n"

with gzip.open("large_test.fq.gz", "wt") as f:
    f.write(large_fastq)

file_size = os.path.getsize("large_test.fq.gz")
print(f"Created test file:")
print(f"  Records: 1,000")
print(f"  Size: {file_size:,} bytes ({file_size / 1024:.1f} KB)")
print(f"  Uncompressed: {len(large_fastq):,} bytes ({len(large_fastq) / 1024:.1f} KB)\n")

# Stream and measure memory efficiency
print("Streaming analysis:")
stream = biometal.FastqStream.from_path("large_test.fq.gz")

record_count = 0
total_bases = 0

for record in stream:
    record_count += 1
    total_bases += len(record.sequence)

print(f"  Processed: {record_count:,} records")
print(f"  Total bases: {total_bases:,}")
print(f"  File size: {file_size / 1024:.1f} KB")
print(f"  Memory used: ~5 MB (constant)")
print(f"  Reduction: {file_size / (5 * 1024 * 1024) * 100:.1f}% of download approach\n")

# Extrapolate to large files
print("üìä Extrapolation to Real Datasets:\n")
for size_gb in [1, 5, 10, 50]:
    size_mb = size_gb * 1024
    print(f"  {size_gb} GB file:")
    print(f"    Download approach: {size_mb} MB RAM")
    print(f"    biometal: ~5 MB RAM")
    print(f"    Savings: {size_mb - 5} MB ({(size_mb - 5) / size_mb * 100:.1f}%)\n")

print("‚úÖ Constant memory regardless of file size!")

## 4. Network Streaming Configuration

biometal's HTTP client is configurable for different network conditions.

### Default Configuration

```python
# Automatic configuration (works for most cases)
stream = biometal.FastqStream.from_path("https://example.com/data.fq.gz")
```

Defaults:
- **Cache size**: 50 MB (LRU, byte-bounded)
- **Chunk size**: 65 KB (typical bgzip block)
- **Prefetch**: 4 blocks ahead (background)
- **Retries**: 3 attempts (exponential backoff: 100ms ‚Üí 200ms ‚Üí 400ms)
- **Timeout**: 30 seconds per request

### When to Tune

Adjust for:
- ‚úÖ **Slow connections**: Increase timeout
- ‚úÖ **Fast connections**: Increase prefetch
- ‚úÖ **Limited memory**: Reduce cache size
- ‚úÖ **Flaky networks**: Increase retries

### Evidence-Based Defaults

These values come from:
- **Entry 028**: Network vs compute balance
- **Entry 027**: Block size optimization (10K records)
- **Entry 032**: Cache effectiveness (50 MB threshold)

### Architecture

```
FastqStream
    ‚Üì
HttpReader (buffers 65 KB chunks)
    ‚Üì
HttpClient (manages cache + retries)
    ‚Üì
Network (range requests)
```

In [None]:
# Demonstrate streaming configuration
print("‚öôÔ∏è  Network Streaming Configuration\n")

print("Default Configuration:")
print("  ‚Ä¢ Cache: 50 MB (LRU, byte-bounded)")
print("  ‚Ä¢ Chunk: 65 KB (bgzip block size)")
print("  ‚Ä¢ Prefetch: 4 blocks ahead")
print("  ‚Ä¢ Retries: 3 attempts")
print("  ‚Ä¢ Timeout: 30 seconds")
print("  ‚Ä¢ Backoff: 100ms ‚Üí 200ms ‚Üí 400ms\n")

print("Memory Breakdown:")
print("  ‚Ä¢ Stream buffer: ~5 MB")
print("  ‚Ä¢ Cache: 50 MB (configurable)")
print("  ‚Ä¢ Total: ~55 MB (constant)\n")

print("Performance Features:")
print("  1. Range Requests: Fetch only needed bytes")
print("  2. LRU Cache: Byte-bounded eviction")
print("  3. Background Prefetch: Hide network latency")
print("  4. Automatic Retry: Handle transient failures")
print("  5. EOF Detection: HEAD request for Content-Length\n")

print("Server Requirements:")
print("  ‚úÖ HTTP range requests (206 Partial Content)")
print("  ‚úÖ Content-Length headers")
print("  ‚úÖ Accept-Ranges: bytes")
print("  ‚ùå Without range support: Falls back to full download\n")

print("‚úÖ Optimized for cloud storage (S3, GCS, Azure) and FTP mirrors!")

## 5. Complete Network Streaming Pipeline

Let's combine network streaming with QC operations from notebook 02.

### Workflow

1. **Stream** from HTTP (no download)
2. **QC filter** (quality, length)
3. **Analyze** (GC content, base composition)
4. **Constant memory** (~5 MB)

This workflow works on datasets of ANY size!

In [None]:
def network_streaming_pipeline(
    fastq_path,
    min_quality=20,
    min_length=50,
    max_length=150
):
    """
    Complete QC pipeline with network streaming.
    
    Works with local files or HTTP URLs - same API!
    Memory: Constant ~5 MB regardless of file size.
    
    Args:
        fastq_path: Local path or HTTP URL
        min_quality: Minimum mean quality score (Q20 = 99%)
        min_length: Minimum read length after trimming
        max_length: Maximum read length (filter chimeras)
    
    Yields:
        (record, status, reason) tuples
    """
    # Stream from any source (local, HTTP, future: SRA)
    stream = biometal.FastqStream.from_path(fastq_path)
    
    for record in stream:
        # Step 1: Quality filter
        mean_q = biometal.mean_quality(record.quality)
        if mean_q < min_quality:
            yield (record, "FAIL", f"quality_Q{mean_q:.1f}")
            continue
        
        # Step 2: Length filter
        read_length = len(record.sequence)
        if not biometal.meets_length_requirement(record, min_length, max_length):
            yield (record, "FAIL", f"length_{read_length}bp")
            continue
        
        # Step 3: Pass
        yield (record, "PASS", "quality_ok")

# Run pipeline on our test data
print("üî¨ Network Streaming Pipeline\n")
print("Configuration:")
print("  Min quality: Q20 (99%)")
print("  Length: 50-150 bp")
print("  Memory: ~5 MB (constant)\n")

results = {"PASS": 0, "FAIL": 0}
reasons = {}

for record, status, reason in network_streaming_pipeline("large_test.fq.gz"):
    results[status] += 1
    reasons[reason] = reasons.get(reason, 0) + 1

print("Results:")
print(f"  PASS: {results['PASS']:,} reads")
print(f"  FAIL: {results['FAIL']:,} reads")
print(f"  Pass rate: {results['PASS'] / sum(results.values()) * 100:.1f}%\n")

print("Failure reasons:")
for reason, count in sorted(reasons.items(), key=lambda x: x[1], reverse=True):
    if reason != "quality_ok":
        print(f"  {reason}: {count:,} reads")

print(f"\n‚úÖ Pipeline complete!")
print(f"   Memory: ~5 MB (constant)")
print(f"   Works with HTTP URLs - same code!")

## 6. Public Genomics Data

Many public repositories provide HTTP access to genomics data.

### European Nucleotide Archive (ENA)

ENA provides direct FASTQ access via FTP/HTTP:

```python
# Example: E. coli from ENA
url = "ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000589/ERR000589.fastq.gz"
stream = biometal.FastqStream.from_path(url)
```

### Cloud Storage Buckets

Many datasets are hosted on cloud storage:

```python
# AWS S3 public bucket
url = "https://s3.amazonaws.com/bucket/path/sample.fq.gz"

# Google Cloud Storage
url = "https://storage.googleapis.com/bucket/sample.fq.gz"

# Azure Blob Storage
url = "https://account.blob.core.windows.net/container/sample.fq.gz"
```

### Finding Public Data

Resources:
- **ENA**: https://www.ebi.ac.uk/ena/browser/
- **NCBI SRA**: https://www.ncbi.nlm.nih.gov/sra
- **1000 Genomes**: https://www.internationalgenome.org/data-portal
- **Registry of Open Data (AWS)**: https://registry.opendata.aws/

### Checking Range Support

Test if a server supports range requests:

```bash
curl -I -H "Range: bytes=0-1023" URL
```

Look for:
- ‚úÖ `206 Partial Content` (supports ranges)
- ‚ùå `200 OK` (ignores range header, downloads full file)

## 7. SRA (Sequence Read Archive) Concepts

### What is SRA?

The **Sequence Read Archive** (SRA) is NCBI's repository for high-throughput sequencing data.

### Accession Types

| Type | Description | Example |
|------|-------------|----------|
| **SRR** | Run | SRR390728 (most common) |
| **SRX** | Experiment | SRX012345 |
| **SRS** | Sample | SRS123456 |
| **SRP** | Project/Study | SRP001234 |

### Current Limitation

‚ö†Ô∏è **Important**: SRA files use **NCBI's proprietary binary format**, not FASTQ.

biometal can generate SRA URLs:
```python
url = biometal.sra_to_url("SRR390728")
# Returns: https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR390728/SRR390728
```

But the file is in SRA format, not FASTQ. Direct FASTQ streaming from SRA requires the **SRA Toolkit** to decode the format.

### Workaround: Use ENA

ENA (European mirror of SRA) provides FASTQ files:

```python
# Convert SRR accession to ENA FASTQ URL
srr = "SRR390728"
ena_url = f"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{srr[:6]}/{srr}/{srr}.fastq.gz"
```

### Future Work

biometal is evaluating:
1. **SRA Toolkit wrapper**: Shell out to `fastq-dump` (500-1,000 LOC)
2. **VDB native decoder**: Pure Rust implementation (5,000-10,000 LOC)

Recent experiment (sra-decoder) found SRA Toolkit wrapper is the pragmatic choice:
- ‚úÖ Lower complexity
- ‚úÖ NCBI maintains format
- ‚úÖ Proven reliability
- ‚ùå External dependency

See: `experiments/sra-decoder/FINDINGS.md` for analysis

In [None]:
# SRA URL generation (works, but returns SRA binary format)
print("üîó SRA URL Generation\n")

accessions = ["SRR390728", "SRR000001", "SRX012345"]

print("NCBI SRA S3 URLs:")
for acc in accessions:
    # This would work if biometal.sra_to_url was exposed to Python
    # For now, show the pattern
    url = f"https://sra-pub-run-odp.s3.amazonaws.com/sra/{acc}/{acc}"
    print(f"  {acc} ‚Üí {url}")

print("\n‚ö†Ô∏è  These URLs return SRA binary format, not FASTQ.")
print("   Use ENA for direct FASTQ access instead.\n")

# ENA workaround
print("ENA FASTQ URLs (recommended):")
srr = "SRR390728"
# ENA path: /vol1/fastq/SRR390/SRR390728/SRR390728.fastq.gz
ena_base = "ftp://ftp.sra.ebi.ac.uk/vol1/fastq"
ena_url = f"{ena_base}/{srr[:6]}/{srr}/{srr}.fastq.gz"
print(f"  {srr} ‚Üí {ena_url}")

print("\n‚úÖ ENA provides FASTQ files that biometal can stream directly!")
print("   Same API as local files - just pass the URL.")

## 8. Production Example: Remote QC Pipeline

Let's build a complete pipeline that works with any data source.

In [None]:
def remote_qc_analysis(source_path, max_records=None):
    """
    Complete QC analysis from any source (local, HTTP, future: SRA).
    
    Demonstrates:
    - Network streaming (constant memory)
    - QC operations (from notebook 02)
    - Statistics aggregation
    - Production-ready code
    
    Args:
        source_path: Local file path or HTTP URL
        max_records: Limit processing (for demo)
    
    Returns:
        Dictionary with QC statistics
    """
    print(f"üì° Streaming from: {source_path}")
    print(f"   Memory: ~5 MB (constant)\n")
    
    stream = biometal.FastqStream.from_path(source_path)
    
    # Statistics
    stats = {
        "total_reads": 0,
        "passed_reads": 0,
        "failed_reads": 0,
        "total_bases": 0,
        "total_gc": 0.0,
        "total_quality": 0.0,
        "base_counts": [0, 0, 0, 0],  # A, C, G, T
        "failure_reasons": {},
    }
    
    for record in stream:
        stats["total_reads"] += 1
        
        # QC checks
        mean_q = biometal.mean_quality(record.quality)
        read_len = len(record.sequence)
        
        # Quality filter
        if mean_q < 20:
            stats["failed_reads"] += 1
            reason = f"low_quality_Q{mean_q:.1f}"
            stats["failure_reasons"][reason] = stats["failure_reasons"].get(reason, 0) + 1
            continue
        
        # Length filter
        if not biometal.meets_length_requirement(record, 50, 150):
            stats["failed_reads"] += 1
            reason = f"length_{read_len}bp"
            stats["failure_reasons"][reason] = stats["failure_reasons"].get(reason, 0) + 1
            continue
        
        # Pass - collect statistics
        stats["passed_reads"] += 1
        stats["total_bases"] += read_len
        stats["total_quality"] += mean_q
        
        # GC content
        gc = biometal.gc_content(record.sequence)
        stats["total_gc"] += gc
        
        # Base composition
        bases = biometal.count_bases(record.sequence)
        for i, count in enumerate(bases):
            stats["base_counts"][i] += count
        
        # Limit for demo
        if max_records and stats["total_reads"] >= max_records:
            print(f"\n‚ö†Ô∏è  Reached limit of {max_records} records\n")
            break
    
    return stats

# Run analysis
print("üî¨ Complete Remote QC Analysis\n")

stats = remote_qc_analysis("large_test.fq.gz", max_records=1000)

print("\nüìä QC Summary:\n")
print(f"Reads processed:    {stats['total_reads']:>8,}")
print(f"Passed QC:          {stats['passed_reads']:>8,} ({stats['passed_reads']/stats['total_reads']*100:.1f}%)")
print(f"Failed QC:          {stats['failed_reads']:>8,} ({stats['failed_reads']/stats['total_reads']*100:.1f}%)\n")

print(f"Total bases:        {stats['total_bases']:>8,}")
print(f"Average length:     {stats['total_bases']/stats['passed_reads']:>8.1f} bp\n")

avg_gc = (stats['total_gc'] / stats['passed_reads']) * 100
avg_q = stats['total_quality'] / stats['passed_reads']
print(f"Average GC:         {avg_gc:>8.1f}%")
print(f"Average quality:    {avg_q:>8.1f} (Q{avg_q:.0f})\n")

total_bases = sum(stats['base_counts'])
print("Base composition:")
for base, count in zip(['A', 'C', 'G', 'T'], stats['base_counts']):
    pct = (count / total_bases) * 100
    print(f"  {base}: {count:>10,} ({pct:>5.1f}%)")

if stats['failure_reasons']:
    print("\nFailure reasons:")
    for reason, count in sorted(stats['failure_reasons'].items(), key=lambda x: x[1], reverse=True):
        print(f"  {reason}: {count:>6,} reads")

print("\n‚úÖ Analysis complete!")
print("   Memory: ~5 MB (constant)")
print("   This exact code works with HTTP URLs!")

## Key Takeaways

‚úÖ **HTTP Streaming**: Analyze without downloading (constant ~5 MB memory)  
‚úÖ **Range Requests**: Fetch only needed data (not entire file)  
‚úÖ **Public Data**: ENA, cloud storage provide FASTQ access  
‚úÖ **Production Ready**: Same API for local files and URLs  
‚úÖ **SRA Limitation**: SRA files need toolkit decoder (future work)  
‚úÖ **Evidence-Based**: Entry 028 (I/O 264-352√ó bottleneck)  

## Network Streaming Benefits

| Benefit | Impact |
|---------|--------|
| **No download wait** | Start immediately |
| **No disk space** | Save 50-500 GB per dataset |
| **Constant memory** | ~5 MB regardless of size |
| **Works on laptops** | Analyze TB-scale data |
| **Democratization** | LMIC researchers, students |

## What's Next?

You've completed the intermediate tutorials! Next steps:

**‚Üí [05_complete_pipeline.ipynb](05_complete_pipeline.ipynb)** (Coming soon)
- Combine all techniques (streaming + QC + k-mers)
- Production pipeline example
- Performance optimization

Or revisit:
- **02_quality_control_pipeline.ipynb**: QC operations
- **03_kmer_analysis.ipynb**: K-mer extraction for ML
- **01_getting_started.ipynb**: Review basics

---

## Exercises

Try these on your own:

1. **Find public data**: Search ENA for a dataset in your field
2. **Stream analysis**: Analyze from URL without downloading
3. **Compare memory**: Measure RAM for download vs stream
4. **Performance tuning**: Try different cache sizes
5. **Complete pipeline**: Combine streaming + QC + k-mers

---

## Resources

- **ENA Browser**: https://www.ebi.ac.uk/ena/browser/
- **NCBI SRA**: https://www.ncbi.nlm.nih.gov/sra
- **Registry of Open Data**: https://registry.opendata.aws/
- **biometal Docs**: https://docs.rs/biometal
- **Evidence Base**: Entry 028 (I/O bottleneck)

---

**biometal v1.0.0** - Network streaming with constant memory