# FASTA Parser - Comprehensive Tutorial

This notebook demonstrates all features of the comprehensive FASTA parser toolkit.

## Features Covered:
- Parsing compressed and uncompressed FASTA files
- Calculating statistics (N50, GC content, etc.)
- Filtering and sorting sequences
- Creating visualizations
- Real-world bioinformatics workflows

---

## 1. Setup and Installation

First, ensure you have the required dependencies installed:

In [None]:
# Install required packages (uncomment if needed)
# !pip install matplotlib numpy seaborn --break-system-packages

# Import the FASTA parser
from fasta_parser import FastaParser, FastaSequence
import matplotlib.pyplot as plt
import numpy as np

# Set up matplotlib for notebook
%matplotlib inline

print("✓ All imports successful!")

## 2. Generate Sample Data

Let's create a sample FASTA file to work with:

In [None]:
import random
import gzip

def generate_random_sequence(length):
    """Generate a random DNA sequence."""
    return ''.join(random.choices('ATGC', k=length))

# Create a sample FASTA file
with open('notebook_sample.fasta', 'w') as f:
    for i in range(50):
        # Vary sequence lengths
        if i < 5:
            length = random.randint(5000, 15000)  # Long sequences
        elif i < 15:
            length = random.randint(1000, 5000)   # Medium sequences
        else:
            length = random.randint(100, 1000)    # Short sequences
        
        seq = generate_random_sequence(length)
        header = f"sequence_{i+1} length={length} sample_data"
        
        f.write(f">{header}\n")
        for j in range(0, len(seq), 80):
            f.write(seq[j:j+80] + '\n')

# Also create a compressed version
with gzip.open('notebook_sample.fasta.gz', 'wt') as f:
    for i in range(20):
        length = random.randint(500, 5000)
        seq = generate_random_sequence(length)
        header = f"compressed_seq_{i+1} length={length}"
        f.write(f">{header}\n")
        for j in range(0, len(seq), 80):
            f.write(seq[j:j+80] + '\n')

print("✓ Sample FASTA files created:")
print("  - notebook_sample.fasta (50 sequences)")
print("  - notebook_sample.fasta.gz (20 sequences, compressed)")

## 3. Basic Parsing

Let's parse a FASTA file and explore the basic functionality:

In [None]:
# Parse the FASTA file
fasta = FastaParser()
fasta.parse('notebook_sample.fasta')

print(f"Loaded {len(fasta)} sequences\n")

# Access individual sequences
first_seq = fasta[0]
print("First sequence details:")
print(f"  Header: {first_seq.header}")
print(f"  ID: {first_seq.id}")
print(f"  Length: {first_seq.length:,} bp")
print(f"  GC Content: {first_seq.gc_content:.2f}%")
print(f"  First 60 bp: {first_seq.sequence[:60]}...")

## 4. Comprehensive Statistics

Generate detailed statistics about the FASTA file:

In [None]:
# Print comprehensive statistics
fasta.print_statistics()

## 5. Sequence Iteration

The parser is iterable, making it easy to loop through sequences:

In [None]:
# Iterate through sequences
print("Top 10 sequences by length:\n")
for i, seq in enumerate(sorted(fasta, key=lambda x: x.length, reverse=True)[:10]):
    print(f"{i+1:2d}. {seq.id:20s} {seq.length:7,} bp  GC: {seq.gc_content:5.2f}%")

## 6. Filtering Sequences

Filter sequences by length or other criteria:

In [None]:
# Filter sequences by length
print(f"Original: {len(fasta)} sequences\n")

# Create a copy and filter
fasta_filtered = FastaParser().parse('notebook_sample.fasta')
fasta_filtered.filter_by_length(min_length=1000, max_length=10000)

print(f"After filtering (1000-10000 bp): {len(fasta_filtered)} sequences")
print(f"Total bases: {sum(s.length for s in fasta_filtered):,}")

# Show some filtered sequences
print("\nFiltered sequences:")
for seq in fasta_filtered[:5]:
    print(f"  {seq.id}: {seq.length:,} bp")

## 7. Sorting Sequences

Sort sequences by different criteria:

In [None]:
# Sort by length (descending)
fasta_by_length = FastaParser().parse('notebook_sample.fasta')
fasta_by_length.sort_sequences(key='length', reverse=True)

print("Sorted by LENGTH (longest first):")
for seq in fasta_by_length[:5]:
    print(f"  {seq.id:20s} {seq.length:7,} bp")

# Sort by GC content (descending)
fasta_by_gc = FastaParser().parse('notebook_sample.fasta')
fasta_by_gc.sort_sequences(key='gc_content', reverse=True)

print("\nSorted by GC CONTENT (highest first):")
for seq in fasta_by_gc[:5]:
    print(f"  {seq.id:20s} {seq.gc_content:5.2f}%")

# Sort by ID (alphabetical)
fasta_by_id = FastaParser().parse('notebook_sample.fasta')
fasta_by_id.sort_sequences(key='id', reverse=False)

print("\nSorted by ID (alphabetical):")
for seq in fasta_by_id[:5]:
    print(f"  {seq.id}")

## 8. Method Chaining

The parser supports fluent method chaining:

In [None]:
# Method chaining example
result = (FastaParser()
    .parse('notebook_sample.fasta')
    .filter_by_length(min_length=2000, max_length=8000)
    .sort_sequences(key='gc_content', reverse=True))

print(f"Chained operations resulted in {len(result)} sequences")
print("\nTop 3 by GC content (2-8kb range):")
for i, seq in enumerate(result[:3]):
    print(f"  {i+1}. {seq.id}: {seq.length:,} bp, GC={seq.gc_content:.2f}%")

## 9. Compressed File Handling

The parser automatically handles compressed files:

In [None]:
# Parse compressed file
fasta_compressed = FastaParser().parse('notebook_sample.fasta.gz')

print(f"Loaded {len(fasta_compressed)} sequences from compressed file")
print(f"Total bases: {sum(s.length for s in fasta_compressed):,}")

# Statistics work the same way
lengths = [s.length for s in fasta_compressed]
print(f"\nLength range: {min(lengths):,} - {max(lengths):,} bp")
print(f"Mean length: {np.mean(lengths):,.0f} bp")

## 10. Comprehensive Visualization

Generate a 6-panel visualization showing various analyses:

In [None]:
# Create comprehensive visualization
fasta.visualize(output_file='notebook_analysis.png')

# The visualization is also displayed inline in notebooks
print("✓ Visualization created and saved to notebook_analysis.png")

## 11. Custom Analysis - GC Content Distribution

Let's perform some custom analysis using the parsed data:

In [None]:
import matplotlib.pyplot as plt

# Analyze GC content distribution
gc_contents = [seq.gc_content for seq in fasta]
lengths = [seq.length for seq in fasta]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# GC content histogram
ax1.hist(gc_contents, bins=20, color='seagreen', edgecolor='black', alpha=0.7)
ax1.axvline(np.mean(gc_contents), color='red', linestyle='--', linewidth=2, label=f'Mean: {np.mean(gc_contents):.1f}%')
ax1.axvline(np.median(gc_contents), color='blue', linestyle='--', linewidth=2, label=f'Median: {np.median(gc_contents):.1f}%')
ax1.set_xlabel('GC Content (%)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('GC Content Distribution', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(alpha=0.3)

# Length vs GC scatter with regression line
ax2.scatter(lengths, gc_contents, alpha=0.6, s=80, c=gc_contents, cmap='viridis', edgecolors='black', linewidth=0.5)
z = np.polyfit(lengths, gc_contents, 1)
p = np.poly1d(z)
ax2.plot(lengths, p(lengths), "r--", alpha=0.8, linewidth=2, label=f'Trend: y={z[0]:.2e}x+{z[1]:.2f}')
ax2.set_xlabel('Sequence Length (bp)', fontsize=12)
ax2.set_ylabel('GC Content (%)', fontsize=12)
ax2.set_title('Length vs GC Content Correlation', fontsize=14, fontweight='bold')
ax2.legend()
ax2.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('custom_gc_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate correlation
correlation = np.corrcoef(lengths, gc_contents)[0, 1]
print(f"Correlation between length and GC content: {correlation:.3f}")

## 12. Writing Sorted Output

Write sorted sequences to new FASTA files:

In [None]:
# Sort by length and write to file
fasta_to_write = FastaParser().parse('notebook_sample.fasta')
fasta_to_write.sort_sequences(key='length', reverse=True)
fasta_to_write.write('sorted_by_length.fasta')

print("✓ Written sorted_by_length.fasta")

# Also write compressed version
fasta_to_write.write('sorted_by_length.fasta.gz')
print("✓ Written sorted_by_length.fasta.gz (compressed)")

# Verify by reading back
verify = FastaParser().parse('sorted_by_length.fasta')
print(f"\nVerification: Read back {len(verify)} sequences")
print(f"First sequence length: {verify[0].length:,} bp (should be longest)")
print(f"Last sequence length: {verify[-1].length:,} bp (should be shortest)")

## 13. Sequence Retrieval by ID

Find specific sequences by their ID:

In [None]:
# Get sequence by ID
seq_id = 'sequence_1'
seq = fasta.get_sequence_by_id(seq_id)

if seq:
    print(f"Found sequence: {seq_id}")
    print(f"  Length: {seq.length:,} bp")
    print(f"  GC Content: {seq.gc_content:.2f}%")
    print(f"  First 80 bp: {seq.sequence[:80]}")
else:
    print(f"Sequence {seq_id} not found")

# Search for sequences matching a pattern
print("\nAll sequences with 'sequence_1' in ID:")
matches = [s for s in fasta if 'sequence_1' in s.id]
for s in matches:
    print(f"  {s.id}: {s.length:,} bp")

## 14. Real-World Workflow: Assembly QC

A practical example of assembly quality control:

In [None]:
def assembly_qc(fasta_parser):
    """Perform quality control checks on an assembly."""
    print("=" * 70)
    print("ASSEMBLY QUALITY CONTROL")
    print("=" * 70)
    
    lengths = [s.length for s in fasta_parser]
    gc_contents = [s.gc_content for s in fasta_parser]
    
    # Basic metrics
    print(f"\nBasic Metrics:")
    print(f"  Total sequences: {len(fasta_parser):,}")
    print(f"  Total bases: {sum(lengths):,}")
    print(f"  N50: {fasta_parser._calculate_n50(lengths):,} bp")
    
    # Quality checks
    print(f"\nQuality Checks:")
    
    # Check for short contigs
    short = sum(1 for l in lengths if l < 500)
    if short > 0:
        print(f"  ⚠ {short} sequences < 500 bp ({short/len(fasta_parser)*100:.1f}%)")
    else:
        print(f"  ✓ No sequences < 500 bp")
    
    # Check for extreme GC content
    extreme_gc = sum(1 for gc in gc_contents if gc < 20 or gc > 80)
    if extreme_gc > 0:
        print(f"  ⚠ {extreme_gc} sequences with extreme GC content")
    else:
        print(f"  ✓ All sequences have reasonable GC content")
    
    # Assembly fragmentation
    if len(fasta_parser) > 1000:
        print(f"  ⚠ Highly fragmented assembly ({len(fasta_parser):,} sequences)")
    else:
        print(f"  ✓ Assembly fragmentation is reasonable")
    
    # Length distribution
    print(f"\nLength Distribution:")
    print(f"  Longest: {max(lengths):,} bp")
    print(f"  Shortest: {min(lengths):,} bp")
    print(f"  Mean: {np.mean(lengths):,.0f} bp")
    print(f"  Median: {np.median(lengths):,.0f} bp")
    
    print("\n" + "=" * 70)

# Run QC on our sample data
assembly_qc(fasta)

## 15. Real-World Workflow: Size Selection

Create size-selected sequence libraries:

In [None]:
def create_size_libraries(input_file):
    """Create size-selected libraries."""
    print("Creating size-selected libraries...\n")
    
    bins = [
        (0, 500, "very_short"),
        (500, 2000, "short"),
        (2000, 5000, "medium"),
        (5000, float('inf'), "long")
    ]
    
    for min_len, max_len, name in bins:
        lib = FastaParser().parse(input_file)
        lib.filter_by_length(min_length=min_len, max_length=max_len)
        
        if len(lib) > 0:
            filename = f"{name}_library.fasta"
            lib.write(filename)
            
            print(f"{name.upper()} ({min_len:,}-{max_len:,} bp):")
            print(f"  Sequences: {len(lib):,}")
            print(f"  Total bases: {sum(s.length for s in lib):,}")
            print(f"  File: {filename}")
            print()

create_size_libraries('notebook_sample.fasta')

## 16. Advanced: Outlier Detection

Identify sequences that are statistical outliers:

In [None]:
def detect_outliers(fasta_parser, threshold=3):
    """Detect outliers using Z-score."""
    lengths = [s.length for s in fasta_parser]
    gc_contents = [s.gc_content for s in fasta_parser]
    
    mean_len = np.mean(lengths)
    std_len = np.std(lengths)
    
    mean_gc = np.mean(gc_contents)
    std_gc = np.std(gc_contents)
    
    print(f"Outlier Detection (|Z-score| > {threshold})\n")
    
    # Length outliers
    print("LENGTH OUTLIERS:")
    length_outliers = []
    for seq in fasta_parser:
        z_score = (seq.length - mean_len) / std_len
        if abs(z_score) > threshold:
            length_outliers.append((seq, z_score))
    
    if length_outliers:
        for seq, z in sorted(length_outliers, key=lambda x: abs(x[1]), reverse=True):
            print(f"  {seq.id:20s} {seq.length:7,} bp  Z={z:+.2f}")
    else:
        print("  None detected")
    
    # GC outliers
    print("\nGC CONTENT OUTLIERS:")
    gc_outliers = []
    for seq in fasta_parser:
        z_score = (seq.gc_content - mean_gc) / std_gc
        if abs(z_score) > threshold:
            gc_outliers.append((seq, z_score))
    
    if gc_outliers:
        for seq, z in sorted(gc_outliers, key=lambda x: abs(x[1]), reverse=True):
            print(f"  {seq.id:20s} {seq.gc_content:5.2f}%  Z={z:+.2f}")
    else:
        print("  None detected")

detect_outliers(fasta)

## 17. Comparison of Different Assemblies

Compare statistics between different FASTA files:

In [None]:
def compare_assemblies(file1, file2):
    """Compare two assemblies."""
    f1 = FastaParser().parse(file1)
    f2 = FastaParser().parse(file2)
    
    lengths1 = [s.length for s in f1]
    lengths2 = [s.length for s in f2]
    
    print("=" * 70)
    print("ASSEMBLY COMPARISON")
    print("=" * 70)
    
    print(f"\nFile 1: {file1}")
    print(f"  Sequences: {len(f1):,}")
    print(f"  Total bases: {sum(lengths1):,}")
    print(f"  N50: {f1._calculate_n50(lengths1):,} bp")
    print(f"  Longest: {max(lengths1):,} bp")
    
    print(f"\nFile 2: {file2}")
    print(f"  Sequences: {len(f2):,}")
    print(f"  Total bases: {sum(lengths2):,}")
    print(f"  N50: {f2._calculate_n50(lengths2):,} bp")
    print(f"  Longest: {max(lengths2):,} bp")
    
    print("\n" + "=" * 70)

# Compare original and compressed files
compare_assemblies('notebook_sample.fasta', 'notebook_sample.fasta.gz')

## 18. Summary and Next Steps

### What We've Covered:

1. ✅ Basic parsing of FASTA files
2. ✅ Accessing sequence properties (length, GC%, etc.)
3. ✅ Comprehensive statistics and N50 calculation
4. ✅ Filtering sequences by length
5. ✅ Sorting by multiple criteria
6. ✅ Handling compressed files (.gz)
7. ✅ Creating visualizations
8. ✅ Writing sorted output
9. ✅ Custom analyses and quality control
10. ✅ Real-world bioinformatics workflows

### Key Takeaways:

- The parser handles both compressed and uncompressed files automatically
- Method chaining enables clean, readable code
- FastaSequence objects provide easy access to sequence properties
- Comprehensive statistics help assess assembly quality
- Visualizations provide quick insights into sequence characteristics

### Next Steps:

1. Try with your own FASTA files
2. Customize the visualizations for your needs
3. Integrate into your existing workflows
4. Extend with additional analyses
5. Combine with other bioinformatics tools

### Additional Resources:

- See `README.md` for complete API documentation
- Check `QUICK_REFERENCE.txt` for common patterns
- Review `bioinformatics_examples.py` for more workflows

## 19. Cleanup (Optional)

Remove generated files if desired:

In [None]:
import os

# Uncomment to clean up generated files
# files_to_remove = [
#     'notebook_sample.fasta',
#     'notebook_sample.fasta.gz',
#     'sorted_by_length.fasta',
#     'sorted_by_length.fasta.gz',
#     'notebook_analysis.png',
#     'custom_gc_analysis.png',
#     'very_short_library.fasta',
#     'short_library.fasta',
#     'medium_library.fasta',
#     'long_library.fasta'
# ]

# for f in files_to_remove:
#     if os.path.exists(f):
#         os.remove(f)
#         print(f"Removed {f}")

print("Notebook complete! Check the generated files for results.")