##### üìÑ What is FASTA Format?

FASTA is the most common format for storing DNA, RNA, and protein sequences.

**Structure:**
- Header line starts with `>` followed by sequence ID/description
- Sequence data follows on subsequent lines
- Multiple sequences can be in one file

**Example:**
```
>Gene1 Arabidopsis NPR1
ATGCGTACGTTAGC
GCTAGCTAGCTAG
>Gene2 Arabidopsis FLS2
ATGATGATGATG
```


##### üîß Part 1: Creating a Sample FASTA File

First, let's create a sample FASTA file to work with:


In [None]:
# Create a sample FASTA file
fasta_content = """<Gene1 Arabidopsis NPR1
ATGCGTACGTTAGCGCTAGCTAGCTAGATGATCGTAGCTAGCTAG
GCTAGCTAGCTAGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Gene2 Arabidopsis FLS2
ATGATGATGATGCCCGGGAAATTTGCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGGCTAGCTA
>Gene3 Arabidopsis RBOHD
ATGCGCGCGCGCGCGCGCGCGCGCGCTAGCTAGCTAG
"""

# Write to file
with open("sample.fasta", "w") as f:
    f.write(fasta_content)

print("‚úì Created sample.fasta")

##### üìñ Part 2: Reading a File Line by Line

Before parsing FASTA, let's practice reading a file:


In [None]:
# Read the file and print each line
with open("sample.fasta") as f:
    for line in f:
        print(repr(line))  # repr() shows hidden characters like \n

##### üîç Notice:
- Each line ends with `\n` (newline character)
- Lines starting with `>` are headers
- Other lines are sequence data


##### üß¨ Part 3: Parsing FASTA into a Dictionary

Now let's parse the FASTA file and store sequences in a dictionary:


In [None]:
# Parse FASTA file
records = {}

with open("sample.fasta") as f:
    current_id = None
    seq_parts = []
    
    for line in f:
        line = line.strip()  # Remove whitespace and newline
        
        if line.startswith(">"):
            # Save previous sequence if exists
            if current_id:
                records[current_id] = "".join(seq_parts)
            
            # Start new sequence
            current_id = line[1:]  # Remove the '>'
            seq_parts = []
        else:
            # Add sequence line
            seq_parts.append(line)
    
    # Don't forget the last sequence!
    if current_id:
        records[current_id] = "".join(seq_parts)

# Display results
print(f"Parsed {len(records)} sequences:\n")
for gene_id, sequence in records.items():
    print(f"ID: {gene_id}")
    print(f"Sequence: {sequence[:50]}...")  # First 50 bases
    print(f"Length: {len(sequence)} bp\n")

##### üìä Part 4: Calculate Sequence Statistics

Let's define a function to calculate GC content:


In [None]:
def calculate_gc_content(sequence):
    """Calculate GC percentage in a DNA sequence."""
    gc_count = 0
    for base in sequence:
        if base == "G" or base == "C":
            gc_count += 1
    
    gc_percent = (gc_count / len(sequence)) * 100
    return gc_percent

# Test it
test_seq = "ATGCGCGCTAGC"
print(f"Test sequence: {test_seq}")
print(f"GC content: {calculate_gc_content(test_seq):.1f}%")

##### üìã Part 5: Create a Summary Table

Now let's analyze all sequences from our FASTA file:


In [None]:
# Create summary table
print("=" * 70)
print(f"{'Gene ID':<30} {'Length (bp)':<15} {'GC%':<10}")
print("=" * 70)

for gene_id, sequence in records.items():
    length = len(sequence)
    gc_percent = calculate_gc_content(sequence)
    print(f"{gene_id:<30} {length:<15} {gc_percent:<10.1f}")

print("=" * 70)

##### üêç Part 6: Introduction to BioPython

**What is BioPython?**

BioPython is a powerful library specifically designed for biological computation.
- Handles FASTA, GenBank, and many other formats
- Provides tools for sequence analysis
- Industry-standard for bioinformatics

**Why use BioPython?**
- ‚úÖ Handles complex file formats automatically
- ‚úÖ More robust error handling
- ‚úÖ Built-in sequence analysis tools
- ‚úÖ Used by researchers worldwide

**Installation:**
```bash
pip install biopython
```


##### üì• Reading FASTA with BioPython

BioPython uses `SeqIO.parse()` to read FASTA files. It returns **SeqRecord** objects that contain:
- `.id` - sequence identifier
- `.description` - full header line
- `.seq` - the actual sequence (as a Seq object)

**Key Difference from Manual Parsing:**
- Manual: You write loops to handle `>` lines
- BioPython: Automatically handles all FASTA formats


In [None]:
# Import BioPython
from Bio import SeqIO

print("BioPython imported successfully!")
print("Version: BioPython provides robust sequence handling")

# Parse FASTA using BioPython
print("\nParsing sample.fasta with BioPython...")
for record in SeqIO.parse("sample.fasta", "fasta"):
    print(f"\nID: {record.id}")
    print(f"Description: {record.description}")
    print(f"Sequence: {str(record.seq)[:50]}...")
    print(f"Length: {len(record.seq)} bp")


##### üîÑ Converting BioPython Records to Dictionary

You can easily convert BioPython records to the same dictionary format we used earlier:


In [None]:
# Convert BioPython records to dictionary
biopython_records = {}

for record in SeqIO.parse("sample.fasta", "fasta"):
    biopython_records[record.id] = str(record.seq)

print("BioPython parsed sequences:")
for gene_id, sequence in biopython_records.items():
    print(f"  {gene_id}: {len(sequence)} bp")


##### üíæ Writing FASTA with BioPython

BioPython makes writing FASTA files even easier. You create SeqRecord objects and use `SeqIO.write()`:


In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

# Create new sequence records
new_records = [
    SeqRecord(
        Seq("ATGCGTACGTAGCTAGC"),
        id="Gene_A",
        description="Test gene A"
    ),
    SeqRecord(
        Seq("GCTAGCTAGCTAGCTAG"),
        id="Gene_B",
        description="Test gene B"
    )
]

# Write to FASTA file
SeqIO.write(new_records, "output.fasta", "fasta")
print("‚úì Created output.fasta with BioPython")

# Read it back to verify
print("\nVerifying output.fasta:")
for record in SeqIO.parse("output.fasta", "fasta"):
    print(f"  {record.id}: {record.seq}")


##### ‚öñÔ∏è Manual Parsing vs BioPython: When to Use Each?

**Use Manual Parsing when:**
- ‚úÖ Learning how file formats work
- ‚úÖ Simple, custom text processing
- ‚úÖ No external dependencies allowed

**Use BioPython when:**
- ‚úÖ Working with real research data
- ‚úÖ Need robust error handling
- ‚úÖ Multiple file formats (FASTA, GenBank, etc.)
- ‚úÖ Using other BioPython features

**Best Practice:** Learn both! Understanding manual parsing helps you understand what BioPython does behind the scenes.


##### üîç BioPython Sequence Analysis Example

BioPython Seq objects have built-in methods for common analyses:


In [None]:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction

# Read first sequence
for record in SeqIO.parse("sample.fasta", "fasta"):
    seq = record.seq
    
    print(f"Analyzing: {record.id}")
    print(f"Sequence: {seq[:40]}...")
    print(f"\nBuilt-in BioPython methods:")
    print(f"  Length: {len(seq)}")
    print(f"  GC Content: {gc_fraction(seq) * 100:.1f}%")
    print(f"  Complement: {seq.complement()[:40]}...")
    print(f"  Reverse Complement: {seq.reverse_complement()[:40]}...")
    break  # Just show first sequence


##### üéØ Part 6: Your Turn - Practice Exercise

Complete the following tasks:


In [None]:
# Exercise 1: Find the longest sequence
# YOUR CODE HERE


# Exercise 2: Find the sequence with highest GC content
# YOUR CODE HERE


# Exercise 3: Count total number of 'A' bases across all sequences
# YOUR CODE HERE

##### üåü Part 7: Challenge - Handling Ambiguous Bases

Real genomic data often contains ambiguous bases (N = unknown).
Let's modify our GC calculator to handle them:


In [None]:
def calculate_gc_content_robust(sequence):
    """Calculate GC% excluding ambiguous bases (N)."""
    gc_count = 0
    valid_bases = 0
    
    for base in sequence.upper():
        if base in "ATGC":
            valid_bases += 1
            if base in "GC":
                gc_count += 1
    
    if valid_bases == 0:
        return 0
    
    return (gc_count / valid_bases) * 100

# Test with ambiguous bases
test_seq_with_n = "ATGCNNNGCTAGC"
print(f"Sequence: {test_seq_with_n}")
print(f"GC% (ignoring N): {calculate_gc_content_robust(test_seq_with_n):.1f}%")

##### üíæ Part 8: Writing Results to a File

Let's save our analysis results:


In [None]:
# Write results to output file
with open("sequence_analysis.txt", "w") as output:
    output.write("Sequence Analysis Results\n")
    output.write("=" * 50 + "\n\n")
    
    for gene_id, sequence in records.items():
        output.write(f"Gene: {gene_id}\n")
        output.write(f"Length: {len(sequence)} bp\n")
        output.write(f"GC Content: {calculate_gc_content(sequence):.1f}%\n")
        output.write(f"First 30 bases: {sequence[:30]}\n")
        output.write("\n")

print("‚úì Results saved to sequence_analysis.txt")

# Read and display the file
with open("sequence_analysis.txt") as f:
    print(f.read())

##### ü§î Reflection Questions

1. **Why do we join sequence parts instead of keeping them in a list?**
   - Answer: FASTA files can have sequences split across multiple lines. Joining creates one continuous sequence string that's easier to work with.

2. **Why are standardized formats like FASTA important?**
   - Universal compatibility across tools
   - Promotes research reproducibility
   - Enables global collaboration

3. **What happens if we forget to save the last sequence?**
   - The final sequence won't be added to our dictionary because the loop ends before we save it!

---

##### üè† Homework Challenge

1. Create your own FASTA file with 3-5 sequences
2. Parse it and calculate:
   - Average sequence length
   - Average GC content
   - Sequence with most 'A' bases
3. Write results to an output file


##### üîó Connection to Capstone Project

FASTA parsing is essential for your final project because:
- All public genomic databases (NCBI, UniProt) use FASTA format
- You'll need to read multi-sequence files
- Real research involves analyzing hundreds or thousands of sequences

**Next lesson:** We'll learn about codon translation and the genetic code! üß¨


---

##### üöÄ Next Lesson

Ready to continue? Open the next lesson notebook:
**[Lesson 06: Translation](lesson06_translation_notebook.ipynb)**
