#### 📄 What is FASTA Format?

FASTA is the most common format for storing DNA, RNA, and protein sequences.

**Structure:**
- Header line starts with `>` followed by sequence ID/description
- Sequence data follows on subsequent lines
- Multiple sequences can be in one file

**Example:**
```
>Gene1 Arabidopsis NPR1
ATGCGTACGTTAGC
GCTAGCTAGCTAG
>Gene2 Arabidopsis FLS2
ATGATGATGATG
```

#### 🔧 Part 1: Creating a Sample FASTA File

First, let's create a sample FASTA file to work with:

In [None]:
# Create a sample FASTA file
fasta_content = """<Gene1 Arabidopsis NPR1
ATGCGTACGTTAGCGCTAGCTAGCTAGATGATCGTAGCTAGCTAG
GCTAGCTAGCTAGGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA
>Gene2 Arabidopsis FLS2
ATGATGATGATGCCCGGGAAATTTGCTAGCTAGCTAGCTAG
GCTAGCTAGCTAGGCTAGCTA
>Gene3 Arabidopsis RBOHD
ATGCGCGCGCGCGCGCGCGCGCGCGCTAGCTAGCTAG
"""

# Write to file
with open("sample.fasta", "w") as f:
    f.write(fasta_content)

print("✓ Created sample.fasta")

#### 📖 Part 2: Reading a File Line by Line

Before parsing FASTA, let's practice reading a file:

In [None]:
# Read the file and print each line
with open("sample.fasta") as f:
    for line in f:
        print(repr(line))  # repr() shows hidden characters like \n

#### 🔍 Notice:
- Each line ends with `\n` (newline character)
- Lines starting with `>` are headers
- Other lines are sequence data

#### 🧬 Part 3: Parsing FASTA into a Dictionary

Now let's parse the FASTA file and store sequences in a dictionary:

In [None]:
# Parse FASTA file
records = {}

with open("sample.fasta") as f:
    current_id = None
    seq_parts = []
    
    for line in f:
        line = line.strip()  # Remove whitespace and newline
        
        if line.startswith(">"):
            # Save previous sequence if exists
            if current_id:
                records[current_id] = "".join(seq_parts)
            
            # Start new sequence
            current_id = line[1:]  # Remove the '>'
            seq_parts = []
        else:
            # Add sequence line
            seq_parts.append(line)
    
    # Don't forget the last sequence!
    if current_id:
        records[current_id] = "".join(seq_parts)

# Display results
print(f"Parsed {len(records)} sequences:\n")
for gene_id, sequence in records.items():
    print(f"ID: {gene_id}")
    print(f"Sequence: {sequence[:50]}...")  # First 50 bases
    print(f"Length: {len(sequence)} bp\n")

#### 📊 Part 4: Calculate Sequence Statistics

Let's define a function to calculate GC content:

In [None]:
def calculate_gc_content(sequence):
    """Calculate GC percentage in a DNA sequence."""
    gc_count = 0
    for base in sequence:
        if base == "G" or base == "C":
            gc_count += 1
    
    gc_percent = (gc_count / len(sequence)) * 100
    return gc_percent

# Test it
test_seq = "ATGCGCGCTAGC"
print(f"Test sequence: {test_seq}")
print(f"GC content: {calculate_gc_content(test_seq):.1f}%")

#### 📋 Part 5: Create a Summary Table

Now let's analyze all sequences from our FASTA file:

In [None]:
# Create summary table
print("=" * 70)
print(f"{'Gene ID':<30} {'Length (bp)':<15} {'GC%':<10}")
print("=" * 70)

for gene_id, sequence in records.items():
    length = len(sequence)
    gc_percent = calculate_gc_content(sequence)
    print(f"{gene_id:<30} {length:<15} {gc_percent:<10.1f}")

print("=" * 70)

#### 🎯 Part 6: Your Turn - Practice Exercise

Complete the following tasks:

In [None]:
# Exercise 1: Find the longest sequence
# YOUR CODE HERE


# Exercise 2: Find the sequence with highest GC content
# YOUR CODE HERE


# Exercise 3: Count total number of 'A' bases across all sequences
# YOUR CODE HERE

#### 🌟 Part 7: Challenge - Handling Ambiguous Bases

Real genomic data often contains ambiguous bases (N = unknown).
Let's modify our GC calculator to handle them:

In [None]:
def calculate_gc_content_robust(sequence):
    """Calculate GC% excluding ambiguous bases (N)."""
    gc_count = 0
    valid_bases = 0
    
    for base in sequence.upper():
        if base in "ATGC":
            valid_bases += 1
            if base in "GC":
                gc_count += 1
    
    if valid_bases == 0:
        return 0
    
    return (gc_count / valid_bases) * 100

# Test with ambiguous bases
test_seq_with_n = "ATGCNNNGCTAGC"
print(f"Sequence: {test_seq_with_n}")
print(f"GC% (ignoring N): {calculate_gc_content_robust(test_seq_with_n):.1f}%")

#### 💾 Part 8: Writing Results to a File

Let's save our analysis results:

In [None]:
# Write results to output file
with open("sequence_analysis.txt", "w") as output:
    output.write("Sequence Analysis Results\n")
    output.write("=" * 50 + "\n\n")
    
    for gene_id, sequence in records.items():
        output.write(f"Gene: {gene_id}\n")
        output.write(f"Length: {len(sequence)} bp\n")
        output.write(f"GC Content: {calculate_gc_content(sequence):.1f}%\n")
        output.write(f"First 30 bases: {sequence[:30]}\n")
        output.write("\n")

print("✓ Results saved to sequence_analysis.txt")

# Read and display the file
with open("sequence_analysis.txt") as f:
    print(f.read())

#### 🤔 Reflection Questions

1. **Why do we join sequence parts instead of keeping them in a list?**
   - Answer: FASTA files can have sequences split across multiple lines. Joining creates one continuous sequence string that's easier to work with.

2. **Why are standardized formats like FASTA important?**
   - Universal compatibility across tools
   - Promotes research reproducibility
   - Enables global collaboration

3. **What happens if we forget to save the last sequence?**
   - The final sequence won't be added to our dictionary because the loop ends before we save it!

---

#### 🏠 Homework Challenge

1. Create your own FASTA file with 3-5 sequences
2. Parse it and calculate:
   - Average sequence length
   - Average GC content
   - Sequence with most 'A' bases
3. Write results to an output file

#### 🔗 Connection to Capstone Project

FASTA parsing is essential for your final project because:
- All public genomic databases (NCBI, UniProt) use FASTA format
- You'll need to read multi-sequence files
- Real research involves analyzing hundreds or thousands of sequences

**Next lesson:** We'll learn about codon translation and the genetic code! 🧬

---

## 🚀 Next Lesson

Ready to continue? Open the next lesson notebook:
**[Lesson 06: Translation](lesson06_translation_notebook.ipynb)**