# Day 2: The Genetic Code Mystery 🧬
*Strings as DNA Sequences: Reading Life's Instructions*

---

## Today's Biological Mystery

**"Why do some genetic mutations cause disease while others are harmless?"**

You've discovered a DNA sequence from a patient with a rare genetic disorder. Your mission: analyze this genetic code to understand how mutations affect protein function.

Today you'll learn that **strings in Python are like DNA sequences** - ordered letters that carry crucial information when read correctly.

---

## 🔬 The Biological Context

**Your patient data:**
- **Patient DNA sequence:** `"ATGAAATTTGGGCCCAAATAG"`
- **Normal reference:** `"ATGAAATTCGGGCCCAAATAG"`
- **Gene name:** CFTR (related to cystic fibrosis)
- **Symptom:** Patient has breathing difficulties

**Your biological questions:**
1. What's different between patient and normal DNA?
2. How does this change affect the protein?
3. Could this explain the symptoms?

**Your coding challenge:** Use Python strings to analyze genetic sequences like a bioinformatician.

## 💡 The Biological Analogy

Think of **Python strings like DNA sequences**:

| DNA Biology | Python Programming |
|---|---|
| **DNA sequence** "ATCG..." | **String** "Hello..." |
| **Reading frame** (start position) | **String indexing** [0], [1], [2] |
| **Sequence length** (base pairs) | **String length** len() |
| **Finding motifs** in DNA | **Finding substrings** in text |
| **Reverse complement** strand | **String manipulation** methods |
| **Mutations** change bases | **String replacement** changes characters |

Just like DNA carries genetic instructions, strings carry textual information that programs can read and analyze!

## 🧪 Lab Exercise 1: Store and Examine DNA Sequences

**Your task:** Store the DNA sequences and explore their basic properties.

**Think like a geneticist:** First, you'd sequence the DNA and compare it to reference databases.

In [None]:
# Store the DNA sequences
patient_dna = "ATGAAATTTGGGCCCAAATAG"
normal_dna = "ATGAAATTCGGGCCCAAATAG"
gene_name = "CFTR"

# Examine basic properties
print(f"Analysis of {gene_name} gene sequences:")
print(f"Patient DNA: {patient_dna}")
print(f"Normal DNA:  {normal_dna}")
print(f"\nSequence lengths:")
print(f"Patient: {len(patient_dna)} base pairs")
print(f"Normal:  {len(normal_dna)} base pairs")

# Check if sequences are the same length (should be for comparison)
if len(patient_dna) == len(normal_dna):
    print("✓ Sequences are same length - good for comparison")
else:
    print("⚠ Sequences are different lengths - may indicate insertion/deletion")

## 🧪 Lab Exercise 2: Find the Mutation

**Biological goal:** Compare the sequences position by position to find the exact mutation.

**Your task:** Write code to identify where the sequences differ.

**Hint:** Use string indexing to check each position: `sequence[0]` gives first letter, `sequence[1]` gives second, etc.

In [None]:
# Find differences between sequences
print("Mutation Analysis:")
print("Position | Normal | Patient | Match?")
print("-" * 35)

# Check each position
mutations_found = []

for position in range(len(normal_dna)):
    normal_base = normal_dna[position]
    patient_base = patient_dna[position]
    
    if normal_base == patient_base:
        match_status = "✓"
    else:
        match_status = "✗ MUTATION"
        mutations_found.append((position, normal_base, patient_base))
    
    print(f"   {position:2d}    |   {normal_base}    |    {patient_base}    | {match_status}")

print(f"\nMutations found: {len(mutations_found)}")
for position, normal, patient in mutations_found:
    print(f"Position {position}: {normal} → {patient}")

## 🧪 Lab Exercise 3: Analyze Nucleotide Composition

**Biological context:** DNA composition affects gene function:
- **GC content** (% of G and C bases) influences gene stability
- **AT content** affects gene expression
- **Codon usage** patterns matter for protein translation

**Your task:** Calculate nucleotide frequencies for both sequences.

In [None]:
# Function to analyze nucleotide composition
def analyze_dna_composition(dna_sequence, sequence_name):
    print(f"\n{sequence_name} DNA Composition:")
    
    # Count each nucleotide
    a_count = dna_sequence.count('A')
    t_count = dna_sequence.count('T') 
    g_count = dna_sequence.count('G')
    c_count = dna_sequence.count('C')
    
    total_bases = len(dna_sequence)
    
    # Calculate percentages
    a_percent = (a_count / total_bases) * 100
    t_percent = (t_count / total_bases) * 100
    g_percent = (g_count / total_bases) * 100
    c_percent = (c_count / total_bases) * 100
    
    print(f"A: {a_count} bases ({a_percent:.1f}%)")
    print(f"T: {t_count} bases ({t_percent:.1f}%)")
    print(f"G: {g_count} bases ({g_percent:.1f}%)")
    print(f"C: {c_count} bases ({c_percent:.1f}%)")
    
    # Calculate GC content (important for gene stability)
    gc_content = ((g_count + c_count) / total_bases) * 100
    print(f"\nGC content: {gc_content:.1f}%")
    
    return gc_content

# Analyze both sequences
normal_gc = analyze_dna_composition(normal_dna, "Normal")
patient_gc = analyze_dna_composition(patient_dna, "Patient")

# Compare GC content
gc_difference = abs(patient_gc - normal_gc)
print(f"\nGC content difference: {gc_difference:.1f}%")

if gc_difference > 5:
    print("⚠ Significant GC content change - may affect gene stability")
else:
    print("✓ GC content similar - mutation unlikely to affect overall stability")

## 🧪 Lab Exercise 4: Translate to Protein

**Biological context:** DNA codes for proteins through the genetic code:
- Every 3 DNA bases (codon) codes for 1 amino acid
- ATG = Start codon (Methionine)
- TAG = Stop codon
- Mutations can change amino acids and affect protein function

**Your task:** Translate DNA to amino acids to see how the mutation affects the protein.

In [None]:
# Simplified genetic code dictionary (codon → amino acid)
genetic_code = {
    'ATG': 'Met', 'AAA': 'Lys', 'AAT': 'Asn', 'TTT': 'Phe', 'TTC': 'Phe',
    'GGG': 'Gly', 'CCC': 'Pro', 'TAG': 'STOP', 'TAA': 'STOP', 'TGA': 'STOP'
}

def translate_dna(dna_sequence, sequence_name):
    print(f"\n{sequence_name} Translation:")
    print(f"DNA: {dna_sequence}")
    
    # Split into codons (groups of 3)
    amino_acids = []
    codons = []
    
    for i in range(0, len(dna_sequence), 3):
        codon = dna_sequence[i:i+3]
        if len(codon) == 3:  # Only process complete codons
            codons.append(codon)
            if codon in genetic_code:
                amino_acid = genetic_code[codon]
                amino_acids.append(amino_acid)
                if amino_acid == 'STOP':
                    break  # Stop translation at stop codon
            else:
                amino_acids.append('?')  # Unknown codon
    
    print(f"Codons: {' '.join(codons)}")
    print(f"Protein: {'-'.join(amino_acids)}")
    
    return amino_acids, codons

# Translate both sequences
normal_protein, normal_codons = translate_dna(normal_dna, "Normal")
patient_protein, patient_codons = translate_dna(patient_dna, "Patient")

# Compare proteins
print(f"\nProtein Comparison:")
if normal_protein == patient_protein:
    print("✓ Proteins are identical - mutation is silent")
else:
    print("✗ Proteins are different - mutation affects protein!")
    for i, (normal_aa, patient_aa) in enumerate(zip(normal_protein, patient_protein)):
        if normal_aa != patient_aa:
            print(f"  Position {i+1}: {normal_aa} → {patient_aa}")

## 🧪 Lab Exercise 5: Clinical Interpretation

**Your task:** Combine your findings to interpret the clinical significance.

**Think like a genetic counselor:** How do you explain these results to the patient?

In [None]:
# Generate clinical report
print("🏥 GENETIC ANALYSIS REPORT")
print("=" * 50)
print(f"Gene analyzed: {gene_name}")
print(f"Patient symptoms: Breathing difficulties")

print(f"\n📊 FINDINGS:")
print(f"• Mutations detected: {len(mutations_found)}")

for position, normal, patient in mutations_found:
    codon_position = position // 3 + 1
    within_codon = position % 3 + 1
    print(f"  - Position {position}: {normal}→{patient} (Codon {codon_position}, position {within_codon})")

print(f"\n🧬 PROTEIN IMPACT:")
if len(mutations_found) == 0:
    clinical_significance = "No mutations found - normal sequence"
elif normal_protein == patient_protein:
    clinical_significance = "Silent mutation - no protein change"
else:
    clinical_significance = "Pathogenic mutation - protein altered"

print(f"• {clinical_significance}")

print(f"\n💊 CLINICAL INTERPRETATION:")
if 'STOP' in patient_protein and 'STOP' not in normal_protein:
    interpretation = "Premature stop - shortened protein likely causes symptoms"
elif normal_protein != patient_protein:
    interpretation = "Protein change detected - likely contributes to patient symptoms"
else:
    interpretation = "Mutation unlikely to explain symptoms - investigate other genes"

print(f"• {interpretation}")

print(f"\n📋 RECOMMENDATION:")
if clinical_significance == "Pathogenic mutation - protein altered":
    print("• Confirm with family genetic testing")
    print("• Consider targeted therapy options")
    print("• Genetic counseling recommended")
else:
    print("• Test additional genes related to breathing disorders")
    print("• Consider environmental factors")

## 🤔 Biological Reflection

**Answer these questions by modifying the text below:**

1. **How did the single DNA base change affect the protein?**
   *Your analysis here...*

2. **Why might this protein change cause breathing difficulties?**
   *Your biological reasoning here...*

3. **How are DNA strings similar to Python strings in terms of processing?**
   *Your coding insight here...*

4. **What would happen if the mutation occurred in a different position?**
   *Your prediction here...*

## 🎯 Today's Key Insights

### Biological Concepts:
- DNA mutations and their protein consequences
- The genetic code and translation
- Clinical interpretation of genetic variants
- Silent vs. pathogenic mutations

### Programming Concepts:
- **Strings** store sequential data like DNA sequences
- **String indexing** accesses specific positions like DNA bases
- **String methods** (.count(), .upper(), slicing) analyze text data
- **Loops** process sequences systematically
- **Dictionaries** map codes (like genetic code translations)

### The Connection:
Just as biologists read DNA sequences to understand genetic information, programmers use strings to store and analyze textual data. Both require systematic, position-by-position analysis!

---

## 📋 Before You Finish

1. **Save this notebook** with your completed solutions
2. **Ask Claude Code to review your work**: "Claude, please review my Day2_DNA_Strings.ipynb notebook"
3. **Connect to yesterday**: How do variables from Day 1 work with strings from Day 2?
4. **Preview tomorrow**: Day 3 explores protein families as lists

**Tomorrow's mystery:** "Why do related proteins have similar but not identical functions?"

*Excellent work decoding life's instructions! 🧬🔍*