# Day 2: The Genetic Code Mystery 🧬
*Strings as DNA Sequences: Reading Life's Instructions*

---

## Today's Biological Mystery

**"Why do some genetic mutations cause disease while others are harmless?"**

You've discovered a DNA sequence from a patient with a rare genetic disorder. Your mission: analyze this genetic code to understand how mutations affect protein function.

Today you'll learn that **strings in Python are like DNA sequences** - ordered letters that carry crucial information when read correctly.

---

## 🔬 The Biological Context

**Your patient data:**
- **Patient DNA sequence:** `"ATGAAATTTGGGCCCAAATAG"`
- **Normal reference:** `"ATGAAATTCGGGCCCAAATAG"`
- **Gene name:** CFTR (related to cystic fibrosis)
- **Symptom:** Patient has breathing difficulties

**Your biological questions:**
1. What's different between patient and normal DNA?
2. How does this change affect the protein?
3. Could this explain the symptoms?

**Your coding challenge:** Use Python strings to analyze genetic sequences like a bioinformatician.

## 💡 The Biological Analogy

Think of **Python strings like DNA sequences**:

| DNA Biology | Python Programming |
|---|---|
| **DNA sequence** "ATCG..." | **String** "Hello..." |
| **Reading frame** (start position) | **String indexing** [0], [1], [2] |
| **Sequence length** (base pairs) | **String length** len() |
| **Finding motifs** in DNA | **Finding substrings** in text |
| **Reverse complement** strand | **String manipulation** methods |
| **Mutations** change bases | **String replacement** changes characters |

Just like DNA carries genetic instructions, strings carry textual information that programs can read and analyze!

## 🧪 Lab Exercise 1: Store and Examine DNA Sequences

**Your task:** Store the DNA sequences and explore their basic properties.

**Think like a geneticist:** First, you'd sequence the DNA and compare it to reference databases.

In [12]:
# Your code here - store the DNA sequences from the biological context above
patient = "ATGAAATTTGGGCCCAAATAG"
reference = "ATGAAATTCGGGCCCAAATAG"

# Your code here - examine basic properties
print(len(patient))
print(len(reference))
print(patient == reference)
print(patient.count('A'))
print(patient.count('T'))
print(patient.count('C'))
print(patient.count('G'))
print(reference.count('A'))
print(reference.count('T'))
print(reference.count('C'))
print(reference.count('G'))

# Your code here - check if sequences are same length
print(len(patient) == len(reference))

21
21
False
8
5
3
5
8
4
4
5
True


## 🧪 Lab Exercise 2: Find the Mutation

**Biological goal:** Compare the sequences position by position to find the exact mutation.

**Your task:** Write code to identify where the sequences differ.

**Hint:** Use string indexing to check each position: `sequence[0]` gives first letter, `sequence[1]` gives second, etc.

In [16]:
# Your code here - find differences between sequences
for i in range(len(patient)):
    if patient[i] == reference[i]:
        print("no change")
    else:
        print("mutation")
        print(f"Patient DNA is mutated at base {i}")

no change
no change
no change
no change
no change
no change
no change
no change
mutation
Patient DNA is mutated at base 8
no change
no change
no change
no change
no change
no change
no change
no change
no change
no change
no change
no change


## 🧪 Lab Exercise 3: Analyze Nucleotide Composition

**Biological context:** DNA composition affects gene function:
- **GC content** (% of G and C bases) influences gene stability
- **AT content** affects gene expression
- **Codon usage** patterns matter for protein translation

**Your task:** Calculate nucleotide frequencies for both sequences.

In [20]:
# Your code here - create a function to analyze DNA composition

def analyze_dna_composition(dna_sequence, sequence_name):
    # Your function code here
    A = dna_sequence.count('A')
    T = dna_sequence.count('T')
    G = dna_sequence.count('G')
    C = dna_sequence.count('C')
    print(f"Number of A: {A}")
    print(f"Number of T: {T}")
    print(f"Number of G: {G}")
    print(f"Number of C: {C}")
    GCcontent = ((G+C)/len(dna_sequence))*100
    print(f"GC content is {GCcontent}%")

# Your code here - analyze both sequences
analyze_dna_composition(patient , "Patient")
analyze_dna_composition(reference, "Reference")

Number of A: 8
Number of T: 5
Number of G: 5
Number of C: 3
GC content is 38.095238095238095%
Number of A: 8
Number of T: 4
Number of G: 5
Number of C: 4
GC content is 42.857142857142854%


## 🧪 Lab Exercise 4: Translate to Protein

**Biological context:** DNA codes for proteins through the genetic code:
- Every 3 DNA bases (codon) codes for 1 amino acid
- ATG = Start codon (Methionine)
- TAG = Stop codon
- Mutations can change amino acids and affect protein function

**Your task:** Translate DNA to amino acids to see how the mutation affects the protein.

In [23]:
# Simplified genetic code dictionary (codon → amino acid)
genetic_code = {
    'ATG': 'Met', 'AAA': 'Lys', 'AAT': 'Asn', 'TTT': 'Phe', 'TTC': 'Phe',
    'GGG': 'Gly', 'CCC': 'Pro', 'TAG': 'STOP', 'TAA': 'STOP', 'TGA': 'STOP'
}

# Your code here - create a function to translate DNA to protein

def translate_dna(dna_sequence, sequence_name):
    # Initialize an empty list to store amino acids
    protein = []
    
    # Loop through the DNA sequence in steps of 3 (codons)
    for i in range(0, len(dna_sequence), 3):
        # Get the current codon (3 bases)
        codon = dna_sequence[i:i+3]
        
        # Check if we have a complete codon
        if len(codon) == 3:
            # Look up the amino acid in our genetic code dictionary
            amino_acid = genetic_code.get(codon, 'X')  # 'X' for unknown codons
            protein.append(amino_acid)
    
    # Join the amino acids into a protein sequence
    protein_sequence = '-'.join(protein)
    
    print(f"\n{sequence_name} protein sequence:")
    print(protein_sequence)
    return protein_sequence
    

# Your code here - translate both sequences
patient_protein = translate_dna(patient, "Patient")
reference_protein = translate_dna(reference, "Reference")

# Your code here - compare the resulting proteins
print("\nProtein comparison:")
if patient_protein == reference_protein:
    print("The proteins are identical")
else:
    print("The proteins are different")
    # Find where they differ
    for i, (p, r) in enumerate(zip(patient_protein.split('-'), reference_protein.split('-'))):
        if p != r:
            print(f"Position {i+1}: Patient has {p}, Reference has {r}")


Patient protein sequence:
Met-Lys-Phe-Gly-Pro-Lys-STOP

Reference protein sequence:
Met-Lys-Phe-Gly-Pro-Lys-STOP

Protein comparison:
The proteins are identical


## 🧪 Lab Exercise 5: Clinical Interpretation

**Your task:** Combine your findings to interpret the clinical significance.

**Think like a genetic counselor:** How do you explain these results to the patient?

In [5]:
# generate a clinical report
"The patient will likely not suffer any significant changes, as the amino acid sequence is conserved"

## 🤔 Biological Reflection

**Answer these questions by modifying the text below:**

1. **How did the single DNA base change affect the protein?**
   *Your analysis here...*

2. **Why might this protein change cause breathing difficulties?**
   *Your biological reasoning here...*

3. **How are DNA strings similar to Python strings in terms of processing?**
   *Your coding insight here...*

4. **What would happen if the mutation occurred in a different position?**
   *Your prediction here...*

## 🎯 Today's Key Insights

### Biological Concepts:
- DNA mutations and their protein consequences
- The genetic code and translation
- Clinical interpretation of genetic variants
- Silent vs. pathogenic mutations

### Programming Concepts:
- **Strings** store sequential data like DNA sequences
- **String indexing** accesses specific positions like DNA bases
- **String methods** (.count(), .upper(), slicing) analyze text data
- **Loops** process sequences systematically
- **Dictionaries** map codes (like genetic code translations)

### The Connection:
Just as biologists read DNA sequences to understand genetic information, programmers use strings to store and analyze textual data. Both require systematic, position-by-position analysis!

---

## 📋 Before You Finish

1. **Save this notebook** with your completed solutions
2. **Ask Claude Code to review your work**: "Claude, please review my Day2_DNA_Strings.ipynb notebook"
3. **Connect to yesterday**: How do variables from Day 1 work with strings from Day 2?
4. **Preview tomorrow**: Day 3 explores protein families as lists

**Tomorrow's mystery:** "Why do related proteins have similar but not identical functions?"

*Excellent work decoding life's instructions! 🧬🔍*