#### üß¨ The Genetic Code

#### Key Concepts:
- **Codon** = 3 nucleotides that code for 1 amino acid
- **64 possible codons** (4¬≥ = 64)
- **20 amino acids** (some redundancy)
- **Start codon**: ATG (codes for Methionine)
- **Stop codons**: TAA, TAG, TGA

#### Example:
```
DNA:     ATG GCC TAA
Codons:  ATG GCC TAA
Protein:  M   A  STOP
```

#### üìñ The Codon Table

Let's create a codon-to-amino-acid mapping:

In [None]:
# Simplified codon table (1-letter amino acid codes)
CODON_TABLE = {
    # Start codon
    "ATG": "M",  # Methionine (Start)
    
    # Some common codons
    "TTT": "F", "TTC": "F",  # Phenylalanine
    "TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",  # Leucine
    "ATT": "I", "ATC": "I", "ATA": "I",  # Isoleucine
    "GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",  # Valine
    "TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S",  # Serine
    "CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",  # Proline
    "ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",  # Threonine
    "GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",  # Alanine
    "TAT": "Y", "TAC": "Y",  # Tyrosine
    "CAT": "H", "CAC": "H",  # Histidine
    "CAA": "Q", "CAG": "Q",  # Glutamine
    "AAT": "N", "AAC": "N",  # Asparagine
    "AAA": "K", "AAG": "K",  # Lysine
    "GAT": "D", "GAC": "D",  # Aspartic acid
    "GAA": "E", "GAG": "E",  # Glutamic acid
    "TGT": "C", "TGC": "C",  # Cysteine
    "TGG": "W",  # Tryptophan
    "CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R",  # Arginine
    "AGT": "S", "AGC": "S",  # Serine
    "AGA": "R", "AGG": "R",  # Arginine
    "GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",  # Glycine
    
    # Stop codons
    "TAA": "*",  # Stop
    "TAG": "*",  # Stop
    "TGA": "*"   # Stop
}

# Display some examples
print("Sample Codon Translations:")
for codon in ["ATG", "GCC", "TAA", "TTT", "TGG"]:
    aa = CODON_TABLE.get(codon, "?")
    print(f"  {codon} ‚Üí {aa}")

#### üîÑ Basic Translation Function

In [None]:
def translate(dna):
    """
    Translate DNA sequence to protein.
    
    Args:
        dna (str): DNA sequence
    
    Returns:
        str: Protein sequence (amino acids)
    """
    protein = []
    
    # Process 3 bases at a time
    for i in range(0, len(dna), 3):
        codon = dna[i:i+3]
        
        # Skip incomplete codons
        if len(codon) < 3:
            break
        
        # Stop at stop codons
        if codon in ["TAA", "TAG", "TGA"]:
            break
        
        # Translate codon
        aa = CODON_TABLE.get(codon, "?")
        protein.append(aa)
    
    return "".join(protein)

# Test translation
dna = "ATGGCCGATTAA"
protein = translate(dna)

print(f"DNA:     {dna}")
print(f"Codons:  {' '.join([dna[i:i+3] for i in range(0, len(dna), 3)])}")
print(f"Protein: {protein}")

#### üé® Detailed Translation with Visualization

In [None]:
def translate_detailed(dna):
    """
    Translate DNA with detailed output showing each step.
    """
    print(f"Translating: {dna}")
    print(f"Length: {len(dna)} bases\n")
    
    protein = []
    
    print("Position | Codon | Amino Acid")
    print("-" * 35)
    
    for i in range(0, len(dna), 3):
        codon = dna[i:i+3]
        
        if len(codon) < 3:
            print(f"{i:4d}-{i+len(codon)-1:<3d} | {codon:<5} | Incomplete (skipped)")
            break
        
        aa = CODON_TABLE.get(codon, "?")
        
        if aa == "*":
            print(f"{i:4d}-{i+2:<3d} | {codon:<5} | STOP")
            break
        
        status = "Start" if codon == "ATG" and i == 0 else ""
        print(f"{i:4d}-{i+2:<3d} | {codon:<5} | {aa} {status}")
        protein.append(aa)
    
    result = "".join(protein)
    print(f"\nFinal Protein: {result}")
    print(f"Length: {len(result)} amino acids")
    return result

# Test it
test_dna = "ATGGCCGATTAA"
translate_detailed(test_dna)

#### üîç Finding Open Reading Frames (ORFs)

Find all potential coding sequences (start codon ‚Üí stop codon):

In [None]:
def find_orfs(dna):
    """
    Find all open reading frames (ATG to stop codon).
    
    Args:
        dna (str): DNA sequence
    
    Returns:
        list: List of (start_pos, end_pos, protein) tuples
    """
    orfs = []
    stop_codons = ["TAA", "TAG", "TGA"]
    
    # Search for ATG
    for i in range(len(dna) - 2):
        if dna[i:i+3] == "ATG":
            # Found a start codon, now find the stop
            protein = []
            
            for j in range(i, len(dna), 3):
                codon = dna[j:j+3]
                
                if len(codon) < 3:
                    break
                
                if codon in stop_codons:
                    # Found complete ORF
                    orfs.append((i, j+2, "".join(protein)))
                    break
                
                aa = CODON_TABLE.get(codon, "?")
                protein.append(aa)
    
    return orfs

# Test with sequence containing multiple ORFs
dna = "ATGGCCGATTAAATGATCGCGTAG"
orfs = find_orfs(dna)

print(f"Sequence: {dna}\n")
print(f"Found {len(orfs)} ORF(s):\n")

for idx, (start, end, protein) in enumerate(orfs, 1):
    orf_seq = dna[start:end+1]
    print(f"ORF {idx}:")
    print(f"  Position: {start}-{end}")
    print(f"  DNA: {orf_seq}")
    print(f"  Protein: {protein}")
    print(f"  Length: {len(protein)} aa\n")

#### üìä Comparing Multiple Translations

In [None]:
# Compare translations of different sequences
sequences = {
    "Gene 1": "ATGGCCGATTAA",
    "Gene 2": "ATGATCGCGTAG",
    "Gene 3": "ATGGGGCCCAAATAG"
}

print("Translation Comparison")
print("=" * 60)

for name, seq in sequences.items():
    protein = translate(seq)
    print(f"\n{name}:")
    print(f"  DNA ({len(seq)} bp):       {seq}")
    print(f"  Protein ({len(protein)} aa):  {protein}")
    
    # Count unknown amino acids
    unknown = protein.count("?")
    if unknown > 0:
        print(f"  ‚ö†Ô∏è  Contains {unknown} unknown codon(s)")

#### üéØ Practice Exercise 1

Expand the codon table and translate a longer sequence:

In [None]:
# Add more codons to the table
# Then translate this sequence
dna = "ATGGCCTTTTGGTAA"

# YOUR CODE HERE
# Hint: Check which codons are missing from CODON_TABLE

#### üéØ Practice Exercise 2

Count mutations that affect protein sequence:

In [None]:
# Compare two sequences and count amino acid changes
dna1 = "ATGGCCGATTAA"
dna2 = "ATGGCGGATTAA"  # One base different

protein1 = translate(dna1)
protein2 = translate(dna2)

print(f"Sequence 1: {dna1}")
print(f"Protein 1:  {protein1}")
print(f"\nSequence 2: {dna2}")
print(f"Protein 2:  {protein2}")

# YOUR CODE HERE
# Count how many amino acids are different

#### üß™ Real Example: Insulin Gene Fragment

In [None]:
# Partial insulin coding sequence
insulin_dna = "ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC"

print("Human Insulin Gene Fragment")
print("=" * 60)
protein = translate_detailed(insulin_dna)

print(f"\n\nAmino Acid Composition:")
for aa in set(protein):
    count = protein.count(aa)
    print(f"  {aa}: {count}")

#### ü§î Reflection Questions

1. How many bases form one amino acid instruction? Why?
2. What does a stop codon do computationally?
3. Why might we get "?" in our protein sequence?
4. Can different codons code for the same amino acid? (Look at the table!)

#### üè† Homework Challenge

1. Complete the codon table with ALL 64 codons
2. Implement a function that:
   - Scans for ATG (start codon)
   - Translates from ATG to first stop codon
   - Returns only the translated region
3. Test with: `"GCTAGCATGGCCTTTTGGTAAGCTA"`

In [None]:
# Homework coding space
# YOUR CODE HERE

#### üéâ Summary

You've learned:
- ‚úÖ Codons are triplets that code for amino acids
- ‚úÖ How to implement DNA ‚Üí protein translation
- ‚úÖ Start codon (ATG) and stop codons (TAA, TAG, TGA)
- ‚úÖ Finding open reading frames (ORFs)
- ‚úÖ Handling incomplete codons and unknown sequences
- ‚úÖ Comparing proteins to identify mutations

**Next lesson:** Probability models and sequence composition! üìä