##### üß¨ Biological Sequences as Strings

##### DNA - 4 bases
**A**denine, **T**hymine, **G**uanine, **C**ytosine

##### RNA - 4 bases  
**A**denine, **U**racil, **G**uanine, **C**ytosine (U replaces T)

##### Proteins - 20 amino acids
Represented by 20 different letters (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y)

Let's create examples of each:


In [1]:
# Different types of biological sequences
dna = "ATGCCGATTA"
rna = "AUGCCGAUUA"  # Notice U instead of T
protein = "MPADEF"   # Amino acid sequence

print("DNA:    ", dna)
print("RNA:    ", rna)
print("Protein:", protein)

DNA:     ATGCCGATTA
RNA:     AUGCCGAUUA
Protein: MPADEF


##### üìç Indexing - Accessing Individual Positions

**Important:** Python counts from **0**, not 1!

```
DNA:   A  T  G  C  C  G  A  T  T  A
Index: 0  1  2  3  4  5  6  7  8  9
```

##### Positive Indexing (from start)


In [2]:
dna = "ATGCCGATTA"

print(f"Full sequence: {dna}")
print(f"\nIndividual bases:")
print(f"First base (index 0):  {dna[0]}")
print(f"Second base (index 1): {dna[1]}")
print(f"Third base (index 2):  {dna[2]}")
print(f"Fourth base (index 3): {dna[3]}")

Full sequence: ATGCCGATTA

Individual bases:
First base (index 0):  A
Second base (index 1): T
Third base (index 2):  G
Fourth base (index 3): C


##### Negative Indexing (from end)

You can also count backwards from the end using negative numbers:


In [3]:
dna = "ATGCCGATTA"

print(f"Full sequence: {dna}")
print(f"\nCounting from the end:")
print(f"Last base (index -1):    {dna[-1]}")
print(f"2nd last (index -2):     {dna[-2]}")
print(f"3rd last (index -3):     {dna[-3]}")

Full sequence: ATGCCGATTA

Counting from the end:
Last base (index -1):    A
2nd last (index -2):     T
3rd last (index -3):     T


##### ‚úÇÔ∏è Slicing - Extracting Subsequences

Syntax: `sequence[start:end]`
- Includes the character at `start`
- **Excludes** the character at `end`

##### Basic Slicing


In [4]:
dna = "ATGCCGATTA"

# Extract first 3 bases (a codon)
first_codon = dna[0:3]
print(f"First 3 bases: {first_codon}")

# Extract bases 3-6
second_codon = dna[3:6]
print(f"Bases 3-6: {second_codon}")

# Extract bases 6-9
third_codon = dna[6:9]
print(f"Bases 6-9: {third_codon}")

First 3 bases: ATG
Bases 3-6: CCG
Bases 6-9: ATT


##### Shorthand Slicing Tricks


In [6]:
dna = "ATGCCGATTA"

# From start to position (leave out the 0)
print(f"First 5 bases:  {dna[:5]}")

# From position to end (leave out the end)
print(f"From position 5 to end: {dna[5:]}")

# Last 3 bases using negative indexing
print(f"Last 3 bases: {dna[-3:]}")

# Everything except last 3
print(f"All but last 3: {dna[:-3]}")

First 5 bases:  ATGCC
From position 5 to end: GATTA
Last 3 bases: TTA
All but last 3: ATGCCGA


##### üß™ Extracting Codons

Codons are groups of 3 nucleotides that code for amino acids:


In [5]:
# A coding sequence
coding_seq = "ATGCCGATTAAGTAG"

print(f"Full sequence: {coding_seq}")
print(f"Length: {len(coding_seq)} bases\n")

# Extract codons manually
codon1 = coding_seq[0:3]
codon2 = coding_seq[3:6]
codon3 = coding_seq[6:9]
codon4 = coding_seq[9:12]
codon5 = coding_seq[12:15]

print("Codons:")
print(f"  1: {codon1}  (Start codon - Methionine)")
print(f"  2: {codon2}")
print(f"  3: {codon3}")
print(f"  4: {codon4}")
print(f"  5: {codon5}  (Stop codon)")

Full sequence: ATGCCGATTAAGTAG
Length: 15 bases

Codons:
  1: ATG  (Start codon - Methionine)
  2: CCG
  3: ATT
  4: AAG
  5: TAG  (Stop codon)


##### üîÑ DNA vs RNA Conversion

Converting DNA to RNA: Replace T with U


In [7]:
# DNA to RNA conversion
dna = "ATGCCGATTA"
rna = dna.replace("T", "U")

print(f"DNA: {dna}")
print(f"RNA: {rna}")

# RNA back to DNA
dna_again = rna.replace("U", "T")
print(f"\nBack to DNA: {dna_again}")

DNA: ATGCCGATTA
RNA: AUGCCGAUUA

Back to DNA: ATGCCGATTA


##### üéØ Practice Exercise 1: Extracting Gene Regions

Given a mock mRNA sequence, extract the start codon region:


In [8]:
# A mock mRNA sequence
mrna = "GCUAGCAUGCCGUUAAGGCUAG"

print(f"mRNA sequence: {mrna}")
print(f"Length: {len(mrna)} bases\n")

# Find and extract the start codon (AUG)
# First, let's find where AUG appears
start_position = mrna.find("AUG")
print(f"Start codon AUG found at position: {start_position}")

# Extract the start codon
start_codon = mrna[start_position:start_position+3]
print(f"Start codon: {start_codon}")

# Extract everything after the start codon (coding region)
coding_region = mrna[start_position:]
print(f"\nCoding region: {coding_region}")

mRNA sequence: GCUAGCAUGCCGUUAAGGCUAG
Length: 22 bases

Start codon AUG found at position: 6
Start codon: AUG

Coding region: AUGCCGUUAAGGCUAG


##### üéØ Practice Exercise 2: Analyzing a Protein Sequence

Work with amino acid sequences:


In [9]:
# Partial insulin protein sequence
insulin = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"

print(f"Insulin protein sequence:")
print(f"Length: {len(insulin)} amino acids\n")

# First 10 amino acids
print(f"First 10 amino acids: {insulin[:10]}")

# Last 10 amino acids
print(f"Last 10 amino acids: {insulin[-10:]}")

# Count specific amino acids
print(f"\nAmino acid composition:")
print(f"  Leucine (L): {insulin.count('L')}")
print(f"  Glycine (G): {insulin.count('G')}")
print(f"  Alanine (A): {insulin.count('A')}")
print(f"  Cysteine (C): {insulin.count('C')}")

Insulin protein sequence:
Length: 110 amino acids

First 10 amino acids: MALWMRLLPL
Last 10 amino acids: SLYQLENYCN

Amino acid composition:
  Leucine (L): 20
  Glycine (G): 12
  Alanine (A): 10
  Cysteine (C): 6


##### üéØ Your Turn - Complete These Challenges


In [None]:
# Challenge 1: Extract the middle codon from this sequence
seq = "ATGCGTAAGTAG"

# YOUR CODE HERE - extract bases 3-6
middle_codon = 

print(f"Middle codon: {middle_codon}")

In [None]:
# Challenge 2: Find a protein online and analyze it
# Visit: https://www.uniprot.org/
# Pick any short protein sequence

my_protein = "PASTE_YOUR_PROTEIN_HERE"

# List 5 distinct amino acids present
print(f"Protein length: {len(my_protein)}")
print(f"\n5 distinct amino acids found:")
# YOUR CODE HERE

##### üîç Visualizing Indexing and Slicing


In [10]:
# Helper function to visualize indexing
def show_indices(sequence):
    """Display a sequence with index numbers."""
    print("Sequence:", " ".join(sequence))
    print("Index:   ", " ".join(str(i) for i in range(len(sequence))))
    print("Negative:", " ".join(str(i-len(sequence)) for i in range(len(sequence))))

# Try it
dna = "ATGCCGATTA"
show_indices(dna)

Sequence: A T G C C G A T T A
Index:    0 1 2 3 4 5 6 7 8 9
Negative: -10 -9 -8 -7 -6 -5 -4 -3 -2 -1


##### üí° Why Slicing Matters in Bioinformatics

Slicing is essential for:
1. **Extracting genes** from chromosomes
2. **Finding codons** in coding sequences
3. **Identifying promoters** (regulatory regions)
4. **Analyzing specific domains** in proteins
5. **Comparing sequence regions** across species


##### ü§î Reflection Questions

**Answer these:**

1. Why does Python start counting from 0 instead of 1?
2. Why is slicing helpful in bioinformatics? Give one specific example.
3. What's the difference between DNA and RNA in terms of bases?


##### Your Answers:

1. 

2. 

3. 


##### üè† Homework

1. Find a real protein sequence online (NCBI, UniProt)
2. In the cell below:
   - Store it in a variable
   - Print the length
   - List 5 distinct amino acid letters present
   - Extract and print the first 20 amino acids


In [None]:
# Homework coding space
# YOUR CODE HERE

##### üéâ Summary

You've learned:
- ‚úÖ How to represent DNA, RNA, and proteins as strings
- ‚úÖ Indexing to access individual positions (positive and negative)
- ‚úÖ Slicing to extract subsequences
- ‚úÖ The difference between DNA (T) and RNA (U)
- ‚úÖ Why these skills matter for genomic analysis

**Next lesson:** We'll use loops to analyze sequences automatically! üîÑ


---

##### üöÄ Next Lesson

Ready to continue? Open the next lesson notebook:
**[Lesson 03: Loops](lesson03_loops_notebook.ipynb)**
