[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sathyasjali/EditTheGenome/blob/main/curriculum/student_materials/lesson01_intro_notebook.ipynb)

#### üß¨ What is Bioinformatics?

**Bioinformatics** = Using computers and programming to analyze biological data

#### Real-World Applications:
1. **Genome Sequencing** - Reading the complete DNA of organisms
2. **Disease Research** - Finding genetic causes of diseases
3. **Evolutionary Studies** - Comparing DNA across species
4. **Drug Development** - Designing new medicines
5. **Personalized Medicine** - Tailoring treatments to individual genetics

#### Key Question:
> How do scientists store DNA information on a computer?

**Answer:** As text! DNA is made of 4 letters: **A, T, G, C**

#### üëã Your First Python Program

Let's start with the classic "Hello World" - bioinformatics style!

In [5]:
# Your first bioinformatics program!
print("Hello Bioinformatics!")

Hello Bioinformatics!


Strings
A string is text inside quotes: "ATGCGT" or 'ATGCGT'. DNA/RNA/protein sequences are stored as strings. You can index, slice, and count characters.

#### üß™ Working with DNA Sequences

In Python, we store DNA sequences as **strings** (text).

#### Creating a DNA Variable

In [6]:
# Store a DNA sequence in a variable
dna = "ATGCGT"

print("ATCG", dna)
print("Type:", type(dna))

ATCG ATGCGT
Type: <class 'str'>


In Python, f-strings (formatted string literals) provide a clean and fast way to insert variables or expressions directly inside a string. The leading f tells Python this is a formatted string.

Inside the string, anything inside { } is evaluated as Python code.

len(dna) computes the length of the string dna (e.g., number of bases in a DNA sequence).

#### üìè Finding the Length of DNA

Use the `len()` function to count how many bases are in a sequence:

In [8]:
# Count the number of bases
dna = "ATGCGT"
length = len(dna)

print(f"DNA: {dna}")
print(f"Length: {length} bases")
print("Length: " + str(length) + " bases")


DNA: ATGCGT
Length: 6 bases
Length: 6 bases


In Python, before f-strings and .format(), strings were formatted using the percent operator (%). %s is a placeholder inside the string

% ‚Üí insert something

s ‚Üí treat it as a string

The value after % (length) is inserted where %s appears.

In [9]:
length = 150
print("Length: %s bases" % length)


Length: 150 bases


#### üî¢ Counting Specific Nucleotides

Python strings have a `.count()` method to count occurrences:

In [10]:
# Count individual bases
dna = "ATGCGTATGC"

count_a = dna.count("A")
count_t = dna.count("T")
count_g = dna.count("G")
count_c = dna.count("C")

print(f"DNA Sequence: {dna}")
print(f"Length: {len(dna)} bases")
print(f"\nNucleotide Counts:")
print(f"  A: {count_a}")
print(f"  T: {count_t}")
print(f"  G: {count_g}")
print(f"  C: {count_c}")
print(f"\nTotal: {count_a + count_t + count_g + count_c}")

DNA Sequence: ATGCGTATGC
Length: 10 bases

Nucleotide Counts:
  A: 2
  T: 3
  G: 3
  C: 2

Total: 10


#### üéØ Your Turn - Practice Exercise

Try changing the DNA sequence and running the code:

In [11]:
# Create your own DNA sequence (at least 10 bases)
my_dna = "ATGCTAGCTA"  # Change this!

# Print the sequence
print(f"My DNA: {my_dna}")

# Calculate length
print(f"Length: {len(my_dna)} bases")

# Count each base
print(f"\nA: {my_dna.count('A')}")
print(f"T: {my_dna.count('T')}")
print(f"G: {my_dna.count('G')}")
print(f"C: {my_dna.count('C')}")

My DNA: ATGCTAGCTA
Length: 10 bases

A: 3
T: 3
G: 2
C: 2


#### üî§ String Methods - Upper and Lower Case

DNA sequences should usually be in uppercase, but Python can convert them: use .upper() or .lower() convert them to to upper or lower case.

In [12]:
# Working with case
mixed_dna = "atgcGTac"

print(f"Original: {mixed_dna}")
print(f"Uppercase: {mixed_dna.upper()}")
print(f"Lowercase: {mixed_dna.lower()}")

Original: atgcGTac
Uppercase: ATGCGTAC
Lowercase: atgcgtac


#### üî£ Escape Sequences

Sometimes you need to include special characters in strings. Python uses **escape sequences** starting with a backslash (`\`):

| Escape Sequence | Meaning |
|-----------------|---------|
| `\\` | Backslash (keep a \) |
| `\'` | Single quote (keeps the ') |
| `\"` | Double quote (keeps the ") |
| `\n` | Newline (creates a new line) |
| `\t` | Tab (creates horizontal spacing) |

Let's see how they work:

In [14]:
# Examples of escape sequences
print("DNA sequences can have descriptions:")
print("Gene: 'insulin'\tOrganism: Human\nFunction: Regulates blood sugar")

print("\n" + "="*50 + "\n")

# Using quotes inside strings
print("The scientist said, \"DNA is amazing!\"")
print('It\'s important to understand escape sequences')

print("\n" + "="*50 + "\n")

# Newline for formatting
dna = "ATGC"
print("DNA sequence:")
print("\tA - Adenine")
print("\tT - Thymine") 
print("\tG - Guanine")
print("\tC - Cytosine")

DNA sequences can have descriptions:
Gene: 'insulin'	Organism: Human
Function: Regulates blood sugar


The scientist said, "DNA is amazing!"
It's important to understand escape sequences


DNA sequence:
	A - Adenine
	T - Thymine
	G - Guanine
	C - Cytosine


#### üîç Checking Python Version

Let's verify your Python installation:

In [15]:
import sys

print(f"Python version: {sys.version}")
print(f"\nYou're ready for bioinformatics! üéâ")

Python version: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ]

You're ready for bioinformatics! üéâ


#### üí° Comparing Manual vs Programmatic Counting

**Manual Counting:**
- Slow for long sequences
- Error-prone
- Not practical for genomic data (millions of bases!)

**Programmatic Counting:**
- Fast (even for huge genomes)
- Accurate
- Reproducible

In [16]:
# Example: A longer sequence
long_dna = "ATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC" * 10

print(f"Sequence length: {len(long_dna)} bases")
print(f"Number of 'A': {long_dna.count('A')}")
print(f"\nImagine counting this manually! üòÖ")

Sequence length: 480 bases
Number of 'A': 120

Imagine counting this manually! üòÖ


#### üßë‚Äçüî¨ Real-World Example

Let's analyze a short segment from a real gene:

In [17]:
# Partial sequence from human insulin gene
insulin_segment = "ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC"

print("Human Insulin Gene Segment")
print("=" * 50)
print(f"Sequence: {insulin_segment}")
print(f"\nLength: {len(insulin_segment)} bases")
print(f"\nBase Composition:")
print(f"  Adenine (A):  {insulin_segment.count('A'):>3}  ({insulin_segment.count('A')/len(insulin_segment)*100:.1f}%)")
print(f"  Thymine (T):  {insulin_segment.count('T'):>3}  ({insulin_segment.count('T')/len(insulin_segment)*100:.1f}%)")
print(f"  Guanine (G):  {insulin_segment.count('G'):>3}  ({insulin_segment.count('G')/len(insulin_segment)*100:.1f}%)")
print(f"  Cytosine (C): {insulin_segment.count('C'):>3}  ({insulin_segment.count('C')/len(insulin_segment)*100:.1f}%)")

Human Insulin Gene Segment
Sequence: ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC

Length: 60 bases

Base Composition:
  Adenine (A):    4  (6.7%)
  Thymine (T):   13  (21.7%)
  Guanine (G):   20  (33.3%)
  Cytosine (C):  23  (38.3%)


#### üìù Exit Ticket Questions

Answer these in the cell below:

1. **Why does bioinformatics matter?**
2. **What Python function finds the length of a string?**
3. **What are the 4 DNA bases?**

#### Your Answers:

1. Bioinformatics matters because...

2. The function to find length is...

3. The 4 DNA bases are...

#### üè† Homework Challenge

**Research Assignment:**
1. Find one bioinformatics career (examples: computational biologist, genomics researcher, bioinformatics software engineer)
2. Write 3-4 sentences about what they do
3. Share one interesting fact you learned

**Coding Challenge:**
- Create a DNA sequence with at least 20 bases
- Calculate what percentage of the sequence is each nucleotide
- Format your output nicely with labels

In [None]:
# Homework coding space
# YOUR CODE HERE

#### üéâ Congratulations!

You've completed Lesson 1! You now know:
- ‚úÖ What bioinformatics is
- ‚úÖ How to store DNA as strings
- ‚úÖ How to count nucleotides
- ‚úÖ Basic Python operations

**Next lesson:** We'll learn about indexing and slicing to extract specific parts of DNA sequences! üß¨