#### üé≤ Probability Basics

#### Key Concept:
**Probability = Frequency / Total**

If A appears 30 times in 100 bases:
- **Frequency:** 30
- **Probability:** 30/100 = 0.30 = 30%
- **Interpretation:** 30% chance of finding A at any position

#### Why This Matters:
- Model random mutations
- Predict sequence patterns
- Identify unusual regions (regulatory elements)
- Compare expected vs actual composition

#### üìä Computing Frequency Distributions

In [None]:
def nucleotide_probabilities(seq):
    """
    Calculate probability of each nucleotide in a sequence.
    
    Args:
        seq (str): DNA sequence
    
    Returns:
        dict: Probability for each nucleotide
    """
    total = len(seq)
    probabilities = {}
    
    for base in "ATGC":
        count = seq.count(base)
        probabilities[base] = count / total
    
    return probabilities

# Test it
dna = "ATGCGCGCTAGCTAGC"
probs = nucleotide_probabilities(dna)

print(f"Sequence: {dna}")
print(f"Length: {len(dna)} bases\n")

print("Nucleotide Probabilities:")
for base, prob in probs.items():
    count = dna.count(base)
    print(f"  {base}: {count:2d}/{ len(dna)} = {prob:.3f} ({prob*100:5.1f}%)")

# Verify probabilities sum to 1
total_prob = sum(probs.values())
print(f"\nTotal probability: {total_prob:.3f} (should be 1.000)")

#### üéØ Expected vs Observed

What if all bases were equally likely?

In [None]:
def compare_to_uniform(seq):
    """
    Compare observed frequencies to uniform distribution.
    """
    observed = nucleotide_probabilities(seq)
    expected = 0.25  # Uniform: each base = 25%
    
    print(f"Sequence: {seq}")
    print(f"Length: {len(seq)} bases\n")
    
    print(f"{'Base':<6} {'Observed':<12} {'Expected':<12} {'Difference':<12}")
    print("=" * 50)
    
    for base in "ATGC":
        obs = observed[base]
        diff = obs - expected
        
        status = "‚úì" if abs(diff) < 0.05 else "‚ö†Ô∏è"
        print(f"{base:<6} {obs:>6.3f} ({obs*100:>5.1f}%) {expected:>6.3f} (25.0%)  {diff:>+7.3f} {status}")

# Test with different sequences
sequences = [
    "ATGCATGCATGCATGC",  # Balanced
    "GCGCGCGCGCGCGCGC",  # GC-rich
    "ATATATATATATATAT"   # AT-rich
]

for seq in sequences:
    compare_to_uniform(seq)
    print()

#### üé≤ Random Sequence Generation

Generate sequences based on probability models:

In [None]:
import random

def generate_random_sequence(length, probabilities=None):
    """
    Generate a random DNA sequence.
    
    Args:
        length (int): Length of sequence to generate
        probabilities (dict): Optional base probabilities
    
    Returns:
        str: Random DNA sequence
    """
    if probabilities is None:
        # Uniform distribution
        bases = "ATGC"
        return "".join(random.choice(bases) for _ in range(length))
    else:
        # Weighted distribution
        bases = list(probabilities.keys())
        weights = list(probabilities.values())
        return "".join(random.choices(bases, weights=weights, k=length))

# Generate uniform random sequence
random_seq = generate_random_sequence(50)
print("Uniformly Random Sequence (50 bp):")
print(random_seq)
print(f"\nProbabilities:")
for base, prob in nucleotide_probabilities(random_seq).items():
    print(f"  {base}: {prob:.3f}")

# Generate GC-rich sequence
print("\n" + "="*50)
gc_rich_probs = {"A": 0.15, "T": 0.15, "G": 0.35, "C": 0.35}
gc_rich_seq = generate_random_sequence(50, gc_rich_probs)
print("\nGC-Rich Random Sequence (50 bp):")
print(gc_rich_seq)
print(f"\nProbabilities:")
for base, prob in nucleotide_probabilities(gc_rich_seq).items():
    print(f"  {base}: {prob:.3f}")

#### üß¨ Mutation Simulation

Simulate random point mutations:

In [None]:
def simulate_mutations(seq, num_mutations):
    """
    Simulate random point mutations in a sequence.
    
    Args:
        seq (str): Original DNA sequence
        num_mutations (int): Number of mutations to introduce
    
    Returns:
        tuple: (mutated_sequence, list_of_changes)
    """
    bases = "ATGC"
    seq_list = list(seq)
    changes = []
    
    for _ in range(num_mutations):
        # Random position
        pos = random.randint(0, len(seq) - 1)
        old_base = seq_list[pos]
        
        # Choose different base
        new_base = random.choice([b for b in bases if b != old_base])
        
        seq_list[pos] = new_base
        changes.append((pos, old_base, new_base))
    
    return "".join(seq_list), changes

# Test mutation simulation
original = "ATGCGCGCTAGCTAGC"
mutated, changes = simulate_mutations(original, 3)

print("Mutation Simulation")
print("=" * 50)
print(f"Original:  {original}")
print(f"Mutated:   {mutated}")
print(f"\nChanges made:")
for pos, old, new in changes:
    print(f"  Position {pos:2d}: {old} ‚Üí {new}")

#### üìä Mutation Rate Analysis

In [None]:
def calculate_mutation_rate(original, mutated):
    """
    Calculate the mutation rate between two sequences.
    
    Args:
        original (str): Original sequence
        mutated (str): Mutated sequence
    
    Returns:
        float: Mutation rate (mutations per base)
    """
    if len(original) != len(mutated):
        raise ValueError("Sequences must be same length")
    
    differences = 0
    positions = []
    
    for i, (base1, base2) in enumerate(zip(original, mutated)):
        if base1 != base2:
            differences += 1
            positions.append(i)
    
    rate = differences / len(original)
    
    return rate, differences, positions

# Test with multiple mutation levels
original = "ATGCGCGCTAGCTAGCATGC"

for n in [1, 3, 5]:
    mutated, _ = simulate_mutations(original, n)
    rate, diff, positions = calculate_mutation_rate(original, mutated)
    
    print(f"\n{n} Mutation(s):")
    print(f"  Original: {original}")
    print(f"  Mutated:  {mutated}")
    print(f"  Differences: {diff}")
    print(f"  Mutation rate: {rate:.3f} ({rate*100:.1f}%)")
    print(f"  Positions: {positions}")

#### üéØ Mini-Challenge: Composition Checker

Create a function that checks if a sequence composition is unusual:

In [None]:
def check_unusual_composition(seq, threshold=0.1):
    """
    Check if any nucleotide is unusually over/under-represented.
    
    Args:
        seq (str): DNA sequence
        threshold (float): Deviation from 0.25 to flag as unusual
    
    Returns:
        bool: True if composition is unusual
    """
    probs = nucleotide_probabilities(seq)
    expected = 0.25
    
    print(f"Analyzing: {seq}")
    print(f"Length: {len(seq)} bases\n")
    
    unusual = False
    
    for base, prob in probs.items():
        deviation = abs(prob - expected)
        
        if deviation > threshold:
            unusual = True
            status = "‚ö†Ô∏è  UNUSUAL"
        else:
            status = "‚úì  Normal"
        
        print(f"{base}: {prob:.3f} (deviation: {deviation:+.3f}) {status}")
    
    print(f"\nOverall: {'UNUSUAL composition' if unusual else 'Normal composition'}")
    return unusual

# Test with different sequences
test_seqs = [
    "ATGCATGCATGCATGC",  # Balanced
    "GGGGGGGGGGGGGGGG",  # All G
    "ATATATATATATATAT"   # Only AT
]

for seq in test_seqs:
    check_unusual_composition(seq)
    print("\n" + "="*50 + "\n")

#### üßÆ Probability of Finding a Motif

What's the probability of finding a specific sequence by chance?

In [None]:
def motif_probability(motif, base_probs=None):
    """
    Calculate probability of finding a motif by random chance.
    
    Args:
        motif (str): Sequence motif
        base_probs (dict): Base probabilities (default: uniform)
    
    Returns:
        float: Probability of finding motif
    """
    if base_probs is None:
        base_probs = {"A": 0.25, "T": 0.25, "G": 0.25, "C": 0.25}
    
    prob = 1.0
    for base in motif:
        prob *= base_probs[base]
    
    return prob

# Test with different motifs
motifs = ["ATG", "TATA", "ATGCGCTA"]

print("Motif Probability (Uniform Distribution):\n")
for motif in motifs:
    prob = motif_probability(motif)
    print(f"{motif:12} ‚Üí {prob:.6f} (1 in {1/prob:.0f})")

# With GC-rich distribution
gc_rich = {"A": 0.15, "T": 0.15, "G": 0.35, "C": 0.35}
print("\nMotif Probability (GC-Rich Distribution):\n")
for motif in motifs:
    prob = motif_probability(motif, gc_rich)
    print(f"{motif:12} ‚Üí {prob:.6f} (1 in {1/prob:.0f})")

#### üéØ Practice Exercise

Complete this challenge:

In [None]:
# Challenge: Weighted Mutation Simulator
# Task: Create mutations that prefer certain base changes
#       (e.g., transitions: A‚ÜîG, C‚ÜîT are more common than transversions)

def weighted_mutation(seq, num_mutations, transition_bias=0.7):
    """
    Simulate mutations with transition bias.
    
    Args:
        seq (str): DNA sequence
        num_mutations (int): Number of mutations
        transition_bias (float): Probability of transition (vs transversion)
    
    Returns:
        tuple: (mutated_sequence, changes_list)
    """
    # YOUR CODE HERE
    # Hint: Transitions are A‚ÜîG and C‚ÜîT
    # Hint: Transversions are A‚ÜîC, A‚ÜîT, G‚ÜîC, G‚ÜîT
    pass

# Test it
original = "ATGCGCGCTAGCTAGC"
# YOUR TEST CODE HERE

#### ü§î Reflection Questions

1. If A appears 30 times in 100 bases, what's P(A)?
2. Explain one limitation of treating mutations as uniformly random.
3. Why might real genomes not have uniform base probabilities?
4. How could probability help identify important genomic regions?

#### üè† Homework

1. Generate 100 random sequences (length 50 each)
2. Calculate average GC content across all sequences
3. Compare to a single real gene sequence
4. Determine if the real gene is unusual compared to random expectation

In [None]:
# Homework coding space
# YOUR CODE HERE

#### üéâ Summary

You've learned:
- ‚úÖ Computing nucleotide frequency distributions
- ‚úÖ Converting frequencies to probabilities
- ‚úÖ Expected vs observed patterns
- ‚úÖ Generating random sequences based on probability models
- ‚úÖ Simulating mutations
- ‚úÖ Calculating mutation rates
- ‚úÖ Identifying unusual sequence composition

**Next lesson:** More advanced mutation simulation! üß¨