#### üß¨ Similarity vs Identity

#### Key Concepts:
- **Similarity** - How alike two sequences are (may include functional similarity)
- **Identity** - Exact matches at each position
- **Distance** - How different two sequences are

#### Example:
```
Seq1: ATGCGT
Seq2: ATGCTT
      ||||.|

Matches: 5/6 = 83.3% identity
Differences: 1/6 = 16.7% distance
```

#### üìè Hamming Distance

**Hamming Distance** = Number of positions where sequences differ

**Limitation:** Only works for equal-length sequences (no gaps allowed)

In [None]:
def hamming(a, b):
    """
    Calculate Hamming distance between two sequences.
    
    Args:
        a, b (str): Sequences to compare
    
    Returns:
        int: Number of differing positions
    """
    if len(a) != len(b):
        raise ValueError(f"Lengths differ: {len(a)} vs {len(b)}")
    
    diff = 0
    for x, y in zip(a, b):
        if x != y:
            diff += 1
    
    return diff

# Test it
seq1 = "ATGCGTAA"
seq2 = "ATGCTTAA"

distance = hamming(seq1, seq2)
print(f"Sequence 1: {seq1}")
print(f"Sequence 2: {seq2}")
print(f"\nHamming distance: {distance}")
print(f"Identity: {100 * (1 - distance/len(seq1)):.1f}%")

#### üé® Visual Alignment

In [None]:
def show_alignment(a, b):
    """
    Display visual alignment of two sequences.
    """
    if len(a) != len(b):
        print("Error: Sequences must be same length!")
        return
    
    matches = sum(1 for x, y in zip(a, b) if x == y)
    differences = len(a) - matches
    identity = (matches / len(a)) * 100
    
    # Create marker line
    markers = "".join("|" if x == y else "X" for x, y in zip(a, b))
    
    print("Sequence Alignment:")
    print("=" * 50)
    print(f"Seq1: {a}")
    print(f"      {markers}")
    print(f"Seq2: {b}")
    print("=" * 50)
    print(f"Length: {len(a)}")
    print(f"Matches: {matches}")
    print(f"Differences: {differences}")
    print(f"Identity: {identity:.1f}%")

# Test with different pairs
pairs = [
    ("ATGCGTAA", "ATGCTTAA"),
    ("ATGCGTAA", "ATGCGTAA"),
    ("ATGCGTAA", "TACGCATT")
]

for seq1, seq2 in pairs:
    show_alignment(seq1, seq2)
    print("\n")

#### üìä Percent Identity Calculator

In [None]:
def percent_identity(a, b):
    """
    Calculate percent identity between two sequences.
    
    Args:
        a, b (str): Sequences to compare
    
    Returns:
        float: Percent identity (0-100)
    """
    if len(a) != len(b):
        raise ValueError("Sequences must be same length")
    
    matches = sum(1 for x, y in zip(a, b) if x == y)
    return (matches / len(a)) * 100

# Test with various similarity levels
test_pairs = [
    ("ATGCGTAA", "ATGCGTAA", "Identical"),
    ("ATGCGTAA", "ATGCTTAA", "High similarity"),
    ("ATGCGTAA", "ATCCGTAA", "Medium similarity"),
    ("ATGCGTAA", "TACGCATT", "Low similarity")
]

print("Percent Identity Comparison:")
print("=" * 60)

for seq1, seq2, description in test_pairs:
    identity = percent_identity(seq1, seq2)
    distance = hamming(seq1, seq2)
    print(f"\n{description}:")
    print(f"  Seq1: {seq1}")
    print(f"  Seq2: {seq2}")
    print(f"  Identity: {identity:.1f}%")
    print(f"  Distance: {distance}")

#### üî¢ Pairwise Comparison Matrix

In [None]:
def compare_all(sequences, names=None):
    """
    Create a pairwise comparison matrix.
    
    Args:
        sequences (list): List of sequences
        names (list): Optional names for sequences
    """
    n = len(sequences)
    
    if names is None:
        names = [f"Seq{i+1}" for i in range(n)]
    
    # Print header
    print("Pairwise Identity Matrix (%):\n")
    print(f"{'':10}", end="")
    for name in names:
        print(f"{name:>10}", end="")
    print()
    print("=" * (10 + 10 * n))
    
    # Calculate and print matrix
    for i, (seq_i, name_i) in enumerate(zip(sequences, names)):
        print(f"{name_i:10}", end="")
        for j, seq_j in enumerate(sequences):
            if len(seq_i) != len(seq_j):
                print(f"{'N/A':>10}", end="")
            else:
                identity = percent_identity(seq_i, seq_j)
                print(f"{identity:>10.1f}", end="")
        print()

# Test with 4 sequences
sequences = [
    "ATGCGTAA",
    "ATGCTTAA",
    "ATCCGTAA",
    "TACGCATT"
]

names = ["Gene_A", "Gene_B", "Gene_C", "Gene_D"]

compare_all(sequences, names)

print("\n\nInterpretation:")
print("  100% = Identical sequences")
print("  >80% = Highly similar (likely homologs)")
print("  50-80% = Moderately similar")
print("  <50% = Distantly related or unrelated")

#### üß¨ Finding Most Similar Pair

In [None]:
def find_most_similar(sequences, names=None):
    """
    Find the most similar pair of sequences.
    """
    if names is None:
        names = [f"Seq{i+1}" for i in range(len(sequences))]
    
    best_identity = -1
    best_pair = None
    
    # Compare all pairs
    for i in range(len(sequences)):
        for j in range(i+1, len(sequences)):
            if len(sequences[i]) == len(sequences[j]):
                identity = percent_identity(sequences[i], sequences[j])
                if identity > best_identity:
                    best_identity = identity
                    best_pair = (i, j)
    
    if best_pair:
        i, j = best_pair
        print("Most Similar Pair:")
        print("=" * 50)
        print(f"{names[i]}: {sequences[i]}")
        print(f"{names[j]}: {sequences[j]}")
        print(f"\nIdentity: {best_identity:.1f}%")
        print(f"\nThese sequences are most closely related!")
    else:
        print("No comparable pairs (different lengths)")

# Test it
sequences = [
    "ATGCGTAA",
    "ATGCTTAA",  # Very similar to first
    "ATCCGTAA",
    "TACGCATT"
]

names = ["E.coli_geneA", "Salmonella_geneA", "Yersinia_geneA", "Distant_gene"]

find_most_similar(sequences, names)

#### üéØ Simple Scoring System

In [None]:
def score_alignment(a, b, match=1, mismatch=-1):
    """
    Score an alignment with match/mismatch values.
    
    Args:
        a, b (str): Sequences
        match (int): Score for matching position
        mismatch (int): Penalty for mismatching position
    
    Returns:
        int: Total alignment score
    """
    if len(a) != len(b):
        raise ValueError("Sequences must be same length")
    
    score = 0
    for x, y in zip(a, b):
        if x == y:
            score += match
        else:
            score += mismatch
    
    return score

# Test different scoring schemes
seq1 = "ATGCGTAA"
seq2 = "ATGCTTAA"

print(f"Seq1: {seq1}")
print(f"Seq2: {seq2}\n")

print("Different Scoring Schemes:")
print("=" * 50)

schemes = [
    (1, -1, "Match: +1, Mismatch: -1"),
    (2, -1, "Match: +2, Mismatch: -1"),
    (1, -2, "Match: +1, Mismatch: -2"),
]

for match, mismatch, desc in schemes:
    score = score_alignment(seq1, seq2, match, mismatch)
    print(f"{desc:30} Score: {score:+3d}")

#### üî¨ Real-World Example: Comparing Species

In [None]:
# Compare same gene across different species
# (These are hypothetical partial sequences)

insulin_sequences = {
    "Human":  "ATGGGCTCC CTGGTGATTG CGGCTCTCTG GGGCGCCGA",
    "Mouse":   "ATGGGTTCC CTGGTGATCG CGGCTCTCTG GGGCGCCGA",
    "Rat":     "ATGGGTTCC CTGGTGATCG CGGCTCTCTG GGGCGCCGA",
    "Chicken": "ATGGGCTCC CTGGTGATTG CCGCTCTCTG GGGAGCCGA"
}

# Remove spaces
insulin_sequences = {k: v.replace(" ", "") for k, v in insulin_sequences.items()}

print("Insulin Gene Comparison Across Species\n")

species = list(insulin_sequences.keys())
sequences = list(insulin_sequences.values())

# Show all sequences
for name, seq in insulin_sequences.items():
    print(f"{name:10} {seq}")

print("\n" + "=" * 60 + "\n")

# Compare human to each other species
human_seq = insulin_sequences["Human"]

for species_name, species_seq in insulin_sequences.items():
    if species_name != "Human":
        identity = percent_identity(human_seq, species_seq)
        distance = hamming(human_seq, species_seq)
        print(f"Human vs {species_name}:")
        print(f"  Identity: {identity:.1f}%")
        print(f"  Differences: {distance} positions\n")

#### üéØ Practice Exercise

In [None]:
# YOUR TURN: Create a function that finds the least similar pair

def find_least_similar(sequences, names=None):
    """
    Find the least similar (most distant) pair of sequences.
    """
    # YOUR CODE HERE
    pass

# Test it
sequences = [
    "ATGCGTAA",
    "ATGCTTAA",
    "ATCCGTAA",
    "TACGCATT"
]

find_least_similar(sequences)

#### üìê Limitations of Hamming Distance

#### Key Limitations:
1. **No gaps allowed** - Can't handle insertions/deletions
2. **Equal length required** - Can't compare different length sequences
3. **Position-dependent** - Doesn't account for shifts

#### Example Problem:
```
Seq1: ATGCGTAA
Seq2: ATGCGT--  (2 bases deleted)

Hamming distance: Can't calculate! Different lengths.
```

#### Solution:
More advanced algorithms like **Smith-Waterman** or **BLAST** handle gaps and variable lengths.

#### ü§î Reflection Questions

1. If two sequences differ at 2 positions out of 10, what's the identity?
2. Write one limitation of Hamming distance vs real sequence alignment.
3. What does BLAST fundamentally attempt to do?
4. Why might high sequence similarity suggest evolutionary relationship?

#### Your Answers:

1. 

2. 

3. 

4. 

#### üè† Homework Challenge

1. Find 3-4 real DNA sequences from NCBI for the same gene in different species
2. Trim them to equal length
3. Calculate all pairwise identities
4. Create a similarity matrix
5. Determine which species are most closely related

In [None]:
# Homework coding space
# YOUR CODE HERE

#### üéâ Summary

You've learned:
- ‚úÖ Difference between similarity and identity
- ‚úÖ How to calculate Hamming distance
- ‚úÖ Computing percent identity
- ‚úÖ Creating pairwise comparison matrices
- ‚úÖ Simple alignment scoring
- ‚úÖ Limitations of simple distance metrics
- ‚úÖ Finding most/least similar sequences

**What's Next?**
- Learn about more sophisticated alignment algorithms (Smith-Waterman, BLAST)
- Study phylogenetic tree construction
- Explore real genomic databases
- Apply these skills to your capstone project! üöÄ