# Module 3: Assignment A #
---

## Limitations of the Hamming Distance

The Hamming distance is the number of positions in between two strings of equal length. It is defined as the number of positions at which the corresponding symbols are different. In other words, it is the minimum number of substitutions required to change one string into the other. The Hamming distance is named after Richard Hamming, who introduced the concept in the context of error-correcting codes.

For example, consider two binary strings,
`1100111001` and `0100010101`. The Hamming distance is 4, because at positions 1, 5, 7 and 8, we
find the symbol mismatch. 


```
1100111001
0100010101
+-++-+--++
```

If we consider two DNA sequences, `AGCTAAC` and `AACTCCA`, the Hamming
distance between them will be 4, because at positions 2, 5, 6 and 7 we find symbol mismatch.

```
AGCTAAC
AACTCCA
+-++---
```


Write a function that computes the Hamming distance between two sequences. 

In [1]:
# Hamming Distance

def hamming_distance(seq1, seq2):
    '''
    This function computes the Hamming Distance between two sequences.

    Input: 
        seq1, seq2 (str): the two sequences to compare. Must be of equal length.

    Returns:
        distance (int): the number of mismatches between the sequences
    '''
    # Check that the given sequences are of equal length
    if len(seq1) != len(seq2):
        raise ValueError("Sequences must be of equal length")

    distance = 0
    
    # map the two sequences to form single entity
    combined_seq = zip(seq1, seq2)

    # comparing corresponding elements from both sequences 
    for n1, n2 in combined_seq:
        # count mismatches between positions
        if n1 != n2: 
            distance += 1
    return distance


In [2]:
# From given enample:
seq1 = "AGCTAAC"
seq2 = "AACTCCA"
distance = hamming_distance(seq1, seq2)

print(f"Sequences:\n\n{seq1}\n{seq2}\n\nHamming Distance: {distance}")

Sequences:

AGCTAAC
AACTCCA

Hamming Distance: 4


Use your Hamming Distance function to answer the following question: Is a hippopotamus more closely related to a pig or to a whale? 


Use the following sequences of Hemoglobin subunit beta from each organism for your assessment. Hemoglobin represents a model protein used to study molecular adaptation in vertebrates. You may chop up the sequences how ever you see fit to ensure they are equal length. Please explain your approach to normalizing the sequence length and rationale.


```
>PIG | A0A8D0NPF5_PIG Hemoglobin subunit beta
MVHLSAEESAAVLGLQGKVNMVELGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPK
VKAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVILARRFG
NDFNPDLQAAFQKVVAGVANALAHYH

>WHALE | A0A140GN68_KOGSI Hemoglobin subunit beta
MVHLTAEEKSAVTALWAKVNVEEIGGEALGRLLVVYPWTQRFFEHFGDLSTADAVMKNAS
VKSHGKKVLASFSEGLKHFDNLKGTFAQLSELHCDKLHVDPENFRLLGNVLVVVLARHFG
KEFTPELQTAFQKLVAGVANALAHKYH

>HIPPOPOTAMUS | HBB_HIPAM Hemoglobin subunit beta
VHLTAEEKDAVLGLWGKVNVQEVGGEALGRLLVVYPWTQRFFESFGDLSSADAVMNNPKV
KAHGKKVLDSFADGLKHLDNLKGTFAALSELHCDQLHVDPENFRLLGNELVVVLARTFGK
EFTPELQAAYQKVVAGVANALAHRYH
```

What are some of the limitations you observe for using the Hamming Distance for biological sequence comparision?

_Answer here_

In [3]:
# Sequences
pig_seq = ("MVHLSAEESAAVLGLQGKVNMVELGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPK"
           "VKAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVILARRFG"
           "NDFNPDLQAAFQKVVAGVANALAHYH")

pig_len = len(pig_seq)

whale_seq = ("MVHLTAEEKSAVTALWAKVNVEEIGGEALGRLLVVYPWTQRFFEHFGDLSTADAVMKNAS"
             "VKSHGKKVLASFSEGLKHFDNLKGTFAQLSELHCDKLHVDPENFRLLGNVLVVVLARHFG"
             "KEFTPELQTAFQKLVAGVANALAHKYH")

whale_len = len(whale_seq)

hippo_seq = ("VHLTAEEKDAVLGLWGKVNVQEVGGEALGRLLVVYPWTQRFFESFGDLSSADAVMNNPKV"
             "KAHGKKVLDSFADGLKHLDNLKGTFAALSELHCDQLHVDPENFRLLGNELVVVLARTFGK"
             "EFTPELQAAYQKVVAGVANALAHRYH")

hippo_len = len(hippo_seq)

In [4]:
print(f"pig_len: {pig_len}\nwhale_len: {whale_len}\nhippo_len: {hippo_len}\n")

pig_len: 146
whale_len: 147
hippo_len: 146



#### Since the sequences are not of equal length, we must normalize them.
Pick minimum length sequence and cut down the longer ones to that size. The minimum seqence length is 146 (found in both the hippo and the pig)

### Normalization Approach: 
Sequences are aligned at the beginning of the sequence (N-terminal). The tails of the longer sequences are cut.

In [5]:
min_len = min(pig_len, whale_len, hippo_len)
print(f"minimum length of sequence: {min_len}")

minimum length of sequence: 146


### Normalize:
#### Cut the tail

In [6]:
# Normalize all sequences to the length of the shortest sequence
pig_seq_norm = pig_seq[:min_len]
whale_seq_norm = whale_seq[:min_len]
hippo_seq_norm = hippo_seq[:min_len]
#print(f"pig: {pig_seq_norm}\nwhale: {whale_seq_norm}\nhippo: {hippo_seq_norm}")

### Pairings:
- Pig and Whale
- Pig and Hippo
- Whale and Hippo

In [7]:
# Compute Hamming distances
pig_v_whale = hamming_distance(pig_seq_norm, whale_seq_norm)
pig_v_hippo = hamming_distance(pig_seq_norm, hippo_seq_norm)
whale_v_hippo = hamming_distance(whale_seq_norm, hippo_seq_norm)

# Output results
print(f"Hamming distance between Pig and Whale: {pig_v_whale}")
print(f"Hamming distance between Pig and Hippo: {pig_v_hippo}")
print(f"Hamming distance between Whale and Hippo: {whale_v_hippo}")


Hamming distance between Pig and Whale: 32
Hamming distance between Pig and Hippo: 130
Hamming distance between Whale and Hippo: 134


#### Comparing hippo to a pig vs. whale:
We want the smallest difference between sequences.

In alligning from the start, the hippopotamus is more closely related to the pig than to the whale, but overall, the pig and whale are more similar to each other than either is to the hippopotamus.

#### Cut the start

In [8]:
# Normalize all sequences to the length of the shortest sequence
pig_start = pig_len - min_len
whale_start = whale_len - min_len
hippo_start = hippo_len - min_len

pig_seq_norm = pig_seq[pig_start:]
whale_seq_norm = whale_seq[whale_start:]
hippo_seq_norm = hippo_seq[hippo_start:]

# Compute Hamming distances
pig_v_whale = hamming_distance(pig_seq_norm, whale_seq_norm)
pig_v_hippo = hamming_distance(pig_seq_norm, hippo_seq_norm)
whale_v_hippo = hamming_distance(whale_seq_norm, hippo_seq_norm)

# Output results
print(f"Hamming distance between Pig and Whale: {pig_v_whale}")
print(f"Hamming distance between Pig and Hippo: {pig_v_hippo}")
print(f"Hamming distance between Whale and Hippo: {whale_v_hippo}")


Hamming distance between Pig and Whale: 131
Hamming distance between Pig and Hippo: 130
Hamming distance between Whale and Hippo: 24


#### Hippo Results:
When alligning from the end, the hippopotamus is more closely related to the whale than to the pig. The alignment between the whale and hippo (24) is much better than when comparing from the start (134).

This indicates that is it impotant to consider whether the start of the end of the sequence contains more significant information.

Based on the smaller Hamming distance, I say that the whale and the hippo are more closely related than the pig and the hippo. The differening results from the normalization make it difficult to give a concrete answer, but this highlights the limitations of using the Hamming distance since alignment can have such a big inpact on the results.

### Limitations:
#### 1. Loss of information
   Since the tails of the longer sequences were cut, mismatches toward the end are excluded. This can be a useful approach if the biological significance of the protein is concentrated at the beginning, but this may omit meaningful data from the latter part of the sequences. This means that the requirement for equal length may lead to loss of information. 

#### 2. Biological considerations
   This calcultation does not take biological context into account and is purely a computational approach. This means that that some information regarding dependencies in the protein structures might be omited.


### References:
- Information on hemoglobin subunit beta: https://en.wikipedia.org/wiki/Hemoglobin_subunit_beta
- Why cut the tail? Information of importance of the N-terminus: https://en.wikipedia.org/wiki/N-terminus
- More information of hamming distance in biology: https://fiveable.me/key-terms/introduction-computational-molecular-biology/hamming-distance

#### Code:
- Merge two sequences using zip() function: https://www.geeksforgeeks.org/zip-in-python/#
- Sample computation with zip: https://stackoverflow.com/questions/54172831/hamming-distance-between-two-strings-in-python