# Bioinformatic Algorithms, Week 1

This course is pretty tough. This Jupyter notebook is part of me organizing my work thus far so I can tackle the Week 4 content. 

**Genome replication**: Before a cell can divide, it must replicate its genome so that each of the two daughters inherits its own copy. 

- 1953: **Watson & Crick**'s DNA double helix paper: *"It has not escaped our notice that the specific pairings we have postulated immediately suggests a possible copying mechanism for the genetic material".*
- Replication begins at the region of the genome called the replication origin (*ori*) and is carried out by **DNA polymerases**.

How do we find the *ori*, given a genome?

Computational methods are now the only realistic way to answer many questions in modern biology. 
1. They are much faster than experimental methods. Experimental methods to finding *ori* are time-consuming and have been successful in only a handful of species.
2. Results of many experiments cannot be interpreted without computational analyses.

In this chapter, we focus on the relatively easy case of finding *ori* in bacterial genomes, most of which consist of a single circular chromosome. 
- The region of the bacterial genome encoding *ori* is typically a few hundred nucleotides long.
- Begin with a bacterium in which *ori* is known, then determine what makes this genomic region special, to design a computational approach for finding *ori* in other bacteria.
- We use the genome of *Vibrio cholerae* as our "known". The *ori* of *V. cholerae* is as follows:

> atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc

In [1]:
f = open("Vibrio_cholerae.txt", "r")
v_cholerae_genome = f.read()
f.close()
print(f'The Vibrio cholerae genome is {len(v_cholerae_genome)} nucleotides long.')
print(f'The first 100 nucleotides are: {v_cholerae_genome[:100]}')

The Vibrio cholerae genome is 1108250 nucleotides long.
The first 100 nucleotides are: ACAATGAGGTCACTATGTTCGAGCTCTTCAAACCGGCTGCGCATACGCAGCGGCTGCCATCCGATAAGGTGGACAGCGTCTATTCACGCCTTCGTTGGCA


How does the bacterial cell know to begin replication in the *ori* region of *V. cholerae*? 
- The initiation of replication is mediated by ***DnaA***, a protein that binds to a short segment within the ori known as a ***DnaA box***.
- That is, a sequence in the *ori* genomic region called a *DnaA box* tells a protein called *DnaA* to "bind here!".
- *DnaA*, in turn, recruits other proteins to initiate the separation of the DNA strands at the origin.

<img src="https://upload.wikimedia.org/wikipedia/commons/3/3c/PDB_1j1v_EBI.jpg" width="200" height="200" style="margin:auto"/>

The **Hidden Message Problem** is a first attempt at formalizing our task at hand, in an intuitive way. But it is not a clearly-stated, meaningful computational problem. 

> **Hidden Message Problem**: *Find a "hidden message" in the replication origin.*
>    - **Input**: A string *Text* (representing the replication origin of a genome)
>    - **Output**: A hidden message in *Text* (representing the *DnaA box*).

### The Frequent Words Problem
For various biological processes, certain nucleotide strings will appear surprisingly often in small regions of the genome because certain proteins can only bind to DNA if a specific string of nucleotides is present. This is the same with the *DnaA* protein, and the *DnaA box* sequences. A better approach would be to find "words" that are surprisingly frequent (i.e. above chance) in the Text. 
- For example, in the Text "ACA**ACTAT**GCAT**ACTAT**CGGGA**ACTAT**CCT", the word **ACTAT** is a surprisingly frequent (n=3).
- We formalize this observation as follows:

> *Count*(ACA**ACTAT**GCAT**ACTAT**CGGGA**ACTAT**CCT, ACTAT) = 3.

To compute *Count*(*Text*, *Pattern*), our plan is to slide a moving window down *Text*, checking whether each substring of *Text* that has the length of *Pattern*—we call it a *k*-mer, where *k*=length of *Pattern*—matches *Pattern* itself. 

In [2]:
def pattern_count(text, pattern):
    count = 0
    # range() is upper-bound exclusive
    for i in range(len(text)-len(pattern)+1): 
        if text[i:i+len(pattern)] == pattern:
            count += 1
    return count

In [3]:
f = open("patternCount.txt", "r")
pc_input = f.read().split("\n")
f.close()

t = pc_input[0]
p = pc_input[1]
print(pattern_count(t, p))

22


Now, we can say that *Pattern* is a **most-frequent *k*-mer** in *Text* if it maximizes *Count*(*Text*, *Pattern*) among all *k*-mers. For example, **ACTAT** is the most-frequent *k*-mer in ACA**ACTAT**GCAT**ACTAT**CGGGA**ACTAT**CCT. This gives us a rigorously defined computational problem.

**Frequent Words Problem**: *Find the most frequent k-mers in a string*.
- **Input**: A string *Text* and an integer *k*.
- **Output**: All most frequent *k*-mers in *Text*.

There is a naive way and a fast way to do this. The **naive way** would be the following:

In [4]:
def frequent_words_naive(text, k):
    frequent_patterns = set()
    counts = []
    for i in range(len(text)-k):
        pattern = text[i:i+k]
        counts.append(pattern_count(text, pattern))
    max_count = max(counts)
    for i in range(len(text)-k):
        if counts[i] == max_count:
            frequent_patterns.add(text[i:i+k])
    return frequent_patterns

In [5]:
f = open("frequentWords.txt", "r")
pc_input = f.read().split("\n")
f.close()

t = pc_input[0]
k = int(pc_input[1])
print(" ".join(frequent_words_naive(t, k)))

GTACCAGGCGCA TACCAGGCGCAG GGTACCAGGCGC GTAGGTACCAGG AGGTACCAGGCG TAGGTACCAGGC


This approach loops through *Text*, at each iteration:
1. Grabbing a *k*-mer substring in *Text*,
2. Calculating how many times a *k*-mer appears in *Text* (looping through *Text* again), and
3. Adding the count to a list `counts`.

It then finds the maximum count in `counts`, and collects all *k*-mers that have the maximum count, before returning it.

**This is a slow algorithm**, with a runtime of O(n<sup>2</sup>), because we have to call *Count*(*Text*, *Pattern*) for every *Pattern* in *Text*. 

The **fast way** to implement the **Frequent Words Problem** uses a hash table. Rather than iterate through *Text* within another iteration through *Text*, which results in the O(n<sup>2</sup>) runtime, we can:
1) Iterate through *Text* once to obtain a hash table, like the **frequency table** below,
2) Then return the hashes that are associated with the greatest value. 

<img src="https://miro.medium.com/max/2129/1*Mr6KrtsUqhQ732SwYIljfg.jpeg" width=400px style="margin:auto"/>


To return the hashes that are associated with the greatest value, we need a function **MaxMap**, defined inside `better_frequent_words()`.

In [6]:
def frequency_table(text, k):
    freq_map = {}
    for i in range(len(text)-k+1):
        pattern = text[i:i+k]

        if pattern not in freq_map:
            freq_map[pattern] = 1
        else:
            freq_map[pattern] += 1
    return freq_map

In [7]:
def better_frequent_words(text, k):
    
    def max_map(table): # Return the maximum value in a dictionary
        return max(table.values())
    
    frequent_patterns = []
    freq_map = frequency_table(text, k)
    max_val = max_map(freq_map)
    for pattern in freq_map.keys():
        if freq_map[pattern] == max_val:
            frequent_patterns.append(pattern)
    return frequent_patterns

In [8]:
f = open("frequentWords.txt", "r")
pc_input = f.read().split("\n")
f.close()

t = pc_input[0]
k = int(pc_input[1])
" ".join(better_frequent_words(t, k))

'GTAGGTACCAGG TAGGTACCAGGC AGGTACCAGGCG GGTACCAGGCGC GTACCAGGCGCA TACCAGGCGCAG'

Pretty neat! Now, trying the **Frequent Words Problem** on the *ori* region of *V. cholerae* for different *k*-values gives us the following:

In [9]:
cholerae_ori = "atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtttccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc"

cholerae_dict = {}
for i in range(3, 10):
    cholerae_dict[i] = better_frequent_words(cholerae_ori, i)
    
for key, value in cholerae_dict.items():
    print(f'{key}-mer: found {len(cholerae_dict[key])}, {" ".join(cholerae_dict[key])}')

3-mer: found 1, tga
4-mer: found 1, atga
5-mer: found 2, tgatc gatca
6-mer: found 1, tgatca
7-mer: found 1, atgatca
8-mer: found 1, atgatcaa
9-mer: found 4, atgatcaag ctcttgatc tcttgatca cttgatcat


There are an awful lot of 9-mers! And each appears three times in the *ori* region:

In [10]:
for p in cholerae_dict[9]:
    print(f'{p} appears {pattern_count(cholerae_ori, p)} times')

atgatcaag appears 3 times
ctcttgatc appears 3 times
tcttgatca appears 3 times
cttgatcat appears 3 times


Experiments reveal that bacterial *DnaA* boxes are usually 9 nucleotides long. The probability that there exists a 9-mer appearing 3 or more times in a string of length 500 is approximately 1/1300. So this is quite unusual! 

### Reverse Complements and Pattern Matching

DNA replication is **semiconservative**: a double-strand unfurls into two **template strands**, and onto each template strand, the **complementary strand** is assembled out of free-floating nucleotides, as confirmed by [Meselson and Stahl's 1958 experiment](https://en.wikipedia.org/wiki/Meselson%E2%80%93Stahl_experiment). 

<img src="https://bioinformaticsalgorithms.com/images/Replication/reverse_complement.png" width=400px style="margin:auto" />

DNA strands are read 5' to 3', and each strand is read in the reverse direction as its complement. Since A pairs with T and G pairs with C, it is a straightforward task to determine the reverse complement of a string of nucleotides.

In [11]:
def reverse_complement(pattern):
    return pattern[::-1].translate(str.maketrans('ACGT', 'TGCA'))

In [12]:
f = open("reverseComplement.txt", "r")
rc_input = f.read()
f.close()
print(reverse_complement(rc_input)[:100]) # truncated output for readability

TGATAAGGCATGCCGCTCCGCCCCGACTATGGTCTTGCGGTGGTAGGTTTTAAGCCCACATTCACTGGAAAGTCAGACGCTAGGAACTTGAGCAGGCAGT


Of the four 9-mers that occur suspiciously frequently in the *ori* of *V. cholerae*, ATGATCAAG and CTTGATCAT are the reverse complements of each other. **So it seems quite likely, now, that ATGATCAAG and CTTGATCAT are *DnaA* boxes in *V. cholerae*!** We just have to check whether this sequence occurs just as frequently in the rest of the genome. 

**Pattern Matching Problem**: *Find all occurrences of a pattern in a string.*
- **Input**: Strings *Pattern* and *Genome*.
- **Output**: All starting positions in Genome where Pattern appears as a substring.

In [13]:
def pattern_match(pattern, genome):
    k = len(pattern)
    positions = []
    for i in range(len(genome)-k+1):
        if genome[i:i+k] == pattern:
            positions.append(i)
    return positions

In [14]:
print(" ".join([str(i) for i in pattern_match("ATGATCAAG", v_cholerae_genome)]))

116556 149355 151913 152013 152394 186189 194276 200076 224527 307692 479770 610980 653338 679985 768828 878903 985368


In [15]:
print(" ".join([str(i) for i in pattern_match("CTTGATCAT", v_cholerae_genome)]))

60039 98409 129189 152283 152354 152411 163207 197028 200160 357976 376771 392723 532935 600085 622755 1065555


It looks like ATGATCAAG/CTTGATCAT form a single clump around 152000 bps, and all other occurrences are pretty spread out. This is promising.

### Clump Finding: Finding the *ori* region

Though ATGATCAAG/CTTGATCAT appears to be the *DnaA* box for *V. cholerae*, it is possible that the ATGATCAAG/CTTGATCAT clump in *V. cholerae* is a statistical fluke. We should not even assume that all bacteria share the same *DnaA* box. Let's search for the same patterns in the *ori* of *Thermotoga petrophilia*:

In [16]:
T_petrophilia_ori = "aactctatacctcctttttgtcgaatttgtgtgatttatagagaaaatcttattaactgaaactaaaatggtaggtttggtggtaggttttgtgtacattttgtagtatctgatttttaattacataccgtatattgtattaaattgacgaacaattgcatggaattgaatatatgcaaaacaaacctaccaccaaactctgtattgaccattttaggacaacttcagggtggtaggtttctgaagctctcatcaatagactattttagtctttacaaacaatattaccgttcagattcaagattctacaacgctgttttaatgggcgttgcagaaaacttaccacctaaaatccagtatccaagccgatttcagagaaacctaccacttacctaccacttacctaccacccgggtggtaagttgcagacattattaaaaacctcatcagaagcttgttcaaaaatttcaatactcgaaacctaccacctgcgtcccctattatttactactactaataatagcagtataattgatctga"

print(pattern_count(T_petrophilia_ori, "ATGATCAAG"))
print(pattern_count(T_petrophilia_ori, "CTTGATCAT"))

0
0


So *T. petrophilia* doesn't have the same sequences after all. If we look for some common 9-mers the *ori* of *T. petrophilia*, we obtain some other promising sequences instead. 

In [17]:
for key, value in frequency_table(T_petrophilia_ori, 9).items():
    if value >= 3:
        print(key.upper())

TGGTAGGTT
GGTAGGTTT
AAACCTACC
AACCTACCA
ACCTACCAC
CCTACCACC


The instructors cheat by using pre-made software called Ori-Finder, and find that CCTACCACC and reverse complement GGTGGTAGG are the most likely *DnaA* boxes. Together, they appear 5 times in the *ori*.

In any case, the fact that a *DnaA* box sequence and its reverse complement appears so frequently in the *ori* means that, **if we find a region of 9-mer clumps, it is a good candidate for an *ori* region**. This problem can be formalized as follows:

**Clump Finding Problem**: *Find patterns forming clumps in a string.*
- **Input**: A string *Genome*, and integers *k* (*k*-mer length), *L* (window **l**ength for clump search), and *t* (count **t**hreshold for clumphood).
- **Output**: All distinct *k*-mers forming (L, t)-clumps in *Genome*.

The strategy here is to iterate through the *Genome* once with a moving window of length *L*:
- In each window, we create a map of the *k*-mers within the window.
- Then, we search for only the *k*-mers that occur at least as frequently as the threshold count within the window.
- We finally return the qualifying *k*-mers.

This algorithm has a time complexity of O(n*L), since for every window in *Genome*, `frequency_table()` iterates through that window.

In [18]:
def find_clump(genome, k, length, threshold):
    patterns = set()
    for i in range(len(genome)-length+1):
        window = genome[i:i+length]
        freq_map = frequency_table(window, k)
        for key in freq_map.keys():
            if freq_map[key] >= threshold:
                patterns.add(key)
    return patterns

In [19]:
f = open("findClump.txt", "r")
fc_input = f.read().split("\n")
f.close()
genome = fc_input[0]
k = int(fc_input[1].split(" ")[0])
L = int(fc_input[1].split(" ")[1])
t = int(fc_input[1].split(" ")[2])
" ".join(find_clump(genome, k, L, t))

'CCACCGTTGT TCCGGCCGAC AGAAGATAAA ACAAGGATGT TATATTATTA TATTGCGTGA ATCCAAATTA AGGCCTTCGC TAGGGTAATG'

Very cool. But this thing won't work on the actual genome of *E. coli* with a runtime of O(n*L). It's much too slow! Let's re-write this function so that it has a more reasonable runtime. The key insight is that, when we shift a window by 1, we don't have to re-check the whole window. Rather than calling frequency_table again and again, we call it exactly once in the beginning. 

In [20]:
def find_clump_faster(genome, k, L, t):
    patterns = set()

    # Loop 0
    window = genome[0:L]
    freq_map = frequency_table(window, k)
    for pattern, freq in freq_map.items():
        if freq >= t:
            patterns.add(pattern)

    # Loop 1 and on
    for i in range(1, len(genome)-L+1):
        previous_first_k_mer = genome[i-1:i+k-1]
        window = genome[i:i+L]
        current_last_k_mer = window[-k:]

        freq_map[previous_first_k_mer] -= 1

        if current_last_k_mer not in freq_map:
            freq_map[current_last_k_mer] = 1
        else:
            freq_map[current_last_k_mer] += 1

        if freq_map[current_last_k_mer] >= t:
            patterns.add(current_last_k_mer)
            
    return patterns

How many (500, 3)-clump forming 9-mer sequences are there in the *E. coli* genome?

In [21]:
f = open("E_coli.txt", "r")
e_coli_genome = f.read()
f.close()

In [22]:
print(f"There are {len(find_clump_faster(e_coli_genome, 9, 500, 3))} 9-mers that form clumps")

There are 1904 9-mers that form clumps


1904 is a lot of 9-mers, and it's unclear which of these might represent *DnaA* boxes in the *ori* region in the genome of *E. coli*. We'll have to keep looking harder..! 