<img align="left" style="padding-right:10px;" src="figures/cartel.jpg">
<!--COURSE_INFORMATION-->
## This notebook contains a unit from the course [Biology Meets Programming](https://www.coursera.org/learn/bioinformatics/home/welcome) by University of California in Coursera 


### The content is available [on GitHub](https://github.com/vencejo/Curso_BiologyMeetsProgramming).

<!--NAVIGATION-->
< [1.2 Hidden Messages in the Replication Origin Part 1](1.2 Hidden Messages in the Replication Origin Part 1.ipynb) | [Contents](Index.ipynb) | [1.4 Some Hidden Messages are More Surprising than Others](1.4 Some Hidden Messages are More Surprising than Others.ipynb)  >

### The Frequent Words Problem

We say that Pattern is a ** most frequent k-mer ** in Text if it maximizes PatternCount(Pattern, Text) among all k-mers. 

You can verify that "ACTAT" is a most frequent 5-mer for Text = "ACAACTATGCATACTATCGGGAACTATCCT", and "ATA" is a most frequent 3-mer for Text = "CGATATATCCATAG".

Exercise Break (1 point): Find the most frequent 2-mer of "GATCCAGATCCCCATAC". (You should solve this exercise by hand; how can it be done quickly?)

STOP and Think: Can a string have multiple most frequent k-mers?



We now have a rigorously defined computational problem for finding frequent words in the replication origin. We define a computational problem as a specification of input data in addition to a precise specification of output data that will solve the problem.

    Frequent Words Problem:  Find the most frequent k-mers in a string.
     * Input: A string Text and an integer k.
     * Output: All most frequent k-mers in Text.


A straightforward algorithm for finding the most frequent words in a string Text computes how many times each k-mer substring of Text appears in Text, then selects the k-mers occurring the most. To implement this algorithm, called FrequentWords, we will need to generate an array (i.e., a one-row table) denoted Count, where Count[i] is the number of times that the i-th k-mer of Text appears in Text. That is, Count[i] stores PatternCount(Pattern, Text) for Pattern = Text[i:i+k] (figure below).

<img align="left" style="padding-right:10px;" src="figures/fig3.png">



Figure: The array Count for Text = "CGATATATCCATAG" and k = 3. For example, Count[3] = Count[5] = 2 because "TAT" appears twice in Text at positions 3 and 5.



Arrays can be represented in Python using a data structure called a dictionary (often abbreviated as dict). You can think of a dictionary as a set of keys (first row in the figure above), where each key refers to a value (second row in the figure above). Learn more about how to work with dictionaries in the following set of exercises.

 ** Python Practice: Complete the “Python Lists and Dictionaries” lesson (14 exercises) in Unit 5 on Codecademy. **

Now that you know how to work with dictionaries, we can write a Python function that takes a string Text and an integer k as input and returns the Count dictionary for k-mers in Text. The code below uses the notation Count = {} to initialize a blank dictionary containing no items. It also uses PatternCount as a subroutine, or a function that is called within another function. Subroutines are vital to programming because they allow us to reuse code without needing to copy it multiple times.

``` python
def CountDict(Text, k):
    Count = {}
    for i in range(len(Text)-k+1):
        Pattern = Text[i:i+k]
        Count[i] = PatternCount(Pattern, Text)
    return Count
```

Code Challenge (1 point): Re-type this function in the allotted space below. (Make sure that you understand each line!) Since CountDict uses PatternCount as a subroutine, you should copy  PatternCount below as well. Then add CountDict to Replication.py.



To identify the most frequent k-mers in Text, we simply need to find the maximum value of the Count dictionary. Python has a built-in function called values() that returns a list containing the values of a dictionary. You have already learned about lists in the preceding Python Practice; we can therefore compute the maximum of all values in a given list using the following function. (This function uses the form of the for loop for ranging over the items in a list that we learned in the last Python Practice.)

``` python
def max(list):
    m = list[0] 
    for item in list:
        if item > m: 
            m = item
    return m 
```

As a result, we can find the maximum value in the dictionary Count by simply calling max(Count.values()). In fact, there is no reason to even write the function above, since Python provides max as a built-in function!



Now that we know how to find the maximum value of Count, we just need to use a for loop to pass through Count and find each index i such that Count[i] is maximized. This index corresponds to a frequent k-mer Text[i:i+k] in Text. We can then add this k-mer to a growing list of strings called FrequentPatterns.

** Python Practice: Complete exercises 10-14 of the “Loops” lesson in Unit 8 on Codecademy to learn how to apply a for loop to lists, strings, and dictionaries. **

We can now generate the most frequent k-mers in Text with the following code. Note that this code uses a Python-specific form of the for loop ranging over the keys of the dictionary Count.

``` python
def FrequentWords(Text, k):
    FrequentPatterns = []
    Count = CountDict(Text, k)
    m = max(Count.values())
    for i in Count:
        if Count[i] == m:
            FrequentPatterns.append(Text[i:i+k])
    return FrequentPatterns
```

Code Challenge (1 point): Re-type the FrequentWords function (and all required subroutines) in the allotted space below.

Click here for this problem's test datasets.



Exercise Break (1 point):  Now that we have implemented FrequentWords, print the result of calling FrequentWords on Text = "GATCCAGATCCCCATAC" and k = 2.

Sample Input:

ACGTTGCATGTCGCATGATGCATGAGAGCT
4

Sample Output:

['GCAT', 'CATG', 'GCAT', 'CATG', 'GCAT', 'CATG']

We have nearly solved the Frequent Words Problem, except for one tiny wrinkle. For the example Text = "CGATATATCCATAG", there is only one most frequent 3-mer ("ATA"). However, we will add "ATA" to FrequentPatterns three separate times, when i is equal to 2, 4, and 10 (why?). Therefore, we need to remove duplicates from FrequentPatterns by writing a function remove_duplicates(items) that takes a list items and returns a list containing all the objects from Items without duplicates. We leave this task to you as an exercise, which is covered in the following Python Practice.

 ** Python Practice: Complete exercise 14 in the “Practice Makes Perfect” lesson (Unit 8) on Codecademy. Then copy your remove_duplicates function into Replication.py. **



When we put everything together, we have a function FrequentWords solving the Frequent Words Problem.

```python
def FrequentWords(Text, k):
    FrequentPatterns = []
    Count = CountDict(Text, k)
    m = max(Count.values())
    for i in Count:
        if Count[i] == m:
            FrequentPatterns.append(Text[i:i+k])
    FrequentPatternsNoDuplicates = remove_duplicates(FrequentPatterns)
    return FrequentPatternsNoDuplicates 
```

Code Challenge (1 point): Re-type the FrequentWords function in the allotted space below. Then add FrequentWords to Replication.py.

Click here for this problem's test datasets.

Sample Input:

ACGTTGCATGTCGCATGATGCATGAGAGCT
4

Sample Output:

CATG GCAT

Exercise Break (1 point): Apply your solution to the Frequent Words Problem if Text is the oriC of Vibrio cholerae (click [here](dnas/v_cholerae_oric.txt) to download) and k = 10. What are the most frequent words?

``` python
# Copy your updated FrequentWords function (along with all required subroutines) below this line
def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count

def CountDict(Text, k):
    Count = {}
    for i in range(len(Text)-k+1):
        Pattern = Text[i:i+k]
        Count[i] = PatternCount(Pattern, Text)
    return Count

def remove_duplicates(lista):
    listaUnicos = []
    for elem in lista:
        if elem not in listaUnicos:
            listaUnicos.append(elem)
    return listaUnicos
    
def FrequentWords(Text, k):
    FrequentPatterns = []
    Count = CountDict(Text, k)
    m = max(Count.values())
    for i in Count:
        if Count[i] == m:
            FrequentPatterns.append(Text[i:i+k])
    FrequentPatternsNoDuplicates = remove_duplicates(FrequentPatterns)
    return FrequentPatternsNoDuplicates


# Now set Text equal to the Vibrio cholerae oriC and k equal to 10
Text = """ATCAATGATCAACGTAAGCTTCTAAGCATGATCAAGGTGCTCACACAGTTTATCCACAACCTGAGTGGATGACATCAAGATAGGTCGTTGTATCTCCTTCCTCTCGTACTCTCATGACCACGGAAAGATGATCAAGAGAGGATGATTTCTTGGCCATATCGCAATGAATACTTGTGACTTGTGCTTCCAATTGACATCTTCAGCGCCATATTGCGCTGGCCAAGGTGACGGAGCGGGATTACGAAAGCATGATCATGGCTGTTGTTCTGTTTATCTTGTTTTGACTGAGACTTGTTAGGATAGACGGTTTTTCATCACTGACTAGCCAAAGCCTTACTCTGCCTGACATCGACCGTAAATTGATAATGAATTTACATGCTTCCGCGACGATTTACCTCTTGATCATCGATCCGATTGAAGATCTTCAATTGTTAATTCTCTTGCCTCGACTCATAGCCATGATGAGCTCTTGATCATGTTTCCTTAACCCTCTATTTTTTACGGAAGAATGATCAAGCTGCTGCTCTTGATCATCGTTTC"""

k = 10

# Finally, print the result of calling FrequentWords on Text and k.

print(FrequentWords(Text, k))
```

Passed test #2. Correct! The most frequent 10-mers in the Vibrio cholerae oriC are: CTCTTGATCA TCTTGATCAT

### Frequent words in Vibrio cholerae 

The figure below reveals the most frequent k-mers in the ori region from Vibrio cholerae.

<img align="left" style="padding-right:10px;" src="figures/fig4.png">

STOP and Think: Do any of the counts in the figure seem surprisingly large?

Figure: The most frequent k-mers in the ori region of Vibrio cholerae for k ranging from 3 to 9, along with the number of times that each k-mer occurs. 

For example, the 9-mer **"ATGATCAAG"** appears three times in the ori region of Vibrio cholerae — is it surprising?

atcaatgatcaacgtaagcttctaagc**ATGATCAAG**gtgctcacacagtttatccacaac
ctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgacca
cggaaag**ATGATCAAG**agaggatgatttcttggccatatcgcaatgaatacttgtgactt
gtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggatt
acgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttagga
tagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaat
tgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaag
atcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtt
tccttaaccctctattttttacggaaga**ATGATCAAG**ctgctgctcttgatcatcgtttc

We highlight a most frequent 9-mer instead of using some other value of k because experiments have revealed that bacterial DnaA boxes are usually 9 nucleotides long. Furthermore, it is very unlikely that a 9-mer would appear three or more times in a randomly generated DNA string of length 500 due to random chance. In fact, there are four different 9-mers repeated three or more times in this region: "ATGATCAAG", "CTTGATCAT", "TCTTGATCA", and "CTCTTGATC".

** The low likelihood of witnessing even one repeated 9-mer in the ori region of Vibrio cholerae leads us to the working hypothesis that one of these four 9-mers may represent a potential DnaA box that, when appearing multiple times in a short region, jump-starts replication. But which one? **

<!--NAVIGATION-->
< [1.2 Hidden Messages in the Replication Origin Part 1](1.2 Hidden Messages in the Replication Origin Part 1.ipynb) | [Contents](Index.ipynb) | [1.4 Some Hidden Messages are More Surprising than Others](1.4 Some Hidden Messages are More Surprising than Others.ipynb)  >