## Counting words
Operating under the assumption that DNA is a language of its own, let’s borrow Legrand’s method and see if we can find any surprisingly frequent “words” within the ori of Vibrio cholerae. We have added reason to look for frequent words in the ori because for various biological processes, certain nucleotide strings appear surprisingly often in small regions of the genome. This is often because certain proteins can only bind to DNA if a specific string of nucleotides is present, and if there are more occurrences of the string, then it is more likely that binding will successfully occur. (It is also less likely that a mutation will disrupt the binding process.)

For example, "**ACTAT**" is a surprisingly frequent **substring** of 

"ACA**ACTAT**GCAT**ACTAT**CGGGA**ACTAT**CCT".

We use the term **k-mer** for a string of length k and define PatternCount(Pattern, Text) as the number of times that a k-mer Pattern appears as a substring of Text. Following the above example,

PatternCount("ACTAT", "ACA**ACTAT**GCAT**ACTAT**CGGGA**ACTAT**CCT") = 3. 

Note that PatternCount("ATA", "CG**ATATAT**CC**ATA**G") is equal to 3 (not 2) since we should account for overlapping occurrences of Pattern in Text.

Before looking for frequent words, we would like to compute PatternCount(Pattern, Text). Because this is your first biological algorithm, we will walk you through the details. To do so, we first create an integer variable count that we set equal to zero:

```python
count = 0
```

As illustrated in the figure below, our plan is to “slide a window” down Text, checking whether each k-mer substring of Text matches Pattern. If it does, then we add 1 to count (adding 1 to a variable is called incrementing it). The value of count after we have slid the window to the end of Text will be equal to PatternCount(Pattern, Text). The question, then, is how to convert the idea in the figure into a working program. Doing so will require a little more knowledge of Python.

![patterncount](patterncount.png)

**Figure**: Sliding a window to compute PatternCount(Pattern, Text) = 3 for Pattern = "ATA" and Text = "CGATATATCCATAG". We initialize count to zero and then increment it each time that Pattern appears in Text (shown in green).

Before thinking about sliding the window down Text, let’s solve the simpler problem of determining whether Pattern matches a k-mer of Text in a fixed window. In Python, the k-mer beginning at position i of Text is denoted `Text[i:i+k]`. For example, if Text = "GACCATACTG", then `Text[4:7] = "ATA"`. Python uses **0-based indexing**, in which the first symbol of the string occurs at position 0 instead of 1; as a result, Text ends at position len(Text)-1, where len(Text) is the number of symbols in Text.

We can now use an **if** statement, shown below, to determine whether Pattern matches `Text[i:i+k]`; if it does, then we increment count.

```python
if Text[i:i+len(Pattern)] == Pattern:
    count = count+1
```

In the above Python code, make sure to note the difference between the equals symbol (=), in which we assign a value to a variable, and the double equals symbol (==), in which we test the equality of two variables.

In Python, the indented block

```python
for i in range(n):
```

iterates over all values of i between 0 and n-1. (This is called a **for loop**, and Codecademy will spend more time on it later, but for now we note that the variable i can be anything you like.)

For example, we could use the following code to print all even numbers between 0 and 100, inclusively.

```python
for number in range(51):
    print(2*number)
```

Note again that we used range(51) and not range(50) because range(n) runs from 0 to n-1.

In general, the final k-mer of a string of length n begins at position n-k; for example, the final 3-mer of "GACCATACTG", which has length 10, begins at position 10 - 3 = 7. This observation implies that the window should slide between position 0 and position len(Text)-len(Pattern).

Thus, to slide our window from position 0 to `len(Text)-len(Pattern)`, we will need a for loop of the form

```python
for i in range(len(Text)-len(Pattern)+1):
```

We are now ready to use this for loop along with our previous if statement to expand our code for PatternCount.

```python 
count = 0
for i in range(len(Text)-len(Pattern)+1):
    if Text[i:i+len(Pattern)] == Pattern:
        count = count+1 
```

In [4]:
with open("/home/duansq/Documents/python/codecademy/vibrio_cholerae_dna_seq.txt", "r+") as dna_seq:
    dna_seq = dna_seq.read()

In [9]:
def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text) - len(Pattern)):
        if Text[i:i + len(Pattern)] == Pattern:
            count += 1
    return count

In [10]:
PatternCount("ATC", dna_seq)

21259