# Finding a Motif in DNA

## Problem

Given two strings s and t, tt is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of ss can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

_Given_: Two DNA strings s and t (each of length at most 1 kbp).

_Return_: All locations of t as a substring of s.

**Sample Dataset**

    GATATATGCATATACTT
    ATAT

**Sample Output**

    2 4 10

____________________
## Solution

Finding a motif in a string is a relatively simple task which can be accomplished in $(l-k)*k$ time, where $l$ is the length of the string and $k$ that of the motif. The function below steps through the input one character at a time, up until $l-k$, and compares the $k$-substring starting at that position to the motif, adding its position to the output list if it's a match. The boolean 'start_at_one' allows us to switch between 1-based and 0-based numbering easily.

In [1]:
def motif_locations(s,t, start_at_one = False):
    locations = []
    k = len(t)
    b = int(start_at_one)
    for i in range(len(s)-k+1):
        if s[i:i+k] == t:
            locations.append(i+b)
    return locations

s = 'GATATATGCATATACTT'
t = 'ATAT'
print(motif_locations(s, t, start_at_one=True))
        

[2, 4, 10]


Biopython has some built-in methods for counting patterns that although 'overkill' for this particular simple task, are very powerful and useful in more advanced pattern-searching. We start by defining a motif sequence: a list containing all motifs we want to search for. In our case, it is made up of only one element, but it can be useful to include all variations of the motif that are at most a _d_-distance away (_d_ being the Hamming distance).

In [2]:
from Bio.Seq import Seq
from Bio import motifs

m = motifs.create([Seq(t)])

print(m)
# Note that the length of our motif object is the length of the patterns included,
# which have to be of the same length, and not the number of different patterns.
print(len(m))

ATAT

4


Then we can search for all motif instances in a string using the 'search' function. This generator function returns all positions of each motif instances in the string.

In [3]:
test_seq=Seq(s)
for pos, seq in m.instances.search(test_seq):
    print(pos, seq)

1 ATAT
3 ATAT
9 ATAT


Note that the search function assumes 0-based numbering.