# Rolling dice to find motifs

We will now turn to **randomized algorithms** that flip coins and roll dice in order to search for motifs. Making random algorithmic decisions may sound like a disastrous idea; just imagine a chess game in which every move would be decided by rolling a die. However, an 18th Century French mathematician and naturalist, Comte de Buffon, first proved that randomized algorithms are useful by randomly dropping needles onto parallel strips of wood and using the results of this experiment to accurately approximate the constant π. For more details, see "DETOUR: Buffon's Needle" in the print companion.

Randomized algorithms may be nonintuitive because they lack the control of traditional algorithms. Some randomized algorithms are **Las Vegas** algorithms, which deliver solutions that are guaranteed to be exact, despite the fact that they rely on making random decisions. Yet most randomized algorithms, including the motif finding algorithms that we will consider in this chapter, are **Monte Carlo** algorithms. These algorithms are not guaranteed to return exact solutions, but they do quickly find *approximate* solutions. Because of their speed, they can be run many times, allowing us to choose the best approximation from thousands of runs.

```
RandomizedMotifSearch(Dna, k, t)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
    BestMotifs ← Motifs
    while forever
        Profile ← Profile(Motifs)
        Motifs ← Motifs(Profile, Dna)
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
        else
            return BestMotifs
```

In [56]:
f = open("randomizedMotifSearch.txt", "r")
input = f.read().split("\n")
f.close()

k = int(input[0].split(" ")[0]) # length of k-mers
t = int(input[0].split(" ")[1]) # number of strings
dnas = input[1].split(" ") # collection of dna strings

In [39]:
import random
import numpy as np

In [45]:
def hamming(s1, s2):
    """
    Find Hamming distance between two strings
    """
    dist = 0
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            dist += 1
    return dist
    
def score(motifs):
    '''Returns the score of the dna list motifs.'''
    score = 0
    for i in range(len(motifs[0])):
        motif = ''.join([motifs[j][i] for j in range(len(motifs))])
        # print motif
        # print [homogeneous*len(motif) for homogeneous in 'ACGT']
        # Calculate the min score between motif and [AAAAA, CCCCC, GGGGG, TTTTT]
        # avoiding find the consensus strings
        score += min([hamming(motif, homogeneous*len(motif)) for homogeneous in 'ACGT'])
    return score

def laplace_profile(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":1, "C":1, "G":1, "T":1}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            score_dicts.append(nt_counts)
    score_profile = np.array([list(i.values()) for i in score_dicts])
    score_profile = score_profile.T # transpose
    return score_profile

def most_probable_kmer(text, k, profile):
    max_prob = -1
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            if kmer[j] == "A":
                prob *= profile[0][j]
            elif kmer[j] == "C":
                prob *= profile[1][j]
            elif kmer[j] == "G":
                prob *= profile[2][j]
            elif kmer[j] == "T":
                prob *= profile[3][j]
        if prob > max_prob:
            max_prob = prob
            max_kmer = kmer
    return max_kmer

def randomizedMotifSearch(dnas, k, t):
    string_length = len(dnas[0])
    motifs = []
    for i in range(t):
        j = random.randint(0,string_length-k)
        motifs.append(dnas[i][j:j+k])
    best_motifs = motifs
    while True:    
        prof = laplace_profile(motifs)
        motifs = [most_probable_kmer(dnas[i], k, prof) for i in range(len(dnas))]
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs
        
def thousand_times(dnas, k, t):
    best_score_overall = float("inf")
    best_motifs_overall = []
    for i in range(1000):
        current_motifs = randomizedMotifSearch(dnas, k, t)
        current_score = score(current_motifs)
        if current_score < best_score_overall:
            best_motifs_overall = current_motifs
            best_score_overall = current_score
    return best_motifs_overall
            

In [48]:
" ".join(thousand_times(dnas, k, t))

'AATTGG AACTGA AATTGG TAATGG AATTGG AAGCGA AATTGG AAAAGG'

In [63]:
import random

def Score(motifs):
    score = 0
    t = len(motifs)
    k = len(motifs[0])
    for j in range(k):
        column = [motifs[i][j] for i in range(t)]
        max_freq = max(column.count('A'), column.count('C'), column.count('G'), column.count('T'))
        score += (t - max_freq)
    return score

def ProfileWithPseudocounts(motifs):
    t = len(motifs)
    k = len(motifs[0])
    profile = {'A': [1] * k, 'C': [1] * k, 'G': [1] * k, 'T': [1] * k}
    for i in range(t):
        for j in range(k):
            profile[motifs[i][j]][j] += 1
    for nucleotide in profile:
        for j in range(k):
            profile[nucleotide][j] /= (t + 4)
    return profile

def ProfileMostProbableKmer(text, k, profile):
    n = len(text)
    max_prob = -1
    most_prob_kmer = text[0:k]
    for i in range(n - k + 1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            prob *= profile[kmer[j]][j]
        if prob > max_prob:
            max_prob = prob
            most_prob_kmer = kmer
    return most_prob_kmer

def RandomMotifs(Dna, k, t):
    motifs = []
    for i in range(t):
        start = random.randint(0, len(Dna[i]) - k)
        motifs.append(Dna[i][start:start + k])
    return motifs

def RandomizedMotifSearch(Dna, k, t):
    motifs = RandomMotifs(Dna, k, t)
    best_motifs = motifs
    while True:
        profile = ProfileWithPseudocounts(motifs)
        motifs = [ProfileMostProbableKmer(seq, k, profile) for seq in Dna]
        if Score(motifs) < Score(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs

def RunRandomizedMotifSearch(Dna, k, t, iterations=1000):
    best_motifs = RandomizedMotifSearch(Dna, k, t)
    best_score = Score(best_motifs)
    for _ in range(iterations - 1):
        motifs = RandomizedMotifSearch(Dna, k, t)
        current_score = Score(motifs)
        if current_score < best_score:
            best_motifs = motifs
            best_score = current_score
    return best_motifs

# Sample Input
k = 8
t = 5
Dna = [
    "CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA",
    "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG",
    "TAGTACCGAGACCGAAAGAAGTATACAGGCGT",
    "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC",
    "AATCCACCAGCTCCACGTGCAATGTTGGCCTA"
]

# Running the RandomizedMotifSearch
best_motifs = RunRandomizedMotifSearch(Dna, k, t)
print("Best Motifs:")
print(" ".join(best_motifs))

Best Motifs:
AACGGCCA AAGTGCCA TAGTACCG AAGTTTCA ACGTGCAA


In [57]:
" ".join(thousand_times(dnas, k, t))

'CGTGGTTCCATATGT CGGCTGCCCAGACAT CGTGGTCCCAGACAT AGGTATCCCAGACGA CGGTCGGCCAGACAT CGGTATCATCGACAT CGGTATCCGCAACAT CATAATCCCAGACAT CGGTAAGTCAGACAT CGGTATCCCAGCGGT CGGTACAGCAGACAT CGGTATTTTAGACAT ACTTATCCCAGACAT GCGTATCCCAGACAA CGGTATTGGAGACAT CGGTATCCCAGAGTA CGGAGGCCCAGACAT CGGTGGTCCAGACAT CGGTATCCCAACTAT CGGTATCCCTTCCAT'

In [61]:
def randomized_motifs(dnas, k, t):
    string_length = len(dnas[0])
    motifs = []
    for i in range(t):
        j = random.randint(0, string_length-k)
        motifs.append(dnas[i][j:j+k])
    return motifs
    
def randomizedMotifSearch(dnas, k, t):
    motifs = randomized_motifs(dnas, k, t)
    best_motifs = motifs
    print(laplace_profile(motifs))
    print(profile_with_pseudocounts(motifs))
    """
    while True:    
        prof = laplace_profile(motifs)
        motifs = [most_probable_kmer(dnas[i], k, prof) for i in range(len(dnas))]
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs
    """

In [62]:
randomizedMotifSearch(dnas, k, t)

[[10 10 10 10  5  5  5  5  9  9  9  9  6  6  6  6  6  6  6  6  7  7  7  7
   6  6  6  6 10 10 10 10  7  7  7  7  8  8  8  8  6  6  6  6  3  3  3  3
   2  2  2  2  2  2  2  2  6  6  6  6]
 [ 3  3  3  3  7  7  7  7  6  6  6  6  5  5  5  5  5  5  5  5  6  6  6  6
   4  4  4  4  6  6  6  6  4  4  4  4  7  7  7  7  8  8  8  8  9  9  9  9
  11 11 11 11  7  7  7  7  6  6  6  6]
 [ 7  7  7  7  6  6  6  6  3  3  3  3  6  6  6  6  6  6  6  6  5  5  5  5
   8  8  8  8  5  5  5  5  6  6  6  6  5  5  5  5  6  6  6  6  4  4  4  4
   6  6  6  6  7  7  7  7  8  8  8  8]
 [ 4  4  4  4  6  6  6  6  6  6  6  6  7  7  7  7  7  7  7  7  6  6  6  6
   6  6  6  6  3  3  3  3  7  7  7  7  4  4  4  4  4  4  4  4  8  8  8  8
   5  5  5  5  8  8  8  8  4  4  4  4]]
{'A': [0.4166666666666667, 0.20833333333333334, 0.375, 0.25, 0.25, 0.2916666666666667, 0.25, 0.4166666666666667, 0.2916666666666667, 0.3333333333333333, 0.25, 0.125, 0.08333333333333333, 0.08333333333333333, 0.25], 'C': [0.125, 0.2916666666666667, 0.2