# Bioinformatics Algorithms, Week 4

Last week, we began thinking about ways to find regulatory motifs in DNA. After trial and error, we came up with a **Greedy Search with Pseudocounts** as a decent way to find regulatory motifs. This week, we are looking at randomized algorithms as a way to improve upon our solution to the **Motif Finding** problem. 

Randomized algorithms can be unintuitive because they rely on chance to find the solution, but they can be very helpful. **Las Vegas** algorithms return solutions guaranteed to be exact, while **Monte Carlo** algorithms quickly find approximate solutions that are not guaranteed to be exact. Because **Monte Carlo** algorithms are so fast, though, we can run them thousands of times and choose the best approximation.

### Randomized Motif Search

Last week, we defined *Profile*(*Motifs*) as the profile matrix constructed from a collection of *k*-mers *Motifs*. Now, given a collection of strings *Dna* and an arbitrary profile matrix *Profile*, we can define backwards a function *Motifs*(*Profile*, *Dna*) as the collection of the *Profile*-most probable *k*-mers from each string in *Dna*.

In [1]:
def profile_most_probable_kmer(text, k, profile):
    char_index_dict = {"A": 0, "C": 1, "G": 2, "T": 3}
    most_probable_kmer = ""
    max_probability = float('-inf')
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        current_prob = 1
        for j in range(k):
            nt = char_index_dict[kmer[j]]
            current_prob *= profile[nt][j]
        if current_prob > max_probability:
            max_probability = current_prob
            most_probable_kmer = kmer
    return most_probable_kmer

def motifs_function(profile, dna):
    return [profile_most_probable_kmer(text, len(profile[0]), profile) for text in dna]

In [2]:
profile = [[0.8, 0, 0, 0.2], [0, 0.6, 0.2, 0], [0.2, 0.2, 0.8, 0], [0, 0.2, 0, 0.8]]
dna=["TTACCTTAAC", "GATGTCTGTC", "ACGGCGTTAG", "CCCTAACGAG", "CGTCAGAGGT"]
print(" ".join(motifs_function(profile, dna)))

ACCT ATGT GCGT ACGA AGGT


This is pretty interesting stuff. We can first construct a set *Motifs* from randomly-selected *k*-mers in *Dna*, then obtain *Profile*(*Motifs*). Then, we can feed that *Profile* back into the *Motifs* function and *Dna*, to obtain a new set of *Motifs*. And we can keep nesting until the *Profile* ceases to improve. 

*Profile*(*Motifs*) -> *Motifs*(*Profile*(*Motifs*)) -> *Profile*(*Motifs*(*Profile*(*Motifs*))) ...

```
RandomizedMotifSearch(Dna, k, t)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
    BestMotifs ← Motifs
    while forever
        Profile ← Profile(Motifs)
        Motifs ← Motifs(Profile, Dna)
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
        else
            return BestMotifs
```

In [3]:
import random
import numpy as np

def laplace_profile(motifs):
    def count(motifs):
        k = len(motifs[0])
        counts = [[0]*k for i in range(4)] # empty matrix to fill
        nt_index_map = {"A": 0, "C": 1, "G": 2, "T": 3}
        for motif in motifs:
            for i in range(k):
                nt = motif[i]
                counts[nt_index_map[nt]][i] += 1
        return counts

    counts = np.array(count(motifs))
    return (counts+1)/(len(motifs)+4)

def profile_most_probable_kmer(text, k, profile):
    char_index_dict = {"A": 0, "C": 1, "G": 2, "T": 3}
    
    most_probable_kmer = ""
    max_probability = float('-inf')
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        current_prob = 1
        for j in range(k):
            nt = char_index_dict[kmer[j]]
            current_prob *= profile[nt][j]
        if current_prob > max_probability:
            max_probability = current_prob
            most_probable_kmer = kmer
    return most_probable_kmer

def score(motifs):
    k = len(motifs[0])
    
    score_dicts = []
    for i in range(k):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for motif in motifs:
            nt_counts[motif[i]] += 1
        score_dicts.append(nt_counts)

    scores = []
    for d in score_dicts:
        dict_score = 0
        for key, value in d.items():
            if key != max(d, key=d.get):
                dict_score += value
        scores.append(dict_score)
    return sum(scores)


def randomized_motif_search(dna, k, t):
    motifs = []
    for text in dna:
        i = random.randint(0, len(dna[0])-k)
        motifs.append(text[i:i+k])
    best_motifs = motifs
    
    while True:
        profile = laplace_profile(motifs)
        motifs = [profile_most_probable_kmer(text, k, profile) for text in dna]
        if score(motifs) < score(best_motifs):
            best_motifs = motifs
        else:
            return best_motifs
        
def thousand_times(dna, k, t):
    best_list = [randomized_motif_search(dna, k, t) for i in range(1000)]
    best_dict = {}
    for bl in best_list:
        best_dict[" ".join(bl)] = score(bl)

    min_value = min(best_dict.values())
    min_keys = [key for key, value in best_dict.items() if value == min_value]
    best_list = [" ".join(i) for i in best_list]
    for min_key in min_keys:
        print(min_key)
        print(best_list.count(min_key))
    

In [4]:
k = 8 
t = 5
dna = [
    "CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA",
    "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG",
    "TAGTACCGAGACCGAAAGAAGTATACAGGCGT",
    "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC",
    "AATCCACCAGCTCCACGTGCAATGTTGGCCTA"
]

In [5]:
thousand_times(dna, k, t)

AACGGCCA AAGTGCCA TAGTACCG AAGTTTCA ACGTGCAA
1
TCTCGGGG CCAAGGTG TACAGGCG TTCAGGTG TCCACGTG
5


The answer the instructors want is "TCTCGGGG CCAAGGTG TACAGGCG TTCAGGTG TCCACGTG", and we obtain it about half of the time with this function. I ended up returning all *Motifs* associated with the lowest score, and we get "AACGGCCA AAGTGCCA TAGTACCG AAGTTTCA ACGTGCAA" as well. Kind of frustrating given the structure of the course, since they ask for one correct answer.

In [6]:
f = open("./randomizedMotifSearch.txt", 'r')
rand_input = f.read().split("\n")
f.close()

k = int(rand_input[0].split(" ")[0])
t = int(rand_input[0].split(" ")[1])
dna = rand_input[1:]

In [7]:
thousand_times(dna, k, t)

TTTTTGCAGCCTACC GTTAGACTACCGACC GTTCAGTTACCGACC GCAGTTCTACCGACC GTTCTTCACACGACC CGTCTTCTACCGACA GTTAGCCTACCGACC GTTCTCAAACCGACC TCACTTCTACCGACC GTTCTTCTAAGTACC TTTCTTCTACCGAAA GTTCTTCTACGACCC GTTCGCATACCGACC GTTCTTTAGCCGACC GTTCTAAAACCGACC GTTCTTCTACCGGAG GTTCTTAATCCGACC GTTCTTCTCGTGACC GTGACTCTACCGACC GTTCTTCTACCTTTC
23


In [9]:
import math
prob=0
for i in range(1,11):
    tmp=((585**(10-i))/(586**10))*math.comb(10,i)
    prob+=tmp

print(prob)

0.01693439692910998


### Gibbs Sampling

The **Randomized Motif Search** algorithm has the chance to replace all strings in *Motifs*. The Gibbs Sampling approach is a more cautious approach that replaces only one string per iteration.

```
GibbsSampler(Dna, k, t, N)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
    BestMotifs ← Motifs
    for j ← 1 to N
        i ← Random(t)
        Profile ← profile matrix constructed from all strings in Motifs except for Motifi
        Motifi ← Profile-randomly generated k-mer in the i-th sequence
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
    return BestMotifs
```

To describe how GibbsSampler updates Motifs, we will need a slightly more advanced random number generator. 

Given a probability distribution (p1, …, pn), 

this random number generator, denoted *Random*(p1, …, pn), models an n-sided biased die and returns integer i with probability pi. 

For example, the standard six-sided fair die represents the random number generator Random(1/6, 1/6, 1/6, 1/6, 1/6, 1/6), whereas a biased die might represent the random number generator Random(0.1, 0.2, 0.3, 0.05, 0.1, 0.25).

GibbsSampler further generalizes the random number generator by using the function Random(p1, …, pn) defined for any set of non-negative numbers, 

i.e., not necessarily satisfying the condition that the pi sum to 1. 

If the pi sum to some C > 0 instead, then Random(p1, …, pn) is defined as Random(p1/C, …, pn/C), where (p1/C, …, pn/C) is a probability distribution. 

For example, for (0.1, 0.2, 0.3) with 0.1 + 0.2 + 0.3 = 0.6,

Random(0.1, 0.2, 0.3) = Random(0.1/0.6, 0.2/0.6, 0.3/0.6) = Random(1/6, 1/3, 1/2).

STOP and Think: Implement the random number generator Random(p1, …, pn) so that it uses RandomNumber(X) (for an appropriately chosen integer X) as a subroutine.

Because the course just doesn't work for my brain, I am following this [Medium post](https://abhinavmanc.medium.com/python-programs-for-beginner-bioinformatics-part-9-eae936c47370) to try to help me along.

In [10]:
def normalize(probs):
    sum_probs = sum(probs.values())
    return {key: value/sum_probs for key, value in probs.items()}
    
probs = {
  "GCGT": 0.1,
  "CGTT": 0.15,
  "GTTA": 0.25
}

normalize(probs)

{'GCGT': 0.2, 'CGTT': 0.3, 'GTTA': 0.5}

In [11]:
def weighted_die(probs):
    probs = normalize(probs)
    sides = list(probs.keys())
    weights = list(probs.values())
    return random.choices(sides, weights=weights)[0]

n = 10000
test_dict = {}
for i in range(n):
    result = weighted_die(probs)
    test_dict[result] = test_dict.get(result, 0) + 1
    
print(test_dict)

{'GTTA': 4982, 'CGTT': 3009, 'GCGT': 2009}


In [12]:
import numpy as np

def count(motifs):
    k = len(motifs[0])
    counts = [[0]*k for i in range(4)] # empty matrix to fill
    nt_index_map = {"A": 0, "C": 1, "G": 2, "T": 3}
    for motif in motifs:
        for i in range(k):
            nt = motif[i]
            counts[nt_index_map[nt]][i] += 1
    return counts

def profile(motifs):
    counts = np.array(count(motifs))
    return counts/len(motifs)

def probability(text, profile):
    char_index_dict = {"A": 0, "C": 1, "G": 2, "T": 3}
    prob = 1
    for i in range(len(text)):
        ind = char_index_dict[text[i]]
        prob *= profile[ind][i]
    return prob

# poop

In [13]:
dna = ["AAA", "AAA", "AAA", "AAA"]
prof = laplace_profile(dna)
probability("AAA", prof)

0.244140625

In [14]:
def profile_randomly_generated_kmer(text, profile, k):
    probs = {text[i:i+k]: probability(text[i:i+k], profile) for i in range(len(text)-k+1)}
    return weighted_die(probs)

In [15]:
profile_randomly_generated_kmer("ATGCATATTA", prof, 3)

'ATT'

```
GibbsSampler(Dna, k, t, N)
    randomly select k-mers Motifs = (Motif1, …, Motift) in each string from Dna
    BestMotifs ← Motifs
    for j ← 1 to N
        i ← Random(t)
        Profile ← profile matrix constructed from all strings in Motifs except for Motifi
        Motifi ← Profile-randomly generated k-mer in the i-th sequence
        if Score(Motifs) < Score(BestMotifs)
            BestMotifs ← Motifs
    return BestMotifs
```

In [None]:
def gibbssampler(k,t,N,l,kmers_array):
    bestmotifs = random_kmer_selection(k,t,l,kmers_array)
    score_bestmotifs = Score(bestmotifs,k,t)
    motifs = random_kmer_selection(k,t,l,kmers_array)
    for j in range(N):
        i = Random(t)
        motifs.pop(i)
        profile = Profile(motifs,k,t)
        motifs_i = prgkst(k,kmers_array[i],profile)
        motifs.insert(i,motifs_i)
        score_motifs = Score(motifs,k,t)
        if score_motifs  < score_bestmotifs:
            bestmotifs = motifs
            score_bestmotifs = score_motifs
    return (bestmotifs,score_bestmotifs)

In [92]:
def gibbs_sampler(dna, k, t, n):
    motifs = []
    for text in dna:
        i = random.randint(0, len(dna[0])-k)
        motifs.append(text[i:i+k])
    best_motifs = motifs.copy()
    for j in range(n):
        i = random.randint(0,t-1)
        text = motifs.pop(i)
        prof = laplace_profile(motifs)
        motif_i = profile_randomly_generated_kmer(dna[i], prof, k)
        motifs.insert(i, motif_i)
        
        if score(motifs) < score(best_motifs):
            best_motifs = motifs.copy()
            
    return best_motifs

In [102]:
f = open('rosalind_ba2g.txt', 'r')
gibbs_input = f.read().split("\n")
f.close()
k = int(gibbs_input[0].split(" ")[0])
t = int(gibbs_input[0].split(" ")[1])
n = int(gibbs_input[0].split(" ")[2])
dna = gibbs_input[1:]
print(dna)

['AGGTAGCGCCTGATACGAAGCGGCGTCCTCATTCGAAGTAAGGTAAAGGGGTAGCTCCAGTCCAGACCGTCAGGTGTTCGAAGACGGGTCTATGGGGAGGGTAAGCCGTGCTCGCTAAAGACACCACACCATGATGGCGATTGCCACAGTGAAGGGAAATGCTCACTTAATATGCCGACCGACCGGGCCGATATATTGAGTTGGCCTTAGAATCTACAGTTAATACAGGAGGTCTCGAGAGAATTAGTAACCTACTGCTCCGATTCTCGGGTTACCTATATACCGCGGACATTCTCAATCATGAGGTAGCGCCTGATA', 'CGAAGCGGCGTCCTCATTCGAAGTAAGGTAAAGGGGTAGCTCCAGTCCAGACCGTCAGGTGTTCGAAGACGGGTCTATGGGGAGGGTAAGCCGTGCTCGCTAAAGACACCACACCATGATGGCGATTGCCACAGTGAAGGGAAATGCTCACTTAATATGCGCGGTCCGTTCCTCCCGACCGACCGGGCCGATATATTGAGTTGGCCTTAGAATCTACAGTTAATACAGGAGGTCTCGAGAGAATTAGTAACCTACTGCTCCGATTCTCGGGTTACCTATATACCGCGGACATTCTCAATCATGAGGTAGCGCCTGATA', 'TAAGTTTCGAGGTGGGGCACCGCGGGTCAGCTCGTTCTGCTTGCAGCGACTTATGTACTGGTCTACGGTCCTTTGCGCCCTGTTACCGGACGTCTAACTTTTAGGCTGTTACACTCACGCCAAAAGAGCAATAACTATGCCGCATGTCAAAAATGGCTGCCCCATAACGCCCTACGTTCCTCATAGGGAAGCTTGGTGCCTGTCCTAACGTAGGCAGGAGCAGTGGCCTGACGTGCTAGGCTATTATATCTCTGGTTAAGTCTTCCTCCCAAGGCAAATCAACTCAGGCCGCCGAGCCTACAAACTGCTCTTAAAGCA', 'GCAGTTCGTCCACGAGACTAGCGGACGGGGCC

In [103]:
'''
dna = [
    "CGCCCCTCTCGGGGGTGTTCAGTAAACGGCCA",
    "GGGCGAGGTATGTGTAAGTGCCAAGGTGCCAG",
    "TAGTACCGAGACCGAAAGAAGTATACAGGCGT",
    "TAGATCAAGTTTCAGGTGCACGTCGGTGAACC",
    "AATCCACCAGCTCCACGTGCAATGTTGGCCTA"
]
k = 8
t = 5
n = 100
'''


best = gibbs_sampler(dna, k, t, n)
s = score(best)

for x in range(20): # for each iteration of the gibbs sampler 
    sample = gibbs_sampler(dna, k, t, n)
    sco = score(sample)
    if sco < s: # if the score of the sample is less than the score of the best
        s = sco # set s to the score of the sample
        best = sample[:] # set best to a copy of the sample
print("\n".join(best))
print(score(best))



CGAAGCGGCGTCCTC
CGCGGTCCGTTCCTC
CGCCCTACGTTCCTC
CGCCCAACGTTCCTC
CGCCTGTAGTTCCTC
CGCCTCGACTTCCTC
CGCCTCCCGTTCTGA
CAATTCCCGTTCCTC
CGCCTCCCGCGACTC
CGCCTCCCGTTGTGC
CGCCTCTTTTTCCTC
ACCCTCCCGTTCCTT
CGCCTATGGTTCCTC
ACGCTCCCGTTCCTC
CGCCTCCCGTAGATC
CGCCTCCCAGCCCTC
GGCCTCCCGTTCCGG
CGGTGCCCGTTCCTC
CGCCTCCGAATCCTC
CGCAGTCCGTTCCTC
64


In [None]:
TCTCGGGG CCAAGGTG TACAGGCG TTCAGGTG TCCACGTG