### A k-mer tends to have a higher probability when it is more similar to the consensus string of a profile.

### Greedy motif search:

#### Given a profile matrix Profile, we can evaluate the probability of every k-mer in a string Text and find a Profile-most probable 

#### k-mer in Text, i.e., a k-mer that was most likely to have been generated by Profile among all k-mers in Text. 

#### For example, ACGGGGATTACC is the Profile-most probable 12-mer in GGTACGGGGATTACCT. Indeed, every other 12-mer in 

#### this string has probability 0. In general, if there are multiple Profile-most probable k-mers in Text, then we 

#### select the first such k-mer occurring in Text.

## Profile-most Probable k-mer Problem: Find a Profile-most probable k-mer in a string.

    Input: A string Text, an integer k, and a 4 × k matrix Profile.
    Output: A Profile-most probable k-mer in Text.


In [9]:
import sys
import math
import numpy as np

In [10]:
def Profile(motifs):
    k = len(motifs[0])
    n = len(motifs)
    s = 1 / n
    seq1 = 'ACGTacgt01230123'
    seq_dict = { seq1[i]:int(seq1[i+8]) for i in range(8) }
    P = [[0 for _ in range(k)] for __ in range(4)]
    for motif in motifs:
        for i in range(k):
            P[seq_dict[motif[i]]][i] += s
    return P

In [11]:
def Pr(pattern, profile):
    seq1 = 'ACGTacgt01230123'
    seq_dict = { seq1[i]:int(seq1[i+8]) for i in range(8) }
    p = 1
    k = len(pattern)
    for i in range(k):
        p *= profile[seq_dict[pattern[i]]][i]
    return p

In [12]:
def ProfileMostPr_kmer(seq, k, profile):
    l = len(seq)
    pmax = -1
    imax = -1
    for i in range(l-k+1):
        p = Pr(seq[i:i+k], profile)
        if p > pmax:
            pmax = p
            imax = i
    return seq[imax:imax+k]

In [14]:
def Score(motifs):
    k = len(motifs[0])
    n = len(motifs)
    seq1 = 'ACGTacgt01230123'
    seq_dict = { seq1[i]:int(seq1[i+8]) for i in range(8) }
    P = [[0 for _ in range(4)] for __ in range(k)]
    for motif in motifs:
        for i in range(k):
            P[i][seq_dict[motif[i]]] += 1
    Sm = 0
    for i in range(k):
        Sm += max(P[i])
    return n * k - Sm

In [24]:
def PseudoProfile(motifs):
    k = len(motifs[0])
    n = len(motifs)
    s = 1 / (n + 4)
    seq1 = 'ACGTacgt01230123'
    seq_dict = { seq1[i]:int(seq1[i+8]) for i in range(8) }
    P = [[1 for _ in range(k)] for __ in range(4)]
    for motif in motifs:
        for i in range(k):
            P[seq_dict[motif[i]]][i] += s
    return P

In [25]:
def GreedyMotifSearch(dna, k, t):
    BestMotifs = [dna[i][0:k] for i in range(t)]
    BestScore = float('inf')
    dna1 = dna[0]
    l1 = len(dna1)
    for i in range(l1-k+1):
        motifs = []
        motifs.append(dna1[i:i+k])
        for i in range(1, t):
            P = Profile(motifs)
            motifs.append(ProfileMostPr_kmer(dna[i], k, P))
        currScore = Score(motifs)
        if currScore < BestScore:
            BestMotifs = motifs
            BestScore = currScore
    return BestMotifs

In [26]:
def GreedyMotifSearch2(dna, k, t):
    #GreedyMotifSearch with pseudocounts
    BestMotifs = [dna[i][0:k] for i in range(t)]
    BestScore = float('inf')
    dna1 = dna[0]
    l1 = len(dna1)
    for i in range(l1-k+1):
        motifs = []
        motifs.append(dna1[i:i+k])
        for i in range(1, t):
            P = PseudoProfile(motifs)
            motifs.append(ProfileMostPr_kmer(dna[i], k, P))
        currScore = Score(motifs)
        if currScore < BestScore:
            BestMotifs = motifs
            BestScore = currScore
    return BestMotifs 

In [27]:
dna = "CGGCTGCTATAGGGATTGAAAACTGACAATCTACAAGGTAAGCTGCTGTGAGTTGGGAGGCCCCCATTCCCTGGCCCTTTCTTGAACATCCTGATTTATAAACTCGGTGTGTACACGACGATGCAACGGAGCATTTCGGAGGAGAGTTCATGACTAGCGAGCTCAGTACGAGCTTTGATACCCGGGGGCCATCCCCGCCTTCGGAATATCGCGGGCTTTAGGAAGGCAGGTTCAGTAACTCTCGTAATTGCCTGCAACGTCGATATGCGTCGAAAATGCTCTGTACGTAAACTTACATCTCTGTAAATATGAGCGCAGTTCGTGGGCCTAAATGAGTGTGAAAGTAGAACGCCCGGTTGACGTGCCAATGGGCAAGATAGACTCCTCTTCACAGGTCGCTGATGTTTCGCCGCGTAACGTCAGATTCAGTTAGAACACTGAGTCTGCAGGTTAACTCACCTAACACGTCGTTAATTCTGCGTCGGTGGGGATCATGTCGCCATAACAAAGATAATGCGTAAGTAATACATAATGTTCGGATAAGGGTTATTTTATCACCTCTCCGGTTTCGTTATCTTGGCTCATAACGTGCGTTCGCGCCGAGCGTAGAACCGCCGCTACAATTCTATTAAATCATAAACCTAAAATGCGCGCGACGATCCGCGCTGTGCGGATGGTGCCCAAGTAGTTCAAGAATACGACATACTCAGCGTGAAGCAAATCTACGGCTTCATCCGTCTGCAAGGGGGAGTCAGGTGTTCGCTAGCAACATAAACGTTATGAAGTAATGCCGTAGGAAGGAAGAAAGTGGGGGCGAACCAAAACAGGACATATAATACACTAACCCAAGCCCCGTCACTATTACGGAACAGCGAAATCACTCCATGTGTCCGGGAAGTTCTCCTTAAGCTCCCATGGTACACAATAGTTAGTCTAGATGGTCGTCGATCAACGTCGGTTTCCCTAGTCGCCACAGAGTCGATATACCCTGATCGGTCGG"

In [30]:
data = np.array([0.211, 0.282, 0.296, 0.31, 0.282, 0.296, 0.296, 0.239, 0.296, 0.31, 0.239, 0.225, 0.282, 0.282,
0.197, 0.197, 0.225, 0.211, 0.099, 0.155, 0.324, 0.183, 0.225, 0.282, 0.225, 0.268, 0.282, 0.282,
0.352, 0.254, 0.282, 0.239, 0.296, 0.296, 0.155, 0.366, 0.254, 0.225, 0.338, 0.127, 0.254, 0.197,
0.239, 0.268, 0.197, 0.239, 0.324, 0.254, 0.225, 0.211, 0.225, 0.183, 0.197, 0.38, 0.183,0.239])

In [32]:
BestMotifs = GreedyMotifSearch2(dna, 14,data)
    for motif in BestMotifs:
        print(motif)

IndentationError: unexpected indent (609340411.py, line 2)