# 2. Which DNA Patterns Play The Role of Molecular Clocks? (Part 1)

```
Finding motif problem

길이가 k인 t 개의 DNA string = Motifs matrix
Count matrix : 예) 길이 6인 DNA string  nucleotide 빈도에 따라 4 * 6 matrix 생성.
Profile matrix : Count matrix를 DNA 수인 t 로 나눠 비율로 표시. 각 열의 합은 1.

Consensus string : the most popular nucleotides in each column of the motif matrix (같은 값이면 랜덤으로).

Motif Score : Consensus 문자열의 j 위치 기호와 일치하지 않는 Motifs의 j 번째 열에있는 기호의 수를 합산
```

In [1]:
motifs = ['AACGTA', 
         'CCCGTT', 
         'CACCTT', 
         'GGATTA', 
         'TTCCGG']

In [2]:
def CountMatrix(Motifs):
    t = len(Motifs)      # DNA string 갯수
    k = len(Motifs[0])   # 각 string의 길이
    
    count = {'A':[0]*k, 'C':[0]*k, 'G':[0]*k,'T':[0]*k}

    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count[symbol][j] += 1
    
    return count

In [3]:
CountMatrix(motifs)

{'A': [1, 2, 1, 0, 0, 2],
 'C': [2, 1, 4, 2, 0, 0],
 'G': [1, 1, 0, 2, 1, 1],
 'T': [1, 1, 0, 1, 4, 2]}

In [4]:
def ProfileMatrix(Motifs):
    t = len(Motifs)
    k = len(Motifs[0])
    
    profile = {'A':[0]*k, 'C':[0]*k, 'G':[0]*k,'T':[0]*k}

    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            profile[symbol][j] += 1
            
    for key, val in profile.items():
        val[:] = [x / t for x in val]
        
    return profile

In [5]:
profile = ProfileMatrix(motifs)

for k, v in profile.items():
    print(k, ':', v)

A : [0.2, 0.4, 0.2, 0.0, 0.0, 0.4]
C : [0.4, 0.2, 0.8, 0.4, 0.0, 0.0]
G : [0.2, 0.2, 0.0, 0.4, 0.2, 0.2]
T : [0.2, 0.2, 0.0, 0.2, 0.8, 0.4]


In [6]:
def ConsensusMatrix(Motifs):
    t = len(Motifs)      # DNA string 갯수
    k = len(Motifs[0])   # 각 string의 길이
    
    count_matrix = {'A':[0]*k, 'C':[0]*k, 'G':[0]*k,'T':[0]*k}

    for i in range(t):
        for j in range(k):
            symbol = Motifs[i][j]
            count_matrix[symbol][j] += 1

    consensus = ""
    for j in range(k):
        m = 0
        frequentSymbol = ""
        
        for symbol in "ACGT":
            if count_matrix[symbol][j] > m:
                m = count_matrix[symbol][j]
                frequentSymbol = symbol
                
        consensus += frequentSymbol
        
    return consensus

In [7]:
ConsensusMatrix(motifs)

'CACCTA'

In [8]:
def HammingDistance(p, q):
    t_len = max(len(p), len(q))
    ham = [1 for x in range(t_len) if p[x] != q[x]]
    count = sum(ham)
    return count

def Score(Motifs):
    consensus = ConsensusMatrix(Motifs)
    
    score = 0
    for motif in Motifs:
        score += HammingDistance(consensus, motif)
    
    return score

In [9]:
Score(motifs)

14

## Motif greedy search algorithms

```
profile = {
    A : [0.2, 0.4, 0.2, 0.0, 0.0, 0.4]
    C : [0.4, 0.2, 0.8, 0.4, 0.0, 0.0]
    G : [0.2, 0.2, 0.0, 0.4, 0.2, 0.2]
    T : [0.2, 0.2, 0.0, 0.2, 0.8, 0.4]
}

P( AACGTA, profile ) = 0.2 * 0.4 * 0.8 * 0.4 * 0.8 * 0.4 = 0.008192

- 1열에 A가 있을 확률 = 0.2
- 2열에 A가 있을 확률 = 0.4
- 3열에 C가 있을 확률 = 0.8
- 4열에 G가 있을 확률 = 0.4
- 5열에 T가 있을 확률 = 0.8
- 6열에 A가 있을 확률 = 0.4

--> AACGTA 가 motif일 확률
```

In [10]:
profile = {
    'A': [ 0.2 ,0.2 ,0.0 ,0.0 ,0.0 ,0.0, 0.9, 0.1, 0.1, 0.1, 0.3, 0.0],
    'C': [ 0.1 ,0.6 ,0.0 ,0.0 ,0.0 ,0.0 ,0.0 ,0.4 ,0.1, 0.2, 0.4, 0.6], 
    'G': [ 0.0 ,0.0 ,1.0 ,1.0 ,0.9 ,0.9 ,0.1 ,0.0, 0.0 ,0.0 ,0.0, 0.0], 
    'T': [ 0.7 ,0.2 ,0.0 ,0.0 ,0.1 ,0.1 ,0.0, 0.5 ,0.8, 0.7, 0.3, 0.4]
}

motif = 'ACGGGGATTACC'

In [11]:
def ProbMotif(Text, Profile):
    prob = 1
    for index, nucleotide in enumerate(Text):
        prob *= Profile[nucleotide][index]
    
    return prob

In [12]:
ProbMotif(motif, profile)

0.0008398080000000002

### most probable k-mer

In [13]:
def ProfileMostProbableKmer(text, k, profile):
    p = -1
    result = text[0:k]
    for i in range(len(text)-k+1):
        seq = text[i:i+k]
        pr = ProbMotif(seq,profile)
        if pr > p:
            p = pr
            result = seq
    return result

In [14]:
string = 'ACCTGTTTATTGCCTAAGTTCCGAACAAACCCAATATAGCCCGAGGGCCT'

profile = {
    'A': [0.2, 0.2, 0.3, 0.2, 0.3],
    'C': [0.4, 0.3, 0.1, 0.5, 0.1],
    'G': [0.3, 0.3, 0.5, 0.2, 0.4],
    'T': [0.1, 0.2, 0.1, 0.1, 0.2]
}
k = 5
best_word = ProfileMostProbableKmer(string, k, profile)
print(best_word)

CCGAG


In [15]:
def window(s, k):
    for i in range(1 + len(s) - k):
        yield s[i:i+k]
        
def GreedyMotifSearch(Dna, k, t):
    bestMotifs = []
    bestScore = 0   
    
    # t개의 dna에서 각각 첫번째 k-mer 추출
    for string in Dna:
        bestMotifs.append(string[:k])
        
    bestScore = Score(bestMotifs)
    
    base = Dna[0]
    n = len(base)
    
    # base : 첫번째 dna에서 모든 k-mer 추출
    for i in window(base, k):
        newMotifs = [i]

        # iterate over len(DNA)
        for j in range(1, t):
            # build up motifs and build profile using them.
            profile = ProfileMatrix(newMotifs[:j])
            pattern = ProfileMostProbableKmer(Dna[j], k, profile)           
            newMotifs.append(pattern)
        
        currentScore = Score(newMotifs)
        if k == 3:
            print(currentScore, newMotifs)
        
        if currentScore <= bestScore:
            bestScore = currentScore
            bestMotifs = newMotifs
            
            
    return bestMotifs

In [16]:
Dna = [
    "GGCGTTCAGGCA",
    "AAGAATCAGTCA",
    "CAAGGAGTTCGC",
    "CACGTCAATCAC",
    "CAATAATATTCG"
]

GreedyMotifSearch(Dna, 3, len(Dna))

7 ['GGC', 'AAG', 'AAG', 'CAC', 'CAA']
5 ['GCG', 'AAG', 'AAG', 'ACG', 'CAA']
4 ['CGT', 'AAG', 'AAG', 'AAT', 'AAT']
4 ['GTT', 'AAG', 'AAG', 'AAT', 'AAT']
6 ['TTC', 'AAG', 'AAG', 'ATC', 'TTC']
3 ['TCA', 'TCA', 'CAA', 'TCA', 'TAA']
2 ['CAG', 'CAG', 'CAA', 'CAA', 'CAA']
5 ['AGG', 'AAG', 'AAG', 'CAC', 'CAA']
7 ['GGC', 'AAG', 'AAG', 'CAC', 'CAA']
6 ['GCA', 'AAG', 'AAG', 'ACG', 'CAA']


['CAG', 'CAG', 'CAA', 'CAA', 'CAA']

### 결핵을 일으키는 Mycobacterium tuberculosis bacterium (MTB)의 유전자 중 regulatory motif 찾기

In [17]:
Dna = ["GCGCCCCGCCCGGACAGCCATGCGCTAACCCTGGCTTCGATGGCGCCGGCTCAGTTAGGGCCGGAAGTCCCCAATGTGGCAGACCTTTCGCCCCTGGCGGACGAATGACCCCAGTGGCCGGGACTTCAGGCCCTATCGGAGGGCTCCGGCGCGGTGGTCGGATTTGTCTGTGGAGGTTACACCCCAATCGCAAGGATGCATTATGACCAGCGAGCTGAGCCTGGTCGCCACTGGAAAGGGGAGCAACATC", 
       "CCGATCGGCATCACTATCGGTCCTGCGGCCGCCCATAGCGCTATATCCGGCTGGTGAAATCAATTGACAACCTTCGACTTTGAGGTGGCCTACGGCGAGGACAAGCCAGGCAAGCCAGCTGCCTCAACGCGCGCCAGTACGGGTCCATCGACCCGCGGCCCACGGGTCAAACGACCCTAGTGTTCGCTACGACGTGGTCGTACCTTCGGCAGCAGATCAGCAATAGCACCCCGACTCGAGGAGGATCCCG", 
       "ACCGTCGATGTGCCCGGTCGCGCCGCGTCCACCTCGGTCATCGACCCCACGATGAGGACGCCATCGGCCGCGACCAAGCCCCGTGAAACTCTGACGGCGTGCTGGCCGGGCTGCGGCACCTGATCACCTTAGGGCACTTGGGCCACCACAACGGGCCGCCGGTCTCGACAGTGGCCACCACCACACAGGTGACTTCCGGCGGGACGTAAGTCCCTAACGCGTCGTTCCGCACGCGGTTAGCTTTGCTGCC", 
       "GGGTCAGGTATATTTATCGCACACTTGGGCACATGACACACAAGCGCCAGAATCCCGGACCGAACCGAGCACCGTGGGTGGGCAGCCTCCATACAGCGATGACCTGATCGATCATCGGCCAGGGCGCCGGGCTTCCAACCGTGGCCGTCTCAGTACCCAGCCTCATTGACCCTTCGACGCATCCACTGCGCGTAAGTCGGCTCAACCCTTTCAAACCGCTGGATTACCGACCGCAGAAAGGGGGCAGGAC", 
       "GTAGGTCAAACCGGGTGTACATACCCGCTCAATCGCCCAGCACTTCGGGCAGATCACCGGGTTTCCCCGGTATCACCAATACTGCCACCAAACACAGCAGGCGGGAAGGGGCGAAAGTCCCTTATCCGACAATAAAACTTCGCTTGTTCGACGCCCGGTTCACCCGATATGCACGGCGCCCAGCCATTCGTGACCGACGTCCCCAGCCCCAAGGCCGAACGACCCTAGGAGCCACGAGCAATTCACAGCG", 
       "CCGCTGGCGACGCTGTTCGCCGGCAGCGTGCGTGACGACTTCGAGCTGCCCGACTACACCTGGTGACCACCGCCGACGGGCACCTCTCCGCCAGGTAGGCACGGTTTGTCGCCGGCAATGTGACCTTTGGGCGCGGTCTTGAGGACCTTCGGCCCCACCCACGAGGCCGCCGCCGGCCGATCGTATGACGTGCAATGTACGCCATAGGGTGCGTGTTACGGCGATTACCTGAAGGCGGCGGTGGTCCGGA", 
       "GGCCAACTGCACCGCGCTCTTGATGACATCGGTGGTCACCATGGTGTCCGGCATGATCAACCTCCGCTGTTCGATATCACCCCGATCTTTCTGAACGGCGGTTGGCAGACAACAGGGTCAATGGTCCCCAAGTGGATCACCGACGGGCGCGGACAAATGGCCCGCGCTTCGGGGACTTCTGTCCCTAGCCCTGGCCACGATGGGCTGGTCGGATCAAAGGCATCCGTTTCCATCGATTAGGAGGCATCAA", 
       "GTACATGTCCAGAGCGAGCCTCAGCTTCTGCGCAGCGACGGAAACTGCCACACTCAAAGCCTACTGGGCGCACGTGTGGCAACGAGTCGATCCACACGAAATGCCGCCGTTGGGCCGCGGACTAGCCGAATTTTCCGGGTGGTGACACAGCCCACATTTGGCATGGGACTTTCGGCCCTGTCCGCGTCCGTGTCGGCCAGACAAGCTTTGGGCATTGGCCACAATCGGGCCACAATCGAAAGCCGAGCAG", 
       "GGCAGCTGTCGGCAACTGTAAGCCATTTCTGGGACTTTGCTGTGAAAAGCTGGGCGATGGTTGTGGACCTGGACGAGCCACCCGTGCGATAGGTGAGATTCATTCTCGCCCTGACGGGTTGCGTCTGTCATCGGTCGATAAGGACTAACGGCCCTCAGGTGGGGACCAACGCCCCTGGGAGATAGCGGTCCCCGCCAGTAACGTACCGCTGAACCGACGGGATGTATCCGCCCCAGCGAAGGAGACGGCG", 
       "TCAGCACCATGACCGCCTGGCCACCAATCGCCCGTAACAAGCGGGACGTCCGCGACGACGCGTGCGCTAGCGCCGTGGCGGTGACAACGACCAGATATGGTCCGAGCACGCGGGCGAACCTCGTGTTCTGGCCTCGGCCAGTTGTGTAGAGCTCATCGCTGTCATCGAGCGATATCCGACCACTGATCCAAGTCGGGGGCTCTGGGGACCGAAGTCCCCGGGCTCGGAGCTATCGGACCTCACGATCACC"]

In [18]:
k = 15
t = len(Dna)

motifs = GreedyMotifSearch(Dna, k, t)
print(motifs)
print(Score(motifs))

['CCAATCGCAAGGATG', 'CCGATCGGCATCACT', 'ACCGTCGATGTGCCC', 'GGGTCAGGTATATTT', 'GTAGGTCAAACCGGG', 'CTGTTCGCCGGCAGC', 'CTGTTCGATATCACC', 'CGCGTCCGTGTCGGC', 'CTGGGAGATAGCGGT', 'CTCATCGCTGTCATC']
64
