How do you find the shared 15-mer regulatory motif in the below strings?
```
1 "atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg"
2 "acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga"
3 "tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga"
4 "gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga"
5 "tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag"
6 "gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa"
7 "cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat"
8 "aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta"
9 "ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag"
10 "ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"
```

We just use the FrequentWords algorithm we made earlier.

In [128]:
def frequency_table(text, k):
    """
    Returns a frequency table of k-mers in a text.

    Args:
        text (str): The text.
        k (int): the length of the k-mers.

    Returns:
        freq_map (dict): The map of each k-mer to its frequency.
    """
    freq_map = {}
    n = len(text)
    # indexing excludes end of range so must add +1 to n-k
    for i in range(0, n-k+1):
        pattern = text[i:i+k]
        if pattern not in freq_map:
            freq_map[pattern] = 1
        else:
            freq_map[pattern] += 1
    return freq_map


def maxmap(input_map):
    """
    Returns the maximum value (not key) in a map.

    Args:
        input_map (dict): the map (type(value) == int)

    Returns:
        max_value (int): the maximum value.
    """
    max_value = max(input_map.values())
    return max_value

    
def frequent_words(text, k):
    """
    Returns the most frequent k-mers from a text.

    Args:
        text (str): the text.
        k (int): the length of the k-mers.

    Returns:
        frequent_patterns (list): the most frequent k-mers.
    """
    frequent_patterns = []
    freq_map = frequency_table(text, k)
    max_num = maxmap(freq_map)
    for pattern, freq in freq_map.items():
        if freq == max_num:
            frequent_patterns.append(pattern)
    return frequent_patterns

In [129]:
text = "atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaagggggggatgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttataggtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga"
frequent_words(text, 15)

['aaaaaaaaggggggg']

But when the regulatory motif has mutations, even the Frequent Words with Mismatches algorithm becomes too slow, and unlike a DnaA box, a regulatory motif only has to appear at least once in each of the regions in the genome associated with the regulatory gene. For this, we might use a brute force algorithm first. 

Given a collection of strings Dna and an integer d, a k-mer is a (k,d)-motif if it appears in every string from Dna with at most d mismatches. For example, the implanted 15-mer in the strings above represents a (15,4)-motif.

**Implanted Motif Problem**: Find all (k, d)-motifs in a collection of strings.

    **Input**: A collection of strings Dna, and integers k and d.
    **Output**: All (k, d)-motifs in Dna.


In [130]:
def hamming(s1, s2):
    """
    Find Hamming distance between two strings
    """
    dist = 0
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            dist += 1
    return dist

    
def neighbors(pattern, d):
    # handle edge cases first
    if d == 0:
        return [pattern] # absolutely necessary to return list
    if len(pattern) == 1:
        return ["A", "T", "G", "C"]
    neighborhood = []
    suffix_neighbors = neighbors(pattern[1:], d)
    for text in suffix_neighbors:
        if hamming(pattern[1:], text) < d:
            for nt in ["A", "T", "G", "C"]:
                neighborhood.append(nt+text)
        else:
            neighborhood.append(pattern[0]+text)
    return neighborhood

In [131]:
def kmer_in_all_dna(kmer, kmers_dict):
    for array in kmers_dict.values():
        if kmer not in array:
            return False
    return True
    
def motif_enumeration(dna, k, d):
    patterns = set()
    kmers_dict = {}
    all_kmers = set()

    for pattern in dna:
        kmers = {pattern[i:i+k] for i in range(len(pattern)-k+1)}
        kmers_dict[pattern] = set()
        for kmer in kmers:
            kmer_neighbors = neighbors(kmer, d)
            for i in kmer_neighbors:
                kmers_dict[pattern].add(i) # d-neighbors of the k-mer
                all_kmers.add(i)
            kmers_dict[pattern].add(kmer) # the k-mer itself
            all_kmers.add(kmer)
                        
    for kmer in all_kmers:
        if kmer_in_all_dna(kmer, kmers_dict):
            patterns.add(kmer)
    
    return patterns

In [132]:
dna = ["AGTGCATCTGATTAGTCGTTGGATG", "GAGTCTCCAGAGTTGCATATGGAAA", "ATCTAAGTATGCACTTCATCAGGCC", "TCCGAAGTTTTGACGCCCGGAATTA", "CACCATTTTGAGTTCTTGCAACCTA", "CAACCTAGGCACCATGGCAAAGTTG", "TAGTAGGGCTTGTATCCCAGAGTGG", "AGTACTTCGCCCTTATCCTACAAGG", "TCCGGGTCTGGTGATCTTCTAGTTC", "AGTTATTGACCTCCAGGCGAATCAC"]
k = 5
d = 2
" ".join(motif_enumeration(dna, k, d))

poop = ["brown", "brown", "slushy", "hard"]
set(poop)

{'brown', 'hard', 'slushy'}

In [133]:
dna = ["TCGGGGGTTTTT",
    "CCGGTGACTTAC",
    "ACGGGGATTTTC",
    "TTGGGGACTTTT",
    "AAGGGGACTTCC",
    "TTGGGGACTTCC",
    "TCGGGGATTCAT",
    "TCGGGGATTCCT",
    "TAGGGGAACTAC",
    "TCGGGTATAACC"]

In [134]:
def score(dna):
    score_dicts = []
    scores = []
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        score_dicts.append(nt_counts)
    for d in score_dicts:
        print(d)
        dict_score = 0
        for key, value in d.items():
            if key != max(d, key=d.get):
                dict_score += value
        scores.append(dict_score)
    return scores

In [135]:
score(dna)

{'A': 2, 'T': 7, 'G': 0, 'C': 1}
{'A': 2, 'T': 2, 'G': 0, 'C': 6}
{'A': 0, 'T': 0, 'G': 10, 'C': 0}
{'A': 0, 'T': 0, 'G': 10, 'C': 0}
{'A': 0, 'T': 1, 'G': 9, 'C': 0}
{'A': 0, 'T': 1, 'G': 9, 'C': 0}
{'A': 9, 'T': 0, 'G': 1, 'C': 0}
{'A': 1, 'T': 5, 'G': 0, 'C': 4}
{'A': 1, 'T': 8, 'G': 0, 'C': 1}
{'A': 1, 'T': 7, 'G': 0, 'C': 2}
{'A': 3, 'T': 3, 'G': 0, 'C': 4}
{'A': 0, 'T': 4, 'G': 0, 'C': 6}


[3, 4, 0, 0, 1, 1, 1, 5, 2, 3, 6, 4]

In [136]:
def count(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            nt_counts[k] = v/len(dna)
        score_dicts.append(nt_counts)
    return(score_dicts)

In [137]:
count(dna)

[{'A': 0.2, 'T': 0.7, 'G': 0.0, 'C': 0.1},
 {'A': 0.2, 'T': 0.2, 'G': 0.0, 'C': 0.6},
 {'A': 0.0, 'T': 0.0, 'G': 1.0, 'C': 0.0},
 {'A': 0.0, 'T': 0.0, 'G': 1.0, 'C': 0.0},
 {'A': 0.0, 'T': 0.1, 'G': 0.9, 'C': 0.0},
 {'A': 0.0, 'T': 0.1, 'G': 0.9, 'C': 0.0},
 {'A': 0.9, 'T': 0.0, 'G': 0.1, 'C': 0.0},
 {'A': 0.1, 'T': 0.5, 'G': 0.0, 'C': 0.4},
 {'A': 0.1, 'T': 0.8, 'G': 0.0, 'C': 0.1},
 {'A': 0.1, 'T': 0.7, 'G': 0.0, 'C': 0.2},
 {'A': 0.3, 'T': 0.3, 'G': 0.0, 'C': 0.4},
 {'A': 0.0, 'T': 0.4, 'G': 0.0, 'C': 0.6}]

In [138]:
import math
def safe_log2(value):
    try:
        return math.log2(value)
    except:
        return 0
        
def entropy(d):
    """
    Calculate entropy of given a dictionary with probabilities for each nucleotide
    """
    plogp = [value*safe_log2(value) for key, value in d.items()]
    ent = -1*sum(plogp)
    return ent

def total_entropy(dna):
    count_matrix = count(dna)
    entropies = [entropy(d) for d in count_matrix]
    return sum(entropies)
        
        
        

In [139]:
entropy({'A': 0.2, 'T': 0.2, 'G': 0.0, 'C': 0.6})

1.3709505944546687

In [140]:
total_entropy(dna)

9.916290005356972

In [141]:
# stolen code from comment section
def hamming_dist(p,q):
    dist = 0
    if(len(p) != len(q)):
        return "Unable to compute."
    else:
        for i in range(0,len(p)):
            if p[i] == q[i]:
                next
            else:
                dist+=1
    return dist
    
def closest_match(dna,pattern):
    """
    given a dna fragment and a pattern with length=k, 
    find a substring k-mer in dna fragment that has the 
    smallest hamming distance from the pattern.
    """
    distances = {}
    for i in range(0,len(dna)-len(pattern)+1):
        kmer = dna[i:i+len(pattern)]
        kmer_dist = hamming_dist(kmer,pattern)
        distances[kmer] = kmer_dist
    min_val = min(distances, key=distances.get)
    return min_val

def median_string(dnas,pattern):
    """
    Within each dna in dnas, find a k-mer min_val such that it is the closest match to the given pattern. 
    If there is a substring "AAA", it will be the closest match.
    """
    kmers = []
    for d in dnas:
        min_val = closest_match(d,pattern)
        kmers.append(min_val)
    return kmers

print(*median_string(["TTACCTTAAC","GATATCTGTC","ACGGCGTTCG","CCCTAAAGAG","CGTCAGAGGT"],"AAA"))


TAA ATA ACG AAA AGA


```
DistanceBetweenPatternAndStrings(Pattern, Dna)
    k ← |Pattern|
    distance ← 0
    for each string Text in Dna
        HammingDistance ← ∞
        for each k-mer Pattern’ in Text
            if HammingDistance > HammingDistance(Pattern, Pattern’)
                HammingDistance ← HammingDistance(Pattern, Pattern’)
        distance ← distance + HammingDistance
    return distance
```

In [142]:
# my own implementation
def DistanceBetweenPatternAndStrings(pattern, dnas):
    k = len(pattern)
    distance = 0
    for dna in dnas:
        hamming_dist = len(dna)
        for i in range(len(dna)-len(pattern)+1):
            kmer = dna[i:i+len(pattern)]
            current_hamming = hamming(pattern, kmer)
            if hamming_dist > current_hamming:
                hamming_dist = current_hamming
        distance += hamming_dist
    return distance

In [143]:
pattern = "TTACG"
dnas = ["GCCCGTACGCGCCGGGCATTAATTGAGGGGCCAAAACCTTGACGAGGGAAAAGTGCCTACATAACTAGCCGTAACATCTGTCCGACTGCGAAACAAGGGAAGAGC",
"AAAGTACGTTATATATGCTCCCGCCACATACGGCGGGCTAATTACAATGCAAATGCCGCTTAATGATTGCAATGGTGATAACAGTATGGCAATGTTTAAGGGACG",
"GCGCGATACCACCGTGAGGCGGAGATGAGTGGCGTTCGTCCGTAACTAGTGAGCATGTTACCACGGGTCCAATAGATCCCATAGGACCCTCAATTATGGTGGAGA",
"CTTCTGTGCACGGATATTTTACCCGGGAGAGATGCGGCTCTGGTTGCGACAATAACCGGGGTGGGTTTAAGTTGGAACTGGTGACGATAGTTAATAGACCATCGT", 
"TCAAATAGCTCCTGGATCAGGATTGATTAGCGTATGACGGGTCACCGCGCTCATGCATTCAGCGTATAGAAGTCCCCTCCGACGCAAACCCGCGCCGAGACGCTC",
"AATCGAATGGGGGCGGTTTTGCACCAGTTCGATAGGTAGTGCCCCGAATACACACGTCGTGCTCATAACATTCGCTGCCTTTTGGGCTGGTAATTTTGGAAATCT",
"CGGGAGGTATCCACTGAATCCGACCAGCCCTGGGTTTTGGTTCACCCGCTTCTCGTCTTGGAGTACACGTTTGCTGCGCGCGATTGTCATTTTGGATATGTCAAC",
"CTGAGACCAATCGGCGCATATTAGTTGACCGATGTACCAAACTGATAACACACCCAGCTTCCTTCATCCAATCATCGGCTAGTGAGCTAGAGCCGGCCCGGCCCT",
"CAAGCAGCATCCCCTCAAGAGCCCATTTTAAAGCGCGAGCCATTTAGGGCCGTTTTTCCTGTTTCCTCTCTCGTAAGCAGGTTGCTCCTACCTACAACAGACCAA",
"GTAGAAGTGATGACTGTCGGCTGTTGTTGTGACTGGTCTCAACCGTGTGGCGTCAAGATAGTGCGGTATCCTCTTCGATTATCAGATTTTCTGGGAGAGATTTAC",
"CGCCACATGATGAATATAACATCTGTAGGGGTGTGTCGACAATGGCCTTGATGTCTCTTTCCCCGACGTCTGTGAAACAAAATTTTGTCGGCAGGTGGGCCCACT",
"GACAAGTATTCACGCTCGTTTATTTGGGTTGAAGATTTAGCCTATCGACAACAGCCGTGGGGGGCGGTGGGCGGCAGAACTAACTCTATACTCGGATACCTTATA",
"AAGGACACTGGTACGATCCTAGAAGGGGCAATGAGAATATGGACACTGGTCTAGAATGGATCCGGTAAGCTCCTCATAAGCCACTATCGGGATGGCCGGATTCTC",
"GCTGCAGGCCGCTTTTCTCCGTTTAGGCCCCACCTGACGGGGTAACGAGTTTAGCGAGCCCTCTTTGAAAGGGGGAACGCATTTGATGGTTGCAGTAGACATAGT",
"CAGATTTACATAACATCTAATTATATAGGCACTACCTACATGTGGTGATGTGCACGTGAACACGGAGTGGGCGCATTCCGCGTTATGTATGCCAGAGACCTTTCA",
"CAATGGGTCTTTTATAACAAGGTTATACAATATTAGAGAATACCAAACGACGGCCCGGTATACCTGAGTTGCTCCAAACCGACTGATTCTGGTTGAACAAAGGTT",
"TACTACCATGAGTTTATCGTCAGAGTATTCGCACTCGTAGACGAGATTCTGAAGCGCATTAACTGCTCCTTGGGCTGGGTTCCAAGGTCGGTTAGTCTTAGGGAT",
"TGTAAGCGCCGCCTACAGCGTTACTCCTCTACATCAATCATAGTTGGTAGAAAGTCAGTAGGTACCCAACGACACAATGTTTATAGATCCTAATGTGAGTGGCTC",
"CCTTACGTTGAGGTATGCTTACTTCGGCAAATGAACCGTTCGAACAAACCGGGTCAAATAGCTAGAGAGACTCAGAGTGGAGGCCATAATATATGGAGAGACATG",
"GGGTAATTACTCTAATATTCTCAAATCACGGCAATACGGAGATAAACCTCTTTACCGACAGAATCTTAGGAAGCACATAATAACTAGGTAGGTAACATTATCAAC",
"CCCATTCAGGTACCATGAAAAAAATTTATTTTCATGGTTGCCAAAATCCACGTTACACATGCCGGCAGCTTGTGGGCAACTGAGCAAGTCCGAGGACGTGAAGCT",
"TTCATTACTGTGCATCAGGCACTGTACGTCTACCCAACTTAAGCTTATGATAAACGCTTCAGACCTTGGGTATTACCTCTGCTTGCTGACCTGGATAGCGCAAAT",
"TAGTGCAGCTTGCAAGTAGAAAACTATATTCGGGGTCCGCTCACTATGAGGGTATACGATCCTACTGGGAGGCCTCTATGAAACACACCGTTAGGGGAGCCCATA",
"CAGCCCATACATAGAAGAGGACGCACTCTACGTGTTGCATCTAGTAGTCAGCACTAATTACAACTCTATCCATAGTAGCAAAGGAATAAACGACAAAAGGTCTAA",
"GTATGCTTCGTCTGTCTCACATAAACAGCGGATACTGTAAGTAACAAACAGATTGACCATAGCGTCAGGTCCGCTAGCACCCCCTGGCATTGCGGGACGAGTCTC",
"CAGTCTTTAACCAGGACCAACAATCCTCGTAAACCTGTGGCTCTCTCCAGGTCTGTTTATTTATTGCATCAGCGTCTCGCACCTTGTGCACAAACGACGTCAATG",
"GCAATGTAGATTTAACATTGTCGACAGTTACCCAGAGCCTGAGACACCTTTCGTCTTTCTAGGCCACGACTTCGCCAGGTTAGATAAGTGAGCCCGTGTGGAAGG",
"CGCATAACAGACAAATGTAACTGTGTACGTCTAAGGCGCCCTTTTTGTAGTCATCCCACCGCCACTATGGCTTGCTCCGTTCTAGCTGTGGTAGACGGCGCCCCA",
"CACCGAGCATGAATTATCAGTTCTTTTGACTGGGAGGACTGATTGGTGGATAACCTGAACGGTGGATTACGGGTAACTCACTTAGCATGACCCCACTAGGTTATA",
"CAGGTGTCCGTTTGTCAATACTCTCCAGAATGTAGGAAGTTCGTACTTACTCAGGCTCTTCCGGTTGCCTCAACCCCTCTAAATTGTCACGACGTTTGTCGAGTA",
"AGTAACCCGAAAGCCTATATGTTGAATTCCTGGTGTTGCAAGTATATGCCAGAGCAGATTTTGGAAGATCCCGCCAGGCGCGGCTTGTTAAACTATTTACTATTC",
"CGATCGGGGTAAAGCAGCTTCAAGCCCTCCGTTAAGGCGCAGTCTACTGCATCTTCGTCAGTAATCTTAAGGCAGCGCATCGTTTGAGTACTCGTGGATTAACGG",
"CCTGAAGCCCGACGAGAAATGGGAACAACTTTTAGCCGTTCGAACGTTCTGCATCGGCGGAAGCTTGATACGTATCTTGCGCATTTGGCGACCAGACAAGCTAAG",
"TTGGCAACGAACCTGCAGATTTCTAACGGAACGTTCTTCCGCGCCATTGAAGTATGACTCGCCTACCAGTTTCGGAATGTCGTAGCATTACCTACTTTACTGGTC"]
        

DistanceBetweenPatternAndStrings(pattern, dnas)

39

```
MedianString(Dna, k)
    distance ← ∞
    Patterns ← AllStrings(k)
    for i ← 0 to |Patterns|
        Pattern ← Patterns[i]
        if distance > DistanceBetweenPatternAndStrings(Pattern, Dna)
            distance ← DistanceBetweenPatternAndStrings(Pattern, Dna)
            Median ← Pattern
    return Median
```

In [144]:
def medianString(dnas, k):
    distance = float('inf')
    patterns = neighbors("".join(["A" for i in range(k)]), k)
    for pattern in patterns:
        current_distance = DistanceBetweenPatternAndStrings(pattern, dnas)
        if distance > current_distance:
            distance = current_distance
            median = pattern
    return median

In [145]:
k=6

dnas = [
    "TATGCCATCGGCCCAAGAGCAGGTGTGCGGCTGAATGAGTCA",
    "GATAGTTGTTAATGACTACTCTTTATCGGGACCCGTCATCCG",
    "CTGGGCCGTAGTAAACGTGGCGCCATCGGTATAGCACGTAAT",
    "CGTACCGCCTGCATCTTTATCGGAACTCCGATCCAACCCATC",
    "ATCGGTGATCAGCTTTGCCCTCCCGGCTGGGAGTACCTACAA",
    "AGCAAATTAGGCGATGGCACGACGCTCAGCCCTACGATCGGC",
    "GGCGGATCGTCCGGTCGGAGCCTGATCGGCCTTGTTTCACAC",
    "AGTGCCTAACCTGTCGTGATCGGAAAACCGAACATGGCAGGT",
    "TTGAGTCCGGAAAATCCACCAGGAAATATTAATGACATCGGT",
    "TCAAGGTCTCATTCGCTAGACCCAATCGGCCGACATTCGCGG"]
medianString(dnas, k)

'ATCGGC'

In [149]:
k=7
dnas = ["CTCGATGAGTAGGAAAGTAGTTTCACTGGGCGAACCACCCCGGCGCTAATCCTAGTGCCC", 
        "GCAATCCTACCCGAGGCCACATATCAGTAGGAACTAGAACCACCACGGGTGGCTAGTTTC",
        "GGTGTTGAACCACGGGGTTAGTTTCATCTATTGTAGGAATCGGCTTCAAATCCTACACAG"]

def medianString_all(dnas, k):
    medians = []
    distance = float('inf')
    patterns = neighbors("".join(["A" for i in range(k)]), k)
    for pattern in patterns:
        current_distance = DistanceBetweenPatternAndStrings(pattern, dnas)
        if distance > current_distance:
            distance = current_distance
            medians.append(pattern)
        elif distance == current_distance:
            medians.append(pattern)
            
    return medians
    
medianString_all(dnas, k)

['AAAAAAA',
 'TAAAAAA',
 'GAAAAAA',
 'TTAAAAA',
 'GTAAAAA',
 'GCTAAAA',
 'TAGAAAA',
 'AGGAAAA',
 'TAGTAAA',
 'TCCTAAA',
 'TAAGAAA',
 'GTAGAAA',
 'TAGGAAA',
 'GTAGGAA',
 'AATCCTA',
 'GAACCAC',
 'TAGTTTC']

In [None]:
from itertools import product

def hamming_distance(pattern1, pattern2):
    """Calculate the Hamming distance between two strings."""
    return sum(c1 != c2 for c1, c2 in zip(pattern1, pattern2))

def median_string(dna, k):
    """Find the median string with the smallest total Hamming distance to all strings in the given DNA sequences."""
    distance = float('inf')
    median = None
    
    # Generate all possible k-mers
    for pattern in [''.join(p) for p in product('ACGT', repeat=k)]:
        # Calculate the total distance between the current pattern and all strings in DNA
        total_distance = sum(min(hamming_distance(pattern, dna_str[j:j+k]) for j in range(len(dna_str) - k + 1)) for dna_str in dna)
        
        # Update the median and distance if the total distance is smaller
        if total_distance < distance:
            distance = total_distance
            median = pattern
    
    return median

In [None]:
median_string(dnas, k)

In [27]:
def most_probable_kmer(text, k, profile):
    max_prob = -1
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            if kmer[j] == "A":
                prob *= profile[0][j]
            elif kmer[j] == "C":
                prob *= profile[1][j]
            elif kmer[j] == "G":
                prob *= profile[2][j]
            elif kmer[j] == "T":
                prob *= profile[3][j]
        if prob > max_prob:
            max_prob = prob
            max_kmer = kmer
    return max_kmer

In [30]:
f = open("mostprobable.txt", "r")
input = f.read().split("\n")
f.close()


text = input[0]
k = int(input[1])
profile = [i.split(" ") for i in input[2:]]
profile = [[float(i) for i in line] for line in profile]
print(profile)

[[0.179, 0.143, 0.107, 0.429, 0.214, 0.107, 0.214], [0.143, 0.321, 0.179, 0.107, 0.286, 0.25, 0.214], [0.25, 0.321, 0.571, 0.143, 0.25, 0.357, 0.321], [0.429, 0.214, 0.143, 0.321, 0.25, 0.286, 0.25]]


In [29]:
most_probable_kmer(text, k, profile)

'TCGTGGT'

Implement **GreedyMotifSearch**.

    Input: Integers k and t, followed by a space-separated collection of strings Dna.
    Output: A collection of strings BestMotifs resulting from applying GreedyMotifSearch(Dna, k, t). If at any step you find more than one Profile-most probable k-mer in a given string, use the one occurring first.


 Sample Input:

3 5
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG

Sample Output:

CAG CAG CAA CAA CAA

In [125]:
k = 3
t = 5
dnas = ["GGCGTTCAGGCA", "AAGAATCAGTCA", "CAAGGAGTTCGC", "CACGTCAATCAC", "CAATAATATTCG"]

f = open("rosalind_ba2d.txt", "r")
input = f.read().rstrip("\n").split("\n")
f.close()

k = int(input[0].split(" ")[0])
t = int(input[0].split(" ")[1])
dnas = input[1:]
print(k)
print(t)
print(dnas)

12
25
['GCCAGAGCCTACTGTTCGGTGCTGTGACAGCAGATATTATACGCCTGGACAGATATCGCACCTTCCACCGTACTCGTTATATACTCCTACCGCAGGTGCCTCGCCTAATCTCTTGGGATTGTCCCTGAACTCGCGGTCGAATAAGAACCCAGGGAG', 'CATAGTCCTTAGCTCGTCATAACGTCACGGTTCTCGCATTAGCCTGGAACTTCCTCCGTATATAGGCCATCAGAGCCCTGGTACCCTGCAAGTAAGTCCTGGCAGGCGCTCAACAGGACTTGGTCCGCTGTAGGCCTGATAAAATTTCAATAAGAC', 'ATCGGGTGGCGCTGCCCTTATGGATCTTCACCGGTAGTATCTTATGCCACAACCGCATGACTTCTTAGTGTTACGGGCCTAATACACCGTTCAGCGTTTTATGGTGTAATGACAAGTGATCACGGCATCACACGGTAGACTAAATGTATCGAACGA', 'GTTGGAATTGCAGACTTTCTCTTGGCAGATAGCCCTCGTGTGAATCAATTGAGGCGGCATCGTGCCCGATAAGAAGACCCGGTAGCAATGTAGCGTTGCGAGCGGTCATCACCGTATCAATAGATTCGTGCGTCTTCGCCTGTAACTGCCGGAGCA', 'TGAGTCGTACCGGCGGGTATAGCCCATGCAATTTCACTCACTGTACGAGACAAAGACCGGACTGTTCGTACGGACAACAATGTGACTTCCTCGGTAGTCGATATCAGTAACCGTCTGACGCTGACGCCTGCTATCCCAAGTACCTTTACCTCTTAC', 'GTGTTGAATCCCGAAGTGACGGCACAGGAACGCCCGAGAAAAGCGTTGCCTTCTCCAGTAGTAGCGATCTGAATTTACCGAAACGCCTTCATGGCGAATGCCTAATGTGTTAGGAATCGAGGTACTCGCTGTCGACCCCAACGCAACCTGTTCGAC', 'AGTGGCACTTTGCGTCCAGGTAATGGATAAGG

In [126]:
import numpy as np

def score(dna):
    score_dicts = []
    scores = []
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "T":0, "G":0, "C":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        score_dicts.append(nt_counts)
    for d in score_dicts:
        dict_score = 0
        for key, value in d.items():
            if key != max(d, key=d.get):
                dict_score += value
        scores.append(dict_score)
    return sum(scores)


def profile(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":0, "C":0, "G":0, "T":0}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            nt_counts[k] = v/len(dna)
        score_dicts.append(nt_counts)
    score_profile = np.array([list(i.values()) for i in score_dicts])
    score_profile = score_profile.T # transpose
    return score_profile


def most_probable_kmer(text, k, profile):
    max_prob = -1
    for i in range(len(text)-k+1):
        kmer = text[i:i+k]
        prob = 1
        for j in range(k):
            if kmer[j] == "A":
                prob *= profile[0][j]
            elif kmer[j] == "C":
                prob *= profile[1][j]
            elif kmer[j] == "G":
                prob *= profile[2][j]
            elif kmer[j] == "T":
                prob *= profile[3][j]
        if prob > max_prob:
            max_prob = prob
            max_kmer = kmer
    return max_kmer


def greedy_motif_search(k, t, dnas):
    # artitrarily choose first k chars from each dna string
    best_motifs = [dna[:k] for dna in dnas] 
    # artitrarily start with first dna string and find all k-mers in this string
    dna0_motifs = [dnas[0][i:i+k] for i in range(len(dnas[0])-k+1)] 
    
    for motif in dna0_motifs:
        current_motifs = [motif] # initialize a current_motifs list for each motif.
        for i in range(1, t): # for each dna string aside from the first,
            current_dna = dnas[i]
            prof = profile(current_motifs) # get the probability profile given previous current_motifs
            motifi = most_probable_kmer(current_dna, k, prof) # and find the most probable k-mer in the dna string given prev probs
            current_motifs.append(motifi) # and update current_motifs
        if score(current_motifs) < score(best_motifs): # if current_motifs has lower disagreement
            best_motifs = current_motifs
    return best_motifs

In [127]:
print(" ".join(greedy_motif_search(k, t, dnas)))

CCTTCCACCGTA CATAGTCCTTAG ATCGGGTGGCGC CCCTCGTGTGAA ACTTCCTCGGTA GTGTTGAATCCC CCTTCCCCTGTA CCCAGCACGGAA CCCTCGACGCGC ACCACTTCCGCC GCTTTCCCCGTA CCCTTCCCCGCC ATCTCCAACGAC ACTTCGCCTGTA ACCTGGCCGTTA AATTGGTCTTGC CTTATCCCCGGA ACTTCGACTGTA ACTTCGCCCGTA GCCGCTCCTGGC ACTAGTTGTTTA CCTTCTACTGTA CCTTCGCCGGTA CACACCCGGGGC GCTTCCCCGGTA


In [None]:
def laplace_profile(dna):
    score_dicts = []
    counts = {}
    for i in range(len(dna[0])):
        nt_counts = {"A":1, "C":1, "G":1, "T":1}
        for seq in dna:
            nt_counts[seq[i]] += 1
        for k, v in nt_counts.items(): # necessary because of floating point error if done in prev for in
            nt_counts[k] = v/(len(dna)+4) # +4 since we add 1 to each nucleotide
        score_dicts.append(nt_counts)
    score_profile = np.array([list(i.values()) for i in score_dicts])
    score_profile = score_profile.T # transpose
    return score_profile


def greedy_motif_search_pseudocounts(k, t, dnas):
    # artitrarily choose first k chars from each dna string
    best_motifs = [dna[:k] for dna in dnas] 
    # artitrarily start with first dna string and find all k-mers in this string
    dna0_motifs = [dnas[0][i:i+k] for i in range(len(dnas[0])-k+1)] 
    
    for motif in dna0_motifs:
        current_motifs = [motif] # initialize a current_motifs list for each motif.
        for i in range(1, t): # for each dna string aside from the first,
            current_dna = dnas[i]
            prof = laplace_profile(current_motifs) # get the probability profile given previous current_motifs
            motifi = most_probable_kmer(current_dna, k, prof) # and find the most probable k-mer in the dna string given prev probs
            current_motifs.append(motifi) # and update current_motifs
        if score(current_motifs) < score(best_motifs): # if current_motifs has lower disagreement
            best_motifs = current_motifs
    return best_motifs

In [150]:
prof = [[0.4, 0.3, 0.0, 0.1, 0.0, 0.9],
        [0.2, 0.3, 0.0, 0.4, 0.0, 0.1],
        [0.1, 0.3, 1.0, 0.1, 0.5, 0.0],
        [0.3, 0.1, 0.0, 0.4, 0.5, 0.0]]
k = 6
text = "GAGCTA"
def calculate_probability(text, k, profile):
    prob = 1
    for i in range(k):
        if text[i] == "A":
            prob *= profile[0][i]
        elif text[i] == "C":
            prob *= profile[1][i]
        elif text[i] == "G":
            prob *= profile[2][i]
        elif text[i] == "T":
            prob *= profile[3][i]
    return prob

calculate_probability(text, k, prof)

0.0054