# Find the Most Frequent Words with Mismatches in a String

[ba1i](https://rosalind.info/problems/ba1i/)

We defined a mismatch in “Compute the Hamming Distance Between Two Strings”. We now generalize “Find the Most Frequent Words in a String” to incorporate mismatches as well.

Given strings Text and Pattern as well as an integer d, we define Countd(Text, Pattern) as the total number of occurrences of Pattern in Text with at most d mismatches. For example, Count1(AACAAGCTGATAAACATTTAAAGAG, AAAAA) = 4 because AAAAA appears four times in this string with at most one mismatch: AACAA, ATAAA, AAACA, and AAAGA. Note that two of these occurrences overlap.

A most frequent k-mer with up to d mismatches in Text is simply a string Pattern maximizing Countd(Text, Pattern) among all k-mers. Note that Pattern does not need to actually appear as a substring of Text; for example, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though AAAAA does not appear exactly in this string. Keep this in mind while solving the following problem.

### Frequent Words with Mismatches Problem

Find the most frequent k-mers with mismatches in a string.

    Given: 

A string Text as well as integers k and d.

    Return: 

All most frequent k-mers with up to d mismatches in Text.

Sample Dataset

    ACGTTGCATGTCGCATGATGCATGAGAGCT
    4 1

Sample Output

    GATG ATGC ATGT

In [27]:
from itertools import product
alphabet='ACGT'
{''.join(kmer):0 for kmer in product(alphabet, repeat = 5)}
# [''.join(i) for i in product(alphabet, repeat = 5)]

{'AAAAA': 0,
 'AAAAC': 0,
 'AAAAG': 0,
 'AAAAT': 0,
 'AAACA': 0,
 'AAACC': 0,
 'AAACG': 0,
 'AAACT': 0,
 'AAAGA': 0,
 'AAAGC': 0,
 'AAAGG': 0,
 'AAAGT': 0,
 'AAATA': 0,
 'AAATC': 0,
 'AAATG': 0,
 'AAATT': 0,
 'AACAA': 0,
 'AACAC': 0,
 'AACAG': 0,
 'AACAT': 0,
 'AACCA': 0,
 'AACCC': 0,
 'AACCG': 0,
 'AACCT': 0,
 'AACGA': 0,
 'AACGC': 0,
 'AACGG': 0,
 'AACGT': 0,
 'AACTA': 0,
 'AACTC': 0,
 'AACTG': 0,
 'AACTT': 0,
 'AAGAA': 0,
 'AAGAC': 0,
 'AAGAG': 0,
 'AAGAT': 0,
 'AAGCA': 0,
 'AAGCC': 0,
 'AAGCG': 0,
 'AAGCT': 0,
 'AAGGA': 0,
 'AAGGC': 0,
 'AAGGG': 0,
 'AAGGT': 0,
 'AAGTA': 0,
 'AAGTC': 0,
 'AAGTG': 0,
 'AAGTT': 0,
 'AATAA': 0,
 'AATAC': 0,
 'AATAG': 0,
 'AATAT': 0,
 'AATCA': 0,
 'AATCC': 0,
 'AATCG': 0,
 'AATCT': 0,
 'AATGA': 0,
 'AATGC': 0,
 'AATGG': 0,
 'AATGT': 0,
 'AATTA': 0,
 'AATTC': 0,
 'AATTG': 0,
 'AATTT': 0,
 'ACAAA': 0,
 'ACAAC': 0,
 'ACAAG': 0,
 'ACAAT': 0,
 'ACACA': 0,
 'ACACC': 0,
 'ACACG': 0,
 'ACACT': 0,
 'ACAGA': 0,
 'ACAGC': 0,
 'ACAGG': 0,
 'ACAGT': 0,
 'ACATA': 0,

In [28]:
def get_hamming_distance(dna1, dna2):
    return sum([x != y for x, y in zip(dna1, dna2)])

In [29]:
def get_frequent_words(dna, k, d):
    kmers = {''.join(kmer):0 for kmer in product('ACGT', repeat = k)} 
    max_count = 0
    for i in range(len(dna) - k + 1):
        for kmer in kmers.keys():
            if get_hamming_distance(dna[i:i+k], kmer) <= d:
                kmers[kmer] += 1
                if max_count < kmers[kmer]:
                    max_count = kmers[kmer]
    
    return ' '.join([key for (key, value) in kmers.items() if value == max_count])

In [30]:
file = "rosalind_ba1i.txt" 
# file = "input.txt" 
with open(file, 'r') as f:
    lines = f.readlines()
    dna  = lines[0].strip()
    k, d = map(int, lines[1].split())

print(get_frequent_words(dna, k, d))

CCTCCT
