# Find the Most Frequent Words in a String

[ba1b](https://rosalind.info/problems/ba1b/)

We say that Pattern is a most frequent k-mer in Text if it maximizes Count(Text, Pattern) among all k-mers. For example, "ACTAT" is a most frequent 5-mer in "ACAACTATGCATCACTATCGGGAACTATCCT", and "ATA" is a most frequent 3-mer of "CGATATATCCATAG".

    FREQUENT WORDS(Text, k)
        FrequentPatterns <-- an empty set
        for i <-- 0 to |Text| k
            Pattern <-- the k-mer Text(i, k)
            COUNT (i) <-- PATTERNCOUNT (Text, Pattern)
        maxCount <-- maximum value in array COUNT
        for i <-- 0 to |Text| k
            if COUNT (i) = maxCount
                add Text(i, k) to FrequentPatterns
        remove duplicates from FrequentPatterns
        return FrequentPatterns

### Frequent Words Problem

Find the most frequent k-mers in a string.

    Given: 

A DNA string Text and an integer k.

    Return: 

All most frequent k-mers in Text (in any order).

Sample Dataset

    ACGTTGCATGTCGCATGATGCATGAGAGCT
    4

Sample Output

    CATG GCAT

In [6]:
from collections import defaultdict
from sys import path
path.append("../")
import common

In [7]:
def get_frequent_words(dna, k):
    max_count = 0
    res = defaultdict(int)
    for i in range(len(dna) - k + 1):
        res[dna[i:i+k]] += 1
        if max_count < res[dna[i:i+k]]:
            max_count = res[dna[i:i+k]]
    
    return ' '.join([key  for (key, value) in res.items() if value == max_count])

In [8]:
def get_faster_frequent_words(dna, k):
    freq_patterns = set()
    freq_array = common.compute_frequencies(dna, k)
    max_count = max(freq_array)
    for i in range(4**k):
        if freq_array[i] == max_count:
            freq_patterns.add(common.number_to_pattern(i, k))
    return ' '.join(freq_patterns)

In [9]:
def get_faster_frequent_words_by_sorting(dna, k):
    freq_patterns = set()
    index = sorted([common.pattern_to_number(dna[i:i+k]) for i in range(len(dna) - k + 1)])
    count = [1] * (len(dna) - k + 1)
    for i in range(1, len(dna) - k + 1):
        if index[i] == index[i-1]:
            count[i] = count[i-1] + 1
    max_count = max(count)
    for i in range(len(dna) - k + 1):
        if count[i] == max_count:
            freq_patterns.add(common.number_to_pattern(index[i], k))
    return ' '.join(freq_patterns)

In [10]:
file = "rosalind_ba1b.txt" 
# file = "input.txt" 
with open(file, 'r') as f:
    lines = f.readlines()
    dna, k = lines[0].strip(), int(lines[1])

print(get_frequent_words(dna, k)) # not optimal solution
print(get_faster_frequent_words(dna, k)) # optimal solution for certain values of dna and small k
print(get_faster_frequent_words_by_sorting(dna, k)) # optimal solution for certain values of dna and large k

TCGCATGACAT
TCGCATGACAT
TCGCATGACAT
