# Find Patterns Forming Clumps in a String

[ba1e](https://rosalind.info/problems/ba1e/)

Given integers L and t, a string Pattern forms an (L, t)-clump inside a (larger) string Genome if there is an interval of Genome of length L in which Pattern appears at least t times. For example, TGCA forms a (25,3)-clump in the following Genome: gatcagcataagggtcccTGCAATGCATGACAAGCCTGCAgttgttttac.

### Clump Finding Problem

Find patterns forming clumps in a string.

    Given: 

A string Genome, and integers k, L, and t.

    Return: 

All distinct k-mers forming (L, t)-clumps in Genome.

Sample Dataset

    CGGACTCGACAGATGTGAAGAAATGTGAAGACTGAGTGAAGAGAAGAGGAAACACGACACGACATTGCGACATAATGTACGAATGTAATGTGCCTATGGC
    5 75 4

Sample Output

    CGACA GAAGA AATGT

In [123]:
from collections import defaultdict

In [124]:
def get_frequent_words(seq, k, t):
    clumps = set()
    kmers = defaultdict(int)
    for i in range(len(seq) - k + 1):
        kmers[seq[i:i+k]] += 1
        if kmers[seq[i:i+k]] >= t:
            clumps.add(seq[i:i+k])
    
    return clumps

In [125]:
def get_clumps(dna, k, L, t):
    clumps = set()
    for i in range(len(dna) - L + 1):
        clumps.update(get_frequent_words(dna[i:i+L], k, t))
    return clumps

In [126]:
def symbol_to_number(symbol):
    symbols = {'A': 0, 'C': 1, 'G': 2, 'T': 3}
    return symbols[symbol]

In [127]:
def number_to_symbol(number):
    return 'ACGT'[number]

In [128]:
def pattern_to_number(pattern):
    if len(pattern) == 0:
        return 0
    return 4 * pattern_to_number(pattern[:-1]) + symbol_to_number(pattern[-1])

In [129]:
def compute_frequencies(dna, k):
    freq_array = [0] * 4**k
    for i in range(len(dna) - k + 1):
        pattern = dna[i:i+k]
        j = pattern_to_number(pattern)
        freq_array[j] += 1
    return freq_array

In [130]:
def get_quotient(index):
    return index // 4

In [131]:
def get_remainder(index):
    return index % 4

In [132]:
def number_to_pattern(index, k):
    if k == 1:
        return number_to_symbol(index)
    prefix_index = get_quotient(index) 
    symbol = number_to_symbol(get_remainder(index))
    prefix_pattern = number_to_pattern(prefix_index, k-1)
    return prefix_pattern + symbol

In [133]:
def get_faster_clumps(dna, k, L, t):
    freq_patterns = set()
    clump = [0] * 4**k

    for i in range(len(dna) - L + 1):
        pattern = dna[i:i+L]
        freq_array = compute_frequencies(pattern, k)
        for index in range(4**k):
            if freq_array[index] >= t:
                clump[index] = 1
    for i in range(4**k):
        if clump[i] == 1:
            freq_patterns.add(number_to_pattern(i, k))

    return freq_patterns

In [134]:
def get_better_faster_clumps(dna, k, L, t):
    freq_patterns = set()
    clump = [0] * 4**k

    pattern = dna[0:L+1]
    freq_array = compute_frequencies(pattern, k)
    for i in range(4**k):
        if freq_array[i] >= t:
            clump[i] = 1
    
    for i in range(1,len(dna) - L + 1):
        first_pattern = dna[i-1:i-1+k]
        index = pattern_to_number(first_pattern)
        freq_array[index] = freq_array[index] - 1
        last_pattern = dna[i+L-k:i+L]
        index = pattern_to_number(last_pattern)
        freq_array[index] = freq_array[index] + 1
        if freq_array[index] >= t:
            clump[index] = 1

    for i in range(4**k):
        if clump[i] == 1:
            freq_patterns.add(number_to_pattern(i, k))

    return freq_patterns

In [138]:
file = "rosalind_ba1e.txt" 
# file = "input.txt" 
with open(file, 'r') as f:
    lines = f.readlines()
    dna  = lines[0].strip()
    k, L, t = map(int, lines[1].split())

print(' '.join(get_clumps(dna, k, L, t)))
print(' '.join(get_faster_clumps(dna, k, L, t))) # more efficient approach, but too slow on real data 
print(' '.join(get_better_faster_clumps(dna, k, L, t))) # the most efficient approach on real data

CCGGTACTGT GAACGCCAGG TAACAACCTC GAGTTACATG CTTCAAGCGT CTATTATTGC
GAACGCCAGG CCGGTACTGT CTTCAAGCGT TAACAACCTC GAGTTACATG CTATTATTGC
GAACGCCAGG CCGGTACTGT CTTCAAGCGT TAACAACCTC GAGTTACATG CTATTATTGC
