This notebook demonstrates implementations of classic motif discovery algorithms in bioinformatics. Each section highlights biological problem, the algorithms that address that problem, and example runs on real or artificial genomic data.

**Finding localised frequent k-mers**

In genomes, there often exist short DNA words of length k (k-mers), that are unusually frequent in localized regions. For example, transcription binding sites, which are often clustered together near promoters. The following code demonstrates the search for repeated k-mers in a fixed-length window in a genome. The detailed algorithm is implemented in FindClumps method, and the E. coli genome is shown as an example. 

In [None]:
from algorithms.Utils import read_data_file, FindClumps

string_list = read_data_file("E_coli.txt")

Genome = string_list[0]
k = 5 # k-mer length
L = 500 # window length
t = 3 # repeated times

# By running FindClumps on Ecoli genome, we search for short DNA words that are unusually frequent in localized regions.
# The problem models the detection of regulatory motifs, e.g. transcription binding site, which are often clustered together near promoters.
Unique_Patterns = FindClumps(Genome, k, L, t)
no_UP = len(Unique_Patterns)
print(f"The {no_UP} unique {k}-mers in any {L} window with frequency >= {t} are {' '.join(Unique_Patterns)} ")

In [None]:
from algorithms.Utils import PlotSkew, MinimumSkew, FrequentWordsWithMismatchesAndRevComp
