# Finding Hidden Messages in DNA
## Week 1

* Within a genome replication begins at the replication origin (ori) and is carried out by DNA polymerases.<br/>
* Replication initiation mediated by DnaA protein that binds to a short segment known as DnaA box.

    * How might we detect DnaA boxes in the ori regions?
    * We should expect to find an enrichment of DnaA sequences as well as its reverse compliment within the ori region
    * Start by identifying enriched k-mers within a region
    * K-mer is a string of k length and the count would be count(text, pattern)

In [1]:
def PatternCount(text, pattern):
    '''
    Count how many times a pattern appears in a given text.
    
    Args:
        text (string): The text to search through.
        pattern (string): The pattern to count within text.
        
    Returns:
        int: Number of times pattern appears in text.
    '''
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if text[i:i + len(pattern)] == pattern:
            count += 1
    return count

In [2]:
# Test/Run PatternCount Function
with open('./Data/dataset_2_7.txt') as inFile:
    data = inFile.readlines()
text = data[0].strip()
pattern = data[1].strip()

print(f"Sequence: {text}\nPattern: {pattern}\nCount: {PatternCount(text, pattern)}")

Sequence: GCTCGGCGCCTAGGGAGCTCGGCCCAAGCTCGGCGGCTCGGCAGCTCGGCTTGCTCGGCTGTTGGGCTCGGCGCATTGTGCTCGGCAGCTCGGCGCTCGGCACTTAACATGGGGGCTCGGCCACTCGCTCGGCCAGCATAGCTCGGCTCGCCGCTCGGCAATGCTCGGCTACGACGCTCGGCGCTCGGCTGGCCGCTCGGCTAGGCTCGGCAGCTCGGCACTATCGCTCGGCAGCTTGGCGCTCGGCCTCTGATCCCGCTCGGCTGCTCGGCGCAGGCTCGGCCGCTCGGCGGCTCGGCTACAAGGGCTCGGCGGCTCGGCGCACGCTCGGCGGCTCGGCGCTCGGCCGCATTTCTGCGGGACGCTCGGCATTCGCTCGGCGCTCGGCGCTCGGCGTCGCTCGGCCGCGCTCGGCTCTTGCAGGCTCGGCCACCGCTCGGCGCTCGGCGACGCTCGGCCGCTCGGCCTCGCTCGGCAGGCTCGGCCGAGGCTCGGCCGCTCGGCGGCTCGGCTAGCTAGCTCGGCACTGGAGGCGCGGGTGCTCGGCCTGGCTCGGCAAGGCTCGGCGAGCTCGGCGGGCTCGGCGCTCGGCGGCTCGGCCGCTCGGCGCTCGGCCGAAGGCTCGGCGCGCTCGGCGTGCGCTCGGCGCTCGGCTAGGCTCGGCTGCTCGGCCGCTCGGCGCCTAAAACGAGCTCGGCCGGCTCGGCAGGCTCGGCGCTCGGCGCTCGGCGACCCGCTCGGCACAGGTGAAGTCTGCTCGGCCCGCTCGGCCCATTGCCCCCGCACGCTCGGCGCTCGGCCCCATTTAGAACTGCTCGGCTGTGACAGTGCTCGGCGGCTCGGCCGGGCTCGGCGCTCGGCGCTCGGCCGCCCCCACTGGCTCGGCGTGGCTCGGCGAATAGCTCGGCCGCTCGGCACGCTCGGCCGCTCGGCTGCTCGGCTGCTCGGCCCAGCTCGGCCATGGCTCGGCTTGCTCGGCAAGCTCGGCCG

Now that k-mers can be identified/counted we want to find which k-mers appear most frequently within a given sequence.
* We can say that Pattern is a most frequent k-mer if it has the highest "Count(Text, Pattern)" among all k-mers

In [19]:
def FrequentWords(text, k, ret_max=True):
    '''
    Find the most frequently occuring k-mer(s) in text.
    
    Args:
        text (string): Text to search through.
        k (int): k-mer length to find within text.
        ret_max (bool, optional): Return Max, defaults to true;
            When True returns only the max k-mer(s).
            When False returns all k-mers and thier respective counts.
            
    Returns:
        Depending on the ret_max paramter either:
            (list[strings]): maximal occuring k-mer(s) or the
            (dict{'string': int}): entire dictionary of k-mers and respective counts.
    '''
    patterns = dict()
    for i in range(len(text) - k + 1):
        kmer = text[i:i + k]
        if kmer not in patterns:
            patterns[kmer] = 1
        else:
            patterns[kmer] += 1
    if ret_max == False:
        return patterns
    else:
        res = []
        mx_count = max(patterns.values())
        for kmer in patterns:
            if patterns[kmer] == mx_count:
                res.append(kmer)
        return res

In [20]:
%%time
#Test/Run FrequentWords Function

# print(*FrequentWords("CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT", 3))
print(FrequentWords("ACGTTGCATGTCGCATGATGCATGAGAGCT", 4))
# with open("./Data/dataset_2_10.txt") as inFile:
#     data = inFile.readlines()
    
# text = data[0].strip()
# k = int(data[1].strip())
    
# print(f"K-Mer Length: {k}\nMost frequent k-mers in text: {FrequentWords(text, k)}\n")

['GCAT', 'CATG']
Wall time: 0 ns


DnaA boxes that occur frequently in the ori region also have the reverse compliment appear often in the reverse strand.

In [9]:
def revComp(pattern):
    '''
    Generate the reverse compliment DNA sequence for a given sequence.
    
    Args:
        pattern (string): Input DNA sequence.
        
    Returns:
        string: Reverse complimetnary DNA sequence to input pattern.
    '''
    codon = {'A': 'T',
             'T': 'A',
             'G': 'C',
             'C': 'G'}
    return ''.join(codon[x] for x in pattern[::-1])   

In [10]:
# Test/Run revComp Function
seq = "AAAACCCGGT"
print(f"Sequence:{seq}\nReverse Compliment:{revComp(seq)}")

Sequence:AAAACCCGGT
Reverse Compliment:ACCGGGTTTT


Given a pattern, identify each position in the genome in which it appears

In [11]:
def patternMatch(pattern, genome):
    '''
    Find every starting index of pattern within a genome.
    
    Args:
        pattern (string): Pattern to find starting index of
            each instance within genome.
        genome (string): Text to search through to find pattern
        
    Returns:
        list[int]: List of integergs of all starting positions of
            pattern within genome.
    '''
    res = []
    for i in range(len(genome) - len(pattern) + 1):
        if genome[i:i + len(pattern)] == pattern:
            res.append(i)
    return res

In [14]:
# Test/Run patternMatch function

pat = 'ATAT'
geno = 'GATATATGCATATACTT'
print(f"Pattern: {pat}\nGenome: {geno}\nStarting Pos: {*patternMatch(pat, geno),}")

# with open("./Data/dataset_3_5.txt") as inFile:
#     data = inFile.readlines()
    
# pattern = data[0].strip()
# genome = data[1].strip()
# print(*patternMatch(pattern, genome))

# with open("./Data/Vibrio_cholerae.txt") as inFile:
#     data = inFile.readlines()

# pattern = "CTTGATCAT"
# genome = data[0].strip()
# print(f"Pattern: {pattern}\nPattern Start Positions in Genome:\n{*patternMatch(pattern, genome),}")

Pattern: ATAT
Genome: GATATATGCATATACTT
Starting Pos: (1, 3, 9)


Now imagine we have a new DNA sequence in which we do not know the ori sequence. How can we find sequences that are possible DnaA boxes<br/>
Search the genome breaking it into windows and see if a kmer is enriched in any given window along the genome.<br/>
This could help identify possible DnaA boxes (frequent kmers) within possible ori regions (windows)

In [23]:
def checkWindow(cords, L, t, gLen):
    '''
    Helper function for clumpFind.
    Determines from the coordinates of a k-clump if it appears at least
    t times within an L window in genome.
    
    Args:
        coords (list[(tuples)]): List of tuples representing the (start, end) cordinates
            for a possible k-clump.
        L (int): Length of window to search.
        t (int): Minimum number of occurrences of a clump within a window to be
            regarded as significant.
        gLen (int): Length of the total genome.
        
    Returns:
        True if there are at least t instances within any given L window of the genome.
            False otherwise.
    '''
    for i in range(len(cords)):
        tmp = 1
        start = cords[i][0]
        while i + 1 < len(cords) and cords[i + 1][1] <= start + L:
            tmp += 1
            i += 1
        if tmp >= t:
            return True
    return False

def clumpFind(genome, k, L, t):
    '''
    Find all patterns forming clumps in a string.
    
    Args:
        genome (string): The text to search through for clumps.
        k (int): Length of clump to search genome for.
        L (int): Length of window within genome to search for clumps.
        t (int): Minimum number of k length clumps within any L window
            of genome for the clump to be significant.
        
    Returns:
        list[strings]: All distinct k-mers forming at least t clumps
            within any given L window of genome.
    '''
    allKs = []
    kmerDict = dict()
    for i in range(len(genome) - k + 1):
        kmer = genome[i:i + k]
        if kmer not in kmerDict:
            kmerDict[kmer] = [(i, i + k)]
        else:
            kmerDict[kmer].append((i, i + k))
    
    for kmer in kmerDict:
        if len(kmerDict[kmer]) >= t:
            if checkWindow(kmerDict[kmer], L, t, len(genome)):
                allKs.append(kmer)
    return allKs

In [24]:
%%time
# test/run clumpFind function

clumpFind("CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA", 5, 50, 4)
# with open("./Data/E_coli.txt") as inFile:
#     data = inFile.readlines()

# clumpFind(data[0].strip(), 9, 500, 3)

Wall time: 0 ns


['CGACA', 'GAAGA']

To speed up performance of the FrequentWords algorithm we can construct a **frequency array**. Defined as an array of length $4^k$ where the i-th element is the number of times that the i-th k-mer appears in Text. In order not only create the frequency array but get useful information from it we need to be able to convert from a pattern to a number and vice versa.

Example:
FrequencyArray(AAGCAAAGGTGGG, 2)
<img src="./img/frequency_array.png" width="650">

In [1]:
def symToNum(symbol):
    '''Convert symbol (nucleotide) to integer.'''
    symMap = {'A': 0,
              'C': 1,
              'G': 2,
              'T': 3}
    return symMap[symbol]

def numToSym(num):
    '''Convert integer to cooresponding symbol (nucleotide)'''
    numMap = {0: 'A',
              1: 'C',
              2: 'G',
              3: 'T'}
    return numMap[num]

def numToPat(index, k):
    '''
    Convert a number into its representative pattern.
    
    Args:
        index (int): Pattern encoded in an integer. 
        k (int): length of target pattern
        
    Returns:
        string: 
    '''
    if k == 1:
        return numToSym(index)
    pfixIndex = int(index / 4)
    r = index % 4
    sym = numToSym(r)
    pfixPattern = numToPat(pfixIndex, k - 1)
    return pfixPattern + sym

def patToNum(pattern):
    '''
    Convert a pattern to an encoded integer.
    
    Args:
        pattern (string): DNA string.
        
    Returns:
        int: Integer conversion of pattern.
    '''
    if not pattern:
        return 0
    symbol = pattern[-1]
    prefix = pattern[:-1]
    return 4 * patToNum(prefix) + symToNum(symbol)

def computeFreq(text, k):
    '''
    For a given sequence generate the coorespodning k-mer frequency array.
    
    Args:
        text (string): DNA string.
        k (int): target k-mer length to compute frequency for.
        
    Returns:
        list[int]: Coorespoding FrequencyArray of text.
    '''
    freq = [0] * (4 ** k)
    for i in range(len(text) - k + 1):
        pat = text[i:i + k]
        j = patToNum(pat)
        freq[j] += 1
    return freq

In [6]:
# test/run FrequencyArray function

print(*computeFreq("AAGCAAAGGTGGG", 2))

# with open("./Data/dataset_2994_5.txt") as inFile:
#     data = inFile.readlines()

# print(*computeFreq(data[0].strip(), int(data[1].strip())), sep='')

3 0 2 0 1 0 0 0 0 1 3 1 0 0 1 0
