# Finding Hidden Messages in DNA
## Week1

* Within a genome replication begins at the replication origin (ori) and is carried out by DNA polymerases.<br/>
* Replication initiation mediated by DnaA protein that binds to a short segment known as DnaA box.

    * How might we detect DnaA boxes in the ori regions?
    * We should expect to find an enrichment of DnaA sequences as well as its reverse compliment within the ori region
    * Start by identifying enriched k-mers within a region
    * K-mer is a string of k length and the count would be count(text, pattern)

In [1]:
# Pattern Count Function Count(Text, Pattern)
# Input: Strings Text(DNA Sequence) and Pattern(k-mer we are searching)
# Output: Integer count of the number of times the Pattern appears in Text

def PatternCount(text, pattern):
    count = 0
    for i in range(len(text) - len(pattern) + 1):
        if text[i:i + len(pattern)] == pattern:
            count += 1
    return count

with open('./Data/dataset_2_7.txt') as inFile:
    data = inFile.readlines()
text = data[0].strip()
pattern = data[1].strip()

print(f"Sequence: {text}\nPattern: {pattern}\nCount: {PatternCount(text, pattern)}")

Sequence: GCTCGGCGCCTAGGGAGCTCGGCCCAAGCTCGGCGGCTCGGCAGCTCGGCTTGCTCGGCTGTTGGGCTCGGCGCATTGTGCTCGGCAGCTCGGCGCTCGGCACTTAACATGGGGGCTCGGCCACTCGCTCGGCCAGCATAGCTCGGCTCGCCGCTCGGCAATGCTCGGCTACGACGCTCGGCGCTCGGCTGGCCGCTCGGCTAGGCTCGGCAGCTCGGCACTATCGCTCGGCAGCTTGGCGCTCGGCCTCTGATCCCGCTCGGCTGCTCGGCGCAGGCTCGGCCGCTCGGCGGCTCGGCTACAAGGGCTCGGCGGCTCGGCGCACGCTCGGCGGCTCGGCGCTCGGCCGCATTTCTGCGGGACGCTCGGCATTCGCTCGGCGCTCGGCGCTCGGCGTCGCTCGGCCGCGCTCGGCTCTTGCAGGCTCGGCCACCGCTCGGCGCTCGGCGACGCTCGGCCGCTCGGCCTCGCTCGGCAGGCTCGGCCGAGGCTCGGCCGCTCGGCGGCTCGGCTAGCTAGCTCGGCACTGGAGGCGCGGGTGCTCGGCCTGGCTCGGCAAGGCTCGGCGAGCTCGGCGGGCTCGGCGCTCGGCGGCTCGGCCGCTCGGCGCTCGGCCGAAGGCTCGGCGCGCTCGGCGTGCGCTCGGCGCTCGGCTAGGCTCGGCTGCTCGGCCGCTCGGCGCCTAAAACGAGCTCGGCCGGCTCGGCAGGCTCGGCGCTCGGCGCTCGGCGACCCGCTCGGCACAGGTGAAGTCTGCTCGGCCCGCTCGGCCCATTGCCCCCGCACGCTCGGCGCTCGGCCCCATTTAGAACTGCTCGGCTGTGACAGTGCTCGGCGGCTCGGCCGGGCTCGGCGCTCGGCGCTCGGCCGCCCCCACTGGCTCGGCGTGGCTCGGCGAATAGCTCGGCCGCTCGGCACGCTCGGCCGCTCGGCTGCTCGGCTGCTCGGCCCAGCTCGGCCATGGCTCGGCTTGCTCGGCAAGCTCGGCCG

Now that k-mers can be identified/counted we want to find which k-mers appear most frequently within a given sequence.
* We can say that Pattern is a most frequent k-mer if it has the highest "Count(Text, Pattern)" among all k-mers

In [21]:
%%time
# Frequent Words Function FrequentWords(Text, k)
# Input: A String Text(DNA Sequence) and an integer k(length of kmer we are looking at)

def FrequentWords(text, k, ret_max=True):
    patterns = dict() # Dictionary Saving all k-mers
    
    maxCount = 0 # Variable to count the highest k-mer
    res = [] #Result array of the most freuqnent k-mers
    
    # Loop that iterates through all possible k-mers
    for i in range(len(text) - k + 1):
        kmer = text[i:i + k]
        if kmer not in patterns:
            patterns[kmer] = 1 # Adds new k-mer to pattern dictionary
        else:
            patterns[kmer] += 1
            if patterns[kmer] >= maxCount:
                maxCount = patterns[kmer] # Increments count of the kmer
                if res and patterns[kmer] == patterns[res[0]]:
                    res.append(kmer)
                else:
                    res = []
                    res.append(kmer)
    if ret_max == False:
        return patterns
    else:
        return res
    
#     else:
#         res = [] # Resulting list of most frequent kmers
#         for kmer in patterns:
#             if patterns[kmer] == maxCount:
#                 res.append(kmer)
#         return res
        
#print(*FrequentWords("CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT", 3))
with open("./Data/dataset_2_10.txt") as inFile:
    data = inFile.readlines()
    
text = data[0].strip()
k = int(data[1].strip())
    
print(f"K-Mer Length: {k}\nMost frequent k-mers in text: {FrequentWords(text, k)}\n")

K-Mer Length: 13
Most frequent k-mers in text: ['TCTCAACTATCCA', 'TCCATTCGGTCCA']

Wall time: 1.02 ms


DnaA boxes that occur frequently in the ori region also have the reverse compliment appear often in the reverse strand.

In [23]:
# Reverse Compliment revComp(Sequence)
# Input: String Pattern(DNA sequence)
# Output: String reverse compliment of pattern

def revComp(pattern):
    codon = {'A': 'T',
             'T': 'A',
             'G': 'C',
             'C': 'G'}
    return ''.join(codon[x] for x in pattern[::-1])

print(revComp("AAAACCCGGT"))
    

ACCGGGTTTT


Given a pattern, identify each position in the genome in which it appears

In [34]:
# Pattern Match patternMatch(Pattern, Genome)
# Input: Strings Pattern and Genome
# Output: List of integers representing starting positions of Pattern in Genome

def patternMatch(pattern, genome):
    res = []
    for i in range(len(genome) - len(pattern) + 1):
        if genome[i:i + len(pattern)] == pattern:
            res.append(i)
    return res

# with open("./Data/dataset_3_5.txt") as inFile:
#     data = inFile.readlines()
    
# pattern = data[0].strip()
# genome = data[1].strip()
# print(*patternMatch(pattern, genome))

with open("./Data/Vibrio_cholerae.txt") as inFile:
    data = inFile.readlines()

pattern = "CTTGATCAT"
genome = data[0].strip()
print(f"Pattern: {pattern}\nPattern Start Positions in Genome:\n{*patternMatch(pattern, genome),}")

Pattern: CTTGATCAT
Pattern Start Positions in Genome:
(60039, 98409, 129189, 152283, 152354, 152411, 163207, 197028, 200160, 357976, 376771, 392723, 532935, 600085, 622755, 1065555)


Now imagine we have a new DNA sequence in which we do not know the ori sequence. How can we find sequences that are possible DnaA boxes<br/>
Search the genome breaking it into windows and see if a kmer is enriched in any given window along the genome.<br/>
This could help identify possible DnaA boxes (frequent kmers) within possible ori regions (windows)

In [6]:
%%time
# Clump Finding Problem clumpFind(Genome, k, L, t)
# Input: String Genome, Ints k(kmer length), L(window size), and t(min number of times a kmer appears)
# Output: A list of all distinct kmers that appear at least t time in any given L window


#Cycle through coordinates for each kmer and check if there are at least t occurances in any given window
def checkWindow(cords, L, t, gLen):
    for i in range(len(cords)):
        tmp = 1
        start = cords[i][0]
        while i + 1 < len(cords) and cords[i + 1][1] <= start + L:
            tmp += 1
            i += 1
        if tmp >= t:
            return True
    return False

def clumpFind(genome, k, L, t):
    allKs = [] # List of all kmers that appear at least t times in any given L window
    kmerDict = dict()
    #Build dictionary of all k-mers in the genome, saving the start and stop coordinates for each
    for i in range(len(genome) - k + 1):
        kmer = genome[i:i + k]
        if kmer not in kmerDict:
            kmerDict[kmer] = [(i, i + k)]
        else:
            kmerDict[kmer].append((i, i + k))
    
    for kmer in kmerDict:
        if len(kmerDict[kmer]) >= t:
            if checkWindow(kmerDict[kmer], L, t, len(genome)):
                allKs.append(kmer)
    print(len(allKs))

#clumpFind("CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA", 5, 50, 4)
with open("./Data/E_coli.txt") as inFile:
    data = inFile.readlines()

clumpFind(data[0].strip(), 9, 500, 3)

1904
Wall time: 5.71 s


In [9]:
def symToNum(symbol):
    symMap = {'A': 0,
              'C': 1,
              'G': 2,
              'T': 3}
    return symMap[symbol]

def numToSym(num):
    numMap = {0: 'A',
              1: 'C',
              2: 'G',
              3: 'T'}
    return numMap[num]

def numToPat(index, k):
    if k == 1:
        return numToSym(index)
    pfixIndex = int(index / 4)
    r = index % 4
    sym = numToSym(r)
    pfixPattern = numToPat(pfixIndex, k - 1)
    return pfixPattern + sym

def patToNum(pattern):
    if not pattern:
        return 0
    symbol = pattern[-1]
    prefix = pattern[:-1]
    return 4 * patToNum(prefix) + symToNum(symbol)

def computeFreq(text, k):
    freq = [0] * (4 ** k)
    for i in range(len(text) - k + 1):
        pat = text[i:i + k]
        j = patToNum(pat)
        freq[j] += 1
    return freq

# with open("./Data/dataset_2994_5.txt") as inFile:
#     data = inFile.readlines()

# print(*computeFreq(data[0].strip(), int(data[1].strip())), sep='')

0000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000110000000010000000000000010000000000000000000000000000000000100010000000000000100000000000000000100100000000000000000100000000010000000001000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000010000000010000001010000000000000000010000000000000000001000000000010000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000010000000100000000000000000000000010000200000000000000020000000000000101000000100000000000000000000000000000000000100100000000000000000000000000000000000000000000000100000000010001010010000000100001000000000000000000000000000001000000000100000000000000010000000000000100000000000000000000000000000000000000000100000000000000000000000000000000000000001

0000000000100000000000001001010000100000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000001000000000000000000100000000000000010000000000000000000000000000000010000000000000000000000000000000000000000000000000010000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000200000000000000100010000000000000100100000000000000000000000000000000000000000001000100000000000000000000000000000000000000000000000000000000000000001000000000000000000000010000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000010100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000000000000000010000000010000000000100000000000000000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000000000