# K-MERS IN REPLICATION ORIGIN

Find a potential conserved domain that may be important to protein function

Approach: find a k-mer that appear more frequently than expected

In [9]:
dna = 'CGGACTCGACAGATGTGAAGAACGACAATGTGAAGACTCGACACGACAGAGTGAAGAGAAGAGGAAACATTGTAA'
k = 5

# Create a dictionary for kmer and their count
# Input: DNA string + k-value
# Output: Dictionary containing all k-mers and their respective count
def frequency_dictionary (dna,k):
    kmer_counter = {}
    for i in range(0,len(dna)-k+1,1):
        kmer = dna [i:i+k]
        if not kmer_counter.get(kmer):
            kmer_counter[kmer] = 1
        else:
            kmer_counter[kmer] = kmer_counter[kmer] + 1
    return (kmer_counter)
print(frequency_dictionary(dna,k))

# Use frequncy_dictionary() function to identify k-mer(s) that appear the most frequent
# Input: DNA string + k-value 
# Output: kmer sequence(s) and their count
def frequent_kmers (dna,k):
    all_kmers = frequency_dictionary(dna,k)
    max_count = max(all_kmers.values())     # Identify the highest frequency
    kmers = []

    for kmer, kmer_count in all_kmers.items():
        if kmer_count == max_count:
            kmers.append(kmer)
    return ('K-mers: ' + str(kmers) + '\t' + 'Count: ' + str(max_count))
print (frequent_kmers(dna,k))

{'CGGAC': 1, 'GGACT': 1, 'GACTC': 2, 'ACTCG': 2, 'CTCGA': 2, 'TCGAC': 2, 'CGACA': 4, 'GACAG': 2, 'ACAGA': 2, 'CAGAT': 1, 'AGATG': 1, 'GATGT': 1, 'ATGTG': 2, 'TGTGA': 2, 'GTGAA': 3, 'TGAAG': 3, 'GAAGA': 4, 'AAGAA': 1, 'AGAAC': 1, 'GAACG': 1, 'AACGA': 1, 'ACGAC': 2, 'GACAA': 1, 'ACAAT': 1, 'CAATG': 1, 'AATGT': 1, 'AAGAC': 1, 'AGACT': 1, 'GACAC': 1, 'ACACG': 1, 'CACGA': 1, 'CAGAG': 1, 'AGAGT': 1, 'GAGTG': 1, 'AGTGA': 1, 'AAGAG': 2, 'AGAGA': 1, 'GAGAA': 1, 'AGAAG': 1, 'AGAGG': 1, 'GAGGA': 1, 'AGGAA': 1, 'GGAAA': 1, 'GAAAC': 1, 'AAACA': 1, 'AACAT': 1, 'ACATT': 1, 'CATTG': 1, 'ATTGT': 1, 'TTGTA': 1, 'TGTAA': 1}
K-mers: ['CGACA', 'GAAGA']	Count: 4


Applying the above function on Replication origin of Vibrio cholerae

In [16]:
vibrio_ori = 'atcaatgatcaacgtaagcttctaagcatgatcaaggtgctcacacagtttatccacaacctgagtggatgacatcaagataggtcgttgtatctccttcctctcgtactctcatgaccacggaaagatgatcaagagaggatgatttcttggccatatcgcaatgaatacttgtgacttgtgcttccaattgacatcttcagcgccatattgcgctggccaaggtgacggagcgggattacgaaagcatgatcatggctgttgttctgtttatcttgttttgactgagacttgttaggatagacggtttttcatcactgactagccaaagccttactctgcctgacatcgaccgtaaattgataatgaatttacatgcttccgcgacgatttacctcttgatcatcgatccgattgaagatcttcaattgttaattctcttgcctcgactcatagccatgatgagctcttgatcatgtttccttaaccctctattttttacggaagaatgatcaagctgctgctcttgatcatcgtttc'
k = 8
length = len(vibrio_ori)
# Create a function to gauge the expected frequency of 8 mers in given length of DNA
def Calculate_exp_kmer (dna,k):
    chance_appearance = pow (4 ,k)  
    return ('We expect one ' +str(k)+'-mer every ' + str(chance_appearance) + ' nucleotides ' )

print(Calculate_exp_kmer(vibrio_ori,k))
print ('Length of DNA sequence is: ' + str(length))
print(frequent_kmers(vibrio_ori,k).upper())

We expect one 8-mer every 65536 nucleotides 
Length of DNA sequence is: 540
K-MERS: ['ATGATCAA']	COUNT: 4


For an 8-mer to appear up to 4 times within a length of 540 nucleotides is surprising and worth investigating into the function of this domain in terms of initiation replication. 