# Finding Hidden Messages in DNA (Bioinformatics I)

[Link to the course](https://www.coursera.org/learn/dna-analysis/home/week/1)

# Week 1

Learnings:
 - bacterias have circular DNA
 - oriC: specific region in DNA where replication begins (in E.Coli, oriC has length 245)
 - DnaA: protein in charge of replication of DNA
 - DnaA-box: specific region of oriC where DnaA binds. There are usually many DnaA boxes in oriC
 - DnaA-box is usually 9 nucleotides long in bacteria
 - repeated 9mers (appearing much more than random) are a good way of finding oriC in a big genome
 - E. Coli genome has length 4.6M

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Chapter-1.2,-PatternCount" data-toc-modified-id="Chapter-1.2,-PatternCount-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Chapter 1.2, PatternCount</a></span></li><li><span><a href="#Chapter-1.2,-BetterFrequentWords" data-toc-modified-id="Chapter-1.2,-BetterFrequentWords-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Chapter 1.2, BetterFrequentWords</a></span></li><li><span><a href="#Chapter-1.3,-ReverseComplement" data-toc-modified-id="Chapter-1.3,-ReverseComplement-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Chapter 1.3, ReverseComplement</a></span></li><li><span><a href="#Chapter-1.3,-PatternMatching" data-toc-modified-id="Chapter-1.3,-PatternMatching-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Chapter 1.3, PatternMatching</a></span></li><li><span><a href="#Vibrio-cholerae" data-toc-modified-id="Vibrio-cholerae-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Vibrio cholerae</a></span></li><li><span><a href="#Chapter-1.4,-ClumpFinding" data-toc-modified-id="Chapter-1.4,-ClumpFinding-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Chapter 1.4, ClumpFinding</a></span></li><li><span><a href="#E.-coli" data-toc-modified-id="E.-coli-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>E. coli</a></span></li></ul></div>

In [1]:
from utils import read_inputs, generate_random_dna_sequence

### Chapter 1.2, PatternCount

Code Challenge: Implement PatternCount (reproduced below).  
> Input: Strings Text and Pattern.  
> Output: Count(Text, Pattern).

In [2]:
seq, pat = read_inputs("dataset_30272_6.txt")

In [3]:
seq

'ATATTGGGATATTGGATATTGGTACTACCGCTATATTGGCATATTGGGATATTGGATAAATATTGGGATATTGGAATATTGGGATATTGGGTAATATTGGAAGCCGGAATATTGGATATTGGATATTGGTTATATTGGAATATTGGAATATTGGTGATATTGGGCCAACTTAGATATTGGGGATATTGGGAACTATATTGGTAGATATTGGATGGTATATTGGAAGATAATATTGGCTAGGGAATATTGGGATATTGGACTCATATTGGTTCCAGATATTGGCTTATATTGGCGGATATTGGCGATATTGGGATATTGGTGGTATATTGGATATTGGAGAATATTGGAATATTGGTCCCGGTATATTGGGTATGAATATTGGATATTGGTATATTGGGAATATTGGCTATATTGGCGGCTAAATATTGGATATTGGCATATTGGATATTGGGTATATTGGTCATATTGGCATATTGGATATTGGCATATTGGATATTGGCGCCGCATATTGGCATATTGGATATTGGCAATATTGGTCTTATATTGGTTGGAATATTGGGCCCAATATTGGTTATATTGGAGCCATATTGGCCAATATTGGATATTGGATATTGGATATTGGATATTGGCAATATTGGGAGCATATTGGCCGATTGAATGTATATTGGATATTGGATATTGGATATTGGCTACAAGGAATATTGGTATATTGGTATATTGGATATTGGCAGTATATTGGCTCAAAATATTGGTACATATTGGTATATTGGAGAATATAATATTGGCATATTGGATATTGGCATAGACCGTAATATTGGATATTGGATATTGGAAGTATATTGGACGATATTGGATATTGGATATTGGATATTGGGAGTGGATATTGGCGAATATTGGATATTGGCGGGTATATTGGAGGTATATTGG'

In [5]:
pat

'ATATTGGAT'

In [6]:
pat = "ATATTGGAT"

Overkill because I had already coded k_mers counter

In [136]:
def get_k_mers_counter(seq, k):
    counter = dict()

    for i in range(len(seq) - k + 1):
        subseq = seq[i:i + k]
        counter[subseq] = counter.get(subseq, 0) + 1
        
    return counter

In [10]:
len(pat)

9

In [16]:
{k: v for k, v in get_k_mers_counter(seq, 9).items() if v > 10}

{'GGATATTGG': 34,
 'GATATTGGA': 17,
 'ATATTGGAT': 27,
 'TATTGGATA': 26,
 'ATTGGATAT': 25,
 'TTGGATATT': 25,
 'TGGATATTG': 26,
 'GAATATTGG': 12,
 'GTATATTGG': 13,
 'GATATTGGC': 12}

In [17]:
get_k_mers_counter(seq, 9)[pat]

27

### Chapter 1.2, BetterFrequentWords

Code Challenge: Solve the Frequent Words Problem.  
> Input: A string Text and an integer k.  
> Output: All most frequent k-mers in Text.

In [18]:
seq, k = read_inputs("dataset_30272_13.txt")

In [19]:
seq

'TGTTCTACTAAAATATGATAAAATATGATGCAAACACAGGAAAGGGTGTTCTACTGGAAAGGGAAAATATGATTCAGCGTAAAAATATGATTGTTCTACTTGTTCTACTGGAAAGGGTGTTCTACTTGTTCTACTTCAGCGTATGTTCTACTTGTTCTACTTGTTCTACTGGAAAGGGTGTTCTACTAAAATATGATGCAAACACAGCAAACACAAAAATATGATAAAATATGATGCAAACACATCAGCGTAGGAAAGGGGCAAACACATGTTCTACTAAAATATGATGGAAAGGGGCAAACACATGTTCTACTTCAGCGTATCAGCGTATGTTCTACTGGAAAGGGGGAAAGGGGCAAACACAGCAAACACATCAGCGTAAAAATATGATTGTTCTACTTCAGCGTATCAGCGTAGGAAAGGGTGTTCTACTTGTTCTACTGCAAACACATGTTCTACTTCAGCGTAGCAAACACATCAGCGTATCAGCGTAAAAATATGATTGTTCTACTGGAAAGGGAAAATATGATTGTTCTACTGCAAACACATCAGCGTAGGAAAGGGTGTTCTACTAAAATATGATAAAATATGATGCAAACACAAAAATATGATTCAGCGTATGTTCTACTAAAATATGATAAAATATGATTGTTCTACTTGTTCTACTGCAAACACATCAGCGTAGCAAACACAGGAAAGGGGGAAAGGGTCAGCGTATGTTCTACTTGTTCTACTGCAAACACAGGAAAGGGGGAAAGGGAAAATATGATGGAAAGGGTGTTCTACTGGAAAGGGGGAAAGGGGCAAACACATCAGCGTAGCAAACACAGGAAAGGGTGTTCTACTAAAATATGATAAAATATGATTCAGCGTATCAGCGTAAAAATATGATGGAAAGGGAAAATATGATGCAAACACA'

In [21]:
k = int(k)

In [22]:
k

11

In [25]:
def find_most_frequent_k_mers(seq, k):
    counter = get_k_mers_counter(seq, k)
    
    max_count = max(counter.values())
    
    most_frequent_k_mers = [k for k, v in counter.items() if v == max_count]
    
    return most_frequent_k_mers, max_count

In [26]:
find_most_frequent_k_mers(seq, k)

(['TAAAATATGAT'], 11)

### Chapter 1.3, ReverseComplement

Reverse Complement Problem: Find the reverse complement of a DNA string.  
> Input: A DNA string Pattern.  
> Output: Patternrc , the reverse complement of Pattern.

In [2]:
dna_complement = {
    "A": "T",
    "T": "A",
    "G": "C",
    "C": "G"
}

In [3]:
def reverse_complement(seq):
    seq_complement = ""
    for nuc in seq:
        seq_complement += dna_complement[nuc]
        
    return seq_complement[::-1]

In [4]:
seq = read_inputs("dataset_30273_2.txt")

In [5]:
seq

'TCCCACACGGGCTAGTTGCGGCCAAGCCAGTGATGAGAGACGTATATAACCTGTCTACAACCTGGCTCCTACTGAGCTTGAAGTACTATTCATTCTGACTTTCATCCTATTTCGCAGGGGGACGCCAGGATGCAACTTTCACAGAACCGATTGGATGGCCGATTTCAGTGGAGACACGAGGTCCCGGGACTGGCGAGCTGCACGTCTGTTCGCAGCGTCACCGTAAGCTATGCTATCCTTCATGATGGATTAGATCCGTTCCCCTACTGATGATGCGTCGAGGATTACTCGGAATTGGACAAAGTCTAGCTCATTAGCATATGCTGTACGTACTGAGACTCTAGGGGACGCTAACCCAGAGGCCCGACCATCCGGTCTCAAAAAACTGGAAACCCCGTGTAGTGCCGAAACTGTTAAGATATAGTAACACCTGTGCCCGTTTTTGGTAGCGTTAGAAGACCTCGTCGACCTGAGTAACGACTCAAGATTGCACACGTCCGAGCGGTGGAAGCTCCACTTATGATACAAGTCGTGTTCGGAGAGGCGCTAGAACACGTTTCTCAATGCGCAATGACCACCTCCTATCAAGGCGCCTCTTGCAGTTGAAAGAGGTGTGCAGGTAGCTCGGTTTTTTGCACTATCGTCGTAGGCGAGCCAGGAGTCGAGGGTCTCTCGGGAGCAACGACGGTAATTAGAATACCGACACACATACAGGGAGTAGGTTCCTTTGTGTTGAGGGAAGACACTCAAGGCGTTTCAGTCATCTTCCCCCTCAAACCGCATTCACACAAGTCGCTCCTTGCTCAGATTACGTGCTCTACGTATAAAGCCAGCTCTTGCAACGATGCACAGACTACTCATGGTCGGCCCTGGGCATGTGTTGTCAACAGATGGGGAGACGATACGGCAGCCATGTTCTACCTGGTTGGGTATGCTCTACACTTTGACTTAAGGATAAACGGGTGCATCAATAGGGATAACCGCATCCGTTAGCCAGTC

In [6]:
reverse_complement(seq)

'GGGTACGCGTTGATAGGTATCTTACGTTAGTACCTCAATCACATGGTAGAAGGTATCCTGGTGATGGGAATCGACGTATGGTTGGGCCGGAAGCTCATATATTAAATGAATGCGACAGCAGGGATGTCTAGCCCAAGCCTGCACCTCTACAGCGCCGTGAGCTTAACACATCTAAACACTCTTTACCATTTAACCGTATAAATCATACAGCAGCACCTAGATGTCTTCCTACCTTGTTCTGACAAGCGGATGGGCAGAGCCGGCCGATTGCTGCATCACCGTTCAGTTGATTAGTATGGACTGCTATCAAAACACAAGGCCACATGTCAAAGGCCTAGCAATTGAGTTGTTTACCGATCAGCGTTGACAGGCTTAACGCTCGATGTAGCTCATTCAAAAGGCGCACTGTCTATGAGACCCATGGCAGAGCGGTAGTACTCGTTGCATAAGTGTAATCCGCACAGCATGTACGTACGAGAACTTTCGGTTCAAAGGCACGGGCCCGCCGGGAGGCCCATTCCACATTTATGTTAAGACCCTTTCAAGGGCGCTATATCTTAATACAACCACTCCGCCAATAAGGACGTAGCACTTAAGGAGTCCACATCTCTTAGCAGCTAGCGCGGGTGCTACCTTGATAGGCAGAAGTATATCATCACCAGGTAAGGCCATCGAATTCGATTTGTATGCTTTCGATTGCCGATATCTGCCACCCAACACATATTTCTTGTCCCGAATGTTAAGTCGTCCTCAATGGATCGTCTCTCCATAAGTATCACTTGAACAAAGATGTGGTCCTCTAGGATGTATATGCCGCGCGTGATCATTGCAATTATGGATCACCCCTCTTCTGCGCCCTAAGCATCGGTATCAGCGGCAGGTTACACCGGGTGTATATCTTGTACCTAACTAGCTGCTACTCCATATAAGAGCCACATCGATGGGTCTAAGAGGCGCCCTATGATCGACCCTGATCAGTCAAATTAGCCCTGAGCCAAC

### Chapter 1.3, PatternMatching

Code Challenge: Solve the Pattern Matching Problem.  
> Input: Two strings, Pattern and Genome.  
> Output: A collection of space-separated integers specifying all starting positions where Pattern appears as a substring of Genome.


In [7]:
u = "pepe"

In [120]:
def find_pattern_indices(seq, pat):
    k = len(pat)
    ids = []
    
    for i in range(len(seq)):
        subseq = seq[i:i + k]
        
        if subseq == pat:
            ids.append(i)
    
    return ids

In [12]:
pat, seq = read_inputs("dataset_30273_5.txt")

In [15]:
pat

'AAGGCCCAA'

In [21]:
find_pattern_indices(seq, pat)

[67,
 82,
 143,
 150,
 157,
 178,
 228,
 374,
 381,
 424,
 496,
 533,
 561,
 601,
 639,
 711,
 729,
 736,
 782,
 798,
 816,
 843,
 917,
 967,
 974,
 1004,
 1011,
 1031,
 1088,
 1133,
 1140,
 1202,
 1231,
 1248,
 1264,
 1338,
 1356,
 1363,
 1375,
 1415,
 1430,
 1437,
 1493,
 1510,
 1561,
 1598,
 1618,
 1630,
 1647,
 1689,
 1707,
 1727,
 1735,
 1777,
 1784,
 1842,
 1967,
 2026,
 2033,
 2091,
 2107,
 2183,
 2203,
 2210,
 2240,
 2247,
 2265,
 2273,
 2280,
 2410,
 2455,
 2467,
 2501,
 2526,
 2639,
 2646,
 2654,
 2670,
 2780,
 2843,
 2868,
 2884,
 2917,
 2969,
 3006,
 3041,
 3048,
 3069,
 3122,
 3131,
 3160,
 3167,
 3174,
 3200,
 3216,
 3287,
 3324,
 3331,
 3377,
 3403,
 3410,
 3417,
 3434,
 3441,
 3448,
 3481,
 3497,
 3518,
 3534,
 3550,
 3562,
 3588,
 3603,
 3640,
 3796,
 3803,
 3827,
 3834,
 3851,
 3858,
 3865,
 3915,
 4016,
 4041,
 4048,
 4066,
 4084,
 4091,
 4236,
 4255,
 4293,
 4308,
 4359,
 4367,
 4374,
 4408,
 4432,
 4447,
 4454,
 4469,
 4590,
 4609,
 4616,
 4640,
 4647,
 4688,
 4703

In [22]:
" ".join([str(n) for n in _])

'67 82 143 150 157 178 228 374 381 424 496 533 561 601 639 711 729 736 782 798 816 843 917 967 974 1004 1011 1031 1088 1133 1140 1202 1231 1248 1264 1338 1356 1363 1375 1415 1430 1437 1493 1510 1561 1598 1618 1630 1647 1689 1707 1727 1735 1777 1784 1842 1967 2026 2033 2091 2107 2183 2203 2210 2240 2247 2265 2273 2280 2410 2455 2467 2501 2526 2639 2646 2654 2670 2780 2843 2868 2884 2917 2969 3006 3041 3048 3069 3122 3131 3160 3167 3174 3200 3216 3287 3324 3331 3377 3403 3410 3417 3434 3441 3448 3481 3497 3518 3534 3550 3562 3588 3603 3640 3796 3803 3827 3834 3851 3858 3865 3915 4016 4041 4048 4066 4084 4091 4236 4255 4293 4308 4359 4367 4374 4408 4432 4447 4454 4469 4590 4609 4616 4640 4647 4688 4703 4712 4719 4726 4744 4809 4816 4844 4881 4896 4913 4920 4938 4945 4987 5004 5012 5019 5026 5109 5136 5151 5177 5192 5199 5216 5224 5277 5318 5366 5399 5461 5547 5594 5640 5647 5654 5661 5687 5752 5771 5807 5865 5904 5911 5926 5994 6029 6045 6096 6103 6188 6195 6202 6217 6252 6271 6325 6364 6

### Vibrio cholerae

In [59]:
seq = read_inputs("./genomes/vibrio_cholerae.txt")

In [60]:
len(seq)

1108250

In [25]:
find_pattern_indices(seq, "CTTGATCAT")

[60039,
 98409,
 129189,
 152283,
 152354,
 152411,
 163207,
 197028,
 200160,
 357976,
 376771,
 392723,
 532935,
 600085,
 622755,
 1065555]

In [26]:
" ".join([str(n) for n in _])

'60039 98409 129189 152283 152354 152411 163207 197028 200160 357976 376771 392723 532935 600085 622755 1065555'

### Chapter 1.4, ClumpFinding

Clump Finding Problem: Find patterns forming clumps in a string  
> Input: A string Genome, and integers k, L, and t.  
> Output: All distinct k-mers forming (L, t)-clumps in Genome.

In [50]:
def find_clumps_of_k_mers(seq, k, window_length, n_appeareances):
    kmers_clumped_seq = set()
    
    for i in range(len(seq) - window_length + 1):
        subseq = seq[i:i + window_length]
        kmers_counter = get_k_mers_counter(subseq, k)
        
        kmers_clumped_subseq = {k for k, v in kmers_counter.items() if v >= n_appeareances}
    
        kmers_clumped_seq.update(kmers_clumped_subseq)
    
    return kmers_clumped_seq

In [54]:
seq, other = read_inputs("dataset_30274_5.txt")

In [55]:
other

'9 29 3'

In [56]:
k, L, t = [int(n) for n in other.split(" ")]

In [57]:
k, L, t

(9, 29, 3)

In [58]:
find_clumps_of_k_mers(seq=seq, k=k, window_length=L, n_appeareances=t)

set()

### E. coli

In [61]:
seq = read_inputs("./genomes/e_coli.txt")

In [62]:
len(seq)

4639675

In [67]:
%%time
find_clumps_of_k_mers(seq, k=9, window_length=500, n_appeareances=3)

CPU times: user 7min 46s, sys: 21.3 ms, total: 7min 46s
Wall time: 7min 46s


{'CTGGTAGCT',
 'TTCACGCCG',
 'CCTTCGGGT',
 'CGCAACAAC',
 'GCCTGATAA',
 'GCTAATGCG',
 'GCCCTACAT',
 'AGCGTGATT',
 'CCCCGCAAC',
 'TCGCGAGTT',
 'GAGCAGCCT',
 'ACATCCAAC',
 'GCAGCCTGG',
 'CGTCGCATC',
 'ACCGATAAG',
 'AGAGCACCT',
 'CCCCCACGT',
 'AACCGGTTG',
 'GACAGTCAT',
 'ACAACCGAT',
 'ATACCGCTA',
 'CAAGCGTCG',
 'GCGCGTCTT',
 'GACATTATT',
 'TGGCGGTGA',
 'CGCATCCGA',
 'TGTGTGCAA',
 'TGAAATGAT',
 'CATCGGGAA',
 'AACGCGTCT',
 'GTTTATCCC',
 'CAGCGCACC',
 'ATCAACGCC',
 'GCCAGCAGC',
 'TCGAACCCC',
 'TGAACGCCT',
 'GATGCATCG',
 'CGCTGTAAT',
 'CGGTTCAAA',
 'GTAGGTCGG',
 'GAATCTGTA',
 'CATCTGCGC',
 'TCGGTTTAT',
 'TCCAGCTGA',
 'GAGAGCACC',
 'AGCAGTTGA',
 'GGGAATAGC',
 'CGAAGTTGA',
 'AAAAATTGA',
 'TCGGGGTCG',
 'ACGCGGGGT',
 'GTACGAGCT',
 'TGCGCACGA',
 'ACAGTCATT',
 'TCAGGCGTT',
 'GGTCGGCGG',
 'CGTAGGCCG',
 'GAGCCGGTT',
 'CGGCGTGAA',
 'GACTTATCA',
 'GCGGTGAGG',
 'GGTCGGGGC',
 'ACGCTGTCG',
 'TGGAACAGC',
 'TCGGATAAG',
 'GACGCGACT',
 'TGGAGTTTG',
 'ACCTCCCTT',
 'AACAGGCTA',
 'GGCGCGAGC',
 'TGGCGCACA',
 'GCTC

In [68]:
len(_)

1904

Lets try and optimize

In [99]:
def find_clumps_of_k_mers_2(seq, k, window_length, n_appeareances):
    kmers_clumped_seq = set()
    
    initial_seq = seq[:window_length]
    kmers_counter = get_k_mers_counter(initial_seq, k)
    kmers_clumped_subseq = {k for k, v in kmers_counter.items() if v >= n_appeareances}
    kmers_clumped_seq.update(kmers_clumped_subseq)
    
    # when sliding window, we just need to
    #  uncount 1 for kmer at window begining
    #  count 1 for kmer at window ending
    for i in range(1, len(seq) - window_length + 1):
        kmer_begining = seq[i - 1:i - 1 + k]
        kmers_counter[kmer_begining] -= 1
        
        kmer_ending = seq[i + window_length - k:i + window_length]
        kmers_counter[kmer_ending] = kmers_counter.get(kmer_ending, 0) + 1
            
        # just kmer_ending can form a new clump at this step
        if kmers_counter[kmer_ending] >= n_appeareances:
            kmers_clumped_seq.add(kmer_ending)
    
    return kmers_clumped_seq

In [101]:
%%time
find_clumps_of_k_mers_2(seq, k=9, window_length=500, n_appeareances=3)

CPU times: user 3.04 s, sys: 3 µs, total: 3.04 s
Wall time: 3.04 s


{'CTGGTAGCT',
 'TTCACGCCG',
 'CCTTCGGGT',
 'GCCTGATAA',
 'GCCCTACAT',
 'TCGCGAGTT',
 'GAGCAGCCT',
 'GCAGCCTGG',
 'CGTCGCATC',
 'ACCGATAAG',
 'AGAGCACCT',
 'CCCCCACGT',
 'AACCGGTTG',
 'GACAGTCAT',
 'CAAGCGTCG',
 'GCGCGTCTT',
 'TGGCGGTGA',
 'TGTGTGCAA',
 'TGAAATGAT',
 'CATCGGGAA',
 'GTTTATCCC',
 'GCCAGCAGC',
 'TGAACGCCT',
 'GATGCATCG',
 'CGGTTCAAA',
 'GTAGGTCGG',
 'TCCAGCTGA',
 'GAGAGCACC',
 'AGCAGTTGA',
 'GGGAATAGC',
 'CGAAGTTGA',
 'AAAAATTGA',
 'TCGGGGTCG',
 'ACGCGGGGT',
 'GTACGAGCT',
 'ACAGTCATT',
 'TCAGGCGTT',
 'GGTCGGCGG',
 'GAGCCGGTT',
 'CGGCGTGAA',
 'GCGGTGAGG',
 'TGGAACAGC',
 'TCGGATAAG',
 'GACGCGACT',
 'AACAGGCTA',
 'GGCGCGAGC',
 'TGGCGCACA',
 'TGATGCGAC',
 'AAGCCGCCG',
 'TCGCAGGTT',
 'ACTGTAGGT',
 'TGTTCATAT',
 'TGATGGTGG',
 'TTCATCTTT',
 'AGCGCTGCG',
 'GAGGGGGTC',
 'CCACCTCTT',
 'GCCCGCTCA',
 'ACGCCAGAC',
 'CTGGCTGGC',
 'AATAGCCTG',
 'GCTCTCTCG',
 'AGAGATGGT',
 'CTGCCCCTC',
 'GGGGTCGCG',
 'AGAAGAACA',
 'TGGCAACAG',
 'CCGACATCC',
 'TCGTACGAG',
 'GATGGTGCA',
 'GTAGAGCAG',
 'TCGG

In [102]:
len(_)

1904

Incredible time change

There are many, lets be more demanding

After some tries...

In [119]:
%%time
find_clumps_of_k_mers_2(seq, k=20, window_length=100, n_appeareances=5)

CPU times: user 2.96 s, sys: 184 ms, total: 3.14 s
Wall time: 3.14 s


{'ATGAAATGATGAAATGATGA', 'GCACTATGGCACTATGGCAC'}

This looks brutal, ATGAAATGATGAAATGATGA appears 5 times in at least one 100-window (super little window!!)

In [123]:
len(_)

5

In [122]:
find_pattern_indices(seq, "ATGAAATGATGAAATGATGA")

[1197676, 1197684, 1197692, 1197700, 1197708]

Indeed, with diffs of 8 in indices, because it is kind of recursive

As well as this

In [124]:
find_pattern_indices(seq, "GCACTATGGCACTATGGCAC")

[2763433, 2763441, 2763449, 2763457, 2763465]

In [137]:
get_k_mers_counter("TAAACGTGAGAGAAACGTGCTGATTACACTTGTTCGTGTGGTAT", 3)

{'TAA': 1,
 'AAA': 2,
 'AAC': 2,
 'ACG': 2,
 'CGT': 3,
 'GTG': 4,
 'TGA': 2,
 'GAG': 2,
 'AGA': 2,
 'GAA': 1,
 'TGC': 1,
 'GCT': 1,
 'CTG': 1,
 'GAT': 1,
 'ATT': 1,
 'TTA': 1,
 'TAC': 1,
 'ACA': 1,
 'CAC': 1,
 'ACT': 1,
 'CTT': 1,
 'TTG': 1,
 'TGT': 2,
 'GTT': 1,
 'TTC': 1,
 'TCG': 1,
 'TGG': 1,
 'GGT': 1,
 'GTA': 1,
 'TAT': 1}

In [138]:
find_pattern_indices("ATGACTTCGCTGTTACGCGC", "CGC")

[7, 15, 17]