# Programming Homework 2 Instructions 

In a practical, we saw Python code implementing the Boyer-Moore algorithm. Some of the code is for preprocessing the pattern P into the tables needed to execute the bad character and good suffix rules — we did not discuss that code. But we did discuss the code that performs the algorithm given those tables:



In [23]:
# Print in color
def cprint(dna):
    from termcolor import colored
    colors = {'A':'red', 'C' : 'green', 'G' :'magenta', 'T' : 'blue','N':'white'}
    print("".join(colored(base, colors[base] if base in 'ATCGN' else 'white') for base in dna))

In [5]:
def readGenome(filename):
    genome = ''
    with open(filename, 'r') as f:
        for line in f:
            # ignore header line with genome information
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

In [6]:
def naive(pattern,text):
    occurrences = []
    for i in range(len(text) - len(pattern) + 1): #loop over alignments
        match = True
        for j in range(len(pattern)):
            if text[i+j] != pattern[j]:
                match = False
                break
        if match:
            occurrences.append(i)
    return occurrences

In [7]:
def naive_count(pattern,text):
    occurrences = []
    match_count = 0
    alignment_count = 0
    for i in range(len(text) - len(pattern) + 1): #loop over alignments
        match = True
        alignment_count += 1
        for j in range(len(pattern)):
            match_count += 1
            if text[i+j] != pattern[j]:
                match = False
                break
        if match:
            occurrences.append(i)
    print('Count: ',alignment_count) # len(human_chr1) - len(pattern) +1
    print('Match Count: ',match_count )
    return occurrences

### Question 1
How many alignments does the naive exact matching algorithm try when matching the string `GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG` (derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

In [8]:
human_chr1 = readGenome('data/chr1.GRCh38.excerpt.fasta')

pattern = 'GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG'
naive_count(pattern,human_chr1)
print("Alignmnet Count :", len(human_chr1) - len(pattern) +1)

print(799954 * len(pattern))

Count:  799954
Match Count:  984143
Alignmnet Count : 799954
37597838


In [19]:
def boyer_moore(p, p_bm, t):
    """ Do Boyer-Moore matching. p=pattern, t=text,
        p_bm=BoyerMoore object for p """
    i = 0
    occurrences = []
    alignment_count = 0 # Added
    while i < len(t) - len(p) + 1:
        alignment_count += 1
        shift = 1
        mismatched = False
        for j in range(len(p)-1, -1, -1):
            if p[j] != t[i+j]:
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                skip_gs = p_bm.good_suffix_rule(j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        i += shift
    return occurrences, alignment_count

In [23]:
pattern_bm = BoyerMoore(pattern, alphabet='ACGT')
occurrences, alignments = boyer_moore(pattern, pattern_bm, human_chr1)
print(alignments)

127974


In [21]:
from bm_preproc import BoyerMoore

This module provides the BoyerMoore class, which encapsulates the preprocessing info used by the boyer_moore function above. Second, download the provided excerpt of human chromosome 1:

In [3]:
import wget
url = 'http://d28rh4a8wq0iu5.cloudfront.net/ads1/data/chr1.GRCh38.excerpt.fasta'
wget.download(url,out='data/')

'data//chr1.GRCh38.excerpt.fasta'

Third, implement versions of the naive exact matching and Boyer-Moore algorithms that additionally count and return (a) the number of character comparisons performed and (b) the number of alignments tried. Roughly speaking, these measure how much work the two different algorithms are doing.

In [7]:
p = 'word'
t = 'there would have been a time for such a word'
lowercase_alphabet = 'abcdefghijklmnopqrstuvwxyz '
p_bm = BoyerMoore(p, lowercase_alphabet)
occurrences, num_alignments, num_character_comparisons = boyer_moore_with_counts(p, p_bm, t)
print(occurrences, num_alignments, num_character_comparisons)

<bm_preproc.BoyerMoore at 0x20b5d808e50>


### Question 
Index-assisted approximate matching. In practicals, we built a Python class called **`Index`**

implementing an ordered-list version of the k-mer index.  The **`Index`** class is copied below.

In [9]:
import bisect

class Index(object):
    def __init__(self, t, k):
        ''' Create index from all substrings of size 'length' '''
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # add (k-mer, offset) pair
        self.index.sort()  # alphabetize by k-mer
    
    def query(self, p):
        ''' Return index hits for first k-mer of P '''
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

We also implemented the pigeonhole principle using Boyer-Moore as our exact matching algorithm.

Implement the pigeonhole principle using \verb|Index|Index to find exact matches for the partitions. Assume P always has length 24, and that we are looking for approximate matches with up to 2 mismatches (substitutions). We will use an 8-mer index.

Download the Python module for building a k-mer index. 

In [2]:
import wget
url = 'https://d28rh4a8wq0iu5.cloudfront.net/ads1/code/kmer_index.py'
wget.download(url)

'kmer_index.py'

---
Write a function that, given a length-24 pattern P and given an `Index` object built on 8-mers, finds all approximate occurrences of P within T with up to 2 mismatches. Insertions and deletions are not allowed. Don't consider any reverse complements.

How many times does the string `GGCGCGGTGGCTCACGCCTGTAAT`, which is derived from a human Alu sequence, occur with up to 2 substitutions in the excerpt of human chromosome 1?  (Don't consider reverse complements here.)

Hint 1: Multiple index hits might direct you to the same match multiple times, but be careful not to count a match more than once.

Hint 2: You can check your work by comparing the output of your new function to that of the `naive_2mm` function implemented in the previous module.

---

In [15]:
humanIndex = Index(human_chr1,8)

def queryIndex(pattern,text,index):
    k = index.k
    offset = []
    for i in index.query(pattern):
        if pattern[k:] == text[i + k: i + len(pattern)]:
            offset.append(i)
    return offset
        



In [16]:
alu_pattern = 'GGCGCGGTGGCTCACGCCTGTAAT'
queryIndex(alu_pattern,human_chr1,humanIndex)

[56922, 262042, 364263, 657496, 717706]

In [31]:
from Bio import SeqIO

for record in SeqIO.parse("data/hg38_chr1.fa", "fasta"):
    cprint(record.seq[956422:956422 + 100])

[32mC[0m[32mC[0m[32mC[0m[31mA[0m[35mG[0m[35mG[0m[31mA[0m[31mA[0m[31mA[0m[31mA[0m[35mG[0m[35mG[0m[31mA[0m[32mC[0m[34mT[0m[35mG[0m[35mG[0m[35mG[0m[31mA[0m[35mG[0m[34mT[0m[34mT[0m[32mC[0m[32mC[0m[31mA[0m[35mG[0m[31mA[0m[34mT[0m[32mC[0m[32mC[0m[34mT[0m[35mG[0m[35mG[0m[31mA[0m[32mC[0m[34mT[0m[31mA[0m[34mT[0m[35mG[0m[35mG[0m[32mC[0m[32mC[0m[31mA[0m[32mC[0m[32mC[0m[34mT[0m[32mC[0m[32mC[0m[31mA[0m[35mG[0m[35mG[0m[34mT[0m[35mG[0m[35mG[0m[34mT[0m[31mA[0m[34mT[0m[32mC[0m[34mT[0m[35mG[0m[35mG[0m[31mA[0m[35mG[0m[32mC[0m[34mT[0m[32mC[0m[34mT[0m[32mC[0m[32mC[0m[35mG[0m[34mT[0m[31mA[0m[34mT[0m[32mC[0m[32mC[0m[34mT[0m[34mT[0m[35mG[0m[34mT[0m[32mC[0m[32mC[0m[32mC[0m[34mT[0m[35mG[0m[35mG[0m[31mA[0m[31mA[0m[31mA[0m[31mA[0m[31mA[0m[31mA[0m[31mA[0m[32mC[0m[31mA[0m[32mC[0m[34mT[0m[35mG[0m[34mT[0m[35mG[0m[31mA[0m

In [32]:

humanIndex = Index(record.seq,8)

MemoryError: 