## Module 2. Preprocessing, Indexing, and Approximate Matching

This week, we'll look at the **Boyer-Moore** algorithm. It is similar to naive exact matching in some ways in that it will still try character alignments, but it'll skip any alignments it can.

The insight is as follows. Given $T$ = `there would have been a time for such a word` and $P$ = `word`, if we see that `word` does not align with `woul`:

```
there would have been a time for such a word
      word
      >>*
```

Because `u` does not occur anywhere within `word`, we can skip two alignments to the position to the right of `u`.

```
there would have been a time for such a word
       >>word
```

Boyer-Moore makes a further modification to the naive algorithm. It still tries alignments left-to-right, but it tries character comparisons **right-to-left**. 

### Bad Character Rule

Furthermore, in what we call the **bad character rule**, upon finding a mismatch, we skip alignments until:
* the mismatch becomes a match, or
* the pattern moves past the mismatch.

Given the following, with a mismatch at the fourth position from the right:

```
GCTTCTGCTACCTTTTGCGC
CCTTTTGC
    *<<<
```

We skip two alignments until we hit a `C` in the pattern. Then, we try the character comparison again:
```
GCTTCTGCTACCTTTTGCGC
 >>CCTTTTGC
         *<
```

We have another match, this time the second from the right. And since there is no `A` in the pattern, we move to the right of `A` entirely, skipping six alignments. Then, we try the character comparisons again, and find that we have a match.

```
GCTTCTGCTACCTTTTGCGC
    >>>>>>CCTTTTGC
          <<<<<<<<
```

### Good Suffix Rule

Boyer-Moore also has another rule: the **good suffix rule**. If we find a substring *t* in the text that matches a suffix of the pattern, then we skip alignments until we find:
* Another *t* within the pattern,
* A prefix of the pattern that matches a suffix of *t*, or
* until the pattern moves past *t* entirely.

Given the following, we have a *t* = `TAC`, followed by a mismatch.

```
      ...
CGTGCCTACTTACTTACTTACTTA
CTTACTTAC
     *<<<
```

Then, we skip three alignments, until we find another `TAC` in the pattern. We perform our character comparisons again, and find that we have a new *t* = `TACTTAC`. 

```
      .......
CGTGCCTACTTACTTACTTACTTA
 >>>CTTACTTAC
     *<<<<<<<       
```

Because *t* does not exist anywhere downstream in the pattern, we skip forward three alignments (by applying the bad character rule), and find a match. 

```
CGTGCCTACTTACTTACTTACTTA
     >>>CTTACTTAC
        <<<<<<<<<       
```

### Putting It All Together

Let's put it all together. If we can apply both rules, we take whichever returns the larger skip. Given the following example, the very first character comparison results in a mismatch. 

```
GTTATAGCTGATCGCGGCGTAGCGGCGAA
GTAGCGGCG
        *
```

There is no good suffix (gs: 0), but the bad character rule gives us six alignment skips (bc: 6), so apply the bad character rule. Then, we perform character comparisons again to find that we have a bad character at the fourth comparison, but also a good suffix *t* = `GCG`. 

```
             ...
GTTATAGCTGATCGCGGCGTAGCGGCGAA
 >>>>>>GTAGCGGCG
            *<<<
```

Applying the bad character rule gives us 0 alignment skips (bs: 0), while applying the good suffix rule gives us 2 alignment skips (gc: 2), so we go with the good suffix rule. 

```
             ......
GTTATAGCTGATCGCGGCGTAGCGGCGAA
        >>GTAGCGGCG
            *<<<<<<
```

Now, the bad character rule gives us 2 alignment skips to the right of `C` (bc: 2), but the good suffix rule gives us 7 alignment skips (gs: 7; the prefix `G` of the pattern matches the suffix `G` of *t*), so we go with the good suffix rule. 

```
GTTATAGCTGATCGCGGCGTAGCGGCGAA
           >>>>>>>GTAGCGGCG
                  <<<<<<<<<
```

And now we have a match. In total, we skipped 15 alignments in this example, and we totally ignored 11 characters. This makes Boyer-Moore substantially faster than the naive exact match algorithm.

### Preprocessing

Another advantage of Boyer-Moore is the **preprocessing**. Boyer-Moore creates, for the pattern *P*, a lookup table for the bad character rule and another lookup table for the good suffix rule for **all possible mismatch scenarios** before the algorithm is run. For the **bad character rule**, we build a lookup table of the requisite number of skips for each character of pattern *P*, for every character in the alphabet. For example, given *P* = `TCGC`, we build the lookup table as follows:

```
  T C G C
A 0 1 2 3  # A is not in P, so we have to skip all the way past A.
C 0 - 0 -  # Encountering C at G means we just look at the next char (C)
G 0 1 - 0
T - 0 1 2  # Have to skip to where T is in P
```

One might wonder whether we can preprocess the text *T* rather than the pattern *P*. It is actually possible, and it can be a good idea if we are searching through *T* multiple times. An algorithm that preprocesses the text is called an **offline algorithm**; an algorithm that does not is an **online algorithm**. An offline algorithm may or may not preprocess the pattern *P*. 

So, a naive exact matching algorithm is an online algorithm, and so is Boyer-Moore. But a modern search engine likely uses an offline algorithm to be able to search through a very large *T* and return a result quickly. Likewise, a search through a reference genome might want to use an offline algorithm. 

In [14]:
import string

def z_array(s):
    """ Use Z algorithm (Gusfield theorem 1.4.1) to preprocess s """
    assert len(s) > 1
    z = [len(s)] + [0] * (len(s)-1)
    # Initial comparison of s[1:] with prefix
    for i in range(1, len(s)):
        if s[i] == s[i-1]:
            z[1] += 1
        else:
            break
    r, l = 0, 0
    if z[1] > 0:
        r, l = z[1], 1
    for k in range(2, len(s)):
        assert z[k] == 0
        if k > r:
            # Case 1
            for i in range(k, len(s)):
                if s[i] == s[i-k]:
                    z[k] += 1
                else:
                    break
            r, l = k + z[k] - 1, k
        else:
            # Case 2
            # Calculate length of beta
            nbeta = r - k + 1
            zkp = z[k - l]
            if nbeta > zkp:
                # Case 2a: Zkp wins
                z[k] = zkp
            else:
                # Case 2b: Compare characters just past r
                nmatch = 0
                for i in range(r+1, len(s)):
                    if s[i] == s[i - k]:
                        nmatch += 1
                    else:
                        break
                l, r = k, r + nmatch
                z[k] = r - k + 1
    return z


def n_array(s):
    """ Compile the N array (Gusfield theorem 2.2.2) from the Z array """
    return z_array(s[::-1])[::-1]


def big_l_prime_array(p, n):
    """ Compile L' array (Gusfield theorem 2.2.2) using p and N array.
        L'[i] = largest index j less than n such that N[j] = |P[i:]| """
    lp = [0] * len(p)
    for j in range(len(p)-1):
        i = len(p) - n[j]
        if i < len(p):
            lp[i] = j + 1
    return lp


def big_l_array(p, lp):
    """ Compile L array (Gusfield theorem 2.2.2) using p and L' array.
        L[i] = largest index j less than n such that N[j] >= |P[i:]| """
    l = [0] * len(p)
    l[1] = lp[1]
    for i in range(2, len(p)):
        l[i] = max(l[i-1], lp[i])
    return l


def small_l_prime_array(n):
    """ Compile lp' array (Gusfield theorem 2.2.4) using N array. """
    small_lp = [0] * len(n)
    for i in range(len(n)):
        if n[i] == i+1:  # prefix matching a suffix
            small_lp[len(n)-i-1] = i+1
    for i in range(len(n)-2, -1, -1):  # "smear" them out to the left
        if small_lp[i] == 0:
            small_lp[i] = small_lp[i+1]
    return small_lp


def good_suffix_table(p):
    """ Return tables needed to apply good suffix rule. """
    n = n_array(p)
    lp = big_l_prime_array(p, n)
    return lp, big_l_array(p, lp), small_l_prime_array(n)


def good_suffix_mismatch(i, big_l_prime, small_l_prime):
    """ Given a mismatch at offset i, and given L/L' and l' arrays,
        return amount to shift as determined by good suffix rule. """
    length = len(big_l_prime)
    assert i < length
    if i == length - 1:
        return 0
    i += 1  # i points to leftmost matching position of P
    if big_l_prime[i] > 0:
        return length - big_l_prime[i]
    return length - small_l_prime[i]


def good_suffix_match(small_l_prime):
    """ Given a full match of P to T, return amount to shift as
        determined by good suffix rule. """
    return len(small_l_prime) - small_l_prime[1]


def dense_bad_char_tab(p, amap):
    """ Given pattern string and list with ordered alphabet characters, create
        and return a dense bad character table.  Table is indexed by offset
        then by character. """
    tab = []
    nxt = [0] * len(amap)
    for i in range(0, len(p)):
        c = p[i]
        assert c in amap
        tab.append(nxt[:])
        nxt[amap[c]] = i+1
    return tab


class BoyerMoore(object):
    """ Encapsulates pattern and associated Boyer-Moore preprocessing. """
    
    def __init__(self, p, alphabet='ACGT'):
        self.p = p
        self.alphabet = alphabet
        # Create map from alphabet characters to integers
        self.amap = {}
        for i in range(len(self.alphabet)):
            self.amap[self.alphabet[i]] = i
        # Make bad character rule table
        self.bad_char = dense_bad_char_tab(p, self.amap)
        # Create good suffix rule table
        _, self.big_l, self.small_l_prime = good_suffix_table(p)
    
    def bad_character_rule(self, i, c):
        """ Return # skips given by bad character rule at offset i """
        assert c in self.amap
        ci = self.amap[c]
        assert i > (self.bad_char[i][ci]-1)
        return i - (self.bad_char[i][ci]-1)
    
    def good_suffix_rule(self, i):
        """ Given a mismatch at offset i, return amount to shift
            as determined by (weak) good suffix rule. """
        length = len(self.big_l)
        assert i < length
        if i == length - 1:
            return 0
        i += 1  # i points to leftmost matching position of P
        if self.big_l[i] > 0:
            return length - self.big_l[i]
        return length - self.small_l_prime[i]
    
    def match_skip(self):
        """ Return amount to shift in case where P matches T """
        return len(self.small_l_prime) - self.small_l_prime[1]

In [15]:
def boyer_moore(p, p_bm, t):
    """ Do Boyer-Moore matching """
    i = 0
    occurrences = []
    while i < len(t) - len(p) + 1:
        shift = 1
        mismatched = False
        for j in range(len(p)-1, -1, -1):
            if p[j] != t[i+j]:
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                skip_gs = p_bm.good_suffix_rule(j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        i += shift
    return occurrences

In [17]:
p = "AATTTG"
t = "CACTTAATTTG"
p_bm = BoyerMoore(p, alphabet='ACGT')
boyer_moore(p, p_bm, t)

[5]

### *k*-mer Ordered Index

How might we preprocess the text *T*? One good way is to implement an **ordered index**. Given a short text *T* = `GTGCGTGTGGGGG` and *k*=3, we can create an index of 3-mers as follows:

`{GTG: 0, TGC: 1, GCG: 2, CGT: 3, GTG: 4, TGT: 5, GTG: 6, TGG: 7, GGG: 8, GGG: 9, GGG: 10}`

Then, we sort the keys alphabetically. 

`{CGT: 3, GCG: 2, GGG: 8, GGG: 9, GGG: 10, GTG: 0, GTG: 4, GTG: 6, TGC: 1, TGG: 7, TGT: 5}`

Sorting the keys allows us to use a **binary search** to efficiently locate the key, and thereby the location of the 3-mer in *T*. Python provides a `bisect.bisect_left(a, x)` function which looks in a sorted list `a` the leftmost position at which variable `x` can be inserted to maintain order, which will be useful in implementing a binary search.

In [2]:
import bisect
a = ["CGT", "GCG", "GGG", "GGG", "GGG", "GTG", "GTG", "GTG", "TGC", "TGG", "TGT"]
bisect.bisect_left(a, "GTG")

5

In the above example, once we've found the leftmost position where `GTG` can be inserted and still maintain order, we just try all the `GTG`s and see at what positions we find hits (0, 4, and 6). 

### Detour: Hash Tables

We can also use another data structure, a **hash table**, to represent the same kind data structure as the *k*-mer ordered index. In a hash table, we have a list of "buckets", and a hash function assigns each *k*-mer to a bucket. Each *k*-mer entry is a 3-long list, consisting of the *k*-mer, the position in *T*, and a null pointer which can be modified to point to another *k*-mer entry if the hash function assigns another *k*-mer to the same bucket. 

When different *k*-mers are assigned to the same bucket, we call it a **collision**. It's not unexpected since there are many more *k*-mers than there are buckets, but many collisions can slow down the querying. 

When we want to query from the hash table, we use the hash function to tell us in which bucket we can find the query string. Then, we look at the entries in that bucket and look for the right keys.

In Python, a dictionary *is* a hash table, so it's quite easy. All the nitty-gritty of the hash function is hidden from us, but that's not such a bad thing.

In [5]:
t = 'GTGCGTGTGGGGG'
table = {'CGT': [3], 'GCG': [2], 'GGG': [8, 9, 10], 'GTG': [0, 4, 6], 'TGC': [1], 'TGG': [7], 'TGT': [5]}
table['GGG']

[8, 9, 10]

### Implementing a *k*-mer index
To implement a *k*-mer index, we do the following.

In [28]:
import bisect # use binary search from bisect library

class Index: # create an Index object

    # init method preprocesses string
    def __init__(self, t, k):
        self.k = k
        self.index = [(t[i:i+k], i) for i in range(len(t)-k+1)] # list of tuples
        self.index.sort() # sort for binary search

    # query method 
    def query(self, p): 
        kmer = p[:self.k] # get the first k-mer from pattern
        
        # return index of the first occurrence of (kmer, n) in self.index
        # search for position for (kmer, -1) to ensure leftmost found
        i = bisect.bisect_left(self.index, (kmer, -1)) 
        
        hits = []
        
        while i < len(self.index): # iterate through the right "half" of self.index
            
            if self.index[i][0] != kmer:
                # if kmer no longer matches then no point in continuing comparison
                break 
            
            hits.append(self.index[i][1])
            i += 1
        return hits

Because I'm not used to using classes in Python, just a note for myself: if we don't use `__init__()`, we have to set up the values for each instance of a class object like below:

In [20]:
class IndexNoInit:
        # query method 
    def hello_world():
        print("hello world")
        
ind = IndexNoInit()
ind.k = 3
ind.t = 'GTGCGTGTGGGGG'
print(ind.k, ind.t)

3 GTGCGTGTGGGGG


And the `self` syntax makes it so that when we create a new instance of a class object, the assignment in `__init__()` is stored within that instance. Without the `self` syntax, the variable assignment in `__init__()` disappears after `__init__()` finishes running, and we cannot refer to the variables later.

In [30]:
class IndexNoSelf:
    def __init__(self, k, t):
        k = k
        index = [(t[i:i+k], i) for i in range(len(t)-k+1)] # list of tuples
        index.sort() # sort for binary search
        
    def hello_world():
        print("hello world")

ind = IndexNoSelf(3, 'GTGCGTGTGGGGG')
ind.k

AttributeError: 'IndexNoSelf' object has no attribute 'k'

OK, back to implementing our *k*-mer indexing. Now, we can write a wrapper function to feed everything into our Index class and return a list of hit positions.

In [31]:
def query_index(pattern, text, index):
    k = index.k
    hits = index.query(pattern)
    offsets = [i for i in hits if pattern[k:] == text[i+k:i+len(pattern)]]
    return offsets

In [34]:
text = "GCTACGATCTAGAATCTATCTG"
pattern = "TCTA"
print(query_index(pattern, text, Index(text, 2))) # k < len(pattern)

[7, 14]


### Variations on *k*-mer indexing

We can save time on our binary searching, and save a little memory, by creating an index using every other *k*-mer in *T* (i.e. only even or only odd). When we do this, we just have to search for a *k*-mer at two adjacent offsets.

You can also do subsequence matching, which can have a pretty large performance gain.

You can also create a suffix index. All you have to do is store a list of indices so the index grows in linear space. 

The Burrows-Wheeler transform can be used to create an "FM index" that is very compact.  

### Approximate Matching

In reality, exact matching is not what we want because sequencing reads have errors. We have to be able to account for some amount of error, as well as differences between individuals, by looking for approximate matches. 

The different kinds of differences are:
* Substitution
* Insertion
* Deletion

**Hamming distance** deals only with substitutions. 

**Edit distance** (**Levenshtein distance**) deals with substitutions, insertions, and deletions.

Can we modify our earlier naive exact matching algorithm to account for approximate matches? Yes!

In [35]:
def naive(p, t):
    occurrences = []
    for i in range(len(t) - len(p) + 1):
        match = True
        for j in range(len(p)):
            if t[i+j] != p[j]:
                match = False
                break
        if match:
            occurrences.append(i)
    return occurrences

In [36]:
def naive_hamming(p, t, max_mismatch):
    occurrences = []
    for i in range(len(t) - len(p) + 1):
        num_mismatch = 0 # add mismatch counter
        for j in range(len(p)):
            if t[i+j] != p[j]:
                num_mismatch += 1
                if num_mismatch > max_mismatch:
                    break
        if num_mismatch < max_mismatch:
            occurrences.append(i)
    return occurrences

In [37]:
text = "GCTACGATCTAGAATCTATCTG"
pattern = "TCTA"
naive_hamming(pattern, text, 2)

[0, 7, 14, 18]

### Pigeonhole Principle

But adopting something like Boyer-Moore to look for approximate matching is harder. Using the **pigeonhole principle** will allow us to look for approximate matches more generally.

The idea is that, if we allow *n* mismatches, then if we divide the pattern *P* into *n*+1 subsections, at least one of the subsections must be free of mismatches.

So, if we find a subsection as an index hit, we check all the other subsections and make sure we are below *k* mismatches in a **verification step**.

Let's implement the pigeonhole principle with Boyer-Moore:

In [42]:
def approximate_match(p, t, n):
    
    segment_length = round(len(p)/(n+1))
    all_matches = set()

    # for each of the n+1 segments of the pattern
    for i in range(n+1):
        start = i*segment_length
        end = min((i+1)*segment_length, len(p))

        # run BM on the segment
        p_bm = BoyerMoore(p[start:end], alphabet="ACGT")

        # get matches in text
        matches = boyer_moore(p[start:end], p_bm, t)

        for m in matches:

            # if match occurs outside the length of the pattern
            # (i.e. all the n+1 segments combined)
            # then we ignore
            if m < start or m-start+len(p) > len(t):
                continue
                
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            
            if mismatches <= n:
                all_matches.add(m-start)
                
    return list(all_matches)

In [43]:
p = 'AACTTG'
t = 'CACTTAATTTG'
approximate_match(p, t, 2)

[0, 5]

## Homework

### Q1 & 2 

How many alignments does the naive exact matching algorithm try when matching the string GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG (derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

How many character comparisons does the naive exact matching algorithm try when matching the string GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG (derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

In [5]:
def read_genome(filename):
    f = open(filename, "r")
    genome = f.read().split("\n")[1:]
    f.close()
    return "".join(genome)

In [36]:
chromosome = read_genome('genomes/chr1.GRCh38.excerpt.fasta')

In [9]:
def naive_with_counts(p, t):
    occurrences = []
    num_alignments = 0
    num_char_comparisons = 0
    for i in range(len(t) - len(p) + 1):
        match = True
        for j in range(len(p)):
            num_char_comparisons += 1
            if t[i+j] != p[j]:
                match = False
                break
        if match:
            occurrences.append(i)
        num_alignments += 1
    print(f'Number of character comparisons: {num_char_comparisons}')
    print(f'Number of alignments: {num_alignments}')
    return occurrences, num_alignments, num_char_comparisons

In [20]:
pattern = 'GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG'
a, b, c = naive_with_counts(pattern, chromosome)

Number of character comparisons: 984143
Number of alignments: 799954


### Q3

How many alignments does Boyer-Moore try when matching the string GGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGG (derived from human Alu sequences) to the excerpt of human chromosome 1?  (Don't consider reverse complements.)

In [18]:
def boyer_moore_with_counts(p, p_bm, t):
    """ Do Boyer-Moore matching """
    i = 0
    occurrences = []
    num_alignments = 0
    while i < len(t) - len(p) + 1:
        shift = 1
        mismatched = False
        for j in range(len(p)-1, -1, -1):
            if p[j] != t[i+j]:
                skip_bc = p_bm.bad_character_rule(j, t[i+j])
                skip_gs = p_bm.good_suffix_rule(j)
                shift = max(shift, skip_bc, skip_gs)
                mismatched = True
                break
        if not mismatched:
            occurrences.append(i)
            skip_gs = p_bm.match_skip()
            shift = max(shift, skip_gs)
        i += shift
        num_alignments += 1
    print(f'Number of alignments: {num_alignments}')
    return occurrences

In [21]:
p_bm = BoyerMoore(pattern, alphabet='ACGT')
a = boyer_moore_with_counts(pattern, p_bm, chromosome)

Number of alignments: 127974


### Q4. Index-assisted approximate matching

In practicals, we built a Python class called `Index` implementing an ordered-list version of the k-mer index. The `Index` class is copied below.

In [24]:
import bisect # use binary search from bisect library

class Index: # create an Index object

    # init method preprocesses string
    def __init__(self, t, k):
        self.k = k
        self.index = [(t[i:i+k], i) for i in range(len(t)-k+1)] # list of tuples
        self.index.sort() # sort for binary search

    # query method 
    def query(self, p): 
        kmer = p[:self.k] # get the first k-mer from pattern
        
        # return index of the first occurrence of (kmer, n) in self.index
        # search for position for (kmer, -1) to ensure leftmost found
        i = bisect.bisect_left(self.index, (kmer, -1)) 
        
        hits = []
        
        while i < len(self.index): # iterate through the right "half" of self.index
            
            if self.index[i][0] != kmer:
                # if kmer no longer matches then no point in continuing comparison
                break 
            
            hits.append(self.index[i][1])
            i += 1
        return hits

We also implemented the pigeonhole principle using Boyer-Moore as our exact matching algorithm.

Implement the pigeonhole principle using `Index` to find exact matches for the partitions. Assume *P* always has length 24, and that we are looking for approximate matches with up to 2 mismatches (substitutions). We will use an 8-mer index.

In [66]:
def index_approximate_match(p, t, n):
    
    segment_length = round(len(p)/(n+1))
    all_matches = set()
    ind = Index(t, 8)
    ind_hits = 0

    # for each of the n+1 segments of the pattern
    for i in range(n+1):
        start = i*segment_length
        end = min((i+1)*segment_length, len(p))
        matches = ind.query(p[start:end])
        for m in matches:
            ind_hits += 1

            # if match occurs outside the length of the pattern
            # (i.e. all the n+1 segments combined)
            # then we ignore
            if m < start or m-start+len(p) > len(t):
                continue
                
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            for j in range(end, len(p)):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            
            if mismatches <= n:
                all_matches.add(m-start)
                
    return list(all_matches), ind_hits

In [67]:
a, b = index_approximate_match('GGCGCGGTGGCTCACGCCTGTAAT', chromosome, 2)
print(len(a))

19


### Q5. 

Using the instructions given in Question 4, how many total index hits are there when searching for occurrences of GGCGCGGTGGCTCACGCCTGTAAT with up to 2 substitutions in the excerpt of human chromosome 1? (Don't consider reverse complements.)

In [68]:
print(b)

90


### Q6.

Let's examine whether there is a benefit to using an index built using subsequences of T rather than substrings, as we discussed in the "Variations on k-mer indexes" video.  We'll consider subsequences involving every N characters.  For example, if we split `ATATAT` into two substring partitions, we would get partitions `ATA` (the first half) and `TAT` (second half).  But if we split `ATATAT` into two  subsequences by taking every other character, we would get `AAA` (first, third and fifth characters) and `TTT` (second, fourth and sixth).

Another way to visualize this is using numbers to show how each character of *P* is allocated to a partition.  Splitting a length-6 pattern into two substrings could be represented as `111222`, and splitting into two subsequences of every other character could be represented as `121212`.

The following class `SubseqIndex` is a more general implementation of `Index` that additionally handles subsequences. It only considers subsequences that take every Nth character:

In [72]:
import bisect
   
class SubseqIndex(object):
    """ Holds a subsequence index for a text T """
    
    def __init__(self, t, k, ival):
        """ Create index from all subsequences consisting of k characters
            spaced ival positions apart.  E.g., SubseqIndex("ATAT", 2, 2)
            extracts ("AA", 0) and ("TT", 1). """
        self.k = k  # num characters per subsequence extracted
        self.ival = ival  # space between them; 1=adjacent, 2=every other, etc
        self.index = []
        self.span = 1 + ival * (k - 1)
        for i in range(len(t) - self.span + 1):  # for each subseq
            self.index.append((t[i:i+self.span:ival], i))  # add (subseq, offset)
        self.index.sort()  # alphabetize by subseq
    
    def query(self, p):
        """ Return index hits for first subseq of p """
        subseq = p[:self.span:self.ival]  # query with first subseq
        i = bisect.bisect_left(self.index, (subseq, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != subseq:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

For example, if we do the below, we see `[('AAA', 0), ('TTT', 1)]`.

In [73]:
ind = SubseqIndex('ATATAT', 3, 2)
print(ind.index)

[('AAA', 0), ('TTT', 1)]


And if we query this index with `TTATAT`, we don't get a hit, because the subsequence `TAA` is not in the index.

In [74]:
p = 'TTATAT'
print(ind.query(p))

[]


But if we query with the second subsequence, we do get a hit, because `TTT` is in the index.

In [75]:
print(p[1:])
print(ind.query(p[1:]))

TATAT
[1]


Write a function that, given a length-24 pattern P and given a SubseqIndex object built with k = 8 and ival = 3, finds all approximate occurrences of P within T with up to 2 mismatches.

When using this function, how many total index hits are there when searching for GGCGCGGTGGCTCACGCCTGTAAT with up to 2 substitutions in the excerpt of human chromosome 1?  (Again, don't consider reverse complements.)

In [87]:
def subseq_approximate_match(p, t, n):
    
    segment_length = round(len(p)/(n+1))
    all_matches = set()
    ind = SubseqIndex(t, 8, 3)
    ind_hits = 0

    for i in range(n+1):
        start = i # we don't have end anymore since we're taking every nth char
        matches = ind.query(p[start:])
        for m in matches:
            ind_hits += 1
            if m < start or m-start+len(p) > len(t):
                continue
                
            mismatches = 0
            for j in range(0, start):
                if not p[j] == t[m-start+j]:
                    mismatches += 1
                    if mismatches > n:
                        break
            
            if mismatches <= n:
                all_matches.add(m-start)
                
    return list(all_matches), ind_hits

In [86]:
a, b = subseq_approximate_match('GGCGCGGTGGCTCACGCCTGTAAT', chromosome, 2)
print(b)

79
