## Naive overlap (basic approach)

We want to implement naive approach in finding longest overlap between two strings. We've defined minimum overlap length, but we're interested in overlaps greather than (or equal) given length. 

As shown in the lecture slides one way would be taking minimal prefix of the second string, and searching for it in the first string (from left to right). Logic behind this is that longer the overlap, (minimal) prefix of the second string will be more to the left in the first string.

For ilustration, compare these two cases, where *minimum overlap length = 3*:

  X: ACC**TAGCA**   vs    X:  CCCGG**TAG**

  Y: **TAGCA**TGA   vs      Y: **TAG**AATGC
  
   
We can see that in the left (first) case, where we have a longer matching prefix, TAG is more to the left in first string then in case where TAG is beggining of the shorter prefix. 

We exploit this by checking from left to right in the first string for occurence of this minimal prefix. After we find it, instead of doing extension case-by-case in both of the strings, we check if suffix of the first string starting with this occurence is prefix of the second string. If answer if yes, this means we've found the longest prefix. If answer is no, we search (to the right) for the next occurence of minimal prefix in the first string. If we don't find any suffix of the first string (starting with minimal prefix of the second string), which is the prefix of the second string, this means that overlap of the two strings is not-existent for given conditions. 

In [1]:
def overlap(a,b, min_length=3):
    "return length of longest suffix of a >= than min overlap"
    "matching a prefix of B"
    start = 0 # setting the initial offset for b
    while True:
        start=a.find(b[:min_length], start) # is there occurence of prefix of b in a, given the offset
        if start == -1:  # if no, exit, this implies no overlap
            return 0
        if b.startswith(a[start:]): #if ocurence is found, check if this suffix of b is also prefix of a
            return len(a)-start
        start+=1  # increase offset 

In [2]:
overlap('abcdefcd','cd',2)

2

In [3]:
overlap('abcdefcd','cd',2)

2

## Shortest common superstring

As explained in *Gusfield 16.17.2.* we can define objective function for shortest common superstring of a set of reads by using permutations.


*Throughout the dicussion of superstring, we assume that no string in input set is substring of another (input) string.* Any such substring can easily be detected and removed. It's evident that even after removing such substrings, superstring of the remaining set is also superstring of the original set.  


First let's define S(L), which represents concatenation of non-overlaping prefixes.

> *S(L) = pref(So1, So2)pref(So2, So3)...pref(Sot-1, Sot)Sot*

where Soi represents reads (strings) and *pref(Soi,Soj)* is the prefix of Soi ending of the start of longest overlap of Soi, Soj.

Or we can define it by using overlap suffix:
> *S(L) = So1 suf(So1, So2) suf(So2, So3) suf(So3, So4)...suf(Sot-1, Sot)

Then S(L) is the common superstring (not necessarily shortest) of strings So1, So2, ..., Sot.


We can define S(L) as one of the permutations of input strings where we merge all overlaps. Then, we can define SCS as the permutation of inputs strings that yields the shortest SCS (more in *Gusfiled 16.17.2*). This would be our objective function then - permutation that *minimizes* SCS length. Or in other terms - it can be defined as a permutation *maximizing* overlap between neighbouring strings (in permutation).

We can implement this as a function that receives a list of strings (reads) and goes through all possible permutations of input strings. For each permutation, it concatenates the neighbouring strings in permutation by finding the maximum overlap between the strings. The permutation yielding shortest supertring gives us the *shortest common supertring*. We've implemented this by using overlap suffix (instead of prefix) since it's more convenient that way.

Function receives a list of strings (reads) and returns the shortest common superstring. As mentioned above, we assume that no string is substring of another string in our input dataset.

In [4]:
import itertools

def scs(ss):
    """ Returns shortest common superstring of given strings,
        assuming no string is a strict substring of another """
    shortest_sup = None
    # iterate through all possible permutations
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]
        # iterating though each permutation and appending overlap suffix
        for i in range(len(ss)-1):
            olen = overlap(ssperm[i], ssperm[i+1], min_length=1)
            sup += ssperm[i+1][olen:]
        # checking if this is minimal superstring
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup
    return shortest_sup

In [5]:
reads = ['ACTGCCC', 'CCCCATG', 'GTTGTGT', 'CGTCGG', 'GTCAC']

In [6]:
%%timeit
sup=scs(reads)

1000 loops, best of 3: 872 µs per loop


In [7]:
sup=scs(reads)
print('SCS is %s with length %d'% (sup, len(sup)))

SCS is CGTCGGTTGTGTCACTGCCCCATG with length 24


## Greedy approach to SCS

Now, let's see the greedy approach! First we'll define function that always greedily choses maximum overlap. It takes a set of reads and a minimum overlap length and returns the two reads which have maximum overlap, and their overlap.

To do so, we create all variations of two (two-permutations) of input set of strings. We'll iterate through these variations and return the one which has two strings that give the largest overlap. 

In [8]:
def pick_maximum_overlap(reads, min_overlap = 3):
    best_overlap =0
    first = None
    second = None
    for a,b in itertools.permutations(reads,2):
        ovlp_length = overlap(a,b)
        if ovlp_length > best_overlap:
            best_overlap = ovlp_length
            first = a
            second = b
    return best_overlap, first, second

Next, by using previously defined function, we'll create a greedy SCS function. It should receive a list of strings and returns (greedily chosen) SCS.

In each loop iteration we remove two reads that we've already merged and add newly created read (by merging these two) to the set of reads. Then, we pick maximum overlap from our newly created set and repeat the process until there are no more overlapping reads.


In [9]:
def greedy_scs(reads, min_overlap = 3):
    overlap, read1, read2 = pick_maximum_overlap(reads)
    # in each loop we remove merged reads, and add newly created read to the set
    while overlap > 0:
        reads.remove(read1)
        reads.remove(read2)
        reads.append(read1+read2[overlap:])
        overlap, read1, read2 = pick_maximum_overlap(reads)
    return ''.join(reads)

In [10]:
%%timeit
greedy_scs(reads)

The slowest run took 4.90 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.18 µs per loop


In [11]:
sup=greedy_scs(reads)
print('Greedy SCS is %s with length %d'% (sup, len(sup)))

Greedy SCS is GTTGTGTCGTCGGGTCACACTGCCCCATG with length 29
