### Algorithms for DNA sequencing: Programming Homework 3 Problems 3-4

In [None]:
3.
Question 3
In a practical, we saw a function for finding the longest exact overlap (suffix/prefix match) between two strings. 
The function is copied below.

In [None]:
def overlap(a, b, min_length=3):
    """ Return length of longest suffix of 'a' matching
        a prefix of 'b' that is at least 'min_length'
        characters long.  If no such overlap exists,
        return 0. """
    start = 0  # start all the way at the left
    while True:
        start = a.find(b[:min_length], start)  # look for b's prefix in a
        if start == -1:  # no more occurrences to right
            return 0
        # found occurrence; check for full suffix/prefix match
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1  # move just past previous match

Say we are concerned only with overlaps that (a) are exact matches (no differences allowed), and (b) are at least \verb|k|k bases long. To make an overlap graph, we could call \verb|overlap(a, b, min_length=k)|overlap(a, b, min_length=k) on every possible pair of reads from the dataset.  Unfortunately, that will be very slow!

Consider this: Say we are using k=6, and we have a read \verb|a|a whose length-6 suffix is \verb|GTCCTA|GTCCTA.  Say \verb|GTCCTA|GTCCTA does not occur in any other read in the dataset.  In other words, the 6-mer \verb|GTCCTA|GTCCTA occurs at the end of read \verb|a|a and nowhere else.  It follows that \verb|a|a's suffix cannot possibly overlap the prefix of any other read by 6 or more characters.

Put another way, if we want to find the overlaps involving a suffix of read \verb|a|a and a prefix of some other read, we can ignore any reads that don't contain the length-k suffix of \verb|a|a.  This is good news because it can save us a lot of work!

Here is a suggestion for how to implement this idea.  You don't have to do it this way, but this might help you.  Let every k-mer in the dataset have an associated Python \verb|set|set object, which starts out empty.  We use a Python dictionary to associate each k-mer with its corresponding \verb|set|set. (1) For every k-mer in a read, we add the read to the \verb|set|set object corresponding to that k-mer.  If our read is \verb|GATTA|GATTA and k=3, we would add \verb|GATTA|GATTA to the \verb|set|set objects for \verb|GAT|GAT, \verb|ATT|ATT and \verb|TTA|TTA.  We do this for every read so that, at the end, each \verb|set|set contains all reads containing the corresponding k-mer.  (2) Now, for each read \verb|a|a, we find all overlaps involving a suffix of \verb|a|a.  To do this, we take \verb|a|a's length-k suffix, find all reads containing that k-mer (obtained from the corresponding \verb|set|set) and call \verb|overlap(a, b, min_length=k)|overlap(a, b, min_length=k) for each.

The most important point is that we do not call \verb|overlap(a, b, min_length=k)|overlap(a, b, min_length=k) if \verb|b|b does not contain the length-k suffix of \verb|a|a.

Download and parse the read sequences from the provided Phi-X FASTQ file. We'll just use their base sequences, so you can ignore read names and base qualities.  Also, no two reads in the FASTQ have the same sequence of bases.  This makes things simpler.

https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ERR266411_1.for_asm.fastq

Next, find all pairs of reads with an exact suffix/prefix match of length at least 30. Don't overlap a read with itself; if a read has a suffix/prefix match to itself, ignore that match.  Ignore reverse complements.

Hint 1: Your function should not take much more than 15 seconds to run on this 10,000-read dataset, and maybe much less than that.  (Our solution takes about 3 seconds.) If your function is much slower, there is a problem somewhere.

Hint 2: Remember not to overlap a read with itself. If you do, your answers will be too high.

Hint 3: You can test your implementation by making up small examples, then checking that (a) your implementation runs quickly, and (b) you get the same answer as if you had simply called \verb|overlap(a, b, min_length=k)|overlap(a, b, min_length=k) on every pair of reads.  We also have provided a couple examples you can check against.



Q3: Picture the overlap graph corresponding to the overlaps just calculated.  How many edges are in the graph?  In other words, how many distinct pairs of reads overlap?

In [None]:
# problem 3 & 4
# Read FASTQ function#
######################

def readFastq(filename):
    sequences = []
    qualities = []
    with open(filename) as fh:
        while True:
            fh.readline() # skip name line
            seq = fh.readline().rstrip() # read base sequence
            fh.readline() # skip placeholder line
            qual = fh.readline().rstrip() #base quality line
            if len(seq) == 0:
                break
            sequences.append(seq)
            qualities.append(qual)
    return sequences, qualities

In [None]:
reads, _ = readFastq('ERR266411_1.for_asm.fastq')

In [None]:
len(reads)

10000

In [None]:
# Answer 3 & 4
#Build dictionary of kmers
from collections import defaultdict

def kmer_dictionary(reads,k):    
    kmer_dict = defaultdict()
    for read in reads:
        for i in range(len(read)): # go through list of reads
            kmer = read[i:i+k] # get the k-mers
            if len(kmer)== k:  # if k-mer is of length k
                if kmer not in kmer_dict:  #Add k-mer as key of dict
                    kmer_dict[kmer] = set() # initialize all values as empty set()
                kmer_dict[kmer].add(read)  # Add the read that the kmer is found in to the set of values  
    return kmer_dict   

In [None]:
# Answer 3 & 4
def overlap(a,b, min_length=30): #arguments as two strings a and b and mimimum length of overlap
    start = 0 # start all the way at the left
    #all_overlaps = []

    while True:
        start = a.find(b[:min_length], start) # look for b suffix in 
        # start tells function to start looking from this index in a 
        if start == -1:   # no  more occurrence to the right
            return 0     #then reurn 0
        if b.startswith(a[start:]):  # If there is an overlap, verify that prefix of b is equal to suffix of a starting at position start
            #return len(a) - start, (a,b)  # return the length of the longest overlap
            #all_overlaps.append((a,b))
            #print('all_overlaps', all_overlaps)
            return  (a,b)#all_overlaps
        start += 1 # move just past previous position, if its not true we re-enter loop and we need to implement the start by 1 before that  
    

In [None]:
# Quiz Anser 3 The edges in the graph
def overlap_all_reads(reads, k):
    overlapping_pairs = []
    kmer_dict = kmer_dictionary(reads, k)
    for read in reads:
        suffix = read[-k:]
        reads_with_kmer = kmer_dict[suffix]
        for r in reads_with_kmer:
            if r != read: 
                overlaps = overlap(read, r, k)
                if overlaps:
                    overlapping_pairs.append(overlaps) 
                    
    return overlapping_pairs

In [None]:
my_overlaps = overlap_all_reads(reads,30) #number of nodes

In [None]:
# Answer 3 No of edges in the graph
len(my_overlaps)

904746

In [None]:
my_overlaps[0:10]

[('TAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC',
  'AAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTCT'),
 ('TAAACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTC',
  'AACAAGCAGTAGTAATTCCTGCTTTATCAAGATAATTTTTCGACTCATCAGAAATATCCGAAAGTGTTAACTTCTGCGTCATGGAAGCGATAAAACTCTG'),
 ('AGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATC',
  'AAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCAGGAGGCAGAAGTTAACACTTTCGGATATTTCTGAGGAGTCGAAAAATAATCTTGATA'),
 ('AGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATC',
  'TTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGTTGGGCC'),
 ('AGCCGACGTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAATTTATGCGCGCTTCGATAAAAATGATTGGCGTATCCAACCTGCAGAGTTTTATC',
  'GTTTTGGCGGCGCAACCTGTGACGACAAATCTGCTCAAA

In [None]:
Q4: Picture the overlap graph corresponding to the overlaps computed for the previous question. How many nodes in this graph have at least one outgoing edge?  
(In other words, how many reads have a suffix involved in an overlap?)

In [None]:
# Q4 The number of nodes that have at least one outgoing edge
def overlap_all_reads(reads, k):
    overlapping_pairs = {}
    kmer_dict = kmer_dictionary(reads, k)
    for read in reads:
        suffix = read[-k:]
        reads_with_kmer = kmer_dict[suffix]
        for r in reads_with_kmer:
            if r != read: 
                overlaps = overlap(read, r, k)
                if overlaps:
                    overlapping_pairs[read] = overlaps
                    
    return overlapping_pairs

In [None]:
reads_involved = overlap_all_reads(reads,30) #number of nodes

In [None]:
#Answer Q4:  Number of nodes with at least 1 outgoing edge
len(reads_involved)

7161