# Week 4: Algorithms for Assembly

## Naive Shortest Common Superstring

In [1]:
def overlap(a, b, min_length=3):
    """Return the length of longest suffix of var 'a'
       matching a prefix of var 'b' that is at least min_length
       long. Return 0 if none exists."""
    start = 0
    while True:
        start = a.find(b[:min_length], start)
        if start == -1:
            return 0
        if b.startswith(a[start:]):
            return len(a)-start
        start += 1

In [2]:
overlap("ATGTGTG", "TGTG")

4

In [3]:
import itertools

def scs(ss):
    shortest_sup = None
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]
        for i in range(len(ss)-1):
            olen = overlap(ssperm[i], ssperm[i+1], min_length = 1)
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup
    return shortest_sup

In [4]:
scs(['ACT', 'CTA', 'AAG'])

'ACTAAG'

This is a very slow algorithm in O(n!) time. 

## Greedy Shortest Common Superstring

This algorithm will make a series of decisions that will get us to a solution quickly, but may not give us the optimal solution. We're going to always pick the longest overlaps.

In [5]:
# helper function
def pick_maximal_overlap(reads, k):
    reada, readb = None, None
    best_olen = 0
    for a, b in itertools.permutations(reads, 2):
        olen = overlap(a, b, min_length = k)
        if olen > best_olen:
            reada, readb = a, b
            best_olen = olen
    return reada, readb, best_olen
    
def greedy_scs(reads, k):
    reada, readb, olen = pick_maximal_overlap(reads, k)
    while olen > 0:
        reads.remove(reada)
        reads.remove(readb)
        reads.append(reada+readb[olen:])
        reada, readb, olen = pick_maximal_overlap(reads, k)
    return "".join(reads) # for reads that don't have overlap

In [6]:
greedy_scs(['ABCD', 'CDBC', 'BCDA'], 1)

'CDBCABCDA'

In [7]:
scs(['ABCD', 'CDBC', 'BCDA'])

'ABCDBCDA'

So the Greedy Shortest Common Superstring algorithm doesn't give us the *actual* shortest common superstring, but in most cases, the time saved from using the greedy algorithm will be worth it.

## The Third Law of Assembly: Repeats are Bad

If the genome contains repeats, the shortest common superstring will not be the best solution. Some copies of the repeats is lost in what is called **overcollapsing**. About half of the human genome is repeating sequences, so it's a significant problem.

## De Bruijn Graphs and Eulerian Walks

A de Bruijn graph is Eulerian as long as it has one read for every *k*-mer.

In [8]:
def de_bruijn_ize(st, k):
    edges = []
    nodes = set()
    for i in range(len(st)-k+1):
        left, right = st[i:i+k-1], st[i+1:i+k]
        edges.append((left, right))
        nodes.add(left)
        nodes.add(right)
    return nodes, edges

In [9]:
nodes, edges = de_bruijn_ize('ACGCGTCG', 3)

print(nodes)
print(edges)

In [11]:
def visualize_de_bruijn(st, k):
    nodes, edges = de_bruijn_ize(st, k)
    dot_str = 'digraph "DeBruijn graph" {\n'
    for node in nodes:
        dot_str += ' %s [label="%s"] ; \n' % (node, node)
    for src, dst in edges:
        dot_str += ' %s -> %s ; \n' % (src, dst)
    return dot_str + '}\n'

In [12]:
#import graphviz

In [13]:
# graphviz.Source(visualize_de_bruijn('ACGCGTCG', 3))

Assembly in practice will require a graph of some kind, whether overlap graph or de Bruijn. And we'll have to trim away spurious dead ends and islands created by sequencing error. Polyploidy can also complicate sequencing: we can have different versions of genes, one from Mom and one from Dad. This causes "bubbles" to appear. If the overall graph is ambiguous, we can still report **contigs** which are sequenced unambiguously.

The solution to ambiguity is, really, to do longer reads. But longer reads suffer from more sequencing errors. And if we can span the entire repeat region and capture non-repeat regions on both sides, we can unambiguously determine the length of the repeat region.

Paired-end sequencing allows us to get twice the sequencing length, but we need more.

## Homework

### 1. What is the length of the shortest common superstring of the following strings?

In [15]:
ss = ["CCT", "CTT", "TGC", "TGG", "GAT", "ATT"]
len(scs(ss))

11

### 2. How many different shortest common superstrings are there?

In [23]:
def scs_with_counts(ss):
    shortest_sup = None
    scss = []
    for ssperm in itertools.permutations(ss):
        sup = ssperm[0]
        for i in range(len(ss)-1):
            olen = overlap(ssperm[i], ssperm[i+1], min_length = 1)
            sup += ssperm[i+1][olen:]
        if shortest_sup is None or len(sup) < len(shortest_sup):
            shortest_sup = sup
            scss = [sup]
        elif len(sup) == len(shortest_sup):
            scss.append(sup)
    return shortest_sup, scss

In [26]:
a, b = scs_with_counts(ss)

In [28]:
print(len(b))

4


### 3. Download this FASTQ file containing synthetic sequencing reads from a mystery virus:

https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/ads1_week4_reads.fq

All the reads are the same length (100 bases) and are exact copies of substrings from the forward strand of the virus genome.  You don't have to worry about sequencing errors, ploidy, or reads coming from the reverse strand.

Assemble these reads using one of the approaches discussed, such as greedy shortest common superstring.  Since there are many reads, you might consider ways to make the algorithm faster, such as the one discussed in the programming assignment in the previous module.

How many As are there in the full, assembled genome?

Hint: the virus genome you are assembling is exactly 15,894 bases long

In [48]:
def read_fastq(filename):
    with open(filename, 'r') as f:
        content = f.read().rstrip("\n").split("\n")
        seqs = [content[i] for i in range(len(content)) if i%4 == 1]
        quals = [content[i] for i in range(len(content)) if i%4 == 3]
    return seqs, quals

In [49]:
seqs, _ = read_fastq('ads1_week4_reads.fq')

In [50]:
%%time
result = greedy_scs(seqs, 10)

CPU times: total: 13min 24s
Wall time: 14min 7s


In [51]:
len(result)

15894

In [52]:
result.count("A")

4633

### 4. How many Ts are there in the full, assembled genome from the previous question?

In [53]:
result.count("T")

3723