# Genome Sequencing, Weeks 1 & 2: Genome Assembly

In weeks 1 and 2, we address the question of **genome assembly** using **graph algorithms**.

We cannot read an organism's genome from start to end, as we do with books. We must first digest the DNA strands into short fragments called **reads**. Then, we must combine the reads into a single genome, in a task called **genome assembly**. The instructors compare genome assembly to blowing up a stack of newspapers with dynamite, then piecing together a complete newspaper from the fragments. 

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/sequencing_overview.png" width=500px /> 
</center>

But this is not a trivial task! Although bacterial genomes can be pretty small, many organisms have genomes with billions of base pairs: humans have ~3 billion bps, and the fern *Tmesipteris oblanceolata* has a genome of [about 160 billion bps](https://doi.org/10.1016/j.isci.2024.109889). This means that, to sequence a full genome, we have to combine millions of DNA fragments together in the right order. Moreover, DNA is double-stranded, so we cannot know for a given read whether or not we should use it or its reverse complement. The fact that modern sequencing techniques continue to often yield errors means that reads cannot be assembled together with 100% certainty, and some regions may simply fail to be sequenced. 

The reads generated by modern sequencers often have the same length, so we can assume that all reads have the same length. For now, we will assume that all reads come from the same strand, have no errors, and exhibit perfect coverage, and we'll later relax these assumptions for more realistic datasets.

## The String Reconstruction Problem

The **String Reconstruction** problem is the task of producing, from a collection of reads, a reconstructed string genome. 

**String Reconstruction Problem**: *Reconstruct a genome from reads.*
- **Input**: A collection of strings *Reads*.
- **Output**: A string *Genome* reconstructed from *Reads*.

But this is not a computational problem—it lacks well-defined conditions that the output *Genome* must satisfy. We will first define the **String Composition** problem, which is the **inverse** of the String Reconstruction problem we eventually would like to solve. Given a string *Text*, we want to give the *k*-mer composition of *Text*, including any repeated *k*-mers.

**String Composition Problem**: *Generate the k-mer composition of a string.*
- **Input**: An integer *k* and a string *Text*.
- **Output**: *Composition*<sub>*k*</sub>(*Text*) (the *k*-mers can be provided in any order).

This is trivial. We just iterate through *Text* and return all substrings of length *k*.

In [1]:
def composition(k, text):
    return [text[i:i+k] for i in range(len(text)-k+1)]

In [2]:
print(" ".join(composition(5, "CAATCCAAC")))

CAATC AATCC ATCCA TCCAA CCAAC


Now, we can revise the String Reconstruction problem to turn it into a computational problem.

**String Reconstruction Problem (Revised)**: *Reconstruct a string from its k-mer composition.*
- **Input**: An integer *k* and a collection *Patterns* of *k*-mers.
- **Output**: A string *Genome* such that *Composition<sub>k</sub>*(*Genome*) is equal to the collection of *k*-mers.

This is a more difficult problem to solve. Given a 3-mer composition:

    AAT  ATG  ATG  ATG  CAT  CCA  GAT  GCC  GGA  GGG  GTT  TAA  TGC  TGG  TGT

The **naive approach** is to start with any 3-mer, say `TAA`, and proceed by chaining it together with an overlapping 3-mer, say `AAT`, and so on: 

    TAA
     AAT
      ATG
    -----
    TAATG

At this point, we can consider three 3-mers that begin with `TG` in our composition: `TGC`, `TGG`, and `TGT`. The choice is important: choosing `TGT` means we can only add `GTT` next, and we have no 3-mers that begin with `TT`, so we hit a dead end with `TAATGTT`. 


    TAA
     AAT
      ATG
       TGT
        GTT
    -------
    TAATGTT



With a little bit of thinking ahead, we could reserve `TGT` and `GTT` for the very end of the string. If we extend `ATG` by `TGC` instead, we can continue the process and obtain `TAATGCCATGGATGTT`:


    TAA
     AAT
      ATG
       TGC
        GCC
         CCA
          CAT
           ATG
            TGG
             GGA
              GAT
               ATG
                TGT
                 GTT
    ----------------
    TAATGCCATGGATGTT



Yet, devilishly, the instructors tell us that this is not the correct string, because we omitted `GGG`! The correct string would be `TAATGCCATGGGATGTT`.

The difficulty with choosing between `TGG`, `TGC`, and `TGT` arises because the sequence `ATG` repeats three times in the sequence. The fact that real genomes contain many, many repeats—about 50% of the human genome is made of repeats, of which the [**Alu sequence**](https://en.wikipedia.org/wiki/Alu_element) comprises over a million copies—poses a real challenge to genome assembly. So we need to design a better algorithm.

## String Reconstruction as a Walk in the Overlap Graph

Repeats in the genome means that we need some way of "looking ahead" to choose the right read any time there are multiple options to arrive at a correct reconstruction. One unlikely scenario is one in which we are simply given the sequence information. The collection of *k*-mers *Pattern*, arranged in the correct sequence to produce a string *Genome*, is called the **genome path**. 

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/path_graph.png" />
</center>
<br>

**String Spelled by a Genome Path Problem**: *Reconstruct a string from its genome path.*
- **Input**: A sequence path of *k*-mers *Pattern<sub>1</sub>*, … , *Pattern<sub>n</sub>* such that the last *k*–1 symbols of *Pattern<sub>i</sub>* are equal to the first *k*–1 symbols of *Pattern*<sub>*i*+1</sub> for 1 ≤ *i* ≤ *n*–1.
- **Output**: A string *Text* of length *k*+*n*–1 such that the *i*-th *k*-mer in *Text* is equal to *Pattern<sub>i</sub>* (for 1 ≤ *i* ≤ *n*).

Given the genome path, it is quite easy to reconstruct *Genome*. There is no need to even check whether the last *k*–1 symbols of *Pattern<sub>i</sub>* are equal to the first *k*–1 symbols of *Pattern*<sub>*i*+1</sub>—we just append, to the first read, the last letter of every subsequent read, since we know that the reads are *k*-mers, and the overlaps are *k*–1 long.

In [3]:
def path_to_genome(path):
    return path[0] + "".join([pattern[-1] for pattern in path[1:]])

In [4]:
path = ["AGCAGATC", "GCAGATCA", "CAGATCAT", "AGATCATC", "GATCATCG", "ATCATCGG"]
print(path_to_genome(path))

AGCAGATCATCGG


But, in real life, we are not given the genome path. Rather, we have to determine the genome path. So, what do we do? Observing that each node on the genome path shares an overlap of length *k*–1 with at least one other node, we can create an **overlap graph** in which, if the last *k*–1 symbols in node A equals the first *k*–1 symbols in node B, we draw an arrow from node A to node B.

<br>
<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/overlap_graph_lex.png" width=700px />
</center>
<br>

**Overlap Graph Problem**: *Construct the overlap graph of a collection of k-mers.*
- **Input**: A collection *Patterns* of *k*-mers.
- **Output**: The overlap graph *Overlap*(*Patterns*), in the form of an adjacency list.

In [5]:
def prefix(pattern):
    return pattern[:-1]
    
def suffix(pattern):
    return pattern[1:]
    
def overlap_graph(patterns):
    adjacency_list = {}
    for p1 in patterns:
        adj = [p2 for p2 in patterns if suffix(p1) == prefix(p2)]
        if len(adj) != 0:
            adjacency_list[p1] = adj
    return adjacency_list

In [6]:
patterns = ["ATGCG", "GCATG", "CATGC", "AGGCA", "GGCAT", "GGCAC"]

for key, value in overlap_graph(patterns).items():
    print(f'{key}: {" ".join(value)}')

GCATG: CATGC
CATGC: ATGCG
AGGCA: GGCAT GGCAC
GGCAT: GCATG


Then, we can find a candidate for the genome path in the overlap graph by tracing a **Hamiltonian path**, defined as a path that visits each node exactly once (there can be multiple Hamiltonian paths in a single graph).

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/overlap_graph_highlighted.png" width=700px/>    
</center>
<br>

**Hamiltonian Path Problem**: *Construct a Hamiltonian path in a graph.*
- **Input**: A directed graph.
- **Output**: A path visiting every node in the graph exactly once (if such a path exists).

But, the Hamiltonian path problem is NP-complete, meaning there are currently no known algorithms with polynomial (*O*(*n*<sup>*k*</sup>)) or more efficient time complexities. Indeed, if the Hamiltonian path was not highlighted in the overlap graph above, we would have a very difficult time identifying it. We will have to adopt an alternative strategy.

## Another Graph for String Reconstruction

Ender Nicolaas de Bruijn, a 20th-century Dutch mathematician, who was interested in solving the problem of finding the ***k*-universal string**. Given some length *k*, he wanted to find a string that contains every binary *k*-mer exactly once—that is, a universal string. For example, 0001110100 is a 3-universal string because it contains all eight binary 3-mers, 000, 001, 011, 111, 110, 101, 010, and 100, as substrings, and exactly once. 

Finding a *k*-universal string, which contains every *k*-mer in a given collection... That sounds a lot like finding a Hamiltonian path, which visits every node given an overlap graph! In fact, the *k*-universal string problem is equivalent to, and can be reduced to, the Hamiltonian path problem. 

<center>
    <img src="https://ucarecdn.com/f7ce673d-b1fc-4121-9aeb-9b50666f09e5/" width=400px />    
</center>
<br>

However, rather than dealing with the intractable Hamiltonian path problem, de Bruijn developed a different way of representing a *k*-mer collection using a graph called **de Bruijn graphs**. Instead of putting each *k*-mer on a node, de Bruijn placed each *k*-mer on an edge, and placed on each node a *k*–1-mer shared by the edges on either side. For example, the below overlap graph:

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/path_graph.png" width=600px />    
</center>
<br>

is translated into the below de Bruijn graph:

<center>
    <img src="https://ucarecdn.com/63655550-8250-4284-95ab-7c77ddbca891/" width=600px />    
</center>
<br>

This move, by itself, is not helpful. But, de Bruijn proceeded to "glue" or merge identical nodes together:

<center>
    <img src="debruijn.png" width=400px />    
</center>
<br>

This innovation offers key advantages. Leonhard Euler proved that, in solving the **Bridges of Königsberg** problem, that there exists an efficient, **polynomial**-time algorithm to construct an **Eulerian cycle**—a cycle that travels through every edge in a graph—for every graph, if there exists an Eulerian cycle in said graph. Failing to recognize that a more efficient algorithm existed, for the first two decades following the invention of DNA sequencing, biologists used overlap graphs to assemble genomes, including the human genome. Today, the de Bruijn graph is the dominant approach to genome assembly.

Let us first implement the creation of a de Bruijn graph from a string first.

**De Bruijn Graph from a String Problem**: *Construct the de Bruijn graph of a string.*
- **Input**: An integer *k* and a string *Text*.
- **Output**: *DeBruijn*<sub>*k*</sub>(*Text*).


In [7]:
def de_bruijn_from_string(k, text):
    adjacency_list = {}
    edges = [text[i:i+k] for i in range(len(text)-k+1)]
    for edge in edges:
        pref = prefix(edge)
        suff = suffix(edge)
        adjacency_list.setdefault(pref, []).append(suff)
    return adjacency_list

In [48]:
k = 4
text = "AAGATTCTCTAC"

for key, value in de_bruijn_from_string(k, text).items():
    print(f'{key} -> {",".join(value)}')


AAG -> AGA
AGA -> GAT
GAT -> ATT
ATT -> TTC
TTC -> TCT
TCT -> CTC,CTA
CTC -> TCT
CTA -> TAC


## Walking in the de Bruijn Graph

With an overlap graph, we had to find a Hamiltonian path, and that was an intractable problem. But with a de Bruijn graph, we have to find an **Eulerian path**—a path that visits **every edge** rather than every node—to find a candidate for the genome path. 

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/debruijn_graph_eulerian_path.png" width=400px />    
</center>
<br>

But hold on! In the **de Bruijn Graph from a String** problem, we constructed the de Bruijn graph *given a genome*, but we're trying to reconstruct the genome! Thankfully, we can just as easily construct a de Bruijn graph from isolated reads or *k*-mers. For each *k*-mer, we can create two nodes, suffix and prefix, and assign a directed edge from suffix to prefix. Once we merge all the identical nodes, we'll end up with a linear **composition graph**. Further merging identical nodes in the composition graph results in a de Bruijn graph.

<center>
    <img src="composition.png" width=600px />    
</center>
<br>

The instructors say that this method, creating a linear graph before doing a second merge, will prove useful later.

```
DeBruijn(Patterns)
    dB ← graph in which every k-mer in Patterns is an isolated edge between its prefix and suffix
    dB ← graph resulting from gluing all nodes in dB with identical labels
    return dB
```

But creating a de Bruijn graph from a collection of *k*-mers *Pattern* is actually quite easy. Just a small tweak to the previous implementation from the **de Bruijn Graph from a String** problem will do.

In [9]:
def de_bruijn_from_kmers(patterns):
    adjacency_list = {}
    for p in patterns:
        pref = prefix(p)
        suff = suffix(p)
        adjacency_list.setdefault(pref, []).append(suff)
    return adjacency_list

In [53]:
patterns = ["GAGG", "CAGG", "GGGG", "GGGA", "CAGG", "AGGG", "GGAG"]
f = open("rosalind_ba3e.txt", "r")
input_ba3e = f.read().rstrip("\n").split("\n")
f.close()

f = open("ba3e_output.txt", "w")
for key, value in de_bruijn_from_kmers(input_ba3e).items():
    
    f.write(f'{key} -> {",".join(value)}\n')

f.close()

## Euler's Theorem

Week 2 content begins here. 

As mentioned briefly in Week 1, Euler proved that there exists an efficient, polynomial-time algorithm to construct an Eulerian cycle for every graph, if there exists an Eulerian cycle in said graph. We will examine his algorithm.

A graph that possesses an Eulerian cycle is an **Eulerian graph**. And for a graph to be Eulerian, it has to be 1) **balanced** and 2) **strongly connected**. In a **balanced graph**, each node has the same **indegree** (the number of incoming edges) as the **outdegree** (the number of outgoing edges). If an ant were to walk along an Eulerian cycle, if it entered a node through an edge, it should be able to exit through another edge, so the indegree must equal the outdegree. And in a **strongly connected** graph, we have to able to walk from one node to another for all pairs of nodes in the graph. So every Eulerian graph is balanced, but not all balanced graphs are Eulerian, and not all strongly-connected graphs are Eulerian.

To prove whether *Graph* has an Eulerian cycle, we can imagine a randomly-walking ant on the graph. Placing the ant on any node and allowing it to walk along only untravelled edges allows it to either find the Eulerian cycle, or get stuck after completing a smaller cycle. For example, the below ant started out at the green node (say *v*<sub>0</sub>), and became stuck after completing a smaller cycle. Once the ant has completed a cycle, we call the cycle *Cycle*<sub>0</sub>.

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/Leo-2.png" width=300px />    
</center>
<br>

Then, we can start the place the ant on a new node within *Cycle*<sub>0</sub> that still has at least one unused edge, and instruct it to complete a circuit through *Cycle*<sub>0</sub>. Let us call the new node *v*<sub>1</sub>. Once the ant returns to *v*<sub>1</sub>, it will find an unused edge to escape *Cycle*<sub>0</sub> and continue its journey. 

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/Leo-3.png" width=300px />    
</center>
<br>

If the ant becomes stuck again, like below, we can name *Cycle*<sub>1</sub> the newly completed cycle (which includes *Cycle*<sub>0</sub>).

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/Leo-4.png" width=300px />    
</center>
<br>

And we repeat the procedure, placing the ant on a node in *Cycle*<sub>1</sub> with an unused edge, until the ant succeeds in traversing all edges.

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/Leo_final_steps.png" width=600px />    
</center>
<br>

## From Euler's Theorem to an Algorithm for Finding Eulerian Cycles

This is a **constructive proof**, which not only proves the result but also provides us with an algorithm. Let us implement Euler's proof.

```
EulerianCycle(Graph)
    form a cycle Cycle by randomly walking in Graph (don't visit the same edge twice!)
    while there are unexplored edges in Graph
        select a node newStart in Cycle with still unexplored edges
        form Cycle’ by traversing Cycle (starting at newStart) and then randomly walking 
        Cycle ← Cycle’
    return Cycle
```

In [11]:
import random

def eulerian_cycle(graph):
    
    def random_walk(graph, current_node):
        '''
        Complete a cycle by walking through an Eulerian graph randomly.
        '''
        cycle = [current_node]
        while True:
            if current_node in graph.keys(): # if we're not stuck
                next_node = random.choice(graph[current_node])
                cycle.append(next_node)
    
                # remove new edge to be traversed from future consideration
                if len(graph[current_node]) > 1:
                    graph[current_node].remove(next_node)
                else:
                    graph.pop(current_node)
                    
                current_node = next_node
            else:
                break
        return cycle, graph

    def prev_cycle(cycle, new_start):
        '''
        Once new start node is selected, shift previous cycle to begin
        with new start node. 
        '''
        new_start_index = cycle.index(new_start) # all nodes are unique so this is fine
        return cycle[new_start_index:] + cycle[1:new_start_index+1]
    
    current_node = random.choice(list(graph.keys()))    
    cycle, graph = random_walk(graph, current_node)

    while len(graph) > 0:
        # choose new node to start from, given the previous iteration's cycle. 
        # if node is **left** in graph.keys(), i.e. still has unused outgoing edge(s)
        new_start = random.choice([node for node in cycle if node in graph.keys()])
        
        prev = prev_cycle(cycle, new_start)
        new, graph = random_walk(graph, new_start)
        cycle = prev + new[1:]
    
    return cycle

In [12]:
graph = {0: [3], 1: [0], 2: [1, 6], 3: [2], 4: [2], 5: [4], 6: [5, 8], 7: [9], 8: [7], 9: [6]}
print(" ".join(str(x) for x in eulerian_cycle(graph)))

0 3 2 6 8 7 9 6 5 4 2 1 0


We can now check whether a directed graph has an Eulerian *cycle*, but not an Eulerian *path*. This is straightforwardly remedied. A nearly balanced graph has an Eulerian path if and only if adding an edge between its unbalanced nodes makes the graph balanced and strongly connected. So, we can **simply connect the two unbalanced nodes** to form an Eulerian cycle.

<center>
    <img src="http://bioinformaticsalgorithms.com/images/Assembly/Eulerian_path_to_cycle.png" width=600px />    
</center>
<br>

In [13]:
def find_unbalanced(graph):
    '''
    Find unbalanced nodes in a nearly balanced graph with a Eulerian path.
    '''
    sources = graph.keys()
    dests = sum(list(graph.values()), []) # list of arrow destinations
    all_nodes = set(list(sources) + list(dests))
    degrees = {node: 0 for node in all_nodes}

    for node in sources:
        degrees[node] += len(graph[node]) # outdegrees
    for node in dests:
        degrees[node] -= 1 # indegrees

    orig, term = None, None
    for node, degree in degrees.items():
        if degree == 1:
            orig = node
        if degree == -1:
            term = node

    if orig == None or term == None:
        return None, None
    return orig, term
    
def eulerian_path(graph):
    orig, term = find_unbalanced(graph)
    if orig:
        graph.setdefault(term, []).append(orig) # add temporary edge to graph
        path = eulerian_cycle(graph)[:-1]
    else:
        path = eulerian_cycle(graph)
    if orig:
        # slice Eulerian cycle at temporary edge
        for i in range(len(path)-1):
            if path[0] == orig and path[len(path)-1] == term:
                return path
            if [path[i], path[i+1]] == [term, orig]:
                return path[i+1:] + path[0:i+1]
    
    return path

In [14]:
data = {0: [2], 1: [3], 2: [1], 3: [0, 4], 6: [3, 7], 7: [8], 8: [9], 9: [6]}
print(" ".join([str(x) for x in eulerian_path(data)]))

6 7 8 9 6 3 0 2 1 3 4


In [58]:
f = open("rosalind_ba3g.txt", "r")
ba3g_input = f.read().rstrip("\n").split("\n")
f.close()

graph = {line.split(" -> ")[0]: line.split(" -> ")[1].split(",") for line in ba3g_input}

print("->".join(eulerian_path(graph)))

2131->2132->260->77->11->472->2465->2466->2464->472->474->473->11->12->35->36->686->687->685->931->932->933->1716->1715->1714->933->685->36->34->390->459->2188->2190->2189->2621->2620->2622->2189->2362->2364->2363->2189->459->457->458->390->388->407->812->813->811->1993->1994->1995->2138->2139->2137->1995->811->1527->1525->1526->1676->1677->1675->1526->811->407->406->408->702->700->701->408->388->1296->1783->1785->1784->2635->2637->2636->1784->1296->1294->1295->388->415->1748->1749->1747->415->416->417->388->1966->1967->1968->388->389->1674->1673->1672->389->34->133->135->134->1954->1955->1956->134->34->566->567->1233->1231->2632->2634->2633->2780->2781->2779->2633->1231->1531->1532->2667->2665->2666->1532->1533->1231->1232->1659->1658->1657->1232->567->565->856->857->858->565->34->123->284->283->478->479->572->870->868->869->572->571->573->479->1599->1597->1598->479->480->283->285->123->1504->1772->1870->1872->1871->1772->1773->1771->1504->2494->2495->2496->1504->1506->1505->123->1159

Somewhat unceremoniously, we have now arrived at a solution to the **String Reconstruction** problem. Given a collection of reads *Patterns*, we first create the de Bruijn graph, then we create the Eulerian path, and then we string together the Eulerian path into a single string *Text*.



In [15]:
def string_reconstruction(patterns):
    de_bruijn_graph = de_bruijn_from_kmers(patterns)
    path = eulerian_path(de_bruijn_graph)
    return path_to_genome(path)

In [60]:
patterns = ["CTTA", "ACCA", "TACC", "GGCT", "GCTT", "TTAC"]
string_reconstruction(patterns)

'GGCTTACCA'

In [62]:
f = open("rosalind_ba3h.txt", "r")
ba3g_input = f.read().rstrip("\n").split("\n")
f.close()

In [63]:
patterns = ba3g_input[1:]
print(string_reconstruction(patterns))

TGTTAAAAGCAATATATCGACCGAATCGGAATTAGCATGGTGGATTTAGTCCCTCGGAGTTGCAACCCACTGCAACGCGTTTTGATGGTCGTAATTGCCCGGGGACTTCGGCTCGCAGGCATGGGACCCCCTTGATGGAGAGATACTAACAACATAGTCACGTGAACTGAGACGGGTGAATAGACGCTCGCTTTCTTTAATTAAACGCGCAGGTAATACACGCGGCTAAATGGGGCAAGTCTGGAAGACATTTTAGACATGGTCAACAGTAACTTACTCTTTGTGTATGCAGAGACTAGCTGTCTGGAGGGAATGTCACACCTACCGGCGCAAAGACTGCAATGCCGTGCTAATCGCCTCGTTCATGTATGTCGGTGAGATAAGATGTTGTGCGAAGGATAGGGCCTTTTGCCGTGGTCAAATCAGACGTGGCTGCCCAGGTATATTCAGTGTCCGACAGTGTCTTGAGGTGAGGTGACCGAATGAGGACATTAGGGTGCGTATGTCTCCCCGGAGGCTTTGAGTTTCCCCAGGTATTACGACGTTGCAACGAATTTCCGTGAATCACGCATTCAATTGTCGGAATGTTCAATTAACTGTCTGCGTGTGGGCCCAACCACTATTCATTCTACATTAACCTTCCGCTCAGGATTTCGTAGCGATCCACCCTTGTTCATCTATGTCTTACGGGCGTCGCGCGCGGGCAGACCAACCTCCGGGTTGCCATCGTCTACCTAAGCTACGTTTCCTACCTTAATGAAAGAGGATGGCGATAAATACTTGTGATATTTAACAAGTGACCTGGCACCCGATCTCAACGCTCTTCTATCAAGCTCGTAATCGGGGAACAGGGCGCCTTCAAGCGTTTATAGAGGCCCGAGACGCACGGATGACATGCAAAGCTTACAAAGAGGCAGAATGATCATTCACTAAAAACAAATTACTTTATAGAGGTACAAAATTCAGCCATCGCCTCCAATTTGAGAATCCCGAGAGGCAA

Recall that de Bruijn's goal was to find the *k*-universal binary string for any given *k*. The instructors tell us that de Bruijn was actually interested in finding *k*-universal *circular* strings. This is quite similar to the problem of assembling a circular genome, like those of most bacteria. We can easily use the **Eulerian Cycle** algorithm to answer this question. 

In [64]:
# binary strings function taken from StackOverflow: 
# https://stackoverflow.com/questions/64890117/what-is-the-best-way-to-generate-all-binary-strings-of-the-given-length-in-pytho
def generate_binary_strings(bit_count):
    binary_strings = []
    def genbin(n, bs=''):
        if len(bs) == n:
            binary_strings.append(bs)
        else:
            genbin(n, bs + '0')
            genbin(n, bs + '1')

    genbin(bit_count)
    return binary_strings

def k_universal_circular_string(k):
    patterns = generate_binary_strings(k)
    de_bruijn_graph = de_bruijn_from_kmers(patterns)
    path = eulerian_cycle(de_bruijn_graph)
    return path_to_genome(path[:-(k-1)]) # I don't know why we need the k-1 term, but sure

In [65]:
print(k_universal_circular_string(9))

11111111001010110010100100000111101011010001100101100000100101010001111100011011101001001111110000101000010111101001110000110101100010011101011111110110101010111100100110111100000110001110110111000111100010111001011111001111011000110000000001110001000011001111101000100011010000111010000000100000010101010011000101011101110011101111101111011101100100100010010010110101110000001101100001001100100001000101100110011010010111010100000101101111110101011011011001110010001110011011010011010100101000101001111001100001


## Assembling Genomes from Read-Pairs

Generate the (3,2)-mer composition of TAATGCCATGGGATGTT in lexicographic order. Include repeats, and return your answer as a list on a single line.  As a hint to help you with formatting, your answer should begin "(AAT|CAT) (ATG|ATG)..."

In [19]:
def read_pairs(k, d, text):
    pairs = []
    for i in range(len(text)-2*k-d+1):
        pairs.append([text[i:i+k], text[i+k+d:i+2*k+d]])
    pairs.sort()
    return pairs

In [20]:
text = "TAATGCCATGGGATGTT"
print(" ".join([f'({pair[0]}|{pair[1]})' for pair in read_pairs(3, 2, text)]))

(AAT|CAT) (ATG|ATG) (ATG|ATG) (CAT|GAT) (CCA|GGA) (GCC|GGG) (GGG|GTT) (TAA|CCA) (TGC|TGG) (TGG|TGT)


**String Reconstruction from Read-Pairs Problem**: *Reconstruct a string from its paired composition.*
- **Input**: A collection of paired *k*-mers *PairedReads* and an integer *d*.
- **Output**: A string *Text* with (*k*,*d*)-mer composition equal to *PairedReads* (if such a string exists).


```
StringSpelledByGappedPatterns(GappedPatterns, k, d)
    FirstPatterns ← the sequence of initial k-mers from GappedPatterns
    SecondPatterns ← the sequence of terminal k-mers from GappedPatterns
    PrefixString ← StringSpelledByPatterns(FirstPatterns, k)
    SuffixString ← StringSpelledByPatterns(SecondPatterns, k)
    for i = k + d + 1 to |PrefixString|
        if the i-th symbol in PrefixString does not equal the (i - k - d)-th symbol in SuffixString
            return "there is no string spelled by the gapped patterns"
    return PrefixString concatenated with the last k + d symbols of SuffixString
```

In [76]:
def string_from_gapped_patterns(gapped_patterns, k, d):
    first_patterns = [p.split("|")[0] for p in gapped_patterns]
    second_patterns = [p.split("|")[1] for p in gapped_patterns]
    prefix_string = path_to_genome(first_patterns)
    suffix_string = path_to_genome(second_patterns)
    for i in range(k+d, len(prefix_string)):
        if prefix_string[i] != suffix_string[i-k-d]:
            return None
    return prefix_string[:k+d] + suffix_string

In [77]:
patterns = ["GAGA|TTGA", "TCGT|GATG", "CGTG|ATGT", "TGGT|TGAG", "GTGA|TGTT", "GTGG|GTGA", "TGAG|GTTG", "GGTC|GAGA", "GTCG|AGAT"]

In [78]:
def de_bruijn_from_paired_kmers(patterns):
    adjacency_list = {}
    for p in patterns:
        pair = p.split("|")
        pref = f'{prefix(pair[0])}|{prefix(pair[1])}'
        suff = f'{suffix(pair[0])}|{suffix(pair[1])}'
        adjacency_list.setdefault(pref, []).append(suff)
    return adjacency_list

In [79]:
def gapped_string_reconstruction(patterns, k, d):
    graph = de_bruijn_from_paired_kmers(patterns)
    gapped_patterns = eulerian_path(graph)
    return string_from_gapped_patterns(gapped_patterns, k, d)

In [80]:
gapped_string_reconstruction(patterns, 4, 2)

'GTGGTCGTGAGATGTTGA'

In [105]:
f = open("gapped.txt", "r")
readin = f.read().rstrip("\n").split("\n")
f.close()
k = int(readin[0].split(" ")[0])
d = int(readin[0].split(" ")[1])
patterns = readin[1].split(" ")

print(gapped_string_reconstruction(patterns, k, d))

TATGCACCGATGGGAGAACTCTAGTAGATCCGTTACAACACGTTCGCCACCTCCACATCCTCCGCCGGATCTTGTGAGGACTCTCATCTTCCTGCCTCAGTATGCTCCTAGCTTGTGTCGTGATCCATAGTGGCTCGGCTCGCTGATCTGCCATATCGGGGACTCAATTATAGCGGACACTTTATTTTCGGACTACTCATCCCAAGTCGCTACAGAAACGATATACACTCTGATTAGCGTGGCTGGCTTACCGCCTTCGGGGCCCGCCAGCACACTTCGCGGTGATCGGAGAGATGGCTCGATTAGGGAGGTTTGGCGTGGTATTGGAACCCCCCCATGGGCCGGAGTAAGAGAAACGGACAAGTGTTGGTGGGAGACAGACACACAAAGCGGGGAATAAAGCGGGGTACTGTGTCCCTCTAACGACAGTGTTGATTATTTTCGGACTACTCATCCCAAGTCGCTACAGAAACGATATACACTCTCCAGCTTGACAGCCTACTCGCAATGGTCCGTGAGTCGATTCAAACTGTAGCAGTAGTCTCCCATACAATGCAACTAAATGCGGACCGCGCGTGGTTGCAGGGACTCGGATTCTTCCCCTCCCAAATGCATTATTGGCCGCGCGCTAGAGTCCCGCGATCAAGAACTGCCTGACCAGGTCTAGTCGCGTTTAAACTTATTTTCGGACTACTCATCCCAAGTCGCTACAGAAACGATATACACTCTGGACTCACTTATTTTCGGACTACTCATCCCAAGTCGCTACAGAAACGATATACACTCTTGGCCTTGCCGAAACAGACCCCCATGGATTAAGGTAAAATTGCCATGCCGGGCGAGACTACGTAAAGCCAATGCAGCAAATATTTTACGACCGGTGTTTACTGGATACTTGATTACCCGCTGACCCAACTGATTATTTTCGGACTACTCATCCCAAGTCGCTACAGAAACGATATACACTCTTGACATGCGCCTCGATTGAAACTTTCACAGG

In [104]:
f = open("rosalind_ba3j.txt", "r")
readin = f.read().rstrip("\n").split("\n")
f.close()
k = int(readin[0].split(" ")[0])
d = int(readin[0].split(" ")[1])
patterns = readin[1:]

print(gapped_string_reconstruction(patterns, k, d))

TTTCTTCTCTTTGGACACTTTCGAGGCTGATCAATACCCTATCCTTTGCCGTTGCTCATGGTGATAGCTCCATATAACTCAAGCGAGGCTGATCAACGAGGCTGATCAATACCCTATCCTTTGCCGTACCCTATCCTTTGCCGGCGGTTTCTGTCTGTAATCTCGCGGTCTTTGCTAGAATACGCCCACCACACGACGAGGCTGATCAATACCCTATCCTTTGCCGACCTTCGCGAGGCTGATCAATACCCTATCCTTTGCCGCAACGGCGACTATACAGCCACCAACAGGCAATTTTTTCCCCCAAATCCCCTCTCTAATGTACCGTATATACCAGCCAAGATCAAAATAATACCGCCAAAAATGATCGGTCAGTCCCTAAATCCGAGGCTGATCAATACCCTATCCTTTGCCGAGCAGGGCCCACCAACCGCGGTCGCTGGGATACTCACGCCGGGCCGTTAGTTTTAGATACTAGACATGATGACCCCAAGCTGGATTAATGTTACATACCCTTTTCTCGAGGCTGATCAATACCCTATCCTTTGCCGGCATAGTCACATGGGACGAGGCTGATCAATACCCTATCCTTTGCCGTTAACTTCCGCTCAGCACACTCTACCGAGTAGTACGTACCCCCTATGAACCATTGCGCATTGAACTTCCATGACGAGGCTGATCAATACCCTATCCTTTGCCGACTTTGGGGTAATCCTGTGTATACCTTCTACGCAGATGGCGGCGATCGAGATATGCTGAACGCAAAACCCCGTTGTAGTGCAGTGATATATTTTGTAATACATAAGTCAACCCGAAGCTGATCGGGGATCGAGGCTGATCAATACCCTATCCTTTGCCGGCGAACGGCGCGTCTTGCAACATACATCAATATATGACGTATTTTTCCTGAAGTATATGTGTGACTAGGAAAGTTCACACCCACACTTTGAACTAACACTATAAGATCGGCGGGGGATAAGAATTTACGCCCTGCCCGAGG

## Maximal Non-Branching Paths in a Graph

```
MaximalNonBranchingPaths(Graph)
    Paths ← empty list
    for each node v in Graph
        if v is not a 1-in-1-out node
            if out(v) > 0
                for each outgoing edge (v, w) from v
                    NonBranchingPath ← the path consisting of single edge (v, w)
                    while w is a 1-in-1-out node
                        extend NonBranchingPath by the edge (w, u) 
                        w ← u
                    add NonBranchingPath to the set Paths
    for each isolated cycle Cycle in Graph
        add Cycle to Paths
    return Paths


```

In [106]:
def find_isolated_cycles(graph, degrees):
    '''
    Find isolated, non-branching cycles.
    '''
    sources = graph.keys()
    dests = sum(list(graph.values()), []) # list of arrow destinations
    degrees = {node: [0, 0] for node in set(list(sources) + list(dests))}

    for node in sources:
        degrees[node][1] += len(graph[node]) # outdegrees
    for node in dests:
        degrees[node][0] += 1 # indegrees
    
    cycles = []
    one_in_one_out = [v for v in graph.keys() if degrees[v][0] == 1 and degrees[v][1] == 1]
    explored = set()
    for start in one_in_one_out:
        if start not in explored:
            w = start
            non_branching_path = []
            while w in one_in_one_out:
                non_branching_path.append(w)
                # check if we've completed a full cycle
                if w == start and len(non_branching_path) > 1:
                    # if we have, add path to cycles
                    # and mark all visited nodes as explored
                    # then break
                    cycles.append(non_branching_path)
                    for node in non_branching_path:
                        explored.add(node)
                    break
                # choose next in sequence
                # if the new w is not in one_in_one_out, then nothing gets added to cycles
                w = graph[w][0] 
    return cycles
    
def maximal_nonbranching_paths(graph):
    paths = []
    sources = graph.keys()
    dests = sum(list(graph.values()), []) # list of arrow destinations
    degrees = {node: [0, 0] for node in set(list(sources) + list(dests))}

    for node in sources:
        degrees[node][1] += len(graph[node]) # outdegrees
    for node in dests:
        degrees[node][0] += 1 # indegrees
        
    for v in graph.keys():
        outdeg = degrees[v][1]
        indeg = degrees[v][0]
        
        if not (outdeg == 1 and indeg == 1):
            if outdeg > 0:
                for w in graph[v]:
                    non_branching_path = [v, w]
                    while degrees[w][1] == 1 and degrees[w][0] == 1:
                        u = graph[w][0]
                        non_branching_path.append(u)
                        w = u
                    paths.append(non_branching_path)

    for cycle in find_isolated_cycles(graph, degrees):
        paths.append(cycle)
    
    return paths
                        

In [107]:
data = {
    1: [2],
    2: [3],
    3: [4, 5],
    6: [7],
    7: [6]
}

'''
f = open("max_nb.txt", "r")
data = f.read().split("\n")
f.close()
data = {line.split(" -> ")[0]: line.split(" -> ")[1].split(",") for line in data}
'''

for line in maximal_nonbranching_paths(data):
    print(" ".join([str(x) for x in line]))

1 2 3
3 4
3 5
6 7 6


In [108]:
def contig_generation(patterns):
    de_bruijn_graph = de_bruijn_from_kmers(patterns)
    paths = maximal_nonbranching_paths(de_bruijn_graph)
    return [path_to_genome(path) for path in paths]

In [111]:
data = ["ATG", "ATG", "TGT", "TGG", "CAT", "GGA", "GAT", "AGA"]
f = open("rosalind_ba3k.txt", "r")
data = f.read().rstrip("\n").split("\n")
f.close()
print(" ".join(contig_generation(data)))

ATGCTTGGTCGCAGGCGCG ATGCTTGGTCGCAGGCGCG GCATTCTTCTGGGGAGTAG GCATTCTTCTGGGGAGTAG CAATCCCCACGGGTCCCGG CAATCCCCACGGGTCCCGG CTTTAGCTCCGGAGCATTT CTTTAGCTCCGGAGCATTT TCGCACGCTTCGGGTGGGC TCGCACGCTTCGGGTGGGC CGGATAAAACAAACGTTAC CGGATAAAACAAACGTTAC TAGCGCGAGGCCGGAAGTG TAGCGCGAGGCCGGAAGTG TTGACGGACGACAAAGCTA TTGACGGACGACAAAGCTA CGTCGCACGCTTCGGGTGG CGTCGCACGCTTCGGGTGG ACAGGGGGAATGTCATTCCACCCCACCTAACCGCA ATTGCTCTGTCCTCACTGC ATTGCTCTGTCCTCACTGC ACCTGACGGGTAGGCGGTA ACCTGACGGGTAGGCGGTA GCTCAACCGAAGTCGGGCC GCTCAACCGAAGTCGGGCC TGTTCACGACCGGTCACCG TGTTCACGACCGGTCACCG AGCGTATCCTACGAGCTCA AGCGTATCCTACGAGCTCA ATTGTAATCATCCGTTACGCCCGGGATTGTTGGA TCCCCACGGGTCCCGGCTCCTTACTAACTTCC TCCATTGTGATGCAAGTGC TCCATTGTGATGCAAGTGC GATCCAATACGTCTCAATG GATCCAATACGTCTCAATG GTCGCACGCTTCGGGTGGG GTCGCACGCTTCGGGTGGG TTGTGCAGACTGCTAATAC TTGTGCAGACTGCTAATAC CTCGCGCCCTTATGAATAC CTCGCGCCCTTATGAATAC CTCCAATGCGAAAAAGACACAATCATCCCGGCGA TTTGTGCAGACTGCTAATA TTTGTGCAGACTGCTAATA CCATTGTGATGCAAGTGCA CCATTGTGATGCAAGTGCA TACTCCAATGCGAAAAAGA T

## Challenge

This seemed initially to be fairly straightforward but I was made to eat my words. This is hard.

In [30]:
def find_unbalanced_modified(graph):
    '''
    Find unbalanced nodes in a nearly balanced graph with a Eulerian path.
    '''
    sources = graph.keys()
    dests = sum(list(graph.values()), []) # list of arrow destinations
    all_nodes = set(list(sources) + list(dests))
    degrees = {node: 0 for node in all_nodes}

    for node in sources:
        degrees[node] += len(graph[node]) # outdegrees
    for node in dests:
        degrees[node] -= 1 # indegrees

    orig, term = None, None
    for node, degree in degrees.items():
        if degree == 1:
            orig = node
            print(node)
        if degree == -1:
            term = node
            print(node)

    if orig == None or term == None:
        return None, None
    return orig, term

In [31]:
f = open("C_rudii.txt", "r")
readin = f.read().split("\n")
k = int(readin[0].split(" ")[0])
d = int(readin[0].split(" ")[1])
patterns_first = [line.split("|")[0] for line in readin[1:]]
patterns_second = [line.split("|")[1] for line in readin[1:]]

broken_up_patterns = []
for i in range(len(patterns_first)):
    composition_first = composition(12, patterns_first[i])
    
    
graph = de_bruijn_from_kmers(patterns)

f = open("challenge.txt", "w")
for key, value in graph.items():
    f.write(f'{key}: {value}\n')
f.close()
#print(find_unbalanced_modified(graph))


# try single string reconstruction first because read pairs just isn't working
