## De Bruijn Graphs

(*inspired by Ben Langmead*)

In this notebook, we will make and visualize De Bruijn graphs constructed from genome sequences and from a set of length k reads. 

First up, let's define a function that makes a set of k-1-mer nodes and edges from a genome string. 

In [None]:
def de_bruijn_ize(st, k):
    """ Return a list holding, for each k-mer, its left
        k-1-mer and its right k-1-mer in a pair """
    edges = []
    nodes = set()
    for i in range(len(st) - k + 1):
        edges.append((st[i:i+k-1], st[i+1:i+k]))
        nodes.add(st[i:i+k-1])
        nodes.add(st[i+1:i+k])
    return nodes, edges



Call this function to define the De Bruijn graph for a given "genome" sequence and for a given k-mer length k. 

In [None]:
nodes, edges = de_bruijn_ize("ATGGCGTGCAATG", 4)


In [None]:
nodes

In [None]:
edges

Let's visualize the graph with the graphviz library. Colab displays a graphviz object as a formatted graph. 

In [None]:
import graphviz

In [None]:
def visualize_de_bruijn(nodes, edges):
    """ Visualize a directed multigraph using graphviz """

    g = graphviz.Digraph(comment='DeBruijn graph')

    for n in nodes:
        g.node(n, n)
    for src, dst in edges:
        g.edge(src, dst)
    return g

In [None]:
visualize_de_bruijn(nodes, edges)

Okay, next up, let's define a De Bruijn graph from a set of length k reads. 

In [None]:
reads = ["ATGG",
"TGGC",
"GGCG",
"GCGT",
"CGTG",
"GTGC",
"TGCA",
"GCAA",
"CAAT",
"AATG"]


We will define a similar De Bruijn graph construction function, this time taking a list of reads. 

In [None]:
def de_bruijn_ize_reads(reads, k):
    """ Return a list holding, for each k-length read, its left
        k-1-mer and its right k-1-mer in a pair """
    edges = []
    nodes = set()
    for read in reads:
        edges.append((read[:k-1], read[1:k]))
        nodes.add(read[:k-1])
        nodes.add(read[1:k])
    return nodes, edges

In [None]:
nodes2, edges2 = de_bruijn_ize_reads(reads, 4)


In [None]:
nodes2

In [None]:
edges2

In [None]:
visualize_de_bruijn(nodes2, edges2)