# 01 - BIOINFORMATICS - De Novo Assembly Method To Obtain The gene sequence. 


## Project Description

Data is in _fasta_ format. Every two lines represents a read (a DNA fragment) with 60bp. The first line started with ‘>’ contains the name of the read and the second line is the nucleotide sequence. 

Tips: 
- Repeated sequences larger than 25bp do not occur in original sequence. 
- Each read overlaps at least 25bp with other reads.


## Project objective
These reads (DNA fragments) of repeated sequences overlap. The task is to construct a de novo method to assemble these reads and obtain the gene sequence.

![image.png](attachment:image.png)

In [51]:
import networkx as nx
import matplotlib.pyplot as plt
import debruijn as db

In [52]:
fname = '/Users/yuliabezginova/PycharmProjects/01_BIO_DeNovoAssembly/20Xreads.fasta'
reads = db.read_reads(fname)

In [64]:
# Construct de bruijn graph from sets of short reads with k length word
bruijn_graph = db.construct_graph(reads, 10)

In [65]:
type(bruijn_graph)

tuple

In [66]:
# Perform searching for Eulerian path in the graph to output genome assembly
path = db.output_contigs(bruijn_graph)

In [67]:
path

'GGATTGCTTTCAGCACTCGCAGCCGTGGACCGCCGTGCGGTCCTTTCCTCCGCAGTGAGCCGATTTGCTCTGCCAGCAGCTGTCGGTGCCGCGCTCGACACCGAGTCCTAGCTAGGCGCTCACAGAATACGCGCTCCCTCCCTCCCCCTTCTCTGTCCCCCGCCTCTCGCTCACCCCGGCCCACTCCAGCGGCGACTTTGAGGGATTCCCTCTCTGGCGGCCTCTGCAGCAGCACAGCCGGCCTCATTCGGGGCACTGCGAGTATGGATCTCCAAGGAAGAGGGGTCCCCAGCATCGACAGACTTCGAGTTCTCCTGATGTTGTTCCGTGAGTAGCGATTTGGCGACTGGGAGAGAGGACGGGCACTCTTTGCAGAGAACAGAGCCGGCAAAGTAAAGTCAATTAGCTTGAGATAAAAAATATGGCCCTTAATGTTTTGCGGTGTTTTTTCTTTTTTTCTTTTAGGGGAGGAGGGAGGGCCTTCCTCGCCGGTGGCAGGGAATTCCCCTGAGAGCCATACACCTCTGTGGCATATTCCAGACAGAGCCTAGACAGGCCAGCTTCGTGAGGAGGATATCCTTAAGCTCCAGGGAGGGAAGGGTACCTGAAGGCTGAGCCGTTTAAGGCCAGCTGCTTCTGCTGAGTGTGGAGCAAGGCTGTCTCTCAGTCTGTTTAGCCTCTGAAAGTTTTGGGATTACAGGCAGGATCCACCATGCCCAGCCTAGGTTTTATACAGTAACAAGGGTTGGGTTTTTAAGTATCATTCTCCTCTTCAAACAGTTTCAGTCCATGGTGCTAGGAGAGCCTGAGGAAGCAGAGAGCTTGCCTCTGCTTCCCATTGCCTATCCCAATGCCATTCATTGGGGGTGCAATGAAGCGCTCTTGTAGTTTATCTCAGGGAGTCAATATAATTTTTTTAACTGTATTTTCATCTCACTTTAGTCCTTTCTTCTCAGTCTCCCTTTGTTTCCTCCTTCTTCTTTTCTTTACCATACTATA

In [68]:
# Print the information in the graph to be (somewhat) presentable
db.print_graph(bruijn_graph)

name:  GGATTGCTTT . indegree:  0 . outdegree:  1
Edges: 

name:  GATTGCTTTC . indegree:  1 . outdegree:  2
Edges: 
ATTGCTTTCA

name:  ATTGCTTTCA . indegree:  2 . outdegree:  2
Edges: 
TTGCTTTCAG

name:  TTGCTTTCAG . indegree:  2 . outdegree:  3
Edges: 
TGCTTTCAGC
TGCTTTCAGC

name:  TGCTTTCAGC . indegree:  3 . outdegree:  4
Edges: 
GCTTTCAGCA
GCTTTCAGCA
GCTTTCAGCA

name:  GCTTTCAGCA . indegree:  4 . outdegree:  4
Edges: 
CTTTCAGCAC
CTTTCAGCAC
CTTTCAGCAC

name:  CTTTCAGCAC . indegree:  4 . outdegree:  4
Edges: 
TTTCAGCACT
TTTCAGCACT
TTTCAGCACT

name:  TTTCAGCACT . indegree:  4 . outdegree:  4
Edges: 
TTCAGCACTC
TTCAGCACTC
TTCAGCACTC

name:  TTCAGCACTC . indegree:  4 . outdegree:  5
Edges: 
TCAGCACTCG
TCAGCACTCG
TCAGCACTCG
TCAGCACTCG

name:  TCAGCACTCG . indegree:  5 . outdegree:  5
Edges: 
CAGCACTCGC
CAGCACTCGC
CAGCACTCGC
CAGCACTCGC

name:  CAGCACTCGC . indegree:  5 . outdegree:  5
Edges: 
AGCACTCGCA
AGCACTCGCA
AGCACTCGCA
AGCACTCGCA

name:  AGCACTCGCA . indegree:  5 . outdegree:  5
Edges

name:  TGCAGGGGAG . indegree:  11 . outdegree:  12
Edges: 
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA
GCAGGGGAGA

name:  GCAGGGGAGA . indegree:  12 . outdegree:  12
Edges: 
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG
CAGGGGAGAG

name:  CAGGGGAGAG . indegree:  12 . outdegree:  12
Edges: 
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA
AGGGGAGAGA

name:  AGGGGAGAGA . indegree:  12 . outdegree:  12
Edges: 
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG
GGGGAGAGAG

name:  GGGGAGAGAG . indegree:  12 . outdegree:  12
Edges: 
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT
GGGAGAGAGT

name:  GGGAGAGAGT . indegree:  12 . outd

AACGATTAAG
AACGATTAAG
AACGATTAAG
AACGATTAAG
AACGATTAAG
AACGATTAAG
AACGATTAAG

name:  AACGATTAAG . indegree:  14 . outdegree:  14
Edges: 
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT
ACGATTAAGT

name:  ACGATTAAGT . indegree:  14 . outdegree:  14
Edges: 
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG
CGATTAAGTG

name:  CGATTAAGTG . indegree:  14 . outdegree:  14
Edges: 
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT
GATTAAGTGT

name:  GATTAAGTGT . indegree:  14 . outdegree:  15
Edges: 
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT
ATTAAGTGTT

name:  ATTAAGTGTT . indegree:  15 . outdegree:  16
Edge

GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG
GCGTGCCCTG

name:  GCGTGCCCTG . indegree:  15 . outdegree:  15
Edges: 
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC
CGTGCCCTGC

name:  CGTGCCCTGC . indegree:  15 . outdegree:  16
Edges: 
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT
GTGCCCTGCT

name:  GTGCCCTGCT . indegree:  16 . outdegree:  16
Edges: 
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT
TGCCCTGCTT

name:  TGCCCTGCTT . indegree:  16 . outdegree:  17
Edges: 
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
GCCCTGCTTT
G

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [69]:
class GraphVisualization:

    def __init__(self):
        self.visual = []
        self.G = nx.DiGraph()

    def addEdge(self, a, b):
        temp = [a, b]
        self.visual.append(temp)

    def visualize(self):
        options = {
            'node_color': 'blue',
            'node_size': 50,
            'width': 1,
            'arrowstyle': '-|>',
            'arrowsize': 10,
        }
        self.G.add_edges_from(self.visual)
        pos = nx.spring_layout(self.G, seed=1500)
        nx.draw(self.G, pos, with_labels=False, **options)
        plt.show()

In [None]:
graph_plot = GraphVisualization()

vertexes = bruijn_graph[0] # take the nodes
edges = bruijn_graph[1]

for vertex in vertexes.keys():
    edges1 = edges[vertex]
    for edge in edges1:
        graph_plot.addEdge(vertex, edge.label)
graph_plot.visualize()