# 01 - BIOINFORMATICS - De Novo Assembly Method To Obtain The gene sequence. 


## Project Description

Data is in _fasta_ format. Every two lines represents a read (a DNA fragment) with 60bp. The first line started with ‘>’ contains the name of the read and the second line is the nucleotide sequence. 

Tips: 
- Repeated sequences larger than 25bp do not occur in original sequence. 
- Each read overlaps at least 25bp with other reads.


## Project objective
These reads (DNA fragments) of repeated sequences overlap. The task is to construct a de novo method to assemble these reads and obtain the gene sequence.

![image.png](attachment:image.png)

In [51]:
import networkx as nx
import matplotlib.pyplot as plt
import debruijn as db

In [52]:
fname = '/Users/yuliabezginova/PycharmProjects/01_BIO_DeNovoAssembly/20Xreads.fasta'
reads = db.read_reads(fname)

In [53]:
# Construct de bruijn graph from sets of short reads with k length word
bruijn_graph = db.construct_graph(reads, 5)

In [54]:
type(bruijn_graph)

tuple

In [55]:
# Perform searching for Eulerian path in the graph to output genome assembly
path = db.output_contigs(bruijn_graph)

In [56]:
path

'TGCCGCGCTCACTGCTTTCAGCACTCGCAGCCGTGGACCGCCGTGCGGTCCTTTCCTCCGCAGTGAGCCGATTTTCAAAATGAAATAAAACACACTATTCTCTGGCTTGTTAACATTTATGCTTCTCACATTACATGGCCAACATATATGTGTGAGTCTGTGTTCCAATAAGGAATGAGGCTTATCAATCTGTTTTTCTGAAAGTTTTGGGATTGCTTTATAATCAAACCAATCATGGGCAACAGAGTGTCTGCGGTCACCATGCCCAGCCTATGGATCCACCTTGACCTCTGAAACGTCATACTAAGTAGGATCTCCAAGGAAGAATCAATTTAGACCCGGGTTAGAGAAACGGCTCGTAGAGCGCACAGCTGCAAACCCTGCAATCTGTGTTAGAAATAATACTTTTTTATTATCTCAGATTTGCGGTGAAAACTACAAGCAGCCTCACAGCTGGGCTCCATTTCCCTTTTGGTACTTGGTATTTTTACCTCCATACAGGCAGGATGACTGAAAAATAGTCCCCAGCATCGACAGACTTCGAGTATGGCCTGCAAAGCTTCCTTGAGACATTAGGTTAGGATGTGAAACATTCTTTTTAAGCTTCATTCCACATCCAACCTTTTTGACATTATCATCATGATCCTGTCTCTATCTTTTGAAACGTCAAACCTCATAGATTCTCCTAGCTAGGCGCTCCCACCATGAGGACCGCCGTGGAGTATGCAAATTTTCAAGGAGTAAGGATTACAGCCAGAATACGCGCTCGACACCGAATGGGATGACTGAAAACTACAAGGGTGTTACAAGTTCTCTGTCATTGGGTGACTATAGTTGCCTAGTTAACATGTGAGAAATCTTGTTTAAGTCGTTGTTTATAGTTAATTACATTTCAGTGTTGTACATGTGAAAATAGCCACATCACCAACAAGCAGGCAGGCTCTGGGCAACACTTGAGAAATCAAACCCTTTCTTGCCCTGCCAAAGCACAGAAGCAGC

In [57]:
# Print the information in the graph to be (somewhat) presentable
db.print_graph(bruijn_graph)

name:  GGATT . indegree:  347 . outdegree:  347
Edges: 
GATTC
GATTC
GATTT
GATTA
GATTC
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTC
GATTT
GATTT
GATTT
GATTC
GATTC
GATTT
GATTA
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTT
GATTA
GATTC
GATTA
GATTC
GATTT
GATTC
GATTT
GATTC
GATTC
GATTT
GATTA
GATTC
GATTA
GATTC
GATTT
GATTT
GATTC
GATTT
GATTC
GATTA
GATTC
GATTT
GATTT
GATTC
GATTA
GATTT
GATTA
GATTC
GATTC
GATTT
GATTA
GATTT

name:  GATTG . indegree:  182 . outdegree:  182
Edges: 
ATTGC
ATTGT
ATTGT
ATTGT
ATTGA
ATTGT
ATTGT
ATTGT
ATTGT
ATTGC
ATTGC
ATTGT
ATTGT
ATTGT
ATTGG
ATTGC
ATTGT
ATTGT
ATTGT
ATTGG
ATTGG
ATTGG
ATTGT
ATTGT
ATTGT
ATTGT
ATTGG
ATTGT
ATTGG
ATTGA
ATTGG
ATTGT
ATTGG

name:  ATTGC . indegree:  237 . outdegree:  234
Edges: 
TTGCT
TTGCT
TTGCT
TTGCA
TTGCC
TTGCA
TTGCT
TTGCT
TTGCA
TTGCT
TTGCT
TTGCA
TTGCT
TTGCT
TTGCT
TTGCT
TTGCT
TTGCC
TTGCA
TTGCT
TTGCC
TTGCT
TTGCT
TTGCC
TTGCT
TTGCA
TTGCT
TTGCC
TTGCC
TTGCT
TTGCT
TTGCC
TTGCA
TTGCT
TTGCT
TTGCC
TTGCT

name:  TTGCT . indegree:  373 .

In [None]:
class GraphVisualization:

    def __init__(self):
        self.visual = []
        self.G = nx.DiGraph()

    def addEdge(self, a, b):
        temp = [a, b]
        self.visual.append(temp)

    def visualize(self):
        options = {
            'node_color': 'blue',
            'node_size': 200,
            'width': 2,
            'arrowstyle': '-|>',
            'arrowsize': 1,
        }
        self.G.add_edges_from(self.visual)
        pos = nx.spring_layout(self.G, seed=1500)
        nx.draw(self.G, pos, with_labels=False, **options)
        plt.show()

graph_plot = GraphVisualization()
vertexes = bruijn_graph[0]
edges = bruijn_graph[1]
for vertex in vertexes.keys():
    edges1 = edges[vertex]
    for edge in edges1:
        graph_plot.addEdge(vertex, edge.label)
graph_plot.visualize()