# Graphical Methods
This is a Rosalind-based assignment. Consider (and do) the 7 new problems. These are all from chapter 3 of the text.

Please feel encouraged to use a module for keeping reusable functions and classes. Of course, remember to provide that module with your notebook.

| Program Name | Rosalind Problem |
|:-------------|:------------------------------------------------------|
|problem8| Generate the k-mer Composition of a String|
|problem9| Reconstruct a String from its Genome Path|
|problem10| Construct the Overlap Graph of a Collection of k-mers|
|problem11| Construct the De Bruijn Graph of a String |
|problem12| Construct the De Bruijn Graph of a Collection of k-mers|
|problem13| Find an Eulerian Path in a Graph|
|problem14| Reconstruct a String from its k-mer Composition|





## @TIANGE YU(JILL)
## BME205 HW3 @UCSC 

# Problem 8: 

In [9]:
def composition(Text, k):
    '''
    generate the K-mer composition of a string
    input: 
        k: integer represent k mer
        Text: string of Dna
    return:
        list of Composition (Text) given k'''     
    comp = []
    for i in range(0, len(Text)- k + 1):
        comp.append(Text[i:i+k])
        comp.sort()

    return comp

def main1(inFile=None):
    '''
    read the dataset and write output to file''' 
    fr = open(inFile, 'r')
    k = int(fr.readline().strip())
    Text = fr.readline().strip()

    temp = composition(Text, k)
    
    fr = open('p8ans.txt', 'w')
    for seq in temp:
        fr.write(seq + '\n')
        
if __name__ == "__main__":
    main1(inFile = 'rosalind_ba3a.txt')   

## Inspection Results
- Gia Hagos: The code is very clean looking and the doc strings make it even simpler to understand
- William Gao: great docstrings and very concise code. I would recommend using `with open(file) as f: ...` so it'll properly close the file
- Lucy Zheng: Very clean and concise doc strings to explain your method and main, make sure to create a class as it was mentioned by the Tas


# Problem 9: 

In [10]:
def reconstruction(comp):
    '''
    Find the string spelled by a genome path
    Input: 
        list of string: a sequence of k-mers
    Output:
        a string Text of length k+n+1'''    
    k = len(comp[0])
    
    ps = []
    ss = set()
    for seq in comp:
        ps.append(seq[:-1])
        ss.add(seq[1:])    
    for i in range(0, len(comp)):
        p = comp[i][:-1]
        if p not in ss:
            start = comp[i]
        s = comp[i][1:]
        if s not in ps:
            end = comp[i]
    
    ps_copy = ps.copy()         
    path = start
    while len(ps_copy) != 1:
        for i in range(0, len(ps)):
            if ps[i] == path[-(k-1):]:
                path += comp[i][-1]
                ps_copy.remove(ps[i])
                    
    return path

def main2(inFile=None):
    '''
    read file into list and write file'''
    with open(inFile) as f:
        comp = [line.rstrip() for line in f]
    
    res = reconstruction(comp)
    
    fw = open('p9ans.txt', 'w')
    fw.write(res)
   
if __name__ == "__main__":
    main2(inFile ='rosalind_ba3b.txt')  

## Inspection Results
- Gia Hagos: Docstrings could better explain what certain things are for but otherwise all looks fine
- William Gao: I agree that docstrings could be better, but otherwise this looks good
- Lucy Zheng: OVerall, the code looks organized, maybe add some comments too to explain more about different parts of your code.

In [11]:
class Graph:
    
    def __init__(self):
        self.graph = {}
        
    def getNodes(self):
        '''get all the nodes of the graph'''
        return list(self.graph.keys())
    
    def getEdges(self):
        '''get all the edges of the graph in from of (node1, node2)'''
        edges = []
        for vertex in self.getNodes():
            for node in self.graph[vertex]:
                edges.append((vertex, node))
                    
    def addNodes(self, node):
        '''add a new node to the graph'''
        if node not in self.getNodes():
            self.graph[node] = []
            
    def addEdges(self, n1, n2):
        '''add new edge to the graph'''
        if n1 not in self.getNodes():
            self.addNodes(n1)
        if n2 not in self.getNodes():
            self.addNodes(n2)
        if n2 not in self.graph[n1]:
            self.graph[n1].append(n2)
            
    def addEdges_dB(self, n1, n2):
        '''add new edge to de bruijn graph, 
        duplication allowed'''
        if n1 not in self.getNodes():
            self.addNodes(n1)
        if n2 not in self.getNodes():
            self.addNodes(n2)
        self.graph[n1].append(n2)  
          
    def getSuccessors(self, node):
        '''get all the successors of certain node on graph'''
        return list(self.graph[node])
    
    def getPredecessors(self, node):
        '''get all the predecessor of certain node on graph'''
        pred = []
        for k in self.getNodes(node):
            if node in self.graph[k]:
                pred.append(node)
        return pred
    
    def outDegree(self, node):
        '''get number of out edges of certain node'''
        return len(self.getNodes(node))
    
    def inDegree(self, node):
        '''get number of in edges of certain node'''
        return len(self.getPredecessors(node))
        

# Problem 10: 

In [12]:
class OverlapGraph(Graph):
    
    def __init__(self):
        Graph.__init__(self)
    
    def get_overlap_graph(self, patterns):
        '''construct the overlap graph of a collection of kmers
        input:
            pattern=a list of string: a collection patterns of kmers
        output:
            dictionary: the overlap graph in form of adjacent list'''
        for p in patterns:
            suf = p[1:]
            for pattern in patterns:
                pred = pattern[:-1]
                if pred == suf:
                    self.addEdges(p, pattern)
        for k in self.getNodes():
            if len(self.graph[k]) == 0:
                del self.graph[k]
        return self.graph     

def main3(inFile=None):
    '''read file into list of patterns 
    and write adjacent list to file'''
    patterns = []
    with open(inFile, 'r') as f: 
        for line in f:
            patterns.append(line.strip())
  
    og = OverlapGraph()
    g = og.get_overlap_graph(patterns)  
    
    with open('p10ans.txt', 'w') as fw:
        for key in list(g.keys()):
            s = key + ' -> ' + ', '.join(g[key]) + '\n'
            fw.write(s)

   
if __name__ == "__main__":
    main3(inFile='rosalind_ba3c.txt')

## Inspection Results
- Gia Hagos: I like the acknowledgement of nodes and edges in this code as the question doesn't directly bring it up, but it's still applicable. Its very clear and conscise and shows good material understanding   
- William Gao: I think the preferred way to call the parent constructor is to use super().__init__(self) (though your way works great as well), otherwise I like the object-oriented approach here.   
- Lucy Zheng: Overall, the code is neat and concise, organized and easy to read

# Problem 11: 

In [13]:
class DeBruijnGraph(Graph):
    
    def __init__(self):
        Graph.__init__(self)
        
    def build_DeBruijn(self, comp):
        '''
        Construct the de Bruijn graph of a string
        Input: 
            comp: list of k-mers
        Output:
            De Bruijn graph in form of adjacency list'''
        for seq in comp:
            self.addNodes(seq[1:])
            self.addNodes(seq[:-1])
            self.addEdges_dB(seq[:-1], seq[1:])
            
        for k in self.getNodes():
            if len(self.graph[k]) == 0:
                del self.graph[k]
        return self.graph
    
def main4(inFile=None):
    '''
    read file into list and write file'''
    fr = open(inFile, 'r')
    k = int(fr.readline().strip())
    Text = fr.readline().strip()

    comp = composition(Text, k)
    dB = DeBruijnGraph()
    g = dB.build_DeBruijn(comp)
    
    with open('p11ans.txt', 'w') as fw:
        for key in list(g.keys()):
            s = key + ' -> ' + ', '.join(g[key]) + '\n'
            fw.write(s)
    
if __name__ == "__main__":
    main4(inFile='rosalind_ba3d.txt')    

## Inspection Results
- Gia Hagos: The code is very efficiently written and easy to follow
- William Gao: great work! I like that you are writing to an output file for this one since the output may get large.
- Lucy Zheng: I like the edge and node variable and how you made it easier to understand that within your code

# Problem 12: 

In [14]:
def dB_Graph_Kmer(patterns):
    '''
    construct the de bruijn from a collection of kmers
    input:
        list of k-mers patterns
    output:
        the de bruijn graph in form of adjacency list
    '''
    dB = DeBruijnGraph()      
    dB.build_DeBruijn(patterns)
    
    return dB.graph

def main5(inFile=None):
    '''
    read file into list and write file'''
    seqs = []
    with open(inFile, 'r') as f:
        for line in f: seqs.append(line.strip())
        
    g = dB_Graph_Kmer(seqs)
    
    with open('p12ans.txt', 'w') as fw:
        for key in list(g.keys()):
            s = key + ' -> ' + ', '.join(g[key]) + '\n'
            fw.write(s)
    
if __name__ == "__main__":
    main5(inFile='rosalind_ba3e.txt')   

## Inspection Results
- Gia Hagos: Although it wasn't apparent at first, nice use of reusing code from other parts of the notebook
- William Gao: I like that you recognized that this is almost the same as the last problem so you're reusing that code.

# Problem 13: 

In [15]:
class Euler(Graph):
    
    def __init__(self):
        Graph.__init__(self)
        
    def edges(self, g):
        '''
        Find all edges on a graph
        input:
            a directed graph
        output:
            a list of sets, set represent edge(n1, n2)'''
        edges = []
        for vertex in list(g.keys()):
            for node in g[vertex]:
                edges.append((vertex, node))    
        return edges
    
        
    def Eulerian_cycle(self, g):
        '''
        Find the Eulerian cycle in a graph
        input: 
            a directed graph that contains eulerian cycle
        output:
            a eulerian cycle in list form from this graph'''     
        visits = self.edges(g) # list of edges
        res = []
        start = visits[0][0]
        res.append(start)
        cur = visits[0][1]
        res.append(cur)
        visits.remove((start, cur))
        
        while len(visits) != 0:
            while cur != start:
                for edge in visits:
                    if edge[0] == cur:
                        res.append(edge[1])
                        visited_pair = (edge[0], edge[1])
                        visits.remove(visited_pair)
                        cur = res[-1]
                        start = res[0]
                        
            # looking for node with unused edge           
            for i in range(0, len(res)):
                find = False
                for j in range(0, len(visits)):
                    if res[i] == visits[j][0]:
                        # found unused edge and revisit res
                        res = res[i:] + res[1:i+1]
                        res.append(visits[j][1])
                        visits.remove((visits[j][0], visits[j][1]))
                        cur = res[-1]
                        start = res[0]
                        # FOUND
                        find = True
                        break                    
                if find:
                    break
                          
        return res
    
    def Predecessors(self, node, g):
        '''get all the predecessor of certain node on graph'''
        pred = []
        for k in list(g.keys()):
            if node in g[k]:
                pred.append(node)
        return pred
    
    def outD(self, node, g):
        '''get number of out edges of certain node'''
        return len(g[node])
    
    def inD(self, node, g):
        '''get number of in edges of certain node'''
        return len(self.Predecessors(node, g))
    
    def Eulerian_path(self, g):
        '''
        Find Eulerian path from graph
        input:
            the adjacency list of a directed graph that has an Eulerian path
        output:
            an Eulerian path in this graph'''
        new_edge = [None, None] # record start and end in from [s, e]
        for node in list(g.keys()):
            outD = self.outD(node, g)
            inD = self.inD(node, g)
            if outD - inD == -1:
                new_edge[0] = node
                                  
            elif inD - outD == -1:
                new_edge[1] = node 
                
        # handel when node has only in-coming edge
        if new_edge[0] == None:
            keys = set(g.keys())
            for n in list(g.keys()):
                find = False
                for s in g[n]:
                    if s not in keys:
                        new_edge[0] = s  
                        find = True
                        break
                if find == True:
                    break  
      
        # draw a edge from start to end to form eulerian cycle
        s = new_edge[0]
        e = new_edge[1]
        if s not in set(g.keys()):
            g[s] = []
        g[s].append(e)
        # get cycle
        cycle = self.Eulerian_cycle(g)
        # get path
        for i in range(0, len(cycle)):
            if cycle[i] == s and cycle[i+1] == e:  
                path = cycle[i+1:] + cycle[1:i+1]
                break

        return path
    
def main6(inFile=None):
    '''read given file into dictionary representing a balanced graph
    write eulerian path into answer file'''
    
    # initialization
    graph = Graph()
    euler = Euler()
    
    # open file
    with open(inFile) as f:
        for line in f:
            temp = line.split(' -> ')
            graph.addNodes(temp[0])
            for n in temp[1].strip().split(','):
                graph.addEdges(temp[0], n)
                
    # get path
    g = graph.graph
    path = euler.Eulerian_path(g)
    
    # write path into file
    fw = open('p13ans.txt', 'w')
    result = ''
    for i in path:
        result = result + str(i) + "->"
    result = result[0:len(result)-2]
    fw.write(result)
    
if __name__ == "__main__":
    main6(inFile='rosalind_ba3g.txt')           

## Inspection Results
- Gia Hagos: Very well organised and easy to follow, no problems here
- William Gao: I really like that you have a `Graph` class and inheriting from it in many questions. very organized!

# Problem 14: 

In [16]:
def StringReconstruction(Patterns):
    '''
    get the string reconstruction from kmers
    input: 
        a list of kmers patterns
    output:
        a string text with kmer composition equal to patterns
        if multiple answers exist, return any one'''
    euler = Euler()  
    dB = dB_Graph_Kmer(Patterns) 
    path = euler.Eulerian_path(dB)
    Text = reconstruction(path)
    return Text

def main7(inFile=None):
    '''
    read file into list and write file'''
    patterns = []
    with open(inFile, 'r') as f:
        f.readline()
        for line in f: patterns.append(line.strip())
    
    genome = StringReconstruction(patterns)
    fw = open('p14ans.txt', 'w')
    fw.write(genome)

if __name__ == "__main__":
    main7(inFile='rosalind_ba3h.txt')     

## Inspection Results
- Gia Hagos: Great reuse of code here, very clean
- William Gao: again on file I/O, it should be better to use `with open('p14ans.txt', 'w') as fw:` for output as well, otherwise this is very clean!
