# Background

Networks arise everywhere in the practical world, especially in biology. Networks are prevalent in popular applications such as modeling the spread of disease, but the extent of network applications spreads far beyond popular science. Our first question asks how to computationally model a network without actually needing to render a picture of the network.

# Problem

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail v and head w is represented by ( v, w ) (but not by ( w, v ) ). A directed loop is a directed edge of the form ( v, v ).

For a collection of strings and a positive integer **k**, the overlap graph for the strings is a directed graph $O_{k}$ in which each string is represented by a node, and string **s** is connected to string **t** with a directed edge when there is a length **k** suffix of **s** that matches a length **k** prefix of **t**, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.

Return: The adjacency list corresponding to $O_{3}$. You may return edges in any order.

# Solution

In [1]:
# Funtion needs to check if two strings have overlaping sequences
# Need to align the suffix of the 1st string to the prefix of the 2nd string 
# If they match, return True, False otherwise
# Variable, k, will hold the length of the prefix/suffix
def overlap_check(string1, string2, k):
    if string1 == string2:
        return False
    
    suffix = string1[-k:]
    prefix = string2[:k]
    
    if suffix == prefix:
        return True
    else:        
        return False

In [2]:
# Stress Check
s1 = 'AAATTTT'
s2='AAATCCC'

s3 = 'AAATAAA'
s4 = 'AAATTTT'
k = 3

print(overlap_check(s1,s2,k))
print(overlap_check(s3,s4,k))

False
True


In [3]:
# BioPython package is used to parse the FASTA file
from Bio import SeqIO

filename = 'input_files/rosalind_grph.txt'
adjacency_list = []

with open (filename,'r') as f:
    records = SeqIO.parse(f, 'fasta')
    
    # Store all seqeunces in a dictionary mapping to their respective ID
    sequence_ditctionary = {} 
    for record in records:
        sequence_ditctionary[record.id] = str(record.seq)
        
    # Want to go through each sequence and see if its suffix overlaps with the 
    # other sequences' prefix
    # If they do overlap, create a tuple of the two IDs and add them to the adjacency list 
    for ID1, seq1 in sequence_ditctionary.items():
        for ID2, seq2 in sequence_ditctionary.items():
            if overlap_check(seq1, seq2, k):
                edge_tup = (ID1,ID2)
                adjacency_list.append(edge_tup)

In [4]:
with open('output_files/overlap_graph.txt', 'a') as f:
    for tup in adjacency_list:
        f.write(f'{tup[0]} {tup[1]} \n')

In [5]:
with open('output_files/overlap_graph.txt', 'r') as f:
    content = f.read()
    print(content)

Rosalind_7067 Rosalind_3715 
Rosalind_7067 Rosalind_2801 
Rosalind_6107 Rosalind_6086 
Rosalind_3222 Rosalind_1542 
Rosalind_3583 Rosalind_6107 
Rosalind_3583 Rosalind_8581 
Rosalind_3583 Rosalind_7303 
Rosalind_0208 Rosalind_4604 
Rosalind_3814 Rosalind_5155 
Rosalind_3814 Rosalind_2895 
Rosalind_0549 Rosalind_6107 
Rosalind_0549 Rosalind_3583 
Rosalind_0549 Rosalind_8581 
Rosalind_0549 Rosalind_7303 
Rosalind_6673 Rosalind_0644 
Rosalind_1624 Rosalind_6483 
Rosalind_1624 Rosalind_0385 
Rosalind_1624 Rosalind_2317 
Rosalind_1624 Rosalind_0664 
Rosalind_5748 Rosalind_7550 
Rosalind_1749 Rosalind_7933 
Rosalind_1726 Rosalind_5155 
Rosalind_1726 Rosalind_2895 
Rosalind_5618 Rosalind_9671 
Rosalind_1638 Rosalind_9671 
Rosalind_1967 Rosalind_9757 
Rosalind_1967 Rosalind_1638 
Rosalind_6086 Rosalind_5058 
Rosalind_5630 Rosalind_6528 
Rosalind_0235 Rosalind_5155 
Rosalind_0235 Rosalind_2895 
Rosalind_7803 Rosalind_4387 
Rosalind_7803 Rosalind_6080 
Rosalind_5059 Rosalind_7406 
Rosalind_5059 