# Problem 9: Identifying Unknown DNA Quickly

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

    Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

    Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [1]:
origSeq = open(r'/Users/Sid/Downloads/rosalind_gc-2.txt').read() #edit file path to your computer

with open('/Users/Sid/Downloads/rosalind_gc-2.txt','r') as file:
    content = file.read().splitlines()
final, lst = [], []
for i in range(len(content)):
    if content[i][0] == '>':
        final.append(lst)
        lst = [content[i][1:]]
    else:
        lst.append(content[i])
final.append(lst)
final.pop(0)
Final = []
for i in range(len(final)):
    DNA = ''
    for j in range(1,len(final[i])):
        DNA += final[i][j]
    Final.append([final[i][0],DNA])
GC_content = []
for i in range(len(Final)):
    record = "%.6f" % round((Final[i][1].count('C') + Final[i][1].count('G')) / len(Final[i][1])*100, 6)
    GC_content.append([record, Final[i][0]])

Max = max(GC_content)

print(Max[1])
print(Max[0])

Rosalind_0399
51.541307


Rosalind_6823: 48.09976247030879
Rosalind_6572: 49.668874172185426
Rosalind_9210: 47.109207708779444
50.28121484814398


# Problem 10 


In [2]:
#edit file path to your computer
file = open(r'/Users/Sid/Downloads/rosalind_subs.txt').read() 

with open('/Users/Sid/Downloads/rosalind_subs.txt','r') as file:
    content = file.read().splitlines()
sequence, subSequence = [], []

for i in range(len(sequence)): 
    if sequence.find(subSequence,i,-1) == i: 
        test = sequence.find(subSequence,i,-1) +1 # add 1 for Rosalind formatting 
        print(test)

# Problem 11 

A graph whose nodes have all been labeled can be represented by an adjacency list, in which each row of the list contains the two node labels corresponding to a unique edge.

A directed graph (or digraph) is a graph containing directed edges, each of which has an orientation. That is, a directed edge is represented by an arrow instead of a line segment; the starting and ending nodes of an edge form its tail and head, respectively. The directed edge with tail and head w is represented by (v,w) (but not by (w,v). A directed loop is a directed edge of the form (v,v). For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph O(k) in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).

    Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
    
    Return: The adjacency list corresponding to O(3). You may return edges in any order.

In [3]:
from Bio import SeqIO

data = "/Users/Sid/Downloads/rosalind_grph-2.txt" # input file 
n = 3 # overlap size 

name, sequence = [], [] # init lists to hold IDs, DNA sequences 
with open (data,'r') as fa:
    for seq_record  in SeqIO.parse(fa,'fasta'):
        name.append(str(seq_record.name)) # add all FASTA IDs to list
        sequence.append(str(seq_record.seq)) # add all FASTA seqs to list 

# Goal: Compare each seq to ith seq         
# iter across each sequence 
for i in range(len(sequence)): 
    # "                   " starting from (i+1)
    for j in range(len(sequence)): 
        # when at two unique sequences 
        if i != j: 
            # compare final 3 bps of ith seq to first 3 bps of (i+1)th seq
            if sequence[i][-n:] == sequence[j][:n]: 
                print(name[i], name[j])


Rosalind_1396 Rosalind_4115
Rosalind_1396 Rosalind_1273
Rosalind_1396 Rosalind_3677
Rosalind_8152 Rosalind_6957
Rosalind_3027 Rosalind_5712
Rosalind_3027 Rosalind_7167
Rosalind_3027 Rosalind_2924
Rosalind_6886 Rosalind_5712
Rosalind_6886 Rosalind_7167
Rosalind_6886 Rosalind_2924
Rosalind_4501 Rosalind_4261
Rosalind_4501 Rosalind_9615
Rosalind_4501 Rosalind_0065
Rosalind_4501 Rosalind_0053
Rosalind_3797 Rosalind_7947
Rosalind_3797 Rosalind_6195
Rosalind_3797 Rosalind_3109
Rosalind_3797 Rosalind_5006
Rosalind_0496 Rosalind_5818
Rosalind_0496 Rosalind_0249
Rosalind_5807 Rosalind_0009
Rosalind_5807 Rosalind_7828
Rosalind_3295 Rosalind_3302
Rosalind_6307 Rosalind_6757
Rosalind_6307 Rosalind_0776
Rosalind_2750 Rosalind_6602
Rosalind_2750 Rosalind_7399
Rosalind_2750 Rosalind_6801
Rosalind_5017 Rosalind_4193
Rosalind_5017 Rosalind_1512
Rosalind_5017 Rosalind_9796
Rosalind_3180 Rosalind_5017
Rosalind_3180 Rosalind_9740
Rosalind_3180 Rosalind_6653
Rosalind_3180 Rosalind_2530
Rosalind_0277 Rosali