The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [4]:
with open("./sample_input.txt", "r") as file:
    input = file.read()

with open("./sample_output.txt", "r") as file:
    output = file.read()

In [5]:
import pandas as pd

def compute_gc_content(input):

    df = pd.DataFrame(columns = ["Name", "Sequence"])

    for s in input.split('>'):
        if len(s)>1:
            id = s.split('\n')[0]
            sequence = s.split('\n', maxsplit=1)[1].replace('\n','')
            df = pd.concat([df, pd.DataFrame([[id,sequence]], columns=["Name", "Sequence"])])
            # df = df.append({'Name':s.split('\n')[0], "Sequence":s.split('\n')[1] })

    df['Count_CG'] = df['Sequence'].str.count('C')+df['Sequence'].str.count('G')
    df['Sequence_length'] = df['Sequence'].str.len()
    df['GC_Content'] = df['Count_CG']/df['Sequence_length']*100
    df = df.sort_values(by='GC_Content', ascending=False)
    
    return df.iloc[0,:].Name + '\n' + str(df.iloc[0,:].GC_Content)

print(compute_gc_content(input))


Rosalind_0808
60.91954022988506


In [6]:
with open("./rosalind_gc.txt", "r") as file:
    real_input = file.read()

print(compute_gc_content(real_input))


Rosalind_9404
51.95729537366548
