# Computing GC Content
![Rosalind](logo.jpg)


## Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

#### Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

#### Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

### Sample Dataset
```
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
```
### Sample Output
```
Rosalind_0808
60.919540
```

In [None]:
def ReadFASTA(filename):
    '''Extract Sequence name and FASTA sequence from text file
    Input   : filename
    Output  : Dict with sequence Name as key and fasta sequence as value
    '''
    with open(filename) as file:
        FASTA = dict()
        lines = file.readlines()
        for line in lines:
            if line.startswith('>'):
                seqname  = line.rstrip().lstrip('>')
                FASTA[seqname] = ''
            else:
                FASTA[seqname] += line.rstrip()
        return FASTA

In [60]:

def gc_content(sequence):
    '''
    Input: Nucleotides Sequence as String
    Output: GC Content in percentage
    '''
    return 100 * (sequence.count('G') + sequence.count('C')) / len(sequence)    

In [83]:
### Python
#!/usr/bin/env python
'''
A solution to a ROSALIND bioinformatics problem.
Problem Title: Computing GC Content
Rosalind ID: GC
Rosalind #: 005
URL: http://rosalind.info/problems/gc/
'''

# Our data is in FASTA form.
dna_dict = ReadFASTA('data/rosalind_gc.txt')

highest_GC = -1
highest_seq = ''
for seqname, dna_seq in dna_dict.items():
    if gc_content(dna_seq) > highest_GC:
        highest_seq, highest_GC = seqname, gc_content(dna_seq)

# Print the solution.
print(highest_seq, '\n', highest_GC, '\n')

# Write the solution to a text file.
with open('output/005_GC.txt', 'w') as output_data:
	output_data.writelines([str(highest_seq),'\n', str(highest_GC)])

Rosalind_5198 
 52.844638949671776 

