# Complementary DNA sequence

The *Deoxyribonucleic acid* (DNA) is the carrier of genetic information and in its simplest form can be viewed as a long string of characters that encode biological features and attributes.

The DNA is an organic molecule consisting of a sequence of two strands held together by complemetary nucleotides (adenine-thymine [A-T] and cytosine-guanine [C-G]). The individual nucleotides consist of a sugar (the deoxyribose) and phosphate group (Figure 1). This sequence of nucleotides carries genetic information and through the two-step process of transcription and translation organisms can decode it to create aminoacids and build specific proteins. The process of transcription and translation is universal across organisms and is considered to be the central dogma of molecular biology.

<img src="https://upload.wikimedia.org/wikipedia/commons/e/e4/DNA_chemical_structure.svg" alt="sketch_image" width="400"/>

*Figure 1: In the DNA segment shown, the 5′ to 3′ directions are down the left strand and up the right strand. Figure created by Madeleine Price Ball* Source: [Wikipedia](https://en.wikipedia.org/wiki/File:DNA_
chemical_structure.svg#file)

The goal of the exercise is to write a script that finds the complementary DNA sequence of a portion of DNA from *Saccharomyces cerevisiae* (Baker's yeast). This is an actual DNA sequence written in the 5' to 3' direction. You will need to handle the extra header lines at the top of the file using Python. The function needs to generate the complementary sequence in the 3' to 5' direction.

Use your knowledge of basic biology to assign the complementary DNA bases (i.e. A-T and C-G).

The reasoning behind this challenge is as follows:

- We scan each base of a long DNA sequence using a for loop
- We use a boolean statement to check the nucleotide.
- Return the matching nucleotide
- Move onto the next base

## Set some rules

- The function must accept a single, long `input` in `string` format consisting of DNA bases.

- The function must `return` the matching DNA strand as a single, long `string` of DNA bases.

- If the function finds an unknown base other than A, G, T, and C (e.g. U for Uracil or something else, perhaps a typo), then the function must return a message detailing the position and value of the unknown base.

>The original DNA sequence is not corrupted, but our code should handle unknown bases. When an unknownThe code should simply stop and let the user know that there is an unknown base. That is all. In real life a human would need to check what is going on.

- The function should not print the string when called (it might be too long).

## Dataset

The sequence was obtained from the National Center for Biotechnology Information. You can find more information about this sequence (that I picked at random) in the file header lines.

Before you start writing any code I highly recommend opening and inspecting the file containing the DNA sequence to learn how the DNA information is organinzed. Some coding decisions will depend on the structure of the file.

>**Important**: Avoid removing the header lines or altering the DNA sequence by hand, this will violate the entire purpose of using coding to generate reproducible research. Header lines might seem annoying, but they provide valuable information about the dataset.



In [3]:
# Read file
dna = open("../datasets/dna_sequence.txt").read().split('\n')


In [4]:
# Inspect data
print(dna[0:10])


['LOCUS: SCU49845', 'ACCESSION: U49845', "ORGANISM: Saccharomyces cerevisiae (baker's yeast)          ", 'AUTHORS: Roemer,T., Madden,K., Chang,J. and Snyder,M.', 'TITLE: Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein', 'JOURNAL: Genes Dev. 10 (7), 777-793 (1996)', 'PUBMED: 8846915', 'SOURCE: https://www.ncbi.nlm.nih.gov/nuccore/U49845.1?report=genbank&to=5028', 'GATCCTCCATATACAACGGTATCTCCACCTCAGGTTTAGATCTCAACAACGGAACCATTGCCGACATGAG', 'ACAGTTAGGTATCGTCGAGAGTTACAAGCTAAAACGAGCAGTAGTCAGCTCTGCATCTGAAGCCGCTGAA']


In [5]:
# Remove headerlines and merge list elements into a single string
dna = ''.join(dna[8:])


In [6]:
# Let's examine the type and length of the returned variable 
print(type(dna)) # type string
print(len(dna))


<class 'str'>
5028


In [13]:
# Define function
def dnamatching(strand):
    '''
    Function that finds the matching strand of a sequence of DNA bases
    Input: string of DNA bases
    Author: Andres Patrignani
    Date: 22-Feb-2019
    '''
    matching_strand = ''
    for k,base in enumerate(strand):
        if base == 'A':
            matching_strand += 'T'
        elif base == 'T':
            matching_strand += 'A'
        elif base == 'C':
            matching_strand += 'G'
        elif base == 'G':
            matching_strand += 'C'
        else:
            return print('Unknown base', base, 'found in position:', k)
        
    return matching_strand

In [14]:
# Test function using a trivial case
dnamatching('GATCCTCCAT')


'CTAGGAGGTA'

In [15]:
# Test function using a trivial case with corrupted nucleotides
dnamatching('GAUCCTCCAT')


Unknown base U found in position: 2


In [16]:
# Call function
dna_matching_strand = dnamatching(dna)
print(dna_matching_strand)


CTAGGAGGTATATGTTGCCATAGAGGTGGAGTCCAAATCTAGAGTTGTTGCCTTGGTAACGGCTGTACTCTGTCAATCCATAGCAGCTCTCAATGTTCGATTTTGCTCGTCATCAGTCGAGACGTAGACTTCGGCGACTTCAAGATGATTCCCACCTATTGTAGTAGGCACGTTCTGGTTCTTGGCGGTTATCTGTTGTATACATTGTATAAATCCTATATGGAGCTTTTATTATTTGGCGGTGTGACAGTAATAATATTAATCTTTGTCTTGCGTTTTTAATAGGTGATATATTAAGTTTCTGCGCTTTTTTTTTCTTGTTGCGCAGTATCTTGAAAACCGTTAAGCGCAGTGTTTATTTAAAACCGTTGAATACAAAGGAGAAGCTCGTCATGAGCTCGGGACAGAGTTCTTACATTATTATGGGTAGCATCCATACCAATTTCTATCGTAGAGGTGTTGGAGTTTCGAGGAACGGCTCTCAGCGGGAGGAAACAGCTCATTAAAAGTGAAAAGTATACTCTTGAATAAAAGAATAAGAAATGAGAGTGTAGGACATCACTAACTGTGACGTTGTCGGTGGTAGTGATCTTCTTGTCTTGTTAATGAATTATCTTTTTAATATAGAAGGAGCTTTGCTAAAGGACGAAGGTTGTAGATGCATATAGTTCTTCGTAAGTGAATGGTACTGTGTCGAAGTCTAAAGTAATAACGACTGTCGATGATATAGTGATGAGGTAGATCATCACCGGTGCGGGATACTCCGTATAGGATAGCCTTTTGTTATGGGGGGTCACCGTTCTCAGTTACTTAGCAAATGTAAAGTTTAAAGGTTACTATGGATATTTAGCAGACATCTGTTCTGTCGAGTTTATTGTATGTTAACGAAGCTGAATGGCTCGACCGAAAGCAAACTGAGATCAAGATCTTGCAAGAGTCCACTTGGAAGAAGACTGAATGATAGACTACGCTTGTGGTGCAACATAAAGTTACATTATGA