# Complementing a Strand of DNA

## Problem

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string ss is the string scsc formed by reversing the symbols of ss, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

_Given_: A DNA string ss of length at most 1000 bp.

_Return_: The reverse complement scsc of ss.

**Sample Dataset**

    AAAACCCGGT
    
**Sample Output**

    ACCGGGTTTT

__________________

## Solution

In the previous notebook we looked at Python's built-in 'replace' method, which can be used to find and replace 'T's for 'U's. But 'replace' can only do one character at a time. We will look at a few different ways we can compute complements including using a simple loop, and a dictionary, and multiple 'replace' methods with dummy characters, paying particular attention to the tradeoff between time and space complexity. In this notebook we will explore how to reverse a list, both in place and creating a copy of the original string. We will also explore another Biopython Seq method which directly computes the reverse complement.

We start by looking at how to compute complements with a simple loop. Note that strings are different than lists in that strings do not support direct assignment. In lists we can modify the item in the ith position with the simple assignment list[i] = new_value. Strings do not support this kind of assignment so if we are going to use a loop we need to create a new string:

In [6]:
sample = 'AAAACCCGGT'

def complement_loop(some_string):
    complement = ''
    for i in range(len(some_string)):
        if some_string[i] == 'A':
            complement += 'T'
        elif some_string[i] == 'C':
            complement += 'G'
        elif some_string[i] == 'G':
            complement += 'C'
        else:
            complement += 'A'
    return complement

print(complement_loop(sample))

TTTTGGGCCA


This simple approach has a time complexity of _O_(N). Note, however than if we wanted to generalize this function, the time complexity would be _O_(N\*k), _k_ being the number of comparisons we make inside the loop! The space complexity of this approach is also _O_(N) since we need to create a new string containing the complement.

Another approach that uses _O_(N) space but that can be significantly faster if performing multiple comparisons inside the loop is to first codify the replacements in a dictionary. Since looking up a dictionary takes _O_(1) time, we have gone down from the general case _O_(N\*k) down to _O_(N), at the expense of having to store an the dictionary which takes _O_(k).

In [9]:
replacements = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}

def complement_dict(some_string, repl_dict):
    complement = ''
    for i in range(len(some_string)):
        complement += repl_dict[some_string[i]]
    return complement

print(complement_dict(sample, replacements))

TTTTGGGCCA


If time isn't a concern, but you are watching out for space than we can turn back to the 'replace' method. Recall that the 'replace' method takes in a string, and replaces each occurrence of a particular character by another character. We cannot simply chain together 4 replace methods since for whatever we do with when replacing 'A's for 'T's will be undone when we replace 'T's for 'A's, and the same for 'C's and 'G's! There is a clever trick we can do to circumvent this: we can use dummy character. So instead of replacing all 'A's with 'T's we can change them by a dummy character. This adds two more 'replace' calls but it does everything in-place, meaning we do not need to store anything extra:

In [10]:
def complement_replace(some_string):
    return some_string.replace('A','+').replace('T','A').replace('+','T')\
                      .replace('C','+').replace('G','C').replace('+','G')
    
print(complement_replace(sample))

TTTTGGGCCA


So now that we have the complement, how can we reverse it? A very _pythonic_ way of reversing a string is to use extended slicing techniques. The basic slicing structure is string[begin:end:step] so, by using a step of -1 and leaving 'begin' and 'end' empty (meaning we consider the whole string) we get a reversed string. Let's test it out on the sample string:

In [13]:
print(sample[::-1])

TGGCCCAAAA


So now that we have a method for computing the complement and that we know how to reverse a string, we can simply combine them in a function. We will use the 'complement_dict' method:

In [14]:
replacements = {'A':'T', 'C':'G', 'G':'C', 'T':'A'}

def reverse_complement_dict(some_string, repl_dict):
    complement = ''
    for i in range(len(some_string)):
        complement += repl_dict[some_string[i]]
    return complement[::-1]

print(reverse_complement_dict(sample, replacements))

ACCGGGTTTT


Finally, just like we did on the last notebook we look at Biopython. Remember that Biopython has an object called 'Seq' which stores a string and an alphabet associated with it. Based on what the string represents (DNA, RNA, protein chain) we can perform a series of methods on it. If we associate a DNA alphabet to our string then we can simply use the 'reverse_complement' method. 

In [17]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

dna = Seq(sample, generic_dna)
print(dna.reverse_complement())

ACCGGGTTTT


Note that the 'reverse_complement' does not work on all alphabets, it makes no sense to look at the reverse complement of proteins!

In [18]:
from Bio.Alphabet import generic_protein

prots = Seq(sample, generic_protein)
print(prots.reverse_complement())

ValueError: Proteins do not have complements!