<a href="https://colab.research.google.com/github/sugatoray/genespeak/blob/master/notebooks/dna_complement.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Define DNA and DNA-complement

In [1]:
dna = "TACATCTTTCGATCGATCGGACAATTTGTCGGTGACTCGATCTAACAT"

In [2]:
dna_comp = "ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA"

## Conversion Reference

In [3]:
ref = {"A": "T", "C": "G", "G": "C", "T": "A"}

## Define Functions

In [4]:
import re
import numpy as np

In [5]:
def dna_complement1(dna: str):
    x = np.array(list(dna))
    y = np.empty_like(x)
    y[x == "A"] = "T"
    y[x == "C"] = "G"
    y[x == "G"] = "C"
    y[x == "T"] = "A"    
    return ''.join(y.tolist())

def dna_complement2(dna: str):
    y = dna
    for i, k in enumerate(ref.keys()):
        y = re.sub(k, str(i), y)
    for i, v in enumerate(ref.values()):
        y = re.sub(str(i), v, y)
    return y

def dna_complement3(dna: str):
    y = dna
    for i, (k, v) in enumerate(ref.items()):
        if k in ["A", "C"]:
            y = re.sub(k, str(i), y)
        else:
            y = re.sub(k, v, y)
    for i, k in enumerate(["A", "C"]):
        y = re.sub(str(i), ref[k], y)
    return y

def dna_complement4(dna: str):
    return ''.join(ref.get(e) for e in dna.upper())

def dna_complement5(dna: str):
    return ''.join(list(map(lambda e: ref[e], dna.upper())))

## Check Performance over 100k Function Calls

In [6]:
%timeit dna_complement1(dna)

The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 27.7 µs per loop


In [7]:
%timeit dna_complement2(dna)

The slowest run took 17.99 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 5: 20 µs per loop


In [8]:
%timeit dna_complement3(dna)

100000 loops, best of 5: 15.4 µs per loop


In [9]:
%timeit dna_complement4(dna)

100000 loops, best of 5: 7.53 µs per loop


In [10]:
%timeit dna_complement5(dna)

100000 loops, best of 5: 6.55 µs per loop


## Check Performance over a Single Function Call

In [11]:
%time dna_complement1(dna)

CPU times: user 97 µs, sys: 2 µs, total: 99 µs
Wall time: 103 µs


'ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA'

In [12]:
%time dna_complement2(dna)

CPU times: user 79 µs, sys: 0 ns, total: 79 µs
Wall time: 84.4 µs


'ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA'

In [13]:
%time dna_complement3(dna)

CPU times: user 51 µs, sys: 1e+03 ns, total: 52 µs
Wall time: 55.6 µs


'ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA'

In [14]:
%time dna_complement4(dna)

CPU times: user 23 µs, sys: 0 ns, total: 23 µs
Wall time: 26.9 µs


'ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA'

In [15]:
%time dna_complement5(dna)

CPU times: user 21 µs, sys: 1e+03 ns, total: 22 µs
Wall time: 25.5 µs


'ATGTAGAAAGCTAGCTAGCCTGTTAAACAGCCACTGAGCTAGATTGTA'

## Check Correctness

In [16]:
assert dna_comp == dna_complement1(dna), "Method-1 Mismatch Error"
assert dna_comp == dna_complement2(dna), "Method-2 Mismatch Error"
assert dna_comp == dna_complement3(dna), "Method-3 Mismatch Error"
assert dna_comp == dna_complement4(dna), "Method-4 Mismatch Error"
assert dna_comp == dna_complement5(dna), "Method-5 Mismatch Error"