# Counting DNA Nucleotides

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [1]:
def counting_DNA_Nucleotides(s):
    s = s.upper()
    A = s.count('A')
    C = s.count('C')
    G = s.count('G')
    T = s.count('T')
    
    return A, C, G, T

In [2]:
s = "ATGCTTCAGAAAGGTCTTACG"
counting_DNA_Nucleotides(s)

(6, 4, 5, 6)

# Transcribing DNA into RNA

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

In [3]:
def transcribe_DNA_to_RNA(t):
    t.upper()
    t = t.replace('T', 'U')
    return t

In [4]:
t = 'GATGGAACTTGACTACGTAAATT'
transcribe_DNA_to_RNA(t)

'GAUGGAACUUGACUACGUAAAUU'

# Complementing a Strand of DNA

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

In [5]:
import string

def complement_DNA(s):
    s.upper()
    trans_DNA = string.maketrans('ATCG', 'TAGC')
    sc = s.translate(trans_DNA)
    return sc

In [6]:
s = 'AAAACCCGGT'
complement_DNA(s)

'TTTTGGGCCA'

# Mendel's First Law

Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let X represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by Pr(X=red)=35 and Pr(X=blue)=25.

Random variables can be combined to yield new random variables. Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball). The probability of Y being red depends on whether the first ball was red or blue. To represent all outcomes of X and Y, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for X and Y, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let A be the event "Y is blue." Pr(A) is equal to the sum of the probabilities of two different outcomes: Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 3/10+1/10=2/5.

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

In [7]:
import scipy.special

def dominate_allele(k, m, n):
    # Calculate total number of organisms in the population:
    total_pop = k + m + n
    # Calculate the number of combos that could be made (valid or not):
    total_comb = scipy.special.comb(total_pop, 2)
    # Calculate the number of combos that have a dominant allele therefore are valid:
    valid_comb = scipy.special.comb(k, 2) + k*m + k*n + 0.75*scipy.special.comb(m, 2) + 0.5*m*n + 0.0
    # Calculate the probability of valid combos
    prob = valid_comb / total_comb
    return prob   

In [8]:
dominate_allele(2,2,2)

0.7833333333333333

# Rabbits and Recurrence Relations

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2–√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation F(n)=F(n−1)+F(n−2) (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

https://stackoverflow.com/questions/25111106/recursive-fibonacci-algorithm-with-variant

In [9]:
def fibonacci(n, k):
    # This coe is 1-based so months may not be 0 or less
    if n <= 0:
        return "n must be greater than 0"
    # In months 1 and 2 there is only 1 pair of rabbits
    if n == 1 or n == 2:
        return 1
    # Calculate the total populatin using fibonacci squence 
    # F(n)=F(n−1)+F(n−2) (with F1=F2=1 to initiate the sequence)
    else:
        total_pop = fibonacci(n-1, k) + k * fibonacci(n-2, k)
    return total_pop

In [10]:
fibonacci(5,3)

19

# Computing GC Content

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [2]:
# Split the file contents at '>' to get a list of strings representing entries
def read_FASTA_strings(filename):
    with open(filename) as file:
        return file.read().split('>')[1:]

# Partition the strings to seperate the first line from the rest
def read_FASTA_entries(filename):
    return [seq.partition('\n') for seq in read_FASTA_strings(filename)]

# Remove the newlines from the sequence data
def read_FASTA_sequences(filename):
    return [(info[0:], seq.replace('\n', '')) 
            for info, ignore, seq in #ignor is ignores (!)
            read_FASTA_entries(filename)]

# Create an sequence dictionary from sequence data
def make_indexed_sequences_dictionary(filename):
    return {info: seq for info, seq in read_FASTA_sequences(filename)}

In [11]:
def gc_content(base_sequence):
    # make base_sequence a string and Upper case for uniform processing.
    base_sequence = str(base_sequence)
    seq = base_sequence.upper()
    # Count G and C and return percentage
    g = float(seq.count('G'))
    c = float(seq.count('C'))
    l = float(len(seq))
    return (g + c) / l * 100

def gc_content_dict(fasta_dict):
    gc_dict = {}
    # for each sequence in fasta_dict find the gc content and create new dict
    for key in fasta_dict.keys():
        gc_dict[key] = gc_content(fasta_dict[key]) * 100
    return gc_dict

def key_with_max_val(d):
    # create a list of the dict's keys and values; 
    v=list(d.values())
    k=list(d.keys())
    # return the key with the max value 
    return k[v.index(max(v))]

def highest_gc_content(filename):
    # Create an sequence dictionary from sequence data in filename
    fasta_dict = make_indexed_sequences_dictionary(filename)
    # Create a gc content dictionary
    gc_dict = gc_content_dict(fasta_dict)
    # find the sequence with the largest gc content
    max_key = key_with_max_val(gc_dict)
    # return key name and gc_content value
    return max_key, gc_dict[max_key]

In [12]:
highest_gc_content('test_gc.fasta')

('Rosalind_0808', 60.91954022988506)

# Translating RNA into Protein

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by s.

In [21]:
# Translating RNA into Protien
RNA_codon_table = {
    "UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"Stop", "UAG":"Stop",
    "UGU":"C", "UGC":"C", "UGA":"Stop", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",
}

def translate_RNA_codon(codon):
    return RNA_codon_table[codon]

def aa_generator(rnaseq):
    # Return a generator object that produces an amino acid by translating
    # three characters of rnaseq at a time
    return (translate_RNA_codon(rnaseq[n:n+3]) for n in range(0, len(rnaseq), 3))

def translate(rnaseq):
    rnaseq = rnaseq.upper()
    # Translate rnaseq into amino acid symbols
    gen = aa_generator(rnaseq)
    seq = ''
    # Sets the first amino acid as 'aa'
    aa = next(gen, None)
    # While aa is true
    while aa:
        # If aa is a stop codon retrun sequence else add aa to seq
        if aa == 'Stop':
            return seq
        else:
            seq += aa
            aa = next(gen, None)
    return seq

In [23]:
test_seq_2 = 'AUGCGUGGCCGCCACUUCUACAAUGCACGCAGUUCAAGCACCAAAUCUGUAGGCUUAGUACUUCCUAAAGUCUAUCCUUGUCUACGCGGUCGGGUUUUGGACACACGUUGUGAUUUCUACACUAGGAAACAAAAGUUCUUUGACCAAAGUCUGAGAAAUGUGAAGUGUCUAAUCCUUAAAUAUUUCGCUGAAUUUUUUGCGUGGAGACGAACCGUAUCCGGGCUGGCCUGGGUGUAUAUAAGGGGUCGCCUGGCCGGCACUCUUCGUUUGGGGUACAGUCUUCUAGGAACGGACUGCACGCAGGCUCGACGGCGGCAGGUAAAUUAUCCCGCGUCAGUGUUUUGUCGGCGCGUUAAUGUUUCUUAUGGUUGUCCCCCAGAUUACCUGCCAACUGCGACUGAUGAUAUGCCGGCCGACCCUGUACCGACGGUUCAUUGCACCUCUUUACUGACCCUCCCUCAAAAAAAGGCAAACGCAUCCUAUGUGCCCGGUGGUCGACCGGACGAACACCUAGAAGCACCUGAGAAAAUCAAGGUUUCGUGUGCAGCGGUGAUACCCACCUGGUCAGUAAACAAGACCGCCAACACUCGACCGUGGUCGGAACAACUGGCCUUAUUACCCUUCAUCUCCGUAAAGAAAGGAUGUCUUCAGUGUCGUAGGCGGGAUCGUAGGAUCAGAGGGACCCACCCUCUUAAUAUUGUCGCAGUAGAGGUUGAUCCGUUUCUACAGCGUAUCAAGACUUUAUUAAACCGGCUUGGCACAGGGUUAGGGUUUCUCACCCAGAGCGUCUAUCGACCACACCAUGGCGAUUGUUUCGAUCCCGGUUACGUUUAUUGUGUUUGUUGGAGCGAAUCAGGACUCCAUUUUGUCCCGAACCGCGGCACGGACCAGGGGUGUAAUCCUUCAGAGGAACUCGGUGCCUCCGCUGGAAUGUCGCCCCGGCGGUUAAUGACCCCGCUCCUAGCAUCUCGCGUCCUGUGUGGUUAUUACGAUUUCCCGGCCGAAAAUGGGUUUCCCCCUUCUACCGACGACUGGGACCUCACAUGGUCAUUGUUGUUCAAGUAUACUCAUGAUGGGAGAUCCAAGGCAGUAGUCAACCUGAGAUUGGCACGUCUUUACGGGUGUAUGCAACGACCCCAGUUAACUCCCCAUCUGGUAGCGAUACUAUAUCCUUUUCAAACUGCCAGGGGCCUUAGACGUAACGUUGGCGGUACGGUUCCACAACAUAAGUCUAUACUAACGGGUUGUCCCCGCAAGAGUUCCGAAUAUAGAGCAGGCAGAGAAUUUGACGUCACAGCUUUCAGCCUAGUAUCUCAAGGGAUAGCGAAAUUUCGGCCUUACAACCAACCAGUUUAUUCACUGAGUGCGCAGUCUUAUCCUUACGCGUCCCGAUGGAGGCAGCUUGGCUAUCGAUCUCGCGUAUUACUCGCGAAAGGUACCCCGACUUUUAACACUGCAAGCACAUCCGACCCUUCCUUGAGUGCUCUGCACUAUCUAUAUGUGGCCCUACAGGGGCAUCCGAUGGUUCCUAUUGCGGAUUGUCUGGCUACAAAGCGACGGCCUAGGCGUCCUUCCCAUCUAUAUCGCUAUGUUUUAAUAGUGGCGCUCAGGUCUAUACGGGCGGGCGAUACUACUUCAAUGAAUUGUGAUAAGCCACGACGAACAAUCUGGAUUGUUAUGUUCGAGACCCAGGAGAGCGGUGCAGGGUAUAACAGUGAUAAUAUCUGCGAAACGGAAAAGCCGGUAUACUGGCACGGCCACACCCUUAGCCGUGUGGCCAAGAAAUUAUCUAGUGCCGGUAGACGAUGCUCAGCGGAUCUGAUAGCUAACCUAGCGUUUAACACCUUGCUCGAGGGCACUCCUGCUUGUCCGACUUGCGAGAGGUACGGCAUAUGCACGAGUAUUAUACGUGGUUGGUCGGCUAGCAUCGCGAUCUUCUCUCUUUAUGCAAGCGCCUUGCGCAACAUAUUGUCAUUAGCGAGUCUAAGCGCGGACCGCGAUAAUCGGAGAUCGGACUCCCCGACGGAUGCCUUUUUAGCGACCUCCACCGCCGUGAGCAGCUUUGGGUCAAUCAUGAGGAGAUGCGUUUUUAUGACUAUAUGGCCCAGCCAGCCGACUAUAGGUUCGAAGAGCCGUUGUACGUGCGGCCACGACUCACCCACUAUAGGCUCGUGGCACACUAUAUCGCUUAGUGCAGUCUUUACGCCUGACUCGGGCCGUGUCGAAGUAUUGUGGGUCCCCACUGAGUUAGCCGGAAUGUAUGACGGGCGGACGUUCUAUCUCGGCUGGAGCGAGUGCGUUCGUUGCGUCCCUUCCCUGCAGAGUACUAUGGGCAGUACAGGCGGUUCGCGGAAAGUUGAGCGCUCUCAUCUAGAGGGCAAAGCUUUCGGCCGGGGCGCCCUACACAUUCCCACCCCAUUUGUAGGCUGUGCCUGUCUAUGCGCGCCCGGGCGGGUUCGGCCGACGCAGUCGAUCGGUGCUGGUGAAUUAUACGAUGCUGUCCCGUCGGGAUCGCUGGAUAGAAGACACUGGGGAUACACCCUUGCGGAUACAUGCUGCACUCAAUAUCUGCAAGACGUCACACUUCCACCGGUUUGUUAUGCUCAUUUAAGUUAUGCUGGGUGUCGGCCAACCACCCUCAUAGAGGAGACAGGAUCCAGGAAUCGCUUCGAAUGCGUCUUAUUCCAUGCGGACUUAGUACCAUCGAUGGGCAACUAUGGGGACCUAGGGGACUCGAAGCGAUGUUUUGACUCGGGGUAUUUGCAUACAUCGCCCGGCAGAGCGCGAAUCCCAAGGGUAUGUCCGCACCGGCAGCAGUACGAUCGUUCUCGGGGAGCGAGAGAUUUGCCUCUGCUCUGGUAUCUUGUCUACUUUACCGCAGUGCUAGAAUCGGGAGCAACCGCCCCGCGAUGGGGGGCAGAGCCUAGUGCUGCUCUAUGUCCCCUAGGAAGUGUCCCACGGGUAGCAAGAGCAGACCCGCCCACAAAAGGCACAUACUACUCCGUCACCUCUGAUACUUGGAACGACGCUUGGUUCAAGGCAGAAGCCCGACUGACGCUGGUAGCCGUGGCGUCCAAGAAGGACGUUCCGACCCUACAAAAGCCCGGUCAGAAUAAGGGUCAUCUUAAACGAAUCGCUGAAUACAACUUAAGGCCUGGGGCUAGGGGUGCAUCUACUUCGCUAGUGACGUUAUCUGUGGCCAGCGCUGGGACCUUGUUUACAGGUCAUCGGGUAAUUUUCCUAACAAUAGAGUUGUGUAACAAAGCACAGGUAAUGCCCGCUAUCCGUCACCUUCCAAGUGAGGUACCCGGUUUAAAAACCCCGGCCACAUCAUGUCGCUCGGUGCACAUCGGUCCUUGUCGAAAGAGUCUGUUUGGAUACGAGGCAGUAUGGAAAGGGGCGGUAGCGAAAUCUUUAUGGACACAGCCCACUUAUAAUUUGGAUAACCAUGAGCAUCGAAUCUGUCCUAGCGCGCGACGAGGCCGCCCCAAGGGCACAUCGCAACCAUCCCGCCGUAUGCUGUAUGUCCCAGCUUGCCACCCCCGUCCAUCACACUACCAUGUUCGUAUGCCACGCACGUGUAUUGCGGCAGCACAAGAAUGUGUGACGAGGAUUUACUUUCAAAUGGAAUAUGUUGACGAAGGGGGCGCCCUAGAGUGUUUAGUCGUCUUUCGACGCUGGCGAAGCUAUCGAGCUAAAUCCCAAGAAAUGAAAGCGAUGGCGACCUGCAAUCUUGCGCUCAAGAAGCUUAUUGUACUAGCGACUACGAUCUGGGGGGCCUACUCAAUGCCGUUAUUUGAAUACGAGCAAGCGGUCGGGCGUUAUGUCAGCUGUAUCCGGGGCGCACUAAUAAGGAGUGGGCCAUUGUGCUGUUUAAUGUAUCGUACUUCCGGGUAUCGGACCAUAGCGUCCGGAUACGUGCACCAGAAAGCCCUUGUAUCCAGCUUGGUCCACGUGUUUGAAAUCAGUGAGACGAAUAUAGUAUUGGUCCCAAAGCUUGGUAAUCGACGCCCUCCUGUUUGGGUACUAGAGUUAACGUGGACCUUCAUUGCCAAACCCAUAUGCUUCGUUAUUAGGGCGCGCAGAACAUCAGACCGAAAAAUAGAACUUUUGGUAGAGCCAUUUCCUAAAACCGGAGAAGAACGUUUUUGGUUGGAUAAAUACAGUUCAGAAGAUAUUCUCCUGUACCACACCCCCCCCUUGGCGUCUUUUCCGAGUUGGACCGCAGGUCGGCAAUUACCCCUUGCGGAGGCGUUGACUCUAGAUUCGUCAAGGUUGAUGGUACCCCUUCACCCGCUGAUGUGCUUUCCCGCGACAGAGAGAGAUGUUAAAUGUGAAGGGAAUGUCUACUUCCAAUUUUGUGAAAUGGGCUGUCUCGCUCGAUCCACUGGCUGGACUUUUUUGGGAGAACGAUCGGACAGCGAAAGCACCAAAUCCUACUCCCCGGUUAUAAACGUUGGAGGUGCUCCCGCAUUCUGCCCACACCCGCCAGGAAAGCGCAUUGGGAGUCCUCAGUCGGAGAGUUACACGUUUGGGCCGCCCCUAACGUACUCUUCGCAUCGAGAUAUGAUCAAUGGGGUCCUGAUACUAAGAUCCGUUUCUAGGAGGGAUCGUCGCCCAGAAUUCCACGCGGUUGCCAUAUAUAUCGUUAUUCGAGGCAGCUCUCGGGAGGACCAUUGGAUACAUCACGUUGUCGCUGGCCUGUACUAUGAGACACGGUCCUUGUACACUUGGUAUACCUCGCAAUCUGUGCAUAGUCUCAGUGAUUCAGAGAGUGAAGGCCAAUUGGUACGCCAUAACGCGGCUGUCUCUUCUAAUUAUUUCAGGAGCCUGUUAGGACUGUGCAAAGCCAUUAGCGUACGACGGACAGCAGUUAACCUAUGCUACACCGAAUCUACGUCGGGGACGGUUGCGGCAUCUGACUCUGCAUAUAGCGUGGCUUAUUGUCUACGAACAGGCGAGACAUUCACAAUUCUGCCGAAGCCGCGGAGUACCCUCGACGAGCCAGCCACGACCCUAGCCCAUUCCGAGCGAGCGCAAUUCCUUGGCGAUGGGGGCGUACAACGCGAAAUCGAAUGCCCCGGACUAAUACUCAGAACCCAUCAGGUAGCGCAAGGAGACGCGACAGCAUGCUCGGUGACUGCUUCAUCGGGGCUCGAAUACAACAGAGUCGCCCCCCUGGCAGUAGCCGCGUUAGUUGCGGUCCACGCGCUUAGAAAGGUCUUCCCAGAAUGGAAGCUGACCCCCUUUAGCAUCAGACCGCGUGCAAAGCCGAUAGUAACAAUUAUUCCGGUGUCGACCAGUUCUACGUGCAGGACUGAGUCAUUGCUAGUUGGUACCGGCUGGCAGGACAAGGUGCGUAGUUCGUUGUUGGGACGCCAUACCGAUUCAGUGCGGGGCUGUCAGCCACAAUAUGUGCAUAACACCGACCAGUUCUCUCCGCCGCAUCGCGAGACACCUGUCUCGAUAAGCAGUUUAGCGACCACUACGCCCCGUAGGCUUGACGUAGAUACCUUGUUGACAGUCAUUGUUAGGAGCCCCUCACCCUAUCGGGUAACUUUGUGUGGACUCUCACCUGCUCAAGACGAUGACAUCACGUUUCUCUCAACAAAGGGUAUCCGGUUAUGGAUAAGUGCUCGAAACGCCGAACGGUUGGUAGUGGGGUGGAAUGCUUCGCGGCAACCAGCCGACCUCCUUGCUGUAUUCUGUAUUUCGGGUAUGCCAUUGUCCUCUUGCCUGUUCCUGCAAGGCUCAUUACCAGCCGGGGCGGGUUCGGAGCGAUUUUACUCGACCAAAGCACAACUGACUAACACUCCUGUAGAUCCUGUCAAUUGUUCGCCUAAUGGGACGAUCAGGCUCGUCGCCGUUUUGGGGUCGAGAUGCGUAGGUCCUGUUACAGAUGGAGCUUCGCGUCAGCUUCGUUUCUACCCUACAGGGGGGCUGCUCCAGCCACGUACAUCCUAUCCUGCCAUGCGCGUUCCAAGGAUACAGCAGUCAGUGAUGCCUUGGGGUGAAACCUUUCGUAACAAUACAAUAAGGUCCGCAAAUCACGUCCACCAGCGUGUGCCCAGUCGAACCCAAGUCUAUCGGCUCGGGGAGACCAGGCUGGGUGCCCUAUGCGUUAGGUCCAGACAGCUGUGCACGUGCGGGUCUAAUGGGGAUCAGCAACCCAGGAGCUGGGUCCCCAAGCAGGGCUUUCGGACACAAGUACCUAUGUCUGAUUUCACCCGUAUAUAUUUUGGUCUCCCAGAAUCUUACAUGCAGAAACCCCGGGCUUCAUCUCAAUGUGCAUAUAUCGUCGCUUUAUGUGAAGGACGUAAACAGAUCGGCACGGAGUACUCUGCACGGAUAGUCGCGAAACCGUCGUCAAACCUGGUCCGAUAUAGGUACAAACGGGAAAUGGAGACGCCGGAAAGAAAAUACAAGGAGAAAUUCGGUCUUGCAGAACUAACGGUGUUAUUAAAGACGAAUACGCUUGCCUCGACCAAAAUGAAGAGCCCCUUCUGGAGGCCGAGAAGCUGGGACUAUCUGCUUAACAUGCCCUCUACGUUUAACAAUUCUUGGCAGGGGUUUCGGGAGGUUAUGGCACCAGUAAUCAAUCCCGCCCAGAUGCAUCGGGCCUCUCUUCAAGCUCCCAUUCGCUCCUUUGGGAGUCGCCUGGUGCAAGUACCUAAAUGUGACAGUAUCUCGCAGCAGUCUUCGCCUGAUCUUGCGUGGGCGCUCCGCUUCAAGGACUACUCGUUAGUUAGUACCAAGGUCAAUCACAAAACUUUGGCCCGUAACAGACUGAAUCAAUGUUUCCGUAGAUUUCCCAAACACGUUUCGAAUGGCCAAAUGCUACUCAUUGCGUCACUGAAGUCUCCCAUCCCCAACCUUCGUAAUGGACUAUAUCGAUCAAAUUUUAAUAUUGCUAGGGUAGAUGAAACUUUCACUUGGGGUUCAUUGAUUAUCUGUGCCAAAGUCAGUUCCACGAGUGUCGUUAUAACGAAACGUGAUCUCUCCACACCGAGUACGCUCACGCUUGCCCCAUUACAUCUGUAUAGGUCAGGACCUACCACUGUGCAGAGAGCGUCGCACCAGCUCGCCCCUCCAAUCAACAAGAUGAGAUCAUUCGAAAAAUCCAGGUCAACAUGGGUCAAGGUACACCUCGACUUUGGUGAUUGCCCCUCUGUCCUAUCUGUCCCAUCCCCAGCUUCUGCCACGUACCAGUACCCUCGAUUAGUCCAUAAGGGUUGCAUACCCUAUAGCUUGGGAGAUAAAUCGGACCGGCGCGCGCAUCACUCGCGUCGCAAUACUCCGUGGGUGUGUGCACUUAGGCUCAUUAAAUUUCCGGCGUUCGAGCGGGCUGCUGACUUGGGAAUACUAGGACAGGGUAUUCGGGGCUAUGAGUACUACUAUCCAGGACCCAUUCAGAGAGGUAAAGCGAGUCUGCUCCUAGCGUGUCAUUCCAAACAACUGCUACAGGAGCGGGCAGAACGUAACUACAGAUCAACAAAAACCAACGGCCCAACGACAUCCUUAUCGGUACGCGUGCAACACUCGCCACGAGCGUAUUCCCCGAGAGGGACGAAGCCAAGACAAGGGCGCCCUGCAGAAAAGAUUUAUACCAGAGUGGCAGCUUGGCCAUUGUGUUUAGGGGGAUGCUUGGACAGACCGCACCCUUCGAGUGGCGUAAUCUUCUCUGAAUCUUCUGCCGAUAUACCUAUCAUUUUUACUGACCGGGCCUACACAAGCCAGCCGGCCGCUUCUAGGUGGAAGACCCGCGAUGGGGGACGUCCGCCGCGUGAAUCGCCAUAUUCCAGAGCGCCGAUCGACGCAAAAACCGGCUACUGCCAGUACACCGACCAAGACGCGUUAGUAUCGGAGUUAGUCCUAGUGCCGGACCGAGUUCGAACAGGCGGGAGCGUGCGUGGUCCAUCAACUUCAACGUGCGUUCGGCCGAUGGUACGCUAUCGCUGGGAUAAAGUCGCCGGAAGCCCACAUAUUGGGUGUAACCCCGCCAAGGGUACUGGCCUUACCAAAGUAACUGUCGCGGCGUACUGGGCUCGGUCGCAAAAAGGCAGCAAUGCGAGUAGACGGAAGUCACACAGCUCAGCGCGCGCUGGUAAUUCUCGAAUCCCGCGCGUGCGAAGGUGUGUGGUCGACACACGUAGACCCUUCAACACAACGGGUUAUCUCCGAUCUCAAAAAUUCGUGUUACAAUGCAACGGUAUGAGUCCAAAAAACAGGCAGUCAUCCGUUGAAUCUGGGAUUGGGGUAGCUAUAGCGAUUAAGAAAUUGCAUGCCUUCCCCGCAGCCGCAGUUUCUUCCGCUACCAAGAGCCAGCUCGGAUCAACUCCAAUUUCGCAGAUACCAAUGGGAGAUUCCUCAACUUCAAGUCCGGUUAUUUUCGAAAUCAGGACAGAGUUCGUGUAUCAAUAUUCGACUCCGCCAUGUAGCGGCAGGAACUGA'
translate(test_seq_2)

# Finding a Motif in DNA

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

In [46]:
import re

def find_motif(motif, seq):
    # create lookahead string to find overlapping motifs
    q = '(?=' + motif + ')'
    # create iterator to find all the motifs in the seq
    iterator = re.finditer(q, seq)
    # find all start locations of motifs in the seq
    indices = [m.start(0) for m in iterator]
    # make start locations of indices for base 1
    indices_base_1 = [x+1 for x in indices]
    
    return indices_base_1

In [47]:
motif = 'ATAT'
seq = 'GATATATGCATATACTT'

find_motif(motif, seq)

[2, 4, 10]

# Counting Point Mutations

Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. 

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

In [48]:
def hamming_distance(seq1, seq2):
    # Check to see if sequences are of equal length
    if len(seq1) != len(seq2):
        raise ValueError("Undefined for sequences of unequal length")
    # returns sum of the differences in sequences when seq1 and seq2 are zip()
    return sum(aa1 != aa2 for aa1, aa2 in zip(seq1, seq2))

In [49]:
seq1 = 'GAGCCTACTAACGGGAT'
seq2 = 'CATCGTAATGACGGCCT'

hamming_distance(seq1, seq2)

7

# Calculating Expected Offspring

For a random variable X taking integer values between 1 and n, the expected value of X is E(X)=∑nk=1 k×Pr(X=k). The expected value offers us a way of taking the long-term average of a random variable over a large number of trials.

As a motivating example, let X be the number on a six-sided die. Over a large number of rolls, we should expect to obtain an average of 3.5 on the die (even though it's not possible to roll a 3.5). The formula for expected value confirms that E(X)=∑6k=1 k×Pr(X=k)=3.5.

More generally, a random variable for which every one of a number of equally spaced outcomes has the same probability is called a uniform random variable (in the die example, this "equal spacing" is equal to 1). We can generalize our die example to find that if X is a uniform random variable with minimum possible value a and maximum possible value b, then E(X)=a+b2. You may also wish to verify that for the dice example, if Y is the random variable associated with the outcome of a second die roll, then E(X+Y)=7.

Given: Six nonnegative integers, each of which does not exceed 20,000. The integers correspond to the number of couples in a population possessing each genotype pairing for a given factor. In order, the six given integers represent the number of couples having the following genotypes:

AA-AA

AA-Aa

AA-aa

Aa-Aa

Aa-aa

aa-aa

Return: The expected number of offspring displaying the dominant phenotype in the next generation, under the assumption that every couple has exactly two offspring.

In [69]:
def make_geno_dict(num_couples):
    # declare new dict geno and list of genotype pairs
    geno_dict = {}
    geno = ["AA-AA", "AA-Aa", "AA-aa", "Aa-Aa", "Aa-aa", "aa-aa"]
    
    # for each genotype pair add it and the number of couples to the geno dict
    for idx, g in enumerate(geno):
        geno_dict[g] = num_couples[idx]
        
    return geno_dict

def cal_exp_offspring(num_couples):
    # make dict from num_couples
    geno = make_geno_dict(num_couples)
    
    # Probabilty of pairs producing a dominate phenotype
    multiplier = {"AA-AA": 1.0, 
                  "AA-Aa": 1.0, 
                  "AA-aa": 1.0, 
                  "Aa-Aa": 0.75, 
                  "Aa-aa": 0.50, 
                  "aa-aa": 0.0}
    # Each couple has exactly 2 offspring times the number of genotype pairing
    # times the probability of getting a domominat phenotype in the next
    # generation plus the previous phenotype pairs' probability
    exp = 0
    for x in geno.keys():
        exp = exp + 2 * geno[x] * multiplier[x]
    return exp

In [71]:
def read_iev_file(filename):
    with open(filename)as file:
        return [float(i) for i in file.read().split(' ')]
    
num_couples = read_iev_file('rosalind_iev.txt')
# num_couples = [1, 0, 0, 1, 0, 1]
cal_exp_offspring(num_couples)

149949.5

# Mortal Fibonacci Rabbits

Recall the definition of the Fibonacci numbers from “Rabbits and Recurrence Relations”, which followed the recurrence relation Fn=Fn−1+Fn−2 and assumed that each pair of rabbits reaches maturity in one month and produces a single pair of offspring (one male, one female) each subsequent month.

Our aim is to somehow modify this recurrence relation to achieve a dynamic programming solution in the case that all rabbits die out after a fixed number of months. See Figure 4 for a depiction of a rabbit tree in which rabbits live for three months (meaning that they reproduce only twice before dying).

Given: Positive integers n≤100 and m≤20.

Return: The total number of pairs of rabbits that will remain after the n-th month if all rabbits live for m months.

In [72]:
def mortal_fibonacci(n, m):
    # This coe is 1-based so months may not be 0 or less
    if n <= 0:
        return "n must be greater than 0"
    # In months 1 and 2 there is only 1 pair of rabbits
    if n == 1 or n == 2:
        return 1
    # Calculate the total populatin using fibonacci squence 
    # F(n)=F(n−1)+F(n−2) (with F1=F2=1 to initiate the sequence)
    else:
        total_pop = fibonacci(n-1, k) + k * fibonacci(n-2, k)
    return total_pop

# Inferring mRNA from Protein

For positive integers a and n, a modulo n (written a mod n in shorthand) is the remainder when a is divided by n. For example, 29 mod 11=7 because 29=11×2+7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if a mod n = b mod n; in this case, we use the notation a ≡ b mod n.

Two useful facts in modular arithmetic are that if a ≡ b mod n and c ≡ d mod n, then a + c ≡ b + d mod n and a × c ≡ b × d mod n. To check your understanding of these rules, you may wish to verify these relationships for a=29, b=73, c=10, d=32, and n=11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given: A protein string of length at most 1000 aa.

Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)

In [34]:
RNA_codon_table = {
    "UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L",
    "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S",
    "UAU":"Y", "UAC":"Y", "UAA":"STOP", "UAG":"STOP",
    "UGU":"C", "UGC":"C", "UGA":"STOP", "UGG":"W",
    "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L",
    "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P",
    "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q",
    "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R",
    "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M",
    "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T",
    "AAU":"N", "AAC":"N", "AAA":"K", "AAG":"K",
    "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R",
    "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V",
    "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A",
    "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E",
    "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G",
}

def reverse_codon_table(table):
    r_key = ''
    r_value = ''
    reverse = {}
    
    for k in table.keys():
        r_key = table[k]
        r_value = k
        if r_key in reverse:
            reverse[r_key].append(r_value)
        else:
            reverse[r_key] = [r_value]
            
    return reverse

def codon_freq_table(table):
    reverse_table = reverse_codon_table(table)
    freq = {}
    for k in reverse_table.keys():
        freq[k] = len(reverse_table[k])
        
    return freq

In [35]:
# https://stackoverflow.com/questions/46135385/inferring-mrna-from-protein-rosalind
from functools import reduce
from operator import mul

def num_rna_strings(dna, modulo=None, freq):
    if modulo:
        reduce_fn = lambda a, b: (a * b) % modulo
    else:
        reduce_fn = mul
    freqs = (freq[base] for base in dna)
    print freqs
    return reduce(reduce_fn, freqs, freq["STOP"])

SyntaxError: non-default argument follows default argument (<ipython-input-35-ebb645aabce6>, line 4)

In [36]:
freq = codon_freq_table(RNA_codon_table)
protein = 'MIFEYKCQPFSMNDDSTFQPFDPDGNGPVVMKTQCMYPAYKWNVQSKSIVKRHNHTYHISYAKLDHVVCDWPVDFVDPVGITNHQVKKHQFTIETDRRTTCEMRQSMVLFYLWGHPDVSFKPGWQEPVPTGAYLGQSHCGPKSGGTKDYVAHMKFTSDLCMKWTTHWQHECQNHMHHGGAWYFLGAHSHMYRRKSFRCNGTACHWNDHTIGQHLAGWMFIFSGQHWGWDRADYCQWMTMFNYNPSSDQWRTGHREVGGANKGDWDWFYWIYSKIDWLSYWQGYYPVGICNIMHGLPGFWLTWGTEWNQELKRTKGYEDVVCAQFADNNIFGDEGVSHYQDSQKEVCGFMYMMFCNLVYDVLFNDQLDPFNRIIFDTTRQWQMFQWNHFMILSQDHYHCVLCFCQARERRCYKEEDQIQETHRSNTLHTRALEMLPCCMRYCRDGMPASLVQFKAQIYPQIGPCEWEEPPWHFLNWYGAFFNPCSHLYMNLIHYCRQMVLRILRGEDPIIMFNEDFEHIASKCCNHFDPFRWAMRITHNRTFTTFHKGFMFYMMIYCPIQGGPVCHDSLQEKTGCAGINEEGIFKDMQRQYQQGNDSLTDIHWQMTPDMDATFKDPYWPNWVDREFCTKAEDPIRMHTCPFEAFEYVVHVSTHGMPHCEMCLPLPHTMWHGGAENNSHKGPIRMHRVYFMAEWNIDRWIFHDEYYTWCHAMAKCYREPSCHLFITFNYNIYSYIEQDDKFRKNMTKTQKHIGPRRRMTQMLPNNHYDFHGAMCETPPSRMAHYGSHLPNPNSNDFCDTANSYDPNREFQSCDCMNNKGQFPACCAVNPIVDHHFMWAFQHNKEDRYNSKGISLCLHGTYGALLHIWVEVIMKYSEQCQVPIILRQSLDKFRKKRSVMVICSIDFTMMGPHCGWDDGWKFRSSSVPRNWFMDCERMITRKQPQMVDDASCYTHGRPPKAFYFFEAKEHGHTYAFYFCIYHWVKPMETSYEFNVMNHQTQAT'
num_rna_strings(protein, 1000000, freq)

TypeError: num_rna_strings() takes at most 2 arguments (3 given)