## Counting DNA Nucleotides

Problem
A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [47]:
f = open("rosalind_dna1.txt", "r")
a = f.read()

In [48]:
def count_bases(x):
    counter = {"A": 0,
               "C": 0,
               "G": 0,
               "T": 0}
    for base in x:
        try:
            counter[base] += 1
        except KeyError:
            print("unknown character")
    return counter

def cast_result(counter_dict):
    return "{} {} {} {}".format(counter_dict["A"], counter_dict["C"], counter_dict["G"], counter_dict["T"])

In [49]:
res = count_bases(a)
cast_result(res)

unknown character


'202 212 230 200'

In [50]:
res

{'A': 202, 'C': 212, 'G': 230, 'T': 200}

In [51]:
total = np.array([x for k, x in res.items()])
np.sum(total)


844

In [52]:
len(a)

845

## Transcribing DNA into RNA

Problem
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

In [231]:
def dna_to_rna(base):
    if base.lower() == "t":
        return "U"
    elif base.lower() in ["a", "c", "g"]:
        return base
    else:
        return ""

def transcribe_strand(s):
    t = [dna_to_rna(base) for base in s]
    return "".join(t)

In [63]:
f = open("rosalind_rna.txt", "r")
a = f.read()
transcribe_strand(a)

'ACAACGCCCAACAGCCCUUUUCUACUUCGUUUUUAAACAUCUUUAUACUUUUUGGCUUGAUUCCUCUGCCGAGUUCGCAGGUCGCCCCUAUACCGGUUCAAAGGCGAUAGGCGUUAGUUCACCAUAUUAGGCGGGUUCAUGCCUAGACACGGUACUGUGAUAGUACUAACGGCAGAUGAGUGCCCUCACAGGCUUUGGUCAAUUAAACUCCGGGAAUCUCCACCGCAUGCUGCAGGAUCUACCUAUGCUCCCCAAAGCACAGUUUUUCAAACGACUGGGGAGCAACACAGUCGAAGCGACGACAGCUACGUGCGCAUCGGCUCGACGAGGCCCCCACAUCGGGAAACCGACACAUAACAAGGAAUCAGCCUACUCCGUAUGGGACGGUUCCUUUAAUCUGAGGCUAAGAACUGCCCGCGUUGCAUUCCCUUGGUUUGGGAUGUAGUCGUCCCAUGUGGGUUCAGUACUAUAAAAUCUCAUUGGGCAGCCCUUACGCUUUCUUAAGCCUUGAGGGACAGGUCGACAUUACUAGCAUGGAAUGCGAGUCGACUAGUGCCGAGGGUUCCCCGGGCAGCAAGUCCGAAACUAGAUCAUCAUAAGUGGUAAGGUGGUUAGACAGAAGCGAAAAGGUAAUGCGACGCGCCCGGCGUGAAAAAUCGUACUGUAAAGGAGACGUGUUACCGAUUCAAUCUAUCAAGCGUUGCAGACCCGGUAAUAUGCGACUUAACCUCCUGAAAAAGCUAUGGACCUUAGGGGUCGUAUUGUUUUCGGCUUAUGCGCCCAUACUUGUACUUGAGUAGGAGGGGCGAAGCGGGUCCCUGACGGUCGUUUGUCAGUGUCGCACACUGAUGGACCGGUAUCGUAGCUCCCUCUAUGUAGUCUUUUUCUCGCAUCCAUACUUUUCGCUGUGCUGUCAGCAAAUCGGCCGGACAGACCCCAGCAGUAAGUCCGUUCAGCUAACAUAUAAAAGAUCG'

In [232]:
b = transcribe_strand(a)

## Complementing a Strand of DNA

Problem
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

In [7]:
def check_base(base):
    if base.lower() in ["a", "c", "g", "t"]:
        return base
    else:
        return ""

def prune_strand(s):
    s_ = [check_base(b) for b in s]
    return "".join(s_)
        
def reverse_strand(s):
    return s[::-1]

def complementary_base(base):
    compl_dict = {"a": "T", "c": "G", "g": "C", "t": "A"}
    
    try:
        return compl_dict[base.lower()]
    except KeyError:
        print("Unknown base!")
        return "_"
    
def reverse_complement(s):
    s_pruned = prune_strand(s)
    s_rev = reverse_strand(s_pruned)
    s_compl = [complementary_base(b) for b in s_rev]
    return "".join(s_compl)

In [87]:
f = open("rosalind_revc.txt", "r")
a = f.read()

reverse_complement(a)

'GTGGGTTGCGATGAAGCATCTATCGATCTGCGAAGACACACTTAAGAAGGAAGGGGGTCTATCTGCAATTCGCCTGTACATTAATCAATGGGATATATAACCCTGACTAGCCACCATCTTTTTCATTTCCGGTTACAGCGTGTCCTCGTTATTCAAGCAGCCCGTACTCGCCCGCCATAGGTATGGTGCAGTGATTGGCGCATAATATTCCCCACAACATTCGTTCTAACGTAGTGAGTCGGATACGACGAGTCAGTCAAGCATTTAGCGGCGTTGTGAGACGCCTGAGTATATTAACGCCGGTTAGTCGTCGCGATCGAGGTTGGTAGTGAACCTTGTAAATAGTGGTAAGCGCTAGAGCGGCAGGGCTAGTCTCGTTCCTTCATAAACGCCTAGCAGTAGTACTGCTGTACCTGTATTGATTAGAACACCCTTCCCTCCGCACGACCACCCTGGGTACGACGTCAAGCCAATAGTTCTGTGAAGTTTACTATATCATTAACCCGTGCTTTTCCTCCCGGCGATCAATCATTCTCATGGAACACCCTCCGATGCAATTCGGGCGAAGCAGCGGCTTTATATATTCGTTATTCCGCAGTTCGGCGGGCTAGGGCCTACCTGGTTCTACATCCCAGTAAATCCTGCATTGCAGGCGGAGGAGGGGGGCCCAGACGAATTAATGCAAGGCCGTTTGTACCAACCTGAGAGCCGTCGCGACAATTTGGCATAGTGTGGCTATTGGTCCACCTCTGAAGTTCCCATACGCGCTCCGGAGGATCAGCCCATATGATGCCCCGGCCAGATCGACACCGTTATGATTGTCCGCCTATCACACCAGTAGAAAACTTGATGGAGTTGAGACCGGCAAGACAATGCTCAATAACCATAAAGATTAGTGTCACAACGACATATCA'

In [88]:
a

'TGATATGTCGTTGTGACACTAATCTTTATGGTTATTGAGCATTGTCTTGCCGGTCTCAACTCCATCAAGTTTTCTACTGGTGTGATAGGCGGACAATCATAACGGTGTCGATCTGGCCGGGGCATCATATGGGCTGATCCTCCGGAGCGCGTATGGGAACTTCAGAGGTGGACCAATAGCCACACTATGCCAAATTGTCGCGACGGCTCTCAGGTTGGTACAAACGGCCTTGCATTAATTCGTCTGGGCCCCCCTCCTCCGCCTGCAATGCAGGATTTACTGGGATGTAGAACCAGGTAGGCCCTAGCCCGCCGAACTGCGGAATAACGAATATATAAAGCCGCTGCTTCGCCCGAATTGCATCGGAGGGTGTTCCATGAGAATGATTGATCGCCGGGAGGAAAAGCACGGGTTAATGATATAGTAAACTTCACAGAACTATTGGCTTGACGTCGTACCCAGGGTGGTCGTGCGGAGGGAAGGGTGTTCTAATCAATACAGGTACAGCAGTACTACTGCTAGGCGTTTATGAAGGAACGAGACTAGCCCTGCCGCTCTAGCGCTTACCACTATTTACAAGGTTCACTACCAACCTCGATCGCGACGACTAACCGGCGTTAATATACTCAGGCGTCTCACAACGCCGCTAAATGCTTGACTGACTCGTCGTATCCGACTCACTACGTTAGAACGAATGTTGTGGGGAATATTATGCGCCAATCACTGCACCATACCTATGGCGGGCGAGTACGGGCTGCTTGAATAACGAGGACACGCTGTAACCGGAAATGAAAAAGATGGTGGCTAGTCAGGGTTATATATCCCATTGATTAATGTACAGGCGAATTGCAGATAGACCCCCTTCCTTCTTAAGTGTGTCTTCGCAGATCGATAGATGCTTCATCGCAACCCAC\n'

## Computing GC Content

Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

In [133]:
def read_fasta(filename):
    data = {}
    with open(filename) as f:
        lines = f.readlines()
        for l in lines:
            if l[0]==">":
                key = l[1:].replace("\n", "")
                data[key] = ""
            else:
                string = l
                data[key] += string.replace("\n", "")
    return data
                    

def determine_GC(sequence):
    N = len(sequence)
    gc_seq = np.array([1 for b in sequence if b.lower() in ["c", "g"]])
    n_gc = np.sum(gc_seq)
    return n_gc/N * 100

def rank_highestGC(all_seqs):
    gc_highest = 0
    seq_highest = ""
    for seq_name, seq in all_seqs.items():
        gc_seq = determine_GC(seq)
        if gc_seq > gc_highest:
            gc_highest = gc_seq
            seq_highest = seq_name
            
    return "{}: {:.6f}".format(seq_highest, gc_highest)
        

In [136]:
data = read_fasta("rosalind_gc.txt")
data

{'Rosalind_8542': 'CGTTTCACGCCAATCCTTCCGCTCTGTGCACTCTAAGTCAGTCCGGAGCATGCCGTTGGCCTGGATGGGCAGGGAAACAGAGACAGCAATGACCTGCCTGAGATGATGCTGCTCGTGCGTGGGAACTGTGTCGATGATATGACCATGGGGAACAAGCGATCGGGATAAGATAGCTCCTTTTGGACCTGCCATCCCTCTTTACGTACGGGTTCTGCTTCCACTCTGTCATTGGTAACGTGTTGTCTGTCGAAACCACAGATATCTCTCATCCTGCCCATAAGACATTGTTCTCGAGGCAAGCATTTCACTACTTCAGAATTCCGTGTCGTAATTGAGACTGTCTCGCTGAGTTCTCGTCGGTCCTAGACTATGCGCGGGTTTGTGTATCCCCCCGGCTTGTATTACGACGTTGCCTCAGACTACAGAGGTGGACGAGTTATGGCATCCTCGCGCTCAAAGGCTCGCGATACACCGCGCCAAGACTAGCCGGCATAAGTTATACTACGAAGTCACGGCTCAATCAGCAACTGCGGCCATTGACTCCAGCCATCCTGTGCATACTAAACCACGTTAGATTACCACCGTCGATGCCAGCAAGTTGTTTGAGCAAAAAACCGAACGAAGACTGGGGGGAATTGTAAAACCGTGACGAATGCCCGTTGCAGTCTGAGTTCCTGGTAGAAGATATCCGGATCCTTATCTTTGGTTCACAAGGCGCCTGCCACCCCCGGCCGTTCGTAAGAGTCGAGTGGGGACCCGTCCATACACCACGGACATACGGTAGTTCTCACCCAGCGTTCACATCTGCGCCGCCAGGCCTCTCGTTTCAGGTAGACCCTCGATCTATTTCCCAGGACGGCTC',
 'Rosalind_6521': 'ACGATCATGCCCCTATAAATTGGCCTGGAACTTACCCTTGCATGAGGCATGTTGTGACCCATTCAACCAACACTTAATATTATCTGGCTGCCGCTAA

In [137]:
rank_highestGC(data)

'Rosalind_8542: 53.132251'

## Counting Point Mutations

Problem

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.
Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. See Figure 2.

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

In [228]:
def _check_base(base):
    if base.lower() in ["a", "c", "g", "t"]:
        return base
    else:
        return ""

def _prune_strand(s):
    s_ = [_check_base(b) for b in s]
    return "".join(s_)

def hamming_distance(seq1, seq2):
    diff = np.array([1 for b1, b2 in zip(seq1, seq2) if b1.lower() != b2.lower()])
    return np.sum(diff)

In [229]:
f = open("rosalind_hamm.txt", "r")
a = f.read()

seq1, seq2, _ = a.split("\n")

In [146]:
hamming_distance(seq1, seq2)

476

## Mendel's First Law 

Problem

Figure 2. The probability of any outcome (leaf) in a probability tree diagram is given by the product of probabilities from the start of the tree to the outcome. For example, the probability that X is blue and Y is blue is equal to (2/5)(1/4), or 1/10.
Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let X represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by Pr(X=red)=35 and Pr(X=blue)=25.

Random variables can be combined to yield new random variables. Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball). The probability of Y being red depends on whether the first ball was red or blue. To represent all outcomes of X and Y, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for X and Y, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree; see Figure 2 for an illustrative example.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let A be the event "Y is blue." Pr(A) is equal to the sum of the probabilities of two different outcomes: Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25 (see Figure 2 above).

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

In [156]:
def probability_recessive_only(k, m , n):
    N = k + m + n
    
    # mating of two recessive homozygous:
    p1 = (n/N) * ((n-1)/(N-1))
    
    # mating of recessive homozygous & heterozygous:
    p2a = (n/N) * (0.5 * m/(N-1))
    p2b = (0.5* m/N) * (n/(N-1))
    
    # mating of two heterozygous:
    p3 = (0.5*(m/N)) * (0.5*(m-1)/(N-1))
    
    return p1 + p2a + p2b + p3

def probability_dominant(k, m , n):
    return 1 - probability_recessive_only(k, m , n)

In [170]:
f = open("rosalind_iprb.txt", "r")
a = f.read()
a

'23 27 29\n'

In [171]:
b = a.replace("\n", "").split(" ")
k, m , n = [int(x) for x in b]
k, m , n 

(23, 27, 29)

In [172]:
probability_dominant(k, m , n)

0.7126744563453424

## Translating RNA into Protein

Problem
The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by s.

In [242]:
def get_aa(codon, codon_table):
    return codon_table[codon]

def translate_strand(s, codon_table):
    n = len(s)
    assert n%3 == 0
    n_codons = int(n / 3)
    protein = ""
    for i in range(n_codons):
        codon = s[3*i:3*(i+1)]
        
#         print(codon)
        aa = get_aa(codon, codon_table)
#         print(aa)
        protein += aa
        
    return protein

In [245]:
# Read the codon table

f = open("codon_table.txt", "r")
a = f.readlines()

codon_table = {}
for line in a:
    parsed_line, = line.replace("\n", "").replace(" ", "").split(" ")
    no_matches = int(len(parsed_line)/4)
    
    for i in range(no_matches):
        section = parsed_line[4*i:4*(i+1)]  
        codon = section[0:3]
        aa = section[-1]
        
        codon_table[codon] = aa
codon_table

{'UUU': 'F',
 'CUU': 'L',
 'AUU': 'I',
 'GUU': 'V',
 'UUC': 'F',
 'CUC': 'L',
 'AUC': 'I',
 'GUC': 'V',
 'UUA': 'L',
 'CUA': 'L',
 'AUA': 'I',
 'GUA': 'V',
 'UUG': 'L',
 'CUG': 'L',
 'AUG': 'M',
 'GUG': 'V',
 'UCU': 'S',
 'CCU': 'P',
 'ACU': 'T',
 'GCU': 'A',
 'UCC': 'S',
 'CCC': 'P',
 'ACC': 'T',
 'GCC': 'A',
 'UCA': 'S',
 'CCA': 'P',
 'ACA': 'T',
 'GCA': 'A',
 'UCG': 'S',
 'CCG': 'P',
 'ACG': 'T',
 'GCG': 'A',
 'UAU': 'Y',
 'CAU': 'H',
 'AAU': 'N',
 'GAU': 'D',
 'UAC': 'Y',
 'CAC': 'H',
 'AAC': 'N',
 'GAC': 'D',
 'UAA': '*',
 'CAA': 'Q',
 'AAA': 'K',
 'GAA': 'E',
 'UAG': '*',
 'CAG': 'Q',
 'AAG': 'K',
 'GAG': 'E',
 'UGU': 'C',
 'CGU': 'R',
 'AGU': 'S',
 'GGU': 'G',
 'UGC': 'C',
 'CGC': 'R',
 'AGC': 'S',
 'GGC': 'G',
 'UGA': '*',
 'CGA': 'R',
 'AGA': 'R',
 'GGA': 'G',
 'UGG': 'W',
 'CGG': 'R',
 'AGG': 'R',
 'GGG': 'G'}

In [244]:
f = open("rosalind_prot.txt", "r")
a = f.read().replace("\n", "")

translate_strand(a, codon_table)

'MHRGYDHANVTNGAPFKTLRVRGPCPSNECNRVIRPRRYQDLSDLGVSDSKNLFDPSIKDERELSQGQTLRATEVGSLKHGFARCFSEAWRSPLGSPSDHGRCAHFTIVPVRYPPPDRYLNTLAVVDAVLMQYCLEVVPCFMMSLVKEQGIHIRISRAPIQLSHMYFKSEPVACGQGLATFANLLNADFPSRWSIIFNHSQNFGYIVIPVPRRLAGASGATPTLHLIDTEAVMSQFHQVRLCHSEVYGQPRRSRVERSASGLYSTQLKWLPPSNENRFFGSHGTYSHADTRTKRSCWRGWTTQVALRLTQVTIVSVCVFFGLQLPAEDTLGVSYSTSPAHSGTLGVAVDDIPRYIHPVCKGMCLRAVLPMAALNKILGPDRIGRCSSQVRCTEIRLTSAPLLGASRKKIKLQAGYLFGVAVAPGVTMSKPVVLPRHTPLSITITSIDWFVTNNLESEKSEFTWLISTIHRSSNIILCVRSCSAKTLEWNVSRKPPMLTQREIASQAVLTAYLAPCTSHFPFYTGSTYLSSPNTLFLPRHILRVIHYQWEKQSPYSSWPKVIPGYVSSSAQFPALAPAHMPDTVHGGGYPSYLLRNSWRATCLKTCIDAINWYVTIRISLGRPHMPLIVQYTYHYQGGPCRRTGQLMLQYSVLVSRGRRGGRVFEGRETHRKRAILCTLYGLDAKKVAGIDPDYSIIPAEDTYQADRSLERYQHYGHLLVTLTLVVSKINTNYWRLRFPARESLTSSVVDLNCDQPTSLPAFSDAVHWSLRLKSLVRNKGHRGCYVSELCTSQYPKRRTIMSRLYRLPDVVRDYLPLTRSSFIGNAHSSQLHLLLRLVLTKRSVTTYPPPTPYNPSTCVDLKRHSMRLCHFRFLRAARGHYSYTSELLRRYLRPLAVLIQTATADGQTAWALSDQTVAVMSAVVDGTCVSPRRCLHPSPGSAYVLCGFTEASNTAKDDALRRSSSNNGTDNDTPDTLRVEQEIPTGSLGRLPRIFGMNHA

## Inferring mRNA from Protein

Problem
For positive integers a and n, a modulo n (written amodn in shorthand) is the remainder when a is divided by n. For example, 29mod11=7 because 29=11×2+7.

Modular arithmetic is the study of addition, subtraction, multiplication, and division with respect to the modulo operation. We say that a and b are congruent modulo n if amodn=bmodn; in this case, we use the notation a≡bmodn.

Two useful facts in modular arithmetic are that if a≡bmodn and c≡dmodn, then a+c≡b+dmodn and a×c≡b×dmodn. To check your understanding of these rules, you may wish to verify these relationships for a=29, b=73, c=10, d=32, and n=11.

As you will see in this exercise, some Rosalind problems will ask for a (very large) integer solution modulo a smaller number to avoid the computational pitfalls that arise with storing such large numbers.

Given: A protein string of length at most 1000 aa.

Return: The total number of different RNA strings from which the protein could have been translated, modulo 1,000,000. (Don't neglect the importance of the stop codon in protein translation.)

In [81]:
def get_reverse_codon_table(codon_table):
    rev_table = {}
    for codon, aa in codon_table.items():
        if aa not in rev_table.keys():
            rev_table[aa] = [codon]
        else:
            rev_table[aa].append(codon)
    return rev_table

In [82]:
def possible_rna(prot_seq, rev_table):
    counter = 1
    for aa in prot_seq:
        n_codons = len(rev_table[aa])
        counter *= n_codons
        counter = counter % 1e6  # from the properties above, this is the same as calculating the modulo at the end
    return counter


In [83]:
rev_table = get_reverse_codon_table(codon_table)
rev_table

{'F': ['UUU', 'UUC'],
 'L': ['CUU', 'CUC', 'UUA', 'CUA', 'UUG', 'CUG'],
 'I': ['AUU', 'AUC', 'AUA'],
 'V': ['GUU', 'GUC', 'GUA', 'GUG'],
 'M': ['AUG'],
 'S': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'],
 'P': ['CCU', 'CCC', 'CCA', 'CCG'],
 'T': ['ACU', 'ACC', 'ACA', 'ACG'],
 'A': ['GCU', 'GCC', 'GCA', 'GCG'],
 'Y': ['UAU', 'UAC'],
 'H': ['CAU', 'CAC'],
 'N': ['AAU', 'AAC'],
 'D': ['GAU', 'GAC'],
 '*': ['UAA', 'UAG', 'UGA'],
 'Q': ['CAA', 'CAG'],
 'K': ['AAA', 'AAG'],
 'E': ['GAA', 'GAG'],
 'C': ['UGU', 'UGC'],
 'R': ['CGU', 'CGC', 'CGA', 'AGA', 'CGG', 'AGG'],
 'G': ['GGU', 'GGC', 'GGA', 'GGG'],
 'W': ['UGG']}

In [85]:
f = open("rosalind_mrna.txt", "r")
prot = f.read().replace("\n", "")
prot += "*"
prot

'MTNDNEYCDDEDAAPWRTNRNKYYGSSEQSVHFQVSDSCAQFLHILPWTQYQREVHFPAENPDMLNEYKQHYLLCCCFDGCDSPHHIMWFRCSFCAVIMFRDNNKFMHLWLRLAFYYCWWDYIVIESCVHLLMIDFNAQCWWTYEFIKQEELFHHCWYLDTHPTESACLGRSNGKAVCYMGIHLDAGDSEMYCYITCGSPPTKWCPGRHMQYGSGPFCIDMVNKEWRYHGEYYKMSWCTSRTLKLYYNIVSWEFQVAMYCPWDFLGSVMVSGRFYWQRCVFDLLGHRLDRMSIDFFTTYNFYVKHTLLVDMQRGIANDTQALVCCCADLSFNRWFITAEDQCAWSMGMDQKFVYSYFYEACDVFSFEMNASSVEKSEQQVNTISQGYRRDHSKCPLRLHIGWSLIKKPMCVRECVAFHGNVCMAKGGEWPVLIMDKFLRENMVDTGEAELKSNTNVACIDRGGTVYEHDCCTKLQELCLWCQKARYAGLTGRTFFILADLVWPQMHWFHEDVFTEEMCMALWPTEDLGPYDTWAHCCFLPFQYVQNFKDIEFVSLGRHFIEHFPLRSVRPGMFTQIYWDDVVVSMFHRCLFGSITNEAYFHQIQKIYTFDGMIRCLECIIFMRKLMDYDWQQYKQFPYGQMFKEVMMEEGVSFAVLKIMKKTKFKTIQAPGMLHRAVYTSKVRRKTPNYDQAQNNPLYTYYHLPNPSECIELMGLEQAHFCAHTMCQSQMTKKVNCVKVLELPVLDINIMGGQFPCIKALPIQMGCQEMPGAEGLAFKFRPFVGNDVDGAPWCDAHEHPQSRTLTHIWQFCYDQLLWTEMKKTILRCCCVGKRFDWNQNWKHCETFGCLSMTSIAAPIYDDSTVVRHITSSESRLAKCCDMMIWDEWIPVGARTSDSLEYAWIFFHLLPGNNIGRPRKHYMNHYIPDRERSREFLPQIEIEETFIDIETQKETALCCAAFRWDQEMFAYYSCACGMHLGWTGVQWSVDPHWMGRKMS*'

In [86]:
possible_rna(prot, rev_table)

378048.0

## Open Reading Frames

Problem
Either strand of a DNA double helix can serve as the coding strand for RNA transcription. Hence, a given DNA string implies six total reading frames, or ways in which the same region of DNA can be translated into amino acids: three reading frames result from reading the string itself, whereas three more result from reading its reverse complement.

An open reading frame (ORF) is one which starts from the start codon and ends by stop codon, without any other stop codons in between. Thus, a candidate protein string is derived by translating an open reading frame into amino acids until a stop codon is reached.

Given: A DNA string s of length at most 1 kbp in FASTA format.

Return: Every distinct candidate protein string that can be translated from ORFs of s. Strings can be returned in any order.

In [325]:
def read_fasta(filename):
    data = {}
    with open(filename) as f:
        lines = f.readlines()
        for l in lines:
            if l[0]==">":
                key = l[1:].replace("\n", "")
                data[key] = ""
            else:
                string = l
                data[key] += string.replace("\n", "")
    return data

In [9]:
def reverse_strand(s):
    return s[::-1]

def complementary_base(base):
    compl_dict = {"a": "T", "c": "G", "g": "C", "t": "A"}
    
    try:
        return compl_dict[base.lower()]
    except KeyError:
        print("Unknown base!")
        return "_"
    
def reverse_complement(s):
    s_pruned = prune_strand(s)
    s_rev = reverse_strand(s_pruned)
    s_compl = [complementary_base(b) for b in s_rev]
    return "".join(s_compl)

In [311]:
def dna_to_rna(base):
    if base.lower() == "t":
        return "U"
    elif base.lower() in ["a", "c", "g"]:
        return base
    else:
        return ""

def transcribe_strand(s):
    t = [dna_to_rna(base) for base in s]
    return "".join(t)


def get_aa(codon, codon_table):
    return codon_table[codon]

def translate_strand(s, codon_table):
    n = len(s)
    assert n%3 == 0
    n_codons = int(n / 3)
    protein = ""
    for i in range(n_codons):
        codon = s[3*i:3*(i+1)]
        
#         print(codon)
        aa = get_aa(codon, codon_table)
#         print(aa)
        protein += aa
        
    return protein

In [141]:
import re

def find_orf(seq, codon_table):
    N = len(seq)
    rest = N%3
    n_full = int((N-rest)/3)
    
    for start in range(3):
        print("Shift #{}".format(start+1))
        
        len_ahead = N - start  # length of the sequence ahead
        rest_ahead = len_ahead % 3
        no_codons = int((len_ahead-rest_ahead)/3)
        
        out = translate_strand(seq[start : start + 3*no_codons], codon_table)
        print("a.a. sequence: " , out)
        out_clean = _prune_proteins(out)
        print(out_clean)
        print(":::::")
        
def _prune_proteins(aa_seq):
    matches = re.findall('M[a-zA-Z]+[*]', aa_seq)
    matches2 = re.findall('M+[*]', aa_seq)
    print(matches)
    print(matches2)
#     for protein in matches:
#         print(protein)
#     for protein in matches2:
#         print(protein)
# #         pass

In [132]:
seq = "AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG"

In [133]:
tseq = transcribe_strand(seq)
tseq_rev = transcribe_strand(reverse_complement(seq))

tseq, tseq_rev

('AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG',
 'CUGAGAUGCUACUCGGAUCAUUCAGGCUUAUUCCAAAAGAGACUCUAAUCCAAGUCGCGGGGUCAUCCCCAUGUAACCUGAGUUAGCUACAUGGCU')

In [310]:
# Read the codon table

f = open("codon_table.txt", "r")
a = f.readlines()

codon_table = {}
for line in a:
    parsed_line, = line.replace("\n", "").replace(" ", "").split(" ")
    no_matches = int(len(parsed_line)/4)
    
    for i in range(no_matches):
        section = parsed_line[4*i:4*(i+1)]  
        codon = section[0:3]
        aa = section[-1]
        
        codon_table[codon] = aa
codon_table

{'UUU': 'F',
 'CUU': 'L',
 'AUU': 'I',
 'GUU': 'V',
 'UUC': 'F',
 'CUC': 'L',
 'AUC': 'I',
 'GUC': 'V',
 'UUA': 'L',
 'CUA': 'L',
 'AUA': 'I',
 'GUA': 'V',
 'UUG': 'L',
 'CUG': 'L',
 'AUG': 'M',
 'GUG': 'V',
 'UCU': 'S',
 'CCU': 'P',
 'ACU': 'T',
 'GCU': 'A',
 'UCC': 'S',
 'CCC': 'P',
 'ACC': 'T',
 'GCC': 'A',
 'UCA': 'S',
 'CCA': 'P',
 'ACA': 'T',
 'GCA': 'A',
 'UCG': 'S',
 'CCG': 'P',
 'ACG': 'T',
 'GCG': 'A',
 'UAU': 'Y',
 'CAU': 'H',
 'AAU': 'N',
 'GAU': 'D',
 'UAC': 'Y',
 'CAC': 'H',
 'AAC': 'N',
 'GAC': 'D',
 'UAA': '*',
 'CAA': 'Q',
 'AAA': 'K',
 'GAA': 'E',
 'UAG': '*',
 'CAG': 'Q',
 'AAG': 'K',
 'GAG': 'E',
 'UGU': 'C',
 'CGU': 'R',
 'AGU': 'S',
 'GGU': 'G',
 'UGC': 'C',
 'CGC': 'R',
 'AGC': 'S',
 'GGC': 'G',
 'UGA': '*',
 'CGA': 'R',
 'AGA': 'R',
 'GGA': 'G',
 'UGG': 'W',
 'CGG': 'R',
 'AGG': 'R',
 'GGG': 'G'}

In [142]:
find_orf(tseq, codon_table)
print("")
find_orf(tseq_rev, codon_table)
# find_orf(t_s2, codon_table)

Shift #1
a.a. sequence:  SHVANSGYMGMTPRLGLESLLE*A*MIRVASQ
['MGMTPRLGLESLLE*']
[]
None
:::::
Shift #2
a.a. sequence:  AM*LTQVTWG*PRDLD*SLFWNKPE*SE*HL
[]
['M*']
None
:::::
Shift #3
a.a. sequence:  PCS*LRLHGDDPATWIRVSFGISLNDPSSIS
[]
[]
None
:::::

Shift #1
a.a. sequence:  LRCYSDHSGLFQKRL*SKSRGHPHVT*VSYMA
[]
[]
None
:::::
Shift #2
a.a. sequence:  *DATRIIQAYSKRDSNPSRGVIPM*PELATW
[]
['M*']
None
:::::
Shift #3
a.a. sequence:  EMLLGSFRLIPKETLIQVAGSSPCNLS*LHG
['MLLGSFRLIPKETLIQVAGSSPCNLS*']
[]
None
:::::


## Enumerating Gene Orders 

Problem
A permutation of length n is an ordering of the positive integers {1,2,…,n}. For example, π=(5,3,2,1,4) is a permutation of length 5.

Given: A positive integer n≤7.

Return: The total number of permutations of length n, followed by a list of all such permutations (in any order)

In [172]:
import math
import itertools

def my_permutations(n):
    x = list(range(1, n+1))
    print("No. permutations:", math.factorial(n))
    print("All permutations:")
    for y in itertools.permutations(x):
        y_ = [str(i) for i in y]
        print(" ".join(y_))

In [174]:
my_permutations(6)

No. permutations: 720
All permutations:
1 2 3 4 5 6
1 2 3 4 6 5
1 2 3 5 4 6
1 2 3 5 6 4
1 2 3 6 4 5
1 2 3 6 5 4
1 2 4 3 5 6
1 2 4 3 6 5
1 2 4 5 3 6
1 2 4 5 6 3
1 2 4 6 3 5
1 2 4 6 5 3
1 2 5 3 4 6
1 2 5 3 6 4
1 2 5 4 3 6
1 2 5 4 6 3
1 2 5 6 3 4
1 2 5 6 4 3
1 2 6 3 4 5
1 2 6 3 5 4
1 2 6 4 3 5
1 2 6 4 5 3
1 2 6 5 3 4
1 2 6 5 4 3
1 3 2 4 5 6
1 3 2 4 6 5
1 3 2 5 4 6
1 3 2 5 6 4
1 3 2 6 4 5
1 3 2 6 5 4
1 3 4 2 5 6
1 3 4 2 6 5
1 3 4 5 2 6
1 3 4 5 6 2
1 3 4 6 2 5
1 3 4 6 5 2
1 3 5 2 4 6
1 3 5 2 6 4
1 3 5 4 2 6
1 3 5 4 6 2
1 3 5 6 2 4
1 3 5 6 4 2
1 3 6 2 4 5
1 3 6 2 5 4
1 3 6 4 2 5
1 3 6 4 5 2
1 3 6 5 2 4
1 3 6 5 4 2
1 4 2 3 5 6
1 4 2 3 6 5
1 4 2 5 3 6
1 4 2 5 6 3
1 4 2 6 3 5
1 4 2 6 5 3
1 4 3 2 5 6
1 4 3 2 6 5
1 4 3 5 2 6
1 4 3 5 6 2
1 4 3 6 2 5
1 4 3 6 5 2
1 4 5 2 3 6
1 4 5 2 6 3
1 4 5 3 2 6
1 4 5 3 6 2
1 4 5 6 2 3
1 4 5 6 3 2
1 4 6 2 3 5
1 4 6 2 5 3
1 4 6 3 2 5
1 4 6 3 5 2
1 4 6 5 2 3
1 4 6 5 3 2
1 5 2 3 4 6
1 5 2 3 6 4
1 5 2 4 3 6
1 5 2 4 6 3
1 5 2 6 3 4
1 5 2 6 4 3
1 5 3 2 4 6
1 5 3 2 6 4


## Calculating Protein Mass

Problem
In a weighted alphabet, every symbol is assigned a positive real number called a weight. A string formed from a weighted alphabet is called a weighted string, and its weight is equal to the sum of the weights of its symbols.

The standard weight assigned to each member of the 20-symbol amino acid alphabet is the monoisotopic mass of the corresponding amino acid.

Given: A protein string P of length at most 1000 aa.

Return: The total weight of P. Consult the monoisotopic mass table.

In [185]:
# Read the codon table

f = open("aa_mass_table.txt", "r")
a = f.readlines()

table = {}
for line in a:
    parsed_line, = line.replace("\n", "").replace(" ", "").split(" ")
    aa = parsed_line[0]
    mass = float(parsed_line[1:])
    table[aa] = mass
table

{'A': 71.03711,
 'C': 103.00919,
 'D': 115.02694,
 'E': 129.04259,
 'F': 147.06841,
 'G': 57.02146,
 'H': 137.05891,
 'I': 113.08406,
 'K': 128.09496,
 'L': 113.08406,
 'M': 131.04049,
 'N': 114.04293,
 'P': 97.05276,
 'Q': 128.05858,
 'R': 156.10111,
 'S': 87.03203,
 'T': 101.04768,
 'V': 99.06841,
 'W': 186.07931,
 'Y': 163.06333}

In [186]:
def protein_mass(seq, mass_table):
    mass = 0
    
    for aa in seq:
        mass += mass_table[aa]
    return mass

In [188]:
f = open("rosalind_prtm.txt", "r")
seq = f.read().replace("\n", "")
seq

'YFKPKQKSNYQGLFGAKMGEGAWNVVTYTVSQHWFENWACCAQNRKSCIAGTRTCSRDDFQWCWNIWDSLPTCMRVMTFIFFMWTMQNLARVMNAMYLYSSYAFKKSQPHHNTYCSCVMPQQFCPHWGPHMLDACNDPRVVGSEKTWLEGGNTAYQVTCNMHQYHVHPIRRCIINTTIALADTIYFALLSCVWMCLWIEREWMDCCGYEFWMVYAGVSYVQWPACDIIIPYSRNEYRQHRYVKIPSWMKEIRYLQSIHQVLTGKWKMIHLQQGKIYHQCAHCCTSVSFTCRGMPHLKEWERHNYQEPKIYALVLWGYKVQIWYGAFQLGIWAEHIVIKLPMNVWKNPDMVGHVVLRRWLIHAMECIGLEGMFLFPQFYAFCMHNHGMKDWVLGYGVDDPQHMKHRVHEQFHSCICLEFKQVWGAVRICQVQHGYFFAYCRHKSCNDDWSNHCDEHKAMNKVVGHLDVSETPHPPIVDDQGMAISFYLIYLCLGPIFPFERWFPHTSYKDHWEIRHHTEQRERMNDYWKVHCHLSAYRESKQTNKFMQEYVLVHEICCQNTVDEGGMQCWEHWTYHIWFAAIIPSRNDTSSHVSMAKLNTSCKERATVFRIQWLRSKQTCFYYHVFFWWPHCSHTQMGRSCCQKPAGNRCILWYLCNVSDMCAKCVSWPFIGTFYVRQPSPWGACGPLYQATHCILYDEVAFATMTHMHPMHSRNHETDAQACHAINTGCYELDTIYNYPQRSNPGNDFQLHDKKPMNQPDTGAVIEFCASFYRWIILRQTHLCHMMKKYPACGYMPIQNLIIHRMAQHKVPCFHESMMAFPWTWAPPDMYCLIPCQGIDYSCHYRHQSTFPFCHPPELNMGLKTFNKSCKEKRHCQCVHHICMVEWKRDMRPCMECARGNQCYGSYSLVEWDIDKETKHYRFKFHQIDGCACGDDPSNPINAEGENMFVFNIATEIIEWQQLTPEEDISICALHMHIMGKENMFNHADSVYHR'

In [189]:
protein_mass(seq, table)

117673.94449000055

## Locating Restriction Sites

Problem

Figure 2. Palindromic recognition site
A DNA string is a reverse palindrome if it is equal to its reverse complement. For instance, GCATGC is a reverse palindrome because its reverse complement is GCATGC. See Figure 2.

Given: A DNA string of length at most 1 kbp in FASTA format.

Return: The position and length of every reverse palindrome in the string having length between 4 and 12. You may return these pairs in any order.

In [276]:
def find_palindromic(seq, n_min=4, n_max=12):
    N = len(seq)
    for i in range(N):
        for window in range(n_min, n_max+1, 2):
            if i + window <= N:
                _check_palindromic(seq[i: i + window], i, window)
    return

def _check_palindromic(section, i, window):
    section_rev = reverse_complement(section)
    if section == section_rev:
        print(i+1, window)
    return

In [277]:
f = read_fasta("rosalind_revp.txt")
# seq = f.read().replace("\n", "")
# f
for k, v in f.items():
    seq = v
seq
# seq = "TATATA"
# seq = "TCAATGCATGCGGGTCTATATGCAT"

'CGCGCCGATTGCGGTTATAACTGTAAATGTGGTGTTGACCATGGCTACGCCTCGGTGACCGGAGACGGAGTTGCTCTTACGGAACGAGTTGTCCAATAATCTTCAAATAACTCAAGGAATAACTTGGCTAAACGCCATATCCGCCCGGTGCAGGATGGCACTACCCACTAGGGACCCAGGGTCGCCAATGGGTCCGTCTTTTAGCCCAGGCAAGCGCCTCCATAGGACAGCTTCAACCAGTCCGGGCAACCTCGGTACCAGATGGAGTGCCGAAGTGTAGTAAAGAAGAGCGGTTACAGATATACAGGCAAGTAATTCGGAGCGCATGCCTGCGAATTTTCGATCATCCTCTCACTGCAGATACCCACGTAGATGAATGGTCCTCTCTCTACAATAAGCGAATCAACGTGTGACAGATTCGGACACCGGTCGCGCGACGTTCGAAGGCCGCGGCCTATACCTGTAAATGCTACTGAGCAATGAGATGAGTTTGGCGATTGAAGATACCACCACGGGTAAACCGACTTCAAGAGTTGCACGGGCCCAAACATCACGGGCCTTTTCGTAAAGATAATCCCAGTAAGTCTCCTATCCGAAGAGCAGCTGACATGGGCCGAAAGATCGCTAAGTCATGTAGGAACCTGGCACGTTTCAGCAGGGCCCCCACCCGGGTAGAGGTAATCGAAGTGTACGGAAGATGGGAGCGTAACCAGACGCCATCAAGTTATCAAGTCTCGGGTCCTGCGACCTTGAAGTAACGAATTACGGCTGTCCAACCCCAGATTAAGTGTACCCTCCGCGCTACGCTCCCTTGCGGCCAATGCTCTAAGAAGGTAAAAGGAGATAGTTGCAAACGTCATTCTAAAATACAGGGGAAGGGTGTGGATTGGTCATGGCGCG'

In [278]:
find_palindromic(seq, n_min=4, n_max=12)

1 4
2 4
14 8
15 6
16 4
39 6
40 4
59 4
137 4
145 4
149 4
168 4
214 4
229 4
242 4
254 6
255 4
300 4
301 4
314 4
322 4
324 6
325 4
335 4
340 4
342 4
355 6
356 4
367 4
406 4
425 6
426 4
429 10
430 8
431 4
431 6
432 4
433 4
437 4
440 6
441 4
445 12
446 4
446 10
447 8
448 6
449 4
452 4
456 4
535 4
540 6
541 4
556 4
601 6
602 4
608 4
612 4
620 4
631 4
647 4
658 6
659 4
666 8
667 6
668 4
682 4
689 4
761 4
784 4
790 4
798 4
799 4
816 4
848 6
849 4
854 4
892 4
896 4
897 4


## Finding a Motif in DNA 

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

In [234]:
def find_substring_location(s, t):
    Ns = len(s)
    Nt = len(t)
    locs = [str(i+1) for i in range(Ns-Nt) if t in s[i:i+Nt]]   # i+1 due to the difference in indexing Python vs question
    return " ".join(locs)

In [233]:
# s = "TCGACCCCTCGACCCTTCGACCCTTCGACCCTGGTAACGTTCGACCCTCGACCCGTCGACCCTCGACCCGTCACAGCTGTCGACCCTAGATTTCGACCCGGTCGACCCGGCATTCGACCCGTTCGACCCTGTCGACCCCTCGACCCTCGACCCGCATCTATTATCGACCCGTCGACCCACGTCGACCCTCGACCCTCCTCGACCCACACGTCGACCCGCGACTTCGACCCGTGTTCGACCCTCGACCCCCCTCGACCCTTTGAGCTTTACTTTCGACCCTTTCGACCCTCGACCCAATCGACCCTTTCGACCCCGTCGACCCGCTAAGTTCGACCCCATCGACCCAACGTCGACCCTTCGACCCCTCGACCCTGATCGACCCTCGACCCGTCGACCCGAGGATCGACCCTCGACCCATCGACCCTCGACCCTGTCGACCCGTATCGACCCTCGACCCTTCGACCCCTATAATCGACCCGTGTAGCTCTCGACCCTAGGTATCGACCCTCGACCCGGTCGACCCCAGTCGACCCACGGTTCGACCCCTCGACCCGGGTTCGACCCATCGACCCTTCGACCCGCATTGACGATCGACCCCCTCTCGACCCACTTCGACCCTCGACCCTTTCTCGACCCATCGACCCCTTGTCTTCGACCCCTCGACCCTCGACCCTCGACCCCCATATCGACCCAGGGGAACATCGACCCGATTCGACCCCTTCGACCCCCTCGACCCGGGTTGGTCGACCCATTTCGACCCATCGACCCACTTCGACCCGGTCCGCTCGACCCAATGTCGACCCTCGACCCTTCGAGTCGACCCTCGACCCTCGACCCCAAATCGACCCGTCGACCCCGGTCGACCCTAAGTCGACCCGTCGACCCTCGGGAGTCGAATCGACCCTTCCGTCGACCCTGAATCGACCCCTCGTCGCTCCATCGACCCATCGACCCGGCAATCGACCCACTCGACCCTTTCGACCCTCGACCCTCGACCC"
# t = "TCGACCCTC"

In [244]:
f = open("rosalind_subs.txt", "r")
a = f.readlines()

codon_table = {}
for line, var in zip(a, [s, t]):
    var = line.replace("\n", "")
    print("----")

----
----


In [245]:
find_substring_location(s, t)


'41 56 140 182 189 235 282 376 403 418 444 501 612 660 667 797 817 824 879 978 985'

## RNA Splicing

After identifying the exons and introns of an RNA string, we only need to delete the introns and concatenate the exons to form a new string ready for translation.

Given: A DNA string s (of length at most 1 kbp) and a collection of substrings of s acting as introns. All strings are given in FASTA format.

Return: A protein string resulting from transcribing and translating the exons of s. (Note: Only one solution will exist for the dataset provided.)

In [319]:
def splice(seq, introns):
    seq_out = seq
    for int_i in introns:
        loc = find_substring_location(seq_out, int_i)
        i = int(loc) - 1
        
        seq_out = seq_out[0:i] + seq_out[i+len(int_i):]
    return seq_out

In [322]:
# s = "ATGGTCTACATAGCTGACAAACAGCACGTAGCAATCGGTCGAATCTCGAGAGGCATATGGTCACATGATCGGTCGAGCGTGTTTCAAAGTTTGCGCCTAG"
# int1 = "ATCGGTCGAA"
# int2 = "ATCGGTCGAGCGTGT"

In [323]:
# x = splice(s, [int1])
# x

In [324]:
# translate_strand(transcribe_strand(x), codon_table)

In [326]:
f = read_fasta("rosalind_splc.txt")
f

{'Rosalind_0588': 'ATGCAGGCTCATGAGAACTTAAGTAACATTCCTTTTGGTGGGACTGTTGATCCCTGATCGTGGCATGCAAATATTGAGGGGAGCTGACATGTGCTCAGTATTTTCCCCCGTCGGTAGAACGCCCAGCACCTCGATATAACGCTGCTCATGGCCATCGACGACACCTCCTTACGCGTTTATATGTGAACGCGCGTGCCCTTCCAATCTTTTGAGACGGCGAAATATCTCGATACACGCACGTACCATAGTCGACCGAGACTGGTATTGGTTTGGGACTTATATCGCAGGCTAATAAACTCACTAACGCAACACCCTTACCGTAGGTAAATTTTTGCAAACCGCTAGCACCAGACTAATGGATGCCTGCTGGGATCGTGTTCGGTGTGGCTAGGATCATTGTCCGCTCGCTATTGCTGGCTTGTCGTGTATCATATTGGCAGACGATGGGGGAGGAGTACTTGCCTCGGGTATTTAACTAACATAGCGTACTTCTTCGCCGGCTTTTCTCTCCCGGACTTCTAGCCGAAACCGTCAGAGGCACGTCTCTCGATAGCCCGGGAGATAGGCTGCCCATGCCCGTCTGCAGTGCCTCTGAACCCTTATCCTTTCTCCCGTCGGTTAAGTCTCCAGTCGTTAGGGGTCTCCGTCCAAAGGGGGAGGAGTTTCTGTCCAAAAAGCGGTCCTCTAGCAGTTACCAATACATGCACCCTCCCAGTGGCCCTCCACGTATCACGGAAACTGTGCGTGCCGCCCCTTTATATAACTGCAGATTGTTGGTTACGTTTACGGACTGGTAGCCTGCCATGTCGATGATCCGTCAATGGCTACGGTTCCGATCCCTAACCGTTGGTATTGTGACGCCGGTCGAGGCACAACCCCGAACATTTAGTGGGCACCGCAGAATACTACTCCCGTTTGGCAAATTAAGATTGCCGGGGATAGTGAGGAGAGGTCGTAA',
 'Rosalind_4577': 'G

In [328]:
introns = []
seq = []

for i, (key, vals) in enumerate(f.items()):
    if i == 0:
        seq = vals
    else:
        introns.append(vals)
seq, introns

('ATGCAGGCTCATGAGAACTTAAGTAACATTCCTTTTGGTGGGACTGTTGATCCCTGATCGTGGCATGCAAATATTGAGGGGAGCTGACATGTGCTCAGTATTTTCCCCCGTCGGTAGAACGCCCAGCACCTCGATATAACGCTGCTCATGGCCATCGACGACACCTCCTTACGCGTTTATATGTGAACGCGCGTGCCCTTCCAATCTTTTGAGACGGCGAAATATCTCGATACACGCACGTACCATAGTCGACCGAGACTGGTATTGGTTTGGGACTTATATCGCAGGCTAATAAACTCACTAACGCAACACCCTTACCGTAGGTAAATTTTTGCAAACCGCTAGCACCAGACTAATGGATGCCTGCTGGGATCGTGTTCGGTGTGGCTAGGATCATTGTCCGCTCGCTATTGCTGGCTTGTCGTGTATCATATTGGCAGACGATGGGGGAGGAGTACTTGCCTCGGGTATTTAACTAACATAGCGTACTTCTTCGCCGGCTTTTCTCTCCCGGACTTCTAGCCGAAACCGTCAGAGGCACGTCTCTCGATAGCCCGGGAGATAGGCTGCCCATGCCCGTCTGCAGTGCCTCTGAACCCTTATCCTTTCTCCCGTCGGTTAAGTCTCCAGTCGTTAGGGGTCTCCGTCCAAAGGGGGAGGAGTTTCTGTCCAAAAAGCGGTCCTCTAGCAGTTACCAATACATGCACCCTCCCAGTGGCCCTCCACGTATCACGGAAACTGTGCGTGCCGCCCCTTTATATAACTGCAGATTGTTGGTTACGTTTACGGACTGGTAGCCTGCCATGTCGATGATCCGTCAATGGCTACGGTTCCGATCCCTAACCGTTGGTATTGTGACGCCGGTCGAGGCACAACCCCGAACATTTAGTGGGCACCGCAGAATACTACTCCCGTTTGGCAAATTAAGATTGCCGGGGATAGTGAGGAGAGGTCGTAA',
 ['GATCCGTCAATGGCTACGGTTC',
  'GTCAGA

In [331]:
seq_spliced = splice(seq, introns)
seq_spliced

'ATGCAGGCTCATGAGAACTTAAGTAACATTGAGCTGACATGTGCTAACGCCCAGCACCTCGATATAACGCTGCTCATGGCCATCGACGACACCTCCTTACCCTTCCAATCTTTTGAGACGGCGAAATATCTCGATACACGCACGTACCATAGTCGAAAACTCACTAACGCAACACCCTTACCGCTAATGGATGCCTGCTGGGATCGTGTTTATCATATTGGCAGAGTACTTCTTCGCCGGCTTTTCTCTCCCGGACTTCTAGCCGAAACCCGTCTCTCGATAGCCCGGGAGATAGGCTGCCCATTATCCTTTCTCCCGTCGGTTAAGTCTCCAGTCGTTAGGGGTCTCCGTCCAAAGGGGAAAAGCGGTCCTCTAGCAGTTACCAATACATGCACCCTCCCAGTGGCCCTCCACGTAATTGTTGGTTACGTTTACGGACTGGTAGCCTGCCATGTCGATCGATCCCTAACCGTTGGTATTGTGACGCCGGTCGAGGCACAACCCCGAACATTTCAAATTAAGATTGCCGGGGATAGTGAGGAGAGGTCGTAA'

In [332]:
translate_strand(transcribe_strand(seq_spliced), codon_table)

'MQAHENLSNIELTCANAQHLDITLLMAIDDTSLPFQSFETAKYLDTRTYHSRKLTNATPLPLMDACWDRVYHIGRVLLRRLFSPGLLAETRLSIAREIGCPLSFLPSVKSPVVRGLRPKGKSGPLAVTNTCTLPVALHVIVGYVYGLVACHVDRSLTVGIVTPVEAQPRTFQIKIAGDSEERS*'

## S