#### Biological Sequences Exercise Answers

-----------------------------------------------------------------------------------------------------------------

**Exercise: Guess the Sequence**

Given the above information, try to figure out what type of biomolecule is represented by each of the following sequences. For each, write down if you know the answer for sure, or if its just likely given the sequence:


**Sequence 1:** MEELVVEVRGSNGAFYKAFVKDVHEDSITVAFENNWQPDRQIPFHDVRFPPPVGYNKDIN

**Sequence 2:** AUGAUCGUACGUCAGCUCGUACGUCGGCGGUGAUGCUAGCGCUACGUGACUACGUAGCUA

**Sequence 3:** ATGACAACACAATTAAATCCCTATTTTGGTGAATTTGGCGGAATGTATGTGCCGGAAATT


**Sequence 4:** AGC

-----------------------------------------------------------------------------------------------------------------
    
**Answers:**
Obviously in any real-world situation it is important to carefully check the source documentation to figure out what your data file is, and the methods used to generate it. That said, it is useful to be able to rapidly form a good working hypothesis about which file has which type of data in it. Here's how I would do so if presented with unlabelled data similar to the above:

**Sequence 1:** This sequence is unambiguously a protein sequence - it has letters that are not present in the RNA or DNA alphabets 

**Sequence 2:** This is unambiguously an RNA sequence. Notice that the letters are AUGC. The letter U does not occur in either the DNA or protein alphabets. 

**Sequence 3:** This is composed entirely of A (Adenosine), G (Guanine) C (Cytosine) and T (Thymidine) - the four letters representing DNA nucleotides. We can rule out sequence 1 being an RNA sequence because there are Ts (thymidine) rather than Us (uracil). However, it is worth noting that while in practice I would have no problem assuming this was a DNA sequence, this is just a (very good) guess. Strictly speaking, the letters A,G,T, and C are also valid amino acid codes, representing Alanine (A),Threonine (T),Cysteine (C), and Glycine (G). An interesting exercise for readers familiar with statistics (or to return to after reading the chapter on statistics in this text) is to calculate the probability of generating a sequence of all A T C and G in a protein with equal amino acid frequency. You will find that this probability is non-trivial for very short sequences (say 1-3 bases), but rapidly becomes vanishingly small for a longer sequence like this one.

**Sequence 4:** Similar to sequence 3, except that in this case the sequence could very possibly be RNA or DNA (since we don't have a T or U that would differentiate the two). It could even plausibly be protein, since it is short enough that the chances of getting 3 letters that also happen to be used (for different molecules) in the DNA and RNA alphabets.

**Exercise 2**. *Escherichia coli*, or *E. coli* is a bacterium that is commonly used in laboratory experiments. Imagine that in the laboratory you chemically modified the structure of one of *E. coli*'s proteins, such that one amino acid was replaced by another. Use the Central Dogma of Molecular Biology to predict whether this chemical modification will be passed on to the next generation of *E. coli*, and justify your prediction.


The Central Dogma of Molecular Biology states that information flows from DNA --> RNA --> Protein. Just like erasing the contents of a computers RAM (e.g. by restarting it) will not delete the information on the hard drive, modifying or destroying an RNA or protein will not retroactively affect the DNA that produced that RNA or protein. Thus, such changes will not be passed to the next generation.

On the other hand, a mutation in *E.coli's* DNA sequence would be passed on to the next generation. If the mutation was in a protein-coding gene, that DNA mutation would also change any RNA transcipts produced from the gene, and those changes in the RNA could in turn potentially alter the amino acid sequence of proteins produced from that RNA.


-----------------------------------------------------------------------------------------------------------------
**Short Exercises: Representing and manipulating biological sequences using python strings**

**1. Write python code to count the number of 'N' (ambiguous) characters in a DNA sequence. Test it with this short sequence**: 

<pre>short_dna_seq = "NNNATGNNCNN"</pre>. Did you get 7 as expected? If so, apply your code to count the N characters in a longer sequence like this one: 

<pre>dna_seq="TACGTAGCTCACGTACGTACGTACGTACGTAGCTACGTAGCTAGCTCGCGGCGCGCCGCGNATCGTACGTACCTACGTACGTCNGTACGTACGTATCGTACGCTGACTACGTACGTACGTACGATCGTACGTAGCTACGTACTCAGNNNNTACGTACGTCACGNTACGTACTGACTGCN"</pre>

**Basic Answer:** We want to take some input DNA sequence, count up the number of nucleotides in it, and report that number to the user. It is important that even though right now I can see the input sequence, and could just count the nucleotides myself (either in a text editor like Word or by hand), in most practical applications what we're writing now will be part of a broader project where we won't know the exact sequence we're going to be analyzing while we're writing the code. So to get the most out of the exercise, it's best to write all of your code assuming the DNA sequence could be *any* DNA sequence.


In [2]:
#Define a DNA sequence. 
dna_seq = "NNNATGNNCNN"

#Count the number of N's and save it in a variable
N_count = dna_seq.count("N")

#Format a string to tell the user the report
report_str = f"{N_count} N's were found in the DNA sequence {dna_seq}"

print(report_str)


7 N's were found in the DNA sequence NNNATGNNCNN


**Improved answer**: This answer is basically what I was looking for and should answer the question. There are, however, multiple ways it could be improved if you were using it for research. For example, we counted upper case N's only. Do we want our code to work on both upper case and lower case DNA sequences? If so we might convert the DNA sequence to upper case before doing the rest of the code. Similarly, do we think we might want to reuse the code for other nucleotides in the future? If so we might consider making the character ("N") we count a variable so we could also use the code to count G's,T's,A's etc.

In [3]:
#revised answer

dna_seq = "NNNATGNNCNN"

#nt is a common abbreviation for nucleotide
nt_to_count = "N"
nt_count = dna_seq.count(nt_to_count)
report_str = f"The character {nt_to_count} appears {nt_count} times in the sequence {dna_seq}"
print(report_str)


The character N appears 7 times in the sequence NNNATGNNCNN


Notice that with the revised answer, if we wanted to count up T characters, we'd just have to change nt_to_count to "T" and the rest of the code would work. Moreover the variable names would still make sense. This makes the code more reusable for other problems.  

**Advanced answer (optional)**: this final answer draws on three methods introduced in later sections: defining our own functions, if statements, and using an assert statement to check that our code is getting the expected answer. It's OK to ignore it if this is your first time through and you are new to python. Think of it as where this is headed. 

The final way to increase reusability is to encode our general code (from the revised answer) into a function that takes the sequence and the nucleotide we want to count as arguments, and returns the count. This function can then be called to get the number, which might be used in other calculations or just used to print a message to the screen as before. We might want the user to be able to control conversion to uppercase if they want. Finally we might want to handle whitespace (like "\n" characters indicating linebreaks). 

This code will be much longer than the above, but less likely to give the wrong answer in common situations we might use it in, and easier to reuse multiple times:

In [28]:
def count_nucleotide(seq,nt,convert_to_uppercase=True,
  remove_trailing_whitespace=True, verbose=True):
    """Return the count of a nucleotide in a sequence
    
    parameters
    ----------
    
    seq -- a string representing the sequence
    
    nt -- a nucleotide
    
    convert_to_uppercase -- if True convert the sequence
    to uppercase before counting
    
    remove_trailing_whitespace -- if True remove trailing
      whitepace from seq
      
    verbose -- if True, print a message describing the count
    """
    if convert_to_uppercase:
        #prevent case-based mismatches by 
        #converting both nt and seq to uppercase
        nt = nt.upper()
        seq = seq.upper()
        
    if remove_trailing_whitespace:
        seq = seq.strip().replace("\n","")

    nt_count = seq.count(nt)
    
    if verbose:
        print(f"The character {nt} appears {nt_count} times in the sequence {dna_seq}")
    
    return nt_count
    
#Example usage:
DNA_seq = "ATTGC"
nt_count = count_nucleotide(DNA_seq,"T")

The character T appears 2 times in the sequence ATACGATGCATCG


How do we really know that the function works? The best way to figure out is to test it in lots of cases where we know the answer:

In [23]:
#Let's informally test the function

#Test that is works on an RNA sequence
seq = """AUUACGUAGCUACGUACGUAC"""
U_count = count_nucleotide(seq,"U")
expected_U_count = 6
assert (U_count == expected_U_count)

#Test that the function works even if multiple newline (\n) characters are present
seq = """ATACG\n
         NNATCGC\n"""
N_count = count_nucleotide(seq,"n")
expected_N_count = 2 #Not 3!
assert N_count == expected_N_count

The character U appears 6 times in the sequence ATACGATGCATCG
The character N appears 2 times in the sequence ATACGATGCATCG


Believe it or not, even in code as relatively simple as this I -still- caught a bug in the above code using these tests (the print message previously referred to the wrong variable and would always say "N" even if you were counting something else).

If you start testing your code, you will very quickly realize why professional programmers and industry regard test code as essential.

**2. Write python code to convert the below RNA sequence into uppercase:**

<pre>rna_seq = "augcggguacuacguacgucgcggcgcgcuagcuacggugcuacggggcuagc"</pre>


In [14]:
#Answer
rna_seq = "augcggguacuacguacgucgcggcgcgcuagcuacggugcuacggggcuagc"
uppercase_rna_seq = rna_seq.upper()

3. Write python code that prints to screen "Your sequence is 17 nucleotides long" for a sequence of 17 nucleotides, and automatically changes the number based on the number of nucleotides in your sequence. (So for example if your sequence was 53 nucleotides long it should print "Your sequence is 53 nucleotides long")


In [17]:
dna_seq = "ATACGATGCATCG"
dna_seq_length = len(dna_seq)
print(f"Your sequence is {dna_seq_length} nucleotides long")

Your sequence is 13 nucleotides long


4. Write python code that converts an RNA sequence to a DNA sequence by replacing all characters representing Uracil with characters representing Thymidine

In [31]:
RNA_seq = "AUGGCU"
DNA_seq = RNA_seq.replace("U","T")

5. Aligned DNA sequences often have gap ('-') characters in them. Sometimes you aren't comparing sequences and so want to remove these gaps. Write code that removes gaps from a sequence. HINT: replacing a character with an empty string ('') is equivalent to removing it from a sequence. 



In [33]:
gappy_seq = "TACC-GTAGCTACGTCAGCGC----ACTAGCA-----"
seq = gappy_seq.replace("-","")
print("Sequence with gaps removed:",seq)

Sequence with gaps removed: TACCGTAGCTACGTCAGCGCACTAGCA


6. Write code to calculate the percentage of a sequence that is gaps. HINT: you might count the number of gaps directly using the count method , or you might use your answer to number 5, generate an ungapped sequence, and infer the percentage of gaps by the change in sequence length when converting to ungapped.

In [36]:
gappy_seq = "TACC-GTAGCTACGTCAGCGC----ACTAGCA-----"
n_gaps = gappy_seq.count("-")
seq_length = len(gappy_seq.strip())
percent_gaps = n_gaps/seq_length
percent_gaps_rounded = round(percent_gaps,2)
print(f"The percentage of gaps is: {percent_gaps_rounded}")

The percentage of gaps is: 0.27


-----------------------------------------------------------------------------------------------------------------
**Exercise: calculate the frequency of each nucleotide in an RNA sequence.**

Use the approaches outlined above to write code to calculate the frequency of each nucleotide in an RNA sequence

Keep these things in mind:
* Be sure that the code can be easily run on new sequences. 
* Use DRY coding methods and a for loop to avoid lots of repeated code
* Be sure to check your code using a sequence where you know the right answer. For example, on the sequence:
  "AAUUGGGG", your code should return frequencies A: 25%, U:25%, G:50%, C:0%.

-----------------------------------------------------------------------------------------------------------------
**Answer**

There are multiple correct answers - just like there are multiple correct ways to write an essay. Here's mine. I used a similar strategy to our code for calculating the composition of DNA, RNA, or protein.  
    
    

In [8]:
#Goal: calculate the composition of an RNA sequence
rna_seq = "AAUUGGGG"
rna_nucleotides = ['A','U','C','G']

rna_profile = {}
for nucleotide in rna_nucleotides:
    nucleotide_count = rna_seq.count(nucleotide)
    seq_length = len(rna_seq)
    nucleotide_percent = (nucleotide_count/seq_length)*100
    rna_profile[nucleotide] = nucleotide_percent 
    
print("The count of each nucleotide is:",rna_profile)

The count of each nucleotide is: {'A': 25.0, 'U': 25.0, 'C': 0.0, 'G': 50.0}


We can further test this code by altering the rna_seq parameter to "AAAAAAAAAAA". As expected the result I get:
{'A': 100.0, 'U': 0.0, 'C': 0.0, 'G': 0.0}