# Week V: Sequence Analysis: Transcription, Translation, Mutation

## Sequence Objects

In the field of Bioinformatics, biological sequences are arguably the primary focus. In this chapter, we will introduce the Biopython Seq object, which serves as a mechanism for handling sequences.

Biological sequences are essentially composed of strings of letters, such as AGTACACTGGT. This format is common and intuitive since it mirrors the way sequences are typically represented in biological file formats.

The key distinction between Seq objects and regular Python strings lies in their available methods. While the Seq object shares many methods with standard strings, it differentiates itself through its `translate()` method, which performs biological translation. Moreover, it offers additional biologically relevant methods like `reverse_complement()`.

### Sequences act like strings

Mostly, we can handle Seq objects just like regular Python strings, which includes tasks like determining their length or iterating through their elements:

In [None]:
from Bio.Seq import Seq
my_seq = Seq("GATCG")
for index, letter in enumerate(my_seq):
    print("%i %s" % (index, letter))

In [None]:
print("The first letter:", my_seq[0]) 
print("The third letter:", my_seq[2])
print("The last letter:", my_seq[-1])

In [None]:
print(len(my_seq))

The Seq object has a .count() method, just like a string.

In [None]:
Seq("AATTAA").count("A")

In [None]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
len(my_seq)

In [None]:
my_seq.count("G")

In [None]:
my_seq.count("ATG")

In [None]:
100 * (my_seq.count("G") + my_seq.count("C")) / len(my_seq)

While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module has several GC functions already built. For example:

In [None]:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATGGC")
gc_fraction(my_seq)

Locating the first typical start codon, ATG, in an DNA sequence:

In [None]:
my_seq.find("ATG")

Locating the last typical start codon, ATG, in an DNA sequence:

In [None]:
my_seq.rfind("ATG")

Returns -1 if the subsequence is NOT found.

Return True if the Seq starts or ends with the given prefix, False otherwise.

In [None]:
my_seq.startswith("ATG")

In [None]:
my_seq.startswith(("ATG", "GAT", "GGG"), 0)

In [None]:
my_seq.endswith("GGC")

Note that using the Bio.SeqUtils.gc_fraction() function should automatically cope with mixed case sequences and the ambiguous nucleotide S which means G or C.

Also note that just like a normal Python string, the Seq object is in some ways “read-only”. If you need to edit your sequence, for example simulating a point mutation, look at the section below which talks about the MutableSeq object.

### Slicing a sequence

Let’s get a slice of the sequence:

In [None]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq[4:12]

If you really do just need a plain string, for example to write to a file, or inSt into a database, then this is very easy to get:

In [None]:
str(my_seq)

`print` function does this conversion, too:

In [None]:
print(my_seq)

Also like a Python string, you can do slices with a start, stop and stride (the step size, which defaults to one). For example, we can get the first, second and third codon positions of this DNA sequence:

In [None]:
print(my_seq[0::3])

In [None]:
print(my_seq[1::3])

In [None]:
print(my_seq[2::3])

Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse the string. You can do this with a Seq object too:

In [None]:
print(my_seq[::-1])

You can also use the Seq object directly with a `%s` placeholder when using the Python string formatting or interpolation operator (`%`):

In [None]:
fasta_format_string = ">Name\n%s\n" % my_seq
print(fasta_format_string)

### Concatenating or adding sequences

Two Seq objects can be concatenated by adding them:

In [None]:
from Bio.Seq import Seq
seq1 = Seq("ACGT")
seq2 = Seq("AACCGG")
print(seq1 + seq2)

In [None]:
seq = Seq('ATG') * 2
print(seq)
seq *= 2
print(seq)

Biopython does not check the sequence contents and will not raise an exception if for example you concatenate a Ptein sequence and a DNA sequence (which is likely a mistake):

In [None]:
from Bio.Seq import Seq
Ptein_seq = Seq("EVRNAK")
dna_seq = Seq("ACGT")
print(Ptein_seq + dna_seq)

You may often have many sequences to add together, which can be done with a for loop like this:

In [None]:
from Bio.Seq import Seq
list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]
concatenated = Seq("")
for s in list_of_seqs:
    concatenated += s

print(concatenated)

Or, a more elegant approach is to the use built in sum function with its optional start value argument (which otherwise defaults to zero):

In [None]:
print(sum(list_of_seqs, Seq("")))

Like Python strings, Biopython Seq also has a .join method:

In [None]:
from Bio.Seq import Seq
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("" * 10)
print(spacer.join(contigs))

### Changing case
Python strings have very useful upper and lower methods for changing the case.

In [None]:
from Bio.Seq import Seq
dna_seq = Seq("acgtACGT")
print(dna_seq)
print(dna_seq.upper())
print(dna_seq.lower())

These are useful for doing case insensitive matching:

In [None]:
"GTAC" in dna_seq

In [None]:
"GTAC" in dna_seq.upper()

## Nucleotide sequences and (reverse) complements

For nucleotide sequences, you can easily obtain the **complement** or **reverse complement** of a Seq object using its built-in methods:

In [None]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print("-Original sequence-")
print(my_seq)
print(my_seq.complement())
print("-Complementary sequence-")

In [None]:
print("Original sequence:")
print(my_seq)
print("Reversed sequence:")
print(my_seq[::-1])

In [None]:
print("Original sequence:")
print(my_seq)
print("-Reversed sequence-")
print(my_seq[::-1])
print(my_seq.reverse_complement())
print("-Reverse Complementary sequence-")

In all of these operations, the alphabet property is maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse)complement of a protein sequence:

In [None]:
from Bio.Seq import Seq
protein_seq = Seq("EVRNAK")
protein_seq.complement()
print("Proteins do not have complements!")

## Transcription

Consider the following (made up) stretch of double stranded DNA which encodes a short peptide:

Before Transcription |
-----:|
DNA coding strand (Crick strand, strand +1)	|
5'-ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG-3'	|
3'-TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC-5'	|
DNA template strand (Watson strand, strand -1)	|
**After Transcription**	|
Single stranded messenger RNA	|
5'-AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG-3'	|
 		 
The actual biological transcription process works from the template strand (-1), doing a complement (TAC -> AUG) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand (+1) because this means we can get the mRNA sequence just by switching T -> U.

Now let’s actually get down to doing a transcription in Biopython. First, let’s create Seq objects for the coding and template DNA strands:

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print("Coding strand:")
print("5'-", coding_dna, "-3'")
print("Template strand:")
template_dna = coding_dna.reverse_complement()
print("5'-",template_dna,"-3'")

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print("Coding strand:")
print("5'-", coding_dna, "-3'")
print("Wrong Template strand:")
complement_dna = coding_dna.complement()
print("5'-",complement_dna,"-3'")

In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print(" DNA: 5'-", coding_dna, "-3'")
messenger_rna = coding_dna.transcribe()
print("mRNA: 5'-", messenger_rna, "-3'")

As you can see, all this does is switch T → U, and adjust the alphabet.

In [None]:
# You can also directly change DNA to RNA
messenger_rna = coding_dna.replace('T','U')
print(messenger_rna)

If you do want to do a true biological transcription starting with the template strand, then this becomes a two-step process:

In [None]:
print("Template strand:")
print("5'-", template_dna, "-3'")
print("mRNA:")
print("5'-", template_dna.reverse_complement().transcribe(), "-3'")

The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U →
 T substitution and associated change of alphabet:

In [None]:
print(messenger_rna.back_transcribe())

## Translation

In the first script, we will translate the given DNA sequence into the protein according to pre-defined codons for aa representation using a predefined function.

In [10]:
# Here is the genetic code of the amino acids defined as dictionaries
standard_genetic_code = {'UUU':'F', 'UUC':'F', 'UCU':'S', 'UCC':'S',
                        'UAU':'Y', 'UAC':'Y', 'UGU':'C', 'UGC':'C',
                        'UUA':'L', 'UCA':'S', 'UAA':'*', 'UGA':'*',
                        'UUG':'L', 'UCG':'S', 'UAG':'*', 'UGG':'W',
                        'CUU':'L', 'CUC':'L', 'CCU':'P', 'CCC':'P',
                        'CAU':'H', 'CAC':'H', 'CGU':'R', 'CGC':'R',
                        'CUA':'L', 'CUG':'L', 'CCA':'P', 'CCG':'P',
                        'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGG':'R',
                        'AUU':'I', 'AUC':'I', 'ACU':'T', 'ACC':'T',
                        'AAU':'N', 'AAC':'N', 'AGU':'S', 'AGC':'S',
                        'AUA':'I', 'ACA':'T', 'AAA':'K', 'AGA':'R',
                        'AUG':'M', 'ACG':'T', 'AAG':'K', 'AGG':'R',
                        'GUU':'V', 'GUC':'V', 'GCU':'A', 'GCC':'A',
                        'GAU':'D', 'GAC':'D', 'GGU':'G', 'GGC':'G',
                        'GUA':'V', 'GUG':'V', 'GCA':'A', 'GCG':'A',
                        'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGG':'G'
                        }

def proteinTranslation(seq, geneticCode):
    """ This function translates a nucleic acid sequence into a
    protein sequence, until the end or until it comes across
    a stop codon """
    # Changes all the T into U, DNA to RNA
    seq = seq.replace('T','U') # Make sure we have RNA sequence
    proteinSeq = [] # Initializing the proteinSeq list to store the output
    
    i = 0
    while i+2 < len(seq):
        # Get codons of three letters
        codon = seq[i:i+3] 
        # Get the match-up aminoacid
        aminoAcid = geneticCode[codon] 
        # If found stop looping
        if aminoAcid is None: # Found stop codon
            break 
        # Other wise add that aminoacid to proteinSeq list
        proteinSeq.append(aminoAcid) 
        i += 3 
    
    return ''.join(proteinSeq)

In [None]:
# DNA sequence
dnaSeq = 'ATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG'

print(proteinTranslation(dnaSeq, standard_genetic_code))

Sticking with the same example discussed in the transcription section above, now let’s translate this mRNA into the corresponding protein sequence - again taking advantage of one of the Seq object’s biological methods:

In [None]:
print(messenger_rna.translate())

You can also translate directly from the coding strand DNA sequence:

In [None]:
print(coding_dna.translate())

You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).

By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:

In [None]:
print(coding_dna.translate(table="Vertebrate Mitochondrial"))

Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

In [None]:
print(coding_dna.translate())
print(coding_dna.translate(to_stop=True))

In [None]:
print(coding_dna.translate(table="Vertebrate Mitochondrial", to_stop=True))

Now, suppose you have a complete coding sequence CDS, which is to say a nucleotide sequence (e.g. mRNA – after any splicing) which is a whole number of codons (i.e. the length is a multiple of three), commences with a start codon, ends with a stop codon, and has no internal in-frame stop codons. In general, given a complete CDS, the default translate method will do what you want (perhaps with the to_stop option). However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria – for example the gene yaaX in E. coli K12:

In [None]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" +
           "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" +
           "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" +
           "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" +
           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")
print(gene.translate(table="Bacterial"))

In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS:

In [None]:
print(gene.translate(table="Bacterial", cds=True))

In addition to telling Biopython to translate an alternative start codon as methionine, using this option also makes sure your sequence really is a valid CDS (you’ll get an exception if not).

### Translation Tables

Seq object translation methods internally use codon table objects derived from the NCBI information at ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt, also shown on http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi in a much more readable layout.

As before, let’s just focus on two choices: the Standard translation table, and the translation table for Vertebrate Mitochondrial DNA.

In [6]:
from Bio.Data import CodonTable
standard_codon_table = CodonTable.unambiguous_dna_by_name["Standard"]
bactarial_codon_table = CodonTable.unambiguous_dna_by_name["Bacterial"]

You can compare the actual tables visually by printing them:

In [7]:
print(standard_codon_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [None]:
print(bactarial_codon_table)

In [None]:
print(standard_codon_table.stop_codons)
print(standard_codon_table.start_codons)
print(standard_codon_table.forward_table["ACG"])

## Comparing Seq objects
Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequences are equal. The basic problem is the meaning of the letters in a sequence are context dependent - the letter ``A’’ could be part of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of each Seq object to try and capture this information - so comparing two Seq objects means considering both the sequence strings and the alphabets.

In [98]:
seq1 = Seq("ACGT")
seq2 = Seq("ACGT")

In [None]:
print(seq1 == seq2)

In [None]:
print(id(seq1) == id(seq2))

In [None]:
print(id(seq1))
print(id(seq2))

In [None]:
print(str(seq1) == str(seq2))

## MutableSeq Objects
Just like the normal Python string, the Seq object is ``read only’‘, or in Python terminology, immutable. Apart from wanting the Seq object to act like a string, this is also a useful default since in many biological applications you want to ensure you are not changing your sequence data:

In [None]:
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
try:
    my_seq[0] = "A"
except Exception as e:
    print(e)

However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it.

In [None]:
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq(my_seq)
mutable_seq

Alternatively, you can create a MutableSeq object directly from a string:

In [111]:
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq("GCCATTGTAATG")

Either way will give you a sequence object which can be changed:

In [None]:
mutable_seq[0] = "C"
print(mutable_seq)
mutable_seq.remove("A")
print(mutable_seq)
mutable_seq.reverse()
print(mutable_seq)

In [None]:
# Add a subsequence to the mutable sequence object:
mutable_seq.append('A')
print(mutable_seq)

# Add a subsequence to the mutable sequence object at a given index:
mutable_seq.insert(8,'G')
print(mutable_seq)

# Remove a subsequence of a single letter at given index:
# Returns the last character of the sequence as default.
mutable_seq.pop()
print(mutable_seq)

Once you have finished editing your a MutableSeq object, it’s easy to get back to a read-only Seq object should you need to:

In [None]:
from Bio.Seq import Seq
new_seq = Seq(mutable_seq)
new_seq

## Working with strings directly
To close this chapter, for those you who really don’t want to use the sequence objects (or who prefer a functional programming style to an object orientated one), there are module level functions in Bio.Seq will accept plain Python strings, Seq objects (including UnknownSeq objects) or MutableSeq objects:

In [None]:
from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
print(reverse_complement(my_string))
print(transcribe(my_string))
print(back_transcribe(my_string))
print(translate(my_string))

## Homework

Prepare a Jupyter notebook to achive the goals below and upload to 'Homeworks/Week_05' folder under Google Drive directory of the course.

* Translate the TP53 mRNA CDS to protein and save into a fasta file with description
* Read that fasta file 
* Then calculate the frequency of each aminoacid -> create a table, save it to tsv/csv file and draw a histogram