# 5. Sequence Objects
Content:
- 5.1 Sequences and Alphabets
- 5.2 Sequences act like strings
- 5.3 Working with sequences
- 5.4 Complements and reverse complements
- 5.5 Transcription
- 5.6 Translation

---


Working with biological sequences is arguably the central object in Bioinformatics. In this chapter we will discover how Biopython makes it fairly easy dealing with sequences. In this chapter we will learn:
- How the Seq object works and which methods and arguments are often used
- Use the built-in transcribing and translating methods
- Compare different sequences with each other


## 5.1 Sequences and Alphabets
We start by importing the Seq object with the following line. This should always be the start of your script (or in this case Notebook). 

In [5]:
from Bio.Seq import Seq

Imagine that you have a DNA sequence which you want to further investigate. Let's assume for this introduction that it consists of a couple of basepairs and we pass it as an argument to the Seq object. The new variable `my_seq` contains these letters, however it will be recognized as a sequence by (Bio)python

In [None]:
my_seq = Seq("AGTACACTGG")
my_seq

By default, Seq will associate a generic alphabet to this sequence. You can verify the alphabet by using the `.alphabet` method. 

In [None]:
my_seq.alphabet

Specifying the alphabet will alleviate any possible misinterpretation in your scripts, e.g. a sequence might be interpreted as a DNA sequence whereas it's supposed to be a protein sequence.

We can specify that we're working with DNA in this case by defining another argument in the Seq object. In the example below we will allocate a DNA sequence to one variable and a protein sequence to another one. 

In [None]:
# importing the IUPAC alphabets
from Bio.Alphabet import IUPAC
my_seq = Seq("AGTACACTGG", IUPAC.unambiguous_dna)
my_prot = Seq("AGTACACTGG", IUPAC.protein)

We've learned how to create sequences explicitly using Biopython. This has many advantages which we will start exploiting now. 

## 5.2 Sequences act like strings

Once created, sequence objects can be used as if they were normal Python strings regarding retrieving the length of the sequence or iterating over the characters

In [None]:
# Creating a new DNA sequence object
my_seq = Seq("GATTACCGATACGAGACCTATACATGATCG", IUPAC.unambiguous_dna)

# Find the length of the sequence
print(len(my_seq))

# Iterating over the elements of the sequence
for index, letter in enumerate(my_seq):
    print("{} {}".format(index, letter))

In [None]:
# Get element at position 0
my_seq[0]

The Seq object has a `.count()` method, just like a string, which gives a non-overlapping count:

In [None]:
# Find how many times "GAT" appears in the sequence
my_seq.count("GAT")

In [None]:
# Find how many A's there are in the sequence
my_seq.count("A")

---

### 5.2.1 Exercise
Calculate the GC-content in the following sequence:
```
GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT
```
- Use the `.count()` method as shown above
- Search through Bio.SeqUtils for a function that might help you

In [None]:
ex_seq = Seq("GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT", IUPAC.unambiguous_dna)

In [None]:
# GC content
100*float(ex_seq.count("G")+ex_seq.count("C"))/len(ex_seq)

In [None]:
# Use the built-in method of Bio.SeqUtils
from Bio.SeqUtils import GC
GC(ex_seq)

# the Bio.SeqUtils.GC() function should automatically cope with mixed case 
# sequences and the ambiguous nucleotide S which means G or C.

---

Can you change the characters in a sequence? 

In [None]:
my_seq[2]

In [None]:
#my_seq[2] = 'A'

Just like the normal Python string, the Seq object is *read only* (immutable), as in many biological applications you want to ensure you are not changing your sequence data. 

If you need to edit your sequence, for example simulating a point mutation, you'll need the MutableSeq object (see below). 

## 5.3 Working with sequences

**1. Slicing**

In [None]:
# Slicing in its most basic form
my_seq[2:6]

Slicing sequences follows the normal conventions for Python strings. 
- The first element of the sequence is 0 (which is normal for computer science, but not so normal for biology).
- The first item is included (i.e. 2 in this case) 
- The last item is excluded (i.e. 6 in this case).

```
GATTACCGATACGAGACCTATACATGATCG
  TTAC
```

This is the way things work in Python, but of course not necessarily the way everyone in the world would expect. 

Also important to note is that the sequence object is maintained throughout the slicing process. 

In [None]:
# Slicing every third starting from 0
my_seq[0::3]

In [None]:
# Slicing every third starting from 1
my_seq[1::3]

In [None]:
# Slicing every third starting from 2
my_seq[2::3]

The three cells above can be interpreted to some extent as a frameshift mutation. The cell below displays a trick to reverse the sequence. 

In [None]:
my_seq[::-1]

**2. Turning Seq objects into strings**


In [None]:
str(my_seq)

In [None]:
print(my_seq)

Besides transforming the data type to a string, we can use the following method to constructs a simple FASTA format record. 

In [None]:
# Python 3.8
fasta_format_string = f">Random sequence\n{my_seq}\n"
# Python <3.8
# fasta_format_string = f">Name\n{}\n".format(my_seq)

print(fasta_format_string)

**3. Concatenating or adding sequences**

Any two Seq objects can be added together - just like you can with Python strings - to concatenate them. However, you can't add sequences with incompatible alphabets

In [17]:
# This will work
protein_seq1 = Seq("EVRNAK", IUPAC.protein)
protein_seq2 = Seq("AGGATC", IUPAC.protein)
protein_seq1 + protein_seq2

Seq('EVRNAKAGGATC', IUPACProtein())

In [None]:
# This won't work
protein_seq = Seq("EVRNAK", IUPAC.protein)
dna_seq = Seq("AGGATC", IUPAC.unambiguous_dna)
#protein_seq + dna_seq

If you really wanted to do this, you'd have to give both sequences a generic alphabet. 

In [None]:
from Bio.Alphabet import generic_alphabet
protein_seq.alphabet = generic_alphabet
dna_seq.alphabet = generic_alphabet
protein_seq + dna_seq

Other constallations are also possible, however need to be dealt with carefully. Adding a generic nucleotide alphabet sequence to an IUPAC unambiguous alphabet sequence will result in an NucleotideAlphabet.

In [7]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.Alphabet import generic_nucleotide
nuc_seq = Seq("GATCGATG", generic_nucleotide)
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
print(nuc_seq)
print(dna_seq)
nuc_seq + dna_seq


GATCGATG
ACGT


Seq('GATCGATGACGT', NucleotideAlphabet())

--- 
### 5.3.1 Exercise
Can you concatenate the following sequences using a for-loop?
- Seq("ACGT", generic_dna)
- Seq("GCTA", generic_dna)
- Seq("TACG", generic_dna)

In [4]:
# Method 1
from Bio.Alphabet import generic_dna
list_of_seqs = [Seq("ACGT", generic_dna), Seq("GCTA", generic_dna), Seq("TACG", generic_dna)]
concatenated = Seq("", generic_dna)
for s in list_of_seqs:
    concatenated += s

print(list_of_seqs)
concatenated

[Seq('ACGT', DNAAlphabet()), Seq('GCTA', DNAAlphabet()), Seq('TACG', DNAAlphabet())]


Seq('ACGTGCTATACG', DNAAlphabet())

In [5]:
# Method 2
list_of_seqs = [Seq("ACGT", generic_dna), Seq("GCTA", generic_dna),Seq("TACG", generic_dna)]
sum(list_of_seqs, Seq("",generic_dna))

Seq('ACGTGCTATACG', DNAAlphabet())

---

**4. Changing case**

Just like Python strings, the Seq object has an `.upper()` and `.lower()` method, useful for doing case insensitive matching:

In [6]:
dna_seq = Seq("acgtACGT", generic_dna)
dna_seq

Seq('acgtACGT', DNAAlphabet())

In [7]:
dna_seq.upper()

Seq('ACGTACGT', DNAAlphabet())

In [8]:
dna_seq.lower()

Seq('acgtacgt', DNAAlphabet())

In [9]:
# Case insensitive matching
"GTAC" in dna_seq

False

In [10]:
"GTAC" in dna_seq.upper()

True

In [11]:
# IUPAC alphabets are for upper case sequences only
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
dna_seq

Seq('ACGT', IUPACUnambiguousDNA())

In [12]:
dna_seq.lower()

Seq('acgt', DNAAlphabet())

## 5.4 Complements and reverse complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of a `Seq`-object using its built-in methods:

In [13]:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq.complement()

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())

In [14]:
my_seq.reverse_complement()

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

In [15]:
# just reverse a Seq object (or a Python string) is slice it with -1
my_seq[::-1]

Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())

The alphabet property is always maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse) complement of a protein sequence:

In [19]:
protein_seq1 = Seq("EVRNAK", IUPAC.protein)
#protein_seq1.reverse_complement()

## 5.5 Transcription

Transcription is always confusing with coding, non-coding, sense and antisense, complements and reverse-complements. Consider the following stretch of double stranded DNA which encodes a short peptide:

![transcription](img/3_8_transcription.png)

The actual biological transcription process works from the template strand, doing a reverse complement (TCAG --> CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching
T --> U.

Now let's actually get down to doing a transcription in Biopython. First, let's create Seq objects for the coding and template DNA strands:

In [14]:
coding_dna = Seq("ATGGCCATTGTAAT", IUPAC.unambiguous_dna)
print(f'Original DNA (gene) sequence is: {str(coding_dna):>18}')
print(f"Complement DNA sequence is: {str(coding_dna.complement()):>23}")
print(f"Reverse complement DNA sequence is: {str(coding_dna.reverse_complement()):>15}")

Original DNA (gene) sequence is:     ATGGCCATTGTAAT
Complement DNA sequence is:          TACCGGTAACATTA
Reverse complement DNA sequence is:  ATTACAATGGCCAT


In [12]:
# Template DNA is reverse complement of coding DNA strand (in 5' to 3' direction)
template_dna = coding_dna.reverse_complement()
template_dna

Seq('ATTACAATGGCCAT', IUPACUnambiguousDNA())

These should match the figure above - remember by convention nucleotide sequences are normally read from
the 5' to 3' direction, while in the figure the template strand is shown reversed.
Now let's transcribe the coding strand into the corresponding mRNA, using the Seq object's built in
transcribe method:

In [23]:
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

In [29]:
# switch T --> U, and adjust the alphabet
messenger_rna = coding_dna.transcribe()
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

In [30]:
# True biological transcription becomes a two step process: 
template_dna.reverse_complement().transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U --> T substitution and associated change of alphabet:

In [26]:
# back transcription method for mRNA to coding strand
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
messenger_rna.back_transcribe()

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

## 5.6 Translation
Sticking with the same example discussed in the transcription section above, now let's translate this mRNA into the corresponding protein sequence:

In [31]:
messenger_rna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [32]:
# also translate directly from the coding strand DNA sequence
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).


Translation tables available in Biopython are based on those from the NCBI. Depending on the organism that has been sequenced, the result of a translation process should be accounted for with the appropriate codons. By default, the standard genetic code translation table will be chosen (NCBI table ID 1: [here](www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)). Let's have a look at that table:

In [50]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead. Let's first inspect the vertebrate mitochondrial codon table and then translate the same coding DNA sequence:

In [53]:
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

In [54]:
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [55]:
# specify the table using the NCBI table number which is shorter
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

It makes sense that you want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

In [35]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In [36]:
coding_dna.translate(to_stop=True)

Seq('MAIVMGR', IUPACProtein())

In [37]:
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

In [38]:
coding_dna.translate(table=2, to_stop=True)

Seq('MAIVMGRWKGAR', IUPACProtein())

Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol
is not included at the end of your protein sequence.

In [39]:
# Specify the stop symbol nevertheless:
coding_dna.translate(table=2, stop_symbol="@")

Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))

However, what if your sequence uses a non-standard start codon? This happens a lot in
bacteria { for example the gene yaaX in E. coli K12:

In [42]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
 "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
 "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
 "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
 "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA", generic_dna)

In [43]:
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [44]:
gene.translate(table="Bacterial",to_stop=True)

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

In the bacterial genetic code GTG is a valid start codon, normally encode Valine, if used as a start codon it should be translated as methionine:

In [45]:
# telling Biopython your sequence is a complete coding sequence! 
gene.translate(table="Bacterial", cds=True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

## 5.7 Next session
Click here to go to the [next session](06_Biopython_Sequence_annotation.ipynb). 