# 5. Sequence Objects
Content:
- 5.1 Sequences and Alphabets
- 5.2 Sequences act like strings
- 5.3 Working with sequences
- 5.4 Complements and reverse complements
- 5.5 Transcription
- 5.6 Translation
- 5.7 Exercise
---


Working with biological sequences is arguably the central object in Bioinformatics. In this chapter we will discover how Biopython makes it fairly easy dealing with sequences. In this chapter we will learn:
- How the Seq object works and which methods and arguments are often used
- Use the built-in transcribing and translating methods
- Compare different sequences with each other


## 5.1 Sequences and Alphabets
We start by importing the Seq object with the following line. This should always be the start of your script (or in this case Notebook). 

In [1]:
# Import the Seq object
from Bio.Seq import Seq

Imagine that you have a DNA sequence which you want to further investigate. Let's assume for this introduction that it consists of a couple of basepairs and we pass it as an argument to the Seq object. The new variable `my_seq` contains these letters, however it will be recognized as a sequence by (Bio)python

In [2]:
# Creating our first Sequence object. 
my_seq = Seq("AGTACACTGG")
my_seq

Seq('AGTACACTGG')

In [3]:
dir(my_seq)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_get_seq_str_and_check_alphabet',
 'alphabet',
 'back_transcribe',
 'complement',
 'count',
 'count_overlap',
 'endswith',
 'find',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'rfind',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

By default, `Seq` will associate a generic alphabet to this sequence. You can verify the alphabet by using the `.alphabet` method. 

In [None]:
# Check the alphabet
my_seq.alphabet

In this case, no alphabet was passed during the creation, hence it remains empty. Specifying the alphabet will alleviate any possible misinterpretation in your scripts, e.g. a sequence might be interpreted as a DNA sequence whereas it's supposed to be a protein sequence.

We can specify that we're working with DNA in this case by defining another argument in the `Seq` object. In the example below we will allocate a DNA sequence to one variable and a protein sequence to another one. 

In [None]:
# importing the IUPAC alphabets
from Bio.Alphabet import IUPAC

my_seq = Seq("AGTACACTGG", IUPAC.unambiguous_dna)
my_prot = Seq("AGTACACTGG", IUPAC.protein)

We've learned how to create sequences explicitly using Biopython. This has many advantages which we will start exploiting now. 

## 5.2 Sequences act like strings

Once created, `Seq` objects can be used as if they were normal Python strings regarding retrieving the length of the sequence or iterating over the characters

In [None]:
# Creating a new DNA sequence object
my_seq = Seq("GATTACCGATACGAGACCTATACATGATCG", IUPAC.unambiguous_dna)

# Find the length of the sequence
print(len(my_seq))

# Iterating over the elements of the sequence
for index, letter in enumerate(my_seq):
    print("At index: {}, letter: {}".format(index, letter))

In [None]:
# Slice element at position 0
my_seq[0]

The `Seq` object has a `.count()` method, just like a string, which gives a non-overlapping count:

In [None]:
# Find how many times "GAT" appears in the sequence
my_seq.count("GAT")

In [None]:
# Find how many A's there are in the sequence
my_seq.count("A")

You can find a list of all the possible methods by using the `dir()` function. 

In [None]:
dir(my_seq)

---

### 5.2.1 Exercise
Calculate the GC-content in the following sequence:
```
GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT
```
- Use the `.count()` method as shown above
- Search through Bio.SeqUtils for a function that might help you

In [None]:
ex_seq = Seq("GATTACCACTCACTGACTCACTGACACGAGACCTATACATGATCGCCGGATGATACGAGAATTACTGACGACTAATCCCGGATACTGCATACACTGACGACGACT", IUPAC.unambiguous_dna)

---

Can you change the characters in a sequence? 

In [None]:
my_seq[2]

In [None]:
#my_seq[2] = 'A'

Just like the normal Python string, the `Seq` object is *read only* (immutable), as in many biological applications you want to ensure you are not changing your sequence data. If you need to edit your sequence, for example simulating a point mutation, you'll need the `MutableSeq` object. 

## 5.3 Working with sequences

**1. Slicing**

In [None]:
# Slicing in its most basic form
my_seq[2:6]

Slicing sequences follows the normal conventions for Python strings. 
- The first element of the sequence is 0 (which is normal for computer science, but not so normal for biology).
- The first item is included (i.e. 2 in this case) 
- The last item is excluded (i.e. 6 in this case).

```
GATTACCGATACGAGACCTATACATGATCG
  TTAC
```

This is the way things work in Python, but of course not necessarily the way everyone in the world would expect. 

Also important to note is that the sequence object is maintained throughout the slicing process. Assign the result to a new variable if you want to work further with that result. 

In [None]:
# Slicing every third starting from 0
my_seq[0::3]

In [None]:
# Slicing every third starting from 1
my_seq[1::3]

In [None]:
# Slicing every third starting from 2
my_seq[2::3]

The three cells above can be interpreted to some extent as a frameshift mutation. The cell below displays a trick to reverse the sequence. 

In [None]:
my_seq[::-1]

**2. Turning `Seq` objects into strings**


In [None]:
str(my_seq)

In [None]:
print(my_seq)

Besides transforming the data type to a string, we can use the following method to constructs a simple FASTA format record. 

In [None]:
# Python 3.8
fasta_format_string = f'>Random sequence\n{my_seq}\n'

# Python <3.8
# fasta_format_string = f'>Name\n{}\n'.format(my_seq)

print(fasta_format_string)

**3. Concatenating or adding sequences**

Any two `Seq` objects can be added together - just like you can with Python strings - to concatenate them. However, you can't add sequences with incompatible alphabets

In [None]:
# This will work
protein_seq1 = Seq("EVRNAK", IUPAC.protein)
protein_seq2 = Seq("AGGATC", IUPAC.protein)

# Concatenating two Seqs with the same alphabet. 
protein_seq1 + protein_seq2

In [None]:
# This won't work
protein_seq = Seq("EVRNAK", IUPAC.protein)
dna_seq = Seq("AGGATC", IUPAC.unambiguous_dna)

# Uncomment to see the result of this concatenation
#protein_seq + dna_seq

If you really wanted to do this, you'd have to give both sequences a generic alphabet. 

In [None]:
from Bio.Alphabet import generic_alphabet

# Assign alphabets
protein_seq.alphabet = generic_alphabet
dna_seq.alphabet = generic_alphabet

# Concatenate Seqs together (with the same, though generic alphabets)
protein_seq + dna_seq

Other constallations are also possible, however need to be dealt with carefully. Adding a generic nucleotide alphabet sequence to an IUPAC unambiguous alphabet sequence will result in a NucleotideAlphabet.

In [None]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.Alphabet import generic_nucleotide

# Assign Alphabets
nuc_seq = Seq("GATCGATG", generic_nucleotide)
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)

# Adding Seqs together (with different alphabet)
nuc_seq + dna_seq

--- 
### 5.3.1 Exercise
Can you concatenate the following sequences using a for-loop?
- Seq("ACGT", generic_dna)
- Seq("GCTA", generic_dna)
- Seq("TACG", generic_dna)

---

**4. Changing case**

Just like Python strings, the Seq object has an `.upper()` and `.lower()` method, useful for doing case insensitive matching:

In [None]:
from Bio.Alphabet import generic_dna

# Create Sequence object with small and capitalized letters
dna_seq = Seq("acgtACGT", generic_dna)
dna_seq

In [None]:
# Capitalize 
dna_seq.upper()

In [None]:
# Lower letters of sequence
dna_seq.lower()

Python is case sensitive, hence letters in a strings and hence also the letters of a Sequence objects are not found if the case doesn't match. 

In [None]:
# Case sensitive matching
"GTAC" in dna_seq

In [None]:
# Case insensitive matching
"GTAC" in dna_seq.upper()

In [None]:
# IUPAC alphabets are for upper case sequences only
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
dna_seq

In [None]:
dna_seq.lower()

## 5.4 Complements and reverse complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of a `Seq`-object using its built-in methods:

In [None]:
# Complement
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
my_seq.complement()

In [None]:
# Reverse complement
my_seq.reverse_complement()

In [None]:
# just reverse a Seq object (or a Python string) is slice it with -1
my_seq[::-1]

The alphabet property is always maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse) complement of a protein sequence:

In [None]:
protein_seq1 = Seq("EVRNAK", IUPAC.protein)

# Uncomment to see what Python thinks of this
#protein_seq1.reverse_complement()

## 5.5 Transcription

Transcription is always confusing with coding, non-coding, sense and antisense, complements and reverse-complements. Consider the following stretch of double stranded DNA which encodes a short peptide:

![transcription](img/transcriptionprocess.png)
Source: [link](https://haygot.s3.amazonaws.com/questions/1308251_1631549_ans_f0c68de70b54468fa7116e7de655ad71.png)

The actual biological transcription process works from the template strand, doing a reverse complement (TCAG --> CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching
T --> U.

Now let's actually get down to doing a transcription in Biopython. First, let's create Seq objects for the coding and template DNA strands:

In [None]:
coding_dna = Seq("ATGATCTCGTAA", IUPAC.unambiguous_dna)
print(f'Original DNA (gene) sequence is: {str(coding_dna):>18}')
print(f"Complement DNA sequence is: {str(coding_dna.complement()):>23}") 
print(f"Reverse complement DNA sequence is: {str(coding_dna.reverse_complement()):>15}")

In [None]:
# Template DNA is reverse complement of coding DNA strand (in 5' to 3' direction)
template_dna = coding_dna.reverse_complement()
template_dna

These should match the figure above - remember by convention nucleotide sequences are normally read from
the 5' to 3' direction, while in the figure the template strand is shown reversed.
Now let's transcribe the coding strand into the corresponding mRNA, using the `Seq` object's built in
transcribe method:

In [None]:
coding_dna

In [None]:
# switch T --> U, and adjust the alphabet
mRNA = coding_dna.transcribe()
mRNA

In [None]:
# True biological transcription becomes a two step process: 
template_dna.reverse_complement().transcribe()

The `Seq` object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U --> T substitution and associated change of alphabet:

In [None]:
# back transcription method for mRNA to coding strand
mRNA = Seq("AUGAUCUCGUAA", IUPAC.unambiguous_rna)
mRNA.back_transcribe()

## 5.6 Translation
Let's translate a longer mRNA sequence into the corresponding protein sequence:

In [None]:
mRNA = Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUUGA', IUPAC.unambiguous_rna)
mRNA.translate()

You should notice in the above protein sequenc that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).


**Translation** tables available in Biopython are based on those from the NCBI. Depending on the organism that has been sequenced, the result of a translation process should be accounted for with the appropriate codons. By default, the standard genetic code translation table will be chosen (NCBI table ID 1: [here](www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)). Let's have a look at that table:

In [None]:
from Bio.Data import CodonTable

# Import the standard codon table
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_table)

Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead. Let's first inspect the vertebrate mitochondrial codon table and then translate the same coding DNA sequence:

In [None]:
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
print(mito_table)

In [None]:
# Using the same sequence (now translating from DNA) with a different codon table
coding_dna = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTTGA', IUPAC.unambiguous_dna)
coding_dna.translate(table="Vertebrate Mitochondrial")

In [None]:
# specify the table using the NCBI table number which is shorter
coding_dna.translate(table=2)

It makes sense that you want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

In [None]:
# Stop translation at first occurence of a stop codon
coding_dna.translate(to_stop=True)

In [None]:
# Combination of table and stop codon
coding_dna.translate(table=2, to_stop=True)

Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol
is not included at the end of your protein sequence.

In [None]:
# Specify the stop symbol nevertheless:
coding_dna.translate(table=2, stop_symbol="@")

However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria, e.g. the gene yaaX in E. coli K12:

In [None]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
 "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
 "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
 "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
 "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA", generic_dna)

In [None]:
gene.translate(table="Bacterial")

In [None]:
gene.translate(table="Bacterial",to_stop=True)

In the bacterial genetic code GTG is a valid start codon, normally encode Valine, if used as a start codon it should be translated as methionine:

In [None]:
# telling Biopython your sequence is a complete coding sequence! 
gene.translate(table="Bacterial", cds=True)

## 5.7 Exercise

Identifying genes is possible by looking for open reading frames (ORFs). For eukaryotic genes we know that there is a complex interaction between promotors, start codons, exons and introns. Nonetheless, for prokaryotic and virus genes this approach would still be useful. 

Depending on the organism you also need to use the according codon table. In this example we're using a bacterial plasmid fasta file for which we need to use codon [table 11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11). By using the following block of code, we will store the sequence in the variable `record`, define the tranlate tables and define that a possible protein needs to be of a minimum length of 100 AA's. 


In [None]:
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", "fasta")
table = 11
min_pro_len = 100

The output might look something like this: 
```
GCLMKKSSIVATIITILSGSANAASSQLIP...YRF, - length 315, strand 1, frame 0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE, - length 285, strand 1, frame 1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA, - length 355, strand 1, frame 2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD, - length 114, strand -1, frame 0
```

You could easily edit the above loop based code to build up a list of the candidate proteins, or convert this to a list comprehension.

## 5.8 Next session
Click here to go to the [next session](06_Biopython_Sequence_annotation.ipynb). 