# 5. Sequence Objects
Content:
- 5.1 Sequences and Alphabets
- 5.2 Sequences act like strings
- 5.3 Working with sequences
- 5.4 Complements and reverse complements
- 5.5 Transcription
- 5.6 Translation
- 5.7 Exercise
---


Working with biological sequences is arguably the central object in Bioinformatics. In this chapter we will discover how Biopython makes it fairly easy dealing with sequences. In this chapter we will learn:
- How the Seq object works and which methods and arguments are often used
- Use the built-in transcribing and translating methods
- Compare different sequences with each other


## 5.1 Sequences and Alphabets
We start by importing the Seq object with the following line. This should always be the start of your script (or in this case Notebook). 

In [2]:
# Import the Seq object
from Bio.Seq import Seq

Imagine that you have a DNA sequence which you want to further investigate. Let's assume for this introduction that it consists of a couple of basepairs and we pass it as an argument to the Seq object. The new variable `my_seq` contains these letters, however it will be recognized as a sequence by (Bio)python

In [6]:
# Creating our first Sequence object. 
my_seq = Seq("AGTACACTGG")
my_seq

Seq('AGTACACTGG')

We can have an overview of all possible functions that we can call on this newly created object:

In [4]:
dir(my_seq)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 'back_transcribe',
 'complement',
 'complement_rna',
 'count',
 'count_overlap',
 'encode',
 'endswith',
 'find',
 'index',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'reverse_complement_rna',
 'rfind',
 'rindex',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

Minor note: Before Oct 2020, each `Seq` object was associated with a specific alphabet type to identify its biological origin (DNA, RNA, Protein, and all derivatives), however `Bio.Alphabet` has now been removed from Biopython. 

## 5.2 Sequences act like strings

Once created, `Seq` objects can be used as if they were normal Python strings regarding retrieving the length of the sequence or iterating over the characters

In [10]:
# Creating a new DNA sequence object
my_seq = Seq("GATTACCGATACGAGACCTATACATGATCG")

# Find the length of the sequence
print(len(my_seq))

# Iterating over the elements of the sequence
for index, letter in enumerate(my_seq):
    print("At index: {}, letter: {}".format(index, letter))

30
At index: 0, letter: G
At index: 1, letter: A
At index: 2, letter: T
At index: 3, letter: T
At index: 4, letter: A
At index: 5, letter: C
At index: 6, letter: C
At index: 7, letter: G
At index: 8, letter: A
At index: 9, letter: T
At index: 10, letter: A
At index: 11, letter: C
At index: 12, letter: G
At index: 13, letter: A
At index: 14, letter: G
At index: 15, letter: A
At index: 16, letter: C
At index: 17, letter: C
At index: 18, letter: T
At index: 19, letter: A
At index: 20, letter: T
At index: 21, letter: A
At index: 22, letter: C
At index: 23, letter: A
At index: 24, letter: T
At index: 25, letter: G
At index: 26, letter: A
At index: 27, letter: T
At index: 28, letter: C
At index: 29, letter: G


In [None]:
# Slice element at position 0
my_seq[0]

The `Seq` object has a `.count()` method, just like a string, which gives a non-overlapping count:

In [None]:
# Find how many times "GAT" appears in the sequence
my_seq.count("GAT")

In [None]:
# Find how many A's there are in the sequence
my_seq.count("A")

You can find a list of all the possible methods by using the `dir()` function. 

In [None]:
dir(my_seq)

---

### 5.2.1 Exercise
Calculate the GC-content in the following sequence:
```
ATGGATTACCACTCACTGCCTCACTGACACGAGACCTATACATG
```
- Use the `.count()` method as shown above
- Search through Bio.SeqUtils for a function that might help you

In [24]:
ex_seq = Seq("ATGGATTACCACTCACTGCCTCACTGACACGAGACCTATACATG")

### 5.2.2 Extra exercise
- Find all occurrences of the subsequence `TGA` and its positions. `TGA` will code for a stop-codon in the translation process. Knowing where it occurs, extract the first subsequence from the sequence.  
- Calculate the molecular weight of the sequence its translation product "ATGGCCATTGTAATGG"

In [49]:
from Bio.SeqUtils import nt_search
ex_seq = "ATGGATTACCACTCACTGCCTCACTGACACGAGACCTATACATG"
stop_seq = "TGA"
occ_TGA = nt_search(ex_seq, stop_seq)
sub_seq = ex_seq[:occ_TGA[1]]
sub_seq

'ATGGATTACCACTCACTGCCTCAC'

Note: the nt_search function does not accept Seq objects as an input. Agree, this is suboptimal, we have to work with what we get. 

In [54]:
from Bio.SeqUtils import molecular_weight
weightDNA = molecular_weight(Seq(ex_seq))
weightProt = molecular_weight(Seq(sub_seq).translate(), 'protein')
print(f"Molecular weight of DNA: {weightDNA:.2f}\nMolecular weight of Protein: {weightProt:.2f}")

Molecular weight of DNA: 13484.62
Molecular weight of Protein: 999.10


---

Can you change the characters in a sequence? 

In [None]:
my_seq[2]

In [None]:
#my_seq[2] = 'A'

Just like the normal Python string, the `Seq` object is *read only* (immutable), as in many biological applications you want to ensure you are not changing your sequence data. If you need to edit your sequence, for example simulating a point mutation, you'll need the `MutableSeq` object. 

## 5.3 Working with sequences

**1. Slicing**

In [None]:
# Slicing in its most basic form
my_seq[2:6]

Slicing sequences follows the normal conventions for Python strings. 
- The first element of the sequence is 0 (which is normal for computer science, but not so normal for biology).
- The first item is included (i.e. 2 in this case) 
- The last item is excluded (i.e. 6 in this case).

```
GATTACCGATACGAGACCTATACATGATCG
  TTAC
```

This is the way things work in Python, but of course not necessarily the way everyone in the world would expect. 

Also important to note is that the sequence object is maintained throughout the slicing process. Assign the result to a new variable if you want to work further with that result. 

In [None]:
# Slicing every third starting from 0
my_seq[0::3]

In [None]:
# Slicing every third starting from 1
my_seq[1::3]

In [None]:
# Slicing every third starting from 2
my_seq[2::3]

The three cells above can be interpreted to some extent as a frameshift mutation. The cell below displays a trick to reverse the sequence. 

In [None]:
my_seq[::-1]

**2. Turning `Seq` objects into strings**


In [None]:
str(my_seq)

In [None]:
print(my_seq)

Besides transforming the data type to a string, we can use the following method to constructs a simple FASTA format record. 

In [None]:
# Python 3.8
fasta_format_string = f'>Random sequence\n{my_seq}\n'

# Python <3.8
# fasta_format_string = f'>Name\n{}\n'.format(my_seq)

print(fasta_format_string)

**3. Concatenating or adding sequences**

Any two `Seq` objects can be added together - just like you can with Python strings - to concatenate them. 

In [57]:
# Define two protein sequences
protein_seq1 = Seq("EVRNAK")
protein_seq2 = Seq("AGGATC")

# Concatenating two Seqs 
protein_seq1 + protein_seq2

Seq('EVRNAKAGGATC')

In [58]:
# Define DNA and protein sequence
dna_seq = Seq("AGGATC")
protein_seq = Seq("EVRNAK")

# This is possible now, however earlier with different alphabets this was not the case
protein_seq + dna_seq

Seq('EVRNAKAGGATC')

Hence, watch out with the simplicity that Biopython is giving you. Keep thinking of the underlying biological processes. 

--- 
### 5.3.1 Exercise
Can you concatenate the following sequences (using a `for`-loop or the built-in `sum` function)?
- Seq("ACGT")
- Seq("GCTA")
- Seq("TACG")

---

**4. Changing case**

Just like Python strings, the Seq object has an `.upper()` and `.lower()` method, useful for doing case insensitive matching:

In [60]:
# Create Sequence object with small and capitalized letters
dna_seq = Seq("acgtACGT")
dna_seq

Seq('acgtACGT')

In [62]:
# Capitalize 
dna_seq.upper()

Seq('ACGTACGT')

In [63]:
# Lower letters of sequence
dna_seq.lower()

Seq('acgtacgt')

Python is case sensitive, hence letters in a strings and hence also the letters of a Sequence objects are not found if the case doesn't match. 

In [64]:
# Case sensitive matching
"GTAC" in dna_seq

False

In [65]:
# Case insensitive matching
"GTAC" in dna_seq.upper()

True

## 5.4 Complements and reverse complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of a `Seq`-object using its built-in methods:

In [66]:
# Complement
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq.complement()

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG')

In [67]:
# Reverse complement
my_seq.reverse_complement()

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')

In [None]:
# just reverse a Seq object (or a Python string) is slice it with -1
my_seq[::-1]

As of the removal of the alphabet property, it is now technically possible to make a reverse complement or back-transcription on a ought-to-be protein sequence. This doesn't really makes sense though.

In [75]:
protein_seq1 = Seq("EVRNAK")

# Do you agree with this output? 
print(protein_seq1.reverse_complement())
print(protein_seq1.back_transcribe())

MTNYBE
EVRNAK


## 5.5 Transcription

Transcription is always confusing with coding, non-coding, sense and antisense, complements and reverse-complements. Consider the following stretch of double stranded DNA which encodes a short peptide:

![transcription](img/transcriptionprocess.png)
Source: [link](https://haygot.s3.amazonaws.com/questions/1308251_1631549_ans_f0c68de70b54468fa7116e7de655ad71.png)

The actual biological transcription process works from the template strand, doing a reverse complement (TCAG --> CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching
T --> U.

Now let's actually get down to doing a transcription in Biopython. First, let's create Seq objects for the coding and template DNA strands:

In [77]:
coding_dna = Seq("ATGATCTCGTAA")
print(f'Original DNA (gene) sequence is: {str(coding_dna):>18}')
print(f"Complement DNA sequence is: {str(coding_dna.complement()):>23}") 
print(f"Reverse complement DNA sequence is: {str(coding_dna.reverse_complement()):>15}")

Original DNA (gene) sequence is:       ATGATCTCGTAA
Complement DNA sequence is:            TACTAGAGCATT
Reverse complement DNA sequence is:    TTACGAGATCAT


In [78]:
# Template DNA is reverse complement of coding DNA strand (in 5' to 3' direction)
template_dna = coding_dna.reverse_complement()
template_dna

Seq('TTACGAGATCAT')

These should match the figure above - remember by convention nucleotide sequences are normally read from
the 5' to 3' direction, while in the figure the template strand is shown reversed.
Now let's transcribe the coding strand into the corresponding mRNA, using the `Seq` object's built in
transcribe method:

In [79]:
coding_dna

Seq('ATGATCTCGTAA')

In [80]:
# switch T --> U, and adjust the alphabet
mRNA = coding_dna.transcribe()
mRNA

Seq('AUGAUCUCGUAA')

In [81]:
# True biological transcription becomes a two step process: 
template_dna.reverse_complement().transcribe()

Seq('AUGAUCUCGUAA')

The `Seq` object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U --> T substitution and associated change of alphabet:

In [83]:
# back transcription method for mRNA to coding strand
mRNA = Seq("AUGAUCUCGUAA")
mRNA.back_transcribe()

Seq('ATGATCTCGTAA')

## 5.6 Translation
Let's translate a longer mRNA sequence into the corresponding protein sequence:

In [84]:
mRNA = Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUUGA')
mRNA.translate()

Seq('MAIVMGR*KG*')

You should notice in the above protein sequence that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).


**Translation** tables available in Biopython are based on those from the NCBI. Depending on the organism that has been sequenced, the result of a translation process should be accounted for with the appropriate codons. By default, the standard genetic code translation table will be chosen (NCBI table ID 1: [here](www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)). Let's have a look at that table:

In [85]:
from Bio.Data import CodonTable

# Import the standard codon table
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead. Let's first inspect the vertebrate mitochondrial codon table and then translate the same coding DNA sequence:

In [86]:
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

In [87]:
# Using the same sequence (now translating from DNA) with a different codon table
coding_dna = Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTTGA')
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGW')

In [88]:
# specify the table using the NCBI table number which is shorter
coding_dna.translate(table=2)

Seq('MAIVMGRWKGW')

It makes sense that you want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

In [89]:
# Stop translation at first occurence of a stop codon
coding_dna.translate(to_stop=True)

Seq('MAIVMGR')

In [90]:
# Combination of table and stop codon
coding_dna.translate(table=2, to_stop=True)

Seq('MAIVMGRWKGW')

Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol
is not included at the end of your protein sequence.

In [91]:
# Specify the stop symbol nevertheless:
coding_dna.translate(table=2, stop_symbol="@")

Seq('MAIVMGRWKGW')

However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria, e.g. the gene yaaX in E. coli K12:

In [104]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
 "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
 "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
 "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
 "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")

In [105]:
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*')

In [106]:
gene.translate(table="Bacterial",to_stop=True)

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')

In the bacterial genetic code GTG is a valid start codon, normally encode Valine, if used as a start codon it should be translated as methionine:

In [107]:
# telling Biopython your sequence is a complete coding sequence! 
gene.translate(table="Bacterial", cds=True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')

If there is an internal stop codon, an error will be raised. This is due to the nature of the `cds = True` argument which basically forces that this is a complete CDS (with a valid start codon, stop codon, length of fragment with respect to codons, and no internal stop codons).

## 5.7 Exercise

Identifying genes is possible by looking for open reading frames (ORFs). For eukaryotic genes we know that there is a complex interaction between promotors, start codons, exons and introns. Nonetheless, for prokaryotic and virus genes this approach would still be useful. 

Depending on the organism you also need to use the according codon table. In this exercise we're using a bacterial plasmid fasta file for which we need to use codon [table 11](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG11). Write a function that accepts a DNA sequence and stores translated sequences in a pandas DataFrame, define the tranlate tables and define that a possible protein needs to be of a minimum length of 100 AA's. 

Input arguments of the function:
- `record`: DNA sequence (`Seq` object)
- `strand`: sense or antisense (+1/-1)
- `frame`: frameshift mutation (0/1/2)
- `table`: translation table (e.g. 11)
- `min_len`: minimum length of protein sequences to be included (e.g. 100)


The output might look something like this: 

|   |                                          Sequence | Length | Strand | Frame |
|--:|--------------------------------------------------:|-------:|-------:|------:|
| 0 | WGKLQVIGLSMWMVLFSQRFDDWLNEQEDALQEKVLADLKKLQVYG... |    125 |     -1 |     1 |
| 1 | RGIFMSDTMVVNGSGGVPAFLFSGSTLSSYRPNFEANSITIALPHY... |    361 |     -1 |     1 |
| 2 | WDVKTVTGVLHHPFHLTFSLCPEGATQSGREAHLLAELPQRRMEPV... |    111 |     -1 |     1 |

In [None]:
def extract_ORF(record, strand, frame, table, min_len):
    """extract_ORF accepts a sequence record object as argument together with a strand orientation 
    and frameshift and will give you as an output all of the possible ORFs from that sequence record object
    that are longer than a predefined minimal length of AAs using a specific codon table"""

    # Create empty dataframe that will store all the information
    
    # Change DNA sequence according to strand orientation 

    # Change DNA sequence according to frameshift mutation

    # Iterate over each possible translation 
    
    # If the possible translation is longer than min_len, add it to the DataFrame
    
    
    return # DataFrame

Test code: 

In [117]:
import pandas as pd
from Bio import SeqIO
record = SeqIO.read("data/NC_005816.fna", "fasta")
table = 11
min_len = 100
extract_ORF(record=record, strand=-1, frame=1, table=11, min_len=100)

## 5.8 Next session
Click here to go to the [next session](06_Biopython_Sequence_annotation.ipynb). 