# Chapter 3: Sequence objects

Biological sequences are arguably the central object in Bioinformatics, and in this chapter we’ll introducethe Biopython mechanism for dealing with sequences, the `Seq` object.  Chapter 4 will introduce the related `SeqRecord` object, which combines the sequence information with any annotation, used again in Chapter 5for Sequence Input/Output.

Sequences are essentially strings of letters likeAGTACACTGGT, which seems very natural since this is themost common way that sequences are seen in biological file formats.

The most important difference between Seqobjects and standard Python strings is they have different methods.  Although theSeqobject supports many of the same methods as a plain string, its `translate()` method differs by doing biological translation, and there are also additional biologically relevant methods like `reverse_complement()`

## 3.1  Sequences act like strings


In most ways, we can deal with Seq objects as if they were normal Python strings, for example getting thelength, or iterating over the elements:

In [1]:
from Bio.Seq import Seq
my_seq = Seq("GATCG")
for index, letter in  enumerate(my_seq):
    print("%i %s" % (index, letter))

0 G
1 A
2 T
3 C
4 G


In [2]:
print(len(my_seq))

5


#### You can access elements of the sequence in the same way as for strings (but remember, Python countsfrom zero!)

In [3]:
print(my_seq[0])

G


In [4]:
print(my_seq[2])

T


In [5]:
print(my_seq[-1])

G


The `Seq` object  has  `a.count()` method,  just  like  a  string.   Note  that  this  means  that  like  a  Pythonstring, this gives anon-overlapping count:

In [6]:
from Bio.Seq import Seq
"AAAA".count("AA") # Normal String

2

In [8]:
Seq("AAAA").count("AA") # Seq Object

2

###3 For some biological uses, you may actually want an overlapping count (i.e.  3 in this trivial example).  Whensearching for single letters, this makes no difference:

In [9]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
len(my_seq)

32

In [10]:
my_seq.count("G")

9

In [11]:
100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

46.875

While you could use the above snippet of code to calculate a GC%, note that theBio.SeqUtilsmodulehas several GC functions already built.  For example:

In [12]:
from Bio.Seq import Seq
from Bio.SeqUtils import GC
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
GC(my_seq)

46.875

#### Note that using the `Bio.SeqUtils.GC()` function should automatically cope with mixed case sequences andthe ambiguous nucleotide S which means G or C.

#### Also note that just like a normal Python string, the `Seq` object is in some ways “read-only”.  If you needto edit your sequence, for example simulating a point mutation, look at the Section 3.12 below which talksabout the `MutableSeq` object

## 3.2 Slicing a sequence

A more complicated example, let’s get a slice of the sequence:

In [13]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq[4:12]

Seq('GATGGGCC')

Note that ‘Seq‘ objects follow the usual indexing conventions for Python strings, with the first elementof the sequence numbered 0.  When you do a slice the first item is included (i.e. 4 in this case) and the lastis excluded (12 in this case).

Also like a Python string, you can do slices with a start, stop andstride(the step size, which defaults toone).  For example, we can get the first, second and third codon positions of this DNA sequence:


In [14]:
my_seq[0::3]

Seq('GCTGTAGTAAG')

In [15]:
my_seq[1::3]

Seq('AGGCATGCATC')

In [16]:
my_seq[2::3]

Seq('TAGCTAAGAC')

Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse thestring.  You can do this with aSeqobject too:

In [17]:
my_seq[::-1]

Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')

## 3.3    Turning Seq objects into strings

In most ways, we can deal with Seq objects as if they were normal Python strings, for example getting thelength, or iterating over the elements:

In [18]:
str(my_seq)

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

In [19]:
print(my_seq)

GATCGATGGGCCTATATAGGATCGAAAATCGC


In [21]:
fasta_format_string = ">Name\n%s\n" % my_seq
print(fasta_format_string)

>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC



This line of code constructs a simple FASTA format record (without worrying about line wrapping).  Sec-tion 4.6 describes a neat way to get a FASTA formatted string from aSeqRecordobject, while the moregeneral topic of reading and writing FASTA format sequence files is covered in Chapter 5.

## 3.4    Concatenating or adding sequences

In [23]:
from Bio.Seq import Seq

protein_seq = Seq("EVRNAK")
dna_seq = Seq("ACGT")

# Deliberately mixing DNA and protein like this is likely a mistake though...
protein_seq + dna_seq 


Seq('EVRNAKACGT')

In [24]:
# You may often have many sequences to add together, which can be done with a for loop like this:
from Bio.Seq import Seq
list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]
concatenated = Seq("")

for seq in list_of_seqs:
    concatenated += seq

print(concatenated)

ACGTAACCGGTT


In [25]:
# Like Python strings, BiopythonSeqalso has a.joinmethod:
from Bio.Seq import Seq
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)

Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')

## 3.5 Changing case

In [26]:
# Python strings have very usefulupperandlowermethods for changing the case.  For example,
from Bio.Seq import Seq
dna_seq = Seq("acgtACGT")
dna_seq

Seq('acgtACGT')

In [27]:
dna_seq.upper()

Seq('ACGTACGT')

In [29]:
dna_seq.lower()

Seq('acgtacgt')

In [30]:
"GTAC" in dna_seq

False

In [31]:
"GTAC" in dna_seq.upper()

True

## 3.6    Nucleotide sequences and (reverse) complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of aSeqobject usingits built-in methods:

In [32]:
from Bio.Seq import Seq

In [33]:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')

In [34]:
my_seq.complement()

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG')

In [35]:
my_seq.reverse_complement()

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')

In [36]:
# As mentioned earlier,  an easy way to just reverse aSeqobject (or a Python string) is slice it with -1 step
my_seq[::-1]


Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')

If  you  do  accidentally  end  up  trying  to  do  something  weird  like  taking  the  (reverse)complement  of  aprotein sequence, the results are biologically meaningless:

In [37]:
from Bio.Seq import Seq
protein_seq = Seq("EVRNAK")
protein_seq.complement()

Seq('EBYNTM')

Here  the  letter  “E”  is  not  a  valid  IUPAC  ambiguity  code  for  nucleotides,  so  was  not  complemented.However, `V` means `A`, `C` or `G` and has complement `G`, and so on.

The example in Section 5.5.3 combines theSeqobject’s reverse complement method withBio.SeqIOforsequence input/output.

## 3.7 Transcription
Before talking about transcription, I want to try to clarify the strand issue.  Consider the following (madeup) stretch of double stranded DNA which encodes a short peptide:

### DNA Coding Strand Length
            ||
        Transcription
            ||
            \/
### Singl stranded mRNA

The actual biological transcription process works from the template strand, doing a reverse complement(`TCAG → CUGA`) to give the mRNA. However, in Biopython and bioinformatics in general, we typicallywork directly with the coding strand because this means we can get the mRNA sequence just by switching `T→U`.

Now let’s actually get down to doing a transcription in Biopython.  First, let’s createSeqobjects for thecoding and template DNA strands:

In [38]:
from Bio.Seq import Seq

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

In [39]:
template_dna = coding_dna.reverse_complement()
template_dna

Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT')

These should match the figure above - remember by convention nucleotide sequences are normally read fromthe 5’ to 3’ direction, while in the figure the template strand is shown reversed.

Now  let’s  transcribe  the  coding  strand  into  the  corresponding  mRNA,  using  the Seq object’s  built  intranscribemethod:


In [40]:
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

In [43]:
messenger_rna = coding_dna.transcribe()
messenger_rna


Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

As you can see, all this does is to replace T by U.If you do want to do a true biological transcription starting with the template strand, then this becomesa two-step process:

In [44]:
template_dna.reverse_complement().transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

The `Seq` object also includes a back-transcription method for going from the mRNA to the coding strandof the DNA. Again, this is a simple `U → T` substitution:

## 3.8    Translation

Sticking with the same example discussed in the transcription section above, now let’s translate this mRNAinto the corresponding protein sequence - again taking advantage of one of theSeqobject’s biological meth-ods:

In [45]:
from Bio.Seq import Seq
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
messenger_rna
messenger_rna.translate()

Seq('MAIVMGR*KGAR*')

In [46]:
# You can also translate directly from the coding strand DNA sequence
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

In [47]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*')

You should  notice in  the  above protein  sequences  that in  addition to  the end  stop  character,  there  isan internal stop as well.  This was a deliberate choice of example, as it gives an excuse to talk about someoptional arguments, including different translation tables (Genetic Codes).

The translation tables available in Biopython are based on those from the NCBI (see the next section ofthis tutorial).  By default, translation will use thestandardgenetic code (NCBI table id 1).  Suppose we aredealing with a mitochondrial sequence.  We need to tell the translation function to use the relevant geneticcode instead:24


In [48]:
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*')

You can also specify the table using the NCBI table number which is shorter, and often included in thefeature annotation of GenBank files:

In [49]:
coding_dna.translate()

Seq('MAIVMGR*KGAR*')

In [50]:
coding_dna.translate(to_stop=True)

Seq('MAIVMGR')

In [51]:
coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*')

In [52]:
coding_dna.translate(table=2, to_stop=True)

Seq('MAIVMGRWKGAR')

Notice that when you use theto_stopargument, the stop codon itself is not translated - and the stop symbolis not included at the end of your protein sequence.You can even specify the stop symbol if you don’t like the default asterisk:

In [53]:
coding_dna.translate(table=2, stop_symbol="@")

Seq('MAIVMGRWKGAR@')

Now,  suppose  you  have  a  complete  coding  sequence  **CDS**,  which  is  to  say  a  nucleotide  sequence  (e.g.mRNA  –  after  any  splicing)  which  is  a  whole  number  of  codons  (i.e.   the  length  is  a  multiple  of  three),commences  with  a  start  codon,  ends  with  a  stop  codon,  and  has  no  internal  in-frame  stop  codons.   Ingeneral,  given  a  complete  CDS,  the  default  translate  method  will  do  what  you  want  (perhaps  with  theto_stopoption).  However, what if your sequence uses a non-standard start codon?  This happens a lot inbacteria – for example the gene yaaX in `E. coli K12`:

In [54]:
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"            "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"            "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")

In [56]:
print(gene.translate(table="Bacterial"))

VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR*


In [58]:
print(gene.translate(table="Bacterial", to_stop=True))

VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDHGWWKQHYEWRGNRWHLHGPPPPPRHHKKAPHDHHGGHGPGKHHR


In [59]:
gene.translate(table="Bacterial", cds=True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')

In addition to telling Biopython to translate an alternative start codon as methionine, using this optionalso makes sure your sequence really is a valid CDS (you’ll get an exception if not).The example in Section 20.1.3 combines theSeqobject’s translate method withBio.SeqIOfor sequence input/output.

## 3.9 Translation Tables

In the previous sections we talked about theSeqobject translation method (and mentioned the equivalentfunction in theBio.Seqmodule – see Section 3.14).  Internally these use codon table objects derived fromthe NCBI information at [ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt](ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt), also shown on [ https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi]( https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) in a much more readable layout.As before, let’s just focus on two choices:  the Standard translation table, and the translation table forVertebrate Mitochondrial DNA.

As before, let’s just focus on two choices:  the Standard translation table, and the translation table forVertebrate Mitochondrial DNA.

In [1]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]

# Alternatively, these tables are labeled with ID numbers 1 and 2, respectively:
standard_table = CodonTable.unambiguous_dna_by_id[1]
mito_table = CodonTable.unambiguous_dna_by_id[2]

In [2]:
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [4]:
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

In [5]:
mito_table.stop_codons

['TAA', 'TAG', 'AGA', 'AGG']

In [6]:
mito_table.start_codons

['ATT', 'ATC', 'ATA', 'ATG', 'GTG']

In [7]:
mito_table.forward_table["ACG"]

'T'

## 3.10 Comparing Seq objects

Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequencesare equal.  The basic problem is the meaning of the letters in a sequence are context dependent - the letter 'A' could be part of a DNA, RNA or protein sequence.  Biopython can track the molecule type, so comparing two Seq objects could mean considering this too.

Should  a  DNA  fragment  `ACG`  and  an  RNA  fragment  `ACG`  be  equal?   What  about  the  peptide `ACG`?  Or the Python string `ACG`?  In everyday use, your sequences will generally all be the same typeof (all DNA, all RNA, or all protein).  Well, as of Biopython 1.65, sequence comparison only looks at thesequence and compares like the Python string.

In [8]:
from Bio.Seq import Seq
seq1 = Seq("ACGT")

In [9]:
"ACGT" == seq1

True

In [10]:
seq1 == "ACGT"

True

## 3.11    Sequences with unknown sequence contents

n some cases, the length of a sequence may be known but not the actual letters constituting it.  For example,GenBank and EMBL files may represent a genomic DNA sequence only by its config information, withoutspecifying the sequence contents explicitly.  Such sequences can be represented by creating aSeqobject withthe argumentNone, followed by the sequence length:

In [11]:
from Bio.Seq import Seq
unknown_seq = Seq(None, 10)

TheSeqobject thus created has a well-defined length.  Any attempt to access the sequence contents,however, will raise anUndefinedSequenceError.

In [12]:
unknown_seq

Seq(None, length=10)

In [13]:
len(unknown_seq)

10

In [None]:
print(unknown_seq)

## 3.12 Mutable Seq objects

Just like the normal Python string,  theSeqobject is “read only”,  or in Python terminology,  immutable.Apart from wanting theSeqobject to act like a string, this is also a useful default since in many biologicalapplications you want to ensure you are not changing your sequence data:

In [16]:
from Bio.Seq import Seq
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")

In [17]:
my_seq[5] = "G"

TypeError: 'Seq' object does not support item assignment

In [18]:
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq(my_seq)
mutable_seq

# Alternatively, you can create aMutableSeqobject directly from a string:
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")


MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [20]:
mutable_seq[5] = "C"
mutable_seq 

MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [21]:
mutable_seq.remove("T")
mutable_seq

MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [22]:
mutable_seq.reverse()
mutable_seq

MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')

Do note that unlike theSeqobject, the `MutableSeq` object’s methods like `reverse_complement()` and `reverse()` act in-situ!

An important technical difference between mutable and immutable objects in Python means that youcan’t use a `MutableSeq` object as a dictionary key, but you can use a Python string or aSeqobject in thisway.

Once you have finished editing your a `MutableSeq` object, it’s easy to get back to a read-onlySeqobjectshould you need to:

In [23]:
from Bio.Seq import Seq
new_seq = Seq(mutable_seq)
new_seq

Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')

## 3.13 UnknownSeq objects
Note that `UnknownSeq` is deprecated.  To represent a sequence of known length but unknown sequence contents, please useSeq(None, length).

The `UnknownSeq` object is a subclass of the basic Seqobject and its purpose is to represent a sequence where we know the length, but not the actual letters making it up.  You could of course use a normalSeqobject in this situation, but it wastes rather a lot of memory to hold a string of a million `N` characters when you could just store a single letter `N` and the desired length as an integer

In [24]:
from Bio.Seq import UnknownSeq
unk = UnknownSeq(20)
unk



UnknownSeq(20, character='?')

## 3.14    Working with strings directly

To  close  this  chapter,  for  those  you  who really don’t  want  to  use  the  sequence  objects  (or  who  prefer  a functional programming style to an object orientated one), there are module level functions in Bio.Seq willaccept plain Python strings,Seq objects (includingUnknownSeqobjects) or MutableSeq objects:

In [25]:
from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"

In [26]:
reverse_complement(my_string)

'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'

In [27]:
transcribe(my_string)

'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'

In [28]:
back_transcribe(my_string)

'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'

In [29]:
translate(my_string)

'AVMGRWKGGRAAG*'