# 5. Sequence Objects
Some further reading on sequence objects


## 5.7 Comparing Seq objects


you might argue that the two DNA Seq objects Seq("ACGT", IUPAC.unambiguous dna)
and Seq("ACGT", IUPAC.ambiguous dna) should be equal, even though they do have different alphabets.
Depending on the context this could be important


This gets worse - suppose you think Seq("ACGT", IUPAC.unambiguous dna) and Seq("ACGT") (i.e. the
default generic alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein) and Seq("ACGT")
should also be equal. Now, in logic if A = B and B = C, by transitivity we expect A = C. So for logical
consistency we'd require Seq("ACGT", IUPAC.unambiguous dna) and Seq("ACGT", IUPAC.protein) to be
equal { which most people would agree is just not right. This transitivity also has implications for using Seq
objects as Python dictionary keys.

In [57]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
str(seq1) == str(seq2)
#str(seq1) == str(seq1)

True

comparison only looks at the sequence, essentially ignoring the alphabet:

In [58]:
seq1 == seq2

True

In [59]:
seq1 == "ACGT"

True

using sequence objects as keys in a Python dictionary is now equivalent to using
the sequence as a plain string for the key.

Note if you compare sequences with incompatible alphabets (e.g. DNA vs RNA, or nucleotide versus
protein), then you will get a warning but for the comparison itself only the string of letters in the sequence
is used:

In [60]:
from Bio.Alphabet import generic_dna, generic_protein
dna_seq = Seq("ACGT", generic_dna)
prot_seq = Seq("ACGT", generic_protein)
dna_seq == prot_seq



True

## 5.8 MutableSeq objects
Just like the normal Python string, the Seq object is "read only" (immutable).

also a useful default since in many biological
applications you want to ensure you are not changing your sequence data

In [None]:
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
my_seq[5]

In [None]:
# This won't work because it's not possible to change the sequence in a Seq object
# my_seq[5] = "G"

convert it into a mutable sequence MutableSeq object

In [None]:
mutable_seq = my_seq.tomutable()
mutable_seq

In [None]:
# create a MutableSeq object directly from a string
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)

In [None]:
mutable_seq

In [None]:
mutable_seq[5] = "G"
mutable_seq

In [None]:
# Remove the first T
mutable_seq.remove("T")
mutable_seq

In [None]:
mutable_seq.reverse()
mutable_seq

Do note that unlike the Seq object, the MutableSeq object's methods like reverse_complement() and
reverse() act in-situ!

can't use a MutableSeq object as a dictionary key, but you can use a Python string or a Seq object in this
way.


In [None]:
new_seq = mutable_seq.toseq()

## 5.9 UnknownSeq objects
a subclass of the basic Seq object and its purpose is to represent a sequence where
we know the length, but not the actual letters making it up

normal Seq object wastes a lot of memory to hold a string of a million "N" characters with desired length as an integer:

In [None]:
from Bio.Seq import UnknownSeq
unk = UnknownSeq(20)
unk

In [None]:
len(unk)

specify an alphabet, meaning for nucleotide sequences the letter defaults to "N" and
for proteins "X", rather than just "?".

In [None]:
unk_dna = UnknownSeq(20, alphabet = IUPAC.ambiguous_dna)
unk_dna

In [None]:
print(unk_dna)

You can use all the usual Seq object methods too, note these give back memory saving UnknownSeq
objects where appropriate as you might expect:

In [None]:
unk_dna
unk_dna.reverse_complement()
unk_dna.complement()
unk_dna.transcribe()
unk_prot = unk_dna.translate()
unk_prot

You may be able to find a use for the `UnknownSeq` object in your own code, but it is more likely that you
will first come across them in a `SeqRecord` object created by `Bio.SeqIO`. Some sequence file
formats don't always include the actual sequence, for example GenBank and EMBL files may include a list
of features but for the sequence just present the contig information (set of overlapping DNA segments that together represent a consensus region of DNA). Alternatively, the QUAL² files used in sequencing work hold quality scores but they never contain a sequence - instead there is a partner FASTA file which does have the sequence.
 
 ² A QUAL file contains quality scores for one or more biological sequences stored in a corresponding FASTA file. QUAL format is very similar to FASTA format: it stores records sequentially, with each record beginning with a header line containing a sequence ID and description.

## 5.10 Working with strings directly
Not keen on object oriented programming? Functional programming is possible as well, however it is encouraged to work with Seq objects by default