# Biopython Sequences

In [43]:
%pip install biopython

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## DNA sequences

Let's start with a DNA sequence. Here are **various operations** we can perform on a sequence, bearing in mind that the sequence itself is immutable:

In [6]:
from Bio.Seq import Seq

random_seq = 'agggcctatagctcaacagcgggggcaaa'
dna = Seq(random_seq) # assign Seq class

print('DNA sequence:', dna)   
print('Single element:', dna[5])
print('GG count:', dna.count('gg'), '\n')

sub_dna = dna[15:]
print('Slice:', sub_dna)
dna = dna + sub_dna
print('Revised sequence:', dna)   
print('Length:', len(dna), '\n')  

print('Uppercase:', dna.upper())  
print('Complement:', dna.complement()) 
print('Reverse complement:', dna.reverse_complement()) 

DNA sequence: agggcctatagctcaacagcgggggcaaa
Single element: c
GG count: 3 

Slice: acagcgggggcaaa
Revised sequence: agggcctatagctcaacagcgggggcaaaacagcgggggcaaa
Length: 43 

Uppercase: AGGGCCTATAGCTCAACAGCGGGGGCAAAACAGCGGGGGCAAA
Complement: tcccggatatcgagttgtcgcccccgttttgtcgcccccgttt
Reverse complement: tttgcccccgctgttttgcccccgctgttgagctataggccct


Like Python strings, Bio.Seq objects are **immutable**. However, if you really know you want to repeatedly modify a sequence, you can turn it into a **mutable sequence object**:

In [9]:
mutable_seq = dna.tomutable()
print(type(mutable_seq))
print('Sequence:', mutable_seq)
mutable_seq.reverse()
print('Reversed:', mutable_seq)
mutable_seq[1] = 'g'
print('Modified 1:', mutable_seq)
mutable_seq[-3:-1] = mutable_seq[2:4]
print('Modified 2:', mutable_seq)
mutable_seq.remove('g')
print('Modified 3:', mutable_seq)

<class 'Bio.Seq.MutableSeq'>
Sequence: agggcctatagctcaacagcgggggcaaaacagcgggggcaaa
Reversed: aaacgggggcgacaaaacgggggcgacaactcgatatccggga
Modified 1: agacgggggcgacaaaacgggggcgacaactcgatatccggga
Modified 2: agacgggggcgacaaaacgggggcgacaactcgatatccgaca
Modified 3: aacgggggcgacaaaacgggggcgacaactcgatatccgaca


## Transcription and translation

Now we will **transcribe** the DNA into mRNA (and **reverse-transcribe** the mRNA into DNA too): 

In [11]:
mrna = dna.transcribe()
print('Original DNA:', dna)
print('MRNA sequence:', mrna)
print('Reverse transcription:', mrna.back_transcribe())

Original DNA: agggcctatagctcaacagcgggggcaaaacagcgggggcaaa
MRNA sequence: agggccuauagcucaacagcgggggcaaaacagcgggggcaaa
Reverse transcription: agggcctatagctcaacagcgggggcaaaacagcgggggcaaa


**Translation** to protein adds the potential for "biological error" to creep in. Let's start with our existing mRNA sequence. In this case, the original sequence is not a multiple of 3, with some bases at the end of the sequence forming a "partial codon". (Biopython flags a warning the first time you run the code.) 

In [12]:
prot = mrna.translate()
print('Default protein sequence:', prot)
print('Same sequence direct from DNA:', dna.translate())
print('protein length:', len(prot), '\n')

mitochondrial_prot = mrna.translate(table="Vertebrate Mitochondrial")
print('Mitochondrial protein sequence:', mitochondrial_prot)
print('Mitochondrial protein alphabet:', mitochondrial_prot.alphabet)

Default protein sequence: RAYSSTAGAKQRGQ
Same sequence direct from DNA: RAYSSTAGAKQRGQ
protein length: 14 

Mitochondrial protein sequence: *AYSSTAGAKQRGQ
Mitochondrial protein alphabet: HasStopCodon(ExtendedIUPACProtein(), '*')




In [14]:
help(mitochondrial_prot.alphabet)

Help on HasStopCodon in module Bio.Alphabet object:

class HasStopCodon(AlphabetEncoder)
 |  HasStopCodon(alphabet, stop_symbol='*')
 |  
 |  Alphabets which contain a stop symbol.
 |  
 |  Method resolution order:
 |      HasStopCodon
 |      AlphabetEncoder
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, alphabet, stop_symbol='*')
 |      Initialize the class.
 |  
 |  contains(self, other)
 |      Test if the other alphabet is contained in this one (OBSOLETE?).
 |      
 |      Returns a boolean.  This relies on the Alphabet subclassing
 |      hierarchy, and attempts to check the stop symbol.  This fails
 |      if the other alphabet does not have a stop symbol!
 |  
 |  ----------------------------------------------------------------------
 |  Methods inherited from AlphabetEncoder:
 |  
 |  __getattr__(self, key)
 |      Proxy method for accessing attributes of the wrapped alphabet.
 |  
 |  __repr__(self)
 |      Represent the alphabet encoder clas

The following code tests whether a **"mystery" DNA sequence** codes for a target protein &mdash; either forwards or backwards: 

In [15]:
mystery_dna = 'aaccgttgtagagttgtt'
target_prot = 'NNSTTV'

this_dna = Seq(mystery_dna)
mystery_prot = this_dna.translate()
if mystery_prot == target_prot:
    print('Found!')
else:
    mystery_rc = this_dna.reverse_complement()
    mystery_prot_reversed = mystery_rc.translate()
    if mystery_prot_reversed == target_prot:
        print('Found (reversed)!')
    else:
        print('Not found!')

Found (reversed)!


Are these two DNA sequences the same? When translated using the Standard Code, do they produce **the same protein sequence**? Write a script that:
- Prints out `The same DNA`, if the DNA sequences are the same                
- Prints out `The same protein`, if the DNA sequences are different but the protein sequences are the same                   
- Prints out `Different` if the protein sequences don't match.


In [37]:
dna1 = Seq('aaccgttgtagagttgttaaccgttgtagagttgtt')
dna2 = Seq('AACCGTTGTAGAGTTGTCAACCGTTGTAGAGTTGTT')
dna2 = dna2.lower()
prot1 = dna1.translate()
prot2 = dna2.translate()

if len(dna1) == len(dna2):
    if dna1 == dna2:
        print ("The same DNA")
    else:
        if prot1 == prot2:
            print ("The same protein")
        else:
            print ("Different")

The same protein


# Sequence alignments

Let's see what a sequence alignment file looks like:

In [38]:
from Bio import AlignIO
alignment = AlignIO.read('../data/hAPP.clustal', 'clustal')

In [39]:
print(alignment)

SingleLetterAlphabet() alignment with 12 rows and 772 columns
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...--- sp|P05067-2|A4_
--MLPGLALLLLAAWTARAL------------------------...MQN sp|P05067-10|A4
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-3|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-4|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-5|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-6|A4_
MDQLEDLLVL-------FINYVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-11|A4
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067|A4_HU
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-1|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-9|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-7|A4_
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...MQN sp|P05067-8|A4_


We can select the first two sequence only, using a slice:

In [40]:
print(alignment[:2])

SingleLetterAlphabet() alignment with 2 rows and 772 columns
--MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLN...--- sp|P05067-2|A4_
--MLPGLALLLLAAWTARAL------------------------...MQN sp|P05067-10|A4


We can also select the first 10 positions in the alignments, using a slice again:

In [41]:
print(alignment[:,:10])

SingleLetterAlphabet() alignment with 12 rows and 10 columns
--MLPGLALL sp|P05067-2|A4_
--MLPGLALL sp|P05067-10|A4
--MLPGLALL sp|P05067-3|A4_
--MLPGLALL sp|P05067-4|A4_
--MLPGLALL sp|P05067-5|A4_
--MLPGLALL sp|P05067-6|A4_
MDQLEDLLVL sp|P05067-11|A4
--MLPGLALL sp|P05067|A4_HU
--MLPGLALL sp|P05067-1|A4_
--MLPGLALL sp|P05067-9|A4_
--MLPGLALL sp|P05067-7|A4_
--MLPGLALL sp|P05067-8|A4_


Both can be combined: the first 10 positions, in the first two sequences:

In [42]:
print(alignment[:2,:10])

SingleLetterAlphabet() alignment with 2 rows and 10 columns
--MLPGLALL sp|P05067-2|A4_
--MLPGLALL sp|P05067-10|A4
