In [None]:
#Biopython Submission
#Dec 1 2020
#Zeke Van Dehy

# Biopython #

We have worked with modules before -- random and statistics for example, among others. Biopython is a large and fairly complex module that encodes a lot of the simple operations that we've learned to do with sequences this semester, so that you'll never have to write your own protein translation routine again unless you want to.

Biopython is big enough that you don't usually want to load the entire thing at once. Instead, use the syntax "from Bio.Xxx import xxxx" to load just the sub-module that you want.

A big reference for Biopython can be found here: https://biopython.org/wiki/Documentation

To see what one of the core Biopython modules has in it, import just the Bio.Seq module, and then use dir() to see what it contains.

In [2]:
from Bio.Seq import Seq
dir(Seq)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'back_transcribe',
 'complement',
 'complement_rna',
 'count',
 'count_overlap',
 'encode',
 'endswith',
 'find',
 'index',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'reverse_complement_rna',
 'rfind',
 'rindex',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

# New object type: Seq #

Back in the early part of the class, when we learned about objects and the methods that can modify them, I mentioned that it was possible to make new types of objects (other than the ones we're used to like strings, lists, file objects, etc).

Biopython has a couple of new object types with new attributes. A Biopython Seq object is sort of like a string, but with attributes that specify whether it is protein or DNA.

First, create a new sequence object from the sequence in the cell below. 

We have to import the Seq function from the Bio.Seq module (a little confusing? -- yes.) 

Then, to turn the string literal I've given you into a Seq object, use the function Seq("string") and then print the Seq object.

In [7]:
from Bio.Seq import Seq
my_seq = Seq("ACTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG")
print(my_seq)


ACTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG


# OK, maybe that's useful #

Biopython lets you build the sequence type into the object so that you can't accidentally concatenate DNA and protein sequences. There are other contexts where the alphabet attribute is useful in the context of Biopython, but its basic function is telling you what kind of sequence your sequence is and not letting you do anything stupid with it.

# String Methods on Sequences #

If you look at the directory of Seq using dir(Seq), you can probably see some familiar names of string methods. In the cells below, try out five string methods that you think will also work on a Biopython Seq object, just based on what you see is available in the directory.

In [4]:
my_seq = Seq("ACTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG")
print(len(my_seq))
print(my_seq[0:10])
print(my_seq.count("C"))

67
ACTTGATGCC
14


In [5]:
for i, letter in enumerate(my_seq):
    print(i, letter)

0 A
1 C
2 T
3 T
4 G
5 A
6 T
7 G
8 C
9 C
10 G
11 T
12 A
13 T
14 T
15 A
16 A
17 G
18 G
19 T
20 T
21 C
22 C
23 T
24 A
25 A
26 A
27 G
28 G
29 A
30 C
31 A
32 G
33 C
34 T
35 A
36 A
37 T
38 T
39 G
40 C
41 T
42 T
43 C
44 C
45 G
46 C
47 G
48 A
49 T
50 T
51 C
52 A
53 T
54 G
55 A
56 A
57 G
58 A
59 T
60 C
61 T
62 A
63 T
64 A
65 C
66 G


In [6]:
p = Seq("KSMKPPRTHLIMHWIIL")
print(p[0])

K


# Seqs have len() #

Check the length of your Seq object using len()

In [None]:
#see above

# Seqs can be sliced #

Try using slice notation on your string.

In [None]:
#see above

# Seqs are iterable #

Try writing a for loop over your Seq object. Use enumerate, and print out both the index and the character at that index.

In [None]:
#see above

# Special string methods #

The Seq object type also has some special string methods that work their own way on biosequences.

Try out the .reverse_complement() method on your DNA Seq object. In the second cell try it out on the one you defined as IUPAC.Protein.

In [10]:
s = my_seq
print(s)
print(s.complement())
print(s.reverse_complement())
print(s.transcribe())

ACTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG
TGAACTACGGCATAATTCCAAGGATTTCCTGTCGATTAACGAAGGCGCTAAGTACTTCTAGATATGC
CGTATAGATCTTCATGAATCGCGGAAGCAATTAGCTGTCCTTTAGGAACCTTAATACGGCATCAAGT
ACUUGAUGCCGUAUUAAGGUUCCUAAAGGACAGCUAAUUGCUUCCGCGAUUCAUGAAGAUCUAUACG


# .translate() is also special #

Try out the .translate() method on a sequence that is defined as IUPAC.unambiguous_dna. Then try it on one that is defined as IUPAC.protein.

In [12]:
print((s+"N").translate())

T*CRIKVPKGQLIASAIHEDLY


# Biopython .translate() is pretty smart #

Unlike the basic .translate() method for strings in python, Biopython's .translate method can handle the three letter code. It also knows that you can't translate a sequence that's already a protein!

It also knows about stop codons. In the cell below, make a DNA Seq object out of the given sequence and try to translate it:

In [20]:
print(Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG").translate())
print(Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG").translate(to_stop=True))

MAIVMGR*KGAR*
MAIVMGR


# .translate() knows different genetic codes #

Try translating the same seq object using .translate() with the argument table=2 in the parentheses.

Also try table="Vertebrate Mitochondrial"

Are these translations the same as the one you got above with the default? (no argument in the parens equals table=1 or standard genetic code)

In [16]:
print(Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG").translate(table=11))
print(Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG").translate(table=2))

MAIVMGR*KGAR*
MAIVMGRWKGAR*


# .translate() can stop at a stop codon #

You can stop at a stop codon by using the argument 'to_stop = True' inside the parentheses.

You can also combine both arguments (table = 2, 'to_stop = True')

Try them below and see what happens.

In [22]:
from Bio.Data import CodonTable
print(CodonTable.unambiguous_dna_by_name["Standard"])
print(CodonTable.unambiguous_dna_by_id[11])

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

# Seq objects are immutable #

If you recall, when you try to reassign a single character value in a string object, you get an error because strings are immutable:

```dna = "ATCGATGCATCG"
dna[1] = "C" ```

gives the error

```TypeError: 'str' object does not support item assignment ```

The same will happen if you try to change a single character in a Seq...UNLESS you make it mutable. That is right -- Seq objects have a corresponding mutable type that you can switch to if you are in the mood to change individual positions in a sequence.

Use the .tomutable() method to make a Seq object mutable, print the original seq, and then change the value of one position in the seq to a small letter 'c'. Then print the new seq.

In [26]:
m = s.tomutable()
print(m)
m[1] = "c"
print(m)
dir(m)

ACTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG
AcTTGATGCCGTATTAAGGTTCCTAAAGGACAGCTAATTGCTTCCGCGATTCATGAAGATCTATACG


['__add__',
 '__class__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'append',
 'complement',
 'count',
 'count_overlap',
 'data',
 'extend',
 'index',
 'insert',
 'join',
 'pop',
 'remove',
 'reverse',
 'reverse_complement',
 'toseq']

# SeqRecord objects: more information #

In addition to the Seq object, with its two attributes, Seq and Alphabet, Biopython has SeqRecord objects that contain more information. 

We don't typically make a SeqRecord manually -- usually we get them from the SeqIO module. If you did want to make a SeqRecord manually, you'd use code like you'll see in the cell below.

In [28]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"),
                   id="YP_025292.1", name="HokC", description="toxic membrane protein")
print(record)
help(SeqRecord)

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF')
Help on class SeqRecord in module Bio.SeqRecord:

class SeqRecord(builtins.object)
 |  SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None)
 |  
 |  A SeqRecord object holds a sequence and information about it.
 |  
 |  Main attributes:
 |   - id          - Identifier such as a locus tag (string)
 |   - seq         - The sequence itself (Seq object or similar)
 |  
 |  Additional attributes:
 |   - name        - Sequence name, e.g. gene name (string)
 |   - description - Additional text (string)
 |   - dbxrefs     - List of database cross references (list of strings)
 |   - features    - Any (sub)features defined (list of SeqFeature objects)
 |   - annotations - Further information about the whole sequence (dictionary).
 |     Most entries 

# Accessing SeqRecord attributes #

The SeqRecord we defined above has several attributes, and a Seq object makes up a part of it. In the cells below, try out print statements to individually print:

- just the sequence
- just the alphabet
- the Seq record only
- just the id
- just the name
- just the description

In [30]:
print(record.id)

YP_025292.1


In [31]:
print(record.seq)

MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF


In [32]:
print(record.name)

HokC


In [29]:
print(record.description)

toxic membrane protein


# Formatting a SeqRecord #

Biopython knows all the common formats and can output a SeqRecord into a compliant format with the .format() function. Try formatting your sequence record using the .format arguments "fasta" and "genbank".

In [34]:
print(record.format("fasta"))

>YP_025292.1 toxic membrane protein
MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF



# Making SeqRecords from files #

Typically, we would not make a SeqRecord by hand -- we'd build it from an input file. To do that, we'll also need the SeqIO module. I gave you the statement to get it below.

I've given you some input files. The first one is called insulin.gbk. It's the nucleotide sequence of human insulin. It's a GenBank file, which is a standard file format you have not seen before. We'd like to open that file and see it as a FASTA file instead.

To do that, we use SeqIO.parse() to get the file in, and then we can print with record.format() to the format we want.

SeqIO.parse takes two arguments -- the filename itself, and the format. e.g. ("insulin.gbk","genbank").

Try to parse this file into a record in the cell below. Then, once you've got it, print it back out as a FASTA with record.format()

In [36]:
from Bio import SeqIO
for i, record in enumerate(SeqIO.parse("insulin.gbk", "genbank")):
    print(i, record.format("fasta"))


0 >NG_007114.1 Homo sapiens insulin (INS), RefSeqGene on chromosome 11
GGCGGCCAGGGAAGGTCTCTGCCGCCAGGGAAGTGTCCCAGAGACCCCTGGAGGGGCTGC
TGACACCCCCGGTGCCCCCACCTCGAGCATGACCCAGGGCTGCCTCTCCCCATCCTTCAT
CCTCCCTGCTCCACAGGACATTGGCCTGGCGTCCCTGGGGGCCTCGGATGAGGAAATTGA
GAAGCTGTCCACGGTGGGTTGACCCCTCCCTGCAGGGCCTGGGGTGTGGGTTTGGGGGTC
TGAATCCAGGCCTCACCCTCTTGCCGTCCAGGCTGAGGCCTCTCCTTCCACCCACGAATT
GTGACCCTCACCCTGGCCTGCCTGCATCCTGGCCTGGCCTCCCTGGGGGTGGTATCCTGG
TCACGGGTGACCAGGGGCTGCCCGGTGGGCGGCAGCTGTCTCTGGGCTGATGCTGCCCGG
CTTCCCCGCAGCTGTACTGGTTCACGGTGGAGTTCGGGCTGTGTAAGCAGAACGGGGAGG
TGAAGGCCTATGGTGCCGGGCTGCTGTCCTCCTACGGGGAGCTCCTGGTGAGAGTCTCTC
CTTGCTGCAGCCCCCAGCAGAGGGGCAGGGCTGGGGGACGGTGCAGGGAGGGGACAGGCT
CCCAGTGGGAGGAAACTGAGGCCTGGACCTCCAGGACTCAGGCTCTGTTTGGGAGAAGGC
TTGTCTCTGCCCAGTCCTCACCCCACATTATCCCAGGCCTCCGAAGGCCCGGCGGGGGAG
ATGGGGGTGACTCTACCCAAGGAACCCACCCAGCGTCAGGCCACGGTGCCCCAGTTCCCT
CGGGGACCTGGGTGCAGTGGAGTCAGTGATGCCATTGGCCTCCTGCCAGCACTGCCTGTC
TGAGGAGCCTGAGATTCGGGCCTTCGACCCTGAGGCTGCGGCCGTGCAGCCCTACCAAGA
CCAGACGTACCAGT

# Converting formats with Bio.SeqIO #

SeqIO has lots of methods that we'll look at next time. Most can be used with iterators. For now, take a look at this straightforward bit of code:

from Bio import SeqIO
count = SeqIO.convert("cor6_6.gb", "genbank", "cor6_6.fasta", "fasta")

In the cell below, try to use SeqIO.convert() to convert the insulin GenBank file to FASTA.

In [39]:
from Bio import SeqIO
count = SeqIO.convert("insulin.gbk", "genbank", "insulin.fasta", "fasta")