#Getting Started with Biopython
*Myles O'Neill - based on http://biopython.org/DIST/docs/tutorial/Tutorial.html*

Biopython is a library we can use to analyze bioinformatic data. Lets have a look at some of the things we can do with it!

This script will go through a selection of tutorial excercises from the Biopython cookbook. I've tweaked them a bit to work with the Drosophila dataset we are using. This is an adapation of the tutorial with a bunch of tweaks, overall it should make it easier to get started with this data on Kaggle.

First lets import numpy, pandas, and Bio (biopython):

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import Bio
print("Biopython v" + Bio.__version__)

## 2.2  Working with sequences

Disputably (of course!), the central object in bioinformatics is the sequence. Thus, we’ll start with a quick introduction to the Biopython mechanisms for dealing with sequences, the Seq object.

Most of the time when we think about sequences we have in my mind a string of letters like ‘AGTACACTGGT’. You can create such Seq object with this sequence as follows:



In [None]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)
my_seq.alphabet

What we have here is a sequence object with a generic alphabet - reflecting the fact we have not specified if this is a DNA or protein sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines and Threonines!).

In addition to having an alphabet, the Seq object differs from the Python string in the methods it supports. You can’t do this with a plain string:

In [None]:
print(my_seq + " - Sequence")
print(my_seq.complement() + " - Complement")
print(my_seq.reverse_complement() + " - Reverse Complement")

The next most important class is the SeqRecord or Sequence Record. This holds a sequence (as a Seq object) with additional annotation including an identifier, name and description. The Bio.SeqIO module for reading and writing sequence file formats works with SeqRecord objects.

This covers the basic features and uses of the Biopython sequence class. Now that you’ve got some idea of what it is like to interact with the Biopython libraries, it’s time to delve into the fun, fun world of dealing with biological file formats!

## 2.4  Parsing sequence file formats

A large part of much bioinformatics work involves dealing with the many types of file formats designed to hold biological data. These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language. However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, and that formats may contain small subtleties which can break even the most well designed parsers.

For this example, lets grab the first 6 sequences in the genomic data set

In [None]:
from Bio import SeqIO
count = 0
sequences = [] # Here we are setting up an array to save our sequences for the next step

for seq_record in SeqIO.parse("../input/genome.fa", "fasta"):
    if (count < 6):
        sequences.append(seq_record)
        print("Id: " + seq_record.id + " \t " + "Length: " + str("{:,d}".format(len(seq_record))) )
        print(repr(seq_record.seq) + "\n")
        count = count + 1

In [None]:
# Lets set these sequences up for easy access later

chr2L = sequences[0].seq
chr2R = sequences[1].seq
chr3L = sequences[2].seq
chr3R = sequences[3].seq
chr4 = sequences[4].seq
chrM = sequences[5].seq

## 3.2  Sequences act like strings

In many ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements:

Lets start by printing the length of the first sequence we grabbed before:

In [None]:
print(len(chr2L))

You can access elements of the sequence in the same way as for strings:

In [None]:
print("First Letter: " + chr2L[0])
print("Third Letter: " + chr2L[2])
print("Last Letter: " + chr2L[-1])

The Seq object has a .count() method, just like a string. Note that this means that like a Python string, this gives a non-overlapping count:

In [None]:
print("AAAA".count("AA"))
print(Seq("AAAA").count("AA"))

For some biological uses, you may actually want an overlapping count (i.e. 3 in this trivial example). When searching for single letters, this makes no difference.

Lets count the number of G shown in the sequence

In [None]:
print("Length:\t" + str(len(chr2L)))
print("G Count:\t" + str(chr2L.count("G")))

The GC Content of a DNA sequence is important and relates to how stable the molecule will be. We can calculate it manually like this:

In [None]:
print("GC%:\t\t" + str(100 * float((chr2L.count("G") + chr2L.count("C")) / len(chr2L) ) ))

While you could use the above snippet of code to calculate a GC%, note that the Bio.SeqUtils module has several GC functions already built. For example:

In [None]:
from Bio.SeqUtils import GC
print("GC% Package:\t" + str(GC(chr2L)))

But wait a minute! Why aren't those two numbers the name?

The answer lies in the sequence. We only used capital G/C characters, but in the actual sequence there are lowercase g/c characters. In addition, there are also S and s characters which represent an ambiguous G OR C character - but which are being counted for GC content by the package. Lets add those and check again:



In [None]:
print("GgCcSs%:\t" + str(100 * float((chr2L.count("G") + chr2L.count("g") + chr2L.count("C") + chr2L.count("c") + chr2L.count("S") + chr2L.count("s") ) / len(chr2L) ) ))
print("GC% Package:\t" + str(GC(chr2L)))

Much better!

## 3.3  Slicing a sequence

A more complicated example, let’s get a slice of the sequence:

In [None]:
print(chr2L[4:12])

When you do a slice the first item is included (i.e. 4 in this case) and the last is excluded (12 in this case)

The second thing to notice is that the slice is performed on the sequence data string, but the new object produced is another Seq object which retains the alphabet information from the original Seq object.

Also like a Python string, you can do slices with a start, stop and stride (the step size, which defaults to one). For example, we can get the first, second and third codon positions of this DNA sequence:

We are using a short subset of chr2L since we don't want to go printing millions of characters

In [None]:
chr2LSHORT = chr2L[0:20]
print("Short chr2L: " + chr2LSHORT)

print("Codon Pos 1: " + chr2LSHORT[0::3])
print("Codon Pos 2: " + chr2LSHORT[1::3])
print("Codon Pos 3: " + chr2LSHORT[2::3])

Another stride trick you might have seen with a Python string is the use of a -1 stride to reverse the string. You can do this with a Seq object too:

In [None]:
print("Reversed: " + chr2LSHORT[::-1])

## 3.5  Concatenating or adding sequences

Naturally, you can in principle add any two Seq objects together - just like you can with Python strings to concatenate them. However, you can’t add sequences with incompatible alphabets, such as a protein sequence and a DNA sequence:

In [None]:
chr2LSHORT = chr2L[0:20]
print("Short chr2L: " + chr2LSHORT)

chr2RSHORT = chr2R[0:20]
print("Short chr2R: " + chr2RSHORT)

concat = chr2LSHORT + chr2RSHORT
print("Concat: " + concat)

If you really want to concat sequences from different alphabets, you’d have to first give both sequences generic alphabets:

In [None]:
from Bio.Alphabet import IUPAC
protein_seq = Seq("EVRNAK", IUPAC.protein)
dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)

# This will fail since they have different alphabets
# print(protein_seq + dna_seq)

# Error: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

# But if we give them the same generic alphabet it works

from Bio.Alphabet import generic_alphabet
protein_seq.alphabet = generic_alphabet
dna_seq.alphabet = generic_alphabet
print(protein_seq + dna_seq)

You may often have many sequences to add together, which can be done with a for loop like this:

In [None]:
from Bio.Alphabet import generic_dna

list_of_seqs = [Seq("ACGT", generic_dna), Seq("AACC", generic_dna), Seq("GGTT", generic_dna)]
concatenated = Seq("", generic_dna)
for s in list_of_seqs:
    concatenated += s
print(concatenated)

Or, a more elegant approach is to the use built in sum function with its optional start value argument (which otherwise defaults to zero):

In [None]:
list_of_seqs = [Seq("ACGT", generic_dna), Seq("AACC", generic_dna), Seq("GGTT", generic_dna)]
print(sum(list_of_seqs, Seq("", generic_dna)))

Unlike the Python string, the Biopython Seq does not (currently) have a .join method.

## 3.6  Changing case

In [None]:
dna_seq = Seq("acgtACGT", generic_dna)
print("Original: " + dna_seq)
print("Upper: " + dna_seq.upper())
print("Lower: " + dna_seq.lower())

These are useful for doing case insensitive matching:

In [None]:
print("GTAC" in dna_seq)
print("GTAC" in dna_seq.upper())

## 3.7  Nucleotide sequences and (reverse) complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of a Seq object using its built-in methods:


In [None]:
print("Original: " + chr2LSHORT)
print("Complement: " + chr2LSHORT.complement())
print("Reverse Complement: " + chr2LSHORT.reverse_complement())

In all of these operations, the alphabet property is maintained. This is very useful in case you accidentally end up trying to do something weird like take the (reverse)complement of a protein sequence (which will give an error)

## 3.8  Transcription

The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA. However, in Biopython and bioinformatics in general, we typically work directly with the coding strand because this means we can get the mRNA sequence just by switching T → U.

In [None]:
print("Coding DNA: " + chr2LSHORT)
template_dna = chr2LSHORT.reverse_complement()
print("Template DNA: " + template_dna)

Biology Note: (remember by convention nucleotide sequences are normally read from the 5’ to 3’ direction)

Now let’s transcribe the coding strand into the corresponding mRNA, using the Seq object’s built in transcribe method:


In [None]:
messenger_rna = chr2LSHORT.transcribe()
print("Messenger RNA: " + messenger_rna)

As you can see, all this does is switch T → U, and adjust the alphabet.

## 3.9  Translation

Using a new example, let’s translate this mRNA into the corresponding protein sequence - again taking advantage of one of the Seq object’s biological methods:



In [None]:
from Bio.Alphabet import IUPAC
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
print("Messenger RNA: " + messenger_rna)
print("Protein Sequence: " + messenger_rna.translate())

You can also translate directly from the coding strand DNA sequence:


In [None]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
print("Coding DNA: " + coding_dna)
print("Protein Sequence: " + coding_dna.translate())

You should notice in the above protein sequences that in addition to the end stop character, there is an internal stop as well. This was a deliberate choice of example, as it gives an excuse to talk about some optional arguments, including different translation tables (Genetic Codes).

The translation tables available in Biopython are based on those from the NCBI (see the next section of this tutorial). By default, translation will use the standard genetic code (NCBI table id 1). Suppose we are dealing with a mitochondrial sequence. We need to tell the translation function to use the relevant genetic code instead:

NCBI Tables: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

In [None]:
print("Vertebrate Mitochondrial Table Result: " + coding_dna.translate(table="Vertebrate Mitochondrial"))

You can also specify the table using the NCBI table number which is shorter, and often included in the feature annotation of GenBank files:

In [None]:
print ("Table 2 Result: " + coding_dna.translate(table=2))

Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

In [None]:
print("Standard Translation: " + coding_dna.translate())
print("Stop as in Biology: " + coding_dna.translate(to_stop=True))
print("Table 2 Translation: " + coding_dna.translate(table=2))
print("Table 2 Translation with Stop: " + coding_dna.translate(table=2, to_stop=True))

Notice that when you use the to_stop argument, the stop codon itself is not translated - and the stop symbol is not included at the end of your protein sequence.

Now, suppose you have a complete coding sequence CDS, which is to say a nucleotide sequence (e.g. mRNA – after any splicing) which is a whole number of codons (i.e. the length is a multiple of three), commences with a start codon, ends with a stop codon, and has no internal in-frame stop codons. In general, given a complete CDS, the default translate method will do what you want (perhaps with the to_stop option). However, what if your sequence uses a non-standard start codon? This happens a lot in bacteria – for example the gene yaaX in E. coli K12:

In [None]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCAGCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACATTATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA", generic_dna)
print(gene)
print("Bacterial Translation With Stop: " + gene.translate(table="Bacterial", to_stop=True))

In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS:

In [None]:
print ("Bacterial Translation of CDS: " + gene.translate(table="Bacterial", cds=True))

In addition to telling Biopython to translate an alternative start codon as methionine, using this option also makes sure your sequence really is a valid CDS (you’ll get an exception if not).

## 3.11  Comparing Seq objects

Sequence comparison is actually a very complicated topic, and there is no easy way to decide if two sequences are equal. The basic problem is the meaning of the letters in a sequence are context dependent - the letter “A” could be part of a DNA, RNA or protein sequence. Biopython uses alphabet objects as part of each Seq object to try to capture this information - so comparing two Seq objects could mean considering both the sequence strings and the alphabets.

For example, you might argue that the two DNA Seq objects Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.ambiguous_dna) should be equal, even though they do have different alphabets. Depending on the context this could be important.

This gets worse – suppose you think Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT") (i.e. the default generic alphabet) should be equal. Then, logically, Seq("ACGT", IUPAC.protein) and Seq("ACGT") should also be equal. Now, in logic if A=B and B=C, by transitivity we expect A=C. So for logical consistency we’d require Seq("ACGT", IUPAC.unambiguous_dna) and Seq("ACGT", IUPAC.protein) to be equal – which most people would agree is just not right. This transitivity also has implications for using Seq objects as Python dictionary keys.

Now, in everyday use, your sequences will probably all have the same alphabet, or at least all be the same type of sequence (all DNA, all RNA, or all protein). What you probably want is to just compare the sequences as strings – which you can do explicitly:

In [None]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
seq2 = Seq("ACGT", IUPAC.ambiguous_dna)
print(str(seq1) == str(seq2))
print(str(seq1) == str(seq1))

So, what does Biopython do? Well, as of Biopython 1.65, sequence comparison only looks at the sequence, essentially ignoring the alphabet:

In [None]:
print(seq1 == seq2)
print(seq1 == "ACGT")

Thats all for now, hope this tutorial run through was helpful! You can find lots more info about biopython here: http://biopython.org/DIST/docs/tutorial/Tutorial.html