Quick Start – What can you do with Biopython? {#chapter:quick-start}
=============================================

This section is designed to get you started quickly with Biopython, and
to give a general overview of what is available and how to use it. All
of the examples in this section assume that you have some general
working knowledge of Python, and that you have successfully installed
Biopython on your system. If you think you need to brush up on your
Python, the main Python web site provides quite a bit of free
documentation to get started with (<http://www.python.org/doc/>).

Since much biological work on the computer involves connecting with
databases on the internet, some of the examples will also require a
working internet connection in order to run.

Now that that is all out of the way, let’s get into what we can do with
Biopython.

General overview of what Biopython provides
-------------------------------------------

As mentioned in the introduction, Biopython is a set of libraries to
provide the ability to deal with “things” of interest to biologists
working on the computer. In general this means that you will need to
have at least some programming experience (in Python, of course!) or at
least an interest in learning to program. Biopython’s job is to make
your job easier as a programmer by supplying reusable libraries so that
you can focus on answering your specific question of interest, instead
of focusing on the internals of parsing a particular file format (of
course, if you want to help by writing a parser that doesn’t exist and
contributing it to Biopython, please go ahead!). So Biopython’s job is
to make you happy!

One thing to note about Biopython is that it often provides multiple
ways of “doing the same thing.” Things have improved in recent releases,
but this can still be frustrating as in Python there should ideally be
one right way to do something. However, this can also be a real benefit
because it gives you lots of flexibility and control over the libraries.
The tutorial helps to show you the common or easy ways to do things so
that you can just make things work. To learn more about the alternative
possibilities, look in the Cookbook (Chapter \[chapter:cookbook\], this
has some cools tricks and tips), the Advanced section
(Chapter \[chapter:advanced\]), the built in “docstrings” (via the
Python help command, or the [API
documentation](http://biopython.org/DIST/docs/api/)) or ultimately the
code itself.

Working with sequences {#sec:sequences}
----------------------

Disputably (of course!), the central object in bioinformatics is the
sequence. Thus, we’ll start with a quick introduction to the Biopython
mechanisms for dealing with sequences, the `Seq` object, which we’ll
discuss in more detail in Chapter \[chapter:Bio.Seq\].

Most of the time when we think about sequences we have in my mind a
string of letters like ‘`AGTACACTGGT`’. You can create such `Seq` object
with this sequence as follows - the “`>>>`” represents the Python prompt
followed by what you would type in:



In [None]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq


In [None]:
print(my_seq)


In [None]:
my_seq.alphabet



What we have here is a sequence object with a *generic* alphabet -
reflecting the fact we have *not* specified if this is a DNA or protein
sequence (okay, a protein with a lot of Alanines, Glycines, Cysteines
and Threonines!). We’ll talk more about alphabets in
Chapter \[chapter:Bio.Seq\].

In addition to having an alphabet, the `Seq` object differs from the
Python string in the methods it supports. You can’t do this with a plain
string:



In [None]:
my_seq


In [None]:
my_seq.complement()


In [None]:
my_seq.reverse_complement()



The next most important class is the `SeqRecord` or Sequence Record.
This holds a sequence (as a `Seq` object) with additional annotation
including an identifier, name and description. The `Bio.SeqIO` module
for reading and writing sequence file formats works with `SeqRecord`
objects, which will be introduced below and covered in more detail by
Chapter \[chapter:Bio.SeqIO\].

This covers the basic features and uses of the Biopython sequence class.
Now that you’ve got some idea of what it is like to interact with the
Biopython libraries, it’s time to delve into the fun, fun world of
dealing with biological file formats!

A usage example {#sec:orchids}
---------------

Before we jump right into parsers and everything else to do with
Biopython, let’s set up an example to motivate everything we do and make
life more interesting. After all, if there wasn’t any biology in this
tutorial, why would you want you read it?

Since I love plants, I think we’re just going to have to have a plant
based example (sorry to all the fans of other organisms out there!).
Having just completed a recent trip to our local greenhouse, we’ve
suddenly developed an incredible obsession with Lady Slipper Orchids (if
you wonder why, have a look at some [Lady Slipper Orchids photos on
Flickr](http://www.flickr.com/search/?q=lady+slipper+orchid&s=int&z=t),
or try a [Google Image
Search](http://images.google.com/images?q=lady%20slipper%20orchid)).

Of course, orchids are not only beautiful to look at, they are also
extremely interesting for people studying evolution and systematics. So
let’s suppose we’re thinking about writing a funding proposal to do a
molecular study of Lady Slipper evolution, and would like to see what
kind of research has already been done and how we can add to that.

After a little bit of reading up we discover that the Lady Slipper
Orchids are in the Orchidaceae family and the Cypripedioideae sub-family
and are made up of 5 genera: *Cypripedium*, *Paphiopedilum*,
*Phragmipedium*, *Selenipedium* and *Mexipedium*.

That gives us enough to get started delving for more information. So,
let’s look at how the Biopython tools can help us. We’ll start with
sequence parsing in Section \[sec:sequence-parsing\], but the orchids
will be back later on as well - for example we’ll search PubMed for
papers about orchids and extract sequence data from GenBank in
Chapter \[chapter:entrez\], extract data from Swiss-Prot from certain
orchid proteins in Chapter \[chapter:swiss\_prot\], and work with
ClustalW multiple sequence alignments of orchid proteins in
Section \[sec:align\_clustal\].

Parsing sequence file formats {#sec:sequence-parsing}
-----------------------------

A large part of much bioinformatics work involves dealing with the many
types of file formats designed to hold biological data. These files are
loaded with interesting biological data, and a special challenge is
parsing these files into a format so that you can manipulate them with
some kind of programming language. However the task of parsing these
files can be frustrated by the fact that the formats can change quite
regularly, and that formats may contain small subtleties which can break
even the most well designed parsers.

We are now going to briefly introduce the `Bio.SeqIO` module – you can
find out more in Chapter \[chapter:Bio.SeqIO\]. We’ll start with an
online search for our friends, the lady slipper orchids. To keep this
introduction simple, we’re just using the NCBI website by hand. Let’s
just take a look through the nucleotide databases at NCBI, using an
Entrez online search
(<http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide>) for
everything mentioning the text Cypripedioideae (this is the subfamily of
lady slipper orchids).

When this tutorial was originally written, this search gave us only 94
hits, which we saved as a FASTA formatted text file and as a GenBank
formatted text file (files
[ls\_orchid.fasta](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta)
and
[ls\_orchid.gbk](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk),
also included with the Biopython source code under
<span>docs/tutorial/examples/</span>).

If you run the search today, you’ll get hundreds of results! When
following the tutorial, if you want to see the same list of genes, just
download the two files above or copy them from `docs/examples/` in the
Biopython source code. In
Section \[sec:connecting-with-biological-databases\] we will look at how
to do a search like this from within Python.

### Simple FASTA parsing example {#sec:fasta-parsing}

If you open the lady slipper orchids FASTA file
[ls\_orchid.fasta](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.fasta)
in your favourite text editor, you’ll see that the file starts like
this:




It contains 94 records, each has a line starting with “`>`”
(greater-than symbol) followed by the sequence on one or more lines. Now
try this in Python:




You should get something like this on your screen:




Notice that the FASTA format does not specify the alphabet, so
`Bio.SeqIO` has defaulted to the rather generic `SingleLetterAlphabet()`
rather than something DNA specific.

### Simple GenBank parsing example

Now let’s load the GenBank file
[ls\_orchid.gbk](http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk)
instead - notice that the code to do this is almost identical to the
snippet used above for the FASTA file - the only difference is we change
the filename and the format string:




This should give:

