# Chapter 02: Quick Start - What can I do with Biopython?


This section is designed to get you started quickly with Biopython, and to give a general overview of what isavailable and how to use it.  All of the examples in this section assume that you have some general workingknowledge of Python, and that you have successfully installed Biopython on your system.  If you think youneed to brush up on your Python, the main Python web site provides quite a bit of free documentation toget started with [https://docs.python.org/3/](https://docs.python.org/3/).

Since much biological work on the computer involves connecting with databases on the internet, some ofthe examples will also require a working internet connection in order to run.

Now that that is all out of the way, let’s get into what we can do with Biopython.

## 2.1    General overview of what Biopython provides

As mentioned in the introduction, Biopython is a set of libraries to provide the ability to deal with “things”of  interest  to  biologists  working  on  the  computer.   In  general  this  means  that  you  will  need  to  have  atleast some programming experience (in Python, of course!)  or at least an interest in learning to program.Biopython’s  job  is  to  make  your  job  easier  as  a  programmer  by  supplying  reusable  libraries  so  that  youcan focus on answering your specific question of interest, instead of focusing on the internals of parsing aparticular file format (of course, if you want to help by writing a parser that doesn’t exist and contributingit to Biopython, please go ahead!).  So Biopython’s job is to make you happy!

One thing to note about Biopython is that it often provides multiple ways of “**doing the same thing.**”Things have improved in recent releases, but this can still be frustrating as in Python there should ideallybe  one  right  way  to  do  something.   However,  this  can  also  be  a  real  benefit  because  it  gives  you  lots  offlexibility and control over the libraries.  The tutorial helps to show you the common or easy ways to dothings so that you can just make things work.  To learn more about the alternative possibilities, look in theCookbook (Chapter 20, this has some cools tricks and tips), the Advanced section (Chapter 22), the builtin “docstrings” (via the Python help command, or the API documentation) or ultimately the code itself.

## 2.2 Working with sequences
Disputably (of course!), the central object in bioinformatics is the sequence.  Thus, we’ll start with a quickintroduction to the Biopython mechanisms for dealing with sequences, theSeqobject, which we’ll discuss inmore detail in Chapter 3.

Most of the time when we think about sequences we have in my mind a string of letters like ‘`AGTACACTGGT`’. You  can  create  suchSeqobject  with  this  sequence  as  follows  -  the  ">>>"  represents  the  Python  prompt followed by what you would type in:


In [2]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq

Seq('AGTACACTGGT')

In [3]:
print(my_seq)

AGTACACTGGT


#### The `Seq object` differs from the Python string in the methods it supports.  You can’t do this with a plainstring:

In [4]:
my_seq.complement()

Seq('TCATGTGACCA')

In [5]:
my_seq.reverse_complement()

Seq('ACCAGTGTACT')

The next most important class is the `SeqRecord` or `Sequence Record`.  This holds a sequence (as aSeqobject) with additional annotation including an identifier, name and description.  TheBio.SeqIOmodulefor reading and writing sequence file formats works with SeqRecord objects, which will be introduced belowand covered in more detail by Chapter 5.

This covers the basic features and uses of the Biopython sequence class.  Now that you’ve got some ideaof what it is like to interact with the Biopython libraries, it’s time to delve into the fun, fun world of dealingwith biological file formats!

## 2.3 A usage example
Before  we  jump  right  into  parsers  and  everything  else  to  do  with  Biopython,  let’s  set  up  an  example  tomotivate  everything  we  do  and  make  life  more  interesting.   After  all,  if  there  wasn’t  any  biology  in  thistutorial, why would you want you read it?

Since I love plants, I think we’re just going to have to have a plant based example (sorry to all the fansof other organisms out there!).  Having just completed a recent trip to our local greenhouse, we’ve suddenly developed an incredible obsession with Lady Slipper Orchids (if you wonder why, have a look at some LadySlipper Orchids photos on Flickr, or try a Google Image Search).

Of course, orchids are not only beautiful to look at, they are also extremely interesting for people studyingevolution and systematics. So let’s suppose we’re thinking about writing a funding proposal to do a molecular study of Lady Slipper evolution, and would like to see what kind of research has already been done and howwe can add to that.

After a little bit of reading up we discover that the Lady Slipper Orchids are in the Orchidaceae family andthe Cypripedioideae sub-family and are made up of 5 genera:*Cypripedium*,*Paphiopedilum*,*Phragmipedium*,*Selenipedium* and *Mexipedium*.

That gives us enough to get started delving for more information.  So, let’s look at how the Biopythontools can help us.  We’ll start with sequence parsing in Section 2.4, but the orchids will be back later on aswell - for example we’ll search PubMed for papers about orchids and extract sequence data from GenBank inChapter 9, extract data from Swiss-Prot from certain orchid proteins in Chapter 10, and work with ClustalWmultiple sequence alignments of orchid proteins in Section 6.5.1.


## 2.4    Parsing sequence file formats
A large part of much bioinformatics work involves dealing with the many types of file formats designed tohold biological data.  These files are loaded with interesting biological data, and a special challenge is parsing these files into a format so that you can manipulate them with some kind of programming language.  However the task of parsing these files can be frustrated by the fact that the formats can change quite regularly, andthat formats may contain small subtleties which can break even the most well designed parsers.

We  are  now  going  to  briefly  introduce  theBio.SeqIOmodule  –  you  can  find  out  more  in  Chapter  5.We’ll start with an online search for our friends, the lady slipper orchids.  To keep this introduction simple,we’re  just  using  the  NCBI  website  by  hand.   Let’s  just  take  a  look  through  the  nucleotide  databases  atNCBI, using an Entrez online search [https://www.ncbi.nlm.nih.gov/nuccore/?term=Cypripedioideae](https://www.ncbi.nlm.nih.gov/nuccore/?term=Cypripedioideae) for everything mentioning the text Cypripedioideae (this is the subfamily of lady slipper orchids).

When this tutorial was originally written, this search gave us only 94 hits, which we saved as a FASTAformatted text file and as a GenBank formatted text file (files lsorchid.fasta and lsorchid.gbk, also includedwith the Biopython source code underDoc/examples/).

If you run the search today, you’ll get hundreds of results!  When following the tutorial, if you want tosee  the  same  list  of  genes,  just  download  the  two  files  above  or  copy  them  fromdocs/examples/in  theBiopython source code.  In Section 2.5 we will look at how to do a search like this from within Python.

### 2.4.1    Simple FASTA parsing example
If you open the lady slipper orchids FASTA file `lsorchid.fasta` in your favourite text editor, you’ll see thatthe file starts like this:

### ls_orchid file dow into data directory with following code

In [16]:
#!pip install wget
import wget
_path = 'data/'
url = 'https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta' 
wget.download(url, out=_path)


'data//ls_orchid (1).fasta'

It contains 94 records, each has a line starting with `>` (greater-than symbol) followed by the sequenceon one or more lines.  Now try this in Python

In [21]:
from Bio import SeqIO
for seq_record in SeqIO.parse(_path + "ls_orchid.fasta","fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq)) # repr() function covert object to print object
    print(len(seq_record))


gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
gi|2765649|emb|Z78524.1|CFZ78524
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
gi|2765648|emb|Z78523.1|CHZ78523
Seq('CGTAACCAGGTTTCCGT

### 2.4.2  Simple GenBank parsing example
Now let’s load the GenBank file lsorchid.gbk instead - notice that the code to do this is almost identical tothe snippet used above for the FASTA file - the only difference is we change the filename and the formatstring:

In [22]:
#!pip install wget
import wget
_path = 'data/'
url = 'https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.gbk' 
wget.download(url, out=_path)

'data//ls_orchid.gbk'

In [26]:
from Bio import SeqIO
for seq_record in SeqIO.parse(_path + "ls_orchid.gbk",'genbank'):
    print(seq_record.id)
    print(repr(seq_record.seq)) # repr() function covert object to print object
    print(len(seq_record))

Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
740
Z78532.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC')
753
Z78531.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA')
748
Z78530.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT')
744
Z78529.1
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA')
733
Z78527.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC')
718
Z78526.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT')
730
Z78525.1
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GCA')
704
Z78524.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATAGTAG...AGC')
740
Z78523.1
Seq('CGTAACCAGGTTTCCGTAGGTGAACCTGCGGCAGGATCATTGTTGAGACAGCAG...AAG')
709
Z78522.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...GAG')
700
Z78521.1
Seq('GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATATGATCGAGT...ACC')
726
Z78520.1
Seq('CGTAACAAGGTTTC

### 2.4.3 I love parsing – please don’t stop talking about it!
Biopython has a lot of parsers, and each has its own little special niches based on the sequence format it isparsing and all of that.  Chapter 5 coversBio.SeqIOin more detail, while Chapter 6 introducesBio.AlignIOfor sequence alignments.While the most popular file formats have parsers integrated intoBio.SeqIOand/orBio.AlignIO, forsome  of  the  rarer  and  unloved  file  formats  there  is  either  no  parser  at  all,  or  an  old  parser  which  hasnot been linked in yet.  Please also check the wiki pages [http://biopython.org/wiki/SeqIO](http://biopython.org/wiki/SeqIO) and [http://biopython.org/wiki/AlignIO](andhttp://biopython.org/wiki/AlignIO) for  the  latest  information,  or  ask  on  the  mailing  list.   The  wiki  pagesshould include an up to date list of supported file types, and some additional examples.The next place to look for information about specific parsers and how to do cool things with them is inthe Cookbook (Chapter 20 of this Tutorial).  If you don’t find the information you are looking for, pleaseconsider helping out your poor overworked documentors and submitting a cookbook entry about it!  (onceyou figure out how to do it, that is!)

## 2.5 Connecting with biological databases

One of the very common things that you need to do in bioinformatics is extract information from biological databases.  It can be quite tedious to access these databases manually, especially if you have a lot of repetitive work to do.  Biopython attempts to save you time and energy by making some on-line databases available from Python scripts.  Currently, Biopython has code to extract information from the following databases:
- Entrez (and PubMed) from the NCBI – See Chapter 9.
- ExPASy – See Chapter 10.
- SCOP – See theBio.SCOP.search()function.


The  code  in  these  modules  basically  makes  it  easy  to  write  Python  code  that  interact  with  the  CGIscripts on these pages, so that you can get results in an easy to deal with format.  In some cases, the resultscan be tightly integrated with the Biopython parsers to make it even easier to extract information.

## 2.6    What to do next 

Now that you’ve made it this far, you hopefully have a good understanding of the basics of Biopython andare ready to start using it for doing useful work.  The best thing to do now is finish reading this tutorial,and then if you want start snooping around in the source code, and looking at the automatically generated documentation.

Once you get a picture of what you want to do, and what libraries in Biopython will do it, you shouldtake a peak at the Cookbook (Chapter 20), which may have example code to do something similar to whatyou want to do.

If you know what you want to do, but can’t figure out how to do it, please feel free to post questionsto the main Biopython list (see [http://biopython.org/wiki/Mailing_lists](http://biopython.org/wiki/Mailing_lists)).  This will not only help usanswer your question, it will also allow us to improve the documentation so it can help the next person dowhat you want to do.Enjoy the code!18
